Auto Improvement Using MetaModelClassification - Example Usage
===

This notebook demonstrates two example usages of MetaModelClassification:

Example 1: Bank Marketing Dataset. <br>
Example 2: German Credit Risk Dataset. <br>

Read about the datasets [here](../../Docs/Data_Dictionaries/ReadMe.md) and their baselines [here](../../Docs/Models/Baseline).

In [1]:
from auto_improve_pipeline import MetamodelClassification
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.pipeline import Pipeline

# Example 1: Improving Log Loss on the Bank Marketing Dataset

## Data Loading and Train Test Split

In [2]:
from data_acquisition.bank_marketing_loading import load_bank_marketing_dataset
from data_acquisition.bank_marketing_constants import DEPOSIT
from modeling.xgboost_bank_marketing_impl import preprocess_labels

bank_marketing_raw_df = load_bank_marketing_dataset()

raw_X = bank_marketing_raw_df.drop(columns=[DEPOSIT])
raw_y = bank_marketing_raw_df[DEPOSIT]
y = preprocess_labels(raw_y)

X_train, X_test, y_train, y_test = train_test_split(raw_X, y, test_size=0.3, random_state=2, stratify=y)

## Original Log Loss

In [3]:
from modeling.xgboost_bank_marketing_impl import get_feature_preprocessor_step, get_model_for_training

base_model_pipeline = Pipeline(steps=[
    ('preprocess', get_feature_preprocessor_step()),
    ('base_model', get_model_for_training())
])

base_model_pipeline.fit(X_train, y_train)
base_pred_test = base_model_pipeline.predict(X_test)  # shape (n_samples,)
# base_pred_test which is the predicted integer label, can be used as probabilities in log loss because:
# base_pred_test[i] = 0 -> probability of the positive class for sample i is 0
# base_pred_test[i] = 1 -> probability of the positive class for sample i is 1
base_model_log_loss_val = log_loss(y_test, base_pred_test)
print(f"Base Model Log Loss: {base_model_log_loss_val}")

Base Model Log Loss: 3.707508753699377


## Improved Log Loss

In [4]:
from auto_improve_pipeline import MetamodelClassification

uq_pipeline = Pipeline(steps=[
    ('preprocess', get_feature_preprocessor_step()),
    ('uq_model', MetamodelClassification(base_model=get_model_for_training(), 
                                         meta_model='gbm', meta_config={},
                                         random_seed=42))
])

uq_pipeline.fit(X_train, y_train)
uq_pred_test, uq_pred_test_score = uq_pipeline.predict(X_test)

uq_model_log_loss_val = log_loss(y_test, uq_pred_test_score)
print(f"UQ Model Log Loss: {uq_model_log_loss_val}")

UQ Model Log Loss: 2.348401489338172


# Example 2: Improving Log Loss on the German Credit Risk Dataset