Auto Improvement Using MetaModelClassification - Example Usage
===

This notebook demonstrates two example usages of MetaModelClassification:

Example 1: Bank Marketing Dataset. <br>
Example 2: German Credit Risk Dataset. <br>

Read about the datasets [here](../../Docs/Data_Dictionaries/ReadMe.md) and their baselines [here](../../Docs/Models/Baseline).

In [1]:
from auto_improve_pipeline import MetamodelClassification
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.pipeline import Pipeline

# Example 1: Improving Log Loss on the Bank Marketing Dataset

## Data Loading and Train Test Split

In [2]:
from data_acquisition.bank_marketing_loading import load_bank_marketing_dataset
from data_acquisition.bank_marketing_constants import DEPOSIT
from modeling.xgboost_bank_marketing_impl import preprocess_labels as preprocess_bank_marketing_labels

bank_marketing_raw_df = load_bank_marketing_dataset()

raw_X = bank_marketing_raw_df.drop(columns=[DEPOSIT])
raw_y = bank_marketing_raw_df[DEPOSIT]
y = preprocess_bank_marketing_labels(raw_y)

X_train, X_test, y_train, y_test = train_test_split(raw_X, y, test_size=0.3, random_state=2, stratify=y)

## Original Log Loss

In [3]:
import modeling.xgboost_bank_marketing_impl as bank_marketing_impl

base_model_pipeline = Pipeline(steps=[
    ('preprocess', bank_marketing_impl.get_feature_preprocessor_step()),
    ('base_model', bank_marketing_impl.get_model_for_training())
])

base_model_pipeline.fit(X_train, y_train)
base_pred_test = base_model_pipeline.predict(X_test)  # shape (n_samples,)
# base_pred_test which is the predicted integer label, can be used as probabilities in log loss because:
# base_pred_test[i] = 0 -> probability of the positive class for sample i is 0
# base_pred_test[i] = 1 -> probability of the positive class for sample i is 1
base_model_log_loss_val = log_loss(y_test, base_pred_test)
print(f"Base Model Log Loss: {base_model_log_loss_val}")

Base Model Log Loss: 3.707508753699377


## Improved Log Loss

In [4]:
uq_pipeline = Pipeline(steps=[
    ('preprocess', bank_marketing_impl.get_feature_preprocessor_step()),
    ('uq_model', MetamodelClassification(base_model=bank_marketing_impl.get_model_for_training(), 
                                         meta_model='gbm', meta_config={},
                                         random_seed=42))
])

uq_pipeline.fit(X_train, y_train)
uq_pred_test, uq_pred_test_score = uq_pipeline.predict(X_test)

uq_model_log_loss_val = log_loss(y_test, uq_pred_test_score)
print(f"UQ Model Log Loss: {uq_model_log_loss_val}")

UQ Model Log Loss: 2.348218595064319


# Example 2: Improving Log Loss on the German Credit Risk Dataset

## Data Loading and Train Test Split

In [15]:
from data_acquisition.german_credit_loading import load_german_credit_risk_dataset, TARGET_COL
import modeling.xgboost_german_credit_impl as german_credit_impl

german_credit_raw_df = load_german_credit_risk_dataset()
raw_X = german_credit_raw_df.drop(columns=[TARGET_COL])
raw_y = german_credit_raw_df[TARGET_COL]
y = german_credit_impl.preprocess_labels(raw_y)

X_train, X_test, y_train, y_test = train_test_split(raw_X, y, test_size=0.3, random_state=2, stratify=y)

## Original Log Loss

In [16]:
german_credit_preprocessor = german_credit_impl.get_feature_preprocessor_step()

X_train = german_credit_preprocessor.fit_transform(X_train, y_train)
X_test = german_credit_preprocessor.transform(X_test)

german_credit_base_model, german_credit_fit_params = german_credit_impl.get_model_for_training(X_train, y_train, X_test, y_test)

base_model_pipeline = Pipeline(steps=[
    ('base_model', german_credit_base_model)
])

base_model_pipeline.fit(X_train, y_train, **{f'base_model__{param}': val for param, val in german_credit_fit_params.items()})
base_pred_test = base_model_pipeline.predict(X_test)  # shape (n_samples,)
# base_pred_test which is the predicted integer label, can be used as probabilities in log loss because:
# base_pred_test[i] = 0 -> probability of the positive class for sample i is 0
# base_pred_test[i] = 1 -> probability of the positive class for sample i is 1
base_model_log_loss_val = log_loss(y_test, base_pred_test)
print(f"Base Model Log Loss: {base_model_log_loss_val}")

[0]	validation_0-auc:0.77686	validation_1-auc:0.67772
[100]	validation_0-auc:0.88135	validation_1-auc:0.74302
[200]	validation_0-auc:0.89594	validation_1-auc:0.75190
[300]	validation_0-auc:0.90854	validation_1-auc:0.75667
[400]	validation_0-auc:0.91961	validation_1-auc:0.75947
[500]	validation_0-auc:0.92977	validation_1-auc:0.76222
[600]	validation_0-auc:0.93856	validation_1-auc:0.76693
[700]	validation_0-auc:0.94522	validation_1-auc:0.76958
[800]	validation_0-auc:0.95080	validation_1-auc:0.76989
[900]	validation_0-auc:0.95598	validation_1-auc:0.77053
[942]	validation_0-auc:0.95847	validation_1-auc:0.77085
base_pred_test shape: (300,)
Base Model Log Loss: 9.210486964838385


## Improved Log Loss

In [17]:
german_credit_base_model, german_credit_fit_params = german_credit_impl.get_model_for_training(X_train, y_train, X_test, y_test)

uq_pipeline = Pipeline(steps=[
    ('uq_model', MetamodelClassification(base_model=german_credit_base_model, 
                                         meta_model='gbm', meta_config={},
                                         random_seed=42))
])

uq_pipeline.fit(X_train, y_train)
uq_pred_test, uq_pred_test_score = uq_pipeline.predict(X_test)

uq_model_log_loss_val = log_loss(y_test, uq_pred_test_score)
print(f"UQ Model Log Loss: {uq_model_log_loss_val}")

UQ Model Log Loss: 0.9617334628179299
