Improving Log Loss using IBM UQ360 blackbox MetaModel
===

https://github.com/IBM/UQ360

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html

# Loading the Bank Marketing Dataset

In [1]:
from data_acquisition.bank_marketing_loading import load_bank_marketing_dataset
from data_acquisition.bank_marketing_constants import DEPOSIT
from modeling.xgboost_bank_marketing_impl import preprocess_labels

bank_marketing_raw_df = load_bank_marketing_dataset()

raw_X = bank_marketing_raw_df.drop(columns=[DEPOSIT])
raw_y = bank_marketing_raw_df[DEPOSIT]
y = preprocess_labels(raw_y)

# Train Test Split

In [2]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(raw_X, y, test_size=0.3, random_state=2, stratify=y)

# Log Loss of the Base Model

In [3]:
from modeling.xgboost_bank_marketing_impl import get_feature_preprocessor_step, get_model_for_training
from sklearn.pipeline import Pipeline
from sklearn.metrics import log_loss

base_model_pipeline = Pipeline(steps=[
    ('preprocess', get_feature_preprocessor_step()),
    ('base_model', get_model_for_training())
])

base_model_pipeline.fit(X_train, y_train)
base_pred_test = base_model_pipeline.predict(X_test)  # shape (n_samples,)
# base_pred_test which is the predicted integer label, can be used as probabilities in log loss because:
# base_pred_test[i] = 0 -> probability of the positive class for sample i is 0
# base_pred_test[i] = 1 -> probability of the positive class for sample i is 1
base_model_log_loss_val = log_loss(y_test, base_pred_test)
print(f"Base Model Log Loss: {base_model_log_loss_val}")

Base Model Log Loss: 3.707508753699377


# Log loss of the Improved Model

## Case 1: using UQ during training

In this case the MetaModel is trained WITH the base model.

In [4]:
from uq360.algorithms.blackbox_metamodel import MetamodelClassification

uq_pipeline = Pipeline(steps=[
    ('preprocess', get_feature_preprocessor_step()),
    ('uq_model', MetamodelClassification(base_model=get_model_for_training(), 
                                         meta_model='gbm', meta_config={},
                                         random_seed=42))
])

uq_pipeline.fit(X_train, y_train)
uq_pred_test, uq_pred_test_score = uq_pipeline.predict(X_test)

uq_model_log_loss_val = log_loss(y_test, uq_pred_test_score)
print(f"UQ Model Log Loss: {uq_model_log_loss_val}")

UQ Model Log Loss: 2.3487672914593642


## Case 2: using UQ after training

In this case the MetaModel is trained AFTER the base model has already been trained.

In [10]:
from sklearn.ensemble import GradientBoostingClassifier

# split training data into 'pre meta training' and 'meta training'
X_pre_meta_train, X_meta_train, y_pre_meta_train, y_meta_train = train_test_split(X_train, y_train, 
                                                                                  test_size=0.2, 
                                                                                  random_state=2, 
                                                                                  stratify=y_train)
preprocessor = get_feature_preprocessor_step().fit(X_pre_meta_train, y_pre_meta_train)

# simulate trained base model
pre_meta_base_model = get_model_for_training()

base_model_trained = pre_meta_base_model.fit(preprocessor.transform(X_pre_meta_train), y_pre_meta_train)

# train the meta model
meta_model = GradientBoostingClassifier()

uq_pipeline2 = Pipeline(steps=[
    ('uq_model', MetamodelClassification(base_model=base_model_trained, 
                                         meta_model=meta_model, meta_config={},
                                         random_seed=42))
])
uq_pipeline2.fit(X=None, y=None, 
                 uq_model__base_is_prefitted=True, 
                 uq_model__meta_train_data=(preprocessor.transform(X_meta_train), y_meta_train))

uq2_pred_test, uq2_pred_test_score = uq_pipeline2.predict(preprocessor.transform(X_test))

uq2_model_log_loss_val = log_loss(y_test, uq2_pred_test_score)
print(f"UQ Model Log Loss with Pre-Trained Base Model: {uq2_model_log_loss_val}")

UQ Model Log Loss with Pre-Trained Base Model: 2.3873139626865516


# MetaModelClassification's Effect on Log Loss
We see that both cases saw an improvement in the log loss.