Improving Model Performance Using MACEst on the Bank Marketing Dataset 
===

This notebook will test weather MACEst can be used in order to improve the ROC-AUC score of a trained model on the bank marketing dataset.

MACEst requires the following 'batches' of data (see [here](https://github.com/oracle/macest?tab=readme-ov-file#classification)):
1. Regular Training Collection (34%)
2. Confidence Collection: (66%)
   * conf_train (33%)
   * cal (33%)
      * cal_train (16.5%)
      * cal_test  (16.5%)

So in order to make sure the non-improved model and the improved model are evaluated correctly, we'll train both models on the same data and test on the same data.

# Data Loading, Preprocessing and Splitting

In [2]:
from data_acquisition.bank_marketing_loading import load_bank_marketing_dataset
from modeling.xgboost_bank_marketing_impl import preprocess_data
from sklearn.model_selection import train_test_split

bank_marketing_raw_data_df = load_bank_marketing_dataset()

data_X, data_y = preprocess_data(bank_marketing_raw_data_df)

X_regular_train, X_conf, y_regular_train, y_conf  = train_test_split(data_X,
                                                                     data_y,
                                                                     stratify=data_y,
                                                                     test_size=0.66,
                                                                     random_state=10)

X_conf_train, X_cal, y_conf_train, y_cal = train_test_split(X_conf,
                                                            y_conf,
                                                            stratify=y_conf,
                                                            test_size=0.5,
                                                            random_state=0)

X_cal_train, X_cal_test, y_cal_train,  y_cal_test = train_test_split(X_cal,
                                                                     y_cal,
                                                                     stratify=y_cal,
                                                                     test_size=0.5,
                                                                     random_state=0)

# Training Probabilistic XGBoost on the Bank Marketing Dataset

In [3]:
import xgboost
import pandas as pd
from modeling.xgboost_bank_marketing_impl import train_model, evaluate_predictions_roc_auc_score

no_macest_train_X = pd.concat([X_regular_train, X_conf_train, X_cal_train])
no_macest_train_y = pd.concat([y_regular_train, y_conf_train, y_cal_train])

# test set is the same: cal_test
no_macest_xgboost = train_model(no_macest_train_X, no_macest_train_y)

no_macest_test_preds = no_macest_xgboost.predict(xgboost.DMatrix(X_cal_test))
no_macest_roc_auc_score_val = evaluate_predictions_roc_auc_score(y_cal_test, no_macest_test_preds)
print(f"ROC AUC Score for XGBoost without MACEst: {no_macest_roc_auc_score_val}")

Begin training
Parameters: { "max_dept", "silent" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Finished training
ROC AUC Score for XGBoost without MACEst: 0.9284442313798231


# Training Probabilistic XGBoost on the Bank Marketing Dataset with MACEst

In [4]:
from modeling.xgboost_bank_marketing_impl import _make_predict_wrapper
from macest.classification import models as cl_mod

point_pred_model = train_model(X_regular_train, y_regular_train)
# print("Finished training")
# wrap the predict function of the point_pred_model in order to fit the requirement of np.ndarray input into ModelWithConfidence
# point_pred_model.predict = _make_predict_wrapper(point_pred_model)
# print("Finished wrapping predict")
macest_model = cl_mod.ModelWithConfidence(point_pred_model,
                                      X_conf_train,
                                      y_conf_train)

macest_model.fit(X_cal_train, y_cal_train)

macest_test_preds = macest_model.predict_confidence_of_point_prediction(X_cal_test)

macest_roc_auc_score_val = evaluate_predictions_roc_auc_score(y_cal_test, macest_test_preds)
print(f"ROC AUC Score for XGBoost with MACEst: {macest_roc_auc_score_val}")

Begin training
Parameters: { "max_dept", "silent" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Finished training


TypeError: ('Expecting data to be a DMatrix object, got: ', <class 'pandas.core.frame.DataFrame'>)