# Getting Started Kaggle TPS Challenge with Tabular ML Toolkit

> A Tutorial to showcase usage of tabular_ml_toolkit (tmlt) library on Kaggle TPS Challenge Nov 2021.

> tabular_ml_toolkit is a helper library to jumpstart your machine learning project based on Tabular or Structured data.

> It comes with model parallelism and cutting edge hyperparameter search techniques.

> Under the hood TMLT uses optuna, xgboost and scikit-learn pipelines

## Install

`pip install -U tabular_ml_toolkit`

### How to Best Use tabular_ml_toolkit

Start with your favorite model and then just simply create **tmlt** with one API.

*Here we are using XGBClassifier, on  [Kaggle TPS Challenge (Nov 2021) data](https://www.kaggle.com/c/tabular-playground-series-nov-2021/data)*

In [1]:
from tabular_ml_toolkit.tmlt import *
from xgboost import XGBClassifier
import numpy as np



In [2]:
from sklearn.metrics import roc_auc_score, accuracy_score

In [3]:
# Dataset file names and Paths
DIRECTORY_PATH = "/Users/pankajmathur/kaggle_datasets/tps_nov_2021/"
TRAIN_FILE = "train.csv"
TEST_FILE = "test.csv"
SAMPLE_SUB_FILE = "sample_submission.csv"
OUTPUT_PATH = "kaggle_tps_output/"

#### Just point tmlt in the direction of your data

#### Let it know what are idx and target columns in your tabular data

#### what kind of problem type you are trying to resolve?

In [4]:
# create tmlt
tmlt = TMLT().prepare_data(
    train_file_path= DIRECTORY_PATH + TRAIN_FILE,
    test_file_path= DIRECTORY_PATH + TEST_FILE,
    #make sure to use right index and target columns
    idx_col="id",
    target="target",
    random_state=42,
    problem_type="binary_classification", nrows=4000)


# supports only task type
# "binary_classification"
# "multi_label_classification"
# "multi_class_classification"
# "regression"

2021-12-06 23:20:08,236 INFO 8 cores found, model and data parallel processing should worked!
2021-12-06 23:20:08,350 INFO DataFrame Memory usage decreased to 0.80 Mb (74.4% reduction)
2021-12-06 23:20:08,457 INFO DataFrame Memory usage decreased to 0.79 Mb (74.3% reduction)
2021-12-06 23:20:08,502 INFO PreProcessing will include target(s) encoding!
2021-12-06 23:20:08,503 INFO categorical columns are None, Preprocessing will done accordingly!


In [5]:
print(type(tmlt.dfl.X))
print(tmlt.dfl.X.shape)
print(type(tmlt.dfl.y))
print(tmlt.dfl.y.shape)
print(type(tmlt.dfl.X_test))
print(tmlt.dfl.X_test.shape)

<class 'pandas.core.frame.DataFrame'>
(4000, 100)
<class 'numpy.ndarray'>
(4000,)
<class 'pandas.core.frame.DataFrame'>
(4000, 100)


In [6]:
tmlt.dfl.X

Unnamed: 0_level_0,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,...,f90,f91,f92,f93,f94,f95,f96,f97,f98,f99
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.106628,3.593750,132.750000,3.183594,0.081970,1.188477,3.732422,2.265625,2.099609,0.012329,...,0.010742,1.098633,0.013329,-0.011719,0.052765,0.065430,4.210938,1.978516,0.085999,0.240479
1,0.125000,1.673828,76.562500,3.378906,0.099426,5.093750,1.275391,-0.471436,4.546875,0.037720,...,0.135864,3.460938,0.017059,0.124878,0.154053,0.606934,-0.267822,2.578125,-0.020874,0.024719
2,0.036316,1.497070,233.500000,2.195312,0.026917,3.126953,5.058594,3.849609,1.801758,0.057007,...,0.117310,4.882812,0.085205,0.032410,0.116089,-0.001689,-0.520020,2.140625,0.124451,0.148193
3,-0.014076,0.245972,780.000000,1.890625,0.006947,1.531250,2.697266,4.515625,4.503906,0.123474,...,-0.015350,3.474609,-0.017105,-0.008102,0.062012,0.041199,0.511719,1.968750,0.040009,0.044861
4,-0.003260,3.714844,156.125000,2.148438,0.018280,2.097656,4.156250,-0.038239,3.371094,0.034180,...,0.013779,1.910156,-0.042938,0.105591,0.125122,0.037506,1.043945,1.075195,-0.012817,0.072815
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3995,0.242188,2.324219,-19.109375,0.984375,0.036438,0.424561,2.269531,3.621094,4.062500,1.197266,...,-0.002506,3.064453,0.112427,0.100220,0.036530,0.451172,1.316406,4.625000,0.056183,0.029724
3996,0.138306,0.679199,37.125000,2.736328,-0.043549,0.514648,4.542969,3.132812,4.972656,0.097961,...,0.060730,4.125000,-0.031433,0.059143,0.164673,0.058075,-0.237427,2.123047,-0.049316,0.050842
3997,0.025436,1.316406,250.375000,3.689453,0.015312,2.490234,1.983398,3.556641,4.164062,0.156860,...,0.038269,4.667969,0.157593,0.102234,1.055664,0.031769,1.661133,1.484375,-0.027924,0.098083
3998,0.109253,2.169922,123.062500,3.279297,0.018204,3.630859,4.636719,4.507812,3.585938,0.037140,...,0.083191,3.623047,0.108765,0.111877,0.020645,0.125122,2.648438,2.753906,0.012726,0.035583


### Training


##### create train valid dataframes for quick preprocessing and training

In [7]:
%%time
# create train, valid split to evaulate model on valid dataset
X_train, X_valid,  y_train, y_valid =  tmlt.dfl.create_train_valid(valid_size=0.2)

CPU times: user 5.46 ms, sys: 1.46 ms, total: 6.92 ms
Wall time: 6.05 ms


In [8]:
print(X_train.shape)
print(y_train.shape)
print(X_valid.shape)
print(y_valid.shape)

# print(X_train.columns.to_list())

(3200, 100)
(3200,)
(800, 100)
(800,)


##### Now PreProcess X_train, X_valid

NOTE: Preprocessing gives back numpy arrays for pandas dataframe

In [9]:
%%time
X_train_np,  X_valid_np = tmlt.pp_fit_transform(X_train, X_valid)

print(type(X_train_np))
print(X_train_np.shape)
# print(X_train_np)
print(type(X_valid_np))
print(X_valid_np.shape)
# print(X_valid_np)
print(type(y_valid))
print(type(y_train))

<class 'numpy.ndarray'>
(3200, 100)
<class 'numpy.ndarray'>
(800, 100)
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
CPU times: user 24.3 ms, sys: 1.71 ms, total: 26 ms
Wall time: 24.7 ms


#### Create a base xgb classifier model with your best guess params

In [10]:
xgb_params = {
    # your best guess params
    'learning_rate':0.01,
    'eval_metric':'auc',
    # must for xgb classifier otherwise warning will be shown
    'use_label_encoder':False,
    # because 42 is the answer for all the randomness of this universe
    'random_state':42,
    #for GPU
    #'tree_method': 'gpu_hist',
    #'predictor': 'gpu_predictor',
}

xgb_model = XGBClassifier(**xgb_params)

In [11]:
%%time
# Now do model training
xgb_model.fit(X_train_np, y_train,
              verbose=False,
              #detect & avoid overfitting
              eval_set=[(X_train_np, y_train), (X_valid_np, y_valid)],
              eval_metric="auc",
              early_stopping_rounds=300
             )

#predict
preds = xgb_model.predict(X_valid_np)
preds_probs = xgb_model.predict_proba(X_valid_np)[:, 1]

# Metrics
auc = roc_auc_score(y_valid, preds_probs)
acc = accuracy_score(y_valid, preds)

print(f"AUC is : {auc} while Accuracy is : {acc} ")

AUC is : 0.6142302819132088 while Accuracy is : 0.6175 
CPU times: user 10.9 s, sys: 167 ms, total: 11.1 s
Wall time: 1.73 s


### For Meta Ensemble Models Training

##### Make sure to PreProcess the data

In [12]:
%%time
X_np, X_test_np = tmlt.pp_fit_transform(tmlt.dfl.X, tmlt.dfl.X_test)
y_np = tmlt.dfl.y

CPU times: user 273 ms, sys: 19.2 ms, total: 292 ms
Wall time: 45.2 ms


#### Base Model 1: linear SVM model

In [13]:
from sklearn.svm import LinearSVC

In [14]:
%%time

# OOF training and prediction on both train and test dataset by a given model
#choose model
linear_oof_model = LinearSVC(tol=1e-7, penalty='l2', dual=False, max_iter=2000, random_state=42)

#fit and predict
linear_oof_model_preds, linear_oof_model_test_preds = tmlt.do_oof_kfold_train_preds(n_splits=5,
                                                                                    oof_model=linear_oof_model,
                                                                                    X = X_np,
                                                                                    y = y_np,
                                                                                    X_test = X_test_np)

if linear_oof_model_preds is not None:
    print(linear_oof_model_preds.shape)

if linear_oof_model_test_preds is not None:    
    print(linear_oof_model_test_preds.shape)

2021-12-06 23:20:10,385 INFO Training Started!
2021-12-06 23:20:10,418 INFO Training Finished!
2021-12-06 23:20:10,421 INFO fold: 1 OOF Model ROC AUC: 0.7257962604771115!
2021-12-06 23:20:10,426 INFO Training Started!
2021-12-06 23:20:10,453 INFO Training Finished!
2021-12-06 23:20:10,456 INFO fold: 2 OOF Model ROC AUC: 0.6957833655705996!
2021-12-06 23:20:10,461 INFO Training Started!
2021-12-06 23:20:10,492 INFO Training Finished!
2021-12-06 23:20:10,495 INFO fold: 3 OOF Model ROC AUC: 0.6616956802063185!
2021-12-06 23:20:10,500 INFO Training Started!
2021-12-06 23:20:10,538 INFO Training Finished!
2021-12-06 23:20:10,541 INFO fold: 4 OOF Model ROC AUC: 0.7076379002699064!
2021-12-06 23:20:10,546 INFO Training Started!
2021-12-06 23:20:10,580 INFO Training Finished!
2021-12-06 23:20:10,584 INFO fold: 5 OOF Model ROC AUC: 0.7227243154104317!
2021-12-06 23:20:10,588 INFO Mean OOF Model ROC AUC: 0.7027275043868735!


(4000,)
(4000,)
CPU times: user 793 ms, sys: 186 ms, total: 979 ms
Wall time: 206 ms


#### Base Model 2: Logistic Regression Model

In [15]:
from sklearn.linear_model import LogisticRegression

In [16]:
%%time

# OOF training and prediction on both train and test dataset by a given model

#choose model
log_oof_model = LogisticRegression(solver='liblinear', random_state=42)

#fit and predict
log_oof_model_preds, log_oof_model_test_preds = tmlt.do_oof_kfold_train_preds(n_splits=5,
                                                                                    oof_model=log_oof_model,
                                                                                    X = X_np,
                                                                                    y = y_np,
                                                                                    X_test = X_test_np)
if log_oof_model_preds is not None:
    print(log_oof_model_preds.shape)

if log_oof_model_test_preds is not None:    
    print(log_oof_model_test_preds.shape)

2021-12-06 23:20:10,605 INFO Training Started!
2021-12-06 23:20:10,630 INFO Training Finished!
2021-12-06 23:20:10,633 INFO fold: 1 OOF Model ROC AUC: 0.7271695680206318!
2021-12-06 23:20:10,637 INFO Training Started!
2021-12-06 23:20:10,665 INFO Training Finished!
2021-12-06 23:20:10,667 INFO fold: 2 OOF Model ROC AUC: 0.694622823984526!
2021-12-06 23:20:10,672 INFO Training Started!
2021-12-06 23:20:10,699 INFO Training Finished!
2021-12-06 23:20:10,703 INFO fold: 3 OOF Model ROC AUC: 0.662114764667956!
2021-12-06 23:20:10,707 INFO Training Started!
2021-12-06 23:20:10,730 INFO Training Finished!
2021-12-06 23:20:10,733 INFO fold: 4 OOF Model ROC AUC: 0.7080501678057705!
2021-12-06 23:20:10,737 INFO Training Started!
2021-12-06 23:20:10,762 INFO Training Finished!
2021-12-06 23:20:10,764 INFO fold: 5 OOF Model ROC AUC: 0.7226276902067136!
2021-12-06 23:20:10,768 INFO Mean OOF Model ROC AUC: 0.7029170029371196!


(4000,)
(4000,)
CPU times: user 503 ms, sys: 116 ms, total: 618 ms
Wall time: 167 ms


#### Base Model 3: SKLearn MLP

In [17]:
from sklearn.neural_network import MLPClassifier

In [18]:
%%time

# OOF training and prediction on both train and test dataset by a given model

#choose model
mlp_oof_model = MLPClassifier(max_iter=1000, early_stopping=True)

#update the model on sklearn pipeline
# tmlt = tmlt.update_model(mlp_oof_model)

# # lets see updated sklearn pipeline with new model
# tmlt.spl

#fit and predict
mlp_oof_model_preds, mlp_oof_model_test_preds = tmlt.do_oof_kfold_train_preds(n_splits=5,
                                                                                    oof_model=mlp_oof_model,
                                                                                    X = X_np,
                                                                                    y = y_np,
                                                                                    X_test = X_test_np)
if mlp_oof_model_preds is not None:
    print(mlp_oof_model_preds.shape)

if mlp_oof_model_test_preds is not None:    
    print(mlp_oof_model_test_preds.shape)

2021-12-06 23:20:10,782 INFO Training Started!
2021-12-06 23:20:11,064 INFO Training Finished!
2021-12-06 23:20:11,067 INFO fold: 1 OOF Model ROC AUC: 0.7153836234687297!
2021-12-06 23:20:11,076 INFO Training Started!
2021-12-06 23:20:11,294 INFO Training Finished!
2021-12-06 23:20:11,297 INFO fold: 2 OOF Model ROC AUC: 0.6823726627981948!
2021-12-06 23:20:11,306 INFO Training Started!
2021-12-06 23:20:11,476 INFO Training Finished!
2021-12-06 23:20:11,479 INFO fold: 3 OOF Model ROC AUC: 0.6245067698259188!
2021-12-06 23:20:11,487 INFO Training Started!
2021-12-06 23:20:11,916 INFO Training Finished!
2021-12-06 23:20:11,920 INFO fold: 4 OOF Model ROC AUC: 0.6928606857812791!
2021-12-06 23:20:11,928 INFO Training Started!
2021-12-06 23:20:12,115 INFO Training Finished!
2021-12-06 23:20:12,119 INFO fold: 5 OOF Model ROC AUC: 0.6891696029992462!
2021-12-06 23:20:12,127 INFO Mean OOF Model ROC AUC: 0.6808586689746737!


(4000,)
(4000,)
CPU times: user 4.5 s, sys: 730 ms, total: 5.23 s
Wall time: 1.35 s


#### Base Model 4: TabNet

In [19]:
from pytorch_tabnet.tab_model import TabNetClassifier

In [20]:
%%time

# OOF training and prediction on both train and test dataset by a given model

#choose model
tabnet_oof_model = TabNetClassifier(optimizer_params=dict(lr=0.02), verbose=0)

#fit and predict
tabnet_oof_model_preds, tabnet_oof_model_test_preds = tmlt.do_oof_kfold_train_preds(n_splits=5,
                                                                                    oof_model=tabnet_oof_model,
                                                                                    X = X_np,
                                                                                    y = y_np,
                                                                                    X_test = X_test_np)

if tabnet_oof_model_preds is not None:
    print(tabnet_oof_model_preds.shape)

if tabnet_oof_model_test_preds is not None:
    print(tabnet_oof_model_test_preds.shape)

2021-12-06 23:20:12,142 INFO Training Started!
2021-12-06 23:20:20,277 INFO Training Finished!
2021-12-06 23:20:20,308 INFO fold: 1 OOF Model ROC AUC: 0.558401031592521!


Stop training because you reached max_epochs = 50 with best_epoch = 44 and best_val_0_auc = 0.5584
Best weights from best epoch are automatically used!


2021-12-06 23:20:20,423 INFO Training Started!
2021-12-06 23:20:22,374 INFO Training Finished!
2021-12-06 23:20:22,405 INFO fold: 2 OOF Model ROC AUC: 0.6584655061250806!



Early stopping occurred at epoch 11 with best_epoch = 1 and best_val_0_auc = 0.65847
Best weights from best epoch are automatically used!


2021-12-06 23:20:22,518 INFO Training Started!
2021-12-06 23:20:24,393 INFO Training Finished!
2021-12-06 23:20:24,425 INFO fold: 3 OOF Model ROC AUC: 0.6677756286266925!



Early stopping occurred at epoch 10 with best_epoch = 0 and best_val_0_auc = 0.66778
Best weights from best epoch are automatically used!


2021-12-06 23:20:24,539 INFO Training Started!
2021-12-06 23:20:26,372 INFO Training Finished!
2021-12-06 23:20:26,403 INFO fold: 4 OOF Model ROC AUC: 0.6324828168179388!



Early stopping occurred at epoch 10 with best_epoch = 0 and best_val_0_auc = 0.63248
Best weights from best epoch are automatically used!


2021-12-06 23:20:26,520 INFO Training Started!
2021-12-06 23:20:28,385 INFO Training Finished!
2021-12-06 23:20:28,416 INFO fold: 5 OOF Model ROC AUC: 0.6354975231739447!



Early stopping occurred at epoch 10 with best_epoch = 0 and best_val_0_auc = 0.6355
Best weights from best epoch are automatically used!


2021-12-06 23:20:28,533 INFO Mean OOF Model ROC AUC: 0.6305245012672356!


(4000,)
(4000,)
CPU times: user 17.6 s, sys: 5.93 s, total: 23.6 s
Wall time: 16.4 s


#### Now add back based models predictions to X and X_test

In [21]:
# add based model oof predictions back to X and X_test before Meta model training
tmlt.dfl.X["linear_preds"] = linear_oof_model_preds
tmlt.dfl.X_test["linear_preds"] = linear_oof_model_test_preds

print(tmlt.dfl.X.shape)
print(tmlt.dfl.X_test.shape)

(4000, 101)
(4000, 101)


In [22]:
# add based model oof predictions back to X and X_test before Meta model training
tmlt.dfl.X["log_reg_preds"] = log_oof_model_preds
tmlt.dfl.X_test["log_reg_preds"] = log_oof_model_test_preds

print(tmlt.dfl.X.shape)
print(tmlt.dfl.X_test.shape)

(4000, 102)
(4000, 102)


In [23]:
# add based model oof predictions back to X and X_test before Meta model training
tmlt.dfl.X["mlp_preds"] = mlp_oof_model_preds
tmlt.dfl.X_test["mlp_preds"] = mlp_oof_model_test_preds

print(tmlt.dfl.X.shape)
print(tmlt.dfl.X_test.shape)

(4000, 103)
(4000, 103)


In [24]:
# add based model oof predictions back to X and X_test before Meta model training
tmlt.dfl.X["tabnet_preds"] = tabnet_oof_model_preds
tmlt.dfl.X_test["tabnet_preds"] = tabnet_oof_model_test_preds

print(tmlt.dfl.X.shape)
print(tmlt.dfl.X_test.shape)

(4000, 104)
(4000, 104)


In [25]:
# now just update the tmlt with this new X and X_test

In [26]:
tmlt = tmlt.update_dfl(X=tmlt.dfl.X, y=tmlt.dfl.y, X_test=tmlt.dfl.X_test)

2021-12-06 23:20:28,605 INFO categorical columns are None, Preprocessing will done accordingly!


#### For META Model Training

##### create train valid dataframes for quick preprocessing and training

In [27]:
%%time
# create train, valid split to evaulate model on valid dataset
X_train, X_valid,  y_train, y_valid =  tmlt.dfl.create_train_valid(valid_size=0.2)

print(X_train.shape)
print(y_train.shape)
print(X_valid.shape)
print(y_valid.shape)

# print(X_train.columns.to_list())

(3200, 104)
(3200,)
(800, 104)
(800,)
CPU times: user 6.7 ms, sys: 1.61 ms, total: 8.31 ms
Wall time: 7.1 ms


##### Now PreProcess X_train, X_valid

NOTE: Preprocessing gives back numpy arrays for pandas dataframe

In [28]:
%%time
X_train_np,  X_valid_np = tmlt.pp_fit_transform(X_train, X_valid)

print(type(X_train_np))
print(X_train_np.shape)
# print(X_train_np)
print(type(X_valid_np))
print(X_valid_np.shape)
# print(X_valid_np)
print(type(y_valid))
print(type(y_train))

<class 'numpy.ndarray'>
(3200, 104)
<class 'numpy.ndarray'>
(800, 104)
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
CPU times: user 40 ms, sys: 3.39 ms, total: 43.4 ms
Wall time: 42.7 ms


In [29]:
# xgb_params = {
#     'objective': 'binary:logistic', 
#     'use_label_encoder': False,
#     'n_estimators': 40000,
#     'learning_rate': 0.18515462875481553,
#     'subsample': 0.97, 
#     'colsample_bytree': 0.32,
#     'max_depth': 1,
#     'booster': 'gbtree',
#     'gamma': 0.2, 
#     'tree_method': 'gpu_hist',
#     'reg_lambda': 0.11729916523488974, 
#     'reg_alpha': 0.6318827156945853,
#     'random_state': 42,
#     'n_jobs': 4, 
#     'min_child_weight': 256,
#     #for GPU
# #     'tree_method': 'gpu_hist',
# #     'predictor': 'gpu_predictor',
#     }

In [None]:
xgb_params = {
    'learning_rate': 0.21761562020600114,
    'eval_metric': 'auc',
    'use_label_encoder': False,
    'random_state': 42,
    'booster': 'gblinear',
    'colsample_bytree': 0.1027132584989078,
    'early_stopping_rounds': 171,
    'max_depth': 6,
    'n_estimators': 7000,
    'reg_alpha': 9.583579660175245e-06,
    'reg_lambda': 9.238315962782784e-05,
    'subsample': 0.4464473710560276,
    'tree_method': 'approx',
    #for GPU
    'tree_method': 'gpu_hist',
    'predictor': 'gpu_predictor',
}

In [30]:
%%time
# Now do model training
xgb_model.fit(X_train_np, y_train,
              verbose=False,
              #detect & avoid overfitting
              eval_set=[(X_train_np, y_train), (X_valid_np, y_valid)],
              eval_metric="auc",
              early_stopping_rounds=300
             )

#predict
preds = xgb_model.predict(X_valid_np)
preds_probs = xgb_model.predict_proba(X_valid_np)[:, 1]

# Metrics
auc = roc_auc_score(y_valid, preds_probs)
acc = accuracy_score(y_valid, preds)

print(f"AUC is : {auc} while Accuracy is : {acc} ")

AUC is : 0.6986887604265654 while Accuracy is : 0.6825 
CPU times: user 11 s, sys: 121 ms, total: 11.1 s
Wall time: 1.56 s


### WOW!!!!

#### For Meta Model, Let's do Optuna based HyperParameter search to get best params for fit

In [31]:
# **Just make sure to supply an output directory path so hyperparameter search is saved**
study = tmlt.do_xgb_optuna_optimization(optuna_db_path=OUTPUT_PATH, opt_timeout=60)
print(study.best_trial)

2021-12-06 23:20:30,245 INFO Optimization Direction is: maximize
[32m[I 2021-12-06 23:20:30,307][0m Using an existing study with name 'tmlt_autoxgb' instead of creating a new one.[0m
2021-12-06 23:20:30,497 INFO Training Started!


Parameters: { colsample_bytree, early_stopping_rounds, eval_set, max_depth, subsample, tree_method } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-12-06 23:20:37,693 INFO Training Ended!
2021-12-06 23:20:37,706 INFO roc_auc_score: 0.7720541653468482
2021-12-06 23:20:37,706 INFO log_loss: 0.5504097771365195
2021-12-06 23:20:37,707 INFO accuracy_score: 0.72375
2021-12-06 23:20:37,707 INFO f1_score: 0.6285714285714287
2021-12-06 23:20:37,708 INFO precision_score: 0.6515679442508711
2021-12-06 23:20:37,708 INFO recall_score: 0.6071428571428571
[32m[I 2021-12-06 23:20:37,741][0m Trial 35 finished with value: 0.7720541653468482 and parameters: {'learning_rate': 0.06917055652243276, 'n_estimators': 7000, 'reg_lambda': 0.0005717720260170862, 'reg_alpha': 6.546458405891649e-07, 'subsample': 0.6290090458891053, 'colsample_bytree': 0.9155898469768312, 'max_depth': 7, 'early_stopping_rounds': 139, 'tree_method': 'approx', 'booster': 'gblinear'}. Best is trial 18 with value: 0.7734399746594869.[0m
2021-12-06 23:20:37,923 INFO Training Started!


Parameters: { early_stopping_rounds, eval_set } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-12-06 23:21:17,526 INFO Training Ended!
2021-12-06 23:21:17,594 INFO roc_auc_score: 0.6885426565304614
2021-12-06 23:21:17,595 INFO log_loss: 0.6858780123793986
2021-12-06 23:21:17,595 INFO accuracy_score: 0.66875
2021-12-06 23:21:17,595 INFO f1_score: 0.5190562613430126
2021-12-06 23:21:17,596 INFO precision_score: 0.588477366255144
2021-12-06 23:21:17,596 INFO recall_score: 0.4642857142857143
[32m[I 2021-12-06 23:21:17,637][0m Trial 36 finished with value: 0.6885426565304614 and parameters: {'learning_rate': 0.02265165070501582, 'n_estimators': 7000, 'reg_lambda': 2.624192127564028e-07, 'reg_alpha': 0.00036391611478680695, 'subsample': 0.5915030624370501, 'colsample_bytree': 0.23237799890226452, 'max_depth': 9, 'early_stopping_rounds': 164, 'tree_method': 'hist', 'booster': 'gbtree', 'gamma': 0.740465140416424, 'grow_policy': 'depthwise'}. Best is trial 18 with value: 0.7734399746594869.[0m
2021-12-06 23:21:17,776 INFO Training Started!


Parameters: { colsample_bytree, early_stopping_rounds, eval_set, max_depth, subsample, tree_method } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-12-06 23:21:32,413 INFO Training Ended!
2021-12-06 23:21:32,425 INFO roc_auc_score: 0.7444303663815859
2021-12-06 23:21:32,426 INFO log_loss: 0.5777584500750527
2021-12-06 23:21:32,426 INFO accuracy_score: 0.7
2021-12-06 23:21:32,427 INFO f1_score: 0.574468085106383
2021-12-06 23:21:32,428 INFO precision_score: 0.6328125
2021-12-06 23:21:32,428 INFO recall_score: 0.525974025974026
[32m[I 2021-12-06 23:21:32,452][0m Trial 37 finished with value: 0.7444303663815859 and parameters: {'learning_rate': 0.056158497794197924, 'n_estimators': 20000, 'reg_lambda': 0.010294961570250994, 'reg_alpha': 1.7037741384478345e-05, 'subsample': 0.4147626780645989, 'colsample_bytree': 0.397278539868545, 'max_depth': 6, 'early_stopping_rounds': 185, 'tree_method': 'approx', 'booster': 'gblinear'}. Best is trial 18 with value: 0.7734399746594869.[0m


FrozenTrial(number=18, values=[0.7734399746594869], datetime_start=datetime.datetime(2021, 12, 6, 21, 57, 0, 88637), datetime_complete=datetime.datetime(2021, 12, 6, 21, 57, 4, 336934), params={'booster': 'gblinear', 'colsample_bytree': 0.1027132584989078, 'early_stopping_rounds': 171, 'learning_rate': 0.21761562020600114, 'max_depth': 6, 'n_estimators': 7000, 'reg_alpha': 9.583579660175245e-06, 'reg_lambda': 9.238315962782784e-05, 'subsample': 0.4464473710560276, 'tree_method': 'approx'}, distributions={'booster': CategoricalDistribution(choices=('gbtree', 'gblinear')), 'colsample_bytree': UniformDistribution(high=1.0, low=0.1), 'early_stopping_rounds': IntUniformDistribution(high=500, low=100, step=1), 'learning_rate': LogUniformDistribution(high=0.25, low=0.01), 'max_depth': IntUniformDistribution(high=9, low=1, step=1), 'n_estimators': CategoricalDistribution(choices=(7000, 15000, 20000)), 'reg_alpha': LogUniformDistribution(high=100.0, low=1e-08), 'reg_lambda': LogUniformDistribut

##### now update the meta model with best params from study and then update the sklearn pipeline with this new model

In [32]:
xgb_params.update(study.best_trial.params)
print("xgb_params", xgb_params)
updated_xgb_model = XGBClassifier(**xgb_params)

xgb_params {'learning_rate': 0.21761562020600114, 'eval_metric': 'auc', 'use_label_encoder': False, 'random_state': 42, 'booster': 'gblinear', 'colsample_bytree': 0.1027132584989078, 'early_stopping_rounds': 171, 'max_depth': 6, 'n_estimators': 7000, 'reg_alpha': 9.583579660175245e-06, 'reg_lambda': 9.238315962782784e-05, 'subsample': 0.4464473710560276, 'tree_method': 'approx'}


#### Let's Use K-Fold Training with best params

In [33]:
%%time
X_np, X_test_np = tmlt.pp_fit_transform(tmlt.dfl.X, tmlt.dfl.X_test)
y_np = tmlt.dfl.y

CPU times: user 493 ms, sys: 43.3 ms, total: 536 ms
Wall time: 83.6 ms


In [34]:
%%time
# k-fold training
xgb_model_metrics_score, xgb_model_test_preds = tmlt.do_kfold_training(X_np,
                                                                       y_np,
                                                                       X_test=X_test_np,
                                                                       n_splits=5,
                                                                       model=updated_xgb_model)

2021-12-06 23:21:32,574 INFO  model class:<class 'xgboost.sklearn.XGBClassifier'>
2021-12-06 23:21:32,578 INFO Training Started!


Parameters: { colsample_bytree, early_stopping_rounds, max_depth, subsample, tree_method } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-12-06 23:21:42,349 INFO Training Finished!
2021-12-06 23:21:42,350 INFO Predicting Val Probablities!
2021-12-06 23:21:42,352 INFO Predicting Val Score!
2021-12-06 23:21:42,360 INFO fold: 1 roc_auc_score : 0.7861895551257254
2021-12-06 23:21:42,361 INFO fold: 1 log_loss : 0.5517830674474681
2021-12-06 23:21:42,361 INFO fold: 1 accuracy_score : 0.72125
2021-12-06 23:21:42,362 INFO fold: 1 f1_score : 0.6626323751891074
2021-12-06 23:21:42,363 INFO fold: 1 precision_score : 0.6616314199395771
2021-12-06 23:21:42,363 INFO fold: 1 recall_score : 0.6636363636363637
2021-12-06 23:21:42,364 INFO Predicting Test Probablities!
2021-12-06 23:21:42,370 INFO Training Started!


Parameters: { colsample_bytree, early_stopping_rounds, max_depth, subsample, tree_method } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-12-06 23:21:51,670 INFO Training Finished!
2021-12-06 23:21:51,670 INFO Predicting Val Probablities!
2021-12-06 23:21:51,672 INFO Predicting Val Score!
2021-12-06 23:21:51,679 INFO fold: 2 roc_auc_score : 0.7962733720180528
2021-12-06 23:21:51,680 INFO fold: 2 log_loss : 0.5342360259930138
2021-12-06 23:21:51,680 INFO fold: 2 accuracy_score : 0.74125
2021-12-06 23:21:51,681 INFO fold: 2 f1_score : 0.6532663316582914
2021-12-06 23:21:51,682 INFO fold: 2 precision_score : 0.7303370786516854
2021-12-06 23:21:51,682 INFO fold: 2 recall_score : 0.5909090909090909
2021-12-06 23:21:51,683 INFO Predicting Test Probablities!
2021-12-06 23:21:51,688 INFO Training Started!


Parameters: { colsample_bytree, early_stopping_rounds, max_depth, subsample, tree_method } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-12-06 23:22:01,036 INFO Training Finished!
2021-12-06 23:22:01,037 INFO Predicting Val Probablities!
2021-12-06 23:22:01,038 INFO Predicting Val Score!
  loss = -(transformed_labels * np.log(y_pred)).sum(axis=1)
2021-12-06 23:22:01,047 INFO fold: 3 roc_auc_score : 0.7733784655061251
2021-12-06 23:22:01,048 INFO fold: 3 log_loss : inf
2021-12-06 23:22:01,048 INFO fold: 3 accuracy_score : 0.72
2021-12-06 23:22:01,049 INFO fold: 3 f1_score : 0.643312101910828
2021-12-06 23:22:01,049 INFO fold: 3 precision_score : 0.6778523489932886
2021-12-06 23:22:01,050 INFO fold: 3 recall_score : 0.6121212121212121
2021-12-06 23:22:01,050 INFO Predicting Test Probablities!
2021-12-06 23:22:01,056 INFO Training Started!


Parameters: { colsample_bytree, early_stopping_rounds, max_depth, subsample, tree_method } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-12-06 23:22:09,796 INFO Training Finished!
2021-12-06 23:22:09,797 INFO Predicting Val Probablities!
2021-12-06 23:22:09,799 INFO Predicting Val Score!
2021-12-06 23:22:09,806 INFO fold: 4 roc_auc_score : 0.789911040395777
2021-12-06 23:22:09,807 INFO fold: 4 log_loss : 0.5406639995649312
2021-12-06 23:22:09,808 INFO fold: 4 accuracy_score : 0.7425
2021-12-06 23:22:09,808 INFO fold: 4 f1_score : 0.6644951140065146
2021-12-06 23:22:09,809 INFO fold: 4 precision_score : 0.7208480565371025
2021-12-06 23:22:09,810 INFO fold: 4 recall_score : 0.6163141993957704
2021-12-06 23:22:09,810 INFO Predicting Test Probablities!
2021-12-06 23:22:09,816 INFO Training Started!


Parameters: { colsample_bytree, early_stopping_rounds, max_depth, subsample, tree_method } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-12-06 23:22:19,205 INFO Training Finished!
2021-12-06 23:22:19,205 INFO Predicting Val Probablities!
2021-12-06 23:22:19,207 INFO Predicting Val Score!
2021-12-06 23:22:19,215 INFO fold: 5 roc_auc_score : 0.7993287769181713
2021-12-06 23:22:19,215 INFO fold: 5 log_loss : 0.5347861251386348
2021-12-06 23:22:19,216 INFO fold: 5 accuracy_score : 0.73375
2021-12-06 23:22:19,216 INFO fold: 5 f1_score : 0.6490939044481056
2021-12-06 23:22:19,217 INFO fold: 5 precision_score : 0.7137681159420289
2021-12-06 23:22:19,217 INFO fold: 5 recall_score : 0.595166163141994
2021-12-06 23:22:19,218 INFO Predicting Test Probablities!
2021-12-06 23:22:19,224 INFO  Mean Metrics Results from all Folds are: {'roc_auc_score': 0.7890162419927703, 'log_loss': inf, 'accuracy_score': 0.7317499999999999, 'f1_score': 0.6545599654425693, 'precision_score': 0.7008874040127365, 'recall_score': 0.6156294058408863}


CPU times: user 5min 36s, sys: 6.25 s, total: 5min 43s
Wall time: 46.7 s


In [35]:
# predict on test dataset
if xgb_model_test_preds is not None:
    print(xgb_model_test_preds.shape)

(4000,)


In [36]:
# # take weighted average of both k-fold models predictions
# final_preds = ((0.45 * sci_model_preds) + (0.55* xgb_model_test_preds)) / 2
# print(final_preds.shape)

#### Create Kaggle Predictions

In [37]:
# sub = pd.read_csv(DIRECTORY_PATH + SAMPLE_SUB_FILE)
# sub['target'] = final_preds
# sub.to_csv('submission.csv', index=False)

In [38]:
# # hide
# # run the script to build 

# from nbdev.export import notebook2script; notebook2script()