# Getting Started Kaggle TPS Challenge with Tabular ML Toolkit

> A Tutorial to showcase usage of tabular_ml_toolkit (tmlt) library on Kaggle TPS Challenge Nov 2021.

> tabular_ml_toolkit is a helper library to jumpstart your machine learning project based on Tabular or Structured data.

> It comes with model parallelism and cutting edge hyperparameter search techniques.

> Under the hood TMLT uses optuna, xgboost and scikit-learn pipelines

## Install

`pip install -U tabular_ml_toolkit`

### How to Best Use tabular_ml_toolkit

Start with your favorite model and then just simply create **tmlt** with one API.

*Here we are using XGBClassifier, on  [Kaggle TPS Challenge (Nov 2021) data](https://www.kaggle.com/c/tabular-playground-series-nov-2021/data)*

In [1]:
from tabular_ml_toolkit.tmlt import *
from xgboost import XGBClassifier
import numpy as np



In [2]:
from sklearn.metrics import roc_auc_score, accuracy_score

In [3]:
# Dataset file names and Paths
DIRECTORY_PATH = "/Users/pankajmathur/kaggle_datasets/tps_nov_2021/"
TRAIN_FILE = "train.csv"
TEST_FILE = "test.csv"
SAMPLE_SUB_FILE = "sample_submission.csv"
OUTPUT_PATH = "kaggle_tps_output/"

#### Just point tmlt in the direction of your data

#### Let it know what are idx and target columns in your tabular data

#### what kind of problem type you are trying to resolve?

In [4]:
# create tmlt
tmlt = TMLT().prepare_data(
    train_file_path= DIRECTORY_PATH + TRAIN_FILE,
    test_file_path= DIRECTORY_PATH + TEST_FILE,
    #make sure to use right index and target columns
    idx_col="id",
    target="target",
    random_state=42,
    problem_type="binary_classification",
    nrows=4000
)


# supports only task type
# "binary_classification"
# "multi_label_classification"
# "multi_class_classification"
# "regression"

2021-12-09 23:58:19,951 INFO 8 cores found, model and data parallel processing should worked!
2021-12-09 23:58:20,069 INFO DataFrame Memory usage decreased to 0.80 Mb (74.4% reduction)
2021-12-09 23:58:20,176 INFO DataFrame Memory usage decreased to 0.79 Mb (74.3% reduction)
2021-12-09 23:58:20,222 INFO PreProcessing will include target(s) encoding!
2021-12-09 23:58:20,222 INFO categorical columns are None, Preprocessing will done accordingly!


In [5]:
print(type(tmlt.dfl.X))
print(tmlt.dfl.X.shape)
print(type(tmlt.dfl.y))
print(tmlt.dfl.y.shape)
print(type(tmlt.dfl.X_test))
print(tmlt.dfl.X_test.shape)

<class 'pandas.core.frame.DataFrame'>
(4000, 100)
<class 'numpy.ndarray'>
(4000,)
<class 'pandas.core.frame.DataFrame'>
(4000, 100)


In [6]:
tmlt.dfl.X

Unnamed: 0_level_0,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,...,f90,f91,f92,f93,f94,f95,f96,f97,f98,f99
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.106628,3.593750,132.750000,3.183594,0.081970,1.188477,3.732422,2.265625,2.099609,0.012329,...,0.010742,1.098633,0.013329,-0.011719,0.052765,0.065430,4.210938,1.978516,0.085999,0.240479
1,0.125000,1.673828,76.562500,3.378906,0.099426,5.093750,1.275391,-0.471436,4.546875,0.037720,...,0.135864,3.460938,0.017059,0.124878,0.154053,0.606934,-0.267822,2.578125,-0.020874,0.024719
2,0.036316,1.497070,233.500000,2.195312,0.026917,3.126953,5.058594,3.849609,1.801758,0.057007,...,0.117310,4.882812,0.085205,0.032410,0.116089,-0.001689,-0.520020,2.140625,0.124451,0.148193
3,-0.014076,0.245972,780.000000,1.890625,0.006947,1.531250,2.697266,4.515625,4.503906,0.123474,...,-0.015350,3.474609,-0.017105,-0.008102,0.062012,0.041199,0.511719,1.968750,0.040009,0.044861
4,-0.003260,3.714844,156.125000,2.148438,0.018280,2.097656,4.156250,-0.038239,3.371094,0.034180,...,0.013779,1.910156,-0.042938,0.105591,0.125122,0.037506,1.043945,1.075195,-0.012817,0.072815
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3995,0.242188,2.324219,-19.109375,0.984375,0.036438,0.424561,2.269531,3.621094,4.062500,1.197266,...,-0.002506,3.064453,0.112427,0.100220,0.036530,0.451172,1.316406,4.625000,0.056183,0.029724
3996,0.138306,0.679199,37.125000,2.736328,-0.043549,0.514648,4.542969,3.132812,4.972656,0.097961,...,0.060730,4.125000,-0.031433,0.059143,0.164673,0.058075,-0.237427,2.123047,-0.049316,0.050842
3997,0.025436,1.316406,250.375000,3.689453,0.015312,2.490234,1.983398,3.556641,4.164062,0.156860,...,0.038269,4.667969,0.157593,0.102234,1.055664,0.031769,1.661133,1.484375,-0.027924,0.098083
3998,0.109253,2.169922,123.062500,3.279297,0.018204,3.630859,4.636719,4.507812,3.585938,0.037140,...,0.083191,3.623047,0.108765,0.111877,0.020645,0.125122,2.648438,2.753906,0.012726,0.035583


### Training


##### create train valid dataframes for quick preprocessing and training

In [7]:
%%time
# create train, valid split to evaulate model on valid dataset
X_train, X_valid,  y_train, y_valid =  tmlt.dfl.create_train_valid(valid_size=0.2)

CPU times: user 6.41 ms, sys: 1.83 ms, total: 8.24 ms
Wall time: 6.76 ms


In [8]:
print(X_train.shape)
print(y_train.shape)
print(X_valid.shape)
print(y_valid.shape)

# print(X_train.columns.to_list())

(3200, 100)
(3200,)
(800, 100)
(800,)


##### Now PreProcess X_train, X_valid

NOTE: Preprocessing gives back numpy arrays for pandas dataframe

In [9]:
%%time
X_train_np,  X_valid_np = tmlt.pp_fit_transform(X_train, X_valid)

print(type(X_train_np))
print(X_train_np.shape)
# print(X_train_np)
print(type(X_valid_np))
print(X_valid_np.shape)
# print(X_valid_np)
print(type(y_valid))
print(type(y_train))

<class 'numpy.ndarray'>
(3200, 100)
<class 'numpy.ndarray'>
(800, 100)
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
CPU times: user 24.3 ms, sys: 1.62 ms, total: 25.9 ms
Wall time: 24.7 ms


#### Create a base xgb classifier model with your best guess params

In [10]:
xgb_params = {
    # your best guess params
    'learning_rate':0.01,
    'eval_metric':'auc',
    # must for xgb classifier otherwise warning will be shown
    'use_label_encoder':False,
    # because 42 is the answer for all the randomness of this universe
    'random_state':42,
    #for GPU
    #'tree_method': 'gpu_hist',
    #'predictor': 'gpu_predictor',
}

xgb_model = XGBClassifier(**xgb_params)

In [11]:
%%time
# Now do model training
xgb_model.fit(X_train_np, y_train,
              verbose=False,
              #detect & avoid overfitting
              eval_set=[(X_train_np, y_train), (X_valid_np, y_valid)],
              eval_metric="auc",
              early_stopping_rounds=300
             )

#predict
preds = xgb_model.predict(X_valid_np)
preds_probs = xgb_model.predict_proba(X_valid_np)[:, 1]

# Metrics
auc = roc_auc_score(y_valid, preds_probs)
acc = accuracy_score(y_valid, preds)

print(f"AUC is : {auc} while Accuracy is : {acc} ")

AUC is : 0.6142302819132088 while Accuracy is : 0.6175 
CPU times: user 10.6 s, sys: 123 ms, total: 10.8 s
Wall time: 1.51 s


### For Meta Ensemble Models Training

##### Make sure to PreProcess the data

In [12]:
%%time
X_np, X_test_np = tmlt.pp_fit_transform(tmlt.dfl.X, tmlt.dfl.X_test)
y_np = tmlt.dfl.y

CPU times: user 270 ms, sys: 23.6 ms, total: 293 ms
Wall time: 46.9 ms


#### Base Model 1: linear SVM model

In [13]:
from sklearn.svm import LinearSVC

In [14]:
%%time

# OOF training and prediction on both train and test dataset by a given model
#choose model
linear_oof_model = LinearSVC(tol=1e-7, penalty='l2', dual=False, max_iter=2000, random_state=42)

#fit and predict
linear_oof_model_preds, linear_oof_model_test_preds = tmlt.do_oof_kfold_train_preds(n_splits=5,
                                                                                    oof_model=linear_oof_model,
                                                                                    X = X_np,
                                                                                    y = y_np,
                                                                                    X_test = X_test_np)

if linear_oof_model_preds is not None:
    print(linear_oof_model_preds.shape)

if linear_oof_model_test_preds is not None:    
    print(linear_oof_model_test_preds.shape)

2021-12-09 23:58:25,808 INFO Training Started!
2021-12-09 23:58:25,840 INFO Training Finished!
2021-12-09 23:58:25,844 INFO fold: 1 OOF Model ROC AUC: 0.7257962604771115!
2021-12-09 23:58:25,848 INFO Training Started!
2021-12-09 23:58:25,881 INFO Training Finished!
2021-12-09 23:58:25,884 INFO fold: 2 OOF Model ROC AUC: 0.6957833655705996!
2021-12-09 23:58:25,890 INFO Training Started!
2021-12-09 23:58:25,920 INFO Training Finished!
2021-12-09 23:58:25,923 INFO fold: 3 OOF Model ROC AUC: 0.6616956802063185!
2021-12-09 23:58:25,928 INFO Training Started!
2021-12-09 23:58:25,963 INFO Training Finished!
2021-12-09 23:58:25,965 INFO fold: 4 OOF Model ROC AUC: 0.7076379002699064!
2021-12-09 23:58:25,969 INFO Training Started!
2021-12-09 23:58:25,994 INFO Training Finished!
2021-12-09 23:58:25,996 INFO fold: 5 OOF Model ROC AUC: 0.7227243154104317!
2021-12-09 23:58:26,000 INFO Mean OOF Model ROC AUC: 0.7027275043868735!


(4000,)
(4000,)
CPU times: user 501 ms, sys: 114 ms, total: 615 ms
Wall time: 194 ms


#### Base Model 2: Logistic Regression Model

In [15]:
from sklearn.linear_model import LogisticRegression

In [16]:
%%time

# OOF training and prediction on both train and test dataset by a given model

#choose model
log_oof_model = LogisticRegression(solver='liblinear', random_state=42)

#fit and predict
log_oof_model_preds, log_oof_model_test_preds = tmlt.do_oof_kfold_train_preds(n_splits=5,
                                                                                    oof_model=log_oof_model,
                                                                                    X = X_np,
                                                                                    y = y_np,
                                                                                    X_test = X_test_np)
if log_oof_model_preds is not None:
    print(log_oof_model_preds.shape)

if log_oof_model_test_preds is not None:    
    print(log_oof_model_test_preds.shape)

2021-12-09 23:58:27,128 INFO Training Started!
2021-12-09 23:58:27,157 INFO Training Finished!
2021-12-09 23:58:27,160 INFO fold: 1 OOF Model ROC AUC: 0.7271695680206318!
2021-12-09 23:58:27,165 INFO Training Started!
2021-12-09 23:58:27,196 INFO Training Finished!
2021-12-09 23:58:27,199 INFO fold: 2 OOF Model ROC AUC: 0.694622823984526!
2021-12-09 23:58:27,204 INFO Training Started!
2021-12-09 23:58:27,233 INFO Training Finished!
2021-12-09 23:58:27,236 INFO fold: 3 OOF Model ROC AUC: 0.662114764667956!
2021-12-09 23:58:27,241 INFO Training Started!
2021-12-09 23:58:27,264 INFO Training Finished!
2021-12-09 23:58:27,267 INFO fold: 4 OOF Model ROC AUC: 0.7080501678057705!
2021-12-09 23:58:27,271 INFO Training Started!
2021-12-09 23:58:27,298 INFO Training Finished!
2021-12-09 23:58:27,301 INFO fold: 5 OOF Model ROC AUC: 0.7226276902067136!
2021-12-09 23:58:27,304 INFO Mean OOF Model ROC AUC: 0.7029170029371196!


(4000,)
(4000,)
CPU times: user 469 ms, sys: 102 ms, total: 571 ms
Wall time: 179 ms


#### Base Model 3: SKLearn MLP

In [17]:
from sklearn.neural_network import MLPClassifier

In [18]:
%%time

# OOF training and prediction on both train and test dataset by a given model

#choose model
mlp_oof_model = MLPClassifier(max_iter=1000, early_stopping=True)

#update the model on sklearn pipeline
# tmlt = tmlt.update_model(mlp_oof_model)

# # lets see updated sklearn pipeline with new model
# tmlt.spl

#fit and predict
mlp_oof_model_preds, mlp_oof_model_test_preds = tmlt.do_oof_kfold_train_preds(n_splits=5,
                                                                                    oof_model=mlp_oof_model,
                                                                                    X = X_np,
                                                                                    y = y_np,
                                                                                    X_test = X_test_np)
if mlp_oof_model_preds is not None:
    print(mlp_oof_model_preds.shape)

if mlp_oof_model_test_preds is not None:    
    print(mlp_oof_model_test_preds.shape)

2021-12-09 23:58:30,843 INFO Training Started!
2021-12-09 23:58:31,114 INFO Training Finished!
2021-12-09 23:58:31,118 INFO fold: 1 OOF Model ROC AUC: 0.7148613797549968!
2021-12-09 23:58:31,127 INFO Training Started!
2021-12-09 23:58:31,448 INFO Training Finished!
2021-12-09 23:58:31,452 INFO fold: 2 OOF Model ROC AUC: 0.6774081237911025!
2021-12-09 23:58:31,460 INFO Training Started!
2021-12-09 23:58:31,653 INFO Training Finished!
2021-12-09 23:58:31,656 INFO fold: 3 OOF Model ROC AUC: 0.6363958736299162!
2021-12-09 23:58:31,665 INFO Training Started!
2021-12-09 23:58:31,868 INFO Training Finished!
2021-12-09 23:58:31,871 INFO fold: 4 OOF Model ROC AUC: 0.6835524578230986!
2021-12-09 23:58:31,880 INFO Training Started!
2021-12-09 23:58:32,085 INFO Training Finished!
2021-12-09 23:58:32,088 INFO fold: 5 OOF Model ROC AUC: 0.6923646764021928!
2021-12-09 23:58:32,097 INFO Mean OOF Model ROC AUC: 0.6809165022802615!


(4000,)
(4000,)
CPU times: user 4.09 s, sys: 708 ms, total: 4.8 s
Wall time: 1.26 s


#### Base Model 4: TabNet

In [19]:
from pytorch_tabnet.tab_model import TabNetClassifier

In [20]:
%%time

# OOF training and prediction on both train and test dataset by a given model

#choose model
tabnet_oof_model = TabNetClassifier(optimizer_params=dict(lr=0.02), verbose=0)

#fit and predict
tabnet_oof_model_preds, tabnet_oof_model_test_preds = tmlt.do_oof_kfold_train_preds(n_splits=5,
                                                                                    oof_model=tabnet_oof_model,
                                                                                    X = X_np,
                                                                                    y = y_np,
                                                                                    X_test = X_test_np)

if tabnet_oof_model_preds is not None:
    print(tabnet_oof_model_preds.shape)

if tabnet_oof_model_test_preds is not None:
    print(tabnet_oof_model_test_preds.shape)

2021-12-09 23:58:34,242 INFO Training Started!
2021-12-09 23:58:42,305 INFO Training Finished!
2021-12-09 23:58:42,335 INFO fold: 1 OOF Model ROC AUC: 0.558401031592521!


Stop training because you reached max_epochs = 50 with best_epoch = 44 and best_val_0_auc = 0.5584
Best weights from best epoch are automatically used!


2021-12-09 23:58:42,448 INFO Training Started!
2021-12-09 23:58:44,406 INFO Training Finished!
2021-12-09 23:58:44,437 INFO fold: 2 OOF Model ROC AUC: 0.6584655061250806!



Early stopping occurred at epoch 11 with best_epoch = 1 and best_val_0_auc = 0.65847
Best weights from best epoch are automatically used!


2021-12-09 23:58:44,547 INFO Training Started!
2021-12-09 23:58:46,455 INFO Training Finished!
2021-12-09 23:58:46,487 INFO fold: 3 OOF Model ROC AUC: 0.6677756286266925!



Early stopping occurred at epoch 10 with best_epoch = 0 and best_val_0_auc = 0.66778
Best weights from best epoch are automatically used!


2021-12-09 23:58:46,604 INFO Training Started!
2021-12-09 23:58:48,428 INFO Training Finished!
2021-12-09 23:58:48,458 INFO fold: 4 OOF Model ROC AUC: 0.6324828168179388!



Early stopping occurred at epoch 10 with best_epoch = 0 and best_val_0_auc = 0.63248
Best weights from best epoch are automatically used!


2021-12-09 23:58:48,572 INFO Training Started!
2021-12-09 23:58:50,446 INFO Training Finished!
2021-12-09 23:58:50,477 INFO fold: 5 OOF Model ROC AUC: 0.6354975231739447!



Early stopping occurred at epoch 10 with best_epoch = 0 and best_val_0_auc = 0.6355
Best weights from best epoch are automatically used!


2021-12-09 23:58:50,592 INFO Mean OOF Model ROC AUC: 0.6305245012672356!


(4000,)
(4000,)
CPU times: user 17.5 s, sys: 5.88 s, total: 23.4 s
Wall time: 16.4 s


#### Now add back based models predictions to X and X_test

In [21]:
# add based model oof predictions back to X and X_test before Meta model training
tmlt.dfl.X["linear_preds"] = linear_oof_model_preds
tmlt.dfl.X_test["linear_preds"] = linear_oof_model_test_preds

print(tmlt.dfl.X.shape)
print(tmlt.dfl.X_test.shape)

(4000, 101)
(4000, 101)


In [22]:
# add based model oof predictions back to X and X_test before Meta model training
tmlt.dfl.X["log_reg_preds"] = log_oof_model_preds
tmlt.dfl.X_test["log_reg_preds"] = log_oof_model_test_preds

print(tmlt.dfl.X.shape)
print(tmlt.dfl.X_test.shape)

(4000, 102)
(4000, 102)


In [23]:
# add based model oof predictions back to X and X_test before Meta model training
tmlt.dfl.X["mlp_preds"] = mlp_oof_model_preds
tmlt.dfl.X_test["mlp_preds"] = mlp_oof_model_test_preds

print(tmlt.dfl.X.shape)
print(tmlt.dfl.X_test.shape)

(4000, 103)
(4000, 103)


In [24]:
# add based model oof predictions back to X and X_test before Meta model training
tmlt.dfl.X["tabnet_preds"] = tabnet_oof_model_preds
tmlt.dfl.X_test["tabnet_preds"] = tabnet_oof_model_test_preds

print(tmlt.dfl.X.shape)
print(tmlt.dfl.X_test.shape)

(4000, 104)
(4000, 104)


In [25]:
# now just update the tmlt with this new X and X_test

In [26]:
tmlt = tmlt.update_dfl(X=tmlt.dfl.X, y=tmlt.dfl.y, X_test=tmlt.dfl.X_test)

2021-12-09 23:59:07,293 INFO categorical columns are None, Preprocessing will done accordingly!


#### For META Model Training

##### create train valid dataframes for quick preprocessing and training

In [27]:
%%time
# create train, valid split to evaulate model on valid dataset
X_train, X_valid,  y_train, y_valid =  tmlt.dfl.create_train_valid(valid_size=0.2)

print(X_train.shape)
print(y_train.shape)
print(X_valid.shape)
print(y_valid.shape)

# print(X_train.columns.to_list())

(3200, 104)
(3200,)
(800, 104)
(800,)
CPU times: user 7.03 ms, sys: 1.77 ms, total: 8.81 ms
Wall time: 7.15 ms


##### Now PreProcess X_train, X_valid

NOTE: Preprocessing gives back numpy arrays for pandas dataframe

In [28]:
%%time
X_train_np,  X_valid_np = tmlt.pp_fit_transform(X_train, X_valid)

print(type(X_train_np))
print(X_train_np.shape)
# print(X_train_np)
print(type(X_valid_np))
print(X_valid_np.shape)
# print(X_valid_np)
print(type(y_valid))
print(type(y_train))

<class 'numpy.ndarray'>
(3200, 104)
<class 'numpy.ndarray'>
(800, 104)
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
CPU times: user 47.8 ms, sys: 4.4 ms, total: 52.2 ms
Wall time: 50.6 ms


In [29]:
# xgb_params = {
#     'objective': 'binary:logistic', 
#     'use_label_encoder': False,
#     'n_estimators': 40000,
#     'learning_rate': 0.18515462875481553,
#     'subsample': 0.97, 
#     'colsample_bytree': 0.32,
#     'max_depth': 1,
#     'booster': 'gbtree',
#     'gamma': 0.2, 
#     'tree_method': 'gpu_hist',
#     'reg_lambda': 0.11729916523488974, 
#     'reg_alpha': 0.6318827156945853,
#     'random_state': 42,
#     'n_jobs': 4, 
#     'min_child_weight': 256,
#     #for GPU
# #     'tree_method': 'gpu_hist',
# #     'predictor': 'gpu_predictor',
#     }

In [30]:
xgb_params = {
    'learning_rate': 0.21761562020600114,
    'eval_metric': 'auc',
    'use_label_encoder': False,
    'random_state': 42,
    'booster': 'gblinear',
    'colsample_bytree': 0.1027132584989078,
    'early_stopping_rounds': 171,
    'max_depth': 6,
    'n_estimators': 7000,
    'reg_alpha': 9.583579660175245e-06,
    'reg_lambda': 9.238315962782784e-05,
    'subsample': 0.4464473710560276,
    'tree_method': 'approx',
    #for GPU
#     'tree_method': 'gpu_hist',
#     'predictor': 'gpu_predictor',
}

In [31]:
%%time
# Now do model training
xgb_model.fit(X_train_np, y_train,
              verbose=False,
              #detect & avoid overfitting
              eval_set=[(X_train_np, y_train), (X_valid_np, y_valid)],
              eval_metric="auc",
              early_stopping_rounds=300
             )

#predict
preds = xgb_model.predict(X_valid_np)
preds_probs = xgb_model.predict_proba(X_valid_np)[:, 1]

# Metrics
auc = roc_auc_score(y_valid, preds_probs)
acc = accuracy_score(y_valid, preds)

print(f"AUC is : {auc} while Accuracy is : {acc} ")

AUC is : 0.6933467954809418 while Accuracy is : 0.67625 
CPU times: user 10.9 s, sys: 95.7 ms, total: 11 s
Wall time: 1.48 s


### WOW!!!!

#### For Meta Model, Let's do Optuna based HyperParameter search to get best params for fit

In [32]:
# **Just make sure to supply an output directory path so hyperparameter search is saved**
study = tmlt.do_xgb_optuna_optimization(optuna_db_path=OUTPUT_PATH, opt_timeout=60)
print(study.best_trial)

2021-12-09 23:59:37,980 INFO Optimization Direction is: maximize
[32m[I 2021-12-09 23:59:38,032][0m Using an existing study with name 'tmlt_autoxgb' instead of creating a new one.[0m
2021-12-09 23:59:38,210 INFO Training Started!


Parameters: { colsample_bytree, early_stopping_rounds, eval_set, max_depth, subsample, tree_method } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-12-09 23:59:47,028 INFO Training Ended!
2021-12-09 23:59:47,033 INFO roc_auc_score: 0.7741592756836658
2021-12-09 23:59:47,034 INFO accuracy_score: 0.72625
[32m[I 2021-12-09 23:59:47,072][0m Trial 38 finished with value: 0.7741592756836658 and parameters: {'learning_rate': 0.07948707900984789, 'n_estimators': 15000, 'reg_lambda': 7.984356925605064e-06, 'reg_alpha': 2.5216595944303144e-06, 'subsample': 0.9949361731336916, 'colsample_bytree': 0.30017155189532796, 'max_depth': 5, 'early_stopping_rounds': 268, 'tree_method': 'approx', 'booster': 'gblinear'}. Best is trial 38 with value: 0.7741592756836658.[0m
2021-12-09 23:59:47,269 INFO Training Started!


Parameters: { early_stopping_rounds, eval_set } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-12-10 00:00:17,929 INFO Training Ended!
2021-12-10 00:00:17,953 INFO roc_auc_score: 0.6579558652729384
2021-12-10 00:00:17,954 INFO accuracy_score: 0.65375
[32m[I 2021-12-10 00:00:17,995][0m Trial 39 finished with value: 0.6579558652729384 and parameters: {'learning_rate': 0.08007436447328488, 'n_estimators': 15000, 'reg_lambda': 5.254743398476762e-06, 'reg_alpha': 6.9699386536621e-08, 'subsample': 0.9322044962669489, 'colsample_bytree': 0.5785734419018447, 'max_depth': 5, 'early_stopping_rounds': 333, 'tree_method': 'hist', 'booster': 'gbtree', 'gamma': 0.034707793715938046, 'grow_policy': 'depthwise'}. Best is trial 38 with value: 0.7741592756836658.[0m
2021-12-10 00:00:18,141 INFO Training Started!


Parameters: { colsample_bytree, early_stopping_rounds, eval_set, max_depth, subsample, tree_method } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-12-10 00:00:27,180 INFO Training Ended!
2021-12-10 00:00:27,185 INFO roc_auc_score: 0.7741064829479463
2021-12-10 00:00:27,186 INFO accuracy_score: 0.72625
[32m[I 2021-12-10 00:00:27,209][0m Trial 40 finished with value: 0.7741064829479463 and parameters: {'learning_rate': 0.03198809904064819, 'n_estimators': 15000, 'reg_lambda': 1.7499374468232044e-07, 'reg_alpha': 2.6296423826705236e-06, 'subsample': 0.8688234611344541, 'colsample_bytree': 0.32645970885560793, 'max_depth': 5, 'early_stopping_rounds': 219, 'tree_method': 'approx', 'booster': 'gblinear'}. Best is trial 38 with value: 0.7741592756836658.[0m
2021-12-10 00:00:27,385 INFO Training Started!


Parameters: { colsample_bytree, early_stopping_rounds, eval_set, max_depth, subsample, tree_method } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-12-10 00:00:37,591 INFO Training Ended!
2021-12-10 00:00:37,596 INFO roc_auc_score: 0.7740338929363321
2021-12-10 00:00:37,596 INFO accuracy_score: 0.72625
[32m[I 2021-12-10 00:00:37,618][0m Trial 41 finished with value: 0.7740338929363321 and parameters: {'learning_rate': 0.028897229702683155, 'n_estimators': 15000, 'reg_lambda': 1.1888881323452074e-07, 'reg_alpha': 2.6601366349662993e-06, 'subsample': 0.9680189654657809, 'colsample_bytree': 0.32353956611499696, 'max_depth': 5, 'early_stopping_rounds': 220, 'tree_method': 'approx', 'booster': 'gblinear'}. Best is trial 38 with value: 0.7741592756836658.[0m
2021-12-10 00:00:37,778 INFO Training Started!


Parameters: { colsample_bytree, early_stopping_rounds, eval_set, max_depth, subsample, tree_method } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-12-10 00:00:47,908 INFO Training Ended!
2021-12-10 00:00:47,912 INFO roc_auc_score: 0.7739876992925773
2021-12-10 00:00:47,912 INFO accuracy_score: 0.72625
[32m[I 2021-12-10 00:00:47,937][0m Trial 42 finished with value: 0.7739876992925773 and parameters: {'learning_rate': 0.025719759580870857, 'n_estimators': 15000, 'reg_lambda': 4.475205533321173e-08, 'reg_alpha': 1.8902434023049325e-06, 'subsample': 0.9801702965579114, 'colsample_bytree': 0.2874747934370655, 'max_depth': 5, 'early_stopping_rounds': 216, 'tree_method': 'approx', 'booster': 'gblinear'}. Best is trial 38 with value: 0.7741592756836658.[0m


FrozenTrial(number=38, values=[0.7741592756836658], datetime_start=datetime.datetime(2021, 12, 9, 23, 59, 38, 76564), datetime_complete=datetime.datetime(2021, 12, 9, 23, 59, 47, 35226), params={'booster': 'gblinear', 'colsample_bytree': 0.30017155189532796, 'early_stopping_rounds': 268, 'learning_rate': 0.07948707900984789, 'max_depth': 5, 'n_estimators': 15000, 'reg_alpha': 2.5216595944303144e-06, 'reg_lambda': 7.984356925605064e-06, 'subsample': 0.9949361731336916, 'tree_method': 'approx'}, distributions={'booster': CategoricalDistribution(choices=('gbtree', 'gblinear')), 'colsample_bytree': UniformDistribution(high=1.0, low=0.1), 'early_stopping_rounds': IntUniformDistribution(high=500, low=100, step=1), 'learning_rate': LogUniformDistribution(high=0.25, low=0.01), 'max_depth': IntUniformDistribution(high=9, low=1, step=1), 'n_estimators': CategoricalDistribution(choices=(7000, 15000, 20000)), 'reg_alpha': LogUniformDistribution(high=100.0, low=1e-08), 'reg_lambda': LogUniformDistr

##### now update the meta model with best params from study and then update the sklearn pipeline with this new model

In [33]:
xgb_params.update(study.best_trial.params)
print("xgb_params", xgb_params)
updated_xgb_model = XGBClassifier(**xgb_params)

xgb_params {'learning_rate': 0.07948707900984789, 'eval_metric': 'auc', 'use_label_encoder': False, 'random_state': 42, 'booster': 'gblinear', 'colsample_bytree': 0.30017155189532796, 'early_stopping_rounds': 268, 'max_depth': 5, 'n_estimators': 15000, 'reg_alpha': 2.5216595944303144e-06, 'reg_lambda': 7.984356925605064e-06, 'subsample': 0.9949361731336916, 'tree_method': 'approx'}


#### Let's Use K-Fold Training with best params

In [34]:
%%time
X_np, X_test_np = tmlt.pp_fit_transform(tmlt.dfl.X, tmlt.dfl.X_test)
y_np = tmlt.dfl.y

CPU times: user 560 ms, sys: 17.7 ms, total: 578 ms
Wall time: 78.9 ms


In [None]:
%%time
# k-fold training
xgb_model_metrics_score, xgb_model_test_preds = tmlt.do_kfold_training(X_np,
                                                                       y_np,
                                                                       X_test=X_test_np,
                                                                       n_splits=5,
                                                                       model=updated_xgb_model)

2021-12-10 00:51:39,405 INFO  model class:<class 'xgboost.sklearn.XGBClassifier'>
2021-12-10 00:51:39,410 INFO Training Started!


Parameters: { colsample_bytree, early_stopping_rounds, max_depth, subsample, tree_method } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-12-10 00:52:00,853 INFO Training Finished!
2021-12-10 00:52:00,854 INFO Predicting Val Probablities!
2021-12-10 00:52:00,856 INFO Predicting Val Score!
2021-12-10 00:52:00,860 INFO fold: 1 roc_auc_score : 0.7867504835589942
2021-12-10 00:52:00,861 INFO fold: 1 accuracy_score : 0.72625
2021-12-10 00:52:00,861 INFO Predicting Test Probablities!
2021-12-10 00:52:00,867 INFO Training Started!


Parameters: { colsample_bytree, early_stopping_rounds, max_depth, subsample, tree_method } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




In [36]:
# predict on test dataset
if xgb_model_test_preds is not None:
    print(xgb_model_test_preds.shape)

(4000,)


In [None]:
# # take weighted average of both k-fold models predictions
# final_preds = ((0.45 * sci_model_preds) + (0.55* xgb_model_test_preds)) / 2
# print(final_preds.shape)

#### Create Kaggle Predictions

In [None]:
# sub = pd.read_csv(DIRECTORY_PATH + SAMPLE_SUB_FILE)
# sub['target'] = final_preds
# sub.to_csv('submission.csv', index=False)

In [None]:
# # hide
# # run the script to build 

# from nbdev.export import notebook2script; notebook2script()