# Getting Started Kaggle TPS Challenge with Tabular ML Toolkit

> A Tutorial to showcase usage of tabular_ml_toolkit (tmlt) library on Kaggle TPS Challenge Nov 2021.

> tabular_ml_toolkit is a helper library to jumpstart your machine learning project based on Tabular or Structured data.

> It comes with model and data parallelism and cutting edge hyperparameter search techniques.

> Under the hood TMLT uses modin, optuna, xgboost and scikit-learn pipelines

## Install

`pip install -U tabular_ml_toolkit`

### How to Best Use tabular_ml_toolkit

Start with your favorite model and then just simply create **tmlt** with one API.

*Here we are using XGBClassifier, on  [Kaggle TPS Challenge (Nov 2021) data](https://www.kaggle.com/c/tabular-playground-series-nov-2021/data)*

In [2]:
from tabular_ml_toolkit.tmlt import *
from sklearn.svm import LinearSVC
from xgboost import XGBClassifier
import numpy as np

# for visualizing pipeline
from sklearn import set_config
set_config(display="diagram")

# just to measure fit performance
import time

In [3]:
from sklearn.metrics import roc_auc_score, accuracy_score

In [4]:
# Dataset file names and Paths
DIRECTORY_PATH = "/Users/pamathur/kaggle_datasets/tps_nov_2021/"
TRAIN_FILE = "train.csv"
TEST_FILE = "test.csv"
SAMPLE_SUB_FILE = "sample_submission.csv"
OUTPUT_PATH = "kaggle_tps_output/"

In [5]:
# TRY THIS using LOGISTIC Regression
# https://www.kaggle.com/maximkazantsev/tps-11-21-eda-xgboost-optuna

# ALSO TAKE OUT MODIN OR USE SOME FUNCTIONALITY TO USE BOTH

###### Create SVM model

In [6]:
skl_svm_model = LinearSVC(tol=1e-7, penalty='l2', dual=False, max_iter=2000, random_state=42)

In [None]:
# createm tmlt for xgb model
tmlt = TMLT().prepare_data_for_training(
    train_file_path= DIRECTORY_PATH + TRAIN_FILE,
    #test_file_path= DIRECTORY_PATH + TEST_FILE,
    #make sure to use right index and target columns
    idx_col="id",
    target="target",
    model=skl_svm_model,
    random_state=42,
    problem_type="binary_classification", nrows=4000)

#### Create a base xgb classifier model with your best guess params

In [4]:
xgb_params = {
    # your best guess params
    'learning_rate':0.01,
    'eval_metric':'auc',
    # must for xgb classifier otherwise warning will be shown
    'use_label_encoder':False,
    # because 42 is the answer for all the randomness of this universe
    'random_state':42,
    #for GPU
    #'tree_method': 'gpu_hist',
    #'predictor': 'gpu_predictor',
}

xgb_model = XGBClassifier(**xgb_params)

In [5]:
# createm tmlt for xgb model
tmlt = TMLT().prepare_data_for_training(
    train_file_path= DIRECTORY_PATH + TRAIN_FILE,
    #test_file_path= DIRECTORY_PATH + TEST_FILE,
    #make sure to use right index and target columns
    idx_col="id",
    target="target",
    model=xgb_model,
    random_state=42,
    problem_type="classification", nrows=4000)

2021-11-23 14:07:20,402 INFO 12 cores found, model and data parallel processing should worked!


[2m[36m(apply_list_of_funcs pid=19472)[0m 
[2m[36m(apply_list_of_funcs pid=19469)[0m 


2021-11-23 14:07:39,201 INFO DataFrame Memory usage decreased to 0.80 Mb (74.4% reduction)
2021-11-23 14:07:39,201 INFO No test_file_path given, so training will continue without it!


[2m[36m(apply_func pid=19469)[0m 
[2m[36m(compute_sliced_len pid=19470)[0m 


2021-11-23 14:07:49,152 INFO PreProcessing will include target(s) encoding!
2021-11-23 14:07:49,153 INFO categorical columns are None, Preprocessing will done accordingly!


In [6]:
tmlt.spl

#### Let's do a quick round of training

In [7]:
# tmlt.dfl.create_train_valid(valid_size=0.2)

In [8]:
# # Quick check on dataframe shapes
# print(f"X_train shape is {tmlt.dfl.X_train.shape}" )
# print(f"X_valid shape is {tmlt.dfl.X_valid.shape}" )
# print(f"y_train shape is {tmlt.dfl.y_train.shape}")
# print(f"y_valid shape is {tmlt.dfl.y_valid.shape}")

In [9]:
# # Fit
# start = time.time()
# # Now fit
# tmlt.spl.fit(tmlt.dfl.X_train, tmlt.dfl.y_train)
# end = time.time()
# print("Fit Time:", end - start)

# #predict
# preds = tmlt.spl.predict(tmlt.dfl.X_valid)
# preds_probs = tmlt.spl.predict_proba(tmlt.dfl.X_valid)[:, 1]

# # Metrics
# auc = roc_auc_score(tmlt.dfl.y_valid, preds_probs)
# acc = accuracy_score(tmlt.dfl.y_valid, preds)

# print(f"AUC is : {auc} while Accuracy is : {acc} ")

### Let's do Optuna based HyperParameter search to get best params for fit

In [None]:
study = tmlt.do_xgb_optuna_optimization(optuna_db_path=OUTPUT_PATH, opt_timeout=360)

2021-11-23 14:25:46,244 INFO Optimization Direction is: minimize
[32m[I 2021-11-23 14:25:46,274][0m Using an existing study with name 'tmlt_autoxgb' instead of creating a new one.[0m


[2m[36m(apply_list_of_funcs pid=19463)[0m 
[2m[36m(compute_sliced_len pid=19464)[0m 


2021-11-23 14:25:49,535 INFO Training Started!


[2m[36m(apply_list_of_funcs pid=19463)[0m 
Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-23 14:26:07,809 INFO Training Ended!


[2m[36m(apply_list_of_funcs pid=19471)[0m 
[2m[36m(compute_sliced_len pid=19466)[0m 


2021-11-23 14:26:29,793 INFO log_loss: 0.6051906617963687
2021-11-23 14:26:29,794 INFO roc_auc_score: 0.7158958927251612
2021-11-23 14:26:29,794 INFO accuracy_score: 0.695
2021-11-23 14:26:29,795 INFO f1_score: 0.5836177474402731
2021-11-23 14:26:29,796 INFO precision_score: 0.6151079136690647
2021-11-23 14:26:29,797 INFO recall_score: 0.5551948051948052
[32m[I 2021-11-23 14:26:29,834][0m Trial 4 finished with value: 0.6051906617963687 and parameters: {'learning_rate': 0.07778202222863026, 'n_estimators': 20000, 'reg_lambda': 1.1572194721196033e-05, 'reg_alpha': 3.079971779735798e-08, 'subsample': 0.8566092401661841, 'colsample_bytree': 0.7338078050434531, 'max_depth': 1, 'tree_method': 'exact', 'booster': 'gblinear'}. Best is trial 3 with value: 0.6051832608412951.[0m
2021-11-23 14:26:33,655 INFO Training Started!


[2m[36m(apply_list_of_funcs pid=19470)[0m 


2021-11-23 14:26:56,977 INFO Training Ended!


[2m[36m(apply_list_of_funcs pid=19467)[0m 
[2m[36m(apply_list_of_funcs pid=19466)[0m 
[2m[36m(apply_func pid=19463)[0m 
[2m[36m(apply_list_of_funcs pid=19470)[0m 
[2m[36m(apply_list_of_funcs pid=19474)[0m 
[2m[36m(apply_list_of_funcs pid=19467)[0m 


2021-11-23 14:27:18,824 INFO log_loss: 1.3733108584548672
2021-11-23 14:27:18,826 INFO roc_auc_score: 0.600121423292155
2021-11-23 14:27:18,826 INFO accuracy_score: 0.59375
2021-11-23 14:27:18,827 INFO f1_score: 0.4574290484140234
2021-11-23 14:27:18,828 INFO precision_score: 0.47079037800687284
2021-11-23 14:27:18,828 INFO recall_score: 0.4448051948051948
[32m[I 2021-11-23 14:27:18,860][0m Trial 5 finished with value: 1.3733108584548672 and parameters: {'learning_rate': 0.14680037130381238, 'n_estimators': 15000, 'reg_lambda': 4.211629467291321, 'reg_alpha': 5.624849648423015e-07, 'subsample': 0.11905642392373826, 'colsample_bytree': 0.7166667785774565, 'max_depth': 2, 'tree_method': 'hist', 'booster': 'gbtree', 'gamma': 2.401734572428494e-07, 'grow_policy': 'depthwise'}. Best is trial 3 with value: 0.6051832608412951.[0m


[2m[36m(compute_sliced_len pid=19465)[0m 


2021-11-23 14:27:22,683 INFO Training Started!


[2m[36m(compute_sliced_len pid=19465)[0m 
[2m[36m(compute_sliced_len pid=19472)[0m 
[2m[36m(apply_func pid=19473)[0m 


2021-11-23 14:27:50,993 INFO Training Ended!


[2m[36m(apply_list_of_funcs pid=19473)[0m 
[2m[36m(compute_sliced_len pid=19472)[0m 
[2m[36m(apply_list_of_funcs pid=19467)[0m 
[2m[36m(apply_list_of_funcs pid=19470)[0m 
[2m[36m(apply_list_of_funcs pid=19469)[0m 
[2m[36m(apply_list_of_funcs pid=19463)[0m 
[2m[36m(apply_list_of_funcs pid=19467)[0m 
[2m[36m(compute_sliced_len pid=19467)[0m 
[2m[36m(apply_list_of_funcs pid=19473)[0m 


2021-11-23 14:28:12,915 INFO log_loss: 0.7951592802736559
2021-11-23 14:28:12,916 INFO roc_auc_score: 0.6458663287931581
2021-11-23 14:28:12,916 INFO accuracy_score: 0.63875
2021-11-23 14:28:12,917 INFO f1_score: 0.5025817555938038
2021-11-23 14:28:12,918 INFO precision_score: 0.5347985347985348
2021-11-23 14:28:12,919 INFO recall_score: 0.474025974025974
[32m[I 2021-11-23 14:28:12,956][0m Trial 6 finished with value: 0.7951592802736559 and parameters: {'learning_rate': 0.020919490032178578, 'n_estimators': 7000, 'reg_lambda': 1.0191685617159378e-06, 'reg_alpha': 2.7618753046849175e-08, 'subsample': 0.5364273278094787, 'colsample_bytree': 0.5406269791805867, 'max_depth': 3, 'tree_method': 'approx', 'booster': 'gbtree', 'gamma': 4.48740453005513e-06, 'grow_policy': 'depthwise'}. Best is trial 3 with value: 0.6051832608412951.[0m


[2m[36m(compute_sliced_len pid=19466)[0m 


2021-11-23 14:28:16,633 INFO Training Started!


[2m[36m(apply_list_of_funcs pid=19464)[0m 
Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-23 14:28:35,484 INFO Training Ended!


[2m[36m(apply_list_of_funcs pid=19466)[0m 
[2m[36m(apply_list_of_funcs pid=19473)[0m 
[2m[36m(apply_list_of_funcs pid=19474)[0m 
[2m[36m(compute_sliced_len pid=19467)[0m 
[2m[36m(apply_list_of_funcs pid=19474)[0m 
[2m[36m(compute_sliced_len pid=19467)[0m 
[2m[36m(apply_list_of_funcs pid=19473)[0m 
[2m[36m(apply_list_of_funcs pid=19474)[0m 


2021-11-23 14:28:56,523 INFO log_loss: 0.6051877725403756
2021-11-23 14:28:56,524 INFO roc_auc_score: 0.7158958927251612
2021-11-23 14:28:56,525 INFO accuracy_score: 0.695
2021-11-23 14:28:56,526 INFO f1_score: 0.5836177474402731
2021-11-23 14:28:56,527 INFO precision_score: 0.6151079136690647
2021-11-23 14:28:56,527 INFO recall_score: 0.5551948051948052
[32m[I 2021-11-23 14:28:56,560][0m Trial 7 finished with value: 0.6051877725403756 and parameters: {'learning_rate': 0.04988928951526494, 'n_estimators': 20000, 'reg_lambda': 1.7238404326454513e-05, 'reg_alpha': 6.60629606464357e-07, 'subsample': 0.2750714764806063, 'colsample_bytree': 0.325291757151274, 'max_depth': 9, 'tree_method': 'exact', 'booster': 'gblinear'}. Best is trial 3 with value: 0.6051832608412951.[0m
2021-11-23 14:29:00,226 INFO Training Started!
2021-11-23 14:29:25,026 INFO Training Ended!


[2m[36m(apply_list_of_funcs pid=19472)[0m 
[2m[36m(apply_list_of_funcs pid=19474)[0m 
[2m[36m(apply_list_of_funcs pid=19472)[0m 
[2m[36m(apply_list_of_funcs pid=19470)[0m 
[2m[36m(apply_list_of_funcs pid=19467)[0m 
[2m[36m(apply_list_of_funcs pid=19464)[0m 
[2m[36m(apply_list_of_funcs pid=19468)[0m 


In [None]:
print(study.best_trial)

##### now update the model with best params from study and then update the sklearn pipeline with new model

In [None]:
xgb_params.update(study.best_trial.params)
print("Final xgb_params:", xgb_params)
xgb_model = XGBClassifier(**xgb_params)
tmlt.update_model(xgb_model)
tmlt.spl

#### Let's Use K-Fold Training

In [13]:
# K-Fold fit and predict on test dataset
xgb_model_mean_metrics_results, xgb_model_test_preds= tmlt.do_kfold_training(n_splits=5,
                                                                            test_preds_metric=roc_auc_score)
if xgb_model_test_preds is not None:
    print(xgb_model_test_preds.shape)

[2m[36m(apply_list_of_funcs pid=19474)[0m 
Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


[2m[36m(apply_list_of_funcs pid=19464)[0m 
[2m[36m(apply_list_of_funcs pid=19463)[0m 


2021-11-23 14:10:23,049 INFO fold: 1 log_loss : 0.6039021413773298
2021-11-23 14:10:23,050 INFO fold: 1 roc_auc_score : 0.7272275950999355
2021-11-23 14:10:23,051 INFO fold: 1 accuracy_score : 0.6925
2021-11-23 14:10:23,051 INFO fold: 1 f1_score : 0.5758620689655173
2021-11-23 14:10:23,052 INFO fold: 1 precision_score : 0.668
2021-11-23 14:10:23,053 INFO fold: 1 recall_score : 0.5060606060606061


Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


[2m[36m(apply_list_of_funcs pid=19471)[0m 
[2m[36m(apply_list_of_funcs pid=19472)[0m 
[2m[36m(apply_list_of_funcs pid=19463)[0m 
[2m[36m(apply_list_of_funcs pid=19471)[0m 


2021-11-23 14:10:49,192 INFO fold: 2 log_loss : 0.6287158783921041
2021-11-23 14:10:49,193 INFO fold: 2 roc_auc_score : 0.694635718891038
2021-11-23 14:10:49,194 INFO fold: 2 accuracy_score : 0.6825
2021-11-23 14:10:49,195 INFO fold: 2 f1_score : 0.5876623376623378
2021-11-23 14:10:49,196 INFO fold: 2 precision_score : 0.6328671328671329
2021-11-23 14:10:49,197 INFO fold: 2 recall_score : 0.5484848484848485


Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


[2m[36m(apply_list_of_funcs pid=19472)[0m 
[2m[36m(apply_list_of_funcs pid=19463)[0m 


2021-11-23 14:11:12,677 INFO fold: 3 log_loss : 0.65383338053427
2021-11-23 14:11:12,678 INFO fold: 3 roc_auc_score : 0.6620309477756288
2021-11-23 14:11:12,679 INFO fold: 3 accuracy_score : 0.6425
2021-11-23 14:11:12,680 INFO fold: 3 f1_score : 0.5119453924914675
2021-11-23 14:11:12,680 INFO fold: 3 precision_score : 0.5859375
2021-11-23 14:11:12,681 INFO fold: 3 recall_score : 0.45454545454545453


Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


[2m[36m(apply_list_of_funcs pid=19471)[0m 
[2m[36m(apply_list_of_funcs pid=19463)[0m 
[2m[36m(apply_list_of_funcs pid=19465)[0m 


2021-11-23 14:11:34,620 INFO fold: 4 log_loss : 0.6231675453111529
2021-11-23 14:11:34,621 INFO fold: 4 roc_auc_score : 0.7080308427650268
2021-11-23 14:11:34,621 INFO fold: 4 accuracy_score : 0.69125
2021-11-23 14:11:34,622 INFO fold: 4 f1_score : 0.5957446808510638
2021-11-23 14:11:34,623 INFO fold: 4 precision_score : 0.65
2021-11-23 14:11:34,624 INFO fold: 4 recall_score : 0.5498489425981873


Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


[2m[36m(apply_list_of_funcs pid=19464)[0m 
[2m[36m(apply_list_of_funcs pid=19466)[0m 
[2m[36m(apply_list_of_funcs pid=19468)[0m 
[2m[36m(apply_list_of_funcs pid=19470)[0m 
[2m[36m(apply_list_of_funcs pid=19467)[0m 
[2m[36m(apply_list_of_funcs pid=19473)[0m 


2021-11-23 14:11:58,562 INFO fold: 5 log_loss : 0.6006386911938898
2021-11-23 14:11:58,563 INFO fold: 5 roc_auc_score : 0.7226470152474571
2021-11-23 14:11:58,564 INFO fold: 5 accuracy_score : 0.6775
2021-11-23 14:11:58,565 INFO fold: 5 f1_score : 0.568561872909699
2021-11-23 14:11:58,565 INFO fold: 5 precision_score : 0.6367041198501873
2021-11-23 14:11:58,566 INFO fold: 5 recall_score : 0.513595166163142
2021-11-23 14:11:58,568 INFO  Mean Metrics Results from all Folds are: {'log_loss': 0.6220515273617493, 'roc_auc_score': 0.7029144239558172, 'accuracy_score': 0.6772500000000001, 'f1_score': 0.5679552705760171, 'precision_score': 0.634701750543464, 'recall_score': 0.5145070035704478}


[2m[36m(apply_list_of_funcs pid=19464)[0m 


In [None]:
# # take weighted average of both k-fold models predictions
# final_preds = ((0.45 * sci_model_preds) + (0.55* xgb_model_test_preds)) / 2
# print(final_preds.shape)

#### Create Kaggle Predictions

In [None]:
# sub = pd.read_csv(DIRECTORY_PATH + SAMPLE_SUB_FILE)
# sub['target'] = final_preds
# sub.to_csv('submission.csv', index=False)

In [None]:
# hide
# run the script to build 

from nbdev.export import notebook2script; notebook2script()