# Getting Started Kaggle TPS Challenge with Tabular ML Toolkit

> A Tutorial to showcase usage of tabular_ml_toolkit (tmlt) library on Kaggle TPS Challenge Nov 2021.

> tabular_ml_toolkit is a helper library to jumpstart your machine learning project based on Tabular or Structured data.

> It comes with model and data parallelism and cutting edge hyperparameter search techniques.

> Under the hood TMLT uses modin, optuna, xgboost and scikit-learn pipelines

## Install

`pip install -U tabular_ml_toolkit`

### How to Best Use tabular_ml_toolkit

Start with your favorite model and then just simply create **tmlt** with one API.

*Here we are using XGBClassifier, on  [Kaggle TPS Challenge (Nov 2021) data](https://www.kaggle.com/c/tabular-playground-series-nov-2021/data)*

In [None]:
from tabular_ml_toolkit.tmlt import *
from xgboost import XGBClassifier
import numpy as np

# for visualizing pipeline
from sklearn import set_config
set_config(display="diagram")

# just to measure fit performance
import time

In [None]:
# Dataset file names and Paths
DIRECTORY_PATH = "/Users/pamathur/kaggle_datasets/tps_nov_2021/"
TRAIN_FILE = "train.csv"
TEST_FILE = "test.csv"
SAMPLE_SUB_FILE = "sample_submission.csv"
OUTPUT_PATH = "kaggle_tps_output/"

#### Create a base xgb classifier model with your best guess params

In [None]:
xgb_params = {
    # your best guess params
    'learning_rate':0.01,
    'eval_metric':'auc',
    # must for xgb classifier otherwise warning will be shown
    'use_label_encoder':False,
    # because 42 is the answer for all the randomness of this universe
    'random_state':42,
    #for GPU
    #'tree_method': 'gpu_hist',
    #'predictor': 'gpu_predictor',
}

xgb_model = XGBClassifier(**xgb_params)

In [None]:
# createm tmlt for xgb model
tmlt = TMLT().prepare_data_for_training(
    train_file_path= DIRECTORY_PATH + TRAIN_FILE,
    #test_file_path= DIRECTORY_PATH + TEST_FILE,
    #make sure to use right index and target columns
    idx_col="id",
    target="target",
    model=xgb_model,
    random_state=42,
    problem_type="classification", nrows=4000)

2021-11-23 00:58:11,440 INFO 12 cores found, model and data parallel processing should worked!
2021-11-23 00:58:28,221 INFO DataFrame Memory usage decreased to 0.80 Mb (74.4% reduction)
2021-11-23 00:58:28,221 INFO No test_file_path given, so training will continue without it!


[2m[36m(apply_list_of_funcs pid=96618)[0m 
[2m[36m(compute_sliced_len pid=96615)[0m 
[2m[36m(apply_func pid=96617)[0m 


2021-11-23 00:58:37,460 INFO PreProcessing will include target(s) encoding!
2021-11-23 00:58:37,461 INFO categorical columns are None, Preprocessing will done accordingly!


In [None]:
tmlt.spl

#### Let's do a quick round of training

In [None]:
tmlt.dfl.create_train_valid(valid_size=0.2)

In [None]:
# Quick check on dataframe shapes
print(f"X_train shape is {tmlt.dfl.X_train.shape}" )
print(f"X_valid shape is {tmlt.dfl.X_valid.shape}" )
print(f"y_train shape is {tmlt.dfl.y_train.shape}")
print(f"y_valid shape is {tmlt.dfl.y_valid.shape}")

X_train shape is (3200, 100)
X_valid shape is (800, 100)
y_train shape is (3200,)
y_valid shape is (800,)


In [None]:
from sklearn.metrics import roc_auc_score, accuracy_score

In [None]:
# Fit
start = time.time()
# Now fit
tmlt.spl.fit(tmlt.dfl.X_train, tmlt.dfl.y_train)
end = time.time()
print("Fit Time:", end - start)

#predict
preds = tmlt.spl.predict(tmlt.dfl.X_valid)
preds_probs = tmlt.spl.predict_proba(tmlt.dfl.X_valid)[:, 1]

# Metrics
auc = roc_auc_score(tmlt.dfl.y_valid, preds_probs)
acc = accuracy_score(tmlt.dfl.y_valid, preds)

print(f"AUC is : {auc} while Accuracy is : {acc} ")

[2m[36m(apply_list_of_funcs pid=96613)[0m 
Fit Time: 6.31636905670166
[2m[36m(compute_sliced_len pid=96613)[0m 
AUC is : 0.6137947418435223 while Accuracy is : 0.6175 


### Let's do Optuna based HyperParameter search to get best params for fit

In [None]:
study = tmlt.do_xgb_optuna_optimization(optuna_db_path=OUTPUT_PATH, opt_timeout=60)

In [None]:
print(study.best_trial)

##### now update the model with best params from study and then update the sklearn pipeline with new model

In [None]:
xgb_params.update(study.best_trial.params)
print("Final xgb_params:", xgb_params)
xgb_model = XGBClassifier(**xgb_params)
tmlt.update_model(xgb_model)
tmlt.spl

#### Let's Use K-Fold Training

In [None]:
# K-Fold fit and predict on test dataset
xgb_model_preds_metrics_score, xgb_model_test_preds= tmlt.do_kfold_training(n_splits=5,
                                                                            test_preds_metric=roc_auc_score)
if xgb_model_test_preds is not None:
    print(xgb_model_test_preds.shape)

[2m[36m(apply_list_of_funcs pid=96615)[0m 
[2m[36m(apply_list_of_funcs pid=96617)[0m 
[2m[36m(apply_list_of_funcs pid=96618)[0m 
[2m[36m(apply_list_of_funcs pid=96609)[0m 


2021-11-23 00:59:45,633 INFO fold: 1 log_loss : 0.6621403547748923
2021-11-23 00:59:45,634 INFO fold: 1 roc_auc_score : 0.6182978723404255
2021-11-23 00:59:45,635 INFO fold: 1 accuracy_score : 0.605
2021-11-23 00:59:45,635 INFO fold: 1 f1_score : 0.3629032258064516
2021-11-23 00:59:45,636 INFO fold: 1 precision_score : 0.5421686746987951
2021-11-23 00:59:45,637 INFO fold: 1 recall_score : 0.2727272727272727


[2m[36m(apply_list_of_funcs pid=96614)[0m 
[2m[36m(apply_list_of_funcs pid=96608)[0m 
[2m[36m(apply_list_of_funcs pid=96612)[0m 


2021-11-23 01:00:03,036 INFO fold: 2 log_loss : 0.6640257256105542
2021-11-23 01:00:03,037 INFO fold: 2 roc_auc_score : 0.6078916827852998
2021-11-23 01:00:03,037 INFO fold: 2 accuracy_score : 0.61875
2021-11-23 01:00:03,038 INFO fold: 2 f1_score : 0.38383838383838387
2021-11-23 01:00:03,039 INFO fold: 2 precision_score : 0.5757575757575758
2021-11-23 01:00:03,040 INFO fold: 2 recall_score : 0.2878787878787879


[2m[36m(apply_list_of_funcs pid=96615)[0m 
[2m[36m(apply_list_of_funcs pid=96613)[0m 
[2m[36m(apply_list_of_funcs pid=96613)[0m 
[2m[36m(apply_list_of_funcs pid=96611)[0m 
[2m[36m(apply_list_of_funcs pid=96613)[0m 
[2m[36m(apply_list_of_funcs pid=96609)[0m 
[2m[36m(apply_list_of_funcs pid=96612)[0m 


2021-11-23 01:00:19,239 INFO fold: 3 log_loss : 0.662113243713975
2021-11-23 01:00:19,240 INFO fold: 3 roc_auc_score : 0.6047582205029014
2021-11-23 01:00:19,241 INFO fold: 3 accuracy_score : 0.6225
2021-11-23 01:00:19,242 INFO fold: 3 f1_score : 0.37860082304526754
2021-11-23 01:00:19,242 INFO fold: 3 precision_score : 0.5897435897435898
2021-11-23 01:00:19,243 INFO fold: 3 recall_score : 0.2787878787878788


[2m[36m(apply_list_of_funcs pid=96613)[0m 
[2m[36m(apply_list_of_funcs pid=96616)[0m 
[2m[36m(apply_list_of_funcs pid=96609)[0m 


2021-11-23 01:00:36,864 INFO fold: 4 log_loss : 0.6577297036349773
2021-11-23 01:00:36,864 INFO fold: 4 roc_auc_score : 0.6376876944582225
2021-11-23 01:00:36,865 INFO fold: 4 accuracy_score : 0.63375
2021-11-23 01:00:36,866 INFO fold: 4 f1_score : 0.40325865580448067
2021-11-23 01:00:36,867 INFO fold: 4 precision_score : 0.61875
2021-11-23 01:00:36,868 INFO fold: 4 recall_score : 0.2990936555891239


[2m[36m(apply_list_of_funcs pid=96608)[0m 
[2m[36m(apply_list_of_funcs pid=96608)[0m 


2021-11-23 01:00:54,497 INFO fold: 5 log_loss : 0.6580264708772302
2021-11-23 01:00:54,498 INFO fold: 5 roc_auc_score : 0.6333331186106584
2021-11-23 01:00:54,499 INFO fold: 5 accuracy_score : 0.60875
2021-11-23 01:00:54,499 INFO fold: 5 f1_score : 0.37274549098196397
2021-11-23 01:00:54,500 INFO fold: 5 precision_score : 0.5535714285714286
2021-11-23 01:00:54,501 INFO fold: 5 recall_score : 0.2809667673716012
2021-11-23 01:00:54,502 INFO kfold_metrics_results: [{'log_loss': 0.6621403547748923, 'roc_auc_score': 0.6182978723404255, 'accuracy_score': 0.605, 'f1_score': 0.3629032258064516, 'precision_score': 0.5421686746987951, 'recall_score': 0.2727272727272727}, {'log_loss': 0.6640257256105542, 'roc_auc_score': 0.6078916827852998, 'accuracy_score': 0.61875, 'f1_score': 0.38383838383838387, 'precision_score': 0.5757575757575758, 'recall_score': 0.2878787878787879}, {'log_loss': 0.662113243713975, 'roc_auc_score': 0.6047582205029014, 'accuracy_score': 0.6225, 'f1_score': 0.378600823045267

[2m[36m(apply_list_of_funcs pid=96608)[0m 


In [None]:
# # take weighted average of both k-fold models predictions
# final_preds = ((0.45 * sci_model_preds) + (0.55* xgb_model_test_preds)) / 2
# print(final_preds.shape)

#### Create Kaggle Predictions

In [None]:
# sub = pd.read_csv(DIRECTORY_PATH + SAMPLE_SUB_FILE)
# sub['target'] = final_preds
# sub.to_csv('submission.csv', index=False)

In [None]:
# hide
# run the script to build 

from nbdev.export import notebook2script; notebook2script()