# Getting Started Kaggle TPS Challenge with Tabular ML Toolkit

> A Tutorial to showcase usage of tabular_ml_toolkit library on Kaggle TPS Challenge Nov 2021.

> tabular_ml_toolkit is a superfast helper library to speedup your machine learning project based on Tabular or Structured data.

> It comes with model parallelism and cutting edge hyperparameter tuning techniques.

## Install

`pip install -U tabular_ml_toolkit`

## How to Best Use tabular_ml_toolkit

Start with your favorite model and then just simply create MLPipeline with one API.

*You can use MLPipeline to quickly train any model which supports scikit-lear fit and transform methods.*

*For example, Here we are using LogisticRegression from Scikit-Learn, on  [Kaggle TPS Challenge (Nov 2021) data](https://www.kaggle.com/c/tabular-playground-series-nov-2021/data)*

In [1]:
from tabular_ml_toolkit.tmlt import *
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score, accuracy_score
import pandas as pd
import numpy as np

# for visualizing pipeline
from sklearn import set_config
set_config(display="diagram")

# just to measure fit performance
import time

In [2]:
# Dataset file names and Paths
DIRECTORY_PATH = "/Users/pamathur/kaggle_datasets/tps_nov_2021/"
TRAIN_FILE = "train.csv"
TEST_FILE = "test.csv"
SAMPLE_SUB_FILE = "sample_submission.csv"
OUTPUT_PATH = "kaggle_tps_output/"

In [3]:
# create a base xgb classifier model
xgb_params = {
    'use_label_encoder':False,
    'eval_metric':'auc',
    'random_state':42,
    # for GPU
#     'tree_method': 'gpu_hist',
#     'predictor': 'gpu_predictor',
}

xgb_model = XGBClassifier(**xgb_params)

In [4]:
# createm tmlt for xgb model
tmlt = TMLT().prepare_data_for_training(
    train_file_path= DIRECTORY_PATH + TRAIN_FILE,
    #test_file_path= DIRECTORY_PATH + TEST_FILE,
    #make sure to use right index and target column
    idx_col="id",
    target="target",
    model=xgb_model,
    random_state=42,
    problem_type="classification")

2021-11-22 00:50:06,944 INFO 12 cores found, parallel processing is enabled!
2021-11-22 00:50:27,378 INFO DataFrame Memory usage decreased to 119.59 Mb (74.4% reduction)
2021-11-22 00:50:27,379 INFO No test_file_path given, so training will continue without it!
2021-11-22 00:50:29,827 INFO PreProcessing will include target(s) encoding!
2021-11-22 00:50:29,847 INFO categorical columns are None, Preprocessing will done accordingly!


In [5]:
tmlt.spl

In [6]:
# print(type(tmlt.dfl.y))
# # print(tmlt.dfl.y.values[10])
# # print(type(tmlt.dfl.y.values[10]))
# tmlt.dfl.y

In [7]:
# tmlt.dfl.create_train_valid(valid_size=0.2)

In [8]:
# # Quick check on dataframe shapes
# print(f"X_train shape is {tmlt.dfl.X_train.shape}" )
# print(f"X_valid shape is {tmlt.dfl.X_valid.shape}" )
# print(f"y_train shape is {tmlt.dfl.y_train.shape}")
# print(f"y_valid shape is {tmlt.dfl.y_valid.shape}")

In [9]:
# # Fit
# start = time.time()
# # Now fit
# tmlt.spl.fit(tmlt.dfl.X_train, tmlt.dfl.y_train)
# end = time.time()
# print("Fit Time:", end - start)

# #predict
# preds = tmlt.spl.predict(tmlt.dfl.X_valid)
# preds_probs = tmlt.spl.predict_proba(tmlt.dfl.X_valid)[:, 1]

# # Metrics
# auc = roc_auc_score(tmlt.dfl.y_valid, preds_probs)
# acc = accuracy_score(tmlt.dfl.y_valid, preds)

# print(f"AUC is : {auc} while Accuracy is : {acc} ")

#### Let's do Optuna based HyperParameter search to get best params for fit

In [10]:
from sklearn.metrics import roc_auc_score, log_loss

In [None]:
study = tmlt.do_xgb_optuna_optimization(metrics=log_loss, output_dir_path=OUTPUT_PATH)

2021-11-22 00:50:29,928 INFO direction is: minimize
[32m[I 2021-11-22 00:50:30,046][0m A new study created in RDB with name: tmlt_autoxgb[0m
2021-11-22 00:50:30,777 INFO Training Started


Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




In [None]:
print(study.best_trial)

##### now update the model with best params from study and then update the sklearn pipeline with new model

In [None]:
# xgb_params =  study.best_trial.params
# xgb_model = XGBRegressor(**xgb_params)
# tmlt.update_model(xgb_model)
# tmlt.spl

#### Let's Use K-Fold Training

In [None]:
# check current pipeline
tmlt.spl

In [None]:
# K-Fold fit and predict on test dataset
xgb_model_metrics_score, xgb_model_test_preds= tmlt.do_kfold_training(n_splits=5, metrics=roc_auc_score)
print(xgb_model_test_preds.shape)

In [None]:
# # take weighted average of both k-fold models predictions
# final_preds = ((0.45 * sci_model_preds) + (0.55* xgb_pred)) / 2
# print(final_preds.shape)

#### Create Kaggle Predictions

In [None]:
# sub = pd.read_csv(DIRECTORY_PATH + SAMPLE_SUB_FILE)
# sub['target'] = final_preds
# sub.to_csv('submission.csv', index=False)

In [None]:
# hide
# run the script to build 

from nbdev.export import notebook2script; notebook2script()