# Getting Started Kaggle TPS Challenge with Tabular ML Toolkit

> A Tutorial to showcase usage of tabular_ml_toolkit (tmlt) library on Kaggle TPS Challenge Nov 2021.

> tabular_ml_toolkit is a helper library to jumpstart your machine learning project based on Tabular or Structured data.

> It comes with model and data parallelism and cutting edge hyperparameter search techniques.

> Under the hood TMLT uses modin, optuna, xgboost and scikit-learn pipelines

## Install

`pip install -U tabular_ml_toolkit`

### How to Best Use tabular_ml_toolkit

Start with your favorite model and then just simply create **tmlt** with one API.

*Here we are using XGBClassifier, on  [Kaggle TPS Challenge (Nov 2021) data](https://www.kaggle.com/c/tabular-playground-series-nov-2021/data)*

In [1]:
from tabular_ml_toolkit.tmlt import *
from sklearn.svm import LinearSVC
from xgboost import XGBClassifier
import numpy as np

# for visualizing pipeline
from sklearn import set_config
set_config(display="diagram")

# just to measure fit performance
import time

In [2]:
from sklearn.metrics import roc_auc_score, accuracy_score

In [3]:
# Dataset file names and Paths
DIRECTORY_PATH = "/Users/pamathur/kaggle_datasets/tps_nov_2021/"
TRAIN_FILE = "train.csv"
TEST_FILE = "test.csv"
SAMPLE_SUB_FILE = "sample_submission.csv"
OUTPUT_PATH = "kaggle_tps_output/"

In [4]:
# TRY THIS using LOGISTIC Regression
# https://www.kaggle.com/maximkazantsev/tps-11-21-eda-xgboost-optuna

# ALSO TAKE OUT MODIN OR USE SOME FUNCTIONALITY TO USE BOTH

#### Create a base xgb classifier model with your best guess params

In [5]:
xgb_params = {
    # your best guess params
    'learning_rate':0.01,
    'eval_metric':'auc',
    # must for xgb classifier otherwise warning will be shown
    'use_label_encoder':False,
    # because 42 is the answer for all the randomness of this universe
    'random_state':42,
    #for GPU
    #'tree_method': 'gpu_hist',
    #'predictor': 'gpu_predictor',
}

xgb_model = XGBClassifier(**xgb_params)

In [6]:
# createm tmlt for xgb model
tmlt = TMLT().prepare_data_for_training(
    train_file_path= DIRECTORY_PATH + TRAIN_FILE,
    test_file_path= DIRECTORY_PATH + TEST_FILE,
    #make sure to use right index and target columns
    idx_col="id",
    target="target",
    model=xgb_model,
    random_state=42,
    problem_type="binary_classification", nrows=4000)


# supports only task type
# "binary_classification"
# "multi_label_classification"
# "multi_class_classification"
# "regression"

2021-11-27 12:07:45,211 INFO 12 cores found, model and data parallel processing should worked!
2021-11-27 12:07:45,339 INFO DataFrame Memory usage decreased to 0.80 Mb (74.4% reduction)
2021-11-27 12:07:45,457 INFO DataFrame Memory usage decreased to 0.79 Mb (74.3% reduction)
2021-11-27 12:07:45,502 INFO categorical columns are None, Preprocessing will done accordingly!


In [7]:
tmlt.spl

#### Let's do a quick round of training

In [8]:
# tmlt.dfl.create_train_valid(valid_size=0.2)

In [9]:
# # Quick check on dataframe shapes
# print(f"X_train shape is {tmlt.dfl.X_train.shape}" )
# print(f"X_valid shape is {tmlt.dfl.X_valid.shape}" )
# print(f"y_train shape is {tmlt.dfl.y_train.shape}")
# print(f"y_valid shape is {tmlt.dfl.y_valid.shape}")

In [10]:
# # Fit
# start = time.time()
# # Now fit
# tmlt.spl.fit(tmlt.dfl.X_train, tmlt.dfl.y_train)
# end = time.time()
# print("Fit Time:", end - start)

# #predict
# preds = tmlt.spl.predict(tmlt.dfl.X_valid)
# preds_probs = tmlt.spl.predict_proba(tmlt.dfl.X_valid)[:, 1]

# # Metrics
# auc = roc_auc_score(tmlt.dfl.y_valid, preds_probs)
# acc = accuracy_score(tmlt.dfl.y_valid, preds)

# print(f"AUC is : {auc} while Accuracy is : {acc} ")

#### Base model For Meta Ensemble Model 

In [11]:
# # OOF training and prediction on both train and test dataset by a given model

# linear_oof_model = LinearSVC(tol=1e-7, penalty='l2', dual=False, max_iter=2000, random_state=42)

# linear_oof_model_preds, linear_oof_model_test_preds = tmlt.do_oof_kfold_train_preds(n_splits=5,
#                                                           oof_model=linear_oof_model)
# if linear_oof_model_preds is not None:
#     print(linear_oof_model_preds.shape)

# if linear_oof_model_test_preds is not None:    
#     print(linear_oof_model_test_preds.shape)

In [12]:
# # add based model oof predictions back to X and X_test before Meta model training
# tmlt.dfl.X["linear_preds"] = linear_oof_model_preds
# tmlt.dfl.X_test["linear_preds"] = linear_oof_model_test_preds

In [13]:
# print(tmlt.dfl.X.shape)
# print(tmlt.dfl.X_test.shape)

#### For Meta Model, Let's do Optuna based HyperParameter search to get best params for fit

In [14]:
# study = tmlt.do_xgb_optuna_optimization(optuna_db_path=OUTPUT_PATH, opt_timeout=60)

In [15]:
# print(study.best_trial)

##### now update the meta model with best params from study and then update the sklearn pipeline with this new model

In [16]:
# xgb_params.update(study.best_trial.params)
# print("Final xgb_params:", xgb_params)
# xgb_model = XGBClassifier(**xgb_params)
# tmlt.update_model(xgb_model)
# tmlt.spl

#### Let's Use K-Fold Training with best params

In [None]:
# K-Fold fit and predict on test dataset
xgb_model_mean_metrics_results, xgb_model_test_preds= tmlt.do_kfold_training(n_splits=5,
                                                                            test_preds_metric=roc_auc_score)
if xgb_model_test_preds is not None:
    print(xgb_model_test_preds.shape)

In [None]:
# # take weighted average of both k-fold models predictions
# final_preds = ((0.45 * sci_model_preds) + (0.55* xgb_model_test_preds)) / 2
# print(final_preds.shape)

#### Create Kaggle Predictions

In [None]:
# sub = pd.read_csv(DIRECTORY_PATH + SAMPLE_SUB_FILE)
# sub['target'] = final_preds
# sub.to_csv('submission.csv', index=False)

In [None]:
# # hide
# # run the script to build 

# from nbdev.export import notebook2script; notebook2script()