# Getting Started Kaggle TPS Challenge with Tabular ML Toolkit

> A Tutorial to showcase usage of tabular_ml_toolkit library on Kaggle TPS Challenge Nov 2021.

> tabular_ml_toolkit is a helper library to jumpstart your machine learning project based on Tabular or Structured data.

> It comes with model and data parallelism and efficient hyperparameter search techniques.

> Under the hood TMLT uses modin, optuna, xgboost and scikit-learn pipelines

## Install

`pip install -U tabular_ml_toolkit`

In [1]:
# !pip install -U tabular_ml_toolkit

In [2]:
# !pip install -U pandas==1.3.4

In [3]:
from tabular_ml_toolkit.tmlt import *
from xgboost import XGBClassifier
import numpy as np

# for visualizing pipeline
from sklearn import set_config
set_config(display="diagram")

# just to measure fit performance
import time

In [4]:
from sklearn.metrics import roc_auc_score, accuracy_score

#### For Dataset, Mount Google Drive

In [5]:
# from google.colab import drive
# drive.mount('/content/gdrive/')
# # drive.mount('/content/gdrive/', force_remount=True)

In [6]:
# import os
# COLAB_BASE_PATH = '/content/gdrive/MyDrive/pankaj_dev/kaggle'
# os.listdir(COLAB_BASE_PATH)

In [7]:
# # Dataset file names and Paths
# DIRECTORY_PATH = COLAB_BASE_PATH +"/tabular/tps_nov_2021/input/"
# TRAIN_FILE = "train.csv"
# TEST_FILE = "test.csv"
# SAMPLE_SUB_FILE = "sample_submission.csv"
# OUTPUT_PATH = COLAB_BASE_PATH + "/tabular/tps_nov_2021/output/"
# os.listdir(DIRECTORY_PATH)

In [8]:
# Dataset file names and Paths
DIRECTORY_PATH = "/Users/pamathur/kaggle_datasets/tps_nov_2021/"
TRAIN_FILE = "train.csv"
TEST_FILE = "test.csv"
SAMPLE_SUB_FILE = "sample_submission.csv"
OUTPUT_PATH = "kaggle_tps_output/"

#### Create a base xgb classifier model with your best guess params

In [9]:
xgb_params = {
    # your best guess params
    'learning_rate':0.01,
    'eval_metric':'auc',
    # must for xgb classifier otherwise warning will be shown
    'use_label_encoder':False,
    # because 42 is the answer for all the randomness of this universe
    'random_state':42,
    #for GPU
    'tree_method': 'gpu_hist',
    'predictor': 'gpu_predictor',
}

xgb_model = XGBClassifier(**xgb_params)

In [10]:
# createm tmlt for xgb model
tmlt = TMLT().prepare_data_for_training(
    train_file_path= DIRECTORY_PATH + TRAIN_FILE,
    test_file_path= DIRECTORY_PATH + TEST_FILE,
    #make sure to use right index and target columns
    idx_col="id",
    target="target",
    model=xgb_model,
    random_state=42,
    problem_type="binary_classification",
    nrows=4000
)

2021-11-25 11:15:07,878 INFO 12 cores found, model and data parallel processing should worked!
2021-11-25 11:15:08,026 INFO DataFrame Memory usage decreased to 0.80 Mb (74.4% reduction)
2021-11-25 11:15:08,174 INFO DataFrame Memory usage decreased to 0.79 Mb (74.3% reduction)
2021-11-25 11:15:08,233 INFO categorical columns are None, Preprocessing will done accordingly!


In [11]:
tmlt.spl

#### Let's do a quick round of training

In [12]:
# tmlt.dfl.create_train_valid(valid_size=0.2)

In [13]:
# # Quick check on dataframe shapes
# print(f"X_train shape is {tmlt.dfl.X_train.shape}" )
# print(f"X_valid shape is {tmlt.dfl.X_valid.shape}" )
# print(f"y_train shape is {tmlt.dfl.y_train.shape}")
# print(f"y_valid shape is {tmlt.dfl.y_valid.shape}")

In [14]:
# # Fit
# start = time.time()
# tmlt.spl.fit(tmlt.dfl.X_train, tmlt.dfl.y_train)
# end = time.time()
# print("Fit Time:", end - start)

# #predict
# preds = tmlt.spl.predict(tmlt.dfl.X_valid)
# preds_probs = tmlt.spl.predict_proba(tmlt.dfl.X_valid)[:, 1]

# # Val Metrics
# auc = roc_auc_score(tmlt.dfl.y_valid, preds_probs)
# acc = accuracy_score(tmlt.dfl.y_valid, preds)

# print(f"AUC is : {auc} while Accuracy is : {acc} ")

#### For Meta Ensemble Models Training

#### Base Model 1: linear SVM model

In [15]:
from sklearn.svm import LinearSVC

In [16]:
# OOF training and prediction on both train and test dataset by a given model

linear_oof_model = LinearSVC(tol=1e-7, penalty='l2', dual=False, max_iter=2000, random_state=42)

linear_oof_model_preds, linear_oof_model_test_preds = tmlt.do_oof_kfold_train_preds(n_splits=5,
                                                          oof_model=linear_oof_model)
if linear_oof_model_preds is not None:
    print(linear_oof_model_preds.shape)

if linear_oof_model_test_preds is not None:    
    print(linear_oof_model_test_preds.shape)

2021-11-25 11:15:08,707 INFO fold: 1 OOF Model ROC AUC: 0.7259767891682785!
2021-11-25 11:15:09,130 INFO fold: 2 OOF Model ROC AUC: 0.6958091553836234!
2021-11-25 11:15:09,682 INFO fold: 3 OOF Model ROC AUC: 0.6614764667956157!
2021-11-25 11:15:10,115 INFO fold: 4 OOF Model ROC AUC: 0.7080050760440353!
2021-11-25 11:15:10,554 INFO fold: 5 OOF Model ROC AUC: 0.7223571396363027!
2021-11-25 11:15:10,560 INFO Mean OOF Model ROC AUC: 0.7027249254055712!


(4000,)
(4000,)


#### Base Model 2: Logistic Regression Model

In [17]:
from sklearn.linear_model import LogisticRegression

In [18]:
# OOF training and prediction on both train and test dataset by a given model

log_oof_model = LogisticRegression(solver='liblinear', random_state=42)

log_oof_model_preds, log_oof_model_test_preds = tmlt.do_oof_kfold_train_preds(n_splits=5,
                                                          oof_model=log_oof_model)
if log_oof_model_preds is not None:
    print(log_oof_model_preds.shape)

if log_oof_model_test_preds is not None:    
    print(log_oof_model_test_preds.shape)

2021-11-25 11:15:10,841 INFO fold: 1 OOF Model ROC AUC: 0.7265248226950354!
2021-11-25 11:15:11,066 INFO fold: 2 OOF Model ROC AUC: 0.6951386202450032!
2021-11-25 11:15:11,287 INFO fold: 3 OOF Model ROC AUC: 0.6605157962604772!
2021-11-25 11:15:11,494 INFO fold: 4 OOF Model ROC AUC: 0.709422245698568!
2021-11-25 11:15:11,695 INFO fold: 5 OOF Model ROC AUC: 0.7191620662333563!
2021-11-25 11:15:11,701 INFO Mean OOF Model ROC AUC: 0.702152710226488!


(4000,)
(4000,)


#### Base Model 3: SKLearn NN

In [19]:
from sklearn.neural_network import MLPClassifier

In [20]:
# OOF training and prediction on both train and test dataset by a given model

mlp_oof_model = MLPClassifier(max_iter=1000, early_stopping=True)

mlp_oof_model_preds, mlp_oof_model_test_preds = tmlt.do_oof_kfold_train_preds(n_splits=5,
                                                          oof_model=mlp_oof_model)
if mlp_oof_model_preds is not None:
    print(mlp_oof_model_preds.shape)

if mlp_oof_model_test_preds is not None:    
    print(mlp_oof_model_test_preds.shape)

2021-11-25 11:15:12,216 INFO fold: 1 OOF Model ROC AUC: 0.6886137975499677!


2021-11-25 11:15:12,708 INFO fold: 2 OOF Model ROC AUC: 0.6480657640232108!


2021-11-25 11:15:13,128 INFO fold: 3 OOF Model ROC AUC: 0.6095099935525468!


2021-11-25 11:15:13,816 INFO fold: 4 OOF Model ROC AUC: 0.6732457694264972!
2021-11-25 11:15:14,341 INFO fold: 5 OOF Model ROC AUC: 0.6675899741688622!
2021-11-25 11:15:14,355 INFO Mean OOF Model ROC AUC: 0.6574050597442169!


(4000,)
(4000,)


#### Now add back based models predictions to X and X_test



In [21]:
# add based model oof predictions back to X and X_test before Meta model training
tmlt.dfl.X["linear_preds"] = linear_oof_model_preds
tmlt.dfl.X_test["linear_preds"] = linear_oof_model_test_preds

print(tmlt.dfl.X.shape)
print(tmlt.dfl.X_test.shape)

(4000, 101)
(4000, 101)


In [22]:
# add based model oof predictions back to X and X_test before Meta model training
tmlt.dfl.X["log_reg_preds"] = log_oof_model_preds
tmlt.dfl.X_test["log_reg_preds"] = log_oof_model_test_preds

print(tmlt.dfl.X.shape)
print(tmlt.dfl.X_test.shape)

(4000, 102)
(4000, 102)


In [23]:
# add based model oof predictions back to X and X_test before Meta model training
tmlt.dfl.X["mlp_preds"] = mlp_oof_model_preds
tmlt.dfl.X_test["mlp_preds"] = mlp_oof_model_test_preds

print(tmlt.dfl.X.shape)
print(tmlt.dfl.X_test.shape)

(4000, 103)
(4000, 103)


#### For META Model Training (direct training non tmlt)


In [24]:
xgb_params = {
    'objective': 'binary:logistic', 
    'use_label_encoder': False,
    'n_estimators': 40000,
    'learning_rate': 0.18515462875481553,
    'subsample': 0.97, 
    'colsample_bytree': 0.32,
    'max_depth': 1,
    'booster': 'gbtree',
    'gamma': 0.2, 
    'tree_method': 'gpu_hist',
    'reg_lambda': 0.11729916523488974, 
    'reg_alpha': 0.6318827156945853,
    'random_state': 42,
    'n_jobs': 4, 
    'min_child_weight': 256,
    #for GPU
    'tree_method': 'gpu_hist',
    'predictor': 'gpu_predictor',
    }

In [25]:
from sklearn.model_selection import StratifiedKFold

In [26]:
%%time

# Setting up fold parameters
splits = 10
skf = StratifiedKFold(n_splits=splits, shuffle=True, random_state=42)

# Creating an array of zeros for storing "out of fold" predictions
oof_preds = np.zeros((tmlt.dfl.X.shape[0],))
preds = 0
# model_fi = 0
total_mean_auc = 0

# Generating folds and making training and prediction for each of them
for num, (train_idx, valid_idx) in enumerate(skf.split(tmlt.dfl.X, tmlt.dfl.y)):
    tmlt.dfl.X_train, tmlt.dfl.X_valid = tmlt.dfl.X.loc[train_idx], tmlt.dfl.X.loc[valid_idx]
    tmlt.dfl.y_train, tmlt.dfl.y_valid = tmlt.dfl.y[train_idx], tmlt.dfl.y[valid_idx]
    
    model = XGBClassifier(**xgb_params)
    model.fit(tmlt.dfl.X_train, tmlt.dfl.y_train,
              verbose=False,
              # The parameters below help to detect and avoid overfitting
              eval_set=[(tmlt.dfl.X_train, tmlt.dfl.y_train), (tmlt.dfl.X_valid, tmlt.dfl.y_valid)],
              eval_metric="auc",
              early_stopping_rounds=300,
              )
    
    # Getting mean test data predictions (i.e. devided by number of splits)
    preds += model.predict_proba(tmlt.dfl.X_test)[:, 1] / splits
    
    # Getting mean feature importances (i.e. devided by number of splits)
    # model_fi += model.feature_importances_ / splits
    
    # Getting validation data predictions. Each fold model makes predictions on an unseen data.
    # So in the end it will be completely filled with unseen data predictions.
    # It will be used to evaluate hyperparameters performance only.
    oof_preds[valid_idx] = model.predict_proba(tmlt.dfl.X_valid)[:, 1]
    
    # Getting score for a fold model
    fold_auc = roc_auc_score(tmlt.dfl.y_valid, oof_preds[valid_idx])
    print(f"Fold {num} ROC AUC: {fold_auc}")

    # Getting mean score of all fold models (i.e. devided by number of splits)
    total_mean_auc += fold_auc / splits
    # delete all dataframes after each fold
    unused_df_lst = [tmlt.dfl.X_train, tmlt.dfl.X_valid, tmlt.dfl.y_train, tmlt.dfl.y_valid]
    del unused_df_lst
    
print(f"\nOverall ROC AUC: {total_mean_auc}")

XGBoostError: [11:15:14] /Users/runner/miniforge3/conda-bld/xgboost-split_1634712680264/work/src/gbm/../common/common.h:157: XGBoost version not compiled with GPU support.
Stack trace:
  [bt] (0) 1   libxgboost.dylib                    0x000000017ba93c74 dmlc::LogMessageFatal::~LogMessageFatal() + 116
  [bt] (1) 2   libxgboost.dylib                    0x000000017bb2389e xgboost::gbm::GBTree::ConfigureUpdaters() + 478
  [bt] (2) 3   libxgboost.dylib                    0x000000017bb233c7 xgboost::gbm::GBTree::Configure(std::__1::vector<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > > const&) + 1207
  [bt] (3) 4   libxgboost.dylib                    0x000000017bb4070e xgboost::LearnerConfiguration::Configure() + 1502
  [bt] (4) 5   libxgboost.dylib                    0x000000017ba979b4 XGBoosterBoostedRounds + 116
  [bt] (5) 6   libffi.7.dylib                      0x0000000100d06ead ffi_call_unix64 + 85
  [bt] (6) 7   ???                                 0x00007ffeef442300 0x0 + 140732912640768



Fold 0 ROC AUC: 0.761928164444148
Fold 1 ROC AUC: 0.7610277116407352
Fold 2 ROC AUC: 0.7627903156056819
Fold 3 ROC AUC: 0.7632015002586378
Fold 4 ROC AUC: 0.7569241924918775

Overall ROC AUC: 0.7611743768882161
CPU times: user 53 s, sys: 384 ms, total: 53.4 s
Wall time: 26.7 s

#### Let's do Optuna based HyperParameter search to get best params for fit

##### Since the training dataset is big size, aka "Big Data", Let's give 600 sec (10 minutes) for Optuna Study optimization

In [27]:
# study = tmlt.do_xgb_optuna_optimization(optuna_db_path=OUTPUT_PATH,
#                                        use_gpu=True, opt_timeout=600)

**Since number of trials did not ended in 10 minutes, we can always come back and run this cell to do additional hyperparams search, Optuna will save the study and restart from the last trial**

In [28]:
# print(study.best_trial)

#### Now, let's update the model with best params from optuna study

**Make sure to update the sklearn pipeline with new model too, Only this way sklearn pipeline will not reuse cache models (estimators)**

In [29]:
# # xgb_params.update(study.best_trial.params)
# # xgb_params.update(new_xgb_params)
# # print("Final xgb_params:", xgb_params)
# xgb_model = XGBClassifier(**new_xgb_params)

# # update sklearn pipeline to not to use cache model(estimator)
# tmlt.update_model(xgb_model)
# # lets see sklearn pipeline
# tmlt.spl

#### Let's do K-Fold Training

In [30]:
# # K-Fold fit and predict on test dataset
# xgb_model_preds_metrics_score, xgb_model_test_preds= tmlt.do_kfold_training(n_splits=5, test_preds_metric=roc_auc_score)
# if xgb_model_test_preds is not None:
#     print(xgb_model_test_preds.shape)

#### Create Kaggle Predictions

In [31]:
import pandas as pd

In [32]:
sub = pd.read_csv(DIRECTORY_PATH + SAMPLE_SUB_FILE)
# sub['target'] = xgb_model_test_preds
sub['target'] = preds
sub.to_csv(OUTPUT_PATH + 'wed_nov_25_1042_submission.csv', index=False)

In [33]:
os.listdir(OUTPUT_PATH)

NameError: name 'os' is not defined