# Getting Started Kaggle TPS Challenge with Tabular ML Toolkit

> A Tutorial to showcase usage of tabular_ml_toolkit library on Kaggle TPS Challenge Nov 2021.

> tabular_ml_toolkit is a helper library to jumpstart your machine learning project based on Tabular or Structured data.

> It comes with model and data parallelism and efficient hyperparameter search techniques.

> Under the hood TMLT uses modin, optuna, xgboost and scikit-learn pipelines

## Install

`pip install -U tabular_ml_toolkit`

In [1]:
# !pip install -U tabular_ml_toolkit

In [2]:
# !pip install -U pandas==1.3.4

In [3]:
from tabular_ml_toolkit.tmlt import *
from xgboost import XGBClassifier
import numpy as np

# for visualizing pipeline
from sklearn import set_config
set_config(display="diagram")

# just to measure fit performance
import time

In [4]:
from sklearn.metrics import roc_auc_score, accuracy_score

#### For Dataset, Mount Google Drive

In [5]:
# from google.colab import drive
# drive.mount('/content/gdrive/')
# # drive.mount('/content/gdrive/', force_remount=True)

In [6]:
# import os
# COLAB_BASE_PATH = '/content/gdrive/MyDrive/pankaj_dev/kaggle'
# os.listdir(COLAB_BASE_PATH)

In [7]:
# # Dataset file names and Paths
# DIRECTORY_PATH = COLAB_BASE_PATH +"/tabular/tps_nov_2021/input/"
# TRAIN_FILE = "train.csv"
# TEST_FILE = "test.csv"
# SAMPLE_SUB_FILE = "sample_submission.csv"
# OUTPUT_PATH = COLAB_BASE_PATH + "/tabular/tps_nov_2021/output/"
# os.listdir(DIRECTORY_PATH)

In [8]:
# Dataset file names and Paths
DIRECTORY_PATH = "/Users/pamathur/kaggle_datasets/tps_nov_2021/"
TRAIN_FILE = "train.csv"
TEST_FILE = "test.csv"
SAMPLE_SUB_FILE = "sample_submission.csv"
OUTPUT_PATH = "kaggle_tps_output/"

#### Create a base sklearn model

In [9]:
from sklearn.linear_model import LogisticRegression
log_reg_model = LogisticRegression(solver='liblinear', random_state=42)

In [10]:
# createm tmlt for xgb model
tmlt = TMLT().prepare_data_for_training(
    train_file_path= DIRECTORY_PATH + TRAIN_FILE,
    test_file_path= DIRECTORY_PATH + TEST_FILE,
    #make sure to use right index and target columns
    idx_col="id",
    target="target",
    model=log_reg_model,
    random_state=42,
    problem_type="binary_classification",
    nrows=4000
)

2021-11-28 17:41:05,158 INFO 12 cores found, model and data parallel processing should worked!
2021-11-28 17:41:05,273 INFO DataFrame Memory usage decreased to 0.80 Mb (74.4% reduction)
2021-11-28 17:41:05,420 INFO DataFrame Memory usage decreased to 0.79 Mb (74.3% reduction)
2021-11-28 17:41:05,475 INFO categorical columns are None, Preprocessing will done accordingly!


In [11]:
tmlt.spl

#### Let's do a quick round of training

In [12]:
tmlt.dfl.create_train_valid(valid_size=0.2)

In [13]:
# # Quick check on dataframe shapes
# print(f"X_train shape is {tmlt.dfl.X_train.shape}" )
# print(f"X_valid shape is {tmlt.dfl.X_valid.shape}" )
# print(f"y_train shape is {tmlt.dfl.y_train.shape}")
# print(f"y_valid shape is {tmlt.dfl.y_valid.shape}")

In [14]:
# Fit
start = time.time()
tmlt.spl.fit(tmlt.dfl.X_train, tmlt.dfl.y_train)
end = time.time()
print("Fit Time:", end - start)

#predict
preds = tmlt.spl.predict(tmlt.dfl.X_valid)
preds_probs = tmlt.spl.predict_proba(tmlt.dfl.X_valid)[:, 1]

# Val Metrics
auc = roc_auc_score(tmlt.dfl.y_valid, preds_probs)
acc = accuracy_score(tmlt.dfl.y_valid, preds)

print(f"AUC is : {auc} while Accuracy is : {acc} ")



Fit Time: 0.11430573463439941
AUC is : 0.7159750818287403 while Accuracy is : 0.695 


In [15]:
# tmlt.do_cross_validation(cv=5, scoring="roc_auc")

In [16]:
# tmlt.do_kfold_training(n_splits=5,test_preds_metric=accuracy_score)

#### For Meta Ensemble Models Training

#### Base Model 1: linear SVM model

In [17]:
from sklearn.svm import LinearSVC

In [18]:
# OOF training and prediction on both train and test dataset by a given model

#choose model
linear_oof_model = LinearSVC(tol=1e-7, penalty='l2', dual=False, max_iter=2000, random_state=42)

#update the model on sklearn pipeline and get update skline pipeline back
tmlt = tmlt.update_model(linear_oof_model)

# let see pipeline
tmlt.spl

In [19]:
#fit and predict
linear_oof_model_preds, linear_oof_model_test_preds = tmlt.do_oof_kfold_train_preds(n_splits=5)

if linear_oof_model_preds is not None:
    print(linear_oof_model_preds.shape)

if linear_oof_model_test_preds is not None:    
    print(linear_oof_model_test_preds.shape)

2021-11-28 17:41:05,868 INFO fold: 1 OOF Model ROC AUC: 0.5420631850419084!
2021-11-28 17:41:05,989 INFO fold: 2 OOF Model ROC AUC: 0.569651837524178!
2021-11-28 17:41:06,138 INFO fold: 3 OOF Model ROC AUC: 0.5094906511927788!
2021-11-28 17:41:06,256 INFO fold: 4 OOF Model ROC AUC: 0.4881634125445281!
2021-11-28 17:41:06,386 INFO fold: 5 OOF Model ROC AUC: 0.5443219809455098!
2021-11-28 17:41:06,392 INFO Mean OOF Model ROC AUC: 0.5307382134497807!


(4000,)
(4000,)


#### Base Model 2: Logistic Regression Model

In [20]:
from sklearn.linear_model import LogisticRegression

In [21]:
# OOF training and prediction on both train and test dataset by a given model

#choose model
log_oof_model = LogisticRegression(solver='liblinear', random_state=42)

#update the model on sklearn pipeline
tmlt = tmlt.update_model(log_oof_model)

#fit and predict
log_oof_model_preds, log_oof_model_test_preds = tmlt.do_oof_kfold_train_preds(n_splits=5)

if log_oof_model_preds is not None:
    print(log_oof_model_preds.shape)

if log_oof_model_test_preds is not None:    
    print(log_oof_model_test_preds.shape)

2021-11-28 17:41:06,531 INFO fold: 1 OOF Model ROC AUC: 0.5431270148291425!
2021-11-28 17:41:06,662 INFO fold: 2 OOF Model ROC AUC: 0.570896196002579!
2021-11-28 17:41:06,811 INFO fold: 3 OOF Model ROC AUC: 0.5097678916827854!
2021-11-28 17:41:06,938 INFO fold: 4 OOF Model ROC AUC: 0.4884726131964262!
2021-11-28 17:41:07,065 INFO fold: 5 OOF Model ROC AUC: 0.5453397664246742!
2021-11-28 17:41:07,071 INFO Mean OOF Model ROC AUC: 0.5315206964271214!


(4000,)
(4000,)


#### Base Model 3: SKLearn MLP

In [22]:
from sklearn.neural_network import MLPClassifier

In [23]:
# OOF training and prediction on both train and test dataset by a given model

#choose model
mlp_oof_model = MLPClassifier(max_iter=1000, early_stopping=True)

#update the model on sklearn pipeline
tmlt = tmlt.update_model(mlp_oof_model)


#fit and predict
mlp_oof_model_preds, mlp_oof_model_test_preds = tmlt.do_oof_kfold_train_preds(n_splits=5)

if mlp_oof_model_preds is not None:
    print(mlp_oof_model_preds.shape)

if mlp_oof_model_test_preds is not None:    
    print(mlp_oof_model_test_preds.shape)

2021-11-28 17:41:07,510 INFO fold: 1 OOF Model ROC AUC: 0.5425145067698259!
2021-11-28 17:41:07,902 INFO fold: 2 OOF Model ROC AUC: 0.5555738233397807!
2021-11-28 17:41:08,296 INFO fold: 3 OOF Model ROC AUC: 0.5118568665377177!
2021-11-28 17:41:08,760 INFO fold: 4 OOF Model ROC AUC: 0.4969756311236223!
2021-11-28 17:41:09,368 INFO fold: 5 OOF Model ROC AUC: 0.53995452173745!
2021-11-28 17:41:09,382 INFO Mean OOF Model ROC AUC: 0.5293750699016794!


(4000,)
(4000,)


#### Now add back based models predictions to X and X_test



In [24]:
# add based model oof predictions back to X and X_test before Meta model training
tmlt.dfl.X["linear_preds"] = linear_oof_model_preds
tmlt.dfl.X_test["linear_preds"] = linear_oof_model_test_preds

print(tmlt.dfl.X.shape)
print(tmlt.dfl.X_test.shape)

(4000, 101)
(4000, 101)


In [25]:
# add based model oof predictions back to X and X_test before Meta model training
tmlt.dfl.X["log_reg_preds"] = log_oof_model_preds
tmlt.dfl.X_test["log_reg_preds"] = log_oof_model_test_preds

print(tmlt.dfl.X.shape)
print(tmlt.dfl.X_test.shape)

(4000, 102)
(4000, 102)


In [26]:
# add based model oof predictions back to X and X_test before Meta model training
tmlt.dfl.X["mlp_preds"] = mlp_oof_model_preds
tmlt.dfl.X_test["mlp_preds"] = mlp_oof_model_test_preds

print(tmlt.dfl.X.shape)
print(tmlt.dfl.X_test.shape)

(4000, 103)
(4000, 103)


In [27]:
# update dfl, pp and spl and get new tmlt back
tmlt = tmlt.update_dfl(tmlt.dfl.X, tmlt.dfl.y, tmlt.dfl.X_test)
tmlt.spl

2021-11-28 17:41:09,474 INFO categorical columns are None, Preprocessing will done accordingly!


In [28]:
# let's see new columns added to new tmlt.dfl.X and tmlt.dfl.X_test
print(tmlt.dfl.X.columns.values.tolist())
print(tmlt.dfl.X_test.columns.values.tolist())

['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', 'f31', 'f32', 'f33', 'f34', 'f35', 'f36', 'f37', 'f38', 'f39', 'f40', 'f41', 'f42', 'f43', 'f44', 'f45', 'f46', 'f47', 'f48', 'f49', 'f50', 'f51', 'f52', 'f53', 'f54', 'f55', 'f56', 'f57', 'f58', 'f59', 'f60', 'f61', 'f62', 'f63', 'f64', 'f65', 'f66', 'f67', 'f68', 'f69', 'f70', 'f71', 'f72', 'f73', 'f74', 'f75', 'f76', 'f77', 'f78', 'f79', 'f80', 'f81', 'f82', 'f83', 'f84', 'f85', 'f86', 'f87', 'f88', 'f89', 'f90', 'f91', 'f92', 'f93', 'f94', 'f95', 'f96', 'f97', 'f98', 'f99', 'linear_preds', 'log_reg_preds', 'mlp_preds']
['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', 'f31', 'f32', 'f33', 'f34', 'f35', 'f36', 'f37', 'f38',

#### For META Model Training (direct training non tmlt)


In [29]:
xgb_params = {
    'objective': 'binary:logistic', 
    'use_label_encoder': False,
    'n_estimators': 40000,
    'learning_rate': 0.18515462875481553,
    'subsample': 0.97, 
    'colsample_bytree': 0.32,
    'max_depth': 1,
    'booster': 'gbtree',
    'gamma': 0.2,
    'reg_lambda': 0.11729916523488974, 
    'reg_alpha': 0.6318827156945853,
    'random_state': 42,
    'n_jobs': 4, 
    'min_child_weight': 256,
    #for GPU
#     'tree_method': 'gpu_hist',
#     'predictor': 'gpu_predictor',
    }

In [30]:
# create xgg meta model
xgb_model = XGBClassifier(**xgb_params)

# update sklearn pipeline to not to use cache model(estimator)
tmlt = tmlt.update_model(xgb_model)

# lets see updated sklearn pipeline
tmlt.spl

In [31]:
%%time
xgb_model_preds_metrics_score, xgb_model_test_preds = tmlt.do_kfold_training(n_splits=5,
                                                                              test_preds_metric=roc_auc_score)

2021-11-28 17:41:09,573 INFO  model class:<class 'xgboost.sklearn.XGBClassifier'>
2021-11-28 17:41:41,536 INFO fold: 1 roc_auc_score : 0.6171373307543521
2021-11-28 17:41:41,536 INFO fold: 1 log_loss : 0.7281110190437176
2021-11-28 17:41:41,537 INFO fold: 1 accuracy_score : 0.57625
2021-11-28 17:41:41,537 INFO fold: 1 f1_score : 0.46275752773375595
2021-11-28 17:41:41,537 INFO fold: 1 precision_score : 0.4850498338870432
2021-11-28 17:41:41,538 INFO fold: 1 recall_score : 0.44242424242424244
2021-11-28 17:41:41,538 INFO Predicting Test Preds Probablities!
2021-11-28 17:42:14,031 INFO fold: 2 roc_auc_score : 0.6035138620245003
2021-11-28 17:42:14,032 INFO fold: 2 log_loss : 0.7519563890597784
2021-11-28 17:42:14,032 INFO fold: 2 accuracy_score : 0.595
2021-11-28 17:42:14,033 INFO fold: 2 f1_score : 0.4774193548387097
2021-11-28 17:42:14,033 INFO fold: 2 precision_score : 0.5103448275862069
2021-11-28 17:42:14,034 INFO fold: 2 recall_score : 0.4484848484848485
2021-11-28 17:42:14,034 INF

CPU times: user 10min 25s, sys: 4.33 s, total: 10min 29s
Wall time: 2min 47s


In [32]:
from sklearn.model_selection import StratifiedKFold

In [33]:
%%time

# Setting up fold parameters
splits = 10
skf = StratifiedKFold(n_splits=splits, shuffle=True, random_state=42)

# Creating an array of zeros for storing "out of fold" predictions
oof_preds = np.zeros((tmlt.dfl.X.shape[0],))
preds = 0
# model_fi = 0
total_mean_auc = 0

# Generating folds and making training and prediction for each of them
for num, (train_idx, valid_idx) in enumerate(skf.split(tmlt.dfl.X, tmlt.dfl.y)):
    tmlt.dfl.X_train, tmlt.dfl.X_valid = tmlt.dfl.X.loc[train_idx], tmlt.dfl.X.loc[valid_idx]
    tmlt.dfl.y_train, tmlt.dfl.y_valid = tmlt.dfl.y[train_idx], tmlt.dfl.y[valid_idx]
    
    model = XGBClassifier(**xgb_params)
    model.fit(tmlt.dfl.X_train, tmlt.dfl.y_train,
              verbose=False,
              # The parameters below help to detect and avoid overfitting
              eval_set=[(tmlt.dfl.X_train, tmlt.dfl.y_train), (tmlt.dfl.X_valid, tmlt.dfl.y_valid)],
              eval_metric="auc",
              early_stopping_rounds=300,
              )
    
    # Getting mean test data predictions (i.e. devided by number of splits)
    preds += model.predict_proba(tmlt.dfl.X_test)[:, 1] / splits
    
    # Getting mean feature importances (i.e. devided by number of splits)
    # model_fi += model.feature_importances_ / splits
    
    # Getting validation data predictions. Each fold model makes predictions on an unseen data.
    # So in the end it will be completely filled with unseen data predictions.
    # It will be used to evaluate hyperparameters performance only.
    oof_preds[valid_idx] = model.predict_proba(tmlt.dfl.X_valid)[:, 1]
    
    # Getting score for a fold model
    fold_auc = roc_auc_score(tmlt.dfl.y_valid, oof_preds[valid_idx])
    print(f"Fold {num} ROC AUC: {fold_auc}")

    # Getting mean score of all fold models (i.e. devided by number of splits)
    total_mean_auc += fold_auc / splits
    # delete all dataframes after each fold
    unused_df_lst = [tmlt.dfl.X_train, tmlt.dfl.X_valid, tmlt.dfl.y_train, tmlt.dfl.y_valid]
    del unused_df_lst
    
print(f"\nOverall ROC AUC: {total_mean_auc}")

Fold 0 ROC AUC: 0.761928164444148
Fold 1 ROC AUC: 0.7610277116407352
Fold 2 ROC AUC: 0.7627903156056819
Fold 3 ROC AUC: 0.7632015002586378
Fold 4 ROC AUC: 0.7569241924918775

Overall ROC AUC: 0.7611743768882161
CPU times: user 53 s, sys: 384 ms, total: 53.4 s
Wall time: 26.7 s

#### Let's do Optuna based HyperParameter search to get best params for fit

##### Since the training dataset is big size, aka "Big Data", Let's give 600 sec (10 minutes) for Optuna Study optimization

In [34]:
# study = tmlt.do_xgb_optuna_optimization(optuna_db_path=OUTPUT_PATH,
#                                        use_gpu=True, opt_timeout=600)

**Since number of trials did not ended in 10 minutes, we can always come back and run this cell to do additional hyperparams search, Optuna will save the study and restart from the last trial**

In [35]:
# print(study.best_trial)

#### Now, let's update the model with best params from optuna study

**Make sure to update the sklearn pipeline with new model too, Only this way sklearn pipeline will not reuse cache models (estimators)**

In [36]:
# # xgb_params.update(study.best_trial.params)
# # xgb_params.update(new_xgb_params)
# # print("Final xgb_params:", xgb_params)
# xgb_model = XGBClassifier(**new_xgb_params)

# # update sklearn pipeline to not to use cache model(estimator)
# tmlt.update_model(xgb_model)
# # lets see sklearn pipeline
# tmlt.spl

#### Let's do K-Fold Training

In [37]:
# # K-Fold fit and predict on test dataset
# xgb_model_preds_metrics_score, xgb_model_test_preds= tmlt.do_kfold_training(n_splits=5, test_preds_metric=roc_auc_score)
# if xgb_model_test_preds is not None:
#     print(xgb_model_test_preds.shape)

#### Create Kaggle Predictions

In [38]:
# import pandas as pd

In [39]:
# sub = pd.read_csv(DIRECTORY_PATH + SAMPLE_SUB_FILE)
# # sub['target'] = xgb_model_test_preds
# sub['target'] = preds
# sub.to_csv(OUTPUT_PATH + 'wed_nov_25_1042_submission.csv', index=False)

In [40]:
os.listdir(OUTPUT_PATH)

NameError: name 'os' is not defined