In [1]:
%load_ext autoreload
%autoreload 2

# Getting Started Tutorial with Tabular ML Toolkit

> A tutorial on getting started with Tabular ml toolkit

> tabular_ml_toolkit is a superfast helper library to speedup your machine learning project based on Tabular or Structured data.

> It comes with model parallelism and cutting edge hyperparameter tuning techniques.

## Install

`pip install -U tabular_ml_toolkit`

## How to Best Use tabular_ml_toolkit

Start with your favorite model and then just simply create MLPipeline with one API.

*For example, Here we are using RandomForestRegressor from Scikit-Learn, on  [Melbourne Home Sale price data](https://www.kaggle.com/estrotococo/home-data-for-ml-course)*


*No need to install scikit-learn as it comes preinstall with Tabular_ML_Toolkit*

In [2]:
from tabular_ml_toolkit.tmlt import *
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import pandas as pd
import numpy as np

# for displaying diagram of pipelines 
from sklearn import set_config
set_config(display="diagram")

# Just to compare fit times
import time

In [3]:
# Dataset file names and Paths
DIRECTORY_PATH = "https://raw.githubusercontent.com/psmathur/tabular_ml_toolkit/master/input/home_data/"
TRAIN_FILE = "train.csv"
TEST_FILE = "test.csv"
SAMPLE_SUB_FILE = "sample_submission.csv"
OUTPUT_PATH = "do_xgb_optuna_optimization_output/"

In [4]:
from xgboost import XGBRegressor

xgb_params = {
    'n_estimators':250,
    'learning_rate':0.05,
    'random_state':42,
    # for GPU
#     'tree_method': 'gpu_hist',
#     'predictor': 'gpu_predictor',
}


# create xgb ml model
xgb_model = XGBRegressor(**xgb_params)

In [5]:
# createm ml pipeline for scikit-learn model
tmlt = TMLT().prepare_data_for_training(
    train_file_path= DIRECTORY_PATH+TRAIN_FILE,
    test_file_path= DIRECTORY_PATH+TEST_FILE,
    idx_col="Id", target="SalePrice",
    model=xgb_model,
    random_state=42,
    problem_type="regression")

2021-11-22 00:36:35,621 INFO 12 cores found, parallel processing is enabled!
2021-11-22 00:36:35,978 INFO DataFrame Memory usage decreased to 0.58 Mb (35.5% reduction)
2021-11-22 00:36:36,313 INFO DataFrame Memory usage decreased to 0.58 Mb (34.8% reduction)
2021-11-22 00:36:36,341 INFO Both Numerical & Categorical columns found, Preprocessing will done accordingly!


In [6]:
# let' see default pipeline
tmlt.spl

#### To see clear picture, let's do k_fold training on updated scikit model

In [7]:
# # k-fold training
# xgb_model_metrics_score, xgb_model_preds = tmlt.do_k_fold_training(n_splits=5,
#                                                                           metrics=mean_absolute_error,
#                                                                           random_state=42)
# print("mean metrics score:", np.mean(xgb_model_metrics_score))
# # predict
# print(xgb_model_preds.shape)

##### Let's see if we can improve our K_Fold score with hyperparams tunning

In [8]:
# from sklearn.preprocessing import StandardScaler, MinMaxScaler
# from sklearn.impute import SimpleImputer
# # from sklearn.

In [9]:
study = tmlt.do_xgb_optuna_optimization(metrics=mean_absolute_error, output_dir_path=OUTPUT_PATH)
print(study.best_trial)

2021-11-22 00:36:36,596 INFO direction is: minimize
[32m[I 2021-11-22 00:36:36,656][0m Using an existing study with name 'tmlt_autoxgb' instead of creating a new one.[0m
2021-11-22 00:36:36,784 INFO Training Started
2021-11-22 00:36:46,970 INFO Training Ended
2021-11-22 00:36:46,970 INFO Predicting Score!
[32m[I 2021-11-22 00:36:47,063][0m Trial 2 finished with value: 14777.86558219178 and parameters: {'learning_rate': 0.016012475961174746, 'reg_lambda': 0.3410194623899067, 'reg_alpha': 75.99856826761018, 'subsample': 0.2048463199761608, 'colsample_bytree': 0.7312842669839699, 'max_depth': 5, 'early_stopping_rounds': 427, 'n_estimators': 7000, 'tree_method': 'hist', 'booster': 'gbtree', 'gamma': 0.021165662172103753, 'grow_policy': 'depthwise'}. Best is trial 2 with value: 14777.86558219178.[0m
2021-11-22 00:36:47,178 INFO Training Started


Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-22 00:36:51,213 INFO Training Ended
2021-11-22 00:36:51,214 INFO Predicting Score!
[32m[I 2021-11-22 00:36:51,257][0m Trial 3 finished with value: 19141.308018514555 and parameters: {'learning_rate': 0.1062423877897573, 'reg_lambda': 0.005808903026169929, 'reg_alpha': 1.0500459880090612e-05, 'subsample': 0.83159728785066, 'colsample_bytree': 0.5767945256752227, 'max_depth': 3, 'early_stopping_rounds': 115, 'n_estimators': 15000, 'tree_method': 'hist', 'booster': 'gblinear'}. Best is trial 2 with value: 14777.86558219178.[0m
2021-11-22 00:36:51,349 INFO Training Started


Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-22 00:36:57,134 INFO Training Ended
2021-11-22 00:36:57,134 INFO Predicting Score!
[32m[I 2021-11-22 00:36:57,179][0m Trial 4 finished with value: 38001.63091288527 and parameters: {'learning_rate': 0.08036356904807763, 'reg_lambda': 9.340626577613357, 'reg_alpha': 0.014731814562689374, 'subsample': 0.48518133652685647, 'colsample_bytree': 0.310998960257031, 'max_depth': 8, 'early_stopping_rounds': 206, 'n_estimators': 20000, 'tree_method': 'exact', 'booster': 'gblinear'}. Best is trial 2 with value: 14777.86558219178.[0m
2021-11-22 00:36:57,283 INFO Training Started
2021-11-22 00:37:06,728 INFO Training Ended
2021-11-22 00:37:06,729 INFO Predicting Score!
[32m[I 2021-11-22 00:37:06,814][0m Trial 5 finished with value: 15790.595555971746 and parameters: {'learning_rate': 0.040464539475722844, 'reg_lambda': 0.41870054997274897, 'reg_alpha': 0.044991345887820125, 'subsample': 0.3235413400706302, 'colsample_bytree': 0.13862108405782803, 'max_depth': 5, 'early_stopping_rounds'

Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-22 00:37:33,444 INFO Training Ended
2021-11-22 00:37:33,445 INFO Predicting Score!
[32m[I 2021-11-22 00:37:33,493][0m Trial 7 finished with value: 18879.80290828339 and parameters: {'learning_rate': 0.10916165762799426, 'reg_lambda': 0.012168383129923977, 'reg_alpha': 5.583278525477054e-07, 'subsample': 0.6225518824585848, 'colsample_bytree': 0.16327264951220277, 'max_depth': 9, 'early_stopping_rounds': 165, 'n_estimators': 7000, 'tree_method': 'exact', 'booster': 'gblinear'}. Best is trial 2 with value: 14777.86558219178.[0m
2021-11-22 00:37:33,588 INFO Training Started
2021-11-22 00:37:56,455 INFO Training Ended
2021-11-22 00:37:56,456 INFO Predicting Score!
[32m[I 2021-11-22 00:37:56,536][0m Trial 8 finished with value: 15265.181252675513 and parameters: {'learning_rate': 0.020965200696374087, 'reg_lambda': 24.456816640001072, 'reg_alpha': 3.5275989063112733e-08, 'subsample': 0.6356771229757, 'colsample_bytree': 0.7412094802852861, 'max_depth': 5, 'early_stopping_rounds

Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-22 00:37:59,282 INFO Training Ended
2021-11-22 00:37:59,283 INFO Predicting Score!
[32m[I 2021-11-22 00:37:59,330][0m Trial 9 finished with value: 18221.209118150684 and parameters: {'learning_rate': 0.01825461532795196, 'reg_lambda': 2.0792749632525548e-08, 'reg_alpha': 0.029397059028886764, 'subsample': 0.48339755653116534, 'colsample_bytree': 0.2273769120512169, 'max_depth': 3, 'early_stopping_rounds': 318, 'n_estimators': 7000, 'tree_method': 'exact', 'booster': 'gblinear'}. Best is trial 2 with value: 14777.86558219178.[0m
2021-11-22 00:37:59,441 INFO Training Started
2021-11-22 00:38:03,400 INFO Training Ended
2021-11-22 00:38:03,400 INFO Predicting Score!
[32m[I 2021-11-22 00:38:03,453][0m Trial 10 finished with value: 16191.472134524829 and parameters: {'learning_rate': 0.1342586705409775, 'reg_lambda': 3.9216168505648526e-08, 'reg_alpha': 1.2093599153413054e-07, 'subsample': 0.8614151452926132, 'colsample_bytree': 0.5797515196740621, 'max_depth': 2, 'early_stoppin

2021-11-22 00:41:07,537 INFO Training Started
2021-11-22 00:41:23,589 INFO Training Ended
2021-11-22 00:41:23,590 INFO Predicting Score!
[32m[I 2021-11-22 00:41:23,679][0m Trial 22 finished with value: 15029.193051690925 and parameters: {'learning_rate': 0.010010095321294295, 'reg_lambda': 3.031464733881546e-06, 'reg_alpha': 0.0009507849210810068, 'subsample': 0.5514834442906478, 'colsample_bytree': 0.6913143029874168, 'max_depth': 6, 'early_stopping_rounds': 484, 'n_estimators': 7000, 'tree_method': 'hist', 'booster': 'gbtree', 'gamma': 0.03536312244259591, 'grow_policy': 'depthwise'}. Best is trial 2 with value: 14777.86558219178.[0m
2021-11-22 00:41:23,813 INFO Training Started
2021-11-22 00:41:39,876 INFO Training Ended
2021-11-22 00:41:39,877 INFO Predicting Score!
[32m[I 2021-11-22 00:41:39,959][0m Trial 23 finished with value: 15159.616277825342 and parameters: {'learning_rate': 0.014368264214164652, 'reg_lambda': 2.993065360049739e-06, 'reg_alpha': 0.002787101642637736, 's

Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-22 00:42:23,307 INFO Training Ended
2021-11-22 00:42:23,308 INFO Predicting Score!
[32m[I 2021-11-22 00:42:23,349][0m Trial 27 finished with value: 18219.798112425087 and parameters: {'learning_rate': 0.01259223681961762, 'reg_lambda': 0.00012506920867547193, 'reg_alpha': 2.0133391972151626e-08, 'subsample': 0.6288525529379291, 'colsample_bytree': 0.6143491968543348, 'max_depth': 2, 'early_stopping_rounds': 398, 'n_estimators': 7000, 'tree_method': 'hist', 'booster': 'gblinear'}. Best is trial 2 with value: 14777.86558219178.[0m
2021-11-22 00:42:23,487 INFO Training Started
2021-11-22 00:42:31,357 INFO Training Ended
2021-11-22 00:42:31,358 INFO Predicting Score!
[32m[I 2021-11-22 00:42:31,425][0m Trial 28 finished with value: 15636.582994434932 and parameters: {'learning_rate': 0.027800769386821793, 'reg_lambda': 0.0224101411619606, 'reg_alpha': 2.669236231424427e-07, 'subsample': 0.17140740144187322, 'colsample_bytree': 0.5005343222851524, 'max_depth': 4, 'early_stopping

FrozenTrial(number=2, values=[14777.86558219178], datetime_start=datetime.datetime(2021, 11, 22, 0, 36, 36, 686161), datetime_complete=datetime.datetime(2021, 11, 22, 0, 36, 47, 25301), params={'booster': 'gbtree', 'colsample_bytree': 0.7312842669839699, 'early_stopping_rounds': 427, 'gamma': 0.021165662172103753, 'grow_policy': 'depthwise', 'learning_rate': 0.016012475961174746, 'max_depth': 5, 'n_estimators': 7000, 'reg_alpha': 75.99856826761018, 'reg_lambda': 0.3410194623899067, 'subsample': 0.2048463199761608, 'tree_method': 'hist'}, distributions={'booster': CategoricalDistribution(choices=('gbtree', 'gblinear')), 'colsample_bytree': UniformDistribution(high=1.0, low=0.1), 'early_stopping_rounds': IntUniformDistribution(high=500, low=100, step=1), 'gamma': LogUniformDistribution(high=1.0, low=1e-08), 'grow_policy': CategoricalDistribution(choices=('depthwise', 'lossguide')), 'learning_rate': LogUniformDistribution(high=0.25, low=0.01), 'max_depth': IntUniformDistribution(high=9, l

2021-11-18 22:05:18,152 INFO direction is: minimize
[I 2021-11-18 22:05:18,207] Using an existing study with name 'tmlt_autoxgb' instead of creating a new one.
/Users/pankajmathur/anaconda3/envs/nbdev_env/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
  warnings.warn(
2021-11-18 22:05:55,715 INFO fold: 1 , mean_absolute_error: 17737.759805757705
2021-11-18 22:06:31,210 INFO fold: 2 , mean_absolute_error: 13937.593455693494
2021-11-18 22:07:07,237 INFO fold: 3 , mean_absolute_error: 13931.05033979024
2021-11-18 22:07:48,523 INFO fold: 4 , mean_absolute_error: 12936.826492936643
2021-11-18 22:08:25,673 INFO fold: 5 , mean_absolute_error: 14848.775216716609
[I 2021-11-18 22:08:25,699] Trial 48 finished with value: 14678.401062178938 and parameters: {'learning_rate': 0.010227648390602546, 'reg_lambda': 8.015393563720193e-06, 'reg_alpha': 3.7753443233851705e-06, 'subsample': 0.5799411949016183, 'colsample_bytree': 0.613735233825501, 'max_depth': 4, 'early_stopping_rounds': 200, 'n_estimators': 20000, 'tree_method': 'hist', 'booster': 'gbtree', 'gamma': 0.01783928295659629, 'grow_policy': 'lossguide'}. Best is trial 48 with value: 14678.401062178938.
/Users/pankajmathur/anaconda3/envs/nbdev_env/lib/python3.9/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
  warnings.warn(
2021-11-18 22:09:02,651 INFO fold: 1 , mean_absolute_error: 17615.194228916953
2021-11-18 22:09:40,540 INFO fold: 2 , mean_absolute_error: 14338.525511023116
2021-11-18 22:10:18,411 INFO fold: 3 , mean_absolute_error: 13896.110378317637
2021-11-18 22:10:56,342 INFO fold: 4 , mean_absolute_error: 13176.088907320205
2021-11-18 22:11:35,685 INFO fold: 5 , mean_absolute_error: 14923.216475813357
[I 2021-11-18 22:11:35,704] Trial 49 finished with value: 14789.827100278253 and parameters: {'learning_rate': 0.01003614304176459, 'reg_lambda': 1.3143006220261207e-05, 'reg_alpha': 3.3506330151130134e-06, 'subsample': 0.543467363947305, 'colsample_bytree': 0.6249109087231277, 'max_depth': 4, 'early_stopping_rounds': 192, 'n_estimators': 20000, 'tree_method': 'hist', 'booster': 'gbtree', 'gamma': 0.028252682493464184, 'grow_policy': 'lossguide'}. Best is trial 48 with value: 14678.401062178938.
FrozenTrial(number=48, values=[14678.401062178938], datetime_start=datetime.datetime(2021, 11, 18, 22, 5, 18, 249529), datetime_complete=datetime.datetime(2021, 11, 18, 22, 8, 25, 674630), params={'booster': 'gbtree', 'colsample_bytree': 0.613735233825501, 'early_stopping_rounds': 200, 'gamma': 0.01783928295659629, 'grow_policy': 'lossguide', 'learning_rate': 0.010227648390602546, 'max_depth': 4, 'n_estimators': 20000, 'reg_alpha': 3.7753443233851705e-06, 'reg_lambda': 8.015393563720193e-06, 'subsample': 0.5799411949016183, 'tree_method': 'hist'}, distributions={'booster': CategoricalDistribution(choices=('gbtree', 'gblinear')), 'colsample_bytree': UniformDistribution(high=1.0, low=0.1), 'early_stopping_rounds': IntUniformDistribution(high=500, low=100, step=1), 'gamma': LogUniformDistribution(high=1.0, low=1e-08), 'grow_policy': CategoricalDistribution(choices=('depthwise', 'lossguide')), 'learning_rate': LogUniformDistribution(high=0.25, low=0.01), 'max_depth': IntUniformDistribution(high=9, low=1, step=1), 'n_estimators': CategoricalDistribution(choices=(7000, 15000, 20000)), 'reg_alpha': LogUniformDistribution(high=100.0, low=1e-08), 'reg_lambda': LogUniformDistribution(high=100.0, low=1e-08), 'subsample': UniformDistribution(high=1.0, low=0.1), 'tree_method': CategoricalDistribution(choices=('exact', 'approx', 'hist'))}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=49, state=TrialState.COMPLETE, value=None)

In [10]:
study.best_trial.params

{'booster': 'gbtree',
 'colsample_bytree': 0.7312842669839699,
 'early_stopping_rounds': 427,
 'gamma': 0.021165662172103753,
 'grow_policy': 'depthwise',
 'learning_rate': 0.016012475961174746,
 'max_depth': 5,
 'n_estimators': 7000,
 'reg_alpha': 75.99856826761018,
 'reg_lambda': 0.3410194623899067,
 'subsample': 0.2048463199761608,
 'tree_method': 'hist'}

**Awesome we found best params with K-fold variations ~ 1 minute!**

##### Now let's use best params to update preprocessor and model in our pipeline

In [11]:
# autoxgb_params = {'learning_rate': 0.016067642810265004,

# 'reg_lambda': 0.0005033307729410949,

# 'reg_alpha': 1.125131255655592e-06,

# 'subsample': 0.43211847297916883,

# 'colsample_bytree': 0.4106787563173376,

# 'max_depth': 5,

# 'early_stopping_rounds': 354,

# 'n_estimators': 7000,

# 'tree_method': 'approx',

# 'booster': 'gbtree',

# 'gamma': 0.2870988185671683,

# 'grow_policy': 'depthwise'}

In [12]:
# xgb_params = autoxgb_params
tmlt_xgb_params = study.best_trial.params
xgb_params =  tmlt_xgb_params
xgb_model = XGBRegressor(**xgb_params)
tmlt.update_model(xgb_model)
tmlt.spl

In [13]:
# k-fold training
xgb_model_metrics_score, xgb_model_test_preds = tmlt.do_kfold_training(n_splits=5,
                                                                          metrics=mean_absolute_error,
                                                                          random_state=42)
# predict
print(xgb_model_test_preds.shape)



Parameters: { "early_stopping_rounds" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-22 00:42:53,893 INFO fold: 1 , mean_absolute_error: 19148.493926583906


Parameters: { "early_stopping_rounds" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-22 00:43:06,375 INFO fold: 2 , mean_absolute_error: 15131.98073630137


Parameters: { "early_stopping_rounds" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-22 00:43:16,508 INFO fold: 3 , mean_absolute_error: 14903.897822131848


Parameters: { "early_stopping_rounds" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-22 00:43:27,856 INFO fold: 4 , mean_absolute_error: 13764.201894263699


Parameters: { "early_stopping_rounds" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-22 00:43:39,492 INFO fold: 5 , mean_absolute_error: 14727.987318065068
2021-11-22 00:43:39,493 INFO  mean metrics score: 15535.31233946918


(1459,)


mean metrics score: 14761.77

(1459,)

#### Yup Indeed Optuna tunning xgb model has improved MAE from earlier cross validated model!

**Amazing our MAE has reduced to 14761.77 by using Optuna based HyperParams search, If we can continue doing hyperparmas tunning, may be we can even do better, take that as challenge!**