In [1]:
%load_ext autoreload
%autoreload 2

# Getting Started Tutorial with Tabular ML Toolkit

> A tutorial on getting started with Tabular ml toolkit

> tabular_ml_toolkit is a superfast helper library to speedup your machine learning project based on Tabular or Structured data.

> It comes with model parallelism and cutting edge hyperparameter tuning techniques.

## Install

`pip install -U tabular_ml_toolkit`

## How to Best Use tabular_ml_toolkit

Start with your favorite model and then just simply create MLPipeline with one API.

*For example, Here we are using RandomForestRegressor from Scikit-Learn, on  [Melbourne Home Sale price data](https://www.kaggle.com/estrotococo/home-data-for-ml-course)*


*No need to install scikit-learn as it comes preinstall with Tabular_ML_Toolkit*

In [2]:
from tabular_ml_toolkit.mlpipeline import *
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import pandas as pd
import numpy as np

# for displaying diagram of pipelines 
from sklearn import set_config
set_config(display="diagram")

# Just to compare fit times
import time



In [3]:
# Dataset file names and Paths
DIRECTORY_PATH = "https://raw.githubusercontent.com/psmathur/tabular_ml_toolkit/master/input/home_data/"
TRAIN_FILE = "train.csv"
TEST_FILE = "test.csv"
SAMPLE_SUB_FILE = "sample_submission.csv"

In [4]:
from xgboost import XGBRegressor

xgb_params = {
    'n_estimators':250,
    'learning_rate':0.05,
    'random_state':42,
    # for GPU
#     'tree_method': 'gpu_hist',
#     'predictor': 'gpu_predictor',
}


# create xgb ml model
xgb_model = XGBRegressor(**xgb_params)

In [5]:
# createm ml pipeline for scikit-learn model
tmlt = MLPipeline().prepare_data_for_training(
    train_file_path= DIRECTORY_PATH+TRAIN_FILE,
    test_file_path= DIRECTORY_PATH+TEST_FILE,
    idx_col="Id", target="SalePrice",
    model=xgb_model,
    random_state=42)

2021-11-15 23:27:44,001 INFO 8 cores found, parallel processing is enabled!
2021-11-15 23:27:44,292 INFO DataFrame Memory usage decreased to 0.58 Mb (35.5% reduction)
2021-11-15 23:27:44,861 INFO DataFrame Memory usage decreased to 0.58 Mb (34.8% reduction)


In [6]:
# tmlt.spl

#### To see clear picture, let's do k_fold training on updated scikit model

In [11]:
# # k-fold training
# xgb_model_metrics_score, xgb_model_preds = tmlt.do_k_fold_training(n_splits=5,
#                                                                           metrics=mean_absolute_error,
#                                                                           random_state=42)
# print("mean metrics score:", np.mean(xgb_model_metrics_score))
# # predict
# print(xgb_model_preds.shape)

2021-11-15 23:27:51,400 INFO fold: 1 , mean_absolute_error: 18947.19236943493
2021-11-15 23:27:52,005 INFO fold: 2 , mean_absolute_error: 15652.96465646404
2021-11-15 23:27:52,662 INFO fold: 3 , mean_absolute_error: 16128.323335830479
2021-11-15 23:27:53,367 INFO fold: 4 , mean_absolute_error: 15037.816045055652
2021-11-15 23:27:54,048 INFO fold: 5 , mean_absolute_error: 17555.253585188355


mean metrics score: 16664.309998394692
(1459,)


##### Let's see if we can improve our K_Fold score with hyperparams tunning

In [19]:
# from sklearn.preprocessing import StandardScaler, MinMaxScaler
# from sklearn.impute import SimpleImputer
# # from sklearn.

In [20]:
study = tmlt.do_xgb_optuna_optimization(task="regression", xgb_eval_metric="rmse",
                                        kfold_metrics=mean_absolute_error, output_dir_path="output/")
print(study.best_trial)

2021-11-15 23:45:44,713 INFO direction is: minimize
[32m[I 2021-11-15 23:45:44,735][0m Using an existing study with name 'tmlt_autoxgb' instead of creating a new one.[0m


Parameters: { early_stopping_rounds } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:46:27,876 INFO fold: 1 , mean_absolute_error: 18639.659139554795


Parameters: { early_stopping_rounds } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:47:15,923 INFO fold: 2 , mean_absolute_error: 14579.810439854453


Parameters: { early_stopping_rounds } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:48:04,696 INFO fold: 3 , mean_absolute_error: 14449.907440603596


Parameters: { early_stopping_rounds } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:48:52,654 INFO fold: 4 , mean_absolute_error: 13730.850840111301


Parameters: { early_stopping_rounds } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:49:42,643 INFO fold: 5 , mean_absolute_error: 14964.15269156678
[32m[I 2021-11-15 23:49:42,670][0m Trial 32 finished with value: 15272.876110338184 and parameters: {'learning_rate': 0.011695606004809915, 'reg_lambda': 8.225407153633027e-07, 'reg_alpha': 0.07497277629406573, 'subsample': 0.6193108123085774, 'colsample_bytree': 0.749550584753849, 'max_depth': 6, 'early_stopping_rounds': 230, 'n_estimators': 20000, 'tree_method': 'hist', 'booster': 'gbtree', 'gamma': 0.00044506298947651746, 'grow_policy': 'depthwise'}. Best is trial 31 with value: 15161.862516053083.[0m


Parameters: { colsample_bytree, early_stopping_rounds, max_depth, subsample, tree_method } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:49:45,925 INFO fold: 1 , mean_absolute_error: 20660.13059182363


Parameters: { colsample_bytree, early_stopping_rounds, max_depth, subsample, tree_method } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:49:49,167 INFO fold: 2 , mean_absolute_error: 16804.197051583906


Parameters: { colsample_bytree, early_stopping_rounds, max_depth, subsample, tree_method } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:49:52,628 INFO fold: 3 , mean_absolute_error: 18372.172356592466


Parameters: { colsample_bytree, early_stopping_rounds, max_depth, subsample, tree_method } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:49:55,760 INFO fold: 4 , mean_absolute_error: 17178.698282320205


Parameters: { colsample_bytree, early_stopping_rounds, max_depth, subsample, tree_method } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:49:59,056 INFO fold: 5 , mean_absolute_error: 19840.807135595034
[32m[I 2021-11-15 23:49:59,078][0m Trial 33 finished with value: 18571.201083583048 and parameters: {'learning_rate': 0.01494828183713036, 'reg_lambda': 2.0372404597067631e-07, 'reg_alpha': 0.0004681141874241499, 'subsample': 0.4560979285593141, 'colsample_bytree': 0.8418515093368574, 'max_depth': 5, 'early_stopping_rounds': 314, 'n_estimators': 7000, 'tree_method': 'hist', 'booster': 'gblinear'}. Best is trial 31 with value: 15161.862516053083.[0m


Parameters: { early_stopping_rounds } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:50:19,220 INFO fold: 1 , mean_absolute_error: 18967.44814854452


Parameters: { early_stopping_rounds } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:50:39,859 INFO fold: 2 , mean_absolute_error: 14574.83781035959


Parameters: { early_stopping_rounds } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:51:00,036 INFO fold: 3 , mean_absolute_error: 14777.20542594178


Parameters: { early_stopping_rounds } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:51:20,609 INFO fold: 4 , mean_absolute_error: 14449.328017979453


Parameters: { early_stopping_rounds } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:51:40,226 INFO fold: 5 , mean_absolute_error: 15539.640732020547
[32m[I 2021-11-15 23:51:40,244][0m Trial 34 finished with value: 15661.69202696918 and parameters: {'learning_rate': 0.027144479610101583, 'reg_lambda': 1.2222118821306003e-07, 'reg_alpha': 0.18300630703030263, 'subsample': 0.6568068926863969, 'colsample_bytree': 0.9175936020919679, 'max_depth': 6, 'early_stopping_rounds': 168, 'n_estimators': 7000, 'tree_method': 'hist', 'booster': 'gbtree', 'gamma': 0.6201174261247351, 'grow_policy': 'depthwise'}. Best is trial 31 with value: 15161.862516053083.[0m


Parameters: { early_stopping_rounds } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:52:03,843 INFO fold: 1 , mean_absolute_error: 18248.452402611303


Parameters: { early_stopping_rounds } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:52:28,068 INFO fold: 2 , mean_absolute_error: 14796.746307791096


Parameters: { early_stopping_rounds } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:52:54,757 INFO fold: 3 , mean_absolute_error: 14450.720636237158


Parameters: { early_stopping_rounds } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:53:19,235 INFO fold: 4 , mean_absolute_error: 14190.637146832192


Parameters: { early_stopping_rounds } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:53:43,141 INFO fold: 5 , mean_absolute_error: 15427.863762842466
[32m[I 2021-11-15 23:53:43,160][0m Trial 35 finished with value: 15422.884051262843 and parameters: {'learning_rate': 0.017176534972555314, 'reg_lambda': 6.379202083079888e-07, 'reg_alpha': 0.013147773254711077, 'subsample': 0.5559462698979399, 'colsample_bytree': 0.5526981282005409, 'max_depth': 8, 'early_stopping_rounds': 229, 'n_estimators': 7000, 'tree_method': 'hist', 'booster': 'gbtree', 'gamma': 0.00010121273930113025, 'grow_policy': 'depthwise'}. Best is trial 31 with value: 15161.862516053083.[0m


FrozenTrial(number=31, values=[15161.862516053083], datetime_start=datetime.datetime(2021, 11, 15, 23, 34, 33, 726431), datetime_complete=datetime.datetime(2021, 11, 15, 23, 35, 51, 401636), params={'booster': 'gbtree', 'colsample_bytree': 0.7978993899268726, 'early_stopping_rounds': 219, 'gamma': 0.00012602511463355868, 'grow_policy': 'depthwise', 'learning_rate': 0.015379020226564632, 'max_depth': 5, 'n_estimators': 7000, 'reg_alpha': 0.03253148458263187, 'reg_lambda': 3.208442870212238e-07, 'subsample': 0.6347778191052607, 'tree_method': 'hist'}, distributions={'booster': CategoricalDistribution(choices=('gbtree', 'gblinear')), 'colsample_bytree': UniformDistribution(high=1.0, low=0.1), 'early_stopping_rounds': IntUniformDistribution(high=500, low=100, step=1), 'gamma': LogUniformDistribution(high=1.0, low=1e-08), 'grow_policy': CategoricalDistribution(choices=('depthwise', 'lossguide')), 'learning_rate': LogUniformDistribution(high=0.25, low=0.01), 'max_depth': IntUniformDistributi

In [21]:
study.best_trial.params

{'booster': 'gbtree',
 'colsample_bytree': 0.7978993899268726,
 'early_stopping_rounds': 219,
 'gamma': 0.00012602511463355868,
 'grow_policy': 'depthwise',
 'learning_rate': 0.015379020226564632,
 'max_depth': 5,
 'n_estimators': 7000,
 'reg_alpha': 0.03253148458263187,
 'reg_lambda': 3.208442870212238e-07,
 'subsample': 0.6347778191052607,
 'tree_method': 'hist'}

**Awesome we found best params with K-fold variations ~ 1 minute!**

##### Now let's use best params to update preprocessor and model in our pipeline

In [22]:
autoxgb_params = {'learning_rate': 0.016067642810265004,

'reg_lambda': 0.0005033307729410949,

'reg_alpha': 1.125131255655592e-06,

'subsample': 0.43211847297916883,

'colsample_bytree': 0.4106787563173376,

'max_depth': 5,

'early_stopping_rounds': 354,

'n_estimators': 7000,

'tree_method': 'approx',

'booster': 'gbtree',

'gamma': 0.2870988185671683,

'grow_policy': 'depthwise'}

In [23]:
# xgb_params = autoxgb_params
tmlt_xgb_params = study.best_trial.params
xgb_params =  tmlt_xgb_params
xgb_model = XGBRegressor(**xgb_params)
tmlt.update_model(xgb_model)
tmlt.spl

In [18]:
# k-fold training
xgb_model_metrics_score, xgb_model_preds = tmlt.do_k_fold_training(n_splits=5,
                                                                          metrics=mean_absolute_error,
                                                                          random_state=42)
print("mean metrics score:", np.mean(xgb_model_metrics_score))
# predict
print(xgb_model_preds.shape)



Parameters: { early_stopping_rounds } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:44:19,642 INFO fold: 1 , mean_absolute_error: 18305.722629494863


Parameters: { early_stopping_rounds } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:44:33,580 INFO fold: 2 , mean_absolute_error: 14528.402704944348


Parameters: { early_stopping_rounds } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:44:47,748 INFO fold: 3 , mean_absolute_error: 13867.527704944348


Parameters: { early_stopping_rounds } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:45:02,359 INFO fold: 4 , mean_absolute_error: 13539.36665239726


Parameters: { early_stopping_rounds } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-11-15 23:45:16,811 INFO fold: 5 , mean_absolute_error: 15188.842867080479


mean metrics score: 15085.97251177226
(1459,)


mean metrics score: 15447.19800406678
(1459,)

#### Yup Indeed Optuna tunning xgb model has improved MAE from earlier cross validated model!

**Amazing our MAE has reduced to 15689.22 by GridSearch HyperParamss tunning, If we can continue doing hyperparmas tunning, may be we can even do better, take that as challenge!**