In [1]:
%load_ext autoreload
%autoreload 2

# Getting Started Tutorial with Tabular ML Toolkit

> A tutorial on getting started with Tabular ml toolkit

> tabular_ml_toolkit is a superfast helper library to speedup your machine learning project based on Tabular or Structured data.

> It comes with model parallelism and cutting edge hyperparameter tuning techniques.

## Install

`pip install -U tabular_ml_toolkit`

## How to Best Use tabular_ml_toolkit

Start with your favorite model and then just simply create MLPipeline with one API.

*For example, Here we are using RandomForestRegressor from Scikit-Learn, on  [Melbourne Home Sale price data](https://www.kaggle.com/estrotococo/home-data-for-ml-course)*


*No need to install scikit-learn as it comes preinstall with Tabular_ML_Toolkit*

In [2]:
from tabular_ml_toolkit.mlpipeline import *
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import pandas as pd
import numpy as np

# for displaying diagram of pipelines 
from sklearn import set_config
set_config(display="diagram")

# Just to compare fit times
import time

In [3]:
# Dataset file names and Paths
DIRECTORY_PATH = "https://raw.githubusercontent.com/psmathur/tabular_ml_toolkit/master/input/home_data/"
TRAIN_FILE = "train.csv"
TEST_FILE = "test.csv"
SAMPLE_SUB_FILE = "sample_submission.csv"

In [4]:
from xgboost import XGBRegressor

xgb_params = {
    'n_estimators':250,
    'learning_rate':0.05,
    'random_state':42,
    # for GPU
#     'tree_method': 'gpu_hist',
#     'predictor': 'gpu_predictor',
}


# create xgb ml model
xgb_model = XGBRegressor(**xgb_params)

In [5]:
# createm ml pipeline for scikit-learn model
tmlt = MLPipeline().prepare_data_for_training(
    train_file_path= DIRECTORY_PATH+TRAIN_FILE,
    test_file_path= DIRECTORY_PATH+TEST_FILE,
    idx_col="Id", target="SalePrice",
    model=xgb_model,
    random_state=42)

2021-11-16 18:10:34,678 INFO 12 cores found, parallel processing is enabled!
2021-11-16 18:10:35,180 INFO DataFrame Memory usage decreased to 0.58 Mb (35.5% reduction)
2021-11-16 18:10:35,546 INFO DataFrame Memory usage decreased to 0.58 Mb (34.8% reduction)


In [6]:
# tmlt.spl

#### To see clear picture, let's do k_fold training on updated scikit model

In [7]:
# # k-fold training
# xgb_model_metrics_score, xgb_model_preds = tmlt.do_k_fold_training(n_splits=5,
#                                                                           metrics=mean_absolute_error,
#                                                                           random_state=42)
# print("mean metrics score:", np.mean(xgb_model_metrics_score))
# # predict
# print(xgb_model_preds.shape)

##### Let's see if we can improve our K_Fold score with hyperparams tunning

In [8]:
# from sklearn.preprocessing import StandardScaler, MinMaxScaler
# from sklearn.impute import SimpleImputer
# # from sklearn.

In [9]:
study = tmlt.do_xgb_optuna_optimization(task="regression", xgb_eval_metric="rmse",
                                        kfold_metrics=mean_absolute_error, output_dir_path="output/")
print(study.best_trial)

2021-11-16 18:10:35,827 INFO direction is: minimize
[32m[I 2021-11-16 18:10:35,921][0m Using an existing study with name 'tmlt_autoxgb' instead of creating a new one.[0m


Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-16 18:10:51,129 INFO fold: 1 , mean_absolute_error: 22159.765477846748


Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-16 18:11:05,466 INFO fold: 2 , mean_absolute_error: 16814.249438142124


Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-16 18:11:17,081 INFO fold: 3 , mean_absolute_error: 18457.338291952055


Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-16 18:11:24,688 INFO fold: 4 , mean_absolute_error: 17241.310399721748


Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-16 18:11:32,901 INFO fold: 5 , mean_absolute_error: 19821.63851468857
[32m[I 2021-11-16 18:11:32,935][0m Trial 45 finished with value: 18898.860424470247 and parameters: {'learning_rate': 0.016515998563772637, 'reg_lambda': 8.088564149641936e-08, 'reg_alpha': 0.0005635063160681214, 'subsample': 0.22548198976416645, 'colsample_bytree': 0.8720130206059861, 'max_depth': 8, 'early_stopping_rounds': 436, 'n_estimators': 15000, 'tree_method': 'exact', 'booster': 'gblinear'}. Best is trial 38 with value: 15106.833885380993.[0m
2021-11-16 18:12:01,464 INFO fold: 1 , mean_absolute_error: 18106.061403039384
2021-11-16 18:12:34,375 INFO fold: 2 , mean_absolute_error: 15288.579409246575
2021-11-16 18:13:26,053 INFO fold: 3 , mean_absolute_error: 15021.933647260274
2021-11-16 18:13:38,018 INFO fold: 4 , mean_absolute_error: 13583.052092251712
2021-11-16 18:13:51,454 INFO fold: 5 , mean_absolute_error: 15095.503478167808
[32m[I 2021-11-16 18:13:51,476][0m Trial 46 finished with value: 1

FrozenTrial(number=47, values=[14894.671767979453], datetime_start=datetime.datetime(2021, 11, 16, 18, 13, 51, 481897), datetime_complete=datetime.datetime(2021, 11, 16, 18, 17, 7, 746829), params={'booster': 'gbtree', 'colsample_bytree': 0.6146669993264926, 'early_stopping_rounds': 194, 'gamma': 0.012933483229426676, 'grow_policy': 'lossguide', 'learning_rate': 0.010848218278471048, 'max_depth': 4, 'n_estimators': 20000, 'reg_alpha': 4.21300307891317e-06, 'reg_lambda': 1.0613008216761301e-05, 'subsample': 0.6052227077512764, 'tree_method': 'hist'}, distributions={'booster': CategoricalDistribution(choices=('gbtree', 'gblinear')), 'colsample_bytree': UniformDistribution(high=1.0, low=0.1), 'early_stopping_rounds': IntUniformDistribution(high=500, low=100, step=1), 'gamma': LogUniformDistribution(high=1.0, low=1e-08), 'grow_policy': CategoricalDistribution(choices=('depthwise', 'lossguide')), 'learning_rate': LogUniformDistribution(high=0.25, low=0.01), 'max_depth': IntUniformDistributi

In [10]:
study.best_trial.params

{'booster': 'gbtree',
 'colsample_bytree': 0.6146669993264926,
 'early_stopping_rounds': 194,
 'gamma': 0.012933483229426676,
 'grow_policy': 'lossguide',
 'learning_rate': 0.010848218278471048,
 'max_depth': 4,
 'n_estimators': 20000,
 'reg_alpha': 4.21300307891317e-06,
 'reg_lambda': 1.0613008216761301e-05,
 'subsample': 0.6052227077512764,
 'tree_method': 'hist'}

**Awesome we found best params with K-fold variations ~ 1 minute!**

##### Now let's use best params to update preprocessor and model in our pipeline

In [11]:
# autoxgb_params = {'learning_rate': 0.016067642810265004,

# 'reg_lambda': 0.0005033307729410949,

# 'reg_alpha': 1.125131255655592e-06,

# 'subsample': 0.43211847297916883,

# 'colsample_bytree': 0.4106787563173376,

# 'max_depth': 5,

# 'early_stopping_rounds': 354,

# 'n_estimators': 7000,

# 'tree_method': 'approx',

# 'booster': 'gbtree',

# 'gamma': 0.2870988185671683,

# 'grow_policy': 'depthwise'}

In [12]:
# xgb_params = autoxgb_params
tmlt_xgb_params = study.best_trial.params
xgb_params =  tmlt_xgb_params
xgb_model = XGBRegressor(**xgb_params)
tmlt.update_model(xgb_model)
tmlt.spl

In [13]:
# k-fold training
xgb_model_metrics_score, xgb_model_preds = tmlt.do_k_fold_training(n_splits=5,
                                                                          metrics=mean_absolute_error,
                                                                          random_state=42)
print("mean metrics score:", np.mean(xgb_model_metrics_score))
# predict
print(xgb_model_preds.shape)



Parameters: { "early_stopping_rounds" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-16 18:17:47,114 INFO fold: 1 , mean_absolute_error: 18164.97436857877


Parameters: { "early_stopping_rounds" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-16 18:18:27,455 INFO fold: 2 , mean_absolute_error: 14516.620277718323


Parameters: { "early_stopping_rounds" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-16 18:19:06,391 INFO fold: 3 , mean_absolute_error: 14323.742562071919


Parameters: { "early_stopping_rounds" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-16 18:19:47,218 INFO fold: 4 , mean_absolute_error: 13025.971826840754


Parameters: { "early_stopping_rounds" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-16 18:20:26,076 INFO fold: 5 , mean_absolute_error: 14920.972188035103


mean metrics score: 14990.456244648973
(1459,)


mean metrics score: 15447.19800406678
(1459,)

#### Yup Indeed Optuna tunning xgb model has improved MAE from earlier cross validated model!

**Amazing our MAE has reduced to 15689.22 by GridSearch HyperParamss tunning, If we can continue doing hyperparmas tunning, may be we can even do better, take that as challenge!**