In [None]:
# hide
from nbdev.showdoc import *
from nbdev import *

# Tabular ML Toolkit

> A super fast helper library to jumpstart your machine learning project based on tabular or structured data.

> It comes with model parallelism and cutting edge hyperparameter tuning techniques.

## Install

`pip install -U tabular_ml_toolkit`

## How to use

Start with your favorite model and then just simply create MLPipeline with one API.

*For example, Here we are using RandomForestRegressor from Scikit-Learn, on  [Melbourne Home Sale price data](https://www.kaggle.com/estrotococo/home-data-for-ml-course)*


*No need to install scikit-learn as it comes preinstall with Tabular_ML_Toolkit*

In [None]:
from tabular_ml_toolkit.tmlt import *
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error
import numpy as np
# just to measure fit time
import time

In [None]:
# Dataset file names and Paths
DIRECTORY_PATH = "https://raw.githubusercontent.com/psmathur/tabular_ml_toolkit/master/input/home_data/"
TRAIN_FILE = "train.csv"
TEST_FILE = "test.csv"
SAMPLE_SUB_FILE = "sample_submission.csv"
#Make sure to create this output directory next to this notebook
OUTPUT_PATH = "tutorial_output/"

In [None]:
# create xgb ml model
xgb_model = XGBRegressor(random_state=42)

# createm ml pipeline for scikit-learn model
tmlt = TMLT().prepare_data_for_training(
    train_file_path= DIRECTORY_PATH+TRAIN_FILE,
    test_file_path= DIRECTORY_PATH+TEST_FILE,
    idx_col="Id", target="SalePrice",
    model=xgb_model,
    random_state=42,
    problem_type="regression")

# visualize scikit-pipeline
# tmlt.spl

2021-11-20 21:20:06,045 INFO 12 cores found, parallel processing is enabled!
2021-11-20 21:20:06,576 INFO DataFrame Memory usage decreased to 0.58 Mb (35.5% reduction)
2021-11-20 21:20:06,980 INFO DataFrame Memory usage decreased to 0.58 Mb (34.8% reduction)
2021-11-20 21:20:07,008 INFO Both Numerical & Categorical columns found, Preprocessing will done accordingly!


In [None]:
study = tmlt.do_xgb_optuna_optimization(xgb_eval_metric="rmse",
                                        kfold_metrics=mean_absolute_error,
                                        output_dir_path=OUTPUT_PATH)
print(study.best_trial)

2021-11-20 21:20:07,013 INFO direction is: minimize
[32m[I 2021-11-20 21:20:07,129][0m A new study created in RDB with name: tmlt_autoxgb[0m


Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-20 21:20:11,393 INFO fold: 1 , mean_absolute_error: 35790.80912885274


Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-20 21:20:17,146 INFO fold: 2 , mean_absolute_error: 33405.666309931505


Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-20 21:20:22,751 INFO fold: 3 , mean_absolute_error: 39933.20020869007


Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-20 21:20:27,901 INFO fold: 4 , mean_absolute_error: 35007.23145869007


Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-20 21:20:33,069 INFO fold: 5 , mean_absolute_error: 37255.71524507705
2021-11-20 21:20:33,070 INFO  mean metrics score: 36278.524470248296
[32m[I 2021-11-20 21:20:33,109][0m Trial 0 finished with value: 36278.524470248296 and parameters: {'learning_rate': 0.06837917669134744, 'reg_lambda': 10.464291360599324, 'reg_alpha': 3.881790940186402e-08, 'subsample': 0.6307472664806286, 'colsample_bytree': 0.7793929777473942, 'max_depth': 7, 'early_stopping_rounds': 403, 'n_estimators': 15000, 'tree_method': 'approx', 'booster': 'gblinear'}. Best is trial 0 with value: 36278.524470248296.[0m


Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-20 21:20:35,475 INFO fold: 1 , mean_absolute_error: 19432.099475599316


Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-20 21:20:37,915 INFO fold: 2 , mean_absolute_error: 17706.260675299658


Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-20 21:20:40,171 INFO fold: 3 , mean_absolute_error: 17637.29025979238


Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-20 21:20:42,608 INFO fold: 4 , mean_absolute_error: 16144.116933326199


Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-20 21:20:45,684 INFO fold: 5 , mean_absolute_error: 17603.451713666524
2021-11-20 21:20:45,685 INFO  mean metrics score: 17704.643811536815
[32m[I 2021-11-20 21:20:45,710][0m Trial 1 finished with value: 17704.643811536815 and parameters: {'learning_rate': 0.013251610183817672, 'reg_lambda': 0.05070997103744175, 'reg_alpha': 6.5596377189764485, 'subsample': 0.24997325724596772, 'colsample_bytree': 0.26866824835445197, 'max_depth': 1, 'early_stopping_rounds': 143, 'n_estimators': 7000, 'tree_method': 'exact', 'booster': 'gblinear'}. Best is trial 1 with value: 17704.643811536815.[0m
2021-11-20 21:21:39,657 INFO fold: 1 , mean_absolute_error: 18901.152236729453
2021-11-20 21:22:35,102 INFO fold: 2 , mean_absolute_error: 14733.510019798801
2021-11-20 21:23:27,829 INFO fold: 3 , mean_absolute_error: 14491.222201412671
2021-11-20 21:24:22,652 INFO fold: 4 , mean_absolute_error: 13826.06889447774
2021-11-20 21:25:17,423 INFO fold: 5 , mean_absolute_error: 15390.086740154109
2021-1

FrozenTrial(number=2, values=[15468.408018514552], datetime_start=datetime.datetime(2021, 11, 20, 21, 20, 45, 716981), datetime_complete=datetime.datetime(2021, 11, 20, 21, 25, 17, 425720), params={'booster': 'gbtree', 'colsample_bytree': 0.936441228462284, 'early_stopping_rounds': 323, 'gamma': 0.00016954628965916405, 'grow_policy': 'depthwise', 'learning_rate': 0.011129207646131652, 'max_depth': 9, 'n_estimators': 20000, 'reg_alpha': 52.967728527588584, 'reg_lambda': 0.30050997361951176, 'subsample': 0.5258942705414594, 'tree_method': 'hist'}, distributions={'booster': CategoricalDistribution(choices=('gbtree', 'gblinear')), 'colsample_bytree': UniformDistribution(high=1.0, low=0.1), 'early_stopping_rounds': IntUniformDistribution(high=500, low=100, step=1), 'gamma': LogUniformDistribution(high=1.0, low=1e-08), 'grow_policy': CategoricalDistribution(choices=('depthwise', 'lossguide')), 'learning_rate': LogUniformDistribution(high=0.25, low=0.01), 'max_depth': IntUniformDistribution(h

#### Update XGB Model with best params

In [None]:
xgb_params =  study.best_trial.params
xgb_model = XGBRegressor(**xgb_params)
tmlt.update_model(xgb_model)
tmlt.spl

#### Let's do Hyper Parameters Optimization and find the best params for Data PreProcessing

 Let's give our Grid Search max 6 minute time budget, Because we don't have eternity to wait for hyperparam tunning!

In [None]:
# let's do tune grid search for Data PreProcessing hyperparams tuning

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer


# let's tune data preprocessing and model hyperparams
param_grid = {
    "preprocessor__num_cols__scaler": [StandardScaler(), MinMaxScaler()],
    "preprocessor__cat_cols__imputer": [SimpleImputer(strategy='constant'),
                                                 SimpleImputer(strategy='most_frequent')]
}

start = time.time()
# Now do tune grid search
tune_search = tmlt.do_tune_grid_search(param_grid=param_grid,
                                       cv=5,
                                       scoring='neg_mean_absolute_error',
                                      early_stopping=False,
                                      time_budget_s=360)
end = time.time()
print("Grid Search Time:", end - start)

print("Best params:")
print(tune_search.best_params_)

print(f"Internal CV Metrics score: {-1*(tune_search.best_score_):.3f}")



[2m[36m(_Trainable pid=13878)[0m Parameters: { "early_stopping_rounds" } might not be used.
[2m[36m(_Trainable pid=13878)[0m 
[2m[36m(_Trainable pid=13878)[0m   This could be a false alarm, with some parameters getting used by language bindings but
[2m[36m(_Trainable pid=13878)[0m   then being mistakenly passed down to XGBoost core, or some parameter actually being used
[2m[36m(_Trainable pid=13878)[0m   but getting flagged wrongly here. Please open an issue if you find any such cases.
[2m[36m(_Trainable pid=13878)[0m 
[2m[36m(_Trainable pid=13878)[0m 
[2m[36m(_Trainable pid=13874)[0m Parameters: { "early_stopping_rounds" } might not be used.
[2m[36m(_Trainable pid=13874)[0m 
[2m[36m(_Trainable pid=13874)[0m   This could be a false alarm, with some parameters getting used by language bindings but
[2m[36m(_Trainable pid=13874)[0m   then being mistakenly passed down to XGBoost core, or some parameter actually being used
[2m[36m(_Trainable pid=13874)[0m

SIGINT received (e.g. via Ctrl+C), ending Ray Tune run. This will try to checkpoint the experiment state one last time. Press CTRL+C one more time (or send SIGINT/SIGKILL/SIGTERM) to skip. 
[2m[36m(pid=13874)[0m 2021-11-20 21:29:57,210	ERROR worker.py:425 -- SystemExit was raised from the worker
[2m[36m(pid=13874)[0m Traceback (most recent call last):
[2m[36m(pid=13874)[0m   File "python/ray/_raylet.pyx", line 692, in ray._raylet.task_execution_handler
[2m[36m(pid=13874)[0m   File "python/ray/_raylet.pyx", line 521, in ray._raylet.execute_task
[2m[36m(pid=13874)[0m   File "python/ray/_raylet.pyx", line 558, in ray._raylet.execute_task
[2m[36m(pid=13874)[0m   File "python/ray/_raylet.pyx", line 565, in ray._raylet.execute_task
[2m[36m(pid=13874)[0m   File "python/ray/_raylet.pyx", line 569, in ray._raylet.execute_task
[2m[36m(pid=13874)[0m   File "python/ray/_raylet.pyx", line 519, in ray._raylet.execute_task.function_executor
[2m[36m(pid=13874)[0m   File "/Us

Trials did not complete: [_Trainable_f70e6_00000, _Trainable_f70e6_00001, _Trainable_f70e6_00002, _Trainable_f70e6_00003]
Experiment has been interrupted, but the most recent state was saved. You can continue running this experiment by passing `resume=True` to `tune.run()`


ZeroDivisionError: division by zero

If you want to customize data and preprocessing steps you can do so by using `DataFrameLoader` and `PreProessor` classes. Please Check other Tutorials and detail documentations for these classes for more options. 

**Amazing our 5 Fold CV MAE has even reduced further within few minutes of Tuen Grid Search HyperParams tunning!**

If we can continue doing hyperparmas tunning, may be we can even do better, You can also try early_stopping, take that as challenge!

###### Let's use our newly found params for a 10 k-fold training and test predictions

##### Update PreProcessor on tmlt with best params found from tune grid search

In [None]:
pp_params = tmlt.get_preprocessor_best_params(tune_search)

# Update pipeline with updated preprocessor
tmlt.update_preprocessor(**pp_params)
tmlt.spl

##### Update Model on tmlt with best params found from tune grid search

In [None]:
# k-fold training
xgb_model_metrics_score, xgb_model_test_preds = tmlt.do_kfold_training(n_splits=10,
                                                                          metrics=mean_absolute_error,
                                                                          random_state=42)
# Check test dataset prediction shape
print(xgb_model_test_preds.shape)

**Yay, we have much better MAE with 10 K-Fold**