# Getting Started Tutorial with TMLT (Tabular ML Toolkit)

> A tutorial on getting started with TMLT (Tabular ML Toolkit)

> tabular_ml_toolkit is a helper library to jumpstart your machine learning project based on Tabular or Structured data.

> It comes with model and data parallelism and cutting edge hyperparameter search techniques.

> Under the hood TMLT uses modin, optuna, xgboost and scikit-learn pipelines

## Install

`pip install -U tabular_ml_toolkit`

## How to Best Use tabular_ml_toolkit

Start with your favorite model and then just simply create tmlt with one API

*For example, Here we are using XGBRegressor on  [Melbourne Home Sale price data](https://www.kaggle.com/estrotococo/home-data-for-ml-course)*

In [1]:
from tabular_ml_toolkit.tmlt import *
from sklearn.metrics import mean_absolute_error
import numpy as np

# Just to compare fit times
import time

In [2]:
# Dataset file names and Paths
DIRECTORY_PATH = "input/home_data/"
TRAIN_FILE = "train.csv"
TEST_FILE = "test.csv"
SAMPLE_SUB_FILE = "sample_submission.csv"
OUTPUT_PATH = "output/"

##### Create a base xgb classifier model with your best guess params

In [3]:
from xgboost import XGBRegressor
xgb_params = {
    'learning_rate':0.1,
    'use_label_encoder':False,
    'eval_metric':'rmse',
    'random_state':42,
    # for GPU
#     'tree_method': 'gpu_hist',
#     'predictor': 'gpu_predictor',
}
# create xgb ml model
xgb_model = XGBRegressor(**xgb_params)

##### Just point in the direction of your data, let tmlt know what are idx and target columns in your tabular data and what kind of problem type you are trying to resolve

In [4]:
# tmlt
tmlt = TMLT().prepare_data_for_training(
    train_file_path= DIRECTORY_PATH+TRAIN_FILE,
    test_file_path= DIRECTORY_PATH+TEST_FILE,
    idx_col="Id", target="SalePrice",
    model=xgb_model,
    random_state=42,
    problem_type="regression")

# TMLT currently only supports below problem_type:

# "binary_classification"
# "multi_label_classification"
# "multi_class_classification"
# "regression"

2021-11-27 20:51:24,843 INFO 12 cores found, model and data parallel processing should worked!
2021-11-27 20:51:24,896 INFO DataFrame Memory usage decreased to 0.58 Mb (35.5% reduction)
2021-11-27 20:51:24,945 INFO DataFrame Memory usage decreased to 0.58 Mb (34.8% reduction)
2021-11-27 20:51:24,979 INFO Both Numerical & Categorical columns found, Preprocessing will done accordingly!


In [5]:
# create train, valid split to evaulate model on valid dataset
tmlt.dfl.create_train_valid(valid_size=0.2)

start = time.time()
# Now fit
tmlt.spl.fit(tmlt.dfl.X_train, tmlt.dfl.y_train)
end = time.time()
print("Fit Time:", end - start)

#predict
preds = tmlt.spl.predict(tmlt.dfl.X_valid)
print('X_valid MAE:', mean_absolute_error(tmlt.dfl.y_valid, preds))

Fit Time: 0.25485920906066895
X_valid MAE: 15936.53249411387


In background `prepare_data_for_training` method loads your input data into Pandas DataFrame, seprates X(features) and y(target).

The `prepare_data_for_training` methods prepare X and y DataFrames, preprocess all numerical and categorical type data found in these DataFrames using scikit-learn pipelines. Then it bundle preprocessed data with your given model and return an MLPipeline object, this class instance has dataframeloader, preprocessor and scikit-lean pipeline instances.

The `create_train_valid` method use valid_size to split X(features) into X_train, y_train, X_valid and y_valid DataFrames, so you can call fit methods on X_train and y_train and predict methods on X_valid or X_test.


Please check detail documentation and source code for more details.

*NOTE: If you want to customize data and preprocessing steps you can do so by using `DataFrameLoader` and `PreProessor` classes. Check detail documentations for these classes for more options.*



#### To see more clear picture of model performance, Let's do a quick Cross Validation on our Pipeline

In [6]:
start = time.time()
# Now do cross_validation
scores = tmlt.do_cross_validation(cv=5, scoring='neg_mean_absolute_error')
end = time.time()
print("Cross Validation Time:", end - start)

print("scores:", scores)
print("Average MAE score:", scores.mean())

Cross Validation Time: 1.2626621723175049
scores: [15752.16827643 16405.26146458 16676.95384739 14588.82684075
 17320.45218857]
Average MAE score: 16148.73252354452


*MAE did came out slightly bad with cross validation*

*Let's see if we can improve our cross validation score with hyperparams tunning*

**we are using optuna based hyperparameter search here, make sure to supply a new directory path so search is saved**

In [8]:
study = tmlt.do_xgb_optuna_optimization(optuna_db_path=OUTPUT_PATH, opt_timeout=60)
print(study.best_trial)

2021-11-27 20:52:12,151 INFO Optimization Direction is: minimize
[32m[I 2021-11-27 20:52:12,215][0m Using an existing study with name 'tmlt_autoxgb' instead of creating a new one.[0m
2021-11-27 20:52:12,537 INFO Training Started!


Parameters: { "early_stopping_rounds", "eval_set" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-27 20:52:16,683 INFO Training Ended!
2021-11-27 20:52:16,802 INFO mean_absolute_error: 14961.287657855308
2021-11-27 20:52:16,802 INFO mean_squared_error: 710182288.1081377
2021-11-27 20:52:16,803 INFO r2_score: 0.9074117229274168
[32m[I 2021-11-27 20:52:16,842][0m Trial 31 finished with value: 710182288.1081377 and parameters: {'learning_rate': 0.010287833814049732, 'n_estimators': 7000, 'reg_lambda': 2.2021084672156013, 'reg_alpha': 2.9596676195877394, 'subsample': 0.6027064970124942, 'colsample_bytree': 0.11131537174951261, 'max_depth': 4, 'early_stopping_rounds': 232, 'tree_method': 'hist', 'booster': 'gbtree', 'gamma': 1.2611297024132594e-06, 'grow_policy': 'depthwise'}. Best is trial 23 with value: 604200048.7128911.[0m
2021-11-27 20:52:17,088 INFO Training Started!


Parameters: { "early_stopping_rounds", "eval_set" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-27 20:52:26,236 INFO Training Ended!
2021-11-27 20:52:26,353 INFO mean_absolute_error: 14896.90144745291
2021-11-27 20:52:26,354 INFO mean_squared_error: 607292442.2350011
2021-11-27 20:52:26,355 INFO r2_score: 0.9208257346778856
[32m[I 2021-11-27 20:52:26,390][0m Trial 32 finished with value: 607292442.2350011 and parameters: {'learning_rate': 0.01600677836151433, 'n_estimators': 7000, 'reg_lambda': 1.9491473446193701, 'reg_alpha': 0.864289638451314, 'subsample': 0.8805539578148631, 'colsample_bytree': 0.3144918754703741, 'max_depth': 4, 'early_stopping_rounds': 241, 'tree_method': 'hist', 'booster': 'gbtree', 'gamma': 4.1194944286516154e-07, 'grow_policy': 'lossguide'}. Best is trial 23 with value: 604200048.7128911.[0m
2021-11-27 20:52:26,630 INFO Training Started!


Parameters: { "early_stopping_rounds", "eval_set" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-27 20:52:41,225 INFO Training Ended!
2021-11-27 20:52:41,322 INFO mean_absolute_error: 15013.209158283391
2021-11-27 20:52:41,323 INFO mean_squared_error: 609257417.2319984
2021-11-27 20:52:41,323 INFO r2_score: 0.920569555873501
[32m[I 2021-11-27 20:52:41,343][0m Trial 33 finished with value: 609257417.2319984 and parameters: {'learning_rate': 0.0177606536084275, 'n_estimators': 7000, 'reg_lambda': 2.174335552307863, 'reg_alpha': 0.732412893540352, 'subsample': 0.8988757563697343, 'colsample_bytree': 0.5597976047603338, 'max_depth': 4, 'early_stopping_rounds': 178, 'tree_method': 'hist', 'booster': 'gbtree', 'gamma': 6.785043359143624e-07, 'grow_policy': 'lossguide'}. Best is trial 23 with value: 604200048.7128911.[0m
2021-11-27 20:52:41,536 INFO Training Started!


Parameters: { "early_stopping_rounds", "eval_set" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-27 20:52:48,508 INFO Training Ended!
2021-11-27 20:52:48,579 INFO mean_absolute_error: 15094.149106378425
2021-11-27 20:52:48,580 INFO mean_squared_error: 612248651.396432
2021-11-27 20:52:48,581 INFO r2_score: 0.9201795810427524
[32m[I 2021-11-27 20:52:48,599][0m Trial 34 finished with value: 612248651.396432 and parameters: {'learning_rate': 0.018375921886293854, 'n_estimators': 7000, 'reg_lambda': 27.063132685501202, 'reg_alpha': 0.25625985717334526, 'subsample': 0.8923260077170412, 'colsample_bytree': 0.5461041412290502, 'max_depth': 3, 'early_stopping_rounds': 153, 'tree_method': 'hist', 'booster': 'gbtree', 'gamma': 5.57536980738242e-08, 'grow_policy': 'lossguide'}. Best is trial 23 with value: 604200048.7128911.[0m
2021-11-27 20:52:48,792 INFO Training Started!


Parameters: { "early_stopping_rounds", "eval_set" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-27 20:52:53,070 INFO Training Ended!
2021-11-27 20:52:53,137 INFO mean_absolute_error: 15612.104465432363
2021-11-27 20:52:53,138 INFO mean_squared_error: 642768084.3513441
2021-11-27 20:52:53,139 INFO r2_score: 0.9162006846919929
[32m[I 2021-11-27 20:52:53,162][0m Trial 35 finished with value: 642768084.3513441 and parameters: {'learning_rate': 0.016822984659983716, 'n_estimators': 7000, 'reg_lambda': 0.686990461824656, 'reg_alpha': 0.005456798619418751, 'subsample': 0.9390935972332031, 'colsample_bytree': 0.6237535008376071, 'max_depth': 2, 'early_stopping_rounds': 188, 'tree_method': 'hist', 'booster': 'gbtree', 'gamma': 1.5334080777849023e-07, 'grow_policy': 'depthwise'}. Best is trial 23 with value: 604200048.7128911.[0m
2021-11-27 20:52:53,356 INFO Training Started!


Parameters: { "early_stopping_rounds", "eval_set" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-27 20:53:03,169 INFO Training Ended!
2021-11-27 20:53:03,289 INFO mean_absolute_error: 15572.701840753425
2021-11-27 20:53:03,290 INFO mean_squared_error: 651125629.7493893
2021-11-27 20:53:03,291 INFO r2_score: 0.9151110901725037
[32m[I 2021-11-27 20:53:03,321][0m Trial 36 finished with value: 651125629.7493893 and parameters: {'learning_rate': 0.02412163067898655, 'n_estimators': 7000, 'reg_lambda': 75.9045043850116, 'reg_alpha': 0.1420454367576083, 'subsample': 0.8907076548824316, 'colsample_bytree': 0.5588717381660775, 'max_depth': 3, 'early_stopping_rounds': 138, 'tree_method': 'hist', 'booster': 'gbtree', 'gamma': 4.1577349379570457e-08, 'grow_policy': 'lossguide'}. Best is trial 23 with value: 604200048.7128911.[0m
2021-11-27 20:53:03,541 INFO Training Started!


Parameters: { "early_stopping_rounds", "eval_set" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-27 20:53:17,175 INFO Training Ended!
2021-11-27 20:53:17,265 INFO mean_absolute_error: 15179.852833369006
2021-11-27 20:53:17,265 INFO mean_squared_error: 607046563.7856213
2021-11-27 20:53:17,266 INFO r2_score: 0.9208577904787392
[32m[I 2021-11-27 20:53:17,285][0m Trial 37 finished with value: 607046563.7856213 and parameters: {'learning_rate': 0.01829836967730152, 'n_estimators': 7000, 'reg_lambda': 12.309123534329808, 'reg_alpha': 0.4197927584423797, 'subsample': 0.8957871298167308, 'colsample_bytree': 0.4683807902340768, 'max_depth': 4, 'early_stopping_rounds': 159, 'tree_method': 'hist', 'booster': 'gbtree', 'gamma': 2.6902639749404396e-07, 'grow_policy': 'lossguide'}. Best is trial 23 with value: 604200048.7128911.[0m


FrozenTrial(number=23, values=[604200048.7128911], datetime_start=datetime.datetime(2021, 11, 22, 23, 39, 5, 962582), datetime_complete=datetime.datetime(2021, 11, 22, 23, 39, 41, 252155), params={'booster': 'gbtree', 'colsample_bytree': 0.5960603552824647, 'early_stopping_rounds': 401, 'gamma': 0.0005177750295162097, 'grow_policy': 'lossguide', 'learning_rate': 0.020767130829769383, 'max_depth': 6, 'n_estimators': 7000, 'reg_alpha': 0.0008846136538441224, 'reg_lambda': 0.0023056651712866118, 'subsample': 0.84754637782141, 'tree_method': 'hist'}, distributions={'booster': CategoricalDistribution(choices=('gbtree', 'gblinear')), 'colsample_bytree': UniformDistribution(high=1.0, low=0.1), 'early_stopping_rounds': IntUniformDistribution(high=500, low=100, step=1), 'gamma': LogUniformDistribution(high=1.0, low=1e-08), 'grow_policy': CategoricalDistribution(choices=('depthwise', 'lossguide')), 'learning_rate': LogUniformDistribution(high=0.25, low=0.01), 'max_depth': IntUniformDistribution(

#### Let's use our newly found best params to update the model on sklearn pipeline

In [9]:
xgb_params.update(study.best_trial.params)
print("xgb_params", xgb_params)
xgb_model = XGBRegressor(**xgb_params)
tmlt.update_model(xgb_model)
tmlt.spl

xgb_params {'learning_rate': 0.020767130829769383, 'use_label_encoder': False, 'eval_metric': 'rmse', 'random_state': 42, 'booster': 'gbtree', 'colsample_bytree': 0.5960603552824647, 'early_stopping_rounds': 401, 'gamma': 0.0005177750295162097, 'grow_policy': 'lossguide', 'max_depth': 6, 'n_estimators': 7000, 'reg_alpha': 0.0008846136538441224, 'reg_lambda': 0.0023056651712866118, 'subsample': 0.84754637782141, 'tree_method': 'hist'}


#### Now, Let's use 5 K-Fold Training on this Updated XGB model with best params found from Optuna search

In [10]:
# k-fold training
xgb_model_metrics_score, xgb_model_test_preds = tmlt.do_kfold_training(n_splits=5, test_preds_metric=mean_absolute_error)

2021-11-27 20:53:17,396 INFO  model class:<class 'xgboost.sklearn.XGBRegressor'>


Parameters: { "early_stopping_rounds" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-27 20:53:51,969 INFO fold: 1 mean_absolute_error : 18068.945660316782
2021-11-27 20:53:51,970 INFO fold: 1 mean_squared_error : 1765853576.5144272
2021-11-27 20:53:51,970 INFO fold: 1 r2_score : 0.6952312760692718


Parameters: { "early_stopping_rounds" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-27 20:54:28,390 INFO fold: 2 mean_absolute_error : 14867.686028467466
2021-11-27 20:54:28,391 INFO fold: 2 mean_squared_error : 768162815.6640443
2021-11-27 20:54:28,391 INFO fold: 2 r2_score : 0.8630994826381212


Parameters: { "early_stopping_rounds" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-27 20:54:59,955 INFO fold: 3 mean_absolute_error : 14586.54402557791
2021-11-27 20:54:59,956 INFO fold: 3 mean_squared_error : 589412516.5140059
2021-11-27 20:54:59,956 INFO fold: 3 r2_score : 0.9225699447768843


Parameters: { "early_stopping_rounds" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-27 20:55:35,714 INFO fold: 4 mean_absolute_error : 13956.20951947774
2021-11-27 20:55:35,715 INFO fold: 4 mean_squared_error : 453154881.61446327
2021-11-27 20:55:35,716 INFO fold: 4 r2_score : 0.9187626742033932


Parameters: { "early_stopping_rounds" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-11-27 20:56:16,811 INFO fold: 5 mean_absolute_error : 15341.814881207192
2021-11-27 20:56:16,812 INFO fold: 5 mean_squared_error : 654930416.2989283
2021-11-27 20:56:16,813 INFO fold: 5 r2_score : 0.9044971118349543
2021-11-27 20:56:17,027 INFO  Mean Metrics Results from all Folds are: {'mean_absolute_error': 15364.240023009417, 'mean_squared_error': 846302841.3211738, 'r2_score': 0.860832097904525}


In [11]:
# predict on test dataset
if xgb_model_test_preds is not None:
    print(xgb_model_test_preds.shape)

(1459,)



##### You can even improve metrics score further by running Optuna search for longer time or rerunning the study, check documentation for more details