# Getting Started Tutorial with TMLT (Tabular ML Toolkit)

> A tutorial on getting started with TMLT (Tabular ML Toolkit)

> tabular_ml_toolkit is a helper library to jumpstart your machine learning project based on Tabular or Structured data.

> It comes with model parallelism and cutting edge hyperparameter search techniques.

> Under the hood TMLT uses optuna, xgboost and scikit-learn pipelines

## Install

`pip install -U tabular_ml_toolkit`

## How to Best Use tabular_ml_toolkit

Start with your favorite model and then just simply create tmlt with one API

*For example, Here we are using XGBRegressor on  [Melbourne Home Sale price data](https://www.kaggle.com/estrotococo/home-data-for-ml-course)*

In [None]:
from tabular_ml_toolkit.tmlt import *
from sklearn.metrics import mean_absolute_error
import numpy as np
from xgboost import XGBRegressor

In [None]:
# Dataset file names and Paths
DIRECTORY_PATH = "input/home_data/"
TRAIN_FILE = "train.csv"
TEST_FILE = "test.csv"
SAMPLE_SUB_FILE = "sample_submission.csv"
OUTPUT_PATH = "output/"

#### Just point tmlt in the direction of your data, let it know what are idx and target columns in your tabular data and what kind of problem type you are trying to resolve

In [None]:
%%time
# tmlt
tmlt = TMLT().prepare_data(
    train_file_path= DIRECTORY_PATH+TRAIN_FILE,
    test_file_path= DIRECTORY_PATH+TEST_FILE,
    idx_col="Id", target="SalePrice",
    random_state=42,
    problem_type="regression")

# TMLT currently only supports below problem_type:

# "binary_classification"
# "multi_label_classification"
# "multi_class_classification"
# "regression"

2021-12-13 15:32:35,221 INFO 12 cores found, model and data parallel processing should worked!
2021-12-13 15:32:35,260 INFO DataFrame Memory usage decreased to 0.58 Mb (35.5% reduction)
2021-12-13 15:32:35,308 INFO DataFrame Memory usage decreased to 0.58 Mb (34.8% reduction)
2021-12-13 15:32:35,343 INFO Both Numerical & Categorical columns found, Preprocessing will done accordingly!


CPU times: user 290 ms, sys: 53.9 ms, total: 344 ms
Wall time: 342 ms


In [None]:
print(type(tmlt.dfl.X))
print(tmlt.dfl.X.shape)
print(type(tmlt.dfl.y))
print(tmlt.dfl.y.shape)
print(type(tmlt.dfl.X_test))
print(tmlt.dfl.X_test.shape)

<class 'pandas.core.frame.DataFrame'>
(1460, 79)
<class 'numpy.ndarray'>
(1460,)
<class 'pandas.core.frame.DataFrame'>
(1459, 79)


In [None]:
tmlt.dfl.X

Unnamed: 0_level_0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,SaleType,SaleCondition,Neighborhood,Exterior1st,Exterior2nd
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,7,5,2003,2003,196.0,706,0,...,TA,Y,,,,WD,Normal,CollgCr,VinylSd,VinylSd
2,20,80.0,9600,6,8,1976,1976,0.0,978,0,...,TA,Y,,,,WD,Normal,Veenker,MetalSd,MetalSd
3,60,68.0,11250,7,5,2001,2002,162.0,486,0,...,TA,Y,,,,WD,Normal,CollgCr,VinylSd,VinylSd
4,70,60.0,9550,7,5,1915,1970,0.0,216,0,...,TA,Y,,,,WD,Abnorml,Crawfor,Wd Sdng,Wd Shng
5,60,84.0,14260,8,5,2000,2000,350.0,655,0,...,TA,Y,,,,WD,Normal,NoRidge,VinylSd,VinylSd
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1456,60,62.0,7917,6,5,1999,2000,0.0,0,0,...,TA,Y,,,,WD,Normal,Gilbert,VinylSd,VinylSd
1457,20,85.0,13175,6,6,1978,1988,119.0,790,163,...,TA,Y,,MnPrv,,WD,Normal,NWAmes,Plywood,Plywood
1458,70,66.0,9042,7,9,1941,2006,0.0,275,0,...,TA,Y,,GdPrv,Shed,WD,Normal,Crawfor,CemntBd,CmentBd
1459,20,68.0,9717,5,6,1950,1996,0.0,49,1029,...,TA,Y,,,,WD,Normal,NAmes,MetalSd,MetalSd


### Training

##### create train valid dataframes for quick preprocessing and training

In [None]:
%%time
# create train, valid split to evaulate model on valid dataset
X_train, X_valid,  y_train, y_valid =  tmlt.dfl.create_train_valid(valid_size=0.2)

CPU times: user 6.27 ms, sys: 1.69 ms, total: 7.96 ms
Wall time: 6.71 ms


In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_valid.shape)
print(y_valid.shape)

(1168, 79)
(1168,)
(292, 79)
(292,)


In [None]:
# X_train.columns.to_list()

##### Now PreProcess X_train, X_valid

NOTE: Preprocessing gives back numpy arrays for pandas dataframe

In [None]:
%%time
X_train_np,  X_valid_np = tmlt.pp_fit_transform(X_train, X_valid)

CPU times: user 40.2 ms, sys: 4.01 ms, total: 44.2 ms
Wall time: 42.9 ms


In [None]:
print(type(X_train_np))
print(X_train_np.shape)
# print(X_train_np)
print(type(X_valid_np))
print(X_valid_np.shape)
# print(X_valid_np)
print(type(y_valid))
print(type(y_train))

<class 'numpy.ndarray'>
(1168, 302)
<class 'numpy.ndarray'>
(292, 302)
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


#### Training

##### Create a base xgb classifier model with your best guess params

In [None]:
xgb_params = {
    'learning_rate':0.1,
    'use_label_encoder':False,
    'eval_metric':'rmse',
    'random_state':42,
    # for GPU
#     'tree_method': 'gpu_hist',
#     'predictor': 'gpu_predictor',
}
# create xgb ml model
xgb_model = XGBRegressor(**xgb_params)

In [None]:
%%time
# Now do model training
xgb_model.fit(X_train_np, y_train,
              verbose=False,
              #detect & avoid overfitting
              eval_set=[(X_train_np, y_train), (X_valid_np, y_valid)],
              eval_metric="mae",
              early_stopping_rounds=300
             )

#predict
preds = xgb_model.predict(X_valid_np)
print('X_valid MAE:', mean_absolute_error(y_valid, preds))

X_valid MAE: 15915.75480254709
CPU times: user 5.53 s, sys: 119 ms, total: 5.65 s
Wall time: 516 ms


In background `prepare_data` method loads your input data into Pandas DataFrame, seprates X(features) and y(target), preprocess all numerical and categorical type data found in these DataFrames using scikit-learn pipelines. Then it bundle preprocessor and data return a TMLT object, this class instance has dataframeloader, preprocessor instances.

The `create_train_valid` method use valid_size to split X(features) into X_train, y_train, X_valid and y_valid DataFrames, so you can call fit methods on X_train and y_train and predict methods on X_valid or X_test.

Please check detail documentation and source code for more details.

*NOTE: If you want to customize data and preprocessing steps you can do so by using `DataFrameLoader` and `PreProessor` classes. Check detail documentations for these classes for more options.*



#### To see more clear picture of model performance, Let's do a quick Cross Validation on our Pipeline

##### Make sure to PreProcess the data

In [None]:
%%time
X_np, X_test_np = tmlt.pp_fit_transform(tmlt.dfl.X, tmlt.dfl.X_test)
y_np = tmlt.dfl.y

CPU times: user 733 ms, sys: 33.8 ms, total: 767 ms
Wall time: 67.9 ms


In [None]:
%%time
# Now do cross_validation
scores = tmlt.do_cross_validation(X_np, y_np, xgb_model, scoring='neg_mean_absolute_error', cv=5)

print("scores:", scores)
print("Average MAE score:", scores.mean())

scores: [15733.51983893 16386.18366064 16648.82777718 14571.39875856
 17295.16245719]
Average MAE score: 16127.018498501711
CPU times: user 629 ms, sys: 240 ms, total: 868 ms
Wall time: 3.73 s


*MAE did came out slightly bad with cross validation*

*Let's see if we can improve our cross validation score with hyperparams tunning*

**We are using optuna based hyperparameter search here!**

**TMLT has inbuilt xgb optuna optimization helper method!**

In [None]:
# **Just make sure to supply an output directory path so hyperparameter search is saved**
study = tmlt.do_xgb_optuna_optimization(optuna_db_path=OUTPUT_PATH, opt_timeout=10)
print(study.best_trial)

2021-12-13 15:32:39,883 INFO Optimization Direction is: minimize
[32m[I 2021-12-13 15:32:39,950][0m Using an existing study with name 'tmlt_autoxgb' instead of creating a new one.[0m
2021-12-13 15:32:40,156 INFO Training Started!


Parameters: { "colsample_bytree", "early_stopping_rounds", "eval_set", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-12-13 15:32:56,014 INFO Training Ended!
2021-12-13 15:32:56,018 INFO mean_absolute_error: 56522.86215753425
[32m[I 2021-12-13 15:32:56,057][0m Trial 37 finished with value: 56522.86215753425 and parameters: {'learning_rate': 0.041663514767391316, 'n_estimators': 20000, 'reg_lambda': 64.73961557549606, 'reg_alpha': 0.09505502167403512, 'subsample': 0.3839427809918437, 'colsample_bytree': 0.7110855538865497, 'max_depth': 4, 'early_stopping_rounds': 134, 'tree_method': 'hist', 'booster': 'gblinear'}. Best is trial 37 with value: 56522.86215753425.[0m


FrozenTrial(number=37, values=[56522.86215753425], datetime_start=datetime.datetime(2021, 12, 13, 15, 32, 39, 994989), datetime_complete=datetime.datetime(2021, 12, 13, 15, 32, 56, 19679), params={'booster': 'gblinear', 'colsample_bytree': 0.7110855538865497, 'early_stopping_rounds': 134, 'learning_rate': 0.041663514767391316, 'max_depth': 4, 'n_estimators': 20000, 'reg_alpha': 0.09505502167403512, 'reg_lambda': 64.73961557549606, 'subsample': 0.3839427809918437, 'tree_method': 'hist'}, distributions={'booster': CategoricalDistribution(choices=('gbtree', 'gblinear')), 'colsample_bytree': UniformDistribution(high=1.0, low=0.1), 'early_stopping_rounds': IntUniformDistribution(high=500, low=100, step=1), 'learning_rate': LogUniformDistribution(high=0.25, low=0.01), 'max_depth': IntUniformDistribution(high=9, low=1, step=1), 'n_estimators': CategoricalDistribution(choices=(7000, 15000, 20000)), 'reg_alpha': LogUniformDistribution(high=100.0, low=1e-08), 'reg_lambda': LogUniformDistribution

#### Let's use our newly found best params to update the model on sklearn pipeline

In [None]:
xgb_params.update(study.best_trial.params)
print("xgb_params", xgb_params)
updated_xgb_model = XGBRegressor(**xgb_params)

xgb_params {'learning_rate': 0.041663514767391316, 'use_label_encoder': False, 'eval_metric': 'rmse', 'random_state': 42, 'booster': 'gblinear', 'colsample_bytree': 0.7110855538865497, 'early_stopping_rounds': 134, 'max_depth': 4, 'n_estimators': 20000, 'reg_alpha': 0.09505502167403512, 'reg_lambda': 64.73961557549606, 'subsample': 0.3839427809918437, 'tree_method': 'hist'}


#### Now, Let's use 5 K-Fold Training on this Updated XGB model with best params found from Optuna search

In [None]:
%%time

# k-fold training
xgb_model_metrics_score, xgb_model_test_preds = tmlt.do_kfold_training(X_np, y_np,
                                                                       model=updated_xgb_model,
                                                                       X_test=X_test_np,
                                                                       n_splits=5)

2021-12-13 15:32:56,182 INFO Training Started!


Parameters: { "colsample_bytree", "early_stopping_rounds", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-12-13 15:33:19,688 INFO Training Finished!
2021-12-13 15:33:19,689 INFO Predicting Val Score!
2021-12-13 15:33:19,693 INFO fold: 1 mean_absolute_error : 51693.192851027394
2021-12-13 15:33:19,694 INFO Predicting Test Scores!
2021-12-13 15:33:19,787 INFO Training Started!


Parameters: { "colsample_bytree", "early_stopping_rounds", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-12-13 15:33:41,691 INFO Training Finished!
2021-12-13 15:33:41,691 INFO Predicting Val Score!
2021-12-13 15:33:41,696 INFO fold: 2 mean_absolute_error : 47925.780875428085
2021-12-13 15:33:41,697 INFO Predicting Test Scores!
2021-12-13 15:33:41,776 INFO Training Started!


Parameters: { "colsample_bytree", "early_stopping_rounds", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-12-13 15:34:03,584 INFO Training Finished!
2021-12-13 15:34:03,585 INFO Predicting Val Score!
2021-12-13 15:34:03,590 INFO fold: 3 mean_absolute_error : 55366.43573416096
2021-12-13 15:34:03,590 INFO Predicting Test Scores!
2021-12-13 15:34:03,677 INFO Training Started!


Parameters: { "colsample_bytree", "early_stopping_rounds", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-12-13 15:34:26,851 INFO Training Finished!
2021-12-13 15:34:26,852 INFO Predicting Val Score!
2021-12-13 15:34:26,857 INFO fold: 4 mean_absolute_error : 50643.58561643836
2021-12-13 15:34:26,858 INFO Predicting Test Scores!
2021-12-13 15:34:26,947 INFO Training Started!


Parameters: { "colsample_bytree", "early_stopping_rounds", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




2021-12-13 15:34:57,438 INFO Training Finished!
2021-12-13 15:34:57,439 INFO Predicting Val Score!
2021-12-13 15:34:57,444 INFO fold: 5 mean_absolute_error : 53430.911012414384
2021-12-13 15:34:57,444 INFO Predicting Test Scores!
2021-12-13 15:34:57,541 INFO  Mean Metrics Results from all Folds are: {'mean_absolute_error': 51811.98121789383}


CPU times: user 21min 44s, sys: 29.4 s, total: 22min 13s
Wall time: 2min 1s


In [None]:
# predict on test dataset
if xgb_model_test_preds is not None:
    print(xgb_model_test_preds.shape)

(1459,)



##### You can even improve metrics score further by running Optuna search for longer time or rerunning the study, check documentation for more details

In [None]:
#fin