# Getting Started Tutorial with TMLT (Tabular ML Toolkit)

> A tutorial on getting started with TMLT (Tabular ML Toolkit)

> tabular_ml_toolkit is a helper library to jumpstart your machine learning project based on Tabular or Structured data.

> It comes with model and data parallelism and cutting edge hyperparameter search techniques.

> Under the hood TMLT uses modin, optuna, xgboost and scikit-learn pipelines

## Install

`pip install -U tabular_ml_toolkit`

## How to Best Use tabular_ml_toolkit

Start with your favorite model and then just simply create tmlt with one API

*For example, Here we are using XGBRegressor on  [Melbourne Home Sale price data](https://www.kaggle.com/estrotococo/home-data-for-ml-course)*

In [1]:
from tabular_ml_toolkit.tmlt import *
from sklearn.metrics import mean_absolute_error
import numpy as np
from xgboost import XGBRegressor



In [2]:
# Dataset file names and Paths
DIRECTORY_PATH = "input/home_data/"
TRAIN_FILE = "train.csv"
TEST_FILE = "test.csv"
SAMPLE_SUB_FILE = "sample_submission.csv"
OUTPUT_PATH = "output/"

#### Just point tmlt in the direction of your data, let it know what are idx and target columns in your tabular data and what kind of problem type you are trying to resolve

In [3]:
# tmlt
tmlt = TMLT().prepare_data(
    train_file_path= DIRECTORY_PATH+TRAIN_FILE,
    test_file_path= DIRECTORY_PATH+TEST_FILE,
    idx_col="Id", target="SalePrice",
    random_state=42,
    problem_type="regression")

# TMLT currently only supports below problem_type:

# "binary_classification"
# "multi_label_classification"
# "multi_class_classification"
# "regression"

2021-12-03 23:27:17,467 INFO 8 cores found, model and data parallel processing should worked!
2021-12-03 23:27:17,506 INFO DataFrame Memory usage decreased to 0.58 Mb (35.5% reduction)
2021-12-03 23:27:17,545 INFO DataFrame Memory usage decreased to 0.58 Mb (34.8% reduction)
2021-12-03 23:27:17,570 INFO Both Numerical & Categorical columns found, Preprocessing will done accordingly!


In [4]:
print(type(tmlt.dfl.X))
print(tmlt.dfl.X.shape)
print(type(tmlt.dfl.y))
print(tmlt.dfl.y.shape)
print(type(tmlt.dfl.X_test))
print(tmlt.dfl.X_test.shape)

<class 'pandas.core.frame.DataFrame'>
(1460, 79)
<class 'numpy.ndarray'>
(1460,)
<class 'pandas.core.frame.DataFrame'>
(1459, 79)


In [5]:
tmlt.dfl.X

Unnamed: 0_level_0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,SaleType,SaleCondition,Neighborhood,Exterior1st,Exterior2nd
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,7,5,2003,2003,196.0,706,0,...,TA,Y,,,,WD,Normal,CollgCr,VinylSd,VinylSd
2,20,80.0,9600,6,8,1976,1976,0.0,978,0,...,TA,Y,,,,WD,Normal,Veenker,MetalSd,MetalSd
3,60,68.0,11250,7,5,2001,2002,162.0,486,0,...,TA,Y,,,,WD,Normal,CollgCr,VinylSd,VinylSd
4,70,60.0,9550,7,5,1915,1970,0.0,216,0,...,TA,Y,,,,WD,Abnorml,Crawfor,Wd Sdng,Wd Shng
5,60,84.0,14260,8,5,2000,2000,350.0,655,0,...,TA,Y,,,,WD,Normal,NoRidge,VinylSd,VinylSd
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1456,60,62.0,7917,6,5,1999,2000,0.0,0,0,...,TA,Y,,,,WD,Normal,Gilbert,VinylSd,VinylSd
1457,20,85.0,13175,6,6,1978,1988,119.0,790,163,...,TA,Y,,MnPrv,,WD,Normal,NWAmes,Plywood,Plywood
1458,70,66.0,9042,7,9,1941,2006,0.0,275,0,...,TA,Y,,GdPrv,Shed,WD,Normal,Crawfor,CemntBd,CmentBd
1459,20,68.0,9717,5,6,1950,1996,0.0,49,1029,...,TA,Y,,,,WD,Normal,NAmes,MetalSd,MetalSd


#### create train valid dataframes for quick preprocessing and training

In [6]:
%%time
# create train, valid split to evaulate model on valid dataset
X_train, X_valid,  y_train, y_valid =  tmlt.dfl.create_train_valid(valid_size=0.2)

CPU times: user 4.87 ms, sys: 1.49 ms, total: 6.36 ms
Wall time: 5.19 ms


In [7]:
print(X_train.shape)
print(y_train.shape)
print(X_valid.shape)
print(y_valid.shape)

(1168, 79)
(1168,)
(292, 79)
(292,)


In [8]:
# X_train.columns.to_list()

##### Now PreProcess X_train, X_valid

NOTE: Preprocessing gives back numpy arrays for pandas dataframe

In [9]:
X_train_np,  X_valid_np = tmlt.pp_fit_transform(X_train, X_valid)

In [10]:
print(type(X_train_np))
print(X_train_np.shape)
# print(X_train_np)
print(type(X_valid_np))
print(X_valid_np.shape)
# print(X_valid_np)
print(type(y_valid))
print(type(y_train))

<class 'numpy.ndarray'>
(1168, 302)
<class 'numpy.ndarray'>
(292, 302)
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


##### Create a base xgb classifier model with your best guess params

In [11]:
xgb_params = {
    'learning_rate':0.1,
    'use_label_encoder':False,
    'eval_metric':'rmse',
    'random_state':42,
    # for GPU
#     'tree_method': 'gpu_hist',
#     'predictor': 'gpu_predictor',
}
# create xgb ml model
xgb_model = XGBRegressor(**xgb_params)

In [12]:
# Now do model training
xgb_model.fit(X_train_np, y_train,
              verbose=False,
              #detect & avoid overfitting
              eval_set=[(X_train_np, y_train), (X_valid_np, y_valid)],
              eval_metric="mae",
              early_stopping_rounds=300
             )

#predict
preds = xgb_model.predict(X_valid_np)
print('X_valid MAE:', mean_absolute_error(y_valid, preds))

X_valid MAE: 15915.75480254709


In background `prepare_data` method loads your input data into Pandas DataFrame, seprates X(features) and y(target), preprocess all numerical and categorical type data found in these DataFrames using scikit-learn pipelines. Then it bundle preprocessor and data return a TMLT object, this class instance has dataframeloader, preprocessor instances.

The `create_train_valid` method use valid_size to split X(features) into X_train, y_train, X_valid and y_valid DataFrames, so you can call fit methods on X_train and y_train and predict methods on X_valid or X_test.

Please check detail documentation and source code for more details.

*NOTE: If you want to customize data and preprocessing steps you can do so by using `DataFrameLoader` and `PreProessor` classes. Check detail documentations for these classes for more options.*



#### To see more clear picture of model performance, Let's do a quick Cross Validation on our Pipeline

##### Make sure to PreProcess the data

In [13]:
X_np, X_test_np = tmlt.pp_fit_transform(tmlt.dfl.X, tmlt.dfl.X_test)
y_np = tmlt.dfl.y

In [14]:
# Now do cross_validation
scores = tmlt.do_cross_validation(X_np, y_np, xgb_model, scoring='neg_mean_absolute_error', cv=5)

print("scores:", scores)
print("Average MAE score:", scores.mean())

scores: [15733.51983893 16386.18366064 16648.82777718 14571.39875856
 17295.16245719]
Average MAE score: 16127.018498501711


*MAE did came out slightly bad with cross validation*

*Let's see if we can improve our cross validation score with hyperparams tunning*

**We are using optuna based hyperparameter search here!**

**TMLT has inbuilt xgb optuna optimization helper method!**

In [15]:
# **Just make sure to supply an output directory path so hyperparameter search is saved**
study = tmlt.do_xgb_optuna_optimization(optuna_db_path=OUTPUT_PATH, opt_timeout=60)
print(study.best_trial)

2021-12-03 23:27:21,635 INFO Optimization Direction is: minimize
[32m[I 2021-12-03 23:27:21,705][0m Using an existing study with name 'tmlt_autoxgb' instead of creating a new one.[0m
2021-12-03 23:27:21,889 INFO Training Started!


Parameters: { early_stopping_rounds, eval_set } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-12-03 23:27:44,450 INFO Training Ended!
2021-12-03 23:27:44,506 INFO mean_absolute_error: 15705.327790560788
2021-12-03 23:27:44,506 INFO mean_squared_error: 730519420.0747098
2021-12-03 23:27:44,507 INFO r2_score: 0.9047603191386815
[32m[I 2021-12-03 23:27:44,574][0m Trial 21 finished with value: 730519420.0747098 and parameters: {'learning_rate': 0.05435272187248115, 'n_estimators': 20000, 'reg_lambda': 0.00370827941660158, 'reg_alpha': 0.45862721196673417, 'subsample': 0.4279676226224459, 'colsample_bytree': 0.13626169760005338, 'max_depth': 4, 'early_stopping_rounds': 291, 'tree_method': 'exact', 'booster': 'gbtree', 'gamma': 0.9923916778361647, 'grow_policy': 'depthwise'}. Best is trial 7 with value: 641154325.9759644.[0m
2021-12-03 23:27:44,725 INFO Training Started!


Parameters: { early_stopping_rounds, eval_set } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-12-03 23:28:45,483 INFO Training Ended!
2021-12-03 23:28:45,574 INFO mean_absolute_error: 14964.357796446919
2021-12-03 23:28:45,574 INFO mean_squared_error: 642176994.0808992
2021-12-03 23:28:45,575 INFO r2_score: 0.9162777466388357
[32m[I 2021-12-03 23:28:45,643][0m Trial 22 finished with value: 642176994.0808992 and parameters: {'learning_rate': 0.013095709693444705, 'n_estimators': 15000, 'reg_lambda': 15.939521164976025, 'reg_alpha': 6.500430184440403e-06, 'subsample': 0.6654972930243539, 'colsample_bytree': 0.49970267751426295, 'max_depth': 6, 'early_stopping_rounds': 453, 'tree_method': 'approx', 'booster': 'gbtree', 'gamma': 0.000806260966811638, 'grow_policy': 'depthwise'}. Best is trial 7 with value: 641154325.9759644.[0m


FrozenTrial(number=7, values=[641154325.9759644], datetime_start=datetime.datetime(2021, 12, 3, 20, 38, 21, 482153), datetime_complete=datetime.datetime(2021, 12, 3, 20, 39, 30, 395838), params={'booster': 'gbtree', 'colsample_bytree': 0.3466657613679916, 'early_stopping_rounds': 409, 'gamma': 7.315596726371822e-06, 'grow_policy': 'depthwise', 'learning_rate': 0.05781643806086814, 'max_depth': 9, 'n_estimators': 20000, 'reg_alpha': 7.916067802441731e-07, 'reg_lambda': 0.14511799018426277, 'subsample': 0.3166053794978003, 'tree_method': 'approx'}, distributions={'booster': CategoricalDistribution(choices=('gbtree', 'gblinear')), 'colsample_bytree': UniformDistribution(high=1.0, low=0.1), 'early_stopping_rounds': IntUniformDistribution(high=500, low=100, step=1), 'gamma': LogUniformDistribution(high=1.0, low=1e-08), 'grow_policy': CategoricalDistribution(choices=('depthwise', 'lossguide')), 'learning_rate': LogUniformDistribution(high=0.25, low=0.01), 'max_depth': IntUniformDistribution(

#### Let's use our newly found best params to update the model on sklearn pipeline

In [16]:
xgb_params.update(study.best_trial.params)
print("xgb_params", xgb_params)
xgb_model = XGBRegressor(**xgb_params)

xgb_params {'learning_rate': 0.05781643806086814, 'use_label_encoder': False, 'eval_metric': 'rmse', 'random_state': 42, 'booster': 'gbtree', 'colsample_bytree': 0.3466657613679916, 'early_stopping_rounds': 409, 'gamma': 7.315596726371822e-06, 'grow_policy': 'depthwise', 'max_depth': 9, 'n_estimators': 20000, 'reg_alpha': 7.916067802441731e-07, 'reg_lambda': 0.14511799018426277, 'subsample': 0.3166053794978003, 'tree_method': 'approx'}


#### Now, Let's use 5 K-Fold Training on this Updated XGB model with best params found from Optuna search

In [17]:
# # k-fold training
# xgb_model_metrics_score, xgb_model_test_preds = tmlt.do_kfold_training(X_np, y_np, n_splits=5, model=xgb_model, test_preds_metric=mean_absolute_error)

TypeError: do_kfold_training() got an unexpected keyword argument 'test_preds_metric'

In [18]:
# k-fold training
xgb_model_metrics_score, xgb_model_test_preds = tmlt.do_kfold_training(X_np, y_np, X_test=X_test_np, n_splits=5, model=xgb_model)

2021-12-03 23:29:32,035 INFO  model class:<class 'xgboost.sklearn.XGBRegressor'>


Parameters: { early_stopping_rounds, verbose } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-12-03 23:30:45,553 INFO Predicting Score!
2021-12-03 23:30:45,612 INFO fold: 1 mean_absolute_error : 19005.10041202911
2021-12-03 23:30:45,613 INFO fold: 1 mean_squared_error : 2411222555.537417
2021-12-03 23:30:45,614 INFO fold: 1 r2_score : 0.5838470238202538


Parameters: { early_stopping_rounds, verbose } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-12-03 23:32:01,470 INFO Predicting Score!
2021-12-03 23:32:01,534 INFO fold: 2 mean_absolute_error : 15078.704302226028
2021-12-03 23:32:01,535 INFO fold: 2 mean_squared_error : 564346640.105124
2021-12-03 23:32:01,535 INFO fold: 2 r2_score : 0.8994232141593029


Parameters: { early_stopping_rounds, verbose } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-12-03 23:33:18,328 INFO Predicting Score!
2021-12-03 23:33:18,390 INFO fold: 3 mean_absolute_error : 14835.53285530822
2021-12-03 23:33:18,391 INFO fold: 3 mean_squared_error : 564983892.6053946
2021-12-03 23:33:18,392 INFO fold: 3 r2_score : 0.925779088874223


Parameters: { early_stopping_rounds, verbose } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-12-03 23:34:36,560 INFO Predicting Score!
2021-12-03 23:34:36,631 INFO fold: 4 mean_absolute_error : 14078.252006635274
2021-12-03 23:34:36,631 INFO fold: 4 mean_squared_error : 521916667.0363479
2021-12-03 23:34:36,632 INFO fold: 4 r2_score : 0.9064357109700444


Parameters: { early_stopping_rounds, verbose } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




2021-12-03 23:35:53,506 INFO Predicting Score!
2021-12-03 23:35:53,576 INFO fold: 5 mean_absolute_error : 15609.898143193494
2021-12-03 23:35:53,577 INFO fold: 5 mean_squared_error : 660176798.8185495
2021-12-03 23:35:53,578 INFO fold: 5 r2_score : 0.9037320768471553
2021-12-03 23:35:53,579 INFO  Mean Metrics Results from all Folds are: {'mean_absolute_error': 15721.497543878426, 'mean_squared_error': 944529310.8205667, 'r2_score': 0.8438434229341959}


In [19]:
# predict on test dataset
if xgb_model_test_preds is not None:
    print(xgb_model_test_preds.shape)


##### You can even improve metrics score further by running Optuna search for longer time or rerunning the study, check documentation for more details