# Getting Started Tutorial with TMLT (Tabular ML Toolkit)

> A tutorial on getting started with TMLT (Tabular ML Toolkit)

> tabular_ml_toolkit is a helper library to jumpstart your machine learning project based on Tabular or Structured data.

> It comes with model and data parallelism and cutting edge hyperparameter search techniques.

> Under the hood TMLT uses modin, optuna, xgboost and scikit-learn pipelines

## Install

`pip install -U tabular_ml_toolkit`

## How to Best Use tabular_ml_toolkit

Start with your favorite model and then just simply create tmlt with one API

*For example, Here we are using XGBRegressor on  [Melbourne Home Sale price data](https://www.kaggle.com/estrotococo/home-data-for-ml-course)*

In [1]:
from tabular_ml_toolkit.tmlt import *
from sklearn.metrics import mean_absolute_error
import numpy as np
from xgboost import XGBRegressor



In [2]:
# Dataset file names and Paths
DIRECTORY_PATH = "input/home_data/"
TRAIN_FILE = "train.csv"
TEST_FILE = "test.csv"
SAMPLE_SUB_FILE = "sample_submission.csv"
OUTPUT_PATH = "output/"

#### Just point tmlt in the direction of your data, let it know what are idx and target columns in your tabular data and what kind of problem type you are trying to resolve

In [3]:
# tmlt
tmlt = TMLT().prepare_data(
    train_file_path= DIRECTORY_PATH+TRAIN_FILE,
    test_file_path= DIRECTORY_PATH+TEST_FILE,
    idx_col="Id", target="SalePrice",
    random_state=42,
    problem_type="regression")

# TMLT currently only supports below problem_type:

# "binary_classification"
# "multi_label_classification"
# "multi_class_classification"
# "regression"

2021-12-03 01:01:39,544 INFO 8 cores found, model and data parallel processing should worked!
2021-12-03 01:01:39,583 INFO DataFrame Memory usage decreased to 0.58 Mb (35.5% reduction)
2021-12-03 01:01:39,620 INFO DataFrame Memory usage decreased to 0.58 Mb (34.8% reduction)
2021-12-03 01:01:39,644 INFO Both Numerical & Categorical columns found, Preprocessing will done accordingly!


In [4]:
print(type(tmlt.dfl.X))
print(tmlt.dfl.X.shape)
print(type(tmlt.dfl.y))
print(tmlt.dfl.y.shape)
print(type(tmlt.dfl.X_test))
print(tmlt.dfl.X_test.shape)

<class 'pandas.core.frame.DataFrame'>
(1460, 79)
<class 'numpy.ndarray'>
(1460,)
<class 'pandas.core.frame.DataFrame'>
(1459, 79)


In [5]:
tmlt.dfl.X

Unnamed: 0_level_0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,SaleType,SaleCondition,Neighborhood,Exterior1st,Exterior2nd
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,7,5,2003,2003,196.0,706,0,...,TA,Y,,,,WD,Normal,CollgCr,VinylSd,VinylSd
2,20,80.0,9600,6,8,1976,1976,0.0,978,0,...,TA,Y,,,,WD,Normal,Veenker,MetalSd,MetalSd
3,60,68.0,11250,7,5,2001,2002,162.0,486,0,...,TA,Y,,,,WD,Normal,CollgCr,VinylSd,VinylSd
4,70,60.0,9550,7,5,1915,1970,0.0,216,0,...,TA,Y,,,,WD,Abnorml,Crawfor,Wd Sdng,Wd Shng
5,60,84.0,14260,8,5,2000,2000,350.0,655,0,...,TA,Y,,,,WD,Normal,NoRidge,VinylSd,VinylSd
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1456,60,62.0,7917,6,5,1999,2000,0.0,0,0,...,TA,Y,,,,WD,Normal,Gilbert,VinylSd,VinylSd
1457,20,85.0,13175,6,6,1978,1988,119.0,790,163,...,TA,Y,,MnPrv,,WD,Normal,NWAmes,Plywood,Plywood
1458,70,66.0,9042,7,9,1941,2006,0.0,275,0,...,TA,Y,,GdPrv,Shed,WD,Normal,Crawfor,CemntBd,CmentBd
1459,20,68.0,9717,5,6,1950,1996,0.0,49,1029,...,TA,Y,,,,WD,Normal,NAmes,MetalSd,MetalSd


#### create train valid dataframes for quick preprocessing and training

In [6]:
%%time
# create train, valid split to evaulate model on valid dataset
X_train, X_valid,  y_train, y_valid =  tmlt.dfl.create_train_valid(valid_size=0.2)

CPU times: user 5.56 ms, sys: 1.56 ms, total: 7.12 ms
Wall time: 5.91 ms


In [7]:
print(X_train.shape)
print(y_train.shape)
print(X_valid.shape)
print(y_valid.shape)

(1168, 79)
(1168,)
(292, 79)
(292,)


In [8]:
X_train.columns.to_list()

['MSSubClass',
 'LotFrontage',
 'LotArea',
 'OverallQual',
 'OverallCond',
 'YearBuilt',
 'YearRemodAdd',
 'MasVnrArea',
 'BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'TotalBsmtSF',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'GrLivArea',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'Fireplaces',
 'GarageYrBlt',
 'GarageCars',
 'GarageArea',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch',
 'PoolArea',
 'MiscVal',
 'MoSold',
 'YrSold',
 'MSZoning',
 'Street',
 'Alley',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'RoofStyle',
 'RoofMatl',
 'MasVnrType',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Heating',
 'HeatingQC',
 'CentralAir',
 'Electrical',
 'KitchenQual',
 'Functional',
 'FireplaceQu',
 'GarageType',
 'GarageFinish',
 'Ga

##### PreProcessing X_train, X_valid

In [9]:
X_train_np,  X_valid_np = tmlt.pp_fit_transform(X_train, X_valid)

In [10]:
print(type(X_train_np))
print(X_train_np.shape)
print(X_train_np)
print(type(X_valid_np))
print(X_valid_np.shape)
print(X_valid_np)
print(type(y_valid))
print(type(y_train))

<class 'numpy.ndarray'>
(1168, 302)
[[-0.8667643   0.35953495 -0.21289571 ...  0.          0.
   0.        ]
 [ 0.07410996  0.04874271 -0.26524463 ...  1.          0.
   0.        ]
 [-0.63154574  0.27477343 -0.17784146 ...  0.          1.
   0.        ]
 ...
 [-0.8667643   0.07699655 -0.23409563 ...  0.          0.
   0.        ]
 [-0.16110861 -0.06427265 -0.28337613 ...  0.          1.
   0.        ]
 [ 1.48542135 -0.12078033 -0.65139925 ...  1.          0.
   0.        ]]
<class 'numpy.ndarray'>
(292, 302)
[[-0.8667643   0.35953495 -0.21159396 ...  0.          0.
   0.        ]
 [ 0.07410996  1.15064245  0.14564323 ...  0.          0.
   0.        ]
 [-0.63154574 -0.03601881 -0.16082574 ...  0.          0.
   1.        ]
 ...
 [ 0.07410996  0.16175807 -0.23158511 ...  0.          0.
   1.        ]
 [ 0.30932853  0.07699655 -0.14929596 ...  0.          0.
   0.        ]
 [-0.8667643   0.35953495 -0.2389307  ...  0.          0.
   0.        ]]
<class 'numpy.ndarray'>
<class 'numpy.nda

##### Create a base xgb classifier model with your best guess params

In [11]:
xgb_params = {
    'learning_rate':0.1,
    'use_label_encoder':False,
    'eval_metric':'rmse',
    'random_state':42,
    # for GPU
#     'tree_method': 'gpu_hist',
#     'predictor': 'gpu_predictor',
}
# create xgb ml model
xgb_model = XGBRegressor(**xgb_params)

In [12]:
# Now do model training
xgb_model.fit(X_train_np, y_train,
              verbose=True,
              #detect & avoid overfitting
              eval_set=[(X_train_np, y_train), (X_valid_np, y_valid)],
              eval_metric="mae",
              early_stopping_rounds=300
             )

#predict
preds = xgb_model.predict(X_valid_np)
print('X_valid MAE:', mean_absolute_error(y_valid, preds))

[0]	validation_0-mae:163571.12500	validation_1-mae:161393.54688
[1]	validation_0-mae:147516.60938	validation_1-mae:145594.76562
[2]	validation_0-mae:133031.25000	validation_1-mae:131271.89062
[3]	validation_0-mae:120025.90625	validation_1-mae:118728.56250
[4]	validation_0-mae:108270.36719	validation_1-mae:107347.47656
[5]	validation_0-mae:97682.67188	validation_1-mae:97039.68750
[6]	validation_0-mae:88152.20312	validation_1-mae:87739.09375
[7]	validation_0-mae:79561.57812	validation_1-mae:79410.71875
[8]	validation_0-mae:71819.76562	validation_1-mae:71818.29688
[9]	validation_0-mae:64848.34375	validation_1-mae:65190.00000
[10]	validation_0-mae:58572.50781	validation_1-mae:59032.60938
[11]	validation_0-mae:52900.61719	validation_1-mae:53628.83984
[12]	validation_0-mae:47809.79688	validation_1-mae:48795.12891
[13]	validation_0-mae:43199.58203	validation_1-mae:44768.75391
[14]	validation_0-mae:39026.89453	validation_1-mae:40993.75391
[15]	validation_0-mae:35295.15234	validation_1-mae:3774

In background `prepare_data` method loads your input data into Pandas DataFrame, seprates X(features) and y(target).

The `prepare_data` methods prepare X and y DataFrames, preprocess all numerical and categorical type data found in these DataFrames using scikit-learn pipelines. Then it bundle preprocessed data with your given model and return an MLPipeline object, this class instance has dataframeloader, preprocessor and scikit-lean pipeline instances.

The `create_train_valid` method use valid_size to split X(features) into X_train, y_train, X_valid and y_valid DataFrames, so you can call fit methods on X_train and y_train and predict methods on X_valid or X_test.


Please check detail documentation and source code for more details.

*NOTE: If you want to customize data and preprocessing steps you can do so by using `DataFrameLoader` and `PreProessor` classes. Check detail documentations for these classes for more options.*



#### To see more clear picture of model performance, Let's do a quick Cross Validation on our Pipeline

##### PreProcess the data

In [13]:
X_np, X_test_np = tmlt.pp_fit_transform(tmlt.dfl.X, tmlt.dfl.X_test)
y = tmlt.dfl.y

In [15]:
# Now do cross_validation
scores = tmlt.do_cross_validation(X_np, y, xgb_model, scoring='neg_mean_absolute_error', cv=5)

print("scores:", scores)
print("Average MAE score:", scores.mean())

scores: [15733.51983893 16386.18366064 16648.82777718 14571.39875856
 17295.16245719]
Average MAE score: 16127.018498501711


*MAE did came out slightly bad with cross validation*

*Let's see if we can improve our cross validation score with hyperparams tunning*

**we are using optuna based hyperparameter search here, make sure to supply a new directory path so search is saved**

In [None]:
study = tmlt.do_xgb_optuna_optimization(optuna_db_path=OUTPUT_PATH, opt_timeout=60)
print(study.best_trial)

#### Let's use our newly found best params to update the model on sklearn pipeline

In [None]:
xgb_params.update(study.best_trial.params)
print("xgb_params", xgb_params)
xgb_model = XGBRegressor(**xgb_params)
tmlt.update_model(xgb_model)
tmlt.spl

#### Now, Let's use 5 K-Fold Training on this Updated XGB model with best params found from Optuna search

In [None]:
# k-fold training
xgb_model_metrics_score, xgb_model_test_preds = tmlt.do_kfold_training(n_splits=5, test_preds_metric=mean_absolute_error)

In [None]:
# predict on test dataset
if xgb_model_test_preds is not None:
    print(xgb_model_test_preds.shape)


##### You can even improve metrics score further by running Optuna search for longer time or rerunning the study, check documentation for more details