# AutoML with tpot and sklearn

## Let's use the diabetes Dataset

In [16]:
from sklearn import datasets
import pandas as pd
diabetes = datasets.load_diabetes()
dia_df = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)
dia_df.columns = ['age',
  'sex',
  'bmi',
  'bp',
  'tc',
  'ldl',
  'hdl',
  'tch',
  'ltg',
  'glu']
dia_df["target"] = diabetes.target

In [17]:
dia_df.head()

Unnamed: 0,age,sex,bmi,bp,tc,ldl,hdl,tch,ltg,glu,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,135.0


In [18]:
diabetes.keys()

dict_keys(['data', 'target', 'frame', 'DESCR', 'feature_names', 'data_filename', 'target_filename', 'data_module'])

In [19]:
diabetes['target'][0:5]

array([151.,  75., 141., 206., 135.])

## train test split

In [20]:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(dia_df, train_size=0.8, test_size=0.2, random_state=1231)

In [21]:
len(df_train.index)

353

In [22]:
len(df_test.index)

89

## AutoML workflow with tpot

In [24]:
from sklearn.model_selection import RepeatedStratifiedKFold
from tpot import TPOTRegressor

# tpot model search, defualt arg for scoring='neg_mean_squared_error'
tpot_model = TPOTRegressor(generations=5, population_size=50, cv=5, verbosity=2, random_state=1, n_jobs=-1) #,use_dask=True)
# perform the model search
tpot_model.fit(df_train.iloc[:,0:10], df_train['target'])
# export the best model pipeline
tpot_model.export('automl_best_model_pipeline_by_tpot.py')

                                                                                                                               
Generation 1 - Current best internal CV score: -3146.3151389530494
                                                                                                                               
Generation 2 - Current best internal CV score: -3130.063760139031
                                                                                                                               
Generation 3 - Current best internal CV score: -3129.3452341689585
                                                                                                                               
Generation 4 - Current best internal CV score: -3129.3452341689585
                                                                                                                               
Generation 5 - Current best internal CV score: -3129.3452341689585
                          

### Technical tips:
(1) n_jobs=-1 will use as many cores as available on the compute

Without seeting it in the TPOTRegressor will result in error message of "A pipeline has not yet been optimized"

(2) use_dask= "True" cannot be used, otherwise it will result in the same error as above

## Results

a lasso regression model is found to perform the best by tpot 

In [27]:
tpot_model.score(df_test.iloc[:,0:10], df_test['target'])

-2662.3504492777415

In [31]:
tpot_model.fitted_pipeline_

Pipeline(steps=[('rbfsampler',
                 RBFSampler(gamma=0.9500000000000001, random_state=1)),
                ('lassolarscv', LassoLarsCV(normalize=True))])

You can see calling "fitted_pipeline_" is the same as use the tpot_model (tpot object directly)

In [32]:
tpot_model.fitted_pipeline_.predict(df_test.iloc[0:5,0:10])

array([170.55528616, 210.79028115, 188.82781319,  90.71896874,
       126.41051848])

In [29]:
tpot_model.predict(df_test.iloc[0:5,0:10])

array([170.55528616, 210.79028115, 188.82781319,  90.71896874,
       126.41051848])

## Check out the best model pipeline by tpot

In [25]:
!cat automl_best_model_pipeline_by_tpot.py

import numpy as np
import pandas as pd
from sklearn.kernel_approximation import RBFSampler
from sklearn.linear_model import LassoLarsCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=1)

# Average CV score on the training set was: -3129.3452341689585
exported_pipeline = make_pipeline(
    RBFSampler(gamma=0.9500000000000001),
    LassoLarsCV(normalize=True)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 1)

exported_pipeline.fit(training_features, training_target)
r

## Save the best model

In [33]:
#we can only pickle the fitted_pipeline_ not the tpot object
import pickle
filehandler1 = open(b"diabetes_tpot_automl.model","wb")
pickle.dump(tpot_model.fitted_pipeline_,filehandler1)
filehandler1.close()

## Use r2 to check the performance

In [35]:
from sklearn.metrics import r2_score
y_test_predicted=tpot_model.predict(df_test.iloc[:,0:10])
r2_score(df_test["target"],y_test_predicted)

0.5735120348720234

Remember previously the results from the hyperopt, the r2 score was 0.52. So this autoML results performs slightly better.

## Use mean absolute error to score the predictions

In [36]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(df_test["target"],y_test_predicted)

42.193522111272294

Similar to the conclusions using r2: remember previously the results from the hyperopt, the mean absolute error is 44; so this autoML results performs slightly better.