# AutoML with hyperopt-sklearn

## Use the diabetes dataset

In [1]:
from sklearn import datasets
import pandas as pd
diabetes = datasets.load_diabetes()
dia_df = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)
dia_df.columns = ['age',
  'sex',
  'bmi',
  'bp',
  'tc',
  'ldl',
  'hdl',
  'tch',
  'ltg',
  'glu']
dia_df["target"] = diabetes.target

## Train test split

In [2]:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(dia_df, train_size=0.8, test_size=0.2, random_state=1231)

## hyperopt-sklearn autoML workflow

In [3]:
from hpsklearn import HyperoptEstimator
from hpsklearn import any_regressor
from hpsklearn import any_preprocessing
from hyperopt import tpe

# define search use tpe suggested algorithm
model = HyperoptEstimator(regressor=any_regressor('rgr'), preprocessing=any_preprocessing('pre'), algo=tpe.suggest, max_evals=50, trial_timeout=30)
# perform the search
model.fit(df_train.iloc[:,0:10], df_train['target'])

100%|███████████| 1/1 [00:01<00:00,  1.19s/trial, best loss: 0.6152122929564217]
100%|███████████| 2/2 [00:01<00:00,  1.19s/trial, best loss: 0.6152122929564217]
100%|███████████| 3/3 [00:01<00:00,  1.07s/trial, best loss: 0.6152122929564217]
100%|███████████| 4/4 [00:01<00:00,  1.09s/trial, best loss: 0.6152122929564217]
100%|███████████| 5/5 [00:01<00:00,  1.10s/trial, best loss: 0.6152122929564217]
100%|███████████| 6/6 [00:01<00:00,  1.32s/trial, best loss: 0.5690118956031025]
100%|███████████| 7/7 [00:01<00:00,  1.08s/trial, best loss: 0.5690118956031025]
100%|███████████| 8/8 [00:01<00:00,  1.08s/trial, best loss: 0.5690118956031025]
100%|███████████| 9/9 [00:01<00:00,  1.11s/trial, best loss: 0.5690118956031025]
100%|█████████| 10/10 [00:01<00:00,  1.09s/trial, best loss: 0.5690118956031025]
100%|█████████| 11/11 [00:01<00:00,  1.08s/trial, best loss: 0.5690118956031025]
100%|█████████| 12/12 [00:01<00:00,  1.10s/trial, best loss: 0.5690118956031025]
100%|█████████| 13/13 [00:01



100%|█████████| 18/18 [00:01<00:00,  1.09s/trial, best loss: 0.5690118956031025]
100%|█████████| 19/19 [00:01<00:00,  1.08s/trial, best loss: 0.5690118956031025]
100%|█████████| 20/20 [00:01<00:00,  1.11s/trial, best loss: 0.5690118956031025]
 95%|█████████████████████████████████▎ | 20/21 [00:00<?, ?trial/s, best loss=?]



100%|█████████| 21/21 [00:01<00:00,  1.72s/trial, best loss: 0.5690118956031025]
100%|█████████| 22/22 [00:01<00:00,  1.69s/trial, best loss: 0.5690118956031025]
100%|█████████| 23/23 [00:01<00:00,  1.72s/trial, best loss: 0.5690118956031025]
100%|█████████| 24/24 [00:01<00:00,  1.31s/trial, best loss: 0.5690118956031025]
100%|█████████| 25/25 [00:01<00:00,  1.29s/trial, best loss: 0.5690118956031025]
100%|█████████| 26/26 [00:01<00:00,  1.10s/trial, best loss: 0.5690118956031025]
100%|█████████| 27/27 [00:01<00:00,  1.43s/trial, best loss: 0.5690118956031025]
100%|█████████| 28/28 [00:01<00:00,  1.18s/trial, best loss: 0.5635741064360515]
100%|█████████| 29/29 [00:01<00:00,  1.16s/trial, best loss: 0.5635741064360515]
100%|█████████| 30/30 [00:01<00:00,  1.17s/trial, best loss: 0.5635741064360515]
100%|█████████| 31/31 [00:01<00:00,  1.13s/trial, best loss: 0.5635741064360515]
100%|█████████| 32/32 [00:01<00:00,  1.13s/trial, best loss: 0.5635741064360515]
100%|█████████| 33/33 [00:01



100%|█████████| 47/47 [00:01<00:00,  1.23s/trial, best loss: 0.5635741064360515]
100%|█████████| 48/48 [00:01<00:00,  1.13s/trial, best loss: 0.5635741064360515]
100%|█████████| 49/49 [00:01<00:00,  1.10s/trial, best loss: 0.5635741064360515]
100%|█████████| 50/50 [00:01<00:00,  1.15s/trial, best loss: 0.5635741064360515]


## Performance on the test data

In [4]:
model.score(df_test.iloc[:,0:10], df_test['target'])



0.5003099931636081

In [5]:
model.predict(df_test.iloc[1:3,0:10])



array([235.54794521, 214.87037037])

## Check the best model

In [6]:
print(model.best_model())

{'learner': AdaBoostRegressor(learning_rate=0.23890080410470052, loss='exponential',
                  n_estimators=124, random_state=np.int64(4)), 'preprocs': (PCA(n_components=2),), 'ex_preprocs': ()}


A random forest model seems to be the best model 
using the hyperopt_sklearn autoML approach

## save the model

In [7]:
import pickle
filehandler = open(b"diabetes_hyperopt_automl.model","wb")
pickle.dump(model,filehandler)
filehandler.close()

## Use R2 to score the predictions

In [8]:
from sklearn.metrics import r2_score
y_test_predicted_hyperopt=model.predict(df_test.iloc[:,0:10])
r2_score(df_test["target"],y_test_predicted_hyperopt)



0.5003099931636081

## Use mean absolute error to score the predictions

In [10]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(df_test["target"],y_test_predicted_hyperopt)

47.030762009844864