[View in Colaboratory](https://colab.research.google.com/github/gomerudo/auto-ml/blob/master/python/notebooks/benchmarking/TPOT.ipynb)

# TPOT benchmarking

## Installing the packages

In [0]:
# This installs the main packages
!pip install numpy scipy scikit-learn pandas deap update_checker tqdm stopit

# This will install TPOT to use the eXtreme Gradient Boosting models. XGBoost is entirely optional
!pip install xgboost

# Actually installing TPOT
!pip install tpot

# OpenML
!pip install git+https://github.com/renatopp/liac-arff@master
!pip install git+https://github.com/openml/openml-python.git@develop


## Running the model on the benchmark datasets

In [0]:
import openml as oml
from openml import tasks, runs, datasets
from sklearn.model_selection import train_test_split
from tpot import TPOTClassifier
import time

################################################################################
########### Function to run a model and return the results as a list ###########
################################################################################
def runModel(dataset, metric = "accuracy", generations = 5, 
             sparse = False, population_size = 20) :
  
  # Get the features and the target
  X, y = dataset.get_data(target = dataset.default_target_attribute)
  # Obtain the train and validation sets with a 3/4 split
  X_train, X_val, y_train, y_val = train_test_split(X, y, 
                                                  train_size = 0.75, 
                                                  test_size = 0.25) 
  
  print("""
================================================================================
RUNNING TPOT CLASSIFIER FOR DATASET {}
================================================================================
  """.format(dataset.dataset_id))
  tpot = TPOTClassifier(generations = generations, 
                        population_size = population_size, verbosity = 2, 
                        scoring = metric, n_jobs = -1)
  if sparse :
    tpot.config_dict = 'TPOT sparse'
  
  start_time = time.time()
  tpot.fit(X_train, y_train)
  end_time = time.time()

  
  return [ dataset.dataset_id, end_time - start_time, 
          tpot.score(X_val, y_val), tpot.fitted_pipeline_ ]


### Part 1

In [0]:
benchmarks_metrics = { 38 : "roc_auc", 46 : "accuracy", 
                      179 : "roc_auc", 184 : "accuracy" } 

In [0]:
################################################################################
##################################### MAIN #####################################
################################################################################
results = []

for datasetId, metric in benchmarks_metrics.items():
  dataset = oml.datasets.get_dataset(datasetId) 
  results.append(runModel(dataset, metric, generations = 5))



RUNNING TPOT CLASSIFIER FOR DATASET 38
  
Imputing missing values in feature set


Optimization Progress:  33%|███▎      | 40/120 [00:36<01:12,  1.10pipeline/s]

Generation 1 - Current best internal CV score: 0.9925559218902873


Optimization Progress:  50%|█████     | 60/120 [01:10<01:36,  1.62s/pipeline]

Generation 2 - Current best internal CV score: 0.9925559218902873


Optimization Progress:  67%|██████▋   | 80/120 [01:37<00:47,  1.18s/pipeline]

Generation 3 - Current best internal CV score: 0.994292884617962


Optimization Progress:  83%|████████▎ | 100/120 [02:02<00:19,  1.04pipeline/s]

Generation 4 - Current best internal CV score: 0.994740862852318




Generation 5 - Current best internal CV score: 0.995615710398992

Best pipeline: XGBClassifier(input_matrix, learning_rate=0.1, max_depth=7, min_child_weight=2, n_estimators=100, nthread=1, subsample=0.9000000000000001)
Imputing missing values in feature set

RUNNING TPOT CLASSIFIER FOR DATASET 46
  


Optimization Progress:  33%|███▎      | 40/120 [01:04<01:56,  1.46s/pipeline]

Generation 1 - Current best internal CV score: 0.9473221757322176


Optimization Progress:  50%|█████     | 60/120 [01:46<02:10,  2.17s/pipeline]

Generation 2 - Current best internal CV score: 0.9519264295676428


Optimization Progress:  67%|██████▋   | 80/120 [02:49<02:24,  3.61s/pipeline]

Generation 3 - Current best internal CV score: 0.9527597629009762


Optimization Progress:  84%|████████▍ | 101/120 [10:08<07:26, 23.52s/pipeline]

Generation 4 - Current best internal CV score: 0.9556834030683403




Generation 5 - Current best internal CV score: 0.9561122733612273

Best pipeline: ExtraTreesClassifier(XGBClassifier(Normalizer(input_matrix, norm=l2), learning_rate=0.5, max_depth=1, min_child_weight=4, n_estimators=100, nthread=1, subsample=0.7500000000000001), bootstrap=False, criterion=gini, max_features=0.8, min_samples_leaf=2, min_samples_split=19, n_estimators=100)


  if diff:



RUNNING TPOT CLASSIFIER FOR DATASET 179
  
Imputing missing values in feature set


Optimization Progress:  34%|███▍      | 41/120 [12:57<50:40, 38.48s/pipeline]

Generation 1 - Current best internal CV score: 0.90740367894238


Optimization Progress:  51%|█████     | 61/120 [21:24<29:22, 29.88s/pipeline]

Generation 2 - Current best internal CV score: 0.9083400999338844


Optimization Progress:  68%|██████▊   | 81/120 [31:39<20:49, 32.04s/pipeline]

Generation 3 - Current best internal CV score: 0.9083400999338844


Optimization Progress:  84%|████████▍ | 101/120 [38:42<06:30, 20.54s/pipeline]

Generation 4 - Current best internal CV score: 0.9094274815712264




Generation 5 - Current best internal CV score: 0.9096284800969592

Best pipeline: RandomForestClassifier(XGBClassifier(VarianceThreshold(input_matrix, threshold=0.2), learning_rate=0.1, max_depth=6, min_child_weight=8, n_estimators=100, nthread=1, subsample=0.9000000000000001), bootstrap=True, criterion=gini, max_features=0.25, min_samples_leaf=17, min_samples_split=2, n_estimators=100)
Imputing missing values in feature set


  if diff:



RUNNING TPOT CLASSIFIER FOR DATASET 184
  


Optimization Progress:  38%|███▊      | 45/120 [33:29<1:04:27, 51.56s/pipeline]

Generation 1 - Current best internal CV score: 0.7033025930422093


Optimization Progress:  55%|█████▌    | 66/120 [52:40<44:09, 49.06s/pipeline]  

Generation 2 - Current best internal CV score: 0.7033025930422093


Optimization Progress:  72%|███████▏  | 86/120 [1:02:59<19:16, 34.01s/pipeline]

Generation 3 - Current best internal CV score: 0.7190847866731527


Optimization Progress:  90%|█████████ | 108/120 [1:20:01<10:40, 53.38s/pipeline]

Generation 4 - Current best internal CV score: 0.7190847866731527




Generation 5 - Current best internal CV score: 0.7223636498279549

Best pipeline: KNeighborsClassifier(input_matrix, n_neighbors=9, p=1, weights=distance)



#### Results

In [0]:
import pandas as pd
pd.set_option('display.width', 1024)
pd.set_option('display.max_colwidth', 1000)

pd.DataFrame(results, columns = ["Dataset id", "Time", "Validation score", "Best pipeline"])

Unnamed: 0,Dataset id,Time,Validation score,Best pipeline
0,38,147.527569,0.998426,"Pipeline(memory=None,\n steps=[('xgbclassifier', XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,\n max_depth=7, min_child_weight=2, missing=None, n_estimators=100,\n n_jobs=1, nthread=1, objective='binary:logistic', random_state=0,\n reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,\n silent=True, subsample=0.9000000000000001))])"
1,46,1000.625428,0.966165,"Pipeline(memory=None,\n steps=[('normalizer', Normalizer(copy=True, norm='l2')), ('stackingestimator', StackingEstimator(estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bytree=1, gamma=0, learning_rate=0.5, max_delta_step=0,\n max_depth=1, min_child_weight=4, missing=Non...imators=100, n_jobs=1,\n oob_score=False, random_state=None, verbose=0, warm_start=False))])"
2,179,3190.894094,0.914897,"Pipeline(memory=None,\n steps=[('variancethreshold', VarianceThreshold(threshold=0.2)), ('stackingestimator', StackingEstimator(estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,\n max_depth=6, min_child_weight=8, miss...n_jobs=1,\n oob_score=False, random_state=None, verbose=0,\n warm_start=False))])"
3,184,6221.501911,0.782435,"Pipeline(memory=None,\n steps=[('kneighborsclassifier', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n metric_params=None, n_jobs=1, n_neighbors=9, p=1,\n weights='distance'))])"


### Part 2

In [0]:
benchmarks_metrics2 = { 772 : "accuracy", 
                       917 : "accuracy", 1049 : "roc_auc" } 

In [9]:
results2 = []

for datasetId, metric in benchmarks_metrics2.items():
  dataset = oml.datasets.get_dataset(datasetId) 
  results2.append(runModel(dataset, metric, generations = 5))



RUNNING TPOT CLASSIFIER FOR DATASET 772
  


Optimization Progress:  33%|███▎      | 40/120 [00:20<00:50,  1.60pipeline/s]

Generation 1 - Current best internal CV score: 0.5756086693106477


Optimization Progress:  50%|█████     | 60/120 [00:34<00:42,  1.40pipeline/s]

Generation 2 - Current best internal CV score: 0.5756086693106477


Optimization Progress:  67%|██████▋   | 80/120 [00:53<00:39,  1.01pipeline/s]

Generation 3 - Current best internal CV score: 0.5756086693106477


Optimization Progress:  83%|████████▎ | 100/120 [01:07<00:14,  1.36pipeline/s]

Generation 4 - Current best internal CV score: 0.5756086693106477




Generation 5 - Current best internal CV score: 0.5805203970455869

Best pipeline: ExtraTreesClassifier(StandardScaler(input_matrix), bootstrap=True, criterion=entropy, max_features=0.25, min_samples_leaf=2, min_samples_split=10, n_estimators=100)

RUNNING TPOT CLASSIFIER FOR DATASET 917
  


Optimization Progress:  33%|███▎      | 40/120 [00:19<00:16,  4.85pipeline/s]

Generation 1 - Current best internal CV score: 0.8893114064328784


Optimization Progress:  50%|█████     | 60/120 [00:35<00:53,  1.12pipeline/s]

Generation 2 - Current best internal CV score: 0.8906536883120731


Optimization Progress:  67%|██████▋   | 80/120 [01:21<01:40,  2.51s/pipeline]

Generation 3 - Current best internal CV score: 0.8933293035246012


Optimization Progress:  83%|████████▎ | 100/120 [02:09<00:56,  2.85s/pipeline]

Generation 4 - Current best internal CV score: 0.8933293035246012




Generation 5 - Current best internal CV score: 0.8946538068358594

Best pipeline: ExtraTreesClassifier(PolynomialFeatures(RFE(input_matrix, criterion=gini, max_features=0.9000000000000001, n_estimators=100, step=0.9500000000000001), degree=2, include_bias=False, interaction_only=False), bootstrap=False, criterion=entropy, max_features=0.9500000000000001, min_samples_leaf=2, min_samples_split=4, n_estimators=100)

RUNNING TPOT CLASSIFIER FOR DATASET 1049
  


Optimization Progress:  33%|███▎      | 40/120 [00:19<00:37,  2.14pipeline/s]

Generation 1 - Current best internal CV score: 0.9371153250360559


Optimization Progress:  50%|█████     | 60/120 [00:33<00:53,  1.13pipeline/s]

Generation 2 - Current best internal CV score: 0.940880081192244


Optimization Progress:  67%|██████▋   | 80/120 [01:02<00:44,  1.11s/pipeline]

Generation 3 - Current best internal CV score: 0.940880081192244


Optimization Progress:  83%|████████▎ | 100/120 [01:19<00:16,  1.23pipeline/s]

Generation 4 - Current best internal CV score: 0.940880081192244




Generation 5 - Current best internal CV score: 0.9416452112600823

Best pipeline: GradientBoostingClassifier(LogisticRegression(input_matrix, C=0.01, dual=False, penalty=l1), learning_rate=0.1, max_depth=6, max_features=0.8, min_samples_leaf=6, min_samples_split=11, n_estimators=100, subsample=0.9000000000000001)


In [10]:
import pandas as pd
pd.set_option('display.width', 1024)
pd.set_option('display.max_colwidth', 1000)

pd.DataFrame(results2, columns = ["Dataset id", "Time", "Validation score", "Best pipeline"])

Unnamed: 0,Dataset id,Time,Validation score,Best pipeline
0,772,85.648197,0.506422,"Pipeline(memory=None,\n steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('extratreesclassifier', ExtraTreesClassifier(bootstrap=True, class_weight=None, criterion='entropy',\n max_depth=None, max_features=0.25, max_leaf_nodes=None,\n min_impurity_decrease=0.0, min_impu...imators=100, n_jobs=1,\n oob_score=False, random_state=None, verbose=0, warm_start=False))])"
1,917,170.42764,0.904,"Pipeline(memory=None,\n steps=[('rfe', RFE(estimator=ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',\n max_depth=None, max_features=0.9000000000000001,\n max_leaf_nodes=None, min_impurity_decrease=0.0,\n min_impurity_split=None, min_samples_leaf=1,\n min_sample...imators=100, n_jobs=1, oob_score=False, random_state=None,\n verbose=0, warm_start=False))])"
2,1049,105.912183,0.938437,"Pipeline(memory=None,\n steps=[('stackingestimator', StackingEstimator(estimator=LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,\n intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n penalty='l1', random_state=None, solver='liblinear', tol=0.0001,\n verbos...auto', random_state=None,\n subsample=0.9000000000000001, verbose=0, warm_start=False))])"


### Part 3

In [0]:
benchmarks_metrics3 = { 1111 : "accuracy", 1120 : "accuracy", 
                       1128 : "roc_auc", 293 : "accuracy" } 

In [16]:
results3 = []

for datasetId, metric in benchmarks_metrics3.items():
  dataset = oml.datasets.get_dataset(datasetId) 
  results3.append(runModel(dataset, metric, gener ations = 5))



RUNNING TPOT CLASSIFIER FOR DATASET 1111
  
Imputing missing values in feature set


Optimization Progress:  46%|████▌     | 55/120 [1:14:09<1:29:33, 82.67s/pipeline] 

Generation 1 - Current best internal CV score: 0.9826666666666666


Optimization Progress:  68%|██████▊   | 81/120 [1:51:44<1:01:53, 95.21s/pipeline]

Generation 2 - Current best internal CV score: 0.9826666666666666


Optimization Progress:  88%|████████▊ | 106/120 [2:27:35<17:28, 74.92s/pipeline] 

Generation 3 - Current best internal CV score: 0.9826666666666666


Optimization Progress: 129pipeline [2:59:28, 96.44s/pipeline]

Generation 4 - Current best internal CV score: 0.9826666666666666




Generation 5 - Current best internal CV score: 0.9826666666666666

Best pipeline: RandomForestClassifier(input_matrix, bootstrap=True, criterion=entropy, max_features=0.3, min_samples_leaf=6, min_samples_split=6, n_estimators=100)
Imputing missing values in feature set

RUNNING TPOT CLASSIFIER FOR DATASET 1120
  


Optimization Progress:  33%|███▎      | 40/120 [02:39<07:22,  5.53s/pipeline]

Generation 1 - Current best internal CV score: 0.8802663862600772


Optimization Progress:  50%|█████     | 60/120 [04:55<07:42,  7.70s/pipeline]

Generation 2 - Current best internal CV score: 0.8837714686295127


Optimization Progress:  67%|██████▋   | 80/120 [10:15<07:46, 11.67s/pipeline]

Generation 3 - Current best internal CV score: 0.8837714686295127


Optimization Progress:  83%|████████▎ | 100/120 [13:33<03:29, 10.47s/pipeline]

Generation 4 - Current best internal CV score: 0.8841219768664563




Generation 5 - Current best internal CV score: 0.8841219768664563

Best pipeline: GradientBoostingClassifier(input_matrix, learning_rate=0.1, max_depth=10, max_features=0.25, min_samples_leaf=4, min_samples_split=19, n_estimators=100, subsample=0.8500000000000001)



In [18]:
import pandas as pd
pd.set_option('display.width', 1024)
pd.set_option('display.max_colwidth', 1000)

pd.DataFrame(results3, columns = ["Dataset id", "Time", "Validation score", "Best pipeline"])

Unnamed: 0,Dataset id,Time,Validation score,Best pipeline
0,1111,12860.90598,0.9808,"Pipeline(memory=None,\n steps=[('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',\n max_depth=None, max_features=0.3, max_leaf_nodes=None,\n min_impurity_decrease=0.0, min_impurity_split=None,\n min_samples_leaf=6, min_samples_split=6,\n min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,\n oob_score=False, random_state=None, verbose=0,\n warm_start=False))])"
1,1120,1135.952204,0.878023,"Pipeline(memory=None,\n steps=[('gradientboostingclassifier', GradientBoostingClassifier(criterion='friedman_mse', init=None,\n learning_rate=0.1, loss='deviance', max_depth=10,\n max_features=0.25, max_leaf_nodes=None,\n min_impurity_decrease=0.0, min_impurity_split=None,\n ...auto', random_state=None,\n subsample=0.8500000000000001, verbose=0, warm_start=False))])"


### Part 4

In [0]:
benchmarks_metrics4 = { 389 : "accuracy", 293 : "accuracy" } 

In [27]:
results4 = []

for datasetId, metric in benchmarks_metrics4.items():
  dataset = oml.datasets.get_dataset(datasetId) 
  results4.append(runModel(dataset, metric, generations = 5, sparse = True))



RUNNING TPOT CLASSIFIER FOR DATASET 389
  


Optimization Progress:  34%|███▍      | 41/120 [17:24<16:58, 12.89s/pipeline]

Generation 1 - Current best internal CV score: 0.848791087706009


Optimization Progress:  51%|█████     | 61/120 [25:16<23:14, 23.64s/pipeline]

Generation 2 - Current best internal CV score: 0.848791087706009


Optimization Progress:  68%|██████▊   | 81/120 [36:18<23:19, 35.88s/pipeline]

Generation 3 - Current best internal CV score: 0.848791087706009


Optimization Progress:  85%|████████▌ | 102/120 [49:35<10:02, 33.45s/pipeline]

Generation 4 - Current best internal CV score: 0.848791087706009




Generation 5 - Current best internal CV score: 0.855360059265748

Best pipeline: XGBClassifier(input_matrix, learning_rate=0.1, max_depth=6, min_child_weight=7, n_estimators=100, nthread=1, subsample=0.8)


  if diff:


In [28]:
import pandas as pd
pd.set_option('display.width', 1024)
pd.set_option('display.max_colwidth', 1000)

pd.DataFrame(results4, columns = ["Dataset id", "Time", "Validation score", "Best pipeline"])

Unnamed: 0,Dataset id,Time,Validation score,Best pipeline
0,389,3502.628909,0.853896,"Pipeline(memory=None,\n steps=[('xgbclassifier', XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,\n max_depth=6, min_child_weight=7, missing=None, n_estimators=100,\n n_jobs=1, nthread=1, objective='multi:softprob', random_state=0,\n reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,\n silent=True, subsample=0.8))])"


### Part 5