## TPOT: A Python tool for automating data science

http://www.randalolson.com/2016/05/08/tpot-a-python-tool-for-automating-data-science/

In [2]:
import sys
print(sys.version)
print(sys.executable)

3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0]
/usr/bin/python3


In [1]:
# Load dependencies
import pandas as pd  
import numpy as np  

### MNIST dataset

In [3]:
# Load data
mnist_data = pd.read_csv('https://raw.githubusercontent.com/rhiever/Data-Analysis-and-Machine-Learning-Projects/master/tpot-demo/mnist.csv.gz', sep='\t', compression='gzip') 

In [10]:
print(type(mnist_data))
mnist_data.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,class,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,784
0,7,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Hyperparameter optimization is importnat

In [15]:
from sklearn.ensemble import RandomForestClassifier  
from sklearn.model_selection import cross_val_score  
  
cv_scores = cross_val_score(RandomForestClassifier(n_estimators=10, n_jobs=-1),  
                            X=mnist_data.drop('class', axis=1).values,  
                            y=mnist_data.loc[:, 'class'].values,  
                            cv=10) 

In [16]:
print(cv_scores)  
print(np.mean(cv_scores)) 

[0.941      0.96042857 0.94828571 0.94185714 0.94971429 0.94171429
 0.94914286 0.94414286 0.94014286 0.95814286]
0.9474571428571428


**Motivation** <br/>
The random forest achieves an average of 94.7% cross-validation accuracy on MNIST. <br/>
However, what if we tuned that hyperparameter a little bit and provided the random forest with 100 trees instead?

## Model selection is important

In [17]:
# Load dataset
hill_valley_data = pd.read_csv('https://raw.githubusercontent.com/rhiever/Data-Analysis-and-Machine-Learning-Projects/master/tpot-demo/Hill_Valley_without_noise.csv.gz', sep='\t', compression='gzip')  

In [28]:
type(hill_valley_data)
hill_valley_data.head(7)

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X92,X93,X94,X95,X96,X97,X98,X99,X100,class
0,1317.265789,1315.220951,1312.770581,1309.834252,1306.315588,1302.099102,1297.046401,1290.991646,1283.736109,1275.041652,...,1327.575109,1327.57535,1327.575552,1327.575719,1327.575859,1327.575976,1327.576074,1327.576155,1327.576223,0
1,7329.967624,7379.907443,7441.799231,7518.503422,7613.565031,7731.377492,7877.385707,8058.337694,8282.596458,8560.526497,...,7121.300474,7121.300438,7121.30041,7121.300387,7121.300368,7121.300353,7121.300341,7121.300331,7121.300323,1
2,809.42141,809.780119,810.207191,810.715653,811.321016,812.041748,812.899834,813.921452,815.137768,816.585886,...,807.545134,807.544181,807.543381,807.542709,807.542144,807.54167,807.541272,807.540937,807.540656,1
3,45334.20888,45334.21356,45334.21906,45334.2255,45334.23305,45334.24191,45334.2523,45334.26448,45334.27876,45334.29552,...,47550.92171,47224.45771,46946.07276,46708.68615,46506.25997,46333.64552,46186.45237,46060.93667,45953.90593,1
4,1.810359,1.810359,1.810359,1.810359,1.810359,1.810359,1.810359,1.810359,1.810359,1.810359,...,1.790275,1.794794,1.798296,1.80101,1.803114,1.804744,1.806008,1.806987,1.807746,0
5,2.073517,2.073546,2.073581,2.073622,2.073671,2.07373,2.073799,2.07388,2.073977,2.074092,...,2.074388,2.074227,2.074092,2.073977,2.07388,2.073799,2.07373,2.073671,2.073622,1
6,382.135104,382.135106,382.135109,382.135113,382.135117,382.135123,382.13513,382.135138,382.135148,382.135161,...,388.980545,387.663542,386.599919,385.740927,385.047197,384.486935,384.034462,383.669041,383.373923,1


In [29]:
# Target: class
print(hill_valley_data.iloc[:,100].unique())

[0 1]


In [30]:
from sklearn.ensemble import RandomForestClassifier  
from sklearn.linear_model import LogisticRegression  
from sklearn.model_selection import cross_val_score  

cv_scores = cross_val_score(RandomForestClassifier(n_estimators=100, n_jobs=-1),  
                            X=hill_valley_data.drop('class', axis=1).values,  
                            y=hill_valley_data.loc[:, 'class'].values,  
                            cv=10)  

In [31]:
# Print scores
print(cv_scores)  
print(np.mean(cv_scores))  

[0.6147541  0.62295082 0.63636364 0.61983471 0.65289256 0.5785124
 0.66942149 0.60330579 0.48760331 0.61157025]
0.6097209050264192


**Motivation** <br/>
What if we tried a different model, for example a logistic regression?

In [34]:
cv_scores = cross_val_score(LogisticRegression(max_iter = 1000),  
                            X = hill_valley_data.drop('class', axis=1).values,  
                            y = hill_valley_data.loc[:, 'class'].values,  
                            cv = 10)  

In [35]:
print(cv_scores)  
print(np.mean(cv_scores))

[1.         1.         1.         0.99173554 1.         1.
 1.         1.         1.         1.        ]
0.9991735537190083


We’ll find that a logistic regression is well-suited for this signal processing task—in fact, it easily achieves near-100% cross-validation accuracy without any hyperparameter tuning at all.
<br/>
Always **try out many different machine learning models** for every machine learning task that you work on. <br/> Trying out—and tuning—different machine learning models is another tedious yet vitally important step of machine learning pipeline design.

## Feature Pre-processing is important

In [41]:
# Import noisy data
hill_valley_noisy_data = pd.read_csv('https://raw.githubusercontent.com/rhiever/Data-Analysis-and-Machine-Learning-Projects/master/tpot-demo/Hill_Valley_with_noise.csv.gz', sep='\t', compression='gzip') 
hill_valley_noisy_data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X92,X93,X94,X95,X96,X97,X98,X99,X100,class
0,39.02,36.49,38.2,38.85,39.38,39.74,37.02,39.53,38.81,38.79,...,36.62,36.92,38.8,38.52,38.07,36.73,39.46,37.5,39.1,0
1,1.83,1.71,1.77,1.77,1.68,1.78,1.8,1.7,1.75,1.78,...,1.8,1.79,1.77,1.74,1.74,1.8,1.78,1.75,1.69,1
2,68177.69,66138.42,72981.88,74304.33,67549.66,69367.34,69169.41,73268.61,74465.84,72503.37,...,73438.88,71053.35,71112.62,74916.48,72571.58,66348.97,71063.72,67404.27,74920.24,1
3,44889.06,39191.86,40728.46,38576.36,45876.06,47034.0,46611.43,37668.32,40980.89,38466.15,...,42625.67,40684.2,46960.73,44546.8,45410.53,47139.44,43095.68,40888.34,39615.19,0
4,5.7,5.4,5.28,5.38,5.27,5.61,6.0,5.38,5.34,5.87,...,5.17,5.67,5.6,5.94,5.73,5.22,5.3,5.73,5.91,0


In [43]:
from sklearn.ensemble import RandomForestClassifier  
from sklearn.decomposition import PCA  
from sklearn.pipeline import make_pipeline  
from sklearn.model_selection import cross_val_score  
 
cv_scores = cross_val_score(RandomForestClassifier(n_estimators=100, n_jobs=-1),  
                            X=hill_valley_noisy_data.drop('class', axis=1).values,  
                            y=hill_valley_noisy_data.loc[:, 'class'].values,  
                            cv=10)  

In [44]:
# Print scores
print(cv_scores)
print(np.mean(cv_scores))

[0.51639344 0.54098361 0.52892562 0.60330579 0.6446281  0.58677686
 0.61983471 0.61157025 0.60330579 0.54545455]
0.5801178702072889


We’ll again find that the **“tuned” random forest** averages a disappointing 57.8% cross-validation accuracy.

However, if we preprocess the **features—denoising them via Principal Component Analysis (PCA)**, for example:

In [45]:
cv_scores = cross_val_score(make_pipeline(PCA(n_components=10),  
                                          RandomForestClassifier(n_estimators=100, n_jobs=-1)),  
                            X=hill_valley_noisy_data.drop('class', axis=1).values,  
                            y=hill_valley_noisy_data.loc[:, 'class'].values,  
                            cv=10)   

In [46]:
# Print scores
print(cv_scores)  
print(np.mean(cv_scores)) 

[0.93442623 0.96721311 0.87603306 0.95041322 0.95041322 0.90909091
 0.91735537 0.92561983 0.89256198 0.95867769]
0.9281804633518492


## Automating data science with TPOT

In [48]:
from sklearn.model_selection import train_test_split
from tpot import TPOTClassifier  

X = hill_valley_noisy_data.drop('class', axis=1).values  
y = hill_valley_noisy_data.loc[:, 'class'].values  
  
X_train, X_test, y_train, y_test = train_test_split(X, y,  
                                                    train_size=0.75,  
                                                    test_size=0.25)  

In [None]:
# Fit TPOT Classifier

my_tpot = TPOTClassifier(generations=10)  
my_tpot.fit(X_train, y_train)  
  
print(my_tpot.score(X_test, y_test))     

If we want to see what pipeline TPOT created, TPOT can export the corresponding scikit-learn code for us with the export() command:

In [None]:
my_tpot.export('exported_pipeline.py')