# TPOT

#### Author's description:

Consider TPOT your Data Science Assistant. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data. Once TPOT is finished searching (or you get tired of waiting), it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there. TPOT is built on top of scikit-learn, so all of the code it generates should look familiar... if you're familiar with scikit-learn, anyway.

#### Useful links:

[git](https://github.com/EpistasisLab/tpot),
[documentation](http://epistasislab.github.io/tpot/),
[installation](http://epistasislab.github.io/tpot/installing/),
[examples](http://epistasislab.github.io/tpot/examples/)

#### Usage Note

TPOT is a popular choice in production environments due to the increasing accuracy of genetic search iterations, the ability to build ensembles and stakced models, and the ease of deployment due to its scikit-learn foundations and the python pipeline code it exports.

## Install and import

In [1]:
!pip install tpot==0.10.2

Collecting tpot==0.10.2
  Downloading TPOT-0.10.2-py3-none-any.whl (75 kB)
[K     |████████████████████████████████| 75 kB 3.6 MB/s 
[?25hCollecting deap>=1.0
  Downloading deap-1.3.1-cp36-cp36m-manylinux2010_x86_64.whl (157 kB)
[K     |████████████████████████████████| 157 kB 12.8 MB/s 
Collecting update-checker>=0.16
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Collecting stopit>=1.1.1
  Downloading stopit-1.1.2.tar.gz (18 kB)
Building wheels for collected packages: stopit
  Building wheel for stopit (setup.py) ... [?25ldone
[?25h  Created wheel for stopit: filename=stopit-1.1.2-py3-none-any.whl size=11954 sha256=a04b505d6270a27d7199d5c9476de3bc2c0927ca321098c0994a218951838bfd
  Stored in directory: /home/ubuntu/.cache/pip/wheels/07/2e/ce/e558b7d4f9aafcdc0e5638ef890a3d5166d8a0f2c2dc768379
Successfully built stopit
Installing collected packages: deap, update-checker, stopit, tpot
Successfully installed deap-1.3.1 stopit-1.1.2 tpot-0.10.2 update-checker-0.18.0
You

In [1]:
import tpot
from tpot import TPOTClassifier
import sklearn
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

In [2]:
#tips and code in this notebook were originally written for v 0.10.2
tpot.__version__

'0.10.2'

## Heart Disease

#### load the heart disease dataset

The raw data can be found in the project files at /mnt/data/raw/heart.csv

Attribute documentation:

      age: age in years
      sex: sex (1 = male; 0 = female)
      cp: chest pain type
        -- Value 1: typical angina
        -- Value 2: atypical angina
        -- Value 3: non-anginal pain
        -- Value 4: asymptomatic
     trestbps: resting blood pressure (in mm Hg on admission to the 
        hospital)
     chol: serum cholestoral in mg/dl
     fbs: (fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false)
     restecg: resting electrocardiographic results
        -- Value 0: normal
        -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST 
                    elevation or depression of > 0.05 mV)
        -- Value 2: showing probable or definite left ventricular hypertrophy
                    by Estes' criteria
     thalach: maximum heart rate achieved
     exang: exercise induced angina (1 = yes; 0 = no)
     oldpeak = ST depression induced by exercise relative to rest
     slope: the slope of the peak exercise ST segment
        -- Value 1: upsloping
        -- Value 2: flat
        -- Value 3: downsloping
     ca: number of major vessels (0-3) colored by flourosopy
     thal: 
         3 = normal; 
         6 = fixed defect; 
         7 = reversable defect
     target: diagnosis of heart disease (angiographic disease status)
        -- Value 0: < 50% diameter narrowing
        -- Value 1: > 50% diameter narrowing

In [3]:
# column names
names = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang', \
         'oldpeak','slope','ca','thal','target']

# load data from Domino project directory
hd_data = pd.read_csv("/mnt/data/raw/heart.csv", header=None, names=names)

In [4]:
# in case some data comes in as string, convert to numeric and coerce errors to NaN
for col in hd_data.columns:  # Iterate over columns
    hd_data[col] = pd.to_numeric(hd_data[col], errors='coerce')

In [5]:
# a function to do one hot encoding for categorical columns
def create_dummies(data, cols, drop1st=True):
    for c in cols:
        dummies_df = pd.get_dummies(data[c], prefix=c, drop_first=drop1st)  
        data=pd.concat([data, dummies_df], axis=1)
        data = data.drop([c], axis=1)
    return data

In [6]:
cats = ['cp', 'restecg', 'slope', 'ca', 'thal']
hd_data_ohe = create_dummies(hd_data, cats)

In [7]:
# drop nulls
hd_data_ohe.dropna(inplace=True)

#load the X and y set as a numpy array
X_hd_ohe = hd_data_ohe.drop('target', axis=1).values
y_hd_ohe = hd_data_ohe['target'].values

#build the train and test sets
X_hd_ohe_train, X_hd_ohe_test, y_hd_ohe_train, y_hd_ohe_test = \
    sklearn.model_selection.train_test_split(X_hd_ohe, y_hd_ohe, random_state=12)

## Run TPOT

#### TPOTClassifier structure

class tpot.TPOTClassifier(generations=100, population_size=100,
                          offspring_size=None, mutation_rate=0.9,
                          crossover_rate=0.1,
                          scoring='accuracy', cv=5,
                          subsample=1.0, n_jobs=1,
                          max_time_mins=None, max_eval_time_mins=5,
                          random_state=None, config_dict=None,
                          template=None,
                          warm_start=False,
                          memory=None,
                          use_dask=False,
                          periodic_checkpoint_folder=None,
                          early_stop=None,
                          verbosity=0,
                          disable_update_check=False)

#### Popular settings

**generations**: int, optional (default=100).
Number of iterations to the run pipeline optimization process. TPOT will evaluate population_size + generations × offspring_size pipelines in total.

**population_size**: int, optional (default=100)
Number of individuals to retain in the genetic programming population every generation. Must be a positive number.

Generally, TPOT will work better when you give it more individuals with which to optimize the pipeline.

**offspring_size**: int, optional (default=None)
Number of offspring to produce in each genetic programming generation. Must be a positive number. By default, the number of offspring is equal to the number of population size.

**scoring**: string or callable, optional (default='accuracy').
Function used to evaluate the quality of a given pipeline for the classification problem. The following built-in scoring functions can be used:

'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'neg_log_loss','precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc'

**n_jobs**: integer, optional (default=1).
Number of processes to use in parallel for evaluating pipelines during the TPOT optimization process.

**max_time_mins**: integer or None, optional (default=None).

**verbosity**: 0 --> not much, 1 --> a bit, 2 --> medium, 3+ --> all the details

**config_dict**: Python dictionary, string, or None, optional (default=None).
A configuration dictionary for customizing the operators and parameters that TPOT searches in the optimization process.

Possible inputs are:
* Python dictionary, TPOT will use your custom configuration,
* string 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors, or
* string 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies, or
* string 'TPOT sparse': TPOT will use a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices, or
* None, TPOT will use the default TPOTClassifier configuration.

http://epistasislab.github.io/tpot/using/#built-in-tpot-configurations

In [8]:
#default config_dict

tpot_hd = TPOTClassifier(generations=5, scoring='accuracy', n_jobs=4, \
                         max_time_mins=2, verbosity=2)
tpot_hd.fit(X_hd_ohe_train, y_hd_ohe_train)
tpot_hd.export('tpot_hd_pipeline.py')

  from numpy.core.umath_tests import inner1d
  import pandas.util.testing as tm
  config.update(yaml.load(text) or {})
Version 0.10.2 of tpot is outdated. Version 0.11.5 was released Monday June 01, 2020.


HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', style=ProgressStyle(description_w…

Generation 1 - Current best internal CV score: 0.8510979358805446

2.0537710166666665 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: LogisticRegression(BernoulliNB(MaxAbsScaler(input_matrix), alpha=1.0, fit_prior=True), C=1.0, dual=False, penalty=l2)


In [9]:
tpot_hd.score(X_hd_ohe_test, y_hd_ohe_test)

0.7763157894736842

In [10]:
#light config_dict

tpot_hd_light = TPOTClassifier(config_dict='TPOT light', generations=2, \
                         scoring='accuracy', n_jobs=4, max_time_mins=1, \
                         verbosity=2)
tpot_hd_light.fit(X_hd_ohe_train, y_hd_ohe_train)
tpot_hd_light.export('tpot_hd_light_pipeline.py')

Version 0.10.2 of tpot is outdated. Version 0.11.5 was released Monday June 01, 2020.


HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', style=ProgressStyle(description_w…

Generation 1 - Current best internal CV score: 0.8550505050505051
Generation 2 - Current best internal CV score: 0.8595959595959595
Generation 3 - Current best internal CV score: 0.8683926218708828
Generation 4 - Current best internal CV score: 0.8683926218708828
Generation 5 - Current best internal CV score: 0.8683926218708828
Generation 6 - Current best internal CV score: 0.8770882740447957
Generation 7 - Current best internal CV score: 0.8770882740447957
Generation 8 - Current best internal CV score: 0.8770882740447957
Generation 9 - Current best internal CV score: 0.8770882740447957
Generation 10 - Current best internal CV score: 0.8770882740447957
Generation 11 - Current best internal CV score: 0.8770882740447957
Generation 12 - Current best internal CV score: 0.8770882740447957
Generation 13 - Current best internal CV score: 0.8770882740447957
Generation 14 - Current best internal CV score: 0.8770882740447957
Generation 15 - Current best internal CV score: 0.8770882740447957
Gene

In [11]:
tpot_hd_light.score(X_hd_ohe_test, y_hd_ohe_test)

0.7894736842105263

#### How to specify your parameter space
...but you lose the model space search

In [12]:
params = {'max_depth': np.arange(1,200,1),
          'learning_rate': np.arange(0.0001,0.1,0.0001),
          'n_estimators': np.arange(1,200,1),
          'nthread':[6],
          'gamma':np.arange(0.00001,0.1,0.00001),
          'subsample':np.arange(0.1,2,0.1),
          'reg_lambda': np.arange(0.1,200,1),
          'reg_alpha': np.arange(1,200,1),
          'min_child_weight': np.arange(1,200,1),
          'gamma': np.arange(0.1,2,0.1),
          'colsample_bytree': np.arange(0.1,2,0.1),
          'colsample_bylevel': np.arange(0.1,2,0.1)
         }

This takes a long time to run so commenting out. Just showing how to run it for now.

In [13]:
# tpot_classifier = TPOTClassifier(generations=2, population_size=2, offspring_size=4, n_jobs=4, \
#                                 verbosity=2, \
#                                 config_dict={'xgboost.XGBClassifier': params}, scoring = 'accuracy')
# tpot_classifier.fit(X_hd_ohe_train, y_hd_ohe_train)

In [14]:
# tpot_classifier.export('tpot_xgb.py')

In [15]:
# tpot_classifier.score(X_hd_ohe_test, y_hd_ohe_test)

#### load the breast cancer dataset

In [16]:
from sklearn.datasets import load_breast_cancer
print(sklearn.datasets.load_breast_cancer()['DESCR'])

Breast Cancer Wisconsin (Diagnostic) Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.

        

In [17]:
#load from sklearn
X_bc, y_bc = sklearn.datasets.load_breast_cancer(return_X_y=True)

#build the train and test sets
X_bc_train, X_bc_test, y_bc_train, y_bc_test = \
    sklearn.model_selection.train_test_split(X_bc, y_bc, random_state=1)

In [18]:
#light config_dict

tpot_bc_light = TPOTClassifier(config_dict='TPOT light', generations=2, \
                         scoring='accuracy', n_jobs=4, max_time_mins=1, \
                         verbosity=2)
tpot_bc_light.fit(X_bc_train, y_bc_train)
tpot_bc_light.export('tpot_bc_light_pipeline.py')

Version 0.10.2 of tpot is outdated. Version 0.11.5 was released Monday June 01, 2020.


HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', style=ProgressStyle(description_w…

Generation 1 - Current best internal CV score: 0.9649234577551951
Generation 2 - Current best internal CV score: 0.9649234577551951
Generation 3 - Current best internal CV score: 0.9672490391505439
Generation 4 - Current best internal CV score: 0.9719555729268452
Generation 5 - Current best internal CV score: 0.9719555729268452
Generation 6 - Current best internal CV score: 0.9743358738844374

1.0151887 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: LogisticRegression(StandardScaler(KNeighborsClassifier(input_matrix, n_neighbors=28, p=2, weights=uniform)), C=0.5, dual=False, penalty=l2)


In [19]:
tpot_bc_light.score(X_bc_test, y_bc_test)

0.965034965034965

In [22]:
hd_acc = tpot_hd_light.score(X_hd_ohe_test, y_hd_ohe_test)
bc_acc = tpot_bc_light.score(X_bc_test, y_bc_test)

import json
with open('../dominostats.json', 'w') as f:
    f.write(json.dumps( {"HD_ACC": hd_acc, "BC_ACC": bc_acc}))