[View in Colaboratory](https://colab.research.google.com/github/gomerudo/auto-ml/blob/master/python/notebooks/TPOT.ipynb)

# TPOT exploration

## Installing the packages

In [0]:
# This installs the main packages
!pip install numpy scipy scikit-learn pandas deap update_checker tqdm stopit

# This will install TPOT to use the eXtreme Gradient Boosting models. XGBoost is entirely optional
!pip install xgboost

# Actually installing TPOT
!pip install tpot

# OpenML
!pip install git+https://github.com/renatopp/liac-arff@master
!pip install git+https://github.com/openml/openml-python.git@develop


Collecting deap
[?25l  Downloading https://files.pythonhosted.org/packages/af/29/e7f2ecbe02997b16a768baed076f5fc4781d7057cd5d9adf7c94027845ba/deap-1.2.2.tar.gz (936kB)
[K    100% |████████████████████████████████| 942kB 7.6MB/s 
[?25hCollecting update_checker
  Downloading https://files.pythonhosted.org/packages/17/c9/ab11855af164d03be0ff4fddd4c46a5bd44799a9ecc1770e01a669c21168/update_checker-0.16-py2.py3-none-any.whl
Collecting stopit
  Downloading https://files.pythonhosted.org/packages/35/58/e8bb0b0fb05baf07bbac1450c447d753da65f9701f551dca79823ce15d50/stopit-1.1.2.tar.gz
Building wheels for collected packages: deap, stopit
  Running setup.py bdist_wheel for deap ... [?25l- \ | / - \ | / - \ | done
[?25h  Stored in directory: /root/.cache/pip/wheels/22/ea/bf/dc7c8a2262025a0ab5da9ef02282c198be88902791ca0c6658
  Running setup.py bdist_wheel for stopit ... [?25l- done
[?25h  Stored in directory: /root/.cache/pip/wheels/3c/85/2b/2580190404636bfc63e8de3dff62

## Fetching a public dataset (fraud)

In [0]:
import openml as oml
from openml import tasks, runs, datasets
from sklearn.model_selection import train_test_split


dataset = oml.datasets.get_dataset(1597) # fraud data
X, y = dataset.get_data(target = dataset.default_target_attribute)

X_train, X_val, y_train, y_val = train_test_split(X, y, 
                                                  train_size = 0.75, 
                                                  test_size = 0.25)

In [0]:
print("Shape of whole dataset:", X.shape)
print("Shape of train dataset:", X_train.shape)
print("Shape of validation dataset:", X_val.shape)

Shape of whole dataset: (284807, 29)
Shape of train dataset: (213605, 29)
Shape of validation dataset: (71202, 29)


## Testing TPOT

In [0]:
from tpot import TPOTClassifier

tpot = TPOTClassifier(generations = 5, population_size = 20, verbosity = 2, 
                      scoring = 'roc_auc', n_jobs = -1)
print("=====================================================================")
print("======================= RUNNING TPOT CLASSIFIER =====================")
print("=====================================================================\n")
tpot.fit(X_train, y_train)




Optimization Progress:  42%|████▎     | 51/120 [58:16<1:29:25, 77.75s/pipeline]

Generation 1 - Current best internal CV score: 0.9815067543598548


Optimization Progress:  64%|██████▍   | 77/120 [2:05:58<1:46:16, 148.30s/pipeline]

Generation 2 - Current best internal CV score: 0.9815067543598548


Optimization Progress:  83%|████████▎ | 100/120 [2:32:37<33:54, 101.73s/pipeline]

Generation 3 - Current best internal CV score: 0.9815067543598548


Optimization Progress: 125pipeline [3:06:08, 105.72s/pipeline]

Generation 4 - Current best internal CV score: 0.9815067543598548




Generation 5 - Current best internal CV score: 0.9815067543598548

Best pipeline: LogisticRegression(FastICA(input_matrix, tol=0.2), C=0.01, dual=True, penalty=l2)


TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
        disable_update_check=False, early_stop=None, generations=5,
        max_eval_time_mins=5, max_time_mins=None, memory=None,
        mutation_rate=0.9, n_jobs=-1, offspring_size=None,
        periodic_checkpoint_folder=None, population_size=20,
        random_state=None, scoring='roc_auc', subsample=1.0,
        use_dask=False, verbosity=2, warm_start=False)

## Validation

In [0]:
tpot.score(X_val, y_val)

0.9809710500201769

## Print the model

In [0]:
tpot.export('tpot_mnist_pipeline.py')

!cat tpot_mnist_pipeline.py

import numpy as np
import pandas as pd
from sklearn.decomposition import FastICA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'].values, random_state=None)

# Average CV score on the training set was:0.9815067543598548
exported_pipeline = make_pipeline(
    FastICA(tol=0.2),
    LogisticRegression(C=0.01, dual=True, penalty="l2")
)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
