**TPOT  (Tree-based Pipeline Optimisation Technique)** is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming. TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data.

Once TPOT is finished searching , it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there. POT is built on top of scikit-learn, so all of the code it generates should look familiar... if you're familiar with scikit-learn, anyway.

**Time Constraint with TPOT**
While using a AtutoML software like TPOT there are certain issues which a Data Scientist need to deal with.

**Longer training time**
Main issue is the longer time duration in search. If you really want TPOT to find a reasonably good pipeline for your dataset, you have to run it for longer duration, so it can search all possible options and find the best one for you. Often its better to run the multiple instances of TPOT in parallel.

AutoML algorithms aren't as simple as fitting one model on the dataset; they are considering multiple machine learning algorithms (random forests, linear models, SVMs, etc.) in a pipeline with multiple preprocessing steps (missing value imputation, scaling, PCA, feature selection, etc.), the hyperparameters for all of the models and preprocessing steps, as well as multiple ways to ensemble or stack the algorithms within the pipeline.

**Extensive compute**
As such, TPOT will take a while to run on larger datasets, but it's important to realize why. With the default TPOT settings (100 generations with 100 population size), TPOT will evaluate 10,000 pipeline configurations before finishing. To put this number into context, think about a grid search of 10,000 hyperparameter combinations for a machine learning algorithm and how long that grid search will take. That is 10,000 model configurations to evaluate with 10-fold cross-validation, which means that roughly 100,000 models are fit and evaluated on the training data in one grid search. That's a time-consuming procedure, even for simpler models like decision trees.

Typical TPOT runs will take hours to days to finish (unless it's a small dataset), but you can always interrupt the run partway through and see the best results so far. TPOT also provides a **warm_start** parameter that lets you restart a TPOT run from where it left off.

**AutoML algorithms can recommend different solutions for the same dataset**
If you're working with a reasonably complex dataset or run TPOT for a short amount of time, different TPOT runs may result in different pipeline recommendations. TPOT's optimization algorithm is stochastic in nature, which means that it uses randomness (in part) to search the possible pipeline space. When two TPOT runs recommend different pipelines, this means that the TPOT runs didn't converge due to lack of time or that multiple pipelines perform more-or-less the same on your dataset.

This is actually an advantage over fixed grid search techniques: TPOT is meant to be an assistant that gives you ideas on how to solve a particular machine learning problem by exploring pipeline configurations that you might have never considered, then leaves the fine-tuning to more constrained parameter tuning techniques such as grid search.

TPOT have two main APIs.

**TPOTClassifier** : For classification use cases.

**TPOTRegressor** : For regression use cases.

## TPOT Example with Iris Dataset

In [1]:
import warnings
warnings.filterwarnings('ignore')

from tpot import TPOTClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split



  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


In [2]:
iris = load_iris()
iris.data[0:5], iris.target

(array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2]]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]))

In [3]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,
                                                    train_size=0.75, test_size=0.25)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((112, 4), (38, 4), (112,), (38,))

In [4]:
tpot = TPOTClassifier(verbosity=2, max_time_mins=5)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

  return f(*args, **kwds)


HBox(children=(IntProgress(value=0, description='Optimization Progress'), HTML(value='')))

Generation 1 - Current best internal CV score: 0.9655467720685114
Generation 2 - Current best internal CV score: 0.9655467720685114
Generation 3 - Current best internal CV score: 0.9734848484848484
Generation 4 - Current best internal CV score: 0.9742424242424242

5.016276466666667 minutes have elapsed. TPOT will close down.
TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: LinearSVC(LogisticRegression(input_matrix, C=0.001, dual=False, penalty=l2), C=15.0, dual=False, loss=squared_hinge, penalty=l2, tol=0.0001)
1.0


In [5]:
tpot.export('tpot_iris_pipeline.py')

True

The resultent best classifier is been written in **tpot_iris_pipeline.py** file in same folder. We only run the classifier for 5 minutes, so within 5 minutes which ever model it will find best will tell. To get the best model we can keep it running till it stops itself.

## TPOT regressor with boston house price

In [6]:
from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,
                                                    train_size=0.75, test_size=0.25)

tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2)
tpot.fit(X_train, y_train)


HBox(children=(IntProgress(value=0, description='Optimization Progress', max=300), HTML(value='')))

Generation 1 - Current best internal CV score: -11.134659635924695
Generation 2 - Current best internal CV score: -10.74747225137049
Generation 3 - Current best internal CV score: -10.242664452137216
Generation 4 - Current best internal CV score: -10.242664452137216
Generation 5 - Current best internal CV score: -9.935683614631834

Best pipeline: XGBRegressor(PolynomialFeatures(input_matrix, degree=2, include_bias=False, interaction_only=False), learning_rate=0.1, max_depth=10, min_child_weight=7, n_estimators=100, nthread=1, subsample=0.7500000000000001)


TPOTRegressor(config_dict=None, crossover_rate=0.1, cv=5,
       disable_update_check=False, early_stop=None, generations=5,
       max_eval_time_mins=5, max_time_mins=None, memory=None,
       mutation_rate=0.9, n_jobs=1, offspring_size=None,
       periodic_checkpoint_folder=None, population_size=50,
       random_state=None, scoring=None, subsample=1.0, use_dask=False,
       verbosity=2, warm_start=False)

In [7]:
print(tpot.score(X_test, y_test))
tpot.export('tpot_boston_pipeline.py')

-10.523367933668329


True