## TPOT AUTOML

The Tree-Based Pipeline Optimization Tool (TPOT) was one of the very first AutoML methods and open-source software packages developed for the data science community. TPOT was developed by Dr. Randal Olson while a postdoctoral student with Dr. Jason H. Moore at the Computational Genetics Laboratory of the University of Pennsylvania and is still being extended and supported by this team.

The goal of TPOT is to automate the building of ML pipelines by combining a flexible expression tree representation of pipelines with stochastic search algorithms such as genetic programming. TPOT makes use of the Python-based scikit-learn library as its ML menu.

Reference : Github url: https://github.com/EpistasisLab/tpot

### Genetic Programming

Genetic Programming (GP) is a type of Evolutionary Algorithm (EA), a subset of machine learning. EAs are used to discover solutions to problems humans do not know how to solve, directly. Free of human preconceptions or biases, the adaptive nature of EAs can generate solutions that are comparable to, and often better than the best human efforts.*

Inspired by biological evolution and its fundamental mechanisms, GP software systems implement an algorithm that uses random mutation, crossover, a fitness function, and multiple generations of evolution to resolve a user-defined task. GP can be used to discover a functional relationship between features in data (symbolic regression), to group data into categories (classification), and to assist in the design of electrical circuits, antennae, and quantum algorithms. GP is applied to software engineering through code synthesis, genetic improvement, automatic bug-fixing, and in developing game-playing strategies, … and more.

With the right data, computing power and machine learning model you can discover a solution to any problem, but knowing which model to use can be challenging for you as there are so many of them like Decision Trees, SVM, KNN, etc. That's where genetic programming can be of great use and provide help. Genetic algorithms are inspired by the Darwinian process of Natural Selection, and they are used to generate solutions to optimization and search problems in computer science.

Broadly speaking, Genetic Algorithms have three properties:

* Selection: You have a population of possible solutions to a given problem and a fitness function. At every iteration, you evaluate how to fit each solution with your fitness function.
* Crossover: Then you select the fittest ones and perform crossover to create a new population.
* Mutation: You take those children and mutate them with some random modification and repeat the process until you get the fittest or best solution.

Reference: http://geneticprogramming.com/

In [15]:
from tpot import TPOTClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

In [16]:
iris = load_iris()
iris.data[0:5], iris.target

(array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2]]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]))

In [17]:

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,
                                                    train_size=0.75, test_size=0.25)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((112, 4), (38, 4), (112,), (38,))

In [7]:
tpot = TPOTClassifier(verbosity=2, max_time_mins=10)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

HBox(children=(IntProgress(value=0, description='Optimization Progress', style=ProgressStyle(description_width…

Generation 1 - Current best internal CV score: 0.9727272727272727
Generation 2 - Current best internal CV score: 0.9734848484848484
Generation 3 - Current best internal CV score: 0.9825757575757577
Generation 4 - Current best internal CV score: 0.9825757575757577
Generation 5 - Current best internal CV score: 0.9825757575757577
Generation 6 - Current best internal CV score: 0.9825757575757577
Generation 7 - Current best internal CV score: 0.9825757575757577
Generation 8 - Current best internal CV score: 0.9825757575757577
Generation 9 - Current best internal CV score: 0.9825757575757577
Generation 10 - Current best internal CV score: 0.9825757575757577
Generation 11 - Current best internal CV score: 0.9825757575757577
Generation 12 - Current best internal CV score: 0.9825757575757577
Generation 13 - Current best internal CV score: 0.9825757575757577
Generation 14 - Current best internal CV score: 0.9825757575757577
Generation 15 - Current best internal CV score: 0.9825757575757577
Gene

In [11]:
tpot.fitted_pipeline_

Pipeline(memory=None,
     steps=[('zerocount', ZeroCount()), ('multinomialnb', MultinomialNB(alpha=0.001, class_prior=None, fit_prior=False))])

In [17]:
print(tpot.score(X_test, y_test))

1.0


In [13]:

tpot.export('tpot_iris_pipeline.py')