## We'll now make an analysis using an automated machine learning tool, called TPOT. It tries to find a pipeline that maximezes the precision of our classification

We'll now use TPOT to find the best pipeline to our problem

In [1]:
%matplotlib notebook

import numpy as np
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split





Importing the DATA

In [2]:
X = np.genfromtxt('./features/features_p8_p9_1.csv', delimiter=',')
y = np.genfromtxt('./features/output_p8_p9_1.csv', delimiter=',')


X_train , X_test, y_train, y_test = train_test_split(X,y,train_size = 0.65 ,test_size = 0.35)


In [3]:
tpot = TPOTClassifier(verbosity = 2, n_jobs = -1)

In [4]:
tpot.fit(X_train, y_train)

HBox(children=(IntProgress(value=0, description='Optimization Progress', max=10100, style=ProgressStyle(descri…

Generation 1 - Current best internal CV score: 0.976114081996435
Generation 2 - Current best internal CV score: 0.976114081996435
Generation 3 - Current best internal CV score: 0.976114081996435
Generation 4 - Current best internal CV score: 0.976114081996435
Generation 5 - Current best internal CV score: 0.976114081996435
Generation 6 - Current best internal CV score: 0.976114081996435
Generation 7 - Current best internal CV score: 0.976114081996435
Generation 8 - Current best internal CV score: 0.976114081996435
Generation 9 - Current best internal CV score: 0.9882352941176471
Generation 10 - Current best internal CV score: 0.9882352941176471
Generation 11 - Current best internal CV score: 0.9941176470588236
Generation 12 - Current best internal CV score: 0.9941176470588236
Generation 13 - Current best internal CV score: 0.9941176470588236
Generation 14 - Current best internal CV score: 0.9941176470588236
Generation 15 - Current best internal CV score: 0.9941176470588236
Generation 1

TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
        disable_update_check=False, early_stop=None, generations=100,
        max_eval_time_mins=5, max_time_mins=None, memory=None,
        mutation_rate=0.9, n_jobs=-1, offspring_size=None,
        periodic_checkpoint_folder=None, population_size=100,
        random_state=None, scoring=None, subsample=1.0,
        template='RandomTree', use_dask=False, verbosity=2,
        warm_start=False)

Let's check the score of the algorithm found:

In [5]:
tpot.score(X_test, y_test)

0.9560439560439561

## It is a huge precision (0.9560439560439561). It was done using two steps, as we can verify in the file generated by TPOT (tpot_pipeline) with the choosen pipeline:

* RFE feature reduction
* LinearSVC

In [6]:
tpot.export('tpot_pipeline.py')

## Let's try to classify five players now and see how accurate can we be (too much computational time, didn't end executing)

In [2]:
X = np.genfromtxt('./features/features_p5_p6_p7_p8_p9_p10.csv', delimiter=',')
y = np.genfromtxt('./features/output_p5_p6_p7_p8_p9_p10.csv', delimiter=',')

X_train , X_test, y_train, y_test = train_test_split(X,y,train_size = 0.65 ,test_size = 0.35)

In [3]:
tpot = TPOTClassifier(verbosity = 2, n_jobs = -1)

In [4]:
tpot.fit(X_train, y_train)

HBox(children=(IntProgress(value=0, description='Optimization Progress', max=10100, style=ProgressStyle(descri…

Generation 1 - Current best internal CV score: 0.7008033488590746
Generation 2 - Current best internal CV score: 0.7174055311226162
Generation 3 - Current best internal CV score: 0.7174055311226162
Generation 4 - Current best internal CV score: 0.7174055311226162
Generation 5 - Current best internal CV score: 0.723114797701448
Generation 6 - Current best internal CV score: 0.7250389689918165
Generation 7 - Current best internal CV score: 0.7250389689918165
Generation 8 - Current best internal CV score: 0.7250389689918165
Generation 9 - Current best internal CV score: 0.7270941193935683
Generation 10 - Current best internal CV score: 0.7288158684208897
Generation 11 - Current best internal CV score: 0.7288158684208897
Generation 12 - Current best internal CV score: 0.7288158684208897
Generation 13 - Current best internal CV score: 0.7639841464244403
Generation 14 - Current best internal CV score: 0.7639841464244403
Generation 15 - Current best internal CV score: 0.7639841464244403
Gener

TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
        disable_update_check=False, early_stop=None, generations=100,
        max_eval_time_mins=5, max_time_mins=None, memory=None,
        mutation_rate=0.9, n_jobs=-1, offspring_size=None,
        periodic_checkpoint_folder=None, population_size=100,
        random_state=None, scoring=None, subsample=1.0,
        template='RandomTree', use_dask=False, verbosity=2,
        warm_start=False)

In [5]:
tpot.export('tpot_5players_pipeline.py')

In [6]:
tpot.score(X_test, y_test)

0.7741935483870968

## 77,5% of precision for the 5 players, which is a big advance given that the last group couldn't work with three players with reasonable precision with the following pipeline:
* SelectFromModel (Dimension reduction)
* RBFSampler approximates feature map of an RBF kernel by Monte Carlo approximation of its Fourier transform
* GradientBoostingClassifier: GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions

It's important to remark that sampling the trajectory with 1Hz we were able to extract:
* 35 points for player 5
* 85 points for player 6
* 88 points for player 7
* 166 points for player 8
* 92 points for player 9
* 64 points for player 10

Which is not enough amount of data to usually train an algorithm