# MAGIC Gamma Telescope - TPOT Classification Study

The below gives information about the data set:

In [10]:
# Import required libraries
from tpot import TPOTClassifier
from sklearn.cross_validation import train_test_split
import pandas as pd 
import numpy as np
import openml as oml

In [14]:
#Load the data
dataset = oml.datasets.get_dataset(1120)
X, y, attribute_names = dataset.get_data(target=dataset.default_target_attribute, return_attribute_names=True)
tele = pd.DataFrame(X, columns=attribute_names)
tele['Class'] = y
tele.head(5)

Unnamed: 0,fLength:,fWidth:,fSize:,fConc:,fConc1:,fAsym:,fM3Long:,fM3Trans:,fAlpha:,fDist:,Class
0,28.7967,16.0021,2.6449,0.3918,0.1982,27.700399,22.011,-8.2027,40.091999,81.882797,0
1,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.823799,-9.9574,6.3609,205.261002,0
2,162.052002,136.031006,4.0612,0.0374,0.0187,116.740997,-64.858002,-45.216,76.959999,256.787994,0
3,23.8172,9.5728,2.3385,0.6147,0.3922,27.210699,-6.4633,-7.1513,10.449,116.737,0
4,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.648,356.462006,0


# Data Analysis using TPOT

To begin our analysis, we need to divide our training data into training and validation sets. The validation set is just to give us an idea of the test set error.

In [15]:
training_indices, validation_indices = training_indices, testing_indices = train_test_split(tele.index, stratify = y, train_size=0.75, test_size=0.25)
training_indices.size, validation_indices.size

(14265, 4755)

After that, we proceed to calling the `fit()`, `score()` and `export()` functions on our training dataset.
An important TPOT parameter to set is the number of generations (via the `generations` kwarg). Since our aim is to just illustrate the use of TPOT, we assume the default setting of 100 generations, whilst bounding the total running time via the `max_time_mins` kwarg (which may, essentially, override the former setting). Further, we enable control for the maximum amount of time allowed for optimization of a single pipeline, via `max_eval_time_mins`.

On a standard laptop with 4GB RAM, each generation takes approximately 5 minutes to run. Thus, for the default value of 100, without the explicit duration bound, the total run time could be roughly around 8 hours.

In [16]:
tpot = TPOTClassifier(verbosity=2, max_time_mins=4, max_eval_time_mins=0.04, population_size=15)
tpot.fit(tele.drop('Class',axis=1).loc[training_indices].values, tele.loc[training_indices,'Class'].values)

Optimization Progress: 37pipeline [00:28,  1.13pipeline/s]                  

Generation 1 - Current best internal CV score: 0.8373637522950468


Optimization Progress: 56pipeline [00:45,  1.10s/pipeline]                  

Generation 2 - Current best internal CV score: 0.8444444368310734


Optimization Progress: 75pipeline [01:01,  1.03s/pipeline]                  

Generation 3 - Current best internal CV score: 0.8444444368310734


Optimization Progress: 95pipeline [01:18,  1.14s/pipeline]

Generation 4 - Current best internal CV score: 0.8444444368310734


Optimization Progress: 114pipeline [01:31,  1.85pipeline/s]

Generation 5 - Current best internal CV score: 0.8444444368310734


Optimization Progress: 130pipeline [01:40,  1.87pipeline/s]

Generation 6 - Current best internal CV score: 0.8445838785648185


Optimization Progress: 148pipeline [01:51,  1.48pipeline/s]

Generation 7 - Current best internal CV score: 0.8445838785648185


Optimization Progress: 166pipeline [02:02,  1.36pipeline/s]

Generation 8 - Current best internal CV score: 0.8484410909914896


Optimization Progress: 181pipeline [02:08,  2.43pipeline/s]

Generation 9 - Current best internal CV score: 0.8484410909914896


Optimization Progress: 196pipeline [02:15,  2.29pipeline/s]

Generation 10 - Current best internal CV score: 0.8484410909914896


Optimization Progress: 214pipeline [02:30,  1.02s/pipeline]

Generation 11 - Current best internal CV score: 0.8484410909914896


Optimization Progress: 231pipeline [02:39,  1.13pipeline/s]

Generation 12 - Current best internal CV score: 0.8484410909914896


Optimization Progress: 252pipeline [02:55,  1.36pipeline/s]

Generation 13 - Current best internal CV score: 0.8484410909914896


Optimization Progress: 271pipeline [03:17,  2.28s/pipeline]

Generation 14 - Current best internal CV score: 0.8505436000360513


Optimization Progress: 291pipeline [03:32,  1.56pipeline/s]

Generation 15 - Current best internal CV score: 0.8505436000360513


Optimization Progress: 313pipeline [03:51,  1.61pipeline/s]

Generation 16 - Current best internal CV score: 0.8505436000360513


                                                           


4.027229666666667 minutes have elapsed. TPOT will close down.
TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: DecisionTreeClassifier(OneHotEncoder(LogisticRegression(VarianceThreshold(input_matrix, threshold=0.005), C=0.5, dual=False, penalty=l2), minimum_fraction=0.2, sparse=False), criterion=gini, max_depth=7, min_samples_leaf=18, min_samples_split=12)


TPOTClassifier(config_dict={'sklearn.naive_bayes.GaussianNB': {}, 'sklearn.naive_bayes.BernoulliNB': {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0], 'fit_prior': [True, False]}, 'sklearn.naive_bayes.MultinomialNB': {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0], 'fit_prior': [True, False]}, 'sklearn.tree.DecisionT....45,
        0.5 ,  0.55,  0.6 ,  0.65,  0.7 ,  0.75,  0.8 ,  0.85,  0.9 ,
        0.95,  1.  ])}}}},
        crossover_rate=0.1, cv=5, disable_update_check=False,
        early_stop=None, generations=1000000, max_eval_time_mins=0.04,
        max_time_mins=4, memory=None, mutation_rate=0.9, n_jobs=1,
        offspring_size=15, periodic_checkpoint_folder=None,
        population_size=15, random_state=None, scoring=None, subsample=1.0,
        verbosity=2, warm_start=False)

In the above, 7 generations were computed, each giving the training efficiency of fitting model on the training set. As evident, the best pipeline is the one that has the CV score of 85.335%. The model that produces this result is pipeline, consisting of a logistic regression that adds synthetic features to the input data, which then get utilized by a decision tree classifier to form the final predictions.

Next, the test error is computed for validation purposes.

In [13]:
tpot.score(tele.drop('Class',axis=1).loc[validation_indices].values, tele.loc[validation_indices, 'Class'].values)

0.85573080967402737

As can be seen, the test accuracy is 85.573%.

In [14]:
tpot.export('tpot_MAGIC_Gamma_Telescope_pipeline.py')

True

In [None]:
# %load tpot_MAGIC_Gamma_Telescope_pipeline.py
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from sklearn.tree import DecisionTreeClassifier
from tpot.builtins import StackingEstimator

# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'].values, random_state=42)

# Score on the training set was:0.853347788745
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=LogisticRegression(C=10.0, dual=False, penalty="l2")),
    DecisionTreeClassifier(criterion="gini", max_depth=7, min_samples_leaf=5, min_samples_split=7)
)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
