In [1]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from tpot import TPOTClassifier

# About

This notebook looks at tpot, an automated machine learning system. Tpot uses an evolutionary algorithm to perform model selection and hyperparameter tuning.

# Dataset

Source: [Heterogeneity Activity Recognition Data Set](https://archive.ics.uci.edu/ml/datasets/Heterogeneity+Activity+Recognition)

The dataset and accompanying research can befound at UCIs dataset repository. The copy used in this notebook was transformed from the original dataset in the `process_datasets.ipynb`

In [2]:
data_directory = os.environ["DATASET"] + "/heterogeneity_activity_recognition"
data_path = f"{data_directory}/processed/phones.zip"

In [3]:
%time df = pd.read_csv(data_path); df.head(1)

CPU times: user 13 s, sys: 435 ms, total: 13.4 s
Wall time: 13.4 s


Unnamed: 0,arrival_time,target,user,x_accel,x_gyro,y_accel,y_gyro,z_accel,z_gyro
0,1424779162870,stand,f,-1.618774,0.009163,0.029892,-0.01741,10.02536,0.009163


# Factorize Categorical Data

Tpot expects only numeric data. `target` and `user` are represented as strings. Both of these columns are categorical data each containing 10-20 unique values. These categories are assigned a number based on the order they appear. 

The numerical represntation can be decoded using the `col_uniques` variables. These variables contain a set of the categories. The numeric representation is the index of its corresponding string in the `col_uniques` set.

```
Categorical data -> numerical representation  
["walk", "stand", "bike", "sit"] -> [1,2,3,4]
```

In [4]:
target_labels, target_uniques = pd.factorize(df.target)
user_labels, user_uniques = pd.factorize(df.user)

## Apply Change and Select Subset
Factorizations are applied in this step, and a subset of the dataset is selected.

In testing Tpot took a very long time to complete a single generation of the full dataset. I opted to select 100,00 random samples instead. 

In [5]:
df = (
    df.assign(
        target=lambda _: target_labels,
        user=lambda _: user_labels
    ).sample(n=100_000)
)

In [6]:
df.head(1)

Unnamed: 0,arrival_time,target,user,x_accel,x_gyro,y_accel,y_gyro,z_accel,z_gyro
2792316,1424776859409,1,5,5.219009,0.008552,0.220764,-0.000611,8.22084,-0.009163


# Split Test And Train Sets

In [7]:
X_train, X_test, y_train, y_test = (
    train_test_split(df.drop(columns=["target"]), df.target)
)

In [8]:
X_train.head(1)

Unnamed: 0,arrival_time,user,x_accel,x_gyro,y_accel,y_gyro,z_accel,z_gyro
6870503,1424782840623,4,-2.236481,0.47905,-1.016281,-0.94046,5.656189,0.103027


In [9]:
X_test.head(1)

Unnamed: 0,arrival_time,user,x_accel,x_gyro,y_accel,y_gyro,z_accel,z_gyro
353997,1424781800470,4,-3.217767,0.007025,-0.459681,-0.015272,9.19362,0.020464


In [10]:
y_train.head(1)

6870503    4
Name: target, dtype: int64

In [11]:
y_test.head(1)

353997    0
Name: target, dtype: int64

# Classify

## Train

In [12]:
%%time
tpot = TPOTClassifier(generations=8, population_size=32, verbosity=2, n_jobs=1)
tpot.fit(X_train, y_train)

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=288.0, style=ProgressStyle(de…

Generation 1 - Current best internal CV score: 0.9991466666666666
Generation 2 - Current best internal CV score: 0.9991466666666666
Generation 3 - Current best internal CV score: 0.9991466666666666
Generation 4 - Current best internal CV score: 0.9991733333333332
Generation 5 - Current best internal CV score: 0.9991733333333332
Generation 6 - Current best internal CV score: 0.9991733333333332
Generation 7 - Current best internal CV score: 0.9991733333333332
Generation 8 - Current best internal CV score: 0.9991733333333332

Best pipeline: RandomForestClassifier(MinMaxScaler(Nystroem(input_matrix, gamma=0.2, kernel=polynomial, n_components=1)), bootstrap=False, criterion=entropy, max_features=0.7500000000000001, min_samples_leaf=9, min_samples_split=16, n_estimators=100)
CPU times: user 7h 35min 29s, sys: 6min 25s, total: 7h 41min 54s
Wall time: 7h 31min 46s


TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
               disable_update_check=False, early_stop=None, generations=8,
               max_eval_time_mins=5, max_time_mins=None, memory=None,
               mutation_rate=0.9, n_jobs=1, offspring_size=None,
               periodic_checkpoint_folder=None, population_size=32,
               random_state=None, scoring=None, subsample=1.0, template=None,
               use_dask=False, verbosity=2, warm_start=False)

## Score

In [13]:
%time tpot.score(X_test, y_test)

CPU times: user 740 ms, sys: 526 ms, total: 1.27 s
Wall time: 301 ms


0.99948

# Export Model

In [14]:
tpot.export("../model/model_01.py")