This document will:
- demonstrate loading and inspecting features from our data set
- fit the features on three simple models, producing a cross validation score for each.

Alden Bradford, February 15 2022

In [1]:
import numpy as np
import pandas as pd
from make_features import load_data, make_features

We break up the data into two tables, which puts it in "third normal form" i.e. without redundant columns.

In [2]:
incidents, acceleration = load_data(filename='../raw-data/har_raw.gz', no_cache=True, drop_batches=False)

In [3]:
display(incidents.head())
display(acceleration.head())

Unnamed: 0_level_0,hash_id,motion,occurrence_ts,confirmation_ts
incident_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
729353,0c96025713b01a04beff5193cbf7d76d,other,NaT,NaT
729389,0c96025713b01a04beff5193cbf7d76d,other,NaT,NaT
729405,0c96025713b01a04beff5193cbf7d76d,other,NaT,NaT
730067,0c96025713b01a04beff5193cbf7d76d,other,NaT,NaT
730071,0c96025713b01a04beff5193cbf7d76d,other,NaT,NaT


Unnamed: 0_level_0,Unnamed: 1_level_0,x,y,z
incident_id,milliseconds,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
729353,40,-0.59,-1.02,-0.15
729353,80,-0.57,-1.1,-0.16
729353,120,-0.5,-1.21,-0.27
729353,160,-0.51,-1.29,-0.34
729353,200,-0.71,-1.26,-0.43


Let's measure how long it takes to make our features.

In [4]:
%%time
features = make_features(use_data=(incidents, acceleration))

CPU times: user 15 s, sys: 50.1 ms, total: 15.1 s
Wall time: 15.4 s


The first several features are computed using the entire data range:

In [5]:
pd.set_option('display.max_columns', 23)
features.iloc[:, :23].head()

Unnamed: 0_level_0,maximum,minimum,range,mean,standard deviation,variance,skew,kurtosis,total variation,mean x,mean y,mean z,peak x,peak y,peak z,stillness,middle of stillness,angular path length,biggest angle difference,angle between incident and vertical,low frequency power,medium frequency power,high frequency power
incident_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
729353,3.299182,0.200998,3.098184,1.118297,0.433189,0.187653,1.235819,3.329116,63.578636,0.207787,-0.245013,0.098907,2.77,1.19,-1.34,0.060225,9.18,5.734693,3.129214,0.620191,0.202031,0.275075,0.024166
729389,3.102193,0.232809,2.869384,1.155901,0.357949,0.128128,2.07692,8.020846,30.870376,0.037547,-0.2608,0.867707,-0.59,-0.03,2.96,0.114159,12.62,3.209399,1.738934,0.553376,0.181291,0.030584,0.001282
729405,5.28337,0.037417,5.245954,1.030549,0.460845,0.212378,5.138048,41.19254,40.827839,0.026533,0.512987,-0.039627,-2.79,-3.19,-0.74,0.003798,14.42,6.588562,2.830798,2.023228,0.342383,0.196719,0.023568
730067,3.245381,0.305287,2.940095,1.006611,0.198571,0.03943,4.00856,43.424537,34.557172,0.255493,0.630773,0.514507,0.45,2.16,2.38,0.054385,9.22,2.285753,1.082905,0.38294,0.015772,0.052258,0.023664
730071,3.598389,0.179722,3.418667,1.00441,0.270733,0.073296,4.564226,36.720746,33.030022,0.054827,0.8228,0.389413,1.95,1.99,1.52,0.025806,11.54,1.522685,0.599753,0.640125,0.057455,0.060873,0.004153


Some of the features make sense to compute on a limited range; for now they are computed across five evenly spaced overlapping windows, though we may choose our windows more deliberately in the future. Here is the first window, as an example. the numbers are the lowest and highest time included in the sample, in milliseconds.

In [6]:
features.iloc[:, 23:40].head()

Unnamed: 0_level_0,window 0:5000 maximum,window 0:5000 minimum,window 0:5000 range,window 0:5000 mean,window 0:5000 standard deviation,window 0:5000 variance,window 0:5000 skew,window 0:5000 kurtosis,window 0:5000 total variation,window 0:5000 mean x,window 0:5000 mean y,window 0:5000 mean z,window 0:5000 angular path length,window 0:5000 biggest angle difference,window 0:5000 low frequency power,window 0:5000 medium frequency power,window 0:5000 high frequency power
incident_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
729353,2.397186,0.200998,2.196188,1.086205,0.393646,0.154957,0.825898,1.78488,21.72754,-0.06336,-0.72312,-0.27128,1.583145,0.692779,0.117772,0.169372,0.013098
729389,1.643229,0.232809,1.41042,1.039304,0.248141,0.061574,-0.019261,0.759484,8.858348,-0.03192,0.388,0.67592,2.406741,1.55812,0.055926,0.022229,0.001087
729405,2.831501,0.037417,2.794085,0.984442,0.424346,0.18007,0.681795,4.677479,16.605596,-0.0224,0.34152,-0.57232,2.65716,1.789926,0.03169,0.021377,0.005594
730067,1.793377,0.555788,1.237589,0.99367,0.181772,0.033041,1.107537,3.901957,11.328262,0.4388,0.4364,0.64888,0.995328,0.672004,0.005488,0.015711,0.004786
730071,1.476516,0.761643,0.714873,0.975025,0.122164,0.014924,1.292014,2.726762,9.068697,-0.05088,0.82752,0.3868,0.620609,0.337053,0.009785,0.010514,0.002967


A description of what each feature may represent, as well as how it was computed, is avaliable in `make_features/features.py`.

Finally, here we present a handfull of simple baseline models which are quick to train. We apply simple 5-fold cross validation.

In [7]:
from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

In [8]:
scoring = 'precision recall f1'.split()
classifiers = {
    'KNeighbors': KNeighborsClassifier(3),
    'Decision tree': DecisionTreeClassifier(max_depth=5),
    'Gaussian Naive Bayes': GaussianNB(),
}
X = StandardScaler().fit_transform(features)
y = incidents['motion'] != 'other'

In [9]:
for name, clf in classifiers.items():
    print(f'Results for the {name} classifier:')
    display(pd.DataFrame(cross_validate(clf, X, y, scoring=scoring, cv=StratifiedKFold())))
    print('*'*20)

Results for the KNeighbors classifier:


Unnamed: 0,fit_time,score_time,test_precision,test_recall,test_f1
0,0.001628,0.035571,0.333333,0.050847,0.088235
1,0.001755,0.03343,0.210526,0.067797,0.102564
2,0.00172,0.032899,0.212766,0.166667,0.186916
3,0.001688,0.034903,0.2,0.1,0.133333
4,0.001696,0.033417,0.194444,0.116667,0.145833


********************
Results for the Decision tree classifier:


Unnamed: 0,fit_time,score_time,test_precision,test_recall,test_f1
0,0.12169,0.003409,0.166667,0.016949,0.030769
1,0.117651,0.003031,0.1,0.016949,0.028986
2,0.115715,0.003033,0.066667,0.016667,0.026667
3,0.115695,0.003024,0.454545,0.083333,0.140845
4,0.115899,0.003055,0.238095,0.083333,0.123457


********************
Results for the Gaussian Naive Bayes classifier:


Unnamed: 0,fit_time,score_time,test_precision,test_recall,test_f1
0,0.004273,0.003668,0.145729,0.491525,0.224806
1,0.003643,0.00358,0.150376,0.677966,0.246154
2,0.003519,0.003563,0.102326,0.733333,0.179592
3,0.003459,0.003688,0.13245,0.666667,0.220994
4,0.00344,0.003746,0.119681,0.75,0.206422


********************


The metrics computed above are:
\begin{align*}
precision &= \frac{true\;positives}{false\;positives + true\;positives}.
\\
recall &= \frac{true\;positives}{false\;negatives + true\;positives}.
\\
f1 &= \frac{2}{\frac{1}{precision}+\frac{1}{recall}}
\end{align*}

We can see that:
- KNeighbors has fairly consistent precision, 
- decision tree has highly variable precision, and 
- gaussian naive bayes strikes more of a balance between precision and recall.