<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Imports" data-toc-modified-id="Imports-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#EDA" data-toc-modified-id="EDA-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>EDA</a></span><ul class="toc-item"><li><span><a href="#Dataset-Information" data-toc-modified-id="Dataset-Information-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Dataset Information</a></span></li><li><span><a href="#Data-Cleaning" data-toc-modified-id="Data-Cleaning-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Data Cleaning</a></span></li><li><span><a href="#Data-Visualizations" data-toc-modified-id="Data-Visualizations-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Data Visualizations</a></span><ul class="toc-item"><li><span><a href="#Target-Variable" data-toc-modified-id="Target-Variable-3.3.1"><span class="toc-item-num">3.3.1&nbsp;&nbsp;</span>Target Variable</a></span></li><li><span><a href="#Features" data-toc-modified-id="Features-3.3.2"><span class="toc-item-num">3.3.2&nbsp;&nbsp;</span>Features</a></span></li></ul></li></ul></li><li><span><a href="#Modeling" data-toc-modified-id="Modeling-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Modeling</a></span><ul class="toc-item"><li><span><a href="#Sitting-Label" data-toc-modified-id="Sitting-Label-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Sitting Label</a></span></li><li><span><a href="#Indoors-Label" data-toc-modified-id="Indoors-Label-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Indoors Label</a></span></li><li><span><a href="#Phone-on-Table" data-toc-modified-id="Phone-on-Table-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Phone on Table</a></span></li></ul></li><li><span><a href="#Evaluate-Best-Models:-(KNN-&amp;-RFC)" data-toc-modified-id="Evaluate-Best-Models:-(KNN-&amp;-RFC)-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Evaluate Best Models: (KNN &amp; RFC)</a></span></li></ul></div>

# Introduction

First: Go out and find a dataset of interest. It could be from one of our recommended resources, some other aggregation, or scraped yourself. Just make sure it has lots of variables in it, including an outcome of interest to you.

Second: Explore the data. Get to know the data. Spend a lot of time going over its quirks and peccadilloes. You should understand how it was gathered, what's in it, and what the variables look like.

Third: Model your outcome of interest. You should try several different approaches and really work to tune a variety of models before using the model evaluation techniques to choose what you consider to be the best performer. Make sure to think about explanatory versus predictive power and experiment with both.

So, here is the deliverable: Prepare a slide deck and 15 minute presentation that guides viewers through your model. Be sure to cover a few specific things:

A specified research question your model addresses
How you chose your model specification and what alternatives you compared it to
The practical uses of your model for an audience of interest
Any weak points or shortcomings of your model

# Imports

In [12]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import time
from matplotlib.legend_handler import HandlerLine2D


from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import normalize
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.metrics import accuracy_score, roc_curve, auc
from sklearn import ensemble
from sklearn.neural_network import MLPClassifier
from scipy.stats import ttest_ind

import operator
import os


%matplotlib inline

# Options for pandas
pd.options.display.max_columns = 100
pd.options.display.max_rows = 500

In [13]:
#use sample user to create separate list of features and target labels to be used for cell below

# sample_user = pd.read_csv('data/1538C99F-BA1E-4EFB-A949-6C7C47701B20.features_labels.csv.gz')

# list of one-hot encoded labels
#label_col = label_col = [col for col in list(sample_user.columns) if col[:6] == 'label:']
# list of explanatory features
#all_feature_list = [col for col in list(sample_user.columns) if col not in label_col and col != 'timestamp']

In [14]:
# #create master dataframe with all users
# directory = os.fsencode('data/')
# master_df = pd.DataFrame()

# for i,file in enumerate(os.listdir(directory)):
#     filename = os.fsdecode(file)
#     if filename.endswith(".gz") or filename.endswith(".csv"): 
#         user = pd.read_csv('data/' + filename)
#         #labels each observation by user number
#         user['user_num'] = i
        
#         #linear interpolate explanatory columns for each user df individually
#         nan_col = [col for col in all_feature_list if user[col].isnull().sum()/user[col].isnull().count()*100 > 0]
        
#         for col in nan_col:
#             user[col].interpolate(method='linear', limit_direction='both', inplace=True)
        
#         #fill nan with zero for dummy labels
#         for col in label_col:
#             user[col].fillna(0, axis=0, inplace=True)
#         #add user dataframe to master dataframe
#         master_df = master_df.append(user)
# master_df.to_csv('data/master_df.csv')

In [15]:
master_df = pd.read_csv('data/master_df.csv')

# EDA

## Dataset Information

The are 60 participants within this [kaggle dataset](https://www.kaggle.com/yvaizman/the-extrasensory-dataset#user2.features_labels.csv) (34 females and 26 males). Majority of participants are UCSD undergraduate and graduate students or research assistants. The average number of days users participated was 7.6 days (std: 3.2). 

The data from each user was collected by a mobile app called ExtraSensory. The majority of the data was collected by built-in sensors within the user's phone and provided smart watches. The sensor data was collected in a 20 second window with intervals of one minute and some time gaps in between. The sensors range from movement sensors such as accelorometer, gyrometer and magnetometer to audio and location sensors. Within the dataset there are different metrics calculated in relation to each sensor such as: mean, median, std, etc. There are missing observations within some sensor columns such as location and audio because of user's personal privacy reasons. Additional binary columns were self-reported by the user such as wifi available and time of day. 
 
The user also self-reported what activity they were doing or what location they were at within the recorded timeframe. The user was able to choose from general movement activities (e.g. lying down, sitting, standing in place, walking, or running), specific movement acitivies (e.g. eating, texting or exercising) or general location (e.g. gym, workplace, or home). 

Sensor Information according to [ExtraSensory UCSD dataset website](http://extrasensory.ucsd.edu/): 

- accelerometer: Tri-axial direction and magnitude of acceleration. 40Hz 
- gyroscope:	Rate of rotation around phone's 3 axes. 40Hz
- magnetometer:	Tri-axial direction and magnitude of magnetic field. 40Hz
- watch accelerometer:	Tri-axial acceleration from the watch. 25Hz
- watch compass:	Watch heading (degrees). nC samples (whenever changes in 1deg)
- location:	Latitude, longitude, altitude, speed, accuracies. nL samples (whenever changed enough)
- location (quick):	Quick location-variability features (no absolute coordinates) calculated on the phone.
- audio:	22kHz for ~20sec. Then 13 MFCC features from half overlapping 96msec frames.
- audio magnitude:	Max absolute value of recorded audio, before it was normalized.	
- phone state:	App status, battery state, WiFi availability, on the phone, time-of-day.
- additional:	Light, air pressure, humidity, temperature, proximity. If available sampled once in session

In [16]:
master_df.head()

Unnamed: 0.1,Unnamed: 0,timestamp,raw_acc:magnitude_stats:mean,raw_acc:magnitude_stats:std,raw_acc:magnitude_stats:moment3,raw_acc:magnitude_stats:moment4,raw_acc:magnitude_stats:percentile25,raw_acc:magnitude_stats:percentile50,raw_acc:magnitude_stats:percentile75,raw_acc:magnitude_stats:value_entropy,raw_acc:magnitude_stats:time_entropy,raw_acc:magnitude_spectrum:log_energy_band0,raw_acc:magnitude_spectrum:log_energy_band1,raw_acc:magnitude_spectrum:log_energy_band2,raw_acc:magnitude_spectrum:log_energy_band3,raw_acc:magnitude_spectrum:log_energy_band4,raw_acc:magnitude_spectrum:spectral_entropy,raw_acc:magnitude_autocorrelation:period,raw_acc:magnitude_autocorrelation:normalized_ac,raw_acc:3d:mean_x,raw_acc:3d:mean_y,raw_acc:3d:mean_z,raw_acc:3d:std_x,raw_acc:3d:std_y,raw_acc:3d:std_z,raw_acc:3d:ro_xy,raw_acc:3d:ro_xz,raw_acc:3d:ro_yz,proc_gyro:magnitude_stats:mean,proc_gyro:magnitude_stats:std,proc_gyro:magnitude_stats:moment3,proc_gyro:magnitude_stats:moment4,proc_gyro:magnitude_stats:percentile25,proc_gyro:magnitude_stats:percentile50,proc_gyro:magnitude_stats:percentile75,proc_gyro:magnitude_stats:value_entropy,proc_gyro:magnitude_stats:time_entropy,proc_gyro:magnitude_spectrum:log_energy_band0,proc_gyro:magnitude_spectrum:log_energy_band1,proc_gyro:magnitude_spectrum:log_energy_band2,proc_gyro:magnitude_spectrum:log_energy_band3,proc_gyro:magnitude_spectrum:log_energy_band4,proc_gyro:magnitude_spectrum:spectral_entropy,proc_gyro:magnitude_autocorrelation:period,proc_gyro:magnitude_autocorrelation:normalized_ac,proc_gyro:3d:mean_x,proc_gyro:3d:mean_y,proc_gyro:3d:mean_z,proc_gyro:3d:std_x,proc_gyro:3d:std_y,...,label:FIX_running,label:BICYCLING,label:SLEEPING,label:LAB_WORK,label:IN_CLASS,label:IN_A_MEETING,label:LOC_main_workplace,label:OR_indoors,label:OR_outside,label:IN_A_CAR,label:ON_A_BUS,label:DRIVE_-_I_M_THE_DRIVER,label:DRIVE_-_I_M_A_PASSENGER,label:LOC_home,label:FIX_restaurant,label:PHONE_IN_POCKET,label:OR_exercise,label:COOKING,label:SHOPPING,label:STROLLING,label:DRINKING__ALCOHOL_,label:BATHING_-_SHOWER,label:CLEANING,label:DOING_LAUNDRY,label:WASHING_DISHES,label:WATCHING_TV,label:SURFING_THE_INTERNET,label:AT_A_PARTY,label:AT_A_BAR,label:LOC_beach,label:SINGING,label:TALKING,label:COMPUTER_WORK,label:EATING,label:TOILET,label:GROOMING,label:DRESSING,label:AT_THE_GYM,label:STAIRS_-_GOING_UP,label:STAIRS_-_GOING_DOWN,label:ELEVATOR,label:OR_standing,label:AT_SCHOOL,label:PHONE_IN_HAND,label:PHONE_IN_BAG,label:PHONE_ON_TABLE,label:WITH_CO-WORKERS,label:WITH_FRIENDS,label_source,user_num
0,0,1446762297,1.057536,0.040597,-0.048977,0.124759,1.053158,1.057091,1.060935,0.344809,6.683838,5.043598,0.001461,0.006744,0.00689,0.110444,0.431683,0.039615,0.38589,0.011074,0.026759,1.056769,0.017011,0.022961,0.040173,-0.35605,0.554204,-0.193868,0.048134,0.177067,0.325819,0.464058,0.006821,0.009408,0.018374,0.372994,4.851353,5.371533,3.93615,5.000829,4.418333,4.907916,3.472832,5.149856,0.025094,-0.001249,-0.00872,0.006114,0.139305,0.105654,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,2,0
1,1,1446762357,1.057436,0.006165,-0.009415,0.018645,1.055086,1.057279,1.060143,1.014093,6.684595,5.042748,0.000283,0.001101,0.00112,0.02537,0.429888,0.119004,0.147357,0.011879,0.029614,1.05693,0.005343,0.004747,0.006148,-0.452366,-0.024794,0.325594,0.006097,0.003335,0.004951,0.007649,0.004764,0.006212,0.0074,1.679151,6.5527,5.17251,2.233978,3.352239,3.291783,4.208239,1.535571,5.051457,0.118545,0.00196,-0.004092,0.000778,0.002348,0.004326,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,2,0
2,2,1446762417,1.056344,0.006302,-0.004635,0.013525,1.053282,1.056208,1.059165,1.429112,6.684594,5.043642,0.000189,0.000943,0.001062,0.021535,0.430053,0.118879,0.095604,0.012864,0.028037,1.055872,0.005085,0.004552,0.006283,-0.272253,0.079819,0.184492,0.003138,0.002708,0.004468,0.006665,0.001507,0.002382,0.003841,1.457233,6.432723,5.316336,3.282392,3.830023,2.61873,4.598896,2.058885,3.63039,0.200703,0.0012,-0.000388,-8.4e-05,0.002136,0.002836,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,2,0
3,3,1446762487,1.056874,0.004767,-0.002796,0.007088,1.053958,1.05701,1.059996,2.190168,6.684602,5.043075,9.4e-05,0.000672,0.000914,0.010632,0.429714,3.230235,0.085321,0.011657,0.027129,1.056438,0.005518,0.004232,0.004721,-0.20612,0.098946,0.373992,0.004039,0.001741,0.001806,0.002655,0.003013,0.003841,0.004882,2.370593,6.594423,4.989078,1.225513,2.888764,2.714748,4.00933,1.178862,0.71343,0.160733,0.002215,-0.001624,0.000879,0.001638,0.00233,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,2,0
4,4,1446762548,1.057353,0.005415,0.006585,0.010781,1.054161,1.0571,1.059976,1.827865,6.684599,5.043392,0.000166,0.000832,0.000914,0.012051,0.429827,0.099193,0.095939,0.012421,0.026983,1.05692,0.003926,0.004167,0.005391,-0.216187,-0.018669,0.203776,0.003714,0.00237,0.005078,0.008566,0.002609,0.003369,0.004392,1.049581,6.559021,4.863858,2.236227,3.875376,3.677466,5.038129,2.33522,4.935066,0.079536,0.002228,-0.001545,0.001001,0.00165,0.00253,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,2,0


In [17]:
master_df.describe()

Unnamed: 0.1,Unnamed: 0,timestamp,raw_acc:magnitude_stats:mean,raw_acc:magnitude_stats:std,raw_acc:magnitude_stats:moment3,raw_acc:magnitude_stats:moment4,raw_acc:magnitude_stats:percentile25,raw_acc:magnitude_stats:percentile50,raw_acc:magnitude_stats:percentile75,raw_acc:magnitude_stats:value_entropy,raw_acc:magnitude_stats:time_entropy,raw_acc:magnitude_spectrum:log_energy_band0,raw_acc:magnitude_spectrum:log_energy_band1,raw_acc:magnitude_spectrum:log_energy_band2,raw_acc:magnitude_spectrum:log_energy_band3,raw_acc:magnitude_spectrum:log_energy_band4,raw_acc:magnitude_spectrum:spectral_entropy,raw_acc:magnitude_autocorrelation:period,raw_acc:magnitude_autocorrelation:normalized_ac,raw_acc:3d:mean_x,raw_acc:3d:mean_y,raw_acc:3d:mean_z,raw_acc:3d:std_x,raw_acc:3d:std_y,raw_acc:3d:std_z,raw_acc:3d:ro_xy,raw_acc:3d:ro_xz,raw_acc:3d:ro_yz,proc_gyro:magnitude_stats:mean,proc_gyro:magnitude_stats:std,proc_gyro:magnitude_stats:moment3,proc_gyro:magnitude_stats:moment4,proc_gyro:magnitude_stats:percentile25,proc_gyro:magnitude_stats:percentile50,proc_gyro:magnitude_stats:percentile75,proc_gyro:magnitude_stats:value_entropy,proc_gyro:magnitude_stats:time_entropy,proc_gyro:magnitude_spectrum:log_energy_band0,proc_gyro:magnitude_spectrum:log_energy_band1,proc_gyro:magnitude_spectrum:log_energy_band2,proc_gyro:magnitude_spectrum:log_energy_band3,proc_gyro:magnitude_spectrum:log_energy_band4,proc_gyro:magnitude_spectrum:spectral_entropy,proc_gyro:magnitude_autocorrelation:period,proc_gyro:magnitude_autocorrelation:normalized_ac,proc_gyro:3d:mean_x,proc_gyro:3d:mean_y,proc_gyro:3d:mean_z,proc_gyro:3d:std_x,proc_gyro:3d:std_y,...,label:FIX_running,label:BICYCLING,label:SLEEPING,label:LAB_WORK,label:IN_CLASS,label:IN_A_MEETING,label:LOC_main_workplace,label:OR_indoors,label:OR_outside,label:IN_A_CAR,label:ON_A_BUS,label:DRIVE_-_I_M_THE_DRIVER,label:DRIVE_-_I_M_A_PASSENGER,label:LOC_home,label:FIX_restaurant,label:PHONE_IN_POCKET,label:OR_exercise,label:COOKING,label:SHOPPING,label:STROLLING,label:DRINKING__ALCOHOL_,label:BATHING_-_SHOWER,label:CLEANING,label:DOING_LAUNDRY,label:WASHING_DISHES,label:WATCHING_TV,label:SURFING_THE_INTERNET,label:AT_A_PARTY,label:AT_A_BAR,label:LOC_beach,label:SINGING,label:TALKING,label:COMPUTER_WORK,label:EATING,label:TOILET,label:GROOMING,label:DRESSING,label:AT_THE_GYM,label:STAIRS_-_GOING_UP,label:STAIRS_-_GOING_DOWN,label:ELEVATOR,label:OR_standing,label:AT_SCHOOL,label:PHONE_IN_HAND,label:PHONE_IN_BAG,label:PHONE_ON_TABLE,label:WITH_CO-WORKERS,label:WITH_FRIENDS,label_source,user_num
count,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,372059.0,372059.0,372059.0,372059.0,372059.0,372059.0,372059.0,372059.0,372059.0,372059.0,372059.0,372059.0,372059.0,372059.0,372059.0,372059.0,372059.0,372059.0,372059.0,372059.0,372059.0,372059.0,...,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0,388876.0
mean,3628.337429,1445842000.0,1.002346,0.038497,0.037367,0.072239,0.983452,0.998648,1.016713,2.046626,6.680333,5.039648,0.063489,0.288424,0.199017,0.313295,0.464775,2.179855,0.182105,0.003916,0.016552,-0.071176,0.045673,0.049431,0.051125,-0.005808,0.002985,-0.03275,0.172375,0.164725,0.221297,0.316538,0.07517,0.125659,0.21423,1.898218,6.315796,5.113895,2.184306,2.974982,2.572627,3.800722,1.681889,3.080371,0.116362,-0.001186,-0.001474,-0.001342,0.137125,0.155743,...,0.002806,0.013068,0.221698,0.009895,0.015712,0.014154,0.093356,0.488449,0.031804,0.016831,0.004721,0.021704,0.006552,0.408963,0.005403,0.061189,0.021428,0.010541,0.00484,0.002088,0.003816,0.005423,0.010181,0.001648,0.003276,0.034967,0.050787,0.00378,0.001417,0.001749,0.001689,0.095902,0.09978,0.043592,0.006917,0.007972,0.005889,0.003446,0.002062,0.001993,0.000514,0.099389,0.108937,0.038442,0.027834,0.312498,0.016216,0.063892,1.560999,31.547882
std,2471.349172,6079698.0,0.078578,0.095612,0.112654,0.169879,0.081889,0.075616,0.103872,0.61733,0.021039,0.025624,0.232695,0.805627,0.546393,0.770233,0.132117,2.509965,0.159501,0.288258,0.371528,0.867057,0.100417,0.114617,0.113071,0.283802,0.276332,0.300784,0.474775,0.34682,0.470446,0.665763,0.315094,0.424571,0.603644,0.781959,0.547539,0.426259,1.415175,1.287823,1.003097,1.158198,0.937498,48.957714,0.11296,0.208687,0.050944,0.034308,0.323105,0.354455,...,0.052893,0.113568,0.415389,0.098981,0.124359,0.118124,0.290931,0.499867,0.175479,0.128636,0.068549,0.145714,0.08068,0.491643,0.073305,0.239677,0.144808,0.102125,0.069399,0.045648,0.061657,0.073443,0.100384,0.040566,0.057144,0.183698,0.219564,0.061366,0.037615,0.04178,0.041069,0.294457,0.299707,0.204186,0.082883,0.088928,0.076512,0.0586,0.045366,0.044598,0.022672,0.299184,0.311561,0.19226,0.164497,0.463512,0.126305,0.24456,1.475694,18.633766
min,0.0,1433537000.0,0.018148,3e-05,-0.493806,3.9e-05,0.015845,0.017998,0.020365,0.009605,5.460637,4.338109,0.0,0.0,0.0,0.0,0.395625,0.0,0.0,-1.294366,-1.299951,-1.903621,0.000237,0.000261,0.0,-0.999806,-0.998505,-0.998895,0.000705,0.0,-1.752818,0.0,4.6e-05,0.000481,0.000823,-0.0,0.454157,0.0,0.0,0.0,0.0,0.0,-0.0,-29814.320733,0.0,-34.906586,-3.214968,-2.222142,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0
25%,1568.0,1441241000.0,0.992862,0.001727,-0.000805,0.002362,0.982991,0.991665,0.995106,1.674078,6.68438,5.042902,3.1e-05,0.000112,0.000121,0.001081,0.429609,0.3778,0.094581,-0.033524,-0.035414,-0.990759,0.001172,0.001139,0.001745,-0.079157,-0.070672,-0.093675,0.003,0.001291,0.00114,0.001932,0.00174,0.002544,0.00353,1.446497,6.273905,4.915394,1.121902,2.302947,2.227005,3.481251,1.199203,0.946252,0.071013,-0.000695,-0.000781,-0.001059,0.001568,0.001611,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,16.0
50%,3245.0,1444914000.0,1.001257,0.003219,0.000766,0.005007,0.995508,1.000152,1.003818,2.301166,6.684606,5.043353,0.000104,0.000445,0.000464,0.003524,0.429697,1.108336,0.122214,0.005043,0.002267,-0.130418,0.002278,0.002171,0.003471,-0.003554,-0.000144,-0.00484,0.008912,0.003748,0.004013,0.006348,0.005163,0.007589,0.010578,2.206309,6.53873,4.985739,1.728664,2.733991,2.560315,4.222649,1.350329,2.514553,0.095567,6e-06,0.0,6e-06,0.00417,0.004524,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,31.0
75%,5382.0,1448692000.0,1.012426,0.020922,0.009488,0.044513,1.003689,1.009114,1.018725,2.522701,6.68461,5.043579,0.002731,0.023489,0.028193,0.088281,0.432007,3.148837,0.196841,0.048802,0.05848,0.968126,0.036821,0.031298,0.037907,0.070181,0.071957,0.067363,0.102132,0.130511,0.192972,0.285138,0.03079,0.052216,0.098363,2.519097,6.587856,5.220709,3.400864,4.02655,3.162546,4.405638,2.043659,4.911345,0.128591,0.000538,0.000551,0.00064,0.07659,0.09751,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,48.0
max,11995.0,1464899000.0,3.185837,1.936343,2.47275,3.360718,1.942718,2.636697,3.958338,2.971272,6.684612,6.489025,5.008721,5.555127,5.55375,5.965551,4.328944,13.922462,0.983178,1.412232,1.356947,1.075647,2.063435,2.580865,1.398859,0.998049,0.999013,0.998292,34.906977,6.668213,9.800017,12.660385,34.906977,34.906977,34.906977,2.973136,6.684612,6.649564,5.882769,6.32512,5.853475,6.788357,5.991286,13.480792,1.0,34.855453,3.012568,3.429335,6.396027,6.149076,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,6.0,63.0


## Data Cleaning

Most of data cleaning was done in the process of adding individual users to master dataframe. (commented out cell under imports)

In [18]:
master_nulls_list = list(master_df.loc[:,(master_df.isnull().sum()/master_df.isnull().count()*100 > 0).values].columns)

#some observations are still nan because interpolation doesn't work when all observations are nans. Use mean impuation for those features.
for col in master_nulls_list:
    master_df[col].fillna(master_df[col].mean(), inplace=True)

## Data Visualizations


### Target Variable 


label_col = label_col = [col for col in list(master_df.columns) if col[:6] == 'label:']

main_labels = ['label:LYING_DOWN', 'label:SITTING', 'label:FIX_running', 'label:OR_standing','label:SLEEPING', 
               'label:FIX_walking']

loc_labels = ['label:LAB_WORK', 'label:IN_CLASS', 'label:IN_A_MEETING', 'label:LOC_main_workplace','label:OR_indoors',
 'label:OR_outside', 'label:IN_A_CAR', 'label:ON_A_BUS', 'label:LOC_home', 'label:FIX_restaurant','label:SHOPPING',
'label:AT_A_PARTY', 'label:AT_A_BAR', 'label:LOC_beach', 'label:AT_THE_GYM', 'label:ELEVATOR', 'label:AT_SCHOOL']

not_secondary_labels = main_labels + loc_labels
secondary_labels = [col for col in label_col if col not in not_secondary_labels]

label_dict = {'main_label': main_labels, 'loc_label': loc_labels, 'secondary_label': secondary_labels}

for label, lst in label_dict.items():
    #init dict
    df_label_dict = {i: '' for i in range(len(master_df))}
    # add labels to dict if labels is present in respective index
    for i in range(len(master_df)):
        for col in lst:
            if master_df[col].iloc[i] == 1:
                #creates multiclass label
                df_label_dict[i] += col + ' '
    master_df[label] = pd.Series(df_label_dict).apply(lambda x: x if x != '' else np.NaN)

#label distribution for each category of labels 

for label in label_dict.keys():
    
    label_freq = master_df[label].value_counts()[:10]
    plt.figure(figsize=(25,10))
    plt.barh(label_freq.index, label_freq.values)
    plt.title('{} Frequency'.format(label), fontdict={'fontsize': 30})
    plt.show()

__<font size="5">Research Question </font>__

Which classification model has the highest AUC score for each of the target behavioral labels: sitting, indoors and phone on table? 


In [19]:
label_col = [col for col in list(master_df.columns) if col[:6] == 'label:']
all_feature_list = [col for col in list(master_df.columns) if col not in label_col and col != 'timestamp']


### Features

After looking at the coefficient correlations of all binary target variables, none of the sensor and self-reported features had a strong correlation (correlation > .7). Using a t-test between the true binary and false binary samples for each independent feature may be helpful in determining if there is a significant difference between the true and false binary samples. 

In [20]:
def t_test_plot_filter(data, feature_list, binary_label, pval_threshold=.05):
    
    t_test_dict = {}
    
    for i,col in enumerate(feature_list):
            zeros_series = data[col][data[binary_label] == 0]
            ones_series = data[col][data[binary_label] == 1]
            
            t_test, pval = ttest_ind(zeros_series, ones_series, equal_var=False)[0],
                                     ttest_ind(zeros_series, ones_series, equal_var=False)[1]
                
            if str(t_test) != 'nan' and pval < .05:
                t_test_dict[col] = abs(t_test)
    plot_feature_list = [col for col, t_val in t_test_dict.items() if t_val > np.array(list(t_test_dict.values())).mean()*2]
    
    return plot_feature_list

In [21]:
#features that pass the t_test_plot_filter function are then used for violin plot visualizations
def target_violin_plot(data, feature_list, binary_label):
    for i,_ in enumerate(feature_list):
        
        plt.figure(figsize=(15,5))

        plt.subplot(1,2,1)
        sns.violinplot(x=binary_label, y=feature_list[i], data=data)
        
        
        try: 
            plt.subplot(1,2,2)
            sns.violinplot(x=binary_label, y=feature_list[i+1], data=data)
        except IndexError:
            break
        plt.tight_layout()
        plt.show()
        

In [22]:
indoor_feature_list = t_test_plot_filter(master_df, all_feature_list, 'label:OR_indoors')
print('Feature Count: ', len(indoor_feature_list))
#target_violin_plot(master_df, indoor_feature_list[:10],'label:OR_indoors')

Feature Count:  31


In [23]:
sitting_feature_list = t_test_plot_filter(master_df, all_feature_list, 'label:SITTING')
print('Feature Count: ', len(sitting_feature_list))
#target_violin_plot(master_df, sitting_feature_list[:10], 'label:PHONE_ON_TABLE')

Feature Count:  25


In [24]:
phone_on_table_feature_list = t_test_plot_filter(master_df, all_feature_list, 'label:PHONE_ON_TABLE')
print('Feature Count: ', len(phone_on_table_feature_list))
#target_violin_plot(master_df, phone_on_table_feature_list[:10], 'label:PHONE_ON_TABLE')

Feature Count:  24


# Modeling

Since there are observations within the master dataframe, we will downsample

In [25]:
#downsamples majority class label to be equal with minority class label
#used for models with large amounts of data to decrease runtime
def downsample_df(data, binary_label):
    drop_size = abs(len(data[data[binary_label] == 0]) - len(data[data[binary_label] ==1]))

    if len(data[data[binary_label] == 0]) > len(data[data[binary_label] ==1]):
        drop_idx = data[data[binary_label] == 0].sample(drop_size).index
    else:
        drop_idx = data[data[binary_label] == 1].sample(drop_size).index
        
    balanced_df = data.drop(list(drop_idx)).copy()
    print('Label Distribution')
    print(balanced_df[binary_label].value_counts())
    return balanced_df

In [36]:
def best_model_dict(model_dict, feature_list, binary_label):
    for name, model in model_dict.items():
        clf = GridSearchCV(model, model.parameters, cv=5, scoring='roc_auc')
        data = downsample_df(master_df.sample(frac=.05), binary_label)
        if name not in ['mlp', 'knn']:
            clf.fit(data[feature_list], data[binary_label])
        else:
            clf.fit(normalize(data[feature_list]), data[binary_label])
        
        model_dict[name] = clf.best_estimator_
    
    return model_dict

In [39]:
def test_best_model(data, model_dict, feature_list, binary_label, downsample=False):
    
    if downsample == True:
        data = downsample_df(data, binary_label)
    else: 
        print('Label Distribution')
        print(data[binary_label].value_counts())
    
    
    for name,model in model_dict.items():
        if name not in ['knn', 'mlp']:
            train_x, test_x, train_y, test_y = train_test_split(data[feature_list], data[binary_label], test_size=.3)
        else:
            #normalize features
            train_x, test_x, train_y, test_y = train_test_split(normalize(data[feature_list]), data[binary_label], test_size=.3)
        print()
        print(name)
        start = time.time()
        model.fit(train_x, train_y)
        end = time.time()
        score = model.score(test_x, test_y) * 100
        base = max(np.unique(test_y, return_counts=True)[1])/len(test_y) * 100
        pred_y = model.predict(test_x)
        false_positive_rate, true_positive_rate, thresholds = roc_curve(test_y, pred_y)
        roc_auc = auc(false_positive_rate, true_positive_rate)*100
        
        print('Base: ', base)
        print('AUC: ', roc_auc)
        print('Runtime: ', end - start)
    



## Sitting Label

In [38]:
rfc = RandomForestClassifier()
rfc.parameters = {'criterion':('gini', 'entropy'), 'n_estimators':[5,10,20,40], 
                  'max_depth': [1,2,4,8,16,32], 'min_samples_leaf': [1,2,4,8]}

dt = DecisionTreeClassifier()
dt.parameters = {'criterion':('gini', 'entropy'), 'max_depth': [1,2,4,8,16,32], 'min_samples_leaf': [1,2,4,8]}

knn = KNeighborsClassifier()
knn.parameters = {'weights':('uniform', 'distance'), 'n_neighbors':[1,2,4,8]}

mlp = MLPClassifier()
mlp.parameters = {'hidden_layer_sizes':[10,15,20], 'activation':('relu', 'logistic'), 
              'learning_rate': ('constant', 'adaptive')}

sitting_model_dict = {'rfc': rfc, 'mlp':mlp, 'knn':knn, 'dt': dt}

sitting_model_dict = best_model_dict(sitting_model_dict, sitting_feature_list, 'label:SITTING')

sitting_model_dict

Label Distribution
0.0    6971
1.0    6971
Name: label:SITTING, dtype: int64
Label Distribution
0.0    6968
1.0    6968
Name: label:SITTING, dtype: int64






Label Distribution
1.0    7032
0.0    7032
Name: label:SITTING, dtype: int64
Label Distribution
1.0    6948
0.0    6948
Name: label:SITTING, dtype: int64


{'rfc': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
             max_depth=32, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=4, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=40, n_jobs=None,
             oob_score=False, random_state=None, verbose=0,
             warm_start=False),
 'mlp': MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
        beta_2=0.999, early_stopping=False, epsilon=1e-08,
        hidden_layer_sizes=20, learning_rate='adaptive',
        learning_rate_init=0.001, max_iter=200, momentum=0.9,
        n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
        random_state=None, shuffle=True, solver='adam', tol=0.0001,
        validation_fraction=0.1, verbose=False, warm_start=False),
 'knn': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
            m

In [40]:
test_best_model(master_df, sitting_model_dict, sitting_feature_list, 'label:SITTING', downsample=True)

Label Distribution
0.0    139593
1.0    139593
Name: label:SITTING, dtype: int64

rfc
Base:  50.23162519700081
AUC:  85.80816678037178
Runtime:  62.029295921325684

mlp
Base:  50.22446153111419
AUC:  82.17716789195683
Runtime:  118.5780999660492

knn
Base:  50.0095515545155
AUC:  86.714666206309
Runtime:  0.6499772071838379

dt
Base:  50.051339605520795
AUC:  81.36055438007976
Runtime:  3.9045300483703613


## Indoors Label

In [48]:
rfc = RandomForestClassifier()
rfc.parameters = {'criterion':('gini', 'entropy'), 'n_estimators':[5,10,20,40], 
                  'max_depth': [1,2,4,8,16,32], 'min_samples_leaf': [1,2,4,8]}

dt = DecisionTreeClassifier()
dt.parameters = {'criterion':('gini', 'entropy'), 'max_depth': [1,2,4,8,16,32], 'min_samples_leaf': [1,2,4,8]}

knn = KNeighborsClassifier()
knn.parameters = {'weights':('uniform', 'distance'), 'n_neighbors':[1,2,4,8]}

mlp = MLPClassifier()
mlp.parameters = {'hidden_layer_sizes':[10,15,20], 'activation':('relu', 'logistic'), 
              'learning_rate': ('constant', 'adaptive')}

indoors_model_dict = {'rfc': rfc, 'mlp':mlp, 'knn':knn, 'dt': dt}

indoors_model_dict = best_model_dict(indoors_model_dict, indoor_feature_list, 'label:OR_indoors')

indoors_model_dict

Label Distribution
0.0    9530
1.0    9530
Name: label:OR_indoors, dtype: int64
Label Distribution
0.0    9514
1.0    9514
Name: label:OR_indoors, dtype: int64






Label Distribution
0.0    9466
1.0    9466
Name: label:OR_indoors, dtype: int64
Label Distribution
1.0    9510
0.0    9510
Name: label:OR_indoors, dtype: int64


{'rfc': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
             max_depth=32, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=40, n_jobs=None,
             oob_score=False, random_state=None, verbose=0,
             warm_start=False),
 'mlp': MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
        beta_2=0.999, early_stopping=False, epsilon=1e-08,
        hidden_layer_sizes=20, learning_rate='adaptive',
        learning_rate_init=0.001, max_iter=200, momentum=0.9,
        n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
        random_state=None, shuffle=True, solver='adam', tol=0.0001,
        validation_fraction=0.1, verbose=False, warm_start=False),
 'knn': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
            m

In [49]:
test_best_model(master_df, indoors_model_dict, indoor_feature_list, 'label:OR_indoors', downsample=True)

Label Distribution
1.0    189946
0.0    189946
Name: label:OR_indoors, dtype: int64

rfc
Base:  50.190404324020776
AUC:  89.79041755193595
Runtime:  79.66787719726562

mlp
Base:  50.23164397023726
AUC:  81.78311240885178
Runtime:  176.98911881446838

knn
Base:  50.13863540642988
AUC:  85.82584879448235
Runtime:  1.8038630485534668

dt
Base:  50.01667134634283
AUC:  81.44132500096673
Runtime:  8.115172147750854


## Phone on Table

In [50]:
rfc = RandomForestClassifier()
rfc.parameters = {'criterion':('gini', 'entropy'), 'n_estimators':[5,10,20,40], 
                  'max_depth': [1,2,4,8,16,32], 'min_samples_leaf': [1,2,4,8]}

dt = DecisionTreeClassifier()
dt.parameters = {'criterion':('gini', 'entropy'), 'max_depth': [1,2,4,8,16,32], 'min_samples_leaf': [1,2,4,8]}

knn = KNeighborsClassifier()
knn.parameters = {'weights':('uniform', 'distance'), 'n_neighbors':[1,2,4,8]}

mlp = MLPClassifier()
mlp.parameters = {'hidden_layer_sizes':[10,15,20], 'activation':('relu', 'logistic'), 
              'learning_rate': ('constant', 'adaptive')}

phone_on_table_model_dict = {'rfc': rfc, 'mlp':mlp, 'knn':knn, 'dt': dt}

phone_on_table_model_dict = best_model_dict(phone_on_table_model_dict, phone_on_table_feature_list, 'label:PHONE_ON_TABLE')

phone_on_table_model_dict

Label Distribution
0.0    6021
1.0    6021
Name: label:PHONE_ON_TABLE, dtype: int64
Label Distribution
1.0    6109
0.0    6109
Name: label:PHONE_ON_TABLE, dtype: int64






Label Distribution
1.0    6083
0.0    6083
Name: label:PHONE_ON_TABLE, dtype: int64
Label Distribution
1.0    6188
0.0    6188
Name: label:PHONE_ON_TABLE, dtype: int64


{'rfc': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
             max_depth=32, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=2, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=40, n_jobs=None,
             oob_score=False, random_state=None, verbose=0,
             warm_start=False),
 'mlp': MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
        beta_2=0.999, early_stopping=False, epsilon=1e-08,
        hidden_layer_sizes=15, learning_rate='adaptive',
        learning_rate_init=0.001, max_iter=200, momentum=0.9,
        n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
        random_state=None, shuffle=True, solver='adam', tol=0.0001,
        validation_fraction=0.1, verbose=False, warm_start=False),
 'knn': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
            m

In [51]:
test_best_model(master_df, phone_on_table_model_dict, phone_on_table_feature_list, 'label:PHONE_ON_TABLE', downsample=True)

Label Distribution
0.0    121523
1.0    121523
Name: label:PHONE_ON_TABLE, dtype: int64

rfc
Base:  50.11931864936775
AUC:  85.66539243603034
Runtime:  44.206029176712036

mlp




Base:  50.26743835203116
AUC:  73.88666373534437
Runtime:  176.32919096946716

knn
Base:  50.20435033052637
AUC:  80.2924009438971
Runtime:  6.139012813568115

dt
Base:  50.01508626601201
AUC:  79.01317675140044
Runtime:  2.8565049171447754


# Evaluate Best Models: (KNN & RFC)

Best Models:
    - Sitting Label: KNN
    - Indoors Label: RFC
    - Phone on Table Label: RFC

In [60]:
sitting_model_dict['knn'].norm = True
indoors_model_dict['rfc'].norm = False
phone_on_table_model_dict['rfc'].norm = False

In [67]:
cross_val_dict = {'label:SITTING': ['knn', sitting_model_dict['knn'], sitting_feature_list], 
                  'label:OR_indoors': ['rfc', indoors_model_dict['rfc'], indoor_feature_list], 
                  'label:PHONE_ON_TABLE': ['rfc', phone_on_table_model_dict['rfc'], phone_on_table_feature_list]}


In [68]:
for label, model in cross_val_dict.items():
    #normalize data if knn
    if model[1].norm == True:
        data = normalize(master_df[model[2]])
    else:
        data = master_df[model[2]]
        
    scores = cross_val_score(model[1], data, master_df[label], scoring='roc_auc', cv=5)
    print(model[0])
    print(label)
    print("AUC: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

knn
label:SITTING
AUC: 0.83 (+/- 0.05)
rfc
label:OR_indoors
AUC: 0.84 (+/- 0.10)
rfc
label:PHONE_ON_TABLE
AUC: 0.72 (+/- 0.21)
