#Activity Recognition from Single Chest-Mounted Accelerometer Data Set 

Uncalibrated Accelerometer Data are collected from 15 participantes performing 7 activities. The dataset provides challenges for identification and authentication of people using motion patterns.

####Data Set Information:

- The dataset collects data from a wearable accelerometer mounted on the chest 

- Sampling frequency of the accelerometer: 52 Hz 

- Accelerometer Data are Uncalibrated 

- Number of Participants: 15 

- Number of Activities: 7 

- Data Format: CSV

####Attribute Information:

- Data are separated by participant 

- Each file contains the following information 

- sequential number, x acceleration, y acceleration, z acceleration, label 

- Labels are codified by numbers 
    - 1: Working at Computer 
    - 2: Standing Up, Walking and Going updown stairs 
    - 3: Standing 
    - 4: Walking 
    - 5: Going UpDown Stairs 
    - 6: Walking and Talking with Someone 
    - 7: Talking while Standing

####Analysis:
- Import data from UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets/Activity+Recognition+from+Single+Chest-Mounted+Accelerometer#

- Create new features

- Classify the data using three methodologies. In this case I'm using:
    - Linear Regression,
    - Support Vector Machines (SVM), and 
    - Decision Trees / Random Forest

- Compute Measures of Prediction Goodness
    - Accuracy
    - Precision
    - Correlation
    - F1 
    - Recall
    - AUC
- Create a Confusion Matrix

- Display ROC Curve

####Discussion:

I will first classify the data using Linear Regression, which one can easily assume will not be able to do the job sufficiently. I will then increase the models in complexity by using SVM to classify, and then Random Forest. With each model, I will print out prediction measures, a confusion matrix, and an ROC Curve

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import sklearn.metrics as skm
import sklearn.ensemble as sk
import pylab as pl
import numpy as np
from sklearn.cross_validation import train_test_split

In [2]:
%matplotlib inline

###Import Data

####Note: 
Combining all 15 sets of data is a bit too much for my computer to handle. I will just be combining three for now and proceeding accordingly. I will still write the code for combining and working with all 15 sets, but I will leave the extra code commented out.

In [3]:
column_names = ['xval', 'yval', 'zval', 'activity']
part1 = pd.read_csv('/Users/molliepettit/Desktop/Data Science/Projects/Accelerometer/1.csv', header=None, names=column_names)
# part2 = pd.read_csv('/Users/molliepettit/Desktop/Data Science/Projects/Accelerometer/2.csv', header=None, names=column_names)
# part3 = pd.read_csv('/Users/molliepettit/Desktop/Data Science/Projects/Accelerometer/3.csv', header=None, names=column_names)
# part4 = pd.read_csv('/Users/molliepettit/Desktop/Data Science/Projects/Accelerometer/4.csv', header=None, names=column_names)
# part5 = pd.read_csv('/Users/molliepettit/Desktop/Data Science/Projects/Accelerometer/5.csv', header=None, names=column_names)
part6 = pd.read_csv('/Users/molliepettit/Desktop/Data Science/Projects/Accelerometer/6.csv', header=None, names=column_names)
# part7 = pd.read_csv('/Users/molliepettit/Desktop/Data Science/Projects/Accelerometer/7.csv', header=None, names=column_names)
# part8 = pd.read_csv('/Users/molliepettit/Desktop/Data Science/Projects/Accelerometer/8.csv', header=None, names=column_names)
# part9 = pd.read_csv('/Users/molliepettit/Desktop/Data Science/Projects/Accelerometer/9.csv', header=None, names=column_names)
# part10 = pd.read_csv('/Users/molliepettit/Desktop/Data Science/Projects/Accelerometer/10.csv', header=None, names=column_names)
part11 = pd.read_csv('/Users/molliepettit/Desktop/Data Science/Projects/Accelerometer/11.csv', header=None, names=column_names)
# part12 = pd.read_csv('/Users/molliepettit/Desktop/Data Science/Projects/Accelerometer/12.csv', header=None, names=column_names)
# part13 = pd.read_csv('/Users/molliepettit/Desktop/Data Science/Projects/Accelerometer/13.csv', header=None, names=column_names)
# part14 = pd.read_csv('/Users/molliepettit/Desktop/Data Science/Projects/Accelerometer/14.csv', header=None, names=column_names)
# part15 = pd.read_csv('/Users/molliepettit/Desktop/Data Science/Projects/Accelerometer/15.csv', header=None, names=column_names)

###Add Features

In brainstorming for adding features, I created all potential features I could think of, with disregard for redundancy or overfitting. I later commented out all I found unnecessary. And later, you'll see that I disregard the features found to be less important.

####Features created and notes:
- xdiff, ydiff, zdiff
    - The difference between row n and row n+1 for each axis
- xdiff_abs, ydiff_abs, zdiff_abs
    - Absolute value of xdiff, ydiff, and zdiff
    - Likely don't need both xdif fand xdiff_abs
- av_resultant_acc
    - Average Resultant Acceleration
    - sqrt(x^2 + y^2 + z^2)
- xvy, xvz, yvz
    - Ratio of x/y, x/z, and y/z
- angle_x-y, angle_y-x, angle_z-y, angle_y-z, angle_x-z, angle_z-x
    - If one creates a right triangle using two of the vectors created from the xdiff, ydiff, or zdiff values, these will be the resulting non-90-degree angles
    - Only one of each pair is needed, as each pair are sets of complimentary angles, and are therefor inversely related. Therefore, I commented out every other angle.
- x_avg, y_avg, z_avg
    - Average of each axis
    - Given each activity, I found the average values of each axis for each participant
    - This is not very useful unless given a data set that you know includes one, and only one, activity. Therefore, this section is commented out. 
- x_std, y_std, z_std
    - Standard deviation for each axis
    - Again, this is not very useful unless given a data set that you know includes one, and only one, activity. This section is also commented out.
    
    
    
      

In [6]:
def add_features(df):
    #Difference in values from one row to the next
    df["xdiff"] = df["xval"].diff(-1)
    df["ydiff"] = df["yval"].diff(-1)
    df["zdiff"] = df["zval"].diff(-1)
    df["xdiff_abs"] = abs(df["xdiff"])
    df["ydiff_abs"] = abs(df["ydiff"])
    df["zdiff_abs"] = abs(df["zdiff"])
    
    #Average Resultant Acceleration: sqrt(x^2 + y^2 + z^2)
    df["av_resultant_acc"] = np.sqrt((df['xval'])**2 + (df['yval'])**2 + (df['zval'])**2)
    
    #Calculate ratio of x axis to y axis, x axis to z axis, and y axis to z axis
    df['xvy'] = df['xdiff_abs']/df['ydiff_abs']
    df['xvz'] = df['xdiff_abs']/df['zdiff_abs']
    df['yvz'] = df['ydiff_abs']/df['zdiff_abs']
    
    #Calculate various angles
    #Only one of each pair is needed, as each pair are sets of complimentary angles, and are therefor inversely proportional
    df['angle_x-y'] = np.arcsin(df['ydiff']/(np.sqrt(df['xdiff']**2 + df['ydiff']**2)))
#     df['angle_y-x'] = np.arcsin(df['xdiff']/(np.sqrt(df['xdiff']**2 + df['ydiff']**2)))
    df['angle_z-y'] = np.arcsin(df['ydiff']/(np.sqrt(df['zdiff']**2 + df['ydiff']**2)))
#     df['angle_y-z'] = np.arcsin(df['zdiff']/(np.sqrt(df['zdiff']**2 + df['ydiff']**2)))
    df['angle_x-z'] = np.arcsin(df['zdiff']/(np.sqrt(df['xdiff']**2 + df['zdiff']**2)))
#     df['angle_z-x'] = np.arcsin(df['xdiff']/(np.sqrt(df['xdiff']**2 + df['zdiff']**2)))

#     #Find Average Values of each axis
#     #X-axis
#     df['x_avg'] = ""
#     df['x_avg'][df['activity']==1] = np.mean(df['xval'][df['activity']==1])
#     df['x_avg'][df['activity']==2] = np.mean(df['xval'][df['activity']==2])
#     df['x_avg'][df['activity']==3] = np.mean(df['xval'][df['activity']==3])
#     df['x_avg'][df['activity']==4] = np.mean(df['xval'][df['activity']==4])
#     df['x_avg'][df['activity']==5] = np.mean(df['xval'][df['activity']==5])
#     df['x_avg'][df['activity']==6] = np.mean(df['xval'][df['activity']==6])
#     df['x_avg'][df['activity']==7] = np.mean(df['xval'][df['activity']==7])

#     #y-axis
#     df['y_avg'] = ""
#     df['y_avg'][df['activity']==1] = np.mean(df['yval'][df['activity']==1])
#     df['y_avg'][df['activity']==2] = np.mean(df['yval'][df['activity']==2])
#     df['y_avg'][df['activity']==3] = np.mean(df['yval'][df['activity']==3])
#     df['y_avg'][df['activity']==4] = np.mean(df['yval'][df['activity']==4])
#     df['y_avg'][df['activity']==5] = np.mean(df['yval'][df['activity']==5])
#     df['y_avg'][df['activity']==6] = np.mean(df['yval'][df['activity']==6])
#     df['y_avg'][df['activity']==7] = np.mean(df['yval'][df['activity']==7])

#     #z-axis
#     df['z_avg'] = ""
#     df['z_avg'][df['activity']==1] = np.mean(df['zval'][df['activity']==1])
#     df['z_avg'][df['activity']==2] = np.mean(df['zval'][df['activity']==2])
#     df['z_avg'][df['activity']==3] = np.mean(df['zval'][df['activity']==3])
#     df['z_avg'][df['activity']==4] = np.mean(df['zval'][df['activity']==4])
#     df['z_avg'][df['activity']==5] = np.mean(df['zval'][df['activity']==5])
#     df['z_avg'][df['activity']==6] = np.mean(df['zval'][df['activity']==6])
#     df['z_avg'][df['activity']==7] = np.mean(df['zval'][df['activity']==7])
    
#     #Find Standard Deviation for each axis
#     #x-axis
#     df['x_std'] = ""
#     df['x_std'][df['activity']==1] = np.std(df['xval'][df['activity']==1])
#     df['x_std'][df['activity']==2] = np.std(df['xval'][df['activity']==2])
#     df['x_std'][df['activity']==3] = np.std(df['xval'][df['activity']==3])
#     df['x_std'][df['activity']==4] = np.std(df['xval'][df['activity']==4])
#     df['x_std'][df['activity']==5] = np.std(df['xval'][df['activity']==5])
#     df['x_std'][df['activity']==6] = np.std(df['xval'][df['activity']==6])
#     df['x_std'][df['activity']==7] = np.std(df['xval'][df['activity']==7])

#     #y-axis
#     df['y_std'] = ""
#     df['y_std'][df['activity']==1] = np.std(df['yval'][df['activity']==1])
#     df['y_std'][df['activity']==2] = np.std(df['yval'][df['activity']==2])
#     df['y_std'][df['activity']==3] = np.std(df['yval'][df['activity']==3])
#     df['y_std'][df['activity']==4] = np.std(df['yval'][df['activity']==4])
#     df['y_std'][df['activity']==5] = np.std(df['yval'][df['activity']==5])
#     df['y_std'][df['activity']==6] = np.std(df['yval'][df['activity']==6])
#     df['y_std'][df['activity']==7] = np.std(df['yval'][df['activity']==7])

#     #z-axis
#     df['z_std'] = ""
#     df['z_std'][df['activity']==1] = np.std(df['zval'][df['activity']==1])
#     df['z_std'][df['activity']==2] = np.std(df['zval'][df['activity']==2])
#     df['z_std'][df['activity']==3] = np.std(df['zval'][df['activity']==3])
#     df['z_std'][df['activity']==4] = np.std(df['zval'][df['activity']==4])
#     df['z_std'][df['activity']==5] = np.std(df['zval'][df['activity']==5])
#     df['z_std'][df['activity']==6] = np.std(df['zval'][df['activity']==6])
#     df['z_std'][df['activity']==7] = np.std(df['zval'][df['activity']==7])
    
    return df
        

In [7]:
add_features(part1)
# add_features(part2)
# add_features(part3)
# add_features(part4)
# add_features(part5)
add_features(part6)
# add_features(part7)
# add_features(part8)
# add_features(part9)
# add_features(part10)
add_features(part11)
# add_features(part12)
# add_features(part13)
# add_features(part14)
# add_features(part15)

Unnamed: 0,xval,yval,zval,activity,xdiff,ydiff,zdiff,xdiff_abs,ydiff_abs,zdiff_abs,av_resultant_acc,xvy,xvz,yvz,angle_x-y,angle_z-y,angle_x-z
0,1983,2438,1825,1,35,-4,28,35,4,28,3634.110345,8.750000,1.250000,0.142857,-0.113792,-0.141897,0.674741
1,1948,2442,1797,1,21,54,13,21,54,13,3603.786481,0.388889,1.615385,4.153846,1.199905,1.334551,0.554307
2,1927,2388,1784,1,-33,69,-47,33,69,47,3549.440660,0.478261,0.702128,1.468085,1.124691,0.972827,-0.958644
3,1960,2319,1831,1,-7,45,-40,7,45,40,3545.690624,0.155556,0.175000,1.125000,1.416478,0.844154,-1.397551
4,1967,2274,1871,1,-24,22,-11,24,22,11,3541.300044,1.090909,2.181818,2.000000,0.741947,1.107149,-0.429762
5,1991,2252,1882,1,-19,-11,-11,19,11,11,3546.478394,1.727273,1.727273,1.000000,-0.524796,-0.785398,-0.524796
6,2010,2263,1893,1,-27,-29,-26,27,29,26,3569.974510,0.931034,1.038462,1.115385,-0.821097,-0.839890,-0.766532
7,2037,2292,1919,1,-4,-14,-6,4,14,6,3617.346265,0.285714,0.666667,2.333333,-1.292497,-1.165905,-0.982794
8,2041,2306,1925,1,-2,8,12,2,8,12,3631.658299,0.250000,0.166667,0.666667,1.325818,0.588003,1.405648
9,2043,2298,1913,1,45,-38,29,45,38,29,3621.356376,1.184211,1.551724,1.310345,-0.701260,-0.918927,0.572460


####Combine the three participant files into one dataframe:

In [9]:
frames = [part1, part6, part11]
# frames = [part1, part2, part3, part4, part5, part6, part7, part8, part9, part10, part11, part12, part13, part14, part15]

In [10]:
df = pd.concat(frames, ignore_index=True)

In [12]:
df

Unnamed: 0,xval,yval,zval,activity,xdiff,ydiff,zdiff,xdiff_abs,ydiff_abs,zdiff_abs,av_resultant_acc,xvy,xvz,yvz,angle_x-y,angle_z-y,angle_x-z
0,1502,2215,2153,1,-165,143,106,165,143,106,3434.768988,1.153846,1.556604,1.349057,0.714091,0.932913,0.571031
1,1667,2072,2047,1,56,115,141,56,115,141,3355.932359,0.486957,0.397163,0.815603,1.117638,0.684183,1.192738
2,1611,1957,1906,1,10,18,75,10,18,75,3171.435952,0.555556,0.133333,0.240000,1.063698,0.235545,1.438245
3,1601,1939,1831,1,-42,-26,-48,42,26,48,3110.543843,1.615385,0.875000,0.541667,-0.554307,-0.496423,-0.851966
4,1643,1965,1879,1,39,6,-42,39,6,42,3176.683018,6.500000,0.928571,0.142857,0.152649,0.141897,-0.822418
5,1604,1959,1921,1,-36,130,-19,36,130,19,3178.165823,0.276923,1.894737,6.842105,1.300643,1.425670,-0.485622
6,1640,1829,1940,1,33,-81,30,33,81,30,3130.246156,0.407407,1.100000,2.700000,-1.183921,-1.216091,0.737815
7,1607,1910,1910,1,61,-135,0,61,135,0,3143.031817,0.451852,inf,inf,-1.146403,-1.570796,0.000000
8,1546,2045,1910,1,17,-4,-62,17,4,62,3196.911165,4.250000,0.274194,0.064516,-0.231091,-0.064427,-1.303180
9,1529,2049,1972,1,-108,71,27,108,71,27,3228.780884,1.521127,4.000000,2.629630,0.581565,1.207403,0.244979


In [11]:
list(df.columns.values)

['xval',
 'yval',
 'zval',
 'activity',
 'xdiff',
 'ydiff',
 'zdiff',
 'xdiff_abs',
 'ydiff_abs',
 'zdiff_abs',
 'av_resultant_acc',
 'xvy',
 'xvz',
 'yvz',
 'angle_x-y',
 'angle_z-y',
 'angle_x-z']

####Clean / Prepare data
Change inf and Nan values to a number so that the models will run.

In [13]:
df = df.replace([np.inf, -np.inf], 999)
df = df.replace(np.nan, 999)

####Split the data into training and test sets

In [15]:
X = df[['xval', 'yval', 'zval', 'xdiff', 'ydiff', 'zdiff', 'av_resultant_acc', 'xvy', 'xvz', 'yvz','angle_x-y', 'angle_z-y', 'angle_x-z']]
y = df['activity']
X_train, X_test, y_train,y_test = train_test_split(X, y, test_size=0.40)

##Linear Regression:

In [16]:
import statsmodels.api as sm
linreg = sm.OLS(y,X)
model_linear = linreg.fit(X_train, y_train)

TypeError: Could not compare ['pinv'] with block values