# Comparison between different users

This notebook is used to compare and analyze different methods of determining the length of a shaplet using the peak method.

Assumptions
* Each shaplet has a single peak far greater than the rest.

TODO
* Improve method to accept a shaplet with n peaks

In [22]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib qt
import peakutils

## Peak Detection
This peak detection algorithm calculates peaks based on a given timeseries array and a threshold.  Refer to the [documentation](http://pythonhosted.org/PeakUtils/).  This method uses the peak detection algorithm to find the average distance between peaks.  If no peaks are found, the method returns nan.  

**Accuracy of the peak detection algorithm is assumed but not verified.  We should replace this with our custom peak detection algorithm**

In [38]:
def estimate_period(df, threshold, axis='x'):
    indexes = peakutils.indexes(df[axis].values, thres=threshold, min_dist=10)
    indexes = pd.Series(indexes)
    #peaks = df.iloc[indexes]
    #plt.close()
    #plt.plot(df['x'])
    #plt.plot(peaks['x'], 'r+')
    if indexes.index.size != 0:
        diff = indexes.drop(0) - indexes.shift(1).drop(0)
        return diff.mean()
    return float('nan')
    

Users who have data on walking: ['/102', '/103', '/107', '/110', '/116', '/118', '/121', '/122', '/125', '/131', '/133', '/140', '/143', '/144', '/148', '/149', '/153', '/159', '/161', '/166', '/171', '/174', '/179', '/181', '/182', '/184', '/188', '/189', '/192']

**Make sure your data file is in the right location on your local machine.**


In [24]:
walking116 = pd.read_csv('../../data/116/13_treadmill_3mph_0%.csv', names=["tick", "timestamp",
                                             "activity", "x", "y",
                                             "z", "user"], index_col=False)
walking125 = pd.read_csv('../../data/125/13_treadmill_3mph_0%.csv', names=["tick", "timestamp",
                                             "activity", "x", "y",
                                             "z", "user"], index_col=False)

## Cleaning data
I trimmed the beggining and end of each timeseries due to bogus values.  Points were chosen visually.

In [25]:
walking116 = walking116.iloc[2800:21000]
walking125 = walking125.iloc[2500:20000]
walking116 = walking116.reset_index()
walking125 = walking125.reset_index()

## Single Peak Detection
Our parameters are the time series to search, the window size, and the threshold to examine.  To improve accuracy, we are using a sliding window to improve the accuracy the peak detection algorithm.  Rather than find peaks over the entire data set, we use a fixed window length and calculate the average peak distance as the window slides over the time series from beginning to end.  The function gives control over how far to slide the window at each calucation and how big the window should be.  The function returns a list of lists of average peaks from the upper half and lower half of the time series. 

Parameters 
* timeseries(pandas dataframe) - dataframe of activity data. index starts at 0 and increases by 1.
* threshold(float between [0.,1.]) - Normalized threshold. Only the peaks with amplitude higher than the threshold will be detected.
* window_size(int) - size of the sliding window, less than the length of the time series.  If greater, window_size will be set to the length of the time series.
* axis(str) - either 'x', 'y', or 'z', corresponding to the axis to check
* step(int) - how far to slide the window for each calculation of peak distance.  If step is greater than window_size, the window size will be used as the step value.

Returns
* dic : 'upper' is upper peak averages, 'lower' is lower peak averages

In [44]:
def find_single_peak_dist(timeseries, threshold=0.8, window_size=500, axis='x', step=100):
    if (window_size > timeseries.last_valid_index()):
        window_size = timeseries.last_valid_index()
    if (axis not in ('x', 'y', 'z')):
        raise ValueError('invalid axis value, use x, y, or z')
    if (step > window_size):
        step = window_size
    upper = []
    lower = []
    ts_flipped = timeseries * -1
    for i in np.arange(0, timeseries.index.size - window_size, step):
        upper.append(estimate_period(timeseries.iloc[i:i+window_size], threshold, axis=axis))
        lower.append(estimate_period(ts_flipped.iloc[i:i+window_size], threshold, axis=axis))
    return { 'upper': upper, 'lower': lower, 'threshold': threshold, 'window': window_size }

## Choosing the correct window size
It seems like we get better accuracy with a smaller window size but we cannot go too small, otherwise our peak detection algorithm does not register any peaks.  If the window size is n and our shaplet size is s, n cannot be less than s otherwise we would never catch any consecutive pair of peaks.  Realistically, we would like catch multiple peaks at a time and average their distances together but at a minimum to be gaurenteed to catch a single period, n needs to be at least 2s.  Since we do not know the length of the shaplet before hand, we will experiment by testing a number of window sizes to determine correct size.

We will use the default threshold value of 0.8 and a step value of half the window size

In [56]:
plt.close()
max = 500
for i in np.arange(100, max, 100):
    info = find_single_peak_dist(walking116, window_size=i, step=int(i/2))
    lower = pd.Series(info['lower'])
    normed = lower.reindex(lower.index*int((max/i)))
    plt.plot(normed)

In [57]:
normed

0     104.666667
1     102.666667
2     103.000000
3     103.000000
4     102.333333
5     102.000000
6     102.333333
7     102.666667
8     102.666667
9     103.333333
10    102.000000
11    104.000000
12    153.500000
13    153.000000
14    103.000000
15    102.000000
16    102.000000
17    152.000000
18    102.000000
19    101.000000
20    101.500000
21    151.500000
22    101.333333
23    102.000000
24    102.000000
25    101.666667
26    101.666667
27    102.000000
28    101.666667
29    101.666667
         ...    
59    103.000000
60    102.000000
61    102.666667
62    102.333333
63    101.000000
64    102.000000
65    101.500000
66    101.666667
67    152.000000
68    102.000000
69    152.500000
70    102.500000
71    102.000000
72    101.666667
73    102.000000
74    153.000000
75    102.333333
76    102.333333
77    102.000000
78    101.333333
79    101.333333
80    102.000000
81    101.500000
82    152.500000
83    102.000000
84    152.500000
85    102.000000
86    101.6666