# Comparison between different users

This notebook is used to compare and analyze different methods of determining the length of a shaplet using the peak method.

Assumptions
* Each shaplet has a single peak far greater than the rest.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib qt
import peakutils

## Peak Detection
This peak detection algorithm calculates peaks based on a given timeseries array and a threshold.  Refer to the [documentation](http://pythonhosted.org/PeakUtils/).  This method uses the peak detection algorithm to find the average distance between peaks.  If no peaks are found, the method returns nan.  

**Accuracy of the peak detection algorithm is assumed but not verified.  We should replace this with our custom peak detection algorithm**

In [3]:
def estimate_period(df, threshold):
    indexes = peakutils.indexes(df['x'].values, thres=threshold, min_dist=10)
    indexes = pd.Series(indexes)
    #peaks = df.iloc[indexes]
    if indexes.index.size != 0:
        diff = indexes.drop(0) - indexes.shift(1).drop(0)
        return diff.mean()
    return float('nan')
    #plt.close()
    #plt.plot(df['x'])
    #plt.plot(peaks['x'], 'r+')

Users who have data on walking: ['/102', '/103', '/107', '/110', '/116', '/118', '/121', '/122', '/125', '/131', '/133', '/140', '/143', '/144', '/148', '/149', '/153', '/159', '/161', '/166', '/171', '/174', '/179', '/181', '/182', '/184', '/188', '/189', '/192']

**Make sure your data file is in the right location on your local machine.**


In [5]:
walking116 = pd.read_csv('../../data/116/13_treadmill_3mph_0%.csv', names=["tick", "timestamp",
                                             "activity", "x", "y",
                                             "z", "user"], index_col=False)
walking125 = pd.read_csv('../../data/125/13_treadmill_3mph_0%.csv', names=["tick", "timestamp",
                                             "activity", "x", "y",
                                             "z", "user"], index_col=False)

## Cleaning data
I trimmed the beggining and end of each timeseries due to bogus values.  Points were chosen visually.

In [6]:
walking116 = walking116.iloc[2800:21000]
walking125 = walking125.iloc[2500:20000]
walking116 = walking116.reset_index()
walking125 = walking125.reset_index()

Initialize variables for the search.  Windows will be 1 second long.

In [7]:
window_size = 500
period_upper_116 = []
period_lower_116 = []
period_upper_125 = []
period_lower_125 = []
threshold = .8

In [8]:
for i in np.arange(0, walking116.index.size, window_size):
    period_upper_116.append(estimate_period(walking116.iloc[i:i+window_size], threshold))
    period_lower_116.append(estimate_period(walking116.iloc[i:i+window_size] * -1, threshold))
for i in np.arange(0, walking125.index.size, window_size):
    period_upper_125.append(estimate_period(walking125.iloc[i:i+window_size], threshold))
    period_lower_125.append(estimate_period(walking125.iloc[i:i+window_size] * -1, threshold))

Convert the list to pandas series for easier analysis.

In [12]:
upper_116 = pd.Series(period_upper_116)
lower_116 = pd.Series(period_lower_116)
upper_125 = pd.Series(period_upper_125)
lower_125 = pd.Series(period_lower_125)

## Choosing the correct window size
It seems like we get better accuracy with a smaller window size but we cannot go too small, otherwise our peak detection algorithm does not register any peaks.  If the window size is n and our shaplet size is s, n cannot be less than s otherwise we would never catch any consecutive pair of peaks.  Realistically, we would like catch multiple peaks at a time and average their distances together but at a minimum to be gaurenteed to catch a single period, n needs to be at least 2s.  Since we do not know the length of the shaplet before hand, we will experiment by testing a number of window sizes to determine correct size.

**TODO**