# Comparison between different users

This notebook is used to compare and analyze different methods of determining the length of a shaplet using the peak method.

Assumptions
* Each shaplet has a single peak far greater than the rest.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib qt
import peakutils

## Peak Detection
This peak detection algorithm calculates peaks based on a given timeseries array and a threshold.  Refer to the [documentation](http://pythonhosted.org/PeakUtils/).  This method uses the peak detection algorithm to find the average distance between peaks.  If no peaks are found, the method returns nan.  

**Accuracy of the peak detection algorithm is assumed but not verified.  We should replace this with our custom peak detection algorithm**

In [3]:
def estimate_period(df, threshold):
    indexes = peakutils.indexes(df['x'].values, thres=threshold, min_dist=10)
    indexes = pd.Series(indexes)
    #peaks = df.iloc[indexes]
    if indexes.index.size != 0:
        diff = indexes.drop(0) - indexes.shift(1).drop(0)
        return diff.mean()
    return float('nan')
    #plt.close()
    #plt.plot(df['x'])
    #plt.plot(peaks['x'], 'r+')

Users who have data on walking: ['/102', '/103', '/107', '/110', '/116', '/118', '/121', '/122', '/125', '/131', '/133', '/140', '/143', '/144', '/148', '/149', '/153', '/159', '/161', '/166', '/171', '/174', '/179', '/181', '/182', '/184', '/188', '/189', '/192']

**Make sure your data file is in the right location on your local machine.**


In [4]:
walking116 = pd.read_csv('../../data/116/13_treadmill_3mph_0%.csv', names=["tick", "timestamp",
                                             "activity", "x", "y",
                                             "z", "user"], index_col=False)
walking125 = pd.read_csv('../../data/125/13_treadmill_3mph_0%.csv', names=["tick", "timestamp",
                                             "activity", "x", "y",
                                             "z", "user"], index_col=False)

## Cleaning data
I trimmed the beggining and end of each timeseries due to bogus values.  Points were chosen visually.

In [5]:
walking116 = walking116.iloc[2800:21000]
walking125 = walking125.iloc[2500:20000]
walking116 = walking116.reset_index()
walking125 = walking125.reset_index()

## Single Peak Detection
Our parameters are the time series to search, the window size, and the threshold to examine.  To improve accuracy, we are using a sliding window to improve the accuracy the peak detection algorithm.  Rather than find peaks over the entire data set, we use a fixed window length and calculate the average peak distance as the window slides over the time series from beginning to end. The function returns a list of lists of average peaks from the upper half and lower half of the time series.

Parameters 
* timeseries(pandas series) - index starts at 0 and increases by 1.
* threshold(float between [0.,1.]) - Normalized threshold. Only the peaks with amplitude higher than the threshold will be detected.
* window_size(int) - size of the sliding window, less than the length of the time series.  If greater, window_size will be set to the length of the time series.

Returns
* dic : 'upper' is upper peak averages, 'lower' is lower peak averages

In [9]:
def find_single_peak_dist(timeseries, threshold=0.8, window_size=500):
    if (window_size > timeseries.last_valid_index()):
        window_size = timeseries.last_valid_index()
    upper = []
    lower = []
    ts_flipped = timeseries * -1
    for i in np.arange(0, timeseries.index.size, window_size):
        upper.append(estimate_period(timeseries.iloc[i:i+window_size], threshold))
        lower.append(estimate_period(ts_flipped.iloc[i:i+window_size], threshold))
    return { 'upper': upper, 'lower': lower, 'threshold': threshold, 'window': window_size }

In [11]:
peak_dist_info = find_single_peak_dist(walking116)
peak_dist_info['upper']

TypeError: list indices must be integers or slices, not str

## Choosing the correct window size
It seems like we get better accuracy with a smaller window size but we cannot go too small, otherwise our peak detection algorithm does not register any peaks.  If the window size is n and our shaplet size is s, n cannot be less than s otherwise we would never catch any consecutive pair of peaks.  Realistically, we would like catch multiple peaks at a time and average their distances together but at a minimum to be gaurenteed to catch a single period, n needs to be at least 2s.  Since we do not know the length of the shaplet before hand, we will experiment by testing a number of window sizes to determine correct size.

**TODO**