# Comparison between different users

This notebook is used to compare and analyze different methods of determining the length of a shaplet using the peak method.

Assumptions
* Each walking shaplet has a single peak far greater than the rest.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib qt
import math
#import series_computation_peak_detection as peak
import peakutils

## Peak Detection
This peak detection algorithm calculates peaks based on a given timeseries array and a threshold.  Refer to the [documentation](http://pythonhosted.org/PeakUtils/).  This method uses the peak detection algorithm to find the average distance between peaks.  If no peaks are found, the method returns nan.  

**Accuracy of the peak detection algorithm is assumed but not verified.  We should replace this with our custom peak detection algorithm**

In [2]:
def estimate_period(df, threshold, axis='x'):
    indexes = peakutils.indexes(df[axis].values, thres=threshold, min_dist=10)
    #series, delta = peak.series_computation(df[axis].values, threshold)
    #indexes = peak.peakdet(series, delta)
    indexes = pd.Series(indexes)
    #peaks = df.iloc[indexes]
    #plt.close()
    #plt.plot(df['x'])
    #plt.plot(peaks['x'], 'r+')
    if indexes.index.size != 0:
        diff = indexes.drop(0) - indexes.shift(1).drop(0)
        #return diff.mean()
        return diff.median()
    return float('nan')
    

Users who have data on walking: ['/102', '/103', '/107', '/110', '/116', '/118', '/121', '/122', '/125', '/131', '/133', '/140', '/143', '/144', '/148', '/149', '/153', '/159', '/161', '/166', '/171', '/174', '/179', '/181', '/182', '/184', '/188', '/189', '/192']

**Make sure your data file is in the right location on your local machine.**


In [3]:
walking116 = pd.read_csv('../../data/116/13_treadmill_3mph_0%.csv', names=["tick", "timestamp",
                                             "activity", "x", "y",
                                             "z", "user"], index_col=False)
walking125 = pd.read_csv('../../data/125/13_treadmill_3mph_0%.csv', names=["tick", "timestamp",
                                             "activity", "x", "y",
                                             "z", "user"], index_col=False)

## Cleaning data
I trimmed the beggining and end of each timeseries due to bogus values.  Points were chosen visually.

In [4]:
walking116 = walking116.iloc[2800:21000]
walking125 = walking125.iloc[2500:20000]
walking116 = walking116.reset_index()
walking125 = walking125.reset_index()

## Single Peak Detection
Our parameters are the time series to search, the window size, and the threshold to examine.  To improve accuracy, we are using a sliding window to improve the accuracy the peak detection algorithm.  Rather than find peaks over the entire data set, we use a fixed window length and calculate the average peak distance as the window slides over the time series from beginning to end.  The function gives control over how far to slide the window at each calucation and how big the window should be.  The function returns a list of lists of average peaks from the upper half and lower half of the time series. 

Parameters 
* timeseries(pandas dataframe) - dataframe of activity data. index starts at 0 and increases by 1.
* threshold(float between [0.,1.]) - Normalized threshold. Only the peaks with amplitude higher than the threshold will be detected.
* window_size(int) - size of the sliding window, less than the length of the time series.  If greater, window_size will be set to the length of the time series.
* axis(str) - either 'x', 'y', or 'z', corresponding to the axis to check
* step(int) - how far to slide the window for each calculation of peak distance.  If step is greater than window_size, the window size will be used as the step value.

Returns
* dic : 'upper' is upper peak averages, 'lower' is lower peak averages

In [417]:
def find_single_peak_dist(timeseries, threshold=0.8, window_size=500, axis='x', step=100):
    if (window_size > timeseries.last_valid_index()):
        window_size = timeseries.last_valid_index()
    if (axis not in ('x', 'y', 'z')):
        raise ValueError('invalid axis value, use x, y, or z')
    if (step > window_size):
        step = window_size
    upper = []
    lower = []
    ts_flipped = timeseries * -1
    for i in np.arange(0, timeseries.index.size - window_size, step):
        print(i)
        upper.append(estimate_period(timeseries.iloc[i:i+window_size], threshold, axis=axis))
        lower.append(estimate_period(ts_flipped.iloc[i:i+window_size], threshold, axis=axis))
    return { 'upper': pd.Series(upper), 'lower': pd.Series(lower), 'threshold': threshold, 'window': window_size }

## Choosing the correct window size
We get better accuracy with a smaller window size but we cannot go too small, otherwise our peak detection algorithm does not register any peaks.  If the window size is n and our shaplet size is s, n cannot be less than s otherwise we would never catch any consecutive pair of peaks.  Realistically, we would like catch multiple peaks at a time and average their distances together but at a minimum to be gaurenteed to catch a single period, n needs to be at least 2s.  Since we do not know the length of the shaplet before hand, we will experiment by testing a number of window sizes to determine correct size.

We will use the default threshold value of 0.8.

We will use the smallest window size that is able to get a prediction for at least 95% of the windows.  We will use a fixed starting value at 100 and increment by 100 until we fufill the above condition.

Set function to capture to optimal window size.

is_accurate checks to see the ratio of nans to values in a pandas series.  If the ratio is less than the accuracy, the funciton returns false, else true.

In [307]:
def is_accurate(peak_avgs, accuracy=.99):
    return (1 - (peak_avgs.isnull().sum()/len(peak_avgs.index)) > accuracy)
    
def find_window_size(time_series, upper=True, accuracy=.99, window_size=100, step=100):
    data = pd.Series(math.nan)
    if upper:
        side = 'upper'
    else:
        side = 'lower'
        
    while not is_accurate(data, accuracy):
        #print(is_accurate(avg, accuracy))
        info = find_single_peak_dist(time_series, window_size=window_size, step=step)
        data = pd.Series(info[side])
        #print(avg.isnull().sum()/len(avg.index))
        window_size += 100
    return { 'window_size':(window_size - increment_window_size), 'data':data }

In [386]:
accuracy = .99
starting_window_size = 500
increment_window_size = 100
walking116.size

145600

Find upper peak window size

In [387]:
upper = find_window_size(walking116, window_size=starting_window_size, step=increment_window_size)

In [388]:
upper['window_size']

500

Find the lower peak window size

In [311]:
lower = find_window_size(walking116, upper=False, window_size=starting_window_size, step=increment_window_size)

In [312]:
lower['window_size']

50000

Calling peaks detect directly

In [450]:
accuracy = .99
starting_window_size = 10000
increment_window_size = 1000
walking125.index.size

17500

In [451]:
data = find_single_peak_dist(walking125, window_size=starting_window_size, step=increment_window_size)

0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900
5000
5100
5200
5300
5400
5500
5600
5700
5800
5900
6000
6100
6200
6300
6400
6500
6600
6700
6800
6900
7000
7100
7200
7300
7400
7500
7600
7700
7800
7900
8000
8100
8200
8300
8400
8500
8600
8700
8800
8900
9000
9100
9200
9300
9400
9500
9600
9700
9800
9900
10000
10100
10200
10300
10400
10500
10600
10700
10800
10900
11000
11100
11200
11300
11400
11500
11600
11700
11800
11900
12000
12100
12200
12300
12400
12500
12600
12700
12800
12900
13000
13100
13200
13300
13400
13500
13600
13700
13800
13900
14000
14100
14200
14300
14400
14500
14600
14700
14800
14900
15000
15100
15200
15300
15400
15500
15600
15700
15800
15900
16000
16100
16200
16300
16400


Plot the averages

In [452]:
plt.close()
plt.plot(data['upper'], ".")
plt.xlabel('Window Index')
plt.ylabel('Average Peak Distance')

<matplotlib.text.Text at 0x18b26af60>

In [453]:
plt.close()
plt.plot(data['lower'], ".")
plt.xlabel('Window Index')
plt.ylabel('Average Peak Distance')

<matplotlib.text.Text at 0x18be01160>

In [454]:
data['upper'].median()

104.0

In [455]:
data['upper'].mode()

0    105.0
dtype: float64

In [456]:
data['lower'].median()

52.0

In [457]:
data['lower'].mode()

0    52.0
dtype: float64

Plot the histograms of the averages to show the clustering

In [458]:
plt.close()
plt.hist(data['upper'].dropna(), bins=15, color='green')
plt.xlabel('averages')
plt.ylabel('frequency')

<matplotlib.text.Text at 0x18c679940>

In [459]:
data['upper'].std()

13.971130221810869

In [460]:
plt.close()
plt.hist(data['lower'].dropna(), bins=15, color='red')
plt.xlabel('averages')
plt.ylabel('frequency')

<matplotlib.text.Text at 0x18dc92240>

In [461]:
data['lower'].std()

4.07383581055768

## Upper vs lower
How do we select?  Discuss at next meeting.

## Shaplet search

### Selecting a shaplet
For now we will randomly select a shaplet.

Meeting Notes:
* Extract possible shaplets.  Cut by fixed window length and slide by fixed step of 20
* Run each shaplet against the entire timeseries