# Dynamic Time Warping

In this notebook, you are going to classify time series data with the 1-NN algorithm, using two different approaches to compute the distance between time series: the Euclidean distance and the Dynamic Time Warping (DTW) distance. The comparison will be made for time series of equal length as well as for varying-length time series.

## Processing the data

The goal is to predict, based on hourly rentals, if a given day is a working day or not. Start by loading the `hour.csv` file, where each line contains information about the bike renting system for one hour. Take care to properly parse the date information of the data as done before. The number of rentals is recorded in the `cnt` column.

In [None]:
import pandas as pd

data = ...
print(data.shape)
data.head(10)

We want to operate on days, not on hours, but we need to keep track of the hourly data, as the sequences of hourly rentals will be our time-series. The other variables are not necessary. 
Find a way to aggregate the hourly observations, and create a dataframe with two columns: `counts` and `workingday`. The former should contain a list of the hourly counts. The latter should contain 0's or 1's indicating whether a given row correspond to a working day or not (0 = no, 1 = yes).
Note that your lists should contain exactly 24 elements.

In [None]:
ts = ...
ts.head(20)

Now that your data is in the right format, use the **train_test_split** method of the **sklearn.cross_validation** module to split it in a training set (66% of the data) and test set (33% of the data). Make sure the shapes of the returned data structures make sense. 

In [None]:
from sklearn.model_selection import train_test_split

# Fill in the line below to obtain train and test sets from your initial data
X_train, X_test, y_train, y_test = ...
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

## Implementing the algorithms

To perform the desired tasks, we need to implement several things:
- A function to compute the Euclidean distance between two time series
- A function to compute the DTW distance between two time series
- A function to classify a time series according to its nearest neighbor, using an arbitrary distance function
- A wrapper function to run our 1NN implementation on the whole test set and compute the accuracy of the approach

Start by defining a method that, given two time series, return the Euclidean distance between them. The formula for the Euclidean distance $d$ between two time-series $a$ and $b$ of length $n$ is the following:
$$d = \sqrt{\sum_{i=1}^n (a_i - b_i)^2}$$

In [None]:
import math

def euclid_dist(s1, s2):
    ...

Now define a method that classifies one time series using the 1 Nearest Neighbor algorithm. This method takes 4 arguments:
- X_train: the time series of the training set
- y_train: the corresponding labels of the time series (working day or not)
- test_s: the instance to classify
- distance: the distance function to use (for the moment we only have the Euclidean distance available)

The returned value should be the prediction for the test instance, 0 or 1.

In [None]:
def oneNearestNeighbor(X_train, y_train, test_s, distance):
    ...

Define a method that will run your **one_nearest_neighbor** function on all the instances of the test set and return the classification accuracy. The method takes 5 arguments:
- X_train: the time series of the training set
- y_train: the corresponding labels of the time series (working day or not)
- X_test and y_test: same, but for the test set
- distance: the distance function to use 

In [None]:
def classify(X_train, y_train, X_test, y_test, distance):
    ...

Now use your methods to classify the test instances using the Euclidean distance. Is the performance good? What would be the performance of a baseline classifier which always predicts the majority class?

In [None]:
accuracy = classify(X_train, y_train, X_test, y_test, euclid_dist)
print(accuracy)

Next, you need to give an implementation of the DTW distance. Take some time to understand the distance, and write the code trying to match what it to what you have seen of Dynamic Time Warping in class. 

In [None]:
def DTWDistance(s1, s2): # returns the DTW distance between two time series s1 and s2
    ...
    # Remember to define the initial cases, where no cells are filled in.
    # Hint: give values to the row and the column in position -1 equal to inf, 
    # such that no errors raise when using min(); Initialize the cell in position (-1,-1) with 0.
    ...
    
    return 

Run your **classify** method again, this time using the DTWDistance. Is the performance better? 

In [None]:
accuracy = classify(X_train, y_train, X_test, y_test, DTWDistance) # should take ~3 min to run
print(accuracy)

So far, all the time series had the same length (24). Let's change that, by arbitrarily removing the hourly counts smaller than 50. The next cell create a new dataframe with varying-length time series.

In [None]:
def trim(row):  # 'trim' a time series by removing elements from it
    tmp = []
    for c in row.counts:
        if c > 50:
            tmp.append(c)
    row.counts = tmp
    return row

varts = ts.apply(trim, axis=1) # apply our trim method on all rows of the ts datarame
varts.head()

In the next cell, we re-create our X and y matrices. This time, they contain time series of varying lengths.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(varts['counts'], varts['workingday'], test_size=0.33, random_state=42)

Finally, we compare the two distances, this time on the varying-length time series dataset. Do you notice any significant change in performance?

In [None]:
euclid_accuracy = classify(X_train, y_train, X_test, y_test, euclid_dist)
DTW_accuracy = classify(X_train, y_train, X_test, y_test, DTWDistance) 
print(euclid_accuracy)
print(DTW_accuracy)