# Distance and Approximations

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sktime.datasets import load_UCR_UEA_dataset

## Load the data

---
Load the Cylinder-Bell-Funnel (CBF) dataset either using the `load_UCR_UEA_dataset` function or by loading the data from the `data` folder. 

If you use the `load_UCR_UEA_dataset` function, be sure to set the `return_type` parameter to `numpy3D`. If you load the data locally, you can use the `np.load` function.

Be sure to load both the data and the classes. Store the data in a variable `X` and the classes in a variable `classes`. X should contain 930 univariate time series of length 128. The classes should contain the class labels for each time series.

---

Store the first two time series of the dataset in the variables `ts1` and `ts2`. Plot the time series.

## Distances

In [7]:
from sktime.distances import distance

---

Calculate all the distances you saw in the lecture between `ts1` and `ts2`.

### Alignment

In [12]:
from sktime.alignment.dtw_numba import AlignerDtwNumba

---

Visualize the alignments for the DTW distance, using a sakoe-chiba window of 0.1.

## Approximation

In [17]:
from sktime.transformations.series.paa import PAA
from sktime.transformations.series.sax import SAX
from pyts.approximation import DiscreteFourierTransform
from sklearn.preprocessing import StandardScaler
from sktime.transformations.series.adapt import TabularToSeriesAdaptor
def dft_inverse_trasform(X_dft, n_coefs, n_timestamps):
    # https://pyts.readthedocs.io/en/latest/auto_examples/approximation/plot_dft.html
    n_samples = X_dft.shape[0]
    if n_coefs % 2 == 0:
        real_idx = np.arange(1, n_coefs, 2)
        imag_idx = np.arange(2, n_coefs, 2)
        X_dft_new = np.c_[
            X_dft[:, :1],
            X_dft[:, real_idx] + 1j * np.c_[X_dft[:, imag_idx],
                                            np.zeros((n_samples, ))]
        ]
    else:
        real_idx = np.arange(1, n_coefs, 2)
        imag_idx = np.arange(2, n_coefs + 1, 2)
        X_dft_new = np.c_[
            X_dft[:, :1],
            X_dft[:, real_idx] + 1j * X_dft[:, imag_idx]
        ]
    X_irfft = np.fft.irfft(X_dft_new, n_timestamps)
    return X_irfft

---
As you probably saw, the time series are very noisy. Your goal here is to denoise them, without losing too much information. If you did things correctly, you should be able to see that the two time series you analized are quite similar. Now store the fourth time series (so `i=3`) in the variable `ts3` and plot it.

---
You should be able to see that ts1 and ts2 are quite similar, while ts3 is quite different.

Your goal here is to approximate `ts1`, `ts2` and `ts3` using one method between PAA, SAX, DFT, such that you remove the noise but maintain the general shape of the time series. At the end `ts1_approx` and `ts2_approx` should still be similar, while `ts3_approx` should be quite different.

---

1. Approximate `ts1`, `ts2` and `ts3` using one method between PAA, SAX, DFT.

---

2. Plot the original time series and the approximated time series (`ts1` with `ts1_approx`, `ts2` with `ts2_approx`, `ts3` with `ts3_approx`). Then plot all the approximations in the same plot.

---
3. Evaluate the results qualititatively by looking at the plots, and by computing the dtw distance between `ts1_approx` and `ts2_approx`, and between `ts1_approx` and `ts3_approx`.

---
Now approximate the entire dataset (`X`) using PCA. You can use the `PCATransformer` from sktime. Find the number of components that explain at least 70% of the variance. Then, approximate the dataset using that number of components.

---

Plot `ts1` and its approximation using PCA.

---
Now use `Tabularizer`and `PCA` with 2 components to compress the dataset. Then, plot the compressed dataset. Use the class labels to color the points.

## Distance-based Classification

In [107]:
from sktime.datasets import load_UCR_UEA_dataset
from sktime.transformations.series.summarize import SummaryTransformer
from sklearn.neighbors import KNeighborsClassifier
from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

---
In the following, we used a very simple SummaryTransformer to convert the time series into a tabular form and used a KNN classifier to classify the time series.

For this exercise you want to maximize accuracy and also runtime. Your goal is to beat our SummaryTransformer + KNN classifier. You can use any normalization, approximation and feature extraction method you want. The only constraint is that you should use the same classifier. You can use KNN directly on the time series (KNeighborsTimeSeriesClassifier), or perform a tabular conversion in some way (as we did for the SummaryTransformer) and then use a standard KNN from sklearn (KNeighborsClassifier).

- Can you find a model that is more accurate?
- Can you find a model that is faster and more accurate?
- Can you find a model that is faster, uses less features and is more accurate?

**Note 1:** You can use the `%%time` magic command to measure the time it takes to run a cell. You have to count all the transformations and the training/test time.

**Note 2:** Training has to be done only on the training set, and the test set should be used only for evaluation.

In [121]:
X_train, y_train = load_UCR_UEA_dataset("CBF", split="TRAIN", return_type="numpy3D")
X_test, y_test = load_UCR_UEA_dataset("CBF", split="TEST", return_type="numpy3D")
encoder = LabelEncoder()
y_train = encoder.fit_transform(y_train)
y_test = encoder.transform(y_test)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((30, 1, 128), (30,), (900, 1, 128), (900,))

In [122]:
summary = SummaryTransformer()

In [125]:
%%time
X_features_train = summary.fit_transform(X_train)
X_features_test = summary.transform(X_test)
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_features_train, y_train)
y_pred = knn.predict(X_features_test)

CPU times: user 3.27 s, sys: 14.4 ms, total: 3.29 s
Wall time: 3.46 s


In [126]:
accuracy_score(y_test, y_pred)

0.6633333333333333

In [131]:
print("n_features:", X_features_train.shape[1])

n_features: 9


---
Good luck!

## Discussion

---

What's the effect of dynamic time warping when comparing time series? Was it useful in this dataset? Why?

Which approximation method did you use? Did it work well? Why?

Did approximation make the time series more similar?

What can you grasp from the 2d representation of the dataset using PCA?

What was the best method you found for classification? Why do you think it worked better than the others?