# Requirements to Run

Please see the notebook ml-supervised-exersises for the requirements to run this notebook.

In [None]:
import numpy as np
import pandas as pd

from sklearn.metrics import adjusted_rand_score
from sklearn.cluster import KMeans
from umap import UMAP

import matplotlib.pyplot as plt
import matplotlib.lines as mlines
import matplotlib.cm as colormap
import seaborn as sns

from IPython.display import Image

# Load in the Data

Note that the data should be split into train and test sets using the script in the bin directory before running this notebook.

When training different models it is important to keep a seperate test set that is not used for training. This is to ensure that the model is not overfitting to the training data. The test set should only be used to evaluate the final model. The test set should not be used to tune the model. If the model is tuned using the test set then the model will not generalise well to unseen data.

In the supervised task, it is important to have a seperate validation and test set to prevent it overfitting to the known labels.

For the unsupervised task, in certain cases it may not be wise to perform a train-test split. For example, if the goal is to cluster a fixed dataset, then the test set is not required as we are only interested in relationships within the current dataset. If the goal is to cluster data that is continually being collected, then a test set is required to evaluate the expected performance of the model on unseen data.

In this notebook, we assume we want methods that have the potential to generalise and so we will split the data into train, validation and test sets.

In [None]:
# Switch the data type by changing the path
# 3 Options by default: raw, raw-fft-split-1, raw-fft-split-2
# raw: time series data, 6000 samples, 100 Hz, padded
# raw-fft-split-1: Absolute of FFT of unpadded time series data, 512 bins
# raw-fft-split-3: Split FFT into 3 Sections in Time: Absolute of FFT of each time series chunk: 512 bins per chunk

train_data = np.load("../datasets/llaima/raw-fft-split-3/train_data.npy")
train_labels = np.load("../datasets/llaima/train_labels.npy")

val_data = np.load("../datasets/llaima/raw-fft-split-3/val_data.npy")
val_labels = np.load("../datasets/llaima/val_labels.npy")

test_data = np.load("../datasets/llaima/raw-fft-split-3/test_data.npy")
test_labels = np.load("../datasets/llaima/test_labels.npy")

In [None]:
# Here we ensure that the data is in the correct format for the models
# Note that we reshape the data to be 1D arrays
# This only affects the case where we have split in time and computed the FFT on multiple segments
train_data = train_data.reshape(train_data.shape[0], -1)
val_data = val_data.reshape(val_data.shape[0], -1)
test_data = test_data.reshape(test_data.shape[0], -1)

In [None]:
# Combine our train and validation sets


In [None]:
label_dict = {
    0: "LP",
    1: "TC",
    2: "TR",
    3: "VT",
}

## What does the data look like

In [None]:
grid_size = 3 
fig, ax = plt.subplots(grid_size, grid_size, figsize=(7, 7))
for i in range(grid_size):
    for j in range(grid_size):
        ax[i][j].plot(train_data[i * 5 + j])
        ax[i][j].set_title(label_dict[train_labels[i * 5 + j]])
fig.suptitle("Sample Events")
fig.tight_layout()

## Clustering

Clustering is a common type of unsupervised learning method that aims to group similar instances together without the need for any labels. An example of this with 2D data is shown below (credit Datacamp, https://www.datacamp.com/tutorial/k-means-clustering-python)

A common clustering method is K means clustering. This method aims to find a fixed number of clusters in the data. The number of clusters is a hyperparameter that needs to be set before training the model. The number of clusters can be set by the user or can be found using a search method such as grid search.

The K means algorithm works by randomly assigning each instance to a cluster. Then the mean of each cluster is calculated. The instances are then reassigned to the cluster with the closest mean. This process is repeated until the clusters no longer change.

In [None]:
Image(filename='images/kmeans.png', width=500)

In [None]:
kmeans_model = KMeans(n_clusters=4, random_state=0)

train_preds = kmeans_model.fit_predict(train_data)

How do we evaluate the performance of this? Computing the accuracy directly does not seem wise... It would be nice to visualise the clusters instead. To do this, we can use dimensionality reduction techniques such as UMAP to reduce the data to 2 dimensions and then plot the clusters. This is shown below.

In [None]:
dimension_reducer = UMAP(n_components=2)
reduced_data = dimension_reducer.fit_transform(train_data)

In [None]:
df = pd.DataFrame({"x": reduced_data[:, 0], "y": reduced_data[:, 1], "label": train_labels, "pred": train_preds})
df.label = df.label.map(label_dict)
df.pred = df.pred.map(label_dict)

fig, ax = plt.subplots(figsize=(7, 7))
sns.scatterplot(data=df, x="x", y="y", hue="label", style="pred", palette="tab10", ax=ax)

Above we can see that the method is capturing the data and that there is a relationship between the clusters obtained and the labels, however the quality is not clear. A useful metric for evaluating the similarity between the labels and the predictions is the adjusted rand index. This metric has an upper bound of 1, and 0 indicates pure randomness.

In [None]:
rand_score = adjusted_rand_score(train_labels, train_preds)

print("Rand Score: ", rand_score)

There are many methods to try and improve this score. A simple way could be to attempt to cluster the data using different numbers of clusters. An alternative would be to cluster the data based on a reduced number of dimensions. This could be done by using a dimensionality reduction technique such as PCA or UMAP. This is shown below.

In [None]:
# Goal is to simplify the problem however there is no need to reduce the dimensionality of the data to 2D for clustering.

# There are arguments regarding the use of UMAP for clustering - do so with caution!
# https://umap-learn.readthedocs.io/en/latest/clustering.html
dimension_reducer2 = UMAP(n_components=10, min_dist=0)
not_so_reduced_data = dimension_reducer.fit_transform(train_data)

kmeans_model2 = KMeans(n_clusters=4, random_state=0)
train_preds2 = kmeans_model2.fit_predict(not_so_reduced_data)

rand_score2 = adjusted_rand_score(train_labels, train_preds2)

print("Rand Score: ", rand_score2)

There is a slight increase in performance when training on reduced data, however not a significant improvement. For unsupervised methods, the feature extraction and the feature engineering phases of the process are very important. By transforming the data into a more suitable representation, the performance of the model could potentially be improved. Additionally methods using autoencoders or other neural networks could be used to learn a more suitable representation of the data.

Potentially a different clustering method could also give better results! K means is the baseline method and is a good starting point, however there are many other methods that could be tried. A starting point is to try different clustering methods that are available in scikit-learn (https://scikit-learn.org/stable/modules/clustering.html). Different methods will have the potential to perform better on different types of data. It is important to try different methods and see which works best for the data at hand.

In [None]:
Image(filename='images/cluster-comparison.png')