# Timeseries classification using LSTM

## Introduction

In this lab exercise, we will apply a simple LSTM to do timeseries classification. 

*The lab is adapted from the example codes on keras.io*

## Setup

In [None]:
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os

## Load the data: the FordA dataset

### Dataset description

The dataset we are using here is called FordA.
The data comes from the UCR archive.
The dataset contains 3601 training instances and another 1320 testing instances.
Each timeseries corresponds to a measurement of engine noise captured by a motor sensor.
For this task, the goal is to automatically detect the presence of a specific issue with
the engine. The problem is a balanced binary classification task. The full description of
this dataset can be found [here](http://www.j-wichard.de/publications/FordPaper.pdf).

### Read the data

We will use the `FordA_TRAIN` file for training and the
`FordA_TEST` file for testing. The simplicity of this dataset
allows us to demonstrate effectively how to use LSTM for timeseries classification.
In this file, the first column corresponds to the label.

In [None]:
train_data_url = 'https://raw.githubusercontent.com/nyp-sit/iti107/main/session-5/FordA_TRAIN.txt' 
train_df = pd.read_csv(train_data_url, delim_whitespace=True, header=None)
train_df

In [None]:
test_data_url = 'https://raw.githubusercontent.com/nyp-sit/iti107/main/session-5/FordA_TEST.txt'
test_df = pd.read_csv(test_data_url, delim_whitespace=True, header=None)

In [None]:
x_train, y_train = train_df.loc[:,1:].values, train_df.loc[:,0].values
x_test, y_test = test_df.loc[:,1:].values, test_df.loc[:,0].values
# x_test, y_test = readucr(root_url + "FordA_TEST.tsv")

In [None]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

## Visualize the data

Here we visualize one timeseries example for each class in the dataset.

In [None]:
classes = np.unique(np.concatenate((y_train, y_test), axis=0))

plt.figure()
for c in classes:
    c_x_train = x_train[y_train == c]
    plt.plot(c_x_train[0], label="class " + str(c))
plt.legend(loc="best")
plt.show()
plt.close()

## Standardize the data

Our timeseries are already in a single length (500). However, their values are
usually in various ranges. This is not ideal for a neural network;
in general we should seek to make the input values normalized.
For this specific dataset, the data is already z-normalized: each timeseries sample
has a mean equal to zero and a standard deviation equal to one. This type of
normalization is very common for timeseries classification problems, see
[Bagnall et al. (2016)](https://link.springer.com/article/10.1007/s10618-016-0483-9).

Note that the timeseries data used here are univariate, meaning we only have one channel
per timeseries example.
We will therefore transform the timeseries into a multivariate one with one channel
using a simple reshaping via numpy.
This will allow us to construct a model that is easily applicable to multivariate time
series.

In [None]:
x_train = np.expand_dims(x_train, axis=2) 
x_test = np.expand_dims(x_test, axis=2) 

In [None]:
x_train.shape

Finally, in order to use `sparse_categorical_crossentropy`, we will have to count
the number of classes beforehand.

In [None]:
num_classes = len(np.unique(y_train))

Now we shuffle the training set because we will be using the `validation_split` option
later when training.

In [None]:
idx = np.random.permutation(len(x_train))
x_train = x_train[idx]
y_train = y_train[idx]

Standardize the labels to positive integers.
The expected labels will then be 0 and 1.

In [None]:
y_train[y_train == -1] = 0
y_test[y_test == -1] = 0

## Build a model

We use a single LSTM layer to capture the temporal information and return the hidden at each timestep.  We then feed these timesteps into the dense layers for classification. 

Note that we set the `return_sequences=True` to return the hidden states at every time-step. The output shape is thus of 3D shape (batch, time-steps, feature). To apply Dense layer to every time step, we use keras TimeDistributed wrapper.  To connect to final Dense layer, we need to Flatten this to a 2D shape (batch, features). 

In [None]:
def make_model(input_shape):
    input_layer = keras.layers.Input(input_shape)
    x = keras.layers.LSTM(32, return_sequences=True)(input_layer)
    x = keras.layers.LSTM(32, return_sequences=True)(x)
    x = keras.layers.TimeDistributed(keras.layers.Dense(16, activation='relu'))(x)
    x = keras.layers.Dropout(0.4)(x)
    x = keras.layers.Flatten()(x)
    output_layer = keras.layers.Dense(num_classes, activation="softmax")(x)
    return keras.models.Model(inputs=input_layer, outputs=output_layer)

model = make_model(input_shape=x_train.shape[1:])
model.summary()

## Train the model

In [None]:
import os 

epochs = 250
batch_size = 256

def create_tb_callback(): 

    root_logdir = os.path.join(os.curdir, "tb_logs")

    def get_run_logdir():    # use a new directory for each run
        import time
        run_id = time.strftime("run_%Y_%m_%d-%H_%M_%S")
        return os.path.join(root_logdir, run_id)

    run_logdir = get_run_logdir()

    tb_callback = keras.callbacks.TensorBoard(run_logdir)

    return tb_callback


callbacks = [
    keras.callbacks.EarlyStopping(
        monitor="val_accuracy", patience=50, restore_best_weights=True
    ),
#     keras.callbacks.ModelCheckpoint(
#         "best_model", save_best_only=True, monitor="val_accuracy"
#     ),
    
    create_tb_callback()
]
model.compile(
    optimizer=keras.optimizers.Adam(),
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

model.fit(
    x_train,
    y_train,
    batch_size=batch_size,
    epochs=epochs,
    callbacks=callbacks,
    validation_split=0.2,
    verbose=1,
)

## Evaluate model on test data

In [None]:
# model = keras.models.load_model("best_model")

test_loss, test_acc = model.evaluate(x_test, y_test)

print("Test accuracy", test_acc)
print("Test loss", test_loss)

## Visualize training using Tensorboard

In [None]:
%load_ext tensorboard

%tensorboard --logdir tb_logs

We can see how the training accuracy reaches almost 1 after 100 epochs.
However, the validation accuracy is stuck at around 0.88. The model is clearly overfitting. Try experimenting with other regularization methods such as L1/L2. 