# SCRIPT 02 - Train LSTM Model

In this script, the samples created with the previous script are used to train a linear model with LSTM layers. Each LSTM layer is followed by a Batch Normalization layer, and at the end a Dense layer connects the layers that deal with time series with the final expected result. The model is easily customizable and more about each step can be read before their respective cell.

In this cell, important libraries are imported.

+ `tensorflow`
    + The library used to deal with all things deep learning in this script. With it, Keras is used as a submodule, and the model is defined, compiled, trained and stored.
+ `numpy`
    + Library used to open and store the training samples.
+ `os`
    + It is used for basic system functionalities, such as joining paths and creating folders.
+ `pyplot`
    + Used to plot data.
+ `IPython.display clear_output`
    + Used to clear the output of a cell and allow the creation of a graph for training curves that is updated on-the-fly.
+ `sklearn.metrics confusion_matrix`
    + Used to generate a confusion matrix, useful for model evaluation.
+ `pandas`
    + Necessary for using `disarray`
+ `disarray`
    + Used to generate many accuracy metrics that derive from a confusion matrix.
+ `glob`
    + USed to gather files from a folder that follow a certain name pattern. USed to get the the model with the highest accuracy.

In [None]:
import tensorflow as tf
import numpy as np
import os
import matplotlib.pyplot as plt
from IPython.display import clear_output
from sklearn.metrics import confusion_matrix
import pandas as pd
import disarray
import glob

Now some important defining parameters are defined.

+ `model_id` (`string`)
    + String used to define an arbitrary id for the model being trained. This will be embedded into a file path, so please only use characters allowed in file paths.
+ `samples_id` (`string`)
    + The id of a group of samples previously created.
+ `path_models_folder` (`string`)
    + A path to the folder where all models should be stored. Can be the same for models with different ids.
+ `path_samples_folder` (`string`)
    + A path to the folder where previously created samples are stored.
+ `batch_size` (`int`)
    + The amount of samples to be processed in each batch during training and prediction.
+ `epochs` (`int`)
    + The maximum number of epochs to be considered during training.

In [None]:
model_id = '001'
samples_id = '001'
path_models_folder = '/path/to/models/folder'
path_samples_folder = '/path/to/samples/folder'
batch_size = 2048
epochs = 200

In the following code, a folder specific for the trained model is created. Its path is stored in `model_path`.

In [None]:

model_folder = os.path.join(path_models_folder, model_id)
os.makedirs(model_folder, exist_ok=True)

On of the most important code snippets of the script. Here the model is defined and compiled. Changes can be easily made here, like changing the amount of LSTM layers, or the number of units in each one of them. Loss function and optimizer can also be changed here.

In [None]:
input_ = tf.keras.layers.Input(shape=[12,6], name='Input')
tensor = tf.keras.layers.LSTM(24, return_sequences=True, name='LSTM_1')(input_)
tensor = tf.keras.layers.BatchNormalization(name='BN_1')(tensor)
tensor = tf.keras.layers.LSTM(12, return_sequences=True, name='LSTM_2')(tensor)
tensor = tf.keras.layers.BatchNormalization(name='BN_2')(tensor)
tensor = tf.keras.layers.LSTM(6, return_sequences=True, name='LSTM_3')(tensor)
tensor = tf.keras.layers.BatchNormalization(name='BN_3')(tensor)
tensor = tf.keras.layers.LSTM(4, return_sequences=True, name='LSTM_4')(tensor)
tensor = tf.keras.layers.BatchNormalization(name='BN_4')(tensor)
tensor = tf.keras.layers.LSTM(2, return_sequences=False, name='LSTM_5')(tensor)
tensor = tf.keras.layers.BatchNormalization(name='BN_5')(tensor)
tensor = tf.keras.layers.Dense(2, activation='softmax', name='Dense_1')(tensor)

model = tf.python.keras.models.Model(inputs=[input_], outputs=[tensor])
model.compile(loss=tf.keras.losses.CategoricalCrossentropy(),
              optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
              metrics=[tf.keras.metrics.CategoricalAccuracy()])

model.summary()

Defines a custom callback, in order to create a graph that shows loss and accuracy curves, updating it after each epoch.

In [None]:
class PlotLearning(tf.keras.callbacks.Callback):
    """
    Callback to plot the learning curves of the model during training.
    """
    def __init__(self, model_folder):
        self.model_folder = model_folder
        self.epochs = 0
    
    def on_train_begin(self, logs={}):
        self.metrics = {}
        for metric in logs:
            self.metrics[metric] = []
            
    def on_epoch_end(self, epoch, logs={}):
        self.epochs = epoch
        # Storing metrics
        for metric in logs:
            if metric in self.metrics:
                self.metrics[metric].append(logs.get(metric))
            else:
                self.metrics[metric] = [logs.get(metric)]
        
        # Plotting
        metrics = [x for x in logs if 'val' not in x]
        
        f, axs = plt.subplots(1, len(metrics), figsize=(15,5))
        clear_output(wait=True)

        for i, metric in enumerate(metrics):
            axs[i].plot(range(1, epoch + 2), 
                        self.metrics[metric], 
                        label=metric)
            if logs['val_' + metric]:
                axs[i].plot(range(1, epoch + 2), 
                            self.metrics['val_' + metric], 
                            label='val_' + metric)
                
            axs[i].legend()
            axs[i].grid()

        plt.tight_layout()
        plt.show()
    
    def on_train_end(self, logs={}):
        # Plotting
        metrics = [x for x in logs if 'val' not in x]
        
        f, axs = plt.subplots(1, len(metrics), figsize=(15,5))
        clear_output(wait=True)

        for i, metric in enumerate(metrics):
            axs[i].plot(range(1, self.epochs + 2), 
                        self.metrics[metric], 
                        label=metric)
            if logs['val_' + metric]:
                axs[i].plot(range(1, self.epochs + 2), 
                            self.metrics['val_' + metric], 
                            label='val_' + metric)
                
            axs[i].legend()
            axs[i].grid()

        plt.tight_layout()
        plt.savefig(os.path.join(self.model_folder, "training_metrics.png"))

The following code loads training samples and convert the format found in the way reference was stored.

In [None]:
samples_values = np.load(os.path.join(path_to_samples_folder, f'{samples_id}_scaled.npy'))
samples_reference = tf.keras.utils.to_categorical(np.load(os.path.join(path_to_samples_folder, f'{samples_id}_reference.npy'))-1)

samples_values.shape, samples_reference.shape

In the following cell the model is fitted to the training samples. Notice the callback ModelCheckpoint, it saves the model after each epoch if it showed the greatest accuracy for validation data until this point, therefore, the latest model saved in the model folder will always be the one with highest accuracy

In [None]:
checkpoint = tf.keras.callbacks.ModelCheckpoint(os.path.join(model_folder,'model-{epoch:03d}.h5'), save_best_only=True)

history_ = model.fit(samples_values[:int(len(samples_values)*0.8)], samples_reference[:int(len(samples_values)*0.8)],
                     validation_data=(samples_values[int(len(samples_values)*0.8):],
                                      samples_reference[int(len(samples_values)*0.8):]),
                     batch_size=batch_size,
                     epochs=epochs,
                     verbose=0,
                     callbacks=[checkpoint, PlotLearning(model_folder)])

The following code loads the model saved with the highest accuracy, because the model after the last training epoch not necessarily is the most accurate. This step helps avoiding overfitting.

In [None]:
models_paths = glob.glob(os.path.join(model_folder,'*.h5'))
models_paths.sort()
print('Highest Accuracy Model:', models_paths[-1])
model = tf.keras.models.load_model(models_paths[-1])

In the next cell, an evaluation of the model is made. Also, a confusion matrix considering for the validation samples is also generated.

In [None]:
model.evaluate(x=samples_values[int(len(samples_values)*0.8):], 
               y=samples_reference[int(len(samples_values)*0.8):],
               batch_size=batch_size)

pred = np.argmax(model.predict(samples_values[int(len(samples_values)*0.8):], batch_size=batch_size), axis=-1)

cm = confusion_matrix(y_pred = pred, y_true=np.argmax(samples_reference[int(len(samples_values)*0.8):], axis=-1))

cm

Saves information about model layers to a .txt file, for later reference.

In [None]:
with open(os.path.join(model_folder,'model_layers.txt'), 'w+') as file:
    model.summary(print_fn=lambda x: file.write(x + '\n'))
    file.write(str(model.evaluate(x=samples_values[int(len(samples_values)*0.8):],
               y=samples_reference[int(len(samples_values)*0.8):],
               batch_size=batch_size)))
    file.write(str(cm))