# Time series classification - model development
This notebook is a stripped down version of notebook 2, so you can focus on model development.

# 1 Load Python modules

In [None]:
import time

import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split, RepeatedStratifiedKFold
import matplotlib.pyplot as plt
import matplotlib.colors
import seaborn as sns

import tensorflow as tf # Fast numerical computation for machine learning, computations on GPU or CPU
import tensorflow.keras as keras # High-level interface to TensorFlow, making it easier to create neural networks
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout

# General settings and variables
sns.set_style('whitegrid')
model_palette = ['rebeccapurple', 'mediumspringgreen']

class_names = ['cement', 'carpet']
class_colors = ['darkorange', 'steelblue']

# 2 Functions
Some functions, for convenience.

In [None]:
def load_data_and_labels(filename):
    ''' Load the dataset file and return the data and labels '''
    # Load the dataset
    url_root = 'https://raw.githubusercontent.com/jarusgnuj/ai-ml-wksh/master/data/UCR_TSC_archive/SonyAIBORobotSurface1_IoC'
    url = url_root+'/'+filename
    robot_df = pd.read_csv(url, sep='\t', header=None)
    print('Loaded from', url)
    robot_data = robot_df.values
    
    # Separate out the data and labels
    labels = robot_data[:,0].astype(int)
    data_samples = robot_data[:,1:]
    
    # Print information
    print('The shape of the data matrix is', data_samples.shape)
    print('Number of samples of class 0', (labels == 0).sum())
    print('Number of samples of class 1', (labels == 1).sum())

    return data_samples, labels


def plot_loss(log):
    ''' Plot the loss recorded in the log during model training '''
    ax = log[['loss', 'val_loss']].plot(title='Loss function during training', color=model_palette)
    ax.set_xlabel("Model training epoch")
    ax.set_ylabel("Loss")
    ax.legend(["training", "validation"]);
    
    
def plot_accuracy(log):
    ''' Plot the accuracy recorded in the log during model training '''
    ax = log[['acc', 'val_acc']].plot(title='Accuracy during training', color=model_palette)
    ax.set_xlabel("Model training epoch")
    ax.set_ylabel("Accuracy")
    ax.legend(["training", "validation"]);

# 3 Load the development dataset

In [None]:
filename = 'SonyAIBORobotSurface1_IoC_DEV.txt'
data, labels = load_data_and_labels(filename)

## 3.1 Split into training and test sets

In [None]:
test_size = 100 ### CHANGE PARAMETER HERE ###

data_train, data_test, labels_train, labels_test = train_test_split(
    data, labels, test_size=test_size, random_state=21, stratify=labels)

# 4 Exercise 3a: Model development
In the build_model function you can change the number of nodes in each layer. You can also add and remove layers to change the number of layers. These are some of the model's "hyperparameters". 

Optional - use the dropout layer that is introduced at the end of this notebook if you wish.

+ Set test_size, batch_size, epochs to the values you settled on in the previous notebook.
+ Change model hyperparameters then build, train and evaluate the model.
  + You may wish to re-examine your choice of test_size, batch_size and epochs periodically.
+ What hyperparameter settings give consistently high validation accuracy?
+ When you have settled on the best set of hyperparameters, re-run model training five times and calculate the mean mean average validation accuracy (see section 4.3.1).


## Competition part I
+ The best development model - highest mean validation accuracy over five training runs.
+ Tie-breaker - the lowest sample standard deviation.

## Competition part II 
This is in the next notebook.
+ Best performing model - the highest accuracy when tested on the final test dataset.

## 4.1 Build model
Create a multilayer perceptron (MLP) model.

In [None]:
# The size of the input vector
input_dim = data_train.shape[1]


def build_model(print_summary=False):
    ''' Return a model with randomly initialised weights '''
    ### CHANGE PARAMETERS HERE ###
    # Change the number of nodes in each layer.
    # Add or remove layers.
    model = Sequential([
        Dense(16, input_dim=input_dim, activation='relu', name='Layer1'), 
        Dense(8, activation='relu', name='Layer2'), 
        Dense(1, activation='sigmoid', name='OutputLayer')
    ])
    ### END OF CHANGE PARAMETERS ###

    optimizer = keras.optimizers.Adam() 
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    if print_summary:
        print(model.summary())
    return model


model = build_model(print_summary=False)

## 4.2 Train the model

In [None]:
batch_size = 8 ### CHANGE PARAMETER HERE ### 
epochs = 20 ### CHANGE PARAMETER HERE ###

model = build_model() # This re-initialises the model with random weights each time before we train it.

# Train
start = time.time()
hist = model.fit(data_train, labels_train, batch_size=batch_size, epochs=epochs, 
                 validation_data=(data_test, labels_test), verbose=1)
end = time.time()
log = pd.DataFrame(hist.history) 
print('Training complete in', round(end-start), 'seconds')

## 4.3 Evaluate

In [None]:
# Use the trained model to classify the test dataset.
result = model.evaluate(data_test, labels_test, batch_size=batch_size)
print('Validation accuracy:\t', result[1])
print('Validation loss:\t', result[0])
print('test_size:\t', test_size)

### 4.3.1 Calculate the mean average validation accuracy
When you have settled on the best set of hyperparameters, re-run the model training 5 times. Enter your validation accuracy results in the array below to calculate the mean and sample standard deviation. Sample standard deviation is a measure of how variable your results are; a lower value relates to lower variability.

In [None]:
val_acc_results = np.array([0.70, 0.80, 0.73, 0.88, 0.93]) ### CHANGE PARAMETERS HERE ###
print('Mean average:', val_acc_results.mean())
print('Sample standard deviation:', val_acc_results.std())

# 5 Optional

## 5.1 Optional : dropout
Dropout layers can improve a model's generalisation. A dropout layer randomly turns off some of the nodes during each training batch calculation. This can prevent a model from overfitting to the training data. Overfitting can result in high training accuracy but low validation accuracy. 
+ Does your model suffer from this?
+ What happens if you increase the number nodes in the model does validation accuracy start to decrease, suggesting that overfitting is occurring?

Try adding dropout layers to your model. An example of such a model is given in the notebook Appendix_4_dropout.

## 5.2 Optional : k-fold cross validation
k-fold cross validation is a method for repeated training and testing of a model.Take a look at the notebook Appendix_3_kfold_cross_validation if you'd like to try it.