# Deep Learning: common issues and solutions

This notebook presenting various techniques must be handed in. It will be marked.
You must add your own comments and tests. It is the comments and your own tests that will be assessed.
* Commentary when comparing different approaches
* Own test, when testing different parameters

In [None]:
import pandas as pd
import numpy as np

In [None]:
import matplotlib.pyplot as plt

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras import layers
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.utils import plot_model 

In [None]:
#!pip install keras-tuner --upgrade
import keras_tuner

In [None]:
### Some global constant
epochs=100
batch_size=256
patience=10
hidden_dim=256

In [None]:
### Usual function for babysit the network

# It is important to systematically observe the learning curves
def babysit(history):
    keys = [key for key in history.keys() if key[:4] != "val_"]
    fig, ax = plt.subplots(nrows=1, ncols=len(keys), figsize=(18, 5))
    for i, key in enumerate(keys):
        ax[i].plot(history[key], label=key)
        if "val_"+key in history.keys():
            ax[i].plot(history["val_"+key], label="val_"+key)
        ax[i].legend()
        ax[i].set_title(key)
    plt.show()

In [None]:
### Usual callback for training deep learning model

# It is important to use early stopping systematically
callbacks_list = [EarlyStopping(monitor='val_accuracy', mode='max',
                                patience=patience,
                                restore_best_weights=True)]

## 1. Today lab

In this lab we use part of the 'Amazon_Unlocked_Mobile.csv' dataset published by Kaggle. The dataset contain the following information:
* Product Name
* Brand Name
* Price
* Rating
* Reviews
* Review Votes

We are mainly interested by the 'Reviews' (X) and by the 'Rating' (y)

As you did in the previous lab, the goal is to try to predict the 'Rating' after reading the 'Reviews'.
We will mostly use this dataset as a case study to illustrate issues that you can have using Multilayer Perceptron or other Deep Learning architectures, namely:

1) **Text preprocessing with Tensorflow API**

2) **The vanishing gradient problem**:

Problem: Your model does not learn at all !
    
3) **Underfitting and Overfitting problems**
Problems:

    - Underfitting relates to the fact that your model does not learn enough on the train dataset to hope for good generalization abilities (good label prediction on new samples with unknown labels).
    - Overfitting means that your model fits too much to the train dataset, which can also prevents it from generalizing well to new samples with unknown labels.
    
4) **Starting, stopping, and resuming training**

Learning how to start, stop and resume learning a deep learning model is a very important skill to master. At some point:

* You have limited time on a GPU instance (this can happen on Google Colab or when using the cheaper Amazon EC2 point instances).
* Your SSH connection is broken.
* Your deep learning platform crashes and shuts down.

Imagine you've spent a whole week training a state-of-the-art deep neural network... and your model is lost due to a power failure! 

5) Find best hyper-parameters with keras-tuner

## 2. Dataset pre-processing

In this lab, we will just re-use the dataset of previous lab providing Sentiment Analysis tasks. 
And we will stick to the tf-idf approach for word embeddings. 

### a) Essential reminder [About Train, validation and test sets](https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7)
![test/train/val](https://miro.medium.com/max/1466/1*aNPC1ifHN2WydKHyEZYENg.png)

* **Training Dataset:** The sample of data used to fit the model.
* **Validation Dataset:** The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.
* **Test Dataset:** The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.

**If you use cross validation, concatenate Train and Validation set.**

In [None]:
TRAIN = pd.read_csv("http://www.i3s.unice.fr/~riveill/dataset/Amazon_Unlocked_Mobile/train.csv.gz")
TEST = pd.read_csv("http://www.i3s.unice.fr/~riveill/dataset/Amazon_Unlocked_Mobile/test.csv.gz")

TRAIN.head()

### b) Build X (features vectors) and y (labels)

In [None]:
# Construct X_train and y_train
X_train = TRAIN['Reviews'].fillna("")
y_train = TRAIN['Rating']
X_train.shape, y_train.shape

In [None]:
# Construct X_test and y_test
X_test = TEST['Reviews'].fillna("")
y_test = TEST['Rating']
X_test.shape, y_test.shape

In [None]:
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
y_train_encoded = ohe.fit_transform(y_train)
y_val_encoded = ohe.transform(y_val)
y_test_encoded = ohe.transform(y_test)

In [None]:
# Define constant
n_classes = len(np.unique(y_train))
feature_vector_length = X_train.shape[1]

feature_vector_length, n_classes

## 3. Text preprocessing with tensorflow

So far we have used `sklearn.feature_extraction.text.CountVectorize` or `sklearn.feature_extraction.text.TfidfVectorize` preceded by our own preprocessing to transform a text sequence into a vector. Unfortunately it is not possible to integrate this into a 'tensorflow' pipeline or vice versa, integrating a Tensorflow network into a sklearn pipeline is not easy. 

Fortunately, Tensorflow has a similar function: [tf.keras.layers.TextVectorization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization)

Look at the Tensorflow documentation to understand how it works.

The main parameter is output_mode:
* "int": Outputs integer indices, one integer index per split string token. When output_mode == "int", 0 is reserved for masked locations; this reduces the vocab size to max_tokens - 2 instead of max_tokens - 1.
    * give an ID for each token
    
Below is a small example of use

In [None]:
corpus = ["I love chocolate and I hate beer",
          "I love beer and I hate chocolate",
          "I love beer and I love chocolate"]
corpus = tf.convert_to_tensor(corpus)

for output_mode in ['multi_hot', 'count', 'tf_idf', 'int']:
    print("-"*50)
    print("output_mode:", output_mode)
    vectorize_layer = layers.TextVectorization(output_mode=output_mode)
    vectorize_layer.adapt(corpus) # Do the same thinks as fit in sklearn library
    print(vectorize_layer.get_vocabulary())
    print(vectorize_layer(corpus))

<font color="red">[TO DO STUDENTS]</font>

It is up to you to build examples with the other parameters of the TextVectorization function

In [None]:
""" Your code here """

<font color="red">[ TO DO STUDENTS]</font>

Initialize your vectorizer layer according to the training data

In [None]:
""" Your code here """

## 3. Vanishing gradient problem

This problem can be encountered when training NN with gradient-based learning methods and backpropagation. In such methods, during each iteration of training each of the neural network's weights receives an update proportional to the partial derivative of the loss function with respect to the current weight. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. This mostly occurs when your architecture counts too many parameters to learn.

Possible solutions, obviously with their pros and cons: 

* Reduce the depth of your network.
* Use sparsity promoting activation functions such as the ReLU activation function, i.e ReLU(x)= max(0, x)
* Use residual connections, i.e output at each layer: layer(input) + input
* Use normalization techniques, e.g Batch Normalization and so on.

### a) Observe the vanishing gradient problem

<font color="red">[ TO DO STUDENTS]</font>

Design a function to simply build MLP with the following inputs, which return the model ready to compile:

* vectorizer: the vectorizer layer used to transform a sentence in a vector</font>
* activation: activation used at each hidden layer
* n_hiddenlayers: number of hidden layers in the network
* hidden_dim: shared number of neurons within each hidden layer

In [None]:
def build_model(vectorizer, activation, n_hiddenlayers, hidden_dim):
    """ Your code here """
    return model

In [None]:
# Build a network with 30 hidden layers with 'tanh' activations
model = build_model(vectorizer, 'tanh', 30, hidden_dim)

In [None]:
# Print the model
model.summary()

In [None]:
# Plot the model
plot_model(model, show_shapes=True,
    show_dtype=True,
    #show_layer_names=True,
    #rankdir="TB",
    #expand_nested=True,
    dpi=64,
    #layer_range=True,
    show_layer_activations=True,
)

<font color="red">You now know 2 ways to view your network. You can choose the one you prefer.</font>

In [None]:
# Configure the model and start training, use the defined early stopping
""" your code here """

In [None]:
# Plot the learning curves and analyze them
""" your code here """

<font color="red">[ TO DO STUDENTS]</font>

Is your network learning? Check your intuition by evaluating your model and looking at the confusion matrix.

In [None]:
# Evaluate the model
""" your code here """

In [None]:
# Print/plot the confusion matrix
""" your code here """

### b) Experiment on ReLU activation

<font color="red">[ TO DO STUDENTS]</font>

Change activation from 'tanh' to 'relu', still with a deep network.

In [None]:
# Build a network with 30 hidden layers with 'tanh' activations
model = build_model(vectorizer, 'relu', 30, hidden_dim)

In [None]:
# Configure the model and start training
""" your code here """

In [None]:
# Plot the learning curves and analyze them
""" your code here """

<font color="red">[ TO DO STUDENTS]</font>

Does the network learn better? Does the network perform well? Study the learning curves and justify your statements with the study of its performance (classification report and confusion matrix)

On my code, it seems that the ReLU activation for sparsity has helped to solve the problem, but the model still struggles to learn and to get good performance on the validation set.

### c) Experiment on residual connections

<font color="red">[ TO DO STUDENTS ]<color>
* Create a function to generate models with residual connections.
* Using ReLU activation + residual connections, are you able to get better results ?
* Provide here a description of the learning and performances of your network
* Compare it to previous models. What are your conclusions ?

In [None]:
def build_residual_model(vectorizer, activation, n_hiddenlayers, hidden_dim):
    """ your code here """
    return model

In [None]:
# Build a network with 30 hidden layers with 'relu' activations
model = build_residual_model(vectorizer, 'relu', 30, hidden_dim)

In [None]:
# Print and plot the model --> What is the best solution ?

In [None]:
# Configure the model and start training
""" your code here """

In [None]:
# Plot the learning curves and analyze them
""" your code here """

### d) Experiment on Batch Normalization

<font color="red">[ TO DO STUDENTS ]</font>
* step1: adapt build_model function to add batch normalization layers after the output of the hidden dense layers. 
* step2: adapt build_model function to use batch normalization layers and residuals
             
* In both case use ReLU activation and 30 hidden layers as previouly
* Compare your results

In [None]:
def build_model_batch_normalization(vectorizer, activation, n_hiddenlayers, hidden_dim):
    """ Your code here """
    return model

In [None]:
def build_residual_model_residual_batch_normalization(vectorizer, activation, n_hiddenlayers, hidden_dim):
    """ your code here """
    return model

In [None]:
#Build and train the network with BatchNormalization layer
model = build_model_batch_normalization(vectorizer, 'relu', 30, hidden_dim)

# Configure the model and start training
""" your code here """

In [None]:
# Plot the learning curves and analyze them
""" your code here """

In [None]:
# Do the same with `build_residual_model_residual_batch_normalization`
""" your code here """

### e) What if you simply reduce the network depth ?

<font color="red">[ TO DO STUDENTS ]</font>
* build a MLP with ReLU activation composed of 10 hidden layers
* compare your results both in terms of learning and performances compared to other models

In [None]:
#Build and train the network without residual connections
model = build_model(vectorizer, 'relu', 10, hidden_dim)

In [None]:
# Configure the model and start training
""" your code here """

In [None]:
# Plot the learning curves and analyze them
""" your code here """

Normally, you observed a typical instance of overfitting.

## 4.  Underfitting and Overfitting problems

Actually what you observed in the last experiment is a typical instance of overfitting.

### a) Decrease the network size ?

In [None]:
#Build and train the network without residual connections
model = build_model(vectorizer, 'relu', 5, hidden_dim)

In [None]:
# Configure the model and start training
""" your code here """

In [None]:
# Plot the learning curves and analyze them
""" your code here """

For me: it gets slightly better but almost the same behavior is observed when taking 10 and 5 hidden layers.

What's going on for you ?

### b) Experiment on L2 regularization

<font color="red">[ TO DO STUDENTS ]</font>
* check the keras documentation on regularizations https://keras.io/api/layers/regularizers/
* Add to the previous network L2 regularization: first with l2_reg = 0.01 / then with l2_reg= 0.0001
* Compare results

In [None]:
# Design the model function
from tensorflow.keras import regularizers

def build_model_reg(vectorizer, activation, n_hiddenlayers, hidden_dim, l2_reg = 0.01):
    """ your code here """
    """ your code use: kernel_regularizer and bias_regularizer parameters of Dense"""
    return model

In [None]:
#Build and train the network without residual connections
model = build_model_reg(vectorizer, 'relu', 5, hidden_dim, 0.01)


In [None]:
# Configure the model and start training
""" your code here """

In [None]:
# Plot the learning curves and analyze them
""" your code here """

<font color="red">[ TO DO STUDENTS]</font code="red">

Reduce the coefficient of L2 regularization taken into account in the loss from l2_reg = 0.01 > to l2_reg = 0.0001 and do the same experimentation.

In [None]:
""" your code here """

What is your conclusion ?

### c) Experiment on Dropout

<font color="red">[TO DO STUDENTS]</font>
* Observe the provided results for a dropout ratio of p=0.7 and p=0.3
* What are your conclusions ?
* In the end, considering all the explored settings in this Lab, what would you suggest as a network to get a better model ?

In [None]:
# Design the model function
from keras.layers import Dropout

def build_model_dropout(vectorizer, activation, n_hiddenlayers, hidden_dim, p = 0.5):
    """ your code here """
    return model

In [None]:
# Build and train the network without residual connections
model = build_model_dropout(vectorizer, 'relu', 5, hidden_dim, 0.7)

In [None]:
# Configure the model and start training
""" your code here """

In [None]:
# Plot the learning curves and analyze them
""" your code here """

<font color="red">[ TO DO STUDENTS]</font code="red">

Decrease the proportion of neurons deactivated at each forward pass, from 0.7 to 0.3

In [None]:
""" your code here """

What is your conclusion ?

## Stop and resume training

Learning how to start, stop and resume learning a deep learning model is a very important skill to master. At some point:

* You have limited time on a GPU instance (this can happen on Google Colab or when using the cheaper Amazon EC2 point instances).
* Your SSH connection is broken.
* Your deep learning platform crashes and shuts down.

Imagine you've spent a whole week training a state-of-the-art deep neural network... and your model is lost due to a power failure! Fortunately, there is a solution - but when these situations occur, you need to know what to do:

1. Take a snapshot model that was saved/serialized to disk during training.
1. Load the model into memory.
1. Resume training where you left off.

Starting, stopping and resuming training is standard practice when setting the learning rate manually:

1. Start training your model until the loss/accuracy reaches a plateau.
1. Take a snapshot of your model every N epochs (typically N={1, 5, 10})
1. Stop training when you arrive at a plateau (by forcing out via ctrl + c or via earlystopping
1. Adjust your learning rate (typically by reducing it by an order of magnitude).
1. Restart the training script, starting from the last snapshot of the model weights

The ability to adjust the learning rate is an essential skill for any deep learning practitioner to master, so take the time to study and practice it!

<font color="red">Look at the documentation Tensorflow has proposed for ModelCheckpoint: we want to save the model at the end of each epoch so that we can restore it later.</font>

In [None]:
# Reuse one the previous model and reset it
# Use sgd as optimizer and fix learning_rate = 0.1
# Use 2 callbacks : EarlyStopping and ModelCheckpoint
# Save model at each epoch
# Fit the network
from tensorflow.keras import optimizers
from tensorflow.keras.callbacks import ModelCheckpoint

model = build_model_dropout(vectorizer, 'relu', 5, hidden_dim, 0.5)

opt = optimizers.SGD(learning_rate=0.1) # Fix learning rate to 0.1
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

callbacks_list = [EarlyStopping(monitor='val_accuracy', mode='max',
                                patience=patience,
                                restore_best_weights=True),
                  ModelCheckpoint(filepath=...,
                                  # Complete, if necessary
                                 )]
history1 = model.fit(X_train, y_train_encoded, validation_data=(X_val, y_val_encoded),
                    epochs=epochs, batch_size=batch_size, callbacks=callbacks_list, verbose=0)

In [None]:
# Re-load the model
new_model = tf.keras.models.load_model(...)

What is the difference between load_model and load_weights ?

In [None]:
# Change learning_rate = 0.01
new_model.optimizer.lr.assign(0.01)

In [None]:
# Continue to fit the network
history2 = new_model.fit(X_train, y_train_encoded, validation_data=(X_val, y_val_encoded),
                    epochs=epochs, batch_size=batch_size, callbacks=callbacks_list, verbose=0)

## Use Keras-tuner

<font color="red">[TO DO STUDENTS]</font>

From the previous experiences use Keras-tuner to find the best possible network.

Keras-Tuner must build at least 3 different network architectures:
* Dense cells only
* Addition of residuals
* Adding batch normalisation
* Adding dropout
* A combination of the different additions


In [None]:
""" your code here """