# **Machine Learning Lab 4EII - IA course**
***Lab 4 - Sequence of words prediction with RNN and LSTM***

## 1- **Introduction**

This lab aim at using variants of **recurrent neural networks** for sequence of words semantic prediction. The variants of RNN can be used in combination with dense dense layers or convolutional neural networks for temporal regression and classification problems.  

In this lab you will learn to:
* Build you variants of RNN Networks: GRU, LSTM.
* Apply these networks on semantic words sequence prediction. 
* Build a network combining  CNN with LSTM  for semantic words sequence prediction
* LSTM network for translating short English sentence into short French sentence

## 2- **Module importation**
Import some useful and common python modules 

In [0]:
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import  Dense, Embedding, Flatten, Convolution1D, Dropout, Input, MaxPooling1D, Activation, SimpleRNN, LSTM                                                                                                                                                                                        

## 3- **Download and study the the IMDB dataset**



### 3.a - IMDB Dataset of 50K Movie Reviews

> The IMDB sentiment classification dataset consists of 50,000 movie reviews from IMDB users that are labeled as either positive (1) or negative (0). The reviews are preprocessed and each one is encoded as a sequence of word indexes in the form of integers. The words within the reviews are indexed by their overall frequency within the dataset. For example, the integer “2” encodes the second most frequent word in the data. The 50,000 reviews are split into 25,000 for training and 25,000 for testing.

> The dataset was created by researchers at Stanford University and published in a 2011 [paper](https://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf), where they achieved 88.89% accuracy. We try in this Lab to reach this performance. 

> The following IMDBdataset class provides methods to download the IMDB dataset and show its features.   

In [0]:
class IMDBdataset:    
  def __init__(self, vocabulary_size=0):  
    self.vocabulary_size = vocabulary_size; 
    
    self.X_train = []; 
    self.X_test = []; 
    self.y_train = []; 
    self.y_test = []; 
    self.data = []; 
    self.targets = []; 
  
  def loadData(self):
    from tensorflow.keras.datasets import imdb;
    (self.X_train, self.y_train), (self.X_test, self.y_test) = imdb.load_data(num_words=vocabulary_size);
    self.data = np.concatenate((self.X_train, self.X_test), axis=0)
    self.targets = np.concatenate((self.y_train, self.y_test), axis=0)
  
  def paddingData(self, max_review_length):
    from tensorflow.keras.preprocessing import sequence
    self.X_train = sequence.pad_sequences(self.X_train, maxlen=max_review_length)
    self.X_test = sequence.pad_sequences(self.X_test, maxlen=max_review_length)
  
  def printInfo(self):
    print('Size of the trainig set ' + str(len(self.X_train)))
    print('Size of the testing set ' + str(len(self.X_test)))

    print("Categories:", np.unique(self.targets), ' <==> ', '[negative, positive]')
    print("Number of unique words:", len(np.unique(np.hstack(self.data))))
    print("Max sentence length :", np.max( [len(self.data[i]) for i in range(0, len(self.data)) ]) )

    length = [len(i) for i in self.data]
    print("Average Review length: %.2f " % np.mean(length))
    print("Standard Deviation:", round(np.std(length)))
  
  def printSentence(self, Id=0):
    from tensorflow.keras.datasets import imdb;
    if(Id >= len(self.data)):
      Id = 0;

    print("Label:", self.targets[Id])
    print("Number of words : ", len(self.data[Id]))
    print('Coded sentence : ', self.data[Id])

    index = imdb.get_word_index()
    reverse_index = dict([(value, key) for (key, value) in index.items()]) 
    decoded = " ".join( [reverse_index.get(i - 3, "#") for i in self.data[Id]] )
    print('Original sentence : ' + decoded) 


### 3.b - Download the IMDBdataset dataset and analys it 

> Create an instance of the IMDB dataset calss 

> Set the vocabulary_size to 10000

> vocabulary_size is the number of different words used in the dataset  

In [0]:
vocabulary_size = 10000
# TO DO 

> download the dataset with loadData() method 

In [0]:
# TO DO 

> Print some information of the IMDB dataset: size of training and testing sets, classes, nomber of words, ...

> Use printInfo() method  

In [0]:
# TO DO 

> Use can use printSentence(Id) method to print sentence of index Id before (original sentence) and after Tokenization.

> It should be noted that the sentences are first Tokenized: each word is represented by a digit in the interval [1, vocabulary_size-1] as pre-processing step. In this dataset the words are sorted by the most represented words which is represented by 1 and so on...

    



In [0]:
IdL = [1, 2, 8]
for sentenceId in IdL:
  # TO DO
  print(' -------- ')

### 3.b - Padding the data set review to have the same size.

> You have noticed that the sentences have not the same length.

> The paddingData method is provide to add padding to get sentences of the same length  of max_review_length.


In [0]:
max_review_length = 500
# TO DO 

## 4- **Neural network and CNN for IMDB classification problem**

> In this section you will develop a simple NN and CNN models for IMDB classification problem


### 4.a - Two layers NN for IMDB dataset classification 

> Complete the following code to create a NN model wiht two layers (hidden dense layer of size 16 with Relu activation function)

In [0]:
hidden_layer_dim = 16;
model = Sequential()

# TO DO 

print(model.summary())


> Train the model and assess its performance on the testing set. 

In [0]:
# Fit the model
logsNN1 = model.fit(dataset.X_train, dataset.y_train, validation_data=(dataset.X_test, dataset.y_test), epochs=10, batch_size=128, verbose=2)
# Final evaluation of the model
scores = model.evaluate(dataset.X_test, dataset.y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

In [0]:
plt.plot(logsNN1.history['loss'], label='Loss (testing data NN1)')
plt.plot(logsNN1.history['val_loss'], label='Loss (validation data NN1)')

plt.title('Binary cross entropy loss function')
plt.ylabel('Loss Error')
plt.xlabel('No. epoch')
plt.legend(loc="upper center")
plt.grid()
plt.show()


plt.plot(logsNN1.history['accuracy'], label='Accuracy (testing data NN1)')
plt.plot(logsNN1.history['val_accuracy'], label='Accuracy (validation data NN1)')
plt.title('Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('No. epoch')
plt.legend(loc="upper left")
plt.grid()
plt.show()

### 4.b - Add an embedding layer  

> Perform a linear projection of the input from input dimention to more compact new dimention (embedding_dim)

> Each single word will be project to a vector of dimension embedding_dim

> Use Embedding layer to perform this projection followed by a falatten layer and the two dense layers (hidden layer of size 16 and output layer)

> Embedding layer must specify 3 arguments:

1. input_dim: this is the size of the vocabulary in the text data (vocabulary_size). 

2. output_dim: this is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. In this Lab, we consider  output_dim =20.

3. input_length: this is the length of input sequences, as you would define for any input layer of a Keras model (max_review_length).

> Further details on Embedding layer can be found [here](https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/). 


In [0]:
embedding_dim=20
model = Sequential()

# TO DO 

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())

In [0]:
# Fit the model
logsNN2 = model.fit(dataset.X_train, dataset.y_train, validation_data=(dataset.X_test, dataset.y_test), epochs=10, batch_size=128, verbose=2)
# Final evaluation of the model
scores = model.evaluate(dataset.X_test, dataset.y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

In [0]:
plt.plot(logsNN2.history['loss'], label='Loss (testing data NN2)')
plt.plot(logsNN2.history['val_loss'], label='Loss (validation data NN2)')

plt.title('Binary cross entropy loss function')
plt.ylabel('Loss Error')
plt.xlabel('No. epoch')
plt.legend(loc="center right")
plt.grid()
plt.show()


plt.plot(logsNN2.history['accuracy'], label='Accuracy (testing data NN2)')
plt.plot(logsNN2.history['val_accuracy'], label='Accuracy (validation data NN2)')
plt.title('Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('No. epoch')
plt.legend(loc="lower right")
plt.grid()
plt.show()

### 4.c - Convolutional Neural Network  

> In this section you will build a convolutional neural network with using one cov layer and one dense output layer.

> The conv layer is composed of:

1.   Convolution1D layer of 200 filters of kernel size 3n  padding='valid' activation='relu' 
2.   Max pooling layer (MaxPooling1D(pool_size=max_review_length-kernel_size+1))





In [0]:
dropout_prob = 0.4
graph_in = Input(shape=(max_review_length, embedding_dim))
kernel_size = 3;
nb_filters = 32;
model = Sequential()

# TO DO Embedding 

model.add(Dropout(dropout_prob))

# TO DO conv + max pooling  + output dense layer 

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['acc'])
print(model.summary())



In [0]:
# Fit the model
logsCNN1 = model.fit(dataset.X_train, dataset.y_train, validation_data=(dataset.X_test, dataset.y_test), epochs=10, batch_size=128, verbose=2)
# Final evaluation of the model
scores = model.evaluate(dataset.X_test, dataset.y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

In [0]:
plt.plot(logsCNN1.history['loss'], label='Loss (testing data CNN)')
plt.plot(logsCNN1.history['val_loss'], label='Loss (validation data CNN)')

plt.title('Binary cross entropy loss function')
plt.ylabel('Loss Error')
plt.xlabel('No. epoch')
plt.legend(loc="center right")
plt.grid()
plt.show()

plt.plot(logsCNN1.history['acc'], label='Accuracy (testing data CNN)')
plt.plot(logsCNN1.history['val_acc'], label='Accuracy (validation data CNN)')
plt.title('Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('No. epoch')
plt.legend(loc="lower right")
plt.grid()
plt.show()

## 5- **Recurrent Neural Network and LSTM for IMDB dataset sentiment classification problem**

> Instead of 1D conv layer used in the previous section, here you will use a simple RNN and then a LSTM

### 5.a - Build a model with a simple RNN layer SimpleRNN

> Use SimpleRNN layer with rnn_cells as parameter. 

In [0]:
# Simple RNN
rnn_cells = 64; 
model = Sequential()

# TO DO 

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

In [0]:
# Fit the model
logsRNN = model.fit(dataset.X_train, dataset.y_train, validation_data=(dataset.X_test, dataset.y_test), epochs=10, batch_size=128, verbose=2)
# Final evaluation of the model
scores = model.evaluate(dataset.X_test, dataset.y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

In [0]:
plt.plot(logsRNN.history['loss'], label='Loss (testing data RNN)')
plt.plot(logsRNN.history['val_loss'], label='Loss (validation data RNN)')

plt.title('Binary cross entropy loss function')
plt.ylabel('Loss Error')
plt.xlabel('No. epoch')
plt.legend(loc="center right")
plt.grid()
plt.show()


plt.plot(logsRNN.history['accuracy'], label='Accuracy (testing data RNN)')
plt.plot(logsRNN.history['val_accuracy'], label='Accuracy (validation data RNN)')
plt.title('Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('No. epoch')
plt.legend(loc="lower right")
plt.grid()
plt.show()

### 5.b - Build a model with a LSTM layer LSTM

In [0]:
# LSTM 
model = Sequential()
lstm_cells =64; 

# TO DO 

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

In [0]:
# Fit the model
logsLSTM = model.fit(dataset.X_train, dataset.y_train, validation_data=(dataset.X_test, dataset.y_test), epochs=10, batch_size=128, verbose=2)
# Final evaluation of the model
scores = model.evaluate(dataset.X_test, dataset.y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

In [0]:
plt.plot(logsLSTM.history['loss'], label='Loss (testing data LSTM)')
plt.plot(logsLSTM.history['val_loss'], label='Loss (validation data LSTM)')

plt.title('Binary cross entropy loss function')
plt.ylabel('Loss Error')
plt.xlabel('No. epoch')
plt.legend(loc="center right")
plt.grid()
plt.show()


plt.plot(logsLSTM.history['accuracy'], label='Accuracy (testing data LSTM)')
plt.plot(logsLSTM.history['val_accuracy'], label='Accuracy (validation data LSTM)')
plt.title('Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('No. epoch')
plt.legend(loc="lower right")
plt.grid()
plt.show()

> What are the advantages of using LSTM tan a simple RNN. 

### 5.c - Build a model combiling conv layer with LSTM layer 

In [0]:
# conv + LSTM 
graph_in = Input(shape=(max_review_length, embedding_dim))
model = Sequential()

# TO DO 

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['acc'])
print(model.summary())


In [0]:
# Fit the model
logsCNNLSTM = model.fit(dataset.X_train, dataset.y_train, validation_data=(dataset.X_test, dataset.y_test), epochs=10, batch_size=128, verbose=2)
# Final evaluation of the model
scores = model.evaluate(dataset.X_test, dataset.y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

> You can find [here](https://paperswithcode.com/sota/sentiment-analysis-on-imdb) a review on state of the art models for IMDB emotion classification. You can compare your performance with these models. 


## 6- **LSTM for translating short English sentences into short French sentences (Question bonus)**

> In this section you will develop a model based on LSTM for English to French short sentence translation. 

## 6.a- Download and prepare the dataset
> In this section you will download and prepare the English to French short sentences dataset.  

> The model is composed of one encoder (LSTM layer) and one decoder (LSTM layer + 1 dense output layer) 

> In this model you will be able to translate sentences of variable length. The states of the decoder  are initialized by the encoder state output, then the state of the decoder are updated with the decoder output and the decoded sample. 

> You can import the dataset file fra.txt in the working directory 

In [0]:
from __future__ import print_function

from keras.models import Model
from keras.layers import Input, LSTM, Dense
import numpy as np
from google.colab import drive



# Path to the data txt file on disk.

num_samples = 10000  # Number of samples to train on.

class datasetFrEg: 
  def __init__(self, data_path):  
    self.data_path = data_path; 
    
    self.X_train = []; 
    self.X_test = []; 
    self.y_train = []; 
    self.y_test = []; 
    self.data = []; 
    self.targets = []; 
  
    self.input_texts = []
    self.target_texts = []
    self.input_characters = set()
    self.target_characters = set()

  def loadData(self):
    with open(self.data_path, 'r', encoding='utf-8') as f:
      lines = f.read().split('\n')
    for line in lines[: min(num_samples, len(lines) - 1)]:
      input_text, target_text, s = line.split('\t')
      # We use "tab" as the "start sequence" character
      # for the targets, and "\n" as "end sequence" character.
      target_text = '\t' + target_text + '\n'
      self.input_texts.append(input_text)
      self.target_texts.append(target_text)
      for char in input_text:
        if char not in self.input_characters:
            self.input_characters.add(char)
      for char in target_text:
        if char not in self.target_characters:
            self.target_characters.add(char)

  def printInfo(self):
    self.input_characters = sorted(list(self.input_characters))
    self.target_characters = sorted(list(self.target_characters))
    self.num_encoder_tokens = len(self.input_characters)
    self.num_decoder_tokens = len(self.target_characters)
    self.max_encoder_seq_length = max([len(txt) for txt in self.input_texts])
    self.max_decoder_seq_length = max([len(txt) for txt in self.target_texts])

    print('Number of samples:', len(self.input_texts))
    print('Number of unique input tokens:', self.num_encoder_tokens)
    print('Number of unique output tokens:', self.num_decoder_tokens)
    print('Max sequence length for inputs:', self.max_encoder_seq_length)
    print('Max sequence length for outputs:', self.max_decoder_seq_length)

  def printSample(self, Id):
    if Id < len(self.input_texts) : 
      print("# setence En :", self.input_texts[Id], "# setence Fr :", self.target_texts[Id])

  def vectorizeData(self):
    self.input_token_index = dict([(char, i) for i, char in enumerate(self.input_characters)])
    self.target_token_index = dict([(char, i) for i, char in enumerate(self.target_characters)])

    self.encoder_input_data = np.zeros((len(self.input_texts), self.max_encoder_seq_length, self.num_encoder_tokens), dtype='float32')
    self.decoder_input_data = np.zeros((len(self.input_texts), self.max_decoder_seq_length, self.num_decoder_tokens),dtype='float32')
    self.decoder_target_data = np.zeros((len(self.input_texts), self.max_decoder_seq_length, self.num_decoder_tokens),dtype='float32')

    for i, (self.input_text, self.target_text) in enumerate(zip(self.input_texts, self.target_texts)):
      for t, char in enumerate(self.input_text):
        self.encoder_input_data[i, t, self.input_token_index[char]] = 1.
      self.encoder_input_data[i, t + 1:, self.input_token_index[' ']] = 1.
      for t, char in enumerate(self.target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        self.decoder_input_data[i, t, self.target_token_index[char]] = 1.
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            self.decoder_target_data[i, t - 1, self.target_token_index[char]] = 1.
      self.decoder_input_data[i, t + 1:, self.target_token_index[' ']] = 1.
      self.decoder_target_data[i, t:, self.target_token_index[' ']] = 1.
      self.reverse_input_char_index = dict(
        (i, char) for char, i in self.input_token_index.items())

      self.reverse_target_char_index = dict(
        (i, char) for char, i in self.target_token_index.items())

> Download and parse the text file 

In [0]:
data_path = 'fra.txt'
dataFrEn = datasetFrEg(data_path);

In [0]:
dataFrEn.loadData();

> Print some global information of the data set

In [0]:
dataFrEn.printInfo()

> Print some sentences 

In [0]:
IdL = [1, 3, 5, 13, 20];

for Id in IdL:  
  dataFrEn.printSample(Id)

> As for the previous IMDB data set, the words need to be vectorized into tokens

In [0]:
dataFrEn.vectorizeData();

## 6.b - Build the translation model 
> In this section you will build the translation model composed of an encoder and decoder both relaying on an LSTM layer with latent_dim cells.

In [0]:
batch_size = 64;  
epochs = 10; 
latent_dim = 256;


# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, dataFrEn.num_encoder_tokens))

encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)

# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]


# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, dataFrEn.num_decoder_tokens))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.

# Create a LSTM layer for the decoder with latent_dim while activating return_sequences and return_state
decoder_lstm = # TO DO 
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                     initial_state=encoder_states)

#TO DO add 1 output dense layer 
# what is the number of outputs 
decoder_dense = # TO DO 
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy',
              metrics=['accuracy'])

print(model.summary())

> Perform the training of the model

In [0]:
model.fit([dataFrEn.encoder_input_data, dataFrEn.decoder_input_data], dataFrEn.decoder_target_data,
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.2)

In [0]:
# Next: inference mode (sampling).
# Here's the drill:
# 1) encode input and retrieve initial decoder state
# 2) run one step of decoder with this initial state
# and a "start of sequence" token as target.
# Output will be the next target token
# 3) Repeat with the current target token and current states

# Define the encoder model 
encoder_model = Model(encoder_inputs, encoder_states)


decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)

# Define the decoder model 
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

## 6.c - Test the translation model 
> Here you can feed the decode_sequence function by sentence to translate from encoder_input_data and print the translated sentence. 

> you can increate the number of epochs to 100, to get a more accurate model. 

In [0]:
# Reverse-lookup token index to decode sequences back to
# something readable.

def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, dataFrEn.num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, dataFrEn.target_token_index['\t']] = 1.

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = dataFrEn.reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '\n' or
           len(decoded_sentence) > dataFrEn.max_decoder_seq_length):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, dataFrEn.num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        # Update states
        states_value = [h, c]

    return decoded_sentence


for seq_index in range(100):
    # Take one sequence (part of the training set) from encoder_input_data[i:i+1]
    # for trying out decoding with decode_sequence.

    # TODO 

    print('-')
    print('Input sentence:', dataFrEn.input_texts[seq_index])
    print('Decoded sentence:', decoded_sentence)