**Lab 1 and 2: Neural Machine Translation (Extra Guide)**

This week and the next, we will build a neural machine translation model based on the sequence-to-sequence (seq2seq) models proposed by Sutskever et al., 2014 and Cho et al., 2014. The seq2seq model is widely used in Machine Translation systems such as Google’s neural machine translation system (GNMT) (Wu et al., 2016).

The folder **nmt_lab_files** has been provided for you. This folder contains 3 files:
1. **data.30.vi** - a file where each line contains a Vietnamese sentence to be translated (i.e. the source sentences)
2. **data.30.en** - a file where each line contains an English sentence corresponding to the Vietnamese sentence in the same line position. (i.e. the target sentences)
3. **nmt_model_keras.py** - incomplete code for this lab.

The pdf file provided contains an explanation of the code file and a guide on how to complete the code (by doing 3 tasks). Read the pdf file and complete the code as instructed.

##**LanguageDict**

LanguageDict is a class for creating language dict objects.

## **The load_dataset() Method**

This helper method reads from the source and target files to load max_num_examples sentences, split those sentences into train, development and test sets, and return relevant data.

As an example of the ouput returned by this code, let's assume we are translating the sentence 'I like rabbits' from English to English (this of course is never the case), such that the tokenised and case-normalised source sentence list and target sentence list are as follows:


```
# In Vietnamese this would actually be [['tôi', 'thích', 'thỏ']].
# We will use English to English here using the following code.
source_words = [['i', 'like', 'rabbits']]
target_words = [['i', 'like', 'rabbits']]
```
The word2ids for the source and target language dictionaries look as follows:
```
source_dict.word2ids = {'<PAD>': 0, '<UNK>': 1, 'i': 2, 'like': 3, 'rabbits':4}

# end and start tokens are added to the target words
target_dict.word2ids = {'<PAD>': 0, '<UNK>': 1, '<start>': 2, 'i': 3, 'like': 4, 'rabbits':5, '<end>':6}

```
Let's also assume that we are training and testing on this dataset of one sentence.
The **source words** for train/dev/test will be given as follows:
```
# [batch_size X max_sent_length]
source_words_train = [[2,3,4]] # corresponding to ['i', 'like', 'rabbits']
source_words_dev = [[2,3,4]]  # corresponding to ['i', 'like', 'rabbits']
source_words_test = [[2,3,4]] # corresponding to ['i', 'like', 'rabbits']
```

The **target words** for the train data will be given as follows (dev/test do not need target words as the model will generate those):
```
target_words_train = [[2,3,4,5]] # corresponding to ['<start>', 'i', 'like', 'rabbits']
```

The **target words labels** for each word will be the next word. The target word labels for train/dev/test data will be given as follows
```
target_words_train_labels = [[3,4,5,6]] # corresponding to ['i', 'like', 'rabbits', '<end>']
target_words_dev_labels = [[3,4,5,6]] # corresponding to ['i', 'like', 'rabbits', '<end>']
target_words_test_labels = [[3,4,5,6]] # corresponding to ['i', 'like', 'rabbits', '<end>']
```
The dimensions for the train target words labels would be expanded to have the following dimentionality:
```
# [batch_size X max_sent_length array X 1]
[[3], [4], [5], [6]]
```






##**Neural Translation Model (NMT)**

For NMT, the network (a system of connected layers/models) used for training differs slightly from the network used for inference. Both use the encoder-decoder architecture.




###**Training mode**

**Encoder**

Given:
- `source_words`: a `batch_size(num_sents) x max_sentence_length` array representing the source words. In our mini example, this would be the Vietnamese equivalent of `['i', 'like', 'rabbits']`, i.e. `[['tôi', 'thích', 'thỏ']]`.

The following steps comprise the encoder network:

1. Transform `source_words` into `source_words_embeddings` using a randomly initialized embedding lookup. `source_words_embeddings` is thus an array with the shape `batch_size(num_sents) x max_sentence_length x embedding_dim`.
2. Apply embedding dropout with `embedding_dropout_rate`.
3. Use a single `LSTM` with the `hidden_size` units to learn a representation for the source words i.e. to encode the input.

    (a.) The hidden and cell states for this `LSTM` are initialized to zeros (i.e. we leave the `initial_state = None` default as is).

    (b.) We save the `encoder_outputs` (the sequence not just the last state); and the encoder (hidden and cell) states.

This way, the model encodes a representation for the source words. Task 1 guides you to complete the encoder part of the training model.


**Decoder (No Attention)**

Given:
- `target_words`: a `batch_size(i.e. num_sents in batch) x max_sentence_length` array representing the target words. This is a time shifted translation of the source words with an added (prepended) `<START>` token `['<start>', 'i', 'like', 'rabbits']`.

The decoding is done in the following steps:

1. Transform `target_words` into `target_words_embeddings` using a randomly initialized embedding lookup. `target_words_embeddings` is thus an array with the shape `batch_size x max_sentence_length x embedding_dim`.

2. Apply embedding dropout of `embedding_dropout_rate`.

3. Use a single `LSTM` with `hidden_size` units to learn a representation for the target words. The context is given to this model by using the encoder states to initialise the decoder LSTM. For example, the encoder state for `'thỏ'` (last word in the input sequence, its hidden representation summarises the sentence) is used to learn the representation for the `'<start>'` token.

4. For each token representation, we use a dense layer to output a `target_vocab_size` vector of probabilities to be the next word following the represented token. The output `decoder_outputs_train` is thus an array  with the shape `batch_size x max_sent_length + 1 x target_vocab_size`.


###**Inference Mode**

**Encoder**

The inference time encoding follows the same steps as the training time encoding.


**Decoder (No attention)**

During training time, we passed a `batch_size(num_sents) x max_sentence_length` array representing the target words into the decoder LSTM. The `decoder_lstm` represents the given target sentence using the context from the encoder LSTM (representation for the source sentence).  

At test time, several things are different:

1. We no longer have access to a complete translation of the source sentence (recall that no `target_words` arrays exist for dev and test sets). Rather we initialise the target words array as follows:

    Each expected target sentence contains only a single token index, the index of the `'<start>'` token. So, the target_word_dev/test is a `batch_size x 1` array (see the nmt.eval() function).

2. This `batch_size x 1` array is fed to the trained `decoder_lstm` and the predicted array is a `batch_size x 1 x target_vocab_size` such that taking the argmax of this array across the dimension 2 will give the most probable next word.

For example, at time_step 0 (first time step) the `step_target_words` is given. It is a `batch_size x 1` array containing the `'<start>'` token. The next word prediction of the decoder is for each sentence (in the batch) the first actual word.


At the first time step, the `decoder_lstm` still uses the `encoder_states` as its initial states. At subsequent time steps, it uses its own states from the previous time steps. We hence loop over time steps to generate a new word at a time.





In [10]:
from google.colab import drive
drive.mount('/content/drive/')

In [11]:
# change this to the path to your folder. Remember to start from the home directory
PATH = 'MyDrive/NLP_NN_24/Lab1-2-NMT/nmt_lab_files'

In [12]:
PATH_TO_FOLDER = "/content/drive/" + PATH

In [13]:
import sys
sys.path.append(PATH_TO_FOLDER)

In [14]:
SOURCE_PATH = PATH_TO_FOLDER + '/data.30.vi'
TARGET_PATH = PATH_TO_FOLDER + '/data.30.en'

# SOURCE_PATH = './data.30.vi'
# TARGET_PATH = './data.30.en'

Let's install the Sacrebleu (https://github.com/mjpost/sacrebleu) package for BLEU computation.

In [15]:
!pip install sacrebleu



In [16]:
import nmt_model_keras as nmt

##**Training Without Attention**

If you have completed Tasks 1 and 2, you are ready to train the NMT model without attention.

Run the following cells to train the model for 10 epochs. The model summary is also shown below.

If you're using a GPU, training will be no more than 10 minutes and you will get the test BLEU score between 5 and 6.

In [17]:
nmt.main(SOURCE_PATH, TARGET_PATH, use_attention=False)

loading dictionaries
read 24000/3000/3000 train/dev/test batches
number of tokens in source: 2034, number of tokens in target:2506
Task 1(a): Creating the embedding lookups...

Task 1(b): Looking up source and target words...

Task 1(c): Creating an encoder




						 Train Model Summary.
Model: "model_6"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_11 (InputLayer)       [(None, None)]               0         []                            
                                                                                                  
 input_12 (InputLayer)       [(None, None)]               0         []                            
                                                                                                  
 embedding_4 (Embedding)     (None, None, 100)            203400    ['input_11[0][0]']            
                                                                                                  
 embedding_5 (Embedding)     (None, None, 100)            250600    ['input_12[0][0]']            
                                                                



Model BLEU score: 0.30
Time used for evaluate on dev set: 0 m 9 s
Starting training epoch 2/10
Time used for epoch 2: 1 m 57 s
Evaluating on dev set after epoch 2/10:




Model BLEU score: 1.88
Time used for evaluate on dev set: 0 m 9 s
Starting training epoch 3/10
Time used for epoch 3: 1 m 57 s
Evaluating on dev set after epoch 3/10:




Model BLEU score: 1.40
Time used for evaluate on dev set: 0 m 8 s
Starting training epoch 4/10
Time used for epoch 4: 1 m 56 s
Evaluating on dev set after epoch 4/10:




Model BLEU score: 2.44
Time used for evaluate on dev set: 0 m 8 s
Starting training epoch 5/10
Time used for epoch 5: 1 m 56 s
Evaluating on dev set after epoch 5/10:




Model BLEU score: 2.68
Time used for evaluate on dev set: 0 m 8 s
Starting training epoch 6/10
Time used for epoch 6: 1 m 59 s
Evaluating on dev set after epoch 6/10:




Model BLEU score: 3.00
Time used for evaluate on dev set: 0 m 8 s
Starting training epoch 7/10
Time used for epoch 7: 1 m 56 s
Evaluating on dev set after epoch 7/10:




Model BLEU score: 3.31
Time used for evaluate on dev set: 0 m 8 s
Starting training epoch 8/10
Time used for epoch 8: 1 m 56 s
Evaluating on dev set after epoch 8/10:




Model BLEU score: 3.95
Time used for evaluate on dev set: 0 m 8 s
Starting training epoch 9/10
Time used for epoch 9: 1 m 55 s
Evaluating on dev set after epoch 9/10:




Model BLEU score: 4.44
Time used for evaluate on dev set: 0 m 8 s
Starting training epoch 10/10
Time used for epoch 10: 1 m 57 s
Evaluating on dev set after epoch 10/10:




Model BLEU score: 4.76
Time used for evaluate on dev set: 0 m 8 s
Training finished!
Time used for training: 21 m 3 s
Evaluating on test set:




Model BLEU score: 4.97
Time used for evaluate on test set: 0 m 8 s


##**Training and Decoding with Attention**

The inputs to the attention layer are encoder and decoder outputs. The attention mechanism:
1. Computes a score (Luong's dot product attention score) for each source word
2. Weights the encoder representations using these scores.
3. Concatenates the weighted encoder representation with the decoder ouput.
This new decoder output will now be the input to the `decoder_dense` layer.

Step-by-step details for Task 3 are in the pdf file. Once you have completed this Task, you are ready to train with attention. Training time will be no more than 10 minutes using a GPU and you should get a test BLEU score of around 10.

In [18]:
nmt.main(SOURCE_PATH, TARGET_PATH, use_attention=True)

loading dictionaries
read 24000/3000/3000 train/dev/test batches
number of tokens in source: 2034, number of tokens in target:2506
Task 1(a): Creating the embedding lookups...

Task 1(b): Looking up source and target words...

Task 1(c): Creating an encoder




						 Train Model Summary.
Model: "model_9"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_16 (InputLayer)       [(None, None)]               0         []                            
                                                                                                  
 embedding_6 (Embedding)     (None, None, 100)            203400    ['input_16[0][0]']            
                                                                                                  
 input_17 (InputLayer)       [(None, None)]               0         []                            
                                                                                                  
 dropout_6 (Dropout)         (None, None, 100)            0         ['embedding_6[0][0]']         
                                                                



Model BLEU score: 1.07
Time used for evaluate on dev set: 0 m 13 s
Starting training epoch 2/10
Time used for epoch 2: 3 m 0 s
Evaluating on dev set after epoch 2/10:




Model BLEU score: 1.05
Time used for evaluate on dev set: 0 m 12 s
Starting training epoch 3/10
Time used for epoch 3: 3 m 2 s
Evaluating on dev set after epoch 3/10:




Model BLEU score: 2.27
Time used for evaluate on dev set: 0 m 12 s
Starting training epoch 4/10
Time used for epoch 4: 2 m 59 s
Evaluating on dev set after epoch 4/10:




Model BLEU score: 2.99
Time used for evaluate on dev set: 0 m 12 s
Starting training epoch 5/10
Time used for epoch 5: 2 m 59 s
Evaluating on dev set after epoch 5/10:




Model BLEU score: 4.15
Time used for evaluate on dev set: 0 m 12 s
Starting training epoch 6/10
Time used for epoch 6: 3 m 2 s
Evaluating on dev set after epoch 6/10:




Model BLEU score: 4.81
Time used for evaluate on dev set: 0 m 13 s
Starting training epoch 7/10
Time used for epoch 7: 3 m 1 s
Evaluating on dev set after epoch 7/10:




Model BLEU score: 5.91
Time used for evaluate on dev set: 0 m 13 s
Starting training epoch 8/10
Time used for epoch 8: 2 m 59 s
Evaluating on dev set after epoch 8/10:




Model BLEU score: 6.82
Time used for evaluate on dev set: 0 m 13 s
Starting training epoch 9/10
Time used for epoch 9: 3 m 0 s
Evaluating on dev set after epoch 9/10:




Model BLEU score: 7.52
Time used for evaluate on dev set: 0 m 13 s
Starting training epoch 10/10
Time used for epoch 10: 3 m 5 s
Evaluating on dev set after epoch 10/10:




Model BLEU score: 8.06
Time used for evaluate on dev set: 0 m 12 s
Training finished!
Time used for training: 32 m 25 s
Evaluating on test set:




Model BLEU score: 8.56
Time used for evaluate on test set: 0 m 12 s
