# **Lab 5 and 6: Neural Machine Translation (Extra Guide)**

This week and the next, we'll be build a neural machine translation model based on the sequence-to-sequence (seq2seq) models proposed by Sutskever et al., 2014 and Cho et al., 2014. The seq2seq model is widely used in machine translation systems such as Google’s neural machine translation system (GNMT) (Wu et al., 2016).

A folder, **nmt_lab_files** has been provided for you. This folder contains 3 files:
1. **data.30.vi** - a file. each line of the file contains a Vietnamese sentence to be translated (i.e. the source sentences). **Source**
2. **data.30.en** - a file. each line of the file contains an English sentence corresponding to the Vietnamese sentence in the same line position. (i.e. the target sentences). **Target**
3. **nmt_model_keras.py** - the incomplete code for this lab.

The doc file provided contains an explanation of the code file and a guide on how to complete the code (by doing 3 tasks). Read the doc file and if you can, complete the code as instructed. When the code is completed, skip to section xx of this notebook. 

This notebook (prior to section section xx) merely contains further explanation on sections of the code.

### Coursework disclaimer

Note I decided to forego keeping the code as a separate python script and instead opted to transfer the pertinent sections (e.g. attention class, model class, etc.) to this notebook.

## Load relevant packages

In [None]:
import tensorflow as tf
from keras.layers import Embedding,LSTM,Dropout,Dense,Layer
from keras import Model,Input
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.optimizers import Adam
import keras.backend as K
import collections
import numpy as np
import time
from nltk.translate.bleu_score import corpus_bleu

## **LanguageDict**

LanguageDict is a class for creating language dict objects.

In [None]:
class LanguageDict():
  def __init__(self, sents):
    word_counter = collections.Counter(tok.lower() for sent in sents for tok in sent)

    self.vocab = []
    self.vocab.append('<pad>') #zero paddings
    self.vocab.append('<unk>')
    # add only words that appear at least 10 times in the corpus
    self.vocab.extend([t for t,c in word_counter.items() if c > 10])

    self.word2ids = {w:id for id, w in enumerate(self.vocab)}
    self.ids2word = dict([(value, key) for (key, value) in self.word2ids.items()])
    self.UNK = self.word2ids['<unk>']
    self.PAD = self.word2ids['<pad>']

## **The <load_dataset()> Method**

This helper method reads from the source and target files to 
- load max_num_examples sentences, 
- split the sentences them into train, development and testing, and
- return relevant data.
The code for this is fully commented. 

<br>

As an example to the kind of ouput returned by this model, let's assume we are translating the sentence 'I like dogs' from English to English (this of course is never the case), such that the tokenized and case normalized source sentence list and target sentence list are as follows:


```
# In our case this would actually be [['tôi', 'thích', 'thỏ']], i.e the Vietnamese equivalent of the English sentence. 
# We've used English to English here so we can follow along with the code.
source_words = [['i', 'like', 'rabbits']] 
target_words = [['i', 'like', 'rabbits']]
```
The word2ids for the source and target language dictionaries would look something like:
```
source_dict.word2ids = {'<PAD>': 0, '<UNK>': 1, 'i': 2, 'like': 3, 'rabbits':4}

# end and start tokens are added for the target words
target_dict.word2ids = {'<PAD>': 0, '<UNK>': 1, '<start>': 2, 'i': 3, 'like': 4, 'rabbits':5, '<end>':6}

```
Let's also assume that we are training and testing on this same dataset of one sentence.
The **source words** for train/dev/test will be given as
```
# a batch_size X max_sent_length array.
source_words_train = [[2,3,4]] # corresponding to ['i', 'like', 'rabbits']
source_words_dev = [[2,3,4]]  # corresponding to ['i', 'like', 'rabbits']
source_words_test = [[2,3,4]] # corresponding to ['i', 'like', 'rabbits']
```

The **target words** for train data will be given as follows (dev/test don't need target words as the model will provide this):
```
target_words_train = [[2,3,4,5]] # corresponding to ['<start>', 'i', 'like', 'rabbits']
```

The **target words labels** for each word will be the word after it. The target word labels for train/dev/test data will be given as follows
```
target_words_train_labels = [[3,4,5,6]] # corresponding to ['i', 'like', 'rabbits', '<end>']
target_words_dev_labels = [[3,4,5,6]] # corresponding to ['i', 'like', 'rabbits', '<end>']
target_words_test_labels = [[3,4,5,6]] # corresponding to ['i', 'like', 'rabbits', '<end>']
```
The dimensions for train target words labels would be expanded to this:
`[[3], [4], [5], [6]]`






In [None]:
def load_dataset(source_path,target_path, max_num_examples=30000):
  ''' This helper method reads from the source and target files to load max_num_examples 
  sentences, split them into train, development and testing and return relevant data.
  Inputs:
    source_path (string): the full path to the source data, SOURCE_PATH
    target_path (string): the full path to the target data, TARGET_PATH
  Returns:
    train_data (list): a list of 3 elements: source_words, target words, target word labels
    dev_data (list): a list of 2 elements - source words, target word labels
    test_data (list): a list of 2 elements - source words, target word labels
    source_dict (LanguageDict): a LanguageDict object for the source language, Vietnamese.
    target_dict (LanguageDict): a LanguageDict object for the target language, English.
  ''' 
  # source_lines/target lines are list of strings such that each string is a sentence in the
  # corresponding file. len(source/target_lines) <= max_num_examples
  source_lines = open(source_path).readlines()
  target_lines = open(target_path).readlines()
  assert len(source_lines) == len(target_lines)
  if max_num_examples > 0:
    max_num_examples = min(len(source_lines), max_num_examples)
    source_lines = source_lines[:max_num_examples]
    target_lines = target_lines[:max_num_examples]

  # strip trailing/leading whitespaces and tokenize each sentence. 
  source_sents = [[tok.lower() for tok in sent.strip().split(' ')] for sent in source_lines]
  target_sents = [[tok.lower() for tok in sent.strip().split(' ')] for sent in target_lines]
    # for the target sentences, add <start> and <end> tokens to each sentence 
  for sent in target_sents:
    sent.append('<end>')
    sent.insert(0,'<start>')

  # create the LanguageDict objects for each file
  source_lang_dict = LanguageDict(source_sents)
  target_lang_dict = LanguageDict(target_sents)


  # for the source sentences.
  # we'll use this to split into train/dev/test 
  unit = len(source_sents)//10
  # get the sents-as-ids for each sentence
  source_words = [[source_lang_dict.word2ids.get(tok,source_lang_dict.UNK) for tok in sent] for sent in source_sents]
  # 8 parts (80%) of the sentences go to the training data. pad upto maximum sentence length
  source_words_train = pad_sequences(source_words[:8*unit],padding='post')
  # 1 parts (10%) of the sentences go to the dev data. pad upto maximum sentence length
  source_words_dev = pad_sequences(source_words[8*unit:9*unit],padding='post')
  # 1 parts (10%) of the sentences go to the test data. pad upto maximum sentence length
  source_words_test = pad_sequences(source_words[9*unit:],padding='post')


  eos = target_lang_dict.word2ids['<end>']
  # for each sentence, get the word index for the tokens from <start> to up to but not including <end>,
  target_words = [[target_lang_dict.word2ids.get(tok,target_lang_dict.UNK) for tok in sent[:-1]] for sent in target_sents]
  # select the training set and pad the sentences
  target_words_train = pad_sequences(target_words[:8*unit],padding='post')
  # the label for each target word is the next word after it
  target_words_train_labels = [sent[1:]+[eos] for sent in target_words[:8*unit]]
  # pad the labels. Dim = [num_sents, max_sent_lenght]
  target_words_train_labels = pad_sequences(target_words_train_labels,padding='post')
  # expand dimensions Dim = [num_sents, max_sent_lenght, 1]. 
  target_words_train_labels = np.expand_dims(target_words_train_labels,axis=2)

  # get the labels for the dev and test data. No need for inputs here. no need to expand dimensions
  target_words_dev_labels = pad_sequences([sent[1:] + [eos] for sent in target_words[8 * unit:9 * unit]], padding='post')
  target_words_test_labels = pad_sequences([sent[1:] + [eos] for sent in target_words[9 * unit:]], padding='post')

  # we have our data.
  train_data = [source_words_train,target_words_train,target_words_train_labels]
  dev_data = [source_words_dev,target_words_dev_labels]
  test_data = [source_words_test,target_words_test_labels]

  return train_data,dev_data,test_data,source_lang_dict,target_lang_dict


## Load the datasets

Let's load the datasets using the load function defined earlier.

In [None]:
source_path = "data.30.vi"
target_path = "data.30.en"

train_data,dev_data,test_data,source_lang_dict,target_lang_dict = load_dataset(source_path,target_path, max_num_examples=30000)

let's now quickly check the data structure.

In [None]:
print(f"Shape of training set: {len(train_data)}")

print("source_words")
print(train_data[0][0])
print([source_lang_dict.ids2word[word] for word in train_data[0][0]])
print("target words")
print(train_data[0][1])
print([target_lang_dict.ids2word[word] for word in train_data[0][1]])
print("target word labels")
print([target_lang_dict.ids2word[word] for word in train_data[0][2]])

Shape of training set: 3
source_words
[ 2  3  4  5  6  7  8  9 10 11  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0]
['khoa', 'học', 'đằng', 'sau', 'một', 'tiêu', 'đề', 'về', 'khí', 'hậu', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
target words
[12 13 14 15 16 17  9 18 19 20 21 22 23 24  2  3 25 26 27 28 29 15 30  1
 16 31 32 33 34 35]
['like', 'to', 'talk', 'you', 'today', 'about', '<end>', 'scale', 'of', 'scientific', 'effort', 'that', 'goes', 'into', '<start>', ':', 'making', 'see', 'in', 'paper', '.', 'you', 'they', '<unk>', 'today', 'are', 'both', 'two', 'same', 'field']
target word labels
['was', 'written', 'by', 'scientists', 'behind', '<unk>', 'effort', 'from', 'behind', '40', 'countries', 'wrote', 'almost', '<start>', ':', 'i', 'thousand', 'field', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', 

In [None]:
dev_data[0]

array([[ 111,  963,  706, ...,    0,    0,    0],
       [ 334,  128,  181, ...,    0,    0,    0],
       [1393,   48,  106, ...,    0,    0,    0],
       ...,
       [ 334,  144,   63, ...,    0,    0,    0],
       [ 110,   75,   52, ...,    0,    0,    0],
       [1689, 1072,  432, ...,    0,    0,    0]], dtype=int32)

In [None]:
test_data

[array([[1689, 1072,  343, ...,    0,    0,    0],
        [ 238,  545,   29, ...,    0,    0,    0],
        [ 238,  190,  191, ...,    0,    0,    0],
        ...,
        [  64,   22,  190, ...,    0,    0,    0],
        [ 651,  652,  102, ...,    0,    0,    0],
        [  64,   22,  190, ...,    0,    0,    0]], dtype=int32),
 array([[   4,  260, 1865, ...,    0,    0,    0],
        [  74,  483, 1296, ...,    0,    0,    0],
        [ 100,   75,  124, ...,    0,    0,    0],
        ...,
        [  49,  100,   15, ...,    0,    0,    0],
        [ 462,  829,   22, ...,    0,    0,    0],
        [  49,  100,   15, ...,    0,    0,    0]], dtype=int32)]

## **The Neural Translation Model (NMT)**

For the NMT the network (a system of connected layers/models) used for training differs slightly from the network used for inference. Both use the the seq-to-seq encoder-decoder architecture. 




### **The training mode**

**Encoder**

Given:
- `source_words`: a `batch_size(num_sents) x max_sentence_length` array representing the source words. In our mini example, this would be the Vietnamese equivalent of `['i', 'like', 'rabbits']`; `[['tôi', 'thích', 'thỏ']]`

The following steps comprise the encoding network:

1. transform `source_words` into `source_words_embeddings` using a randomly initialized embedding lookup. source_words_embeddings is thus a `batch_size(num_sents) x max_sentence_length x embedding_dim` array.
2. Apply embedding dropout of `embedding_dropout_rate`.
3. Use a single `LSTM` with `hidden_size` units to learn a representation for the source words i.e. to encode the input. 

    (a.) The hidden and cell states for this `LSTM` are initialized to zeros (i.e. we leave the `initial_states = None` default as is).

    (b.) We save the `encoder_output` (the sequence not just the last state); and the encoder (hidden and cell) states. 

This way, the model encodes a representation for the source words. Task 1 guides you to complete the encoder part of the training model.

<br>

**Decoder (No Attention)**

Given:
- `target_words`: a `batch_size(i.e.num_sents in batch) x max_sentence_length+1` array representing the target words. This is a time shifted translation of the source words with an added (prepended) `<START>` token `['<start>', 'i', 'like', 'rabbits']`.

The decoding is in the following steps:

1. transform `target_words` into `target_words_embeddings` using a randomly initialized embedding lookup. target_words_embeddings is thus a `batch_size x max_sentence_length+1 x embedding_dim` array.

2.  Apply embedding dropout of `embedding_dropout_rate`.

3. Use a single `LSTM` with `hidden_size` units to learn a representation for the target words. Some context is given to this model by using the encoder states to initialize the decoder lstm. This way the encoder state for `'tôi'` for example is used to learn to the representation (and next word prediction, see number 4.) for the `'<start>'` token, and so on.

4. For each token representation, use a dense layer to predict a `target_vocab_size` vector which is the probability that any given word in the target vocabulary is the next word following the represented token. The output `decoder_outputs_train` is thus a `batch_size x max_sent_length x target_vocab_size` array.


### **The Inference Mode**

**Encoder**

The inference time encoding follows the same steps as training time encoding.

<br>

**Decoder (No attention)**

During training time, we passed a `batch_size(num_sents) x max_sentence_length` array representing the target words into the decoder lstm. The decoder_lstm learns how to represent a given target sentence using the context from the encoder lstm (that learns to represent a source sentence).  

At test time, several things are different:

1. We no longer have access to a complete translation of the source sentence (recall that no target_words array exists for dev and test sets). Rather, we initialize the target_words_array as thus:

    Each expected sentence contains only a single token index, the index of the `'<start>'` token. So, the target_word_dev/test is a `batch_size x 1` array. (see the nmt.eval() function for this)

2. This `batch_size x 1` array is fed to the trained decoder_lstm and the predicted array is a `batch_size x 1 x target_vocab_size` such that taking the argmax of this array accross the dimension 2 will give the most probable next word. 

    For example, at time_step `0`, the first time step, where the `step_target_words` given is the `batch_size x 1` array containing the `'<start>'` token, the next word prediction of the decoder is for each sentence (in the batch) the initial word in the sentence. 

3. At the first time step, the decoder_lstm still uses the encoder_states as it's initial states. At subsequent time steps, it uses it's own states from the previous time steps. This is also what the decoder_lstm does at training time but it is made more explicit here as we loop over time steps using a for loop.
(see nmt.eval())





### Attention
**encoder_outputs** has a shape of [batch_size, **max_source_sent_len**,
hidden_size]

**decoder_outputs** has a shape of [batch_size,
**max_target_sent_len**, hidden_size]

In [None]:
class AttentionLayer(Layer):
  def compute_mask(self, inputs, mask=None):
    if mask == None:
      return None
    return mask[1]

  def compute_output_shape(self, input_shape):
    return (input_shape[1][0],input_shape[1][1],input_shape[1][2]*2)


  def call(self, inputs, mask=None):
    encoder_outputs, decoder_outputs = inputs

    """
    Task 3 attention
    
    Start
    """

    #=======================================#
    #the transpose of the last 2 dimensions.
    #=======================================# 
    # Note the first dimension is the batch size and shouldn't be touched.
    
    decoder_outputs_T =  K.permute_dimensions(decoder_outputs,(0,2,1)) #out: [batch_size, hidden_size, max_target_sent_len]

    #=========================================#
    #Dot of encoder outputs and decoder outputs
    #=========================================#
    # axes = [2,1] represents [hidden layer of encoder_outputs, hidden layer of decoder_outputs]
    # = [2nd dimension of encoder outputs, 3rd dimension of the transpose of the decoder outputs]
    luong_score = K.batch_dot(encoder_outputs,
                        decoder_outputs_T,
                        axes =[2,1]) # out: [batch_size, max_source_sent_len, max_target_sent_len]

    #==========================================================#
    #perform softmax to get the probability distribution/weights
    #==========================================================#
    luong_score_softmax = K.softmax(luong_score, axis=1)

    #=========================================#
    #Prepare inputs of weighted sum (expansion)
    #=========================================# 

    # We do this because python will later broadcast during matrix multiplication
    luong_score_softmax_expand = K.expand_dims(luong_score_softmax,-1) # out [batch_size, max_source_sent_len, max_target_sent_len, 1]

    encoder_outputs_expand = K.expand_dims(encoder_outputs,2) # out [batch_size, max_source_sent_len, 1, hidden_size]

    #=============================#
    #Attention score (weighted sum)
    #=============================#

    #Finally we are going to create the encoder_vector by doing element-wise multiplication

    product = encoder_outputs_expand*luong_score_softmax_expand

    #The last step is to sum the max_source_sent_len dimension to create the encoder_vector.
    encoder_vector = K.sum(product,axis = 1)
    
    """
    End Task 3
    """

    
    # [batch,max_dec,2*hidden size]
    new_decoder_outputs = K.concatenate([decoder_outputs, encoder_vector])

    return new_decoder_outputs

In [None]:
class NmtModel(object):
  def __init__(self,source_dict,target_dict,use_attention):
    ''' The model initialization function initializes network parameters.
    Inputs:
      source_dict (LanguageDict): a LanguageDict object for the source language, Vietnamese.
      target_dict (LanguageDict): a LanguageDict object for the target language, English.
      use_attention (bool): if True, use attention.
    Returns:
      None.
    '''
    # the number of hidden units used by the LSTM
    self.hidden_size = 200
    # the size of the word embeddings being used
    self.embedding_size = 100
    # the dropout rate for the hidden layers
    self.hidden_dropout_rate=0.2
    # the dropout rate for the word embeddings
    self.embedding_dropout_rate = 0.2
    # batch size
    self.batch_size = 100

    # the maximum length of the target sentences
    #Used in the inference step
    self.max_target_step = 30

    # vocab size for source and target; we'll use everything we receive
    self.vocab_target_size = len(target_dict.vocab)
    self.vocab_source_size = len(source_dict.vocab)

    # instances of the dictionaries
    self.target_dict = target_dict
    self.source_dict = source_dict

    # special tokens to indicate sentence starts and ends.
    self.SOS = target_dict.word2ids['<start>']
    self.EOS = target_dict.word2ids['<end>']

    # Boolean to use attention or not
    # use attention or no
    self.use_attention = use_attention

    print("number of tokens in source: %d, number of tokens in target:%d" % (self.vocab_source_size,self.vocab_target_size))



  def build(self):

    #-------------------------Train Models------------------------------
    source_words = Input(shape=(None,),dtype='int32')
    target_words = Input(shape=(None,), dtype='int32')

    """
    Task 1 encoder
    
    Start
    """
    # The train encoder
    # (a.) Create two randomly initialized embedding lookups, one for the source, another for the target. 
    print('Task 1(a): Creating the embedding lookups...')
    embeddings_source = Embedding(self.vocab_source_size, self.embedding_size, name='embedding_source', #Note the first argument here is the vocabulary size
                        	embeddings_initializer='glorot_uniform', mask_zero=True, trainable=True)
    embeddings_target = Embedding(self.vocab_target_size, self.embedding_size, name='embedding_target', #Note the first argument here is the vocabulary size
                        	embeddings_initializer='glorot_uniform', mask_zero=True, trainable=True) 
    
    # (b.) Look up the embeddings for source words and for target words. Apply dropout to each encoded input
    print('\nTask 1(b): Looking up source and target words...')
    source_word_embeddings = embeddings_source(source_words)
    target_words_embeddings = embeddings_target(target_words)

    source_word_embeddings = Dropout(self.embedding_dropout_rate, 
                             input_shape = source_word_embeddings.shape, 
                             name = "dropout_source_embedding",seed=1010)(source_word_embeddings)

    target_words_embeddings = Dropout(self.embedding_dropout_rate, 
                          input_shape = source_word_embeddings.shape, 
                          name = "dropout_target_embedding",seed=1010)(target_words_embeddings)



    # (c.) An encoder LSTM() with return sequences set to True
    print('\nTask 1(c): Creating an encoder')
    encoder_lstm = LSTM(self.hidden_size, return_sequences = True, return_state = True, name = "encoder_LSTM")

    # encoder_outputs = hidden state at every time step
    # encoder_state_h = hidden_state at final time_step
    # encoder_state_c = cell state

    encoder_outputs, encoder_state_h, encoder_state_c = encoder_lstm(source_word_embeddings)
    """
    End Task 1
    """
    encoder_states = [encoder_state_h,encoder_state_c]

    # The train decoder
    decoder_lstm = LSTM(self.hidden_size, recurrent_dropout=self.hidden_dropout_rate, 
                        return_sequences=True, return_state=True, name = "decoder_LSTM")
    decoder_outputs_train,_,_ = decoder_lstm(target_words_embeddings,initial_state=encoder_states)

    if self.use_attention:
      decoder_attention = AttentionLayer()
      decoder_outputs_train = decoder_attention([encoder_outputs,decoder_outputs_train])

    decoder_dense = Dense(self.vocab_target_size,activation='softmax')
    decoder_outputs_train = decoder_dense(decoder_outputs_train)

    # compiling the train model.
    adam = Adam(lr=0.01,clipnorm=5.0)
    self.train_model = Model([source_words,target_words], decoder_outputs_train)
    self.train_model.compile(optimizer=adam,loss='sparse_categorical_crossentropy', metrics=['accuracy'])

    # at this point you can print model summary for the train model
    print('\t\t\t\t\t\t Train Model Summary.')
    self.train_model.summary()



    #-------------------------Inference Models------------------------------
    # The inference encoder 
    self.encoder_model = Model(source_words,[encoder_outputs,encoder_state_h,encoder_state_c])
    # at this point you can print the summary for the encoder model.
    print('\t\t\t\t\t\t Inference Time Encoder Model Summary.')
    self.encoder_model.summary()

    # The decoder model
    # specifying the inputs to the decoder
    decoder_state_input_h = Input(shape=(self.hidden_size,)) # last hidden State
    decoder_state_input_c = Input(shape=(self.hidden_size,)) # cell state
    encoder_outputs_input = Input(shape=(None,self.hidden_size,)) # encoder outputs

    """
    Task 2 decoder for inference
    
    Start
    """
    # Task 2 (a.) Get the decoded outputs
    print('\n Putting together the decoder states')
    # get the initial states for the decoder, decoder_states
    # decoder states are the hidden and cell states from the training stage


    decoder_states = [decoder_state_input_h, decoder_state_input_c]

    # use decoder states as input to the decoder lstm to get the decoder outputs, h, and c for test time inference
    decoder_outputs_test,decoder_state_output_h, decoder_state_output_c = decoder_lstm(target_words_embeddings,
                                                                                       initial_state = decoder_states)


    # Task 2 (b.) Add attention if attention
    if self.use_attention:
      decoder_outputs_test = decoder_attention([encoder_outputs_input, 
                                                decoder_outputs_test])

    # Task 2 (c.) pass the decoder_outputs_test (with or without attention) to the decoder dense layer
    
    decoder_outputs_test = decoder_dense(decoder_outputs_test)

    """
    End Task 2 
    """
    # put the model together
    self.decoder_model = Model([target_words,decoder_state_input_h,decoder_state_input_c,encoder_outputs_input],
                               [decoder_outputs_test,decoder_state_output_h,decoder_state_output_c])
    # you can now view the model summary
    print('\t\t\t\t\t\t Decoder Inference Model summary')
    print(self.decoder_model.summary())



  def time_used(self, start_time):
    curr_time = time.time()
    used_time = curr_time-start_time
    m = used_time // 60
    s = used_time - 60 * m
    return "%d m %d s" % (m, s)



  def train(self,train_data,dev_data,test_data, epochs):
    start_time = time.time()
    for epoch in range(epochs):
      print("Starting training epoch {}/{}".format(epoch + 1, epochs))
      epoch_time = time.time()
      source_words_train, target_words_train, target_words_train_labels = train_data

      self.train_model.fit([source_words_train,target_words_train],target_words_train_labels,batch_size=self.batch_size)

      print("Time used for epoch {}: {}".format(epoch + 1, self.time_used(epoch_time)))
      dev_time = time.time()
      print("Evaluating on dev set after epoch {}/{}:".format(epoch + 1, epochs))
      self.eval(dev_data)
      print("Time used for evaluate on dev set: {}".format(self.time_used(dev_time)))

    print("Training finished!")
    print("Time used for training: {}".format(self.time_used(start_time)))

    print("Evaluating on test set:")
    test_time = time.time()
    self.eval(test_data)
    print("Time used for evaluate on test set: {}".format(self.time_used(test_time)))



  def get_target_sentences(self, sents,vocab,reference=False):
    str_sents = []
    num_sent, max_len = sents.shape
    for i in range(num_sent):
      str_sent = []
      for j in range(max_len):
        t = sents[i,j].item()
        if t == self.SOS:
          continue
        if t == self.EOS:
          break

        str_sent.append(vocab[t])
      if reference:
        str_sents.append([str_sent])
      else:
        str_sents.append(str_sent)
    return str_sents



  def eval(self, dataset,print_outputs = False):
    # get the source words and target_word_labels for the eval dataset
    source_words, target_words_labels = dataset
    vocab = self.target_dict.vocab

    # using the same encoding network used during training time, encode the training
    encoder_outputs, state_h,state_c = self.encoder_model.predict(source_words,batch_size=self.batch_size)
    # for max_target_step steps, feed the step target words into the decoder.
    predictions = []
    step_target_words = np.ones([source_words.shape[0],1]) * self.SOS #start with <Start> symbol, initialized as a vector of <Start> symbols
    for _ in range(self.max_target_step):
      
      step_decoder_outputs, state_h,state_c = self.decoder_model.predict([step_target_words,state_h,state_c,encoder_outputs],batch_size=self.batch_size)
      step_target_words = np.argmax(step_decoder_outputs,axis=2)
      predictions.append(step_target_words)

    # predictions is a [time_step x batch_size x 1] array. We use get_target_sentence() to recover the batch_size sentences
    candidates = self.get_target_sentences(np.concatenate(predictions,axis=1),vocab)
    references = self.get_target_sentences(target_words_labels,vocab,reference=True)

    # score using nltk bleu scorer
    score = corpus_bleu(references,candidates)
    print("Model BLEU score: %.2f" % (score*100.0))

    #Modification
    if print_outputs:
      sources = self.get_target_sentences(np.array(source_words[0:len(source_words)]),self.source_dict.vocab)
      return sources,  candidates, references



## Output printing function

This function prints out a set number of translation from the test set. This is done by modifying the `eval()` function in the model class and adding a parameters that allows it to output the follow:

1. Sources = Source sentences in the test set
2. Candidates = predicted translations
3. References = actual translations

One important thing to note is that there may be <UNK> tokens in any of these lists. This is because the language dictionary is based on the training set, hence some tokens in the test sight might not be included.

Moreover, the presence of an <UNK> in the predicted translation is because that token had the highest probability of being the next word. This is a byproduct of using greedy inference. One way to fix this would be to apply some mechanism to replace <UNK> tokens (e.g. get the 2nd most likely token), or to use a different type of inference altogether (e.g. beam search).

In [None]:
def print_examples(model, example_no = 10):
  """
  Prints out a set number of translations from the test set
  """

  sources,  candidates, references = model.eval(test_data,print_outputs=True)

  for i in range(example_no-1):

    print(f"example:{i+1}")
    print(f"Source sentence: {' '.join(sources[i]).replace('<pad>', '')}")
    print(f"Predicted translation: {' '.join(candidates[i]).replace('<pad>', '')}")
    print(f"Actual translation: {' '.join([l[0] for l in references][i]).replace('<pad>', '')}")

In [None]:
def main(source_path, target_path, use_attention):
  max_example = 30000
  print('loading dictionaries')
  train_data, dev_data, test_data, source_dict, target_dict = load_dataset(source_path,target_path,max_num_examples=max_example)
  print("read %d/%d/%d train/dev/test batches" % (len(train_data[0]),len(dev_data[0]), len(test_data[0])))

  model = NmtModel(source_dict,target_dict,use_attention)
  model.build()
  model.train(train_data,dev_data,test_data,10)

## **Training Without Attention**

If you've completed Tasks 1 and 2, you are ready to train the NMT model without attention.

Run the following cells to train the model for 10 epochs. It also shows the model summary of the each model you encapsulated.

If you're using a GPU, training will no more than 10 minutes and you will get a BLEU score between 4 and 5. 

### Architecture

In [None]:
#Clear session prior to creating the architecture
tf.keras.backend.clear_session()
model = NmtModel(source_lang_dict, target_lang_dict,False)
model.build()

number of tokens in source: 2034, number of tokens in target:2506
Task 1(a): Creating the embedding lookups...

Task 1(b): Looking up source and target words...

Task 1(c): Creating an encoder
						 Train Model Summary.
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 input_2 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 embedding_source (Embedding)   (None, None, 100)    203400      ['input_1[0][0]']                
                                                                       

  super(Adam, self).__init__(name, **kwargs)



 Putting together the decoder states
						 Decoder Inference Model summary
Model: "model_2"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_2 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 embedding_target (Embedding)   (None, None, 100)    250600      ['input_2[0][0]']                
                                                                                                  
 dropout_target_embedding (Drop  (None, None, 100)   0           ['embedding_target[0][0]']       
 out)                                                                                             
                                                                                                  
 input_3 (Input

### Training and test evaluation 

In [None]:
model.train(train_data,dev_data,test_data,10)

Starting training epoch 1/10
Time used for epoch 1: 0 m 18 s
Evaluating on dev set after epoch 1/10:
Model BLEU score: 1.54
Time used for evaluate on dev set: 0 m 8 s
Starting training epoch 2/10
Time used for epoch 2: 0 m 14 s
Evaluating on dev set after epoch 2/10:
Model BLEU score: 2.25
Time used for evaluate on dev set: 0 m 6 s
Starting training epoch 3/10
Time used for epoch 3: 0 m 20 s
Evaluating on dev set after epoch 3/10:
Model BLEU score: 3.56
Time used for evaluate on dev set: 0 m 6 s
Starting training epoch 4/10
Time used for epoch 4: 0 m 14 s
Evaluating on dev set after epoch 4/10:
Model BLEU score: 3.80
Time used for evaluate on dev set: 0 m 6 s
Starting training epoch 5/10
Time used for epoch 5: 0 m 14 s
Evaluating on dev set after epoch 5/10:
Model BLEU score: 4.18
Time used for evaluate on dev set: 0 m 6 s
Starting training epoch 6/10
Time used for epoch 6: 0 m 20 s
Evaluating on dev set after epoch 6/10:
Model BLEU score: 4.54
Time used for evaluate on dev set: 0 m 6 

### Sample output

In [None]:
print_examples(model)

Model BLEU score: 5.65
example:1
Source sentence: trích dẫn thứ hai đến từ người đứng đầu cơ quan quản lý dịch vụ tài chính vương quốc anh .         
Predicted translation: the <unk> <unk> , the <unk> <unk> , and the <unk> <unk> <unk> <unk> <unk> .
Actual translation: the second quote is from the head of the u.k. financial services <unk> .
example:2
Source sentence: chuyện trở nên tồi tệ hơn .                       
Predicted translation: it &apos;s really quite <unk> .
Actual translation: it gets worse .
example:3
Source sentence: chuyện gì đang diễn ra ở đây ? sao chuyện này lại có thể ?               
Predicted translation: what &apos;s happening today ? what is happening ?
Actual translation: what &apos;s happening here ? how can this be possible ?
example:4
Source sentence: thật không may , câu trả lời là đúng vậy đấy .                  
Predicted translation: in fact , it &apos;s not a <unk> .
Actual translation: unfortunately , the answer is yes .
example:5
Source sentence: nhưn

## **Training with Attention**

The inputs to the attention layer are encoder and decoder outputs. The attention mechanism:
1. Computes a score (a luong score) for each source word
2. Weights the words by their luong scores.
3. Concatenates the wieghted encoder representation with the decoder_ouput.
This new decoder output will now be the input to the decoder_dense layer. 

Task 3 description in the doc file outlines the steps for this in detail. Once you have completed this Task, you are now ready to train with attention. Training time will be no more than 10 minutes using a GPU and you should get a bleu score of about 15.

### Architecture

In [None]:
#Clear session prior to creating the architecture
tf.keras.backend.clear_session()
model_attention = NmtModel(source_lang_dict, target_lang_dict,True)
model_attention.build()

number of tokens in source: 2034, number of tokens in target:2506
Task 1(a): Creating the embedding lookups...

Task 1(b): Looking up source and target words...

Task 1(c): Creating an encoder
						 Train Model Summary.
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 embedding_source (Embedding)   (None, None, 100)    203400      ['input_1[0][0]']                
                                                                                                  
 input_2 (InputLayer)           [(None, None)]       0           []                               
                                                                       

  super(Adam, self).__init__(name, **kwargs)



 Putting together the decoder states
						 Decoder Inference Model summary
Model: "model_2"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_2 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 embedding_target (Embedding)   (None, None, 100)    250600      ['input_2[0][0]']                
                                                                                                  
 dropout_target_embedding (Drop  (None, None, 100)   0           ['embedding_target[0][0]']       
 out)                                                                                             
                                                                                                  
 input_3 (Input

### Training and test evaluation 

In [None]:
model_attention.train(train_data,dev_data,test_data,10)

Starting training epoch 1/10
Time used for epoch 1: 0 m 20 s
Evaluating on dev set after epoch 1/10:
Model BLEU score: 5.49
Time used for evaluate on dev set: 0 m 8 s
Starting training epoch 2/10
Time used for epoch 2: 0 m 20 s
Evaluating on dev set after epoch 2/10:
Model BLEU score: 10.99
Time used for evaluate on dev set: 0 m 7 s
Starting training epoch 3/10
Time used for epoch 3: 0 m 20 s
Evaluating on dev set after epoch 3/10:
Model BLEU score: 13.50
Time used for evaluate on dev set: 0 m 7 s
Starting training epoch 4/10
Time used for epoch 4: 0 m 15 s
Evaluating on dev set after epoch 4/10:
Model BLEU score: 15.02
Time used for evaluate on dev set: 0 m 7 s
Starting training epoch 5/10
Time used for epoch 5: 0 m 15 s
Evaluating on dev set after epoch 5/10:
Model BLEU score: 15.25
Time used for evaluate on dev set: 0 m 7 s
Starting training epoch 6/10
Time used for epoch 6: 0 m 15 s
Evaluating on dev set after epoch 6/10:
Model BLEU score: 14.98
Time used for evaluate on dev set: 0

### Sample output

In [None]:
print_examples(model_attention)

Model BLEU score: 16.25
example:1
Source sentence: trích dẫn thứ hai đến từ người đứng đầu cơ quan quản lý dịch vụ tài chính vương quốc anh .         
Predicted translation: the second path came from the head of the <unk> <unk> of the <unk> england .
Actual translation: the second quote is from the head of the u.k. financial services <unk> .
example:2
Source sentence: chuyện trở nên tồi tệ hơn .                       
Predicted translation: so obviously it &apos;s worse .
Actual translation: it gets worse .
example:3
Source sentence: chuyện gì đang diễn ra ở đây ? sao chuyện này lại có thể ?               
Predicted translation: what &apos;s going on here ? why can you ?
Actual translation: what &apos;s happening here ? how can this be possible ?
example:4
Source sentence: thật không may , câu trả lời là đúng vậy đấy .                  
Predicted translation: unfortunately , the answer is correct .
Actual translation: unfortunately , the answer is yes .
example:5
Source sentence: nhưng