<a href="https://colab.research.google.com/github/kintumiku/hse_nlp/blob/master/dl.ai_Copy_of_NMT_with_Attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 1:  Neural Machine Translation

Welcome to the second assignment of course 4. In this assignment you will explore neural machine translation using LSTMs with attention. You will implement an encoder-decoder algorithm from scratch and you will be able to to translate english sentences into German. 

Machine translation is an important task in natural language processing and could be useful not only for translating one language to another but also for word sense disambiguation. For example, if you are not sure whether the word bank in a sentence refers to the financial bank, or the land alongside a river, then you can translate the sentence to see if it was translated to `banc` or `banque`. Regardless of this application, neural machine translation is extremely useful. By completing this assignment you will learn to:  

- Learn how to use built in functions to preprocess your data
- Implement an encoder-decoder system
- Understand how attention works
- Build the model from scratch
- Evaluate your model
- Translate using your own sentence

As you can tell, this model is slightly different than the ones you have already implemented. This one uses an LSTM to encode a sentence and represent it into a context vector. It then inputs that context vector as a hidden layer into another LSTM and finally uses the LSTM to decode it into the target translation by keeping track of the attention. Don't worry about, we will explain everything in this notebook.

# Part 1: Importing the Data

We will first start by importing a few packages and the data to do some exploration. We have provided for you a bunch of built in functions that you can use.

If the next line throws an error, run it again (it should work)

In [1]:
!pip install -q -U trax==1.2.4

[K     |████████████████████████████████| 430kB 8.6MB/s 
[K     |████████████████████████████████| 1.5MB 19.3MB/s 
[K     |████████████████████████████████| 5.3MB 20.3MB/s 
[K     |████████████████████████████████| 655kB 47.2MB/s 
[K     |████████████████████████████████| 81kB 7.3MB/s 
[K     |████████████████████████████████| 194kB 60.1MB/s 
[K     |████████████████████████████████| 307kB 53.7MB/s 
[K     |████████████████████████████████| 983kB 53.7MB/s 
[K     |████████████████████████████████| 358kB 54.9MB/s 
[K     |████████████████████████████████| 368kB 58.4MB/s 
[K     |████████████████████████████████| 235kB 47.3MB/s 
[K     |████████████████████████████████| 51kB 4.7MB/s 
[?25h  Building wheel for pypng (setup.py) ... [?25l[?25hdone
  Building wheel for bz2file (setup.py) ... [?25l[?25hdone
[31mERROR: kfac 0.2.2 has requirement tensorflow-probability==0.8, but you'll have tensorflow-probability 0.7.0 which is incompatible.[0m


In [2]:
import os
import time
import numpy as np
import gin
from tensor2tensor import problems
import trax
from trax import layers as tl
from termcolor import colored
import tensorflow as tf
#from utils import PrepareAttentionInput



In [3]:
data_dir = os.path.expanduser("data")
!rm -rf {data_dir}
!mkdir {data_dir}
tmp_dir = os.path.expanduser("tmp")
!rm -rf {tmp_dir}
!mkdir {tmp_dir}

# Fetch the en-de translation data.
# The generate_data method of a problem will download data and process it into
# a standard format ready for training and evaluation.
ende_problem = problems.problem("translate_ende_wmt32k")
# ende_problem.generate_data(data_dir, tmp_dir)

# Instead of the above commented-out line, for speed, we copy the data. 
!gsutil -q cp gs://tensor2tensor-data/tmp/vocab.translate_ende* {data_dir}
!gsutil -m -q cp gs://tensor2tensor-data/tmp/translate_ende* {data_dir}

In [6]:
encoders = ende_problem.feature_encoders('./data/')  # Tokenizer: base of subword modeling.

## 1.1 Encode & Decode helper functions


The cell above loads in the encoder for you. Given any data set, you have to be able to map words to their indices, and indices to their words. The inputs and outputs to your trax models are usually tensors of numbers where each number corresponds to a word. If you were to process your data manually, you would have to make use of the following: 

- <span style='color:blue'> word2Ind: </span> a dictionary mapping the word to its index.
- <span style='color:blue'> ind2Word:</span> a dictionary mapping the index to its word.
- <span style='color:blue'> word2Count:</span> a dictionary mapping the word to the number of times it appears. 
- <span style='color:blue'> num_words:</span> total number of words that have appeared. 

Since you have already implemented these in previous assignments of the specialization, we will provide you with helper functions that will do this for you. Run the cell below to get the following functions:

- <span style='color:blue'> encode: </span> converts a text sentence to its corresponding token array (i.e. list of indices). Also converts words to subwords.
- <span style='color:blue'> decode: </span> converts a token array to its corresponding sentence (i.e. string).

In [7]:
# encode function
def encode(input_str, output_str=None):
    """Input str to features dict, ready for inference"""
    EOS = 1
    inputs =  encoders["inputs"].encode(input_str)+ [EOS]  # Add EOS index to mark the end of the sentence (input)
    batch_inputs = np.reshape(inputs, [1, -1])  # Add the batch dimension to the front of the shape
    return batch_inputs

# decode function
def decode(integers):
    """List of ints to str"""
    integers = list(np.squeeze(integers)) # Remove the dimensions of size 1
    EOS = 1
    if EOS in integers:
        integers = integers[:integers.index(EOS)] # Remove the EOS index to decode only those tokens which have a translation
    return encoders["inputs"].decode(integers)



## 1.2 Defining parameters

You will now set a few hyperparameters that will allow you to  like batch_size, maximum batch length, and others that will allow you to train your model. The data preprocessing techniques we used are from a package known as Tensor2tensor (T2T) developed by the Google team. Run the cell below to set the hyperparameters using the T2T package. This method uses `gin`, which is a lightweight configuration framework that allows you to set the default parameters. Take a look to see how to set the parameters [here](https://github.com/google/gin-config/blob/master/docs/index.md). 

In [8]:
# Parameters how to batch T2T problem.
gin.bind_parameter('batch_fn.batch_size_per_device', 64)
gin.bind_parameter('batch_fn.eval_batch_size', 64)
gin.bind_parameter('batch_fn.max_eval_length', 512)
gin.bind_parameter('batch_fn.bucket_length', 32)
gin.bind_parameter('batch_fn.buckets_include_inputs_in_length', True)
# Preprocessing just cuts off too long sequences.
gin.bind_parameter('shuffle_and_batch_data.preprocess_fun',
                   trax.supervised.inputs.wmt_preprocess)
gin.bind_parameter('wmt_preprocess.max_length', 256)
gin.bind_parameter('wmt_preprocess.max_eval_length', 512)

## 1.3 Exploring the data

Now that you have defined your parameters above, you will see that the functions we gave you do the exact same thing you have been doing again and again throughout the specialization. We gave them to you so you can focus more on building the model from scratch. Feel free to see the tokenized versions of your data. 


In [10]:
# Shutting down the warnings. 
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # ignore these 2 lines
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

# These are the english to german inputs
ende_inputs = trax.supervised.inputs.inputs("t2t_translate_ende_wmt32k",
                                            data_dir='./data/')
# using your data generator to see the data 
input_batch, target_batch = next(ende_inputs.train_stream(1))

In [11]:
# Test what it produces.
print(colored('THIS IS THE ENGLISH SENTENCE: \n', 'red'), decode(input_batch[0]))
print(colored('THIS IS THE TOKENIZED VERSION OF THE ENGLISH SENTENCE: \n ','red') , input_batch[0])
print(colored('THIS IS THE GERMAN TRANSLATION: \n','red'), decode(target_batch[0]))
print(colored('THIS IS THE TOKENIZED VERSION OF THE GERMAN TRANSLATION: \n', 'red'),encode(decode(input_batch[0])))

[31mTHIS IS THE ENGLISH SENTENCE: 
[0m I would have liked to have had him here, as I would like to remind him of paragraph 6 of the resolution, in which we ask the services to go through the budget with a fine toothcomb with the aim of identifying potential savings.
[31mTHIS IS THE TOKENIZED VERSION OF THE ENGLISH SENTENCE: 
 [0m [   46    98    43  6855     9    43   214   802   439     2    36    46
    98   151     9  6058   802     7  5766   330     7     4   998     2
     6    65    60  1150     4   508     9   491   317     4   700    30
    13  3835 22519 17702     5    30     4  2775     7 28866  6899  1424
  5473     3     1     0     0     0     0     0     0     0     0     0
     0     0     0     0]
[31mTHIS IS THE GERMAN TRANSLATION: 
[0m Ich hätte ihn gerne hier mit dabei gehabt, und ich hätte ihn gerne an Absatz 6 der Entschließung erinnert, in dem wir die Dienststellen auffordern, den Haushaltsplan genau unter die Lupe zu nehmen, um potenzielle Einsparungen zu e

You can see that the `ende_inputs`, (the english to german data generator), returns batches of data. You can use `next` to get the next batch. Take a minute or two to understand the tokens. What does `1` correspond to? Why are we using the `0`?

# 2.0 Neural Machine Translation with Attention

Now that we have given you the data generator and have handled the preprocessing for you, it is time for you to build your own model. We saved you time because we know you already processed data before in this specialization, so we would rather you spend your time doing the cool stuff. You will be implementing a neural machine translation model from scratch with attention. Concretely, you will understand how the encoder works, how the decoder works, and how the attentions heads are used. 

### 2.1 Encoder 

- <span style="color:red">input_encoder </span>: Input encoder runs on the English sentence and creates activations that will be the keys and values for attention. This is a serial function which uses:
<ul>
    <li> 
        <span style="color:blue">t1.Embedding </span>: takes in the dimension of the model and the input vocab size
   </li> 
   <li> 
        An array of n_encoder_layers of LSTMs each of dimension d_model
   </li> 
</ul>

- <span style="color:red">pre-attention input </span>: Prepares the queries keys values
<ul>
    <li> 
        <span style="color:blue">t1.Embedding </span>: 
   </li> 
   <li> 
        An array of n_encoder_layers of LSTMs each of dimension d_model
   </li> 
</ul>

### 2.2 Attention Explained

<img src = "att.png">

Over here we will denote the queries, Q, the keys K, and the values V. We learn these embeddings by learning $W_Q, W_K, W_V$. In the image above, imagine that K is they key, say the english word embedding and Q is the query say, the german word embedding. The dot product of similar vectors tend to have higher values. So think of it as a similarity operation. If the query and the key are similar, then the dot product will be similar. If it is different, then it will be small. Then you take the softmax of that, and multiply it by V. You can think of V as the value or the english word embedding. And then that tells you how much weight to put on each english word when translating. So each word, has its key.

A query q will assign each key a probability that that key k is a match for q. We measure similarity by taking dot product of vectors: so q and k are similar of qk is large. These similarity numbers do not add up to 1, so they cannot be used as probabilities. To make them so, and to make attention more focused on the best matching keys, we use the softmax -- the same as in the cross-entropy loss. So you can compute the matrix of query-key probabilities, often called the “attention weights”, just as softmax(QK^T). This matrix has shape L_Q by L_K -- each query and key pair gets a probability.

In the final step, we take the values - which is another matrix of the same shape as keys, and often the same as keys, and we want to get a weighted sum, weighting each value v_i by the probability that the key k_i matches the query. This can be computed very efficiently just as matrix multiplication: we multipy attention weights by values, that’s it!

 

In [12]:
def PrepareAttentionInput(encoder_activations, decoder_activations, inputs):
    """Prepare queries, keys, values and mask for attention."""
    keys = values = encoder_activations
    queries = decoder_activations
    
    # Mask is 1 where inputs are not padding (0) and 0 where they are padding.
    mask = (inputs != 0)
    # We need to add axes to the mask for attention heads and decoder length.
    mask = trax.math.numpy.reshape(mask, (mask.shape[0], 1, 1, mask.shape[1]))
    # Broadcast so mask shape is [batch, 1 for heads, decoder-len, encoder-len].
    mask = mask + trax.math.numpy.zeros((1, 1, decoder_activations.shape[1], 1))
    return queries, keys, values, mask

### 2.3 Decoder 

<ul>
    <li> 
        <span style="color:blue">tl.LSTM:  </span> takes in the dimension of the model: 
   </li> 
   <li> 
        An array of n_decoder_layers of LSTMs each of dimension d_model
   </li> 
    <li> 
        <span style="color:blue">tl.Dense:  </span> the dense layer 
   </li> 
    <li> 
        <span style="color:blue">tl.logsoftmax:  </span> of dimension V for the vocabulary
   </li> 
</ul>



By completing this part you complete the following functions: 

### 2.4 Implementation Overview


- <span style="color:red">Pre_attention_decoder </span>: runs on the targets and creates activations that are used as queries in attention. This is a serial function which takes the following:
<ul>
    <li> 
        <span style="color:blue"> tl.ShiftRight(mode=mode) </span>: takes in the dimension of the model and the input vocab size
   </li> 
   <li> 
        <span style="color:blue"> tl.Embedding </span>: takes in the dimension of the model and the vocab target size
   </li> 
</ul>
- <span style="color:red">Model:</span> the final model
<ul>
    <li> 
        <span style="color:blue"> tl.Select </span>: duplicate the input so you can keep track of it when doing the attention
   </li> 
   <li> 
        <span style="color:blue"> tl.Parallel </span>: Run input encoder on the input and pre-attention decoder the target
   </li> 
    
   <li> 
        <span style="color:blue"> tl.Fn(PrepareAttentionInput, n_out=4) </span>: Prepare queries, keys, values and mask for attention.
   </li> 
   <li> 
        <span style="color:blue"> tl.Residual(tl.AttentionQKV(...)) </span>: Run the attention layer, add to the pre-attention decoder
   </li> 
   <li> 
        <span style="color:blue"> tl.Select(...) </span>: drop the attention mask
   </li> 
   <li> 
        <span style="color:blue"> Array of LSTMs</span>: run the rest of the decoder
   </li> 
   <li> 
        <span style="color:blue"> tl.Dense </span>: Prepare output by making it the right size
   </li> 
   <li> 
        <span style="color:blue"> tl.LogSoftmax </span>: Log-softmax for output.
   </li> 
</ul>

In [13]:
# NMT model with attention.
def NMTAttn(input_vocab_size=33300,
            target_vocab_size=33300,
            d_model=1024,
            n_encoder_layers=2,
            n_decoder_layers=2,
            n_attention_heads=4,
            attention_dropout=0.0,
            mode='train'):
    """Returns an LSTM sequence-to-sequence model with attention.

    The input to the model is a pair (input tokens, target tokens), e.g.,
    an English sentence (tokenized) and its translation into German (tokenized).

    Args:
        input_vocab_size: int: vocab size of the input
        target_vocab_size: int: vocab size of the target
        d_model: int:  depth of embedding (n_units in the LSTM cell)
        n_encoder_layers: int: number of LSTM layers in the encoder
        n_decoder_layers: int: number of LSTM layers in the decoder after attention
        n_attention_heads: int: number of attention heads
        attention_dropout: float, dropout for the attention layer
        mode: str: 'train', 'eval' or 'predict', predict mode is for fast inference

    Returns:
        A LSTM sequence-to-sequence model with attention.
    """

  ### START CODE HERE ###
    input_encoder = tl.Serial(
      tl.Embedding(d_model, input_vocab_size),
      [tl.LSTM(d_model) for _ in range(n_encoder_layers)],
    )
  ### END CODE HERE ###

  ### START CODE HERE ###
    pre_attention_decoder = tl.Serial(
      tl.ShiftRight(mode=mode),  # Teacher forcing: predict O_t from O_{t-1}...
      tl.Embedding(d_model, target_vocab_size),
      tl.LSTM(d_model)
    )
  ### END CODE HERE ###

  ### START CODE HERE ###
    
    model = tl.Serial(
      # Copy input tokens and target tokens as they will be needed later.
      tl.Select([0, 1, 0, 1]),
      # Run input encoder on the input and pre-attention decoder the target.
      tl.Parallel(input_encoder, pre_attention_decoder),
      # Prepare queries, keys, values and mask for attention.
      tl.Fn(PrepareAttentionInput, n_out=4),
      # Run the attention layer, add to the pre-attention decoder.
      tl.Residual(tl.AttentionQKV(d_model, n_heads=n_attention_heads,
                                  dropout=attention_dropout, mode=mode)),
      tl.Select([0], n_in=2),  # Drop attention mask (not needed).
      # Run the rest of the RNN decoder.
      [tl.LSTM(d_model) for _ in range(n_decoder_layers)],
      # Prepare output by making it the right size.
      tl.Dense(target_vocab_size),
      # Log-softmax for output.
      tl.LogSoftmax()
  )
    return model
  ### END CODE HERE ###

# Part 3: Training

Now you are going to train your model. As usual, you have to define the cost function, the optimizer, and decide whether you will be training it on a `gpu` or `cpu`. In our case, you will train your model on a cpu and we will load in a pre-trained model for you. You could then train it a little bit more and predict with your own words. Instead of making you wait for a couple of hours, we will give you a trained version and then you will train it more using a program you are about to write. 

### 3.1 Defining the optimizer and the loss function

In this part you will define the optimizer and the loss function. You need two optimizers. The first one is for the encoder and the second one is for the decoder.

You will be using stochastic gradient descent to update the parameters and a negative loglikelood function. 

$$J = - \frac{1}{m}\sum_{i=1}^m \log (\hat y^{(i)}) $$



### 3.2 Training the model


You will now write a function that takes in your model and trains it. To train your model you have to decide how many times you want to iterate over the entire data set. Each iteration is defined as an `epoch`. For each epoch, you have to go over all the data, using your training iterator.

**Instructions:** Implement the `train_model` program below to train the neural network above. Here is a list of things you should do: 
- Create the trainer object by calling `trax.supervised.Trainer` and pass in the following: 
    - <span style='color:blue'> model </span> = NMTAttn
    - <span style='color:blue'> loss_fn </span>=tl.CrossEntropyLoss
    - <span style='color:blue'> optimizer </span> =trax.optimizers.Adam
    - <span style='color:blue'> lr_schedule </span>=trax.lr.MultifactorSchedule
    - <span style='color:blue'> inputs</span>= ende_inputs
    - <span style='color:blue'> output_dir</span>=output_dir
    
You will be using a cross entropy loss, with Adam optimizer. Please read the [trax](link) documentation to get a full understanding. Make sure you use `trainer.train_epoch(train_steps, eval_steps)` to train it the right number of times. 

In [14]:
# GRADED FUNCTION: train_model
def train_model(NMTAttn, n_epochs, train_steps, eval_steps, output_dir = "~/"):
    '''
    Input: 
        NER - the model you are building
        n_epochs - number of times to go over all the data
        train_steps - number of training steps
        eval_steps - the evaluation steps
        output_dir - folder to save your file
    '''
    output_dir = os.path.expanduser(output_dir) # trainer is an object
    lr_schedule = lambda h: trax.lr.MultifactorSchedule(
    h, factors='constant * linear_warmup * rsqrt_decay', constant=0.05, warmup_steps=1000)
    ### START CODE HERE ###
    trainer = trax.supervised.Trainer(
        model=NMTAttn,
        loss_fn=tl.CrossEntropyLoss(),
        optimizer=trax.optimizers.Adam,  
        lr_schedule=lr_schedule,
        inputs= ende_inputs, 
        output_dir=output_dir)
    for _ in range(n_epochs):
        trainer.train_epoch(train_steps, eval_steps)
    ### END CODE HERE ###
    return trainer



Notice the model is being trained for 1 epoch only, so it will be very different from the fully trained model (which you'll use later)

In [15]:
output_dir = os.path.expanduser('~/output_dir/')
n_epochs  = 1
train_steps = 10
eval_steps = 10
trainer = train_model(NMTAttn, n_epochs, train_steps, eval_steps, output_dir)




Step     10: Ran 10 train steps in 219.06 secs
Step     10: Evaluation
Step     10: train                   accuracy |  0.00000000
Step     10: train                       loss |  10.22048855
Step     10: train         neg_log_perplexity | -10.22048855
Step     10: train          sequence_accuracy |  0.00000000
Step     10: train weights_per_batch_per_core |  2048.00000000
Step     10: eval                    accuracy |  0.00000000
Step     10: eval                        loss |  10.22193718
Step     10: eval          neg_log_perplexity | -10.22193718
Step     10: eval           sequence_accuracy |  0.00000000
Step     10: eval  weights_per_batch_per_core |  2048.00000000
Step     10: Finished evaluation


**Expected Output:** The loss should be around 10.

 # Part 4:  Evaluation  

### 4.1 Loading in a trained model

In this part you will evaluate by loading in the exact same version of the model you coded, but we trained it for you to save you time. Please run the cell below to load in the model. 

In [18]:
model = NMTAttn(mode='eval')
# Copy pre-trained model
!gsutil cp gs://tensor2tensor-data/tmp/nmt/model.pkl* {tmp_dir}
model.init_from_file( "./tmp/model.pkl")

Copying gs://tensor2tensor-data/tmp/nmt/model.pkl...
Copying gs://tensor2tensor-data/tmp/nmt/model.pkl.gz...
/ [2 files][  3.2 GiB/  3.2 GiB]   41.4 MiB/s                                   
Operation completed over 2 objects/3.2 GiB.                                      


# Part 5: Testing with your own input

You will now test your input. You are going to implement greedy decoding. This consists of two functions. The first one allows you to identify the next symbol. It gets the argmax of the output of your model and then returns that index. 

**Instructions:** Implement the next symbol function that takes in the input_tokens, and the cur_output_tokens and return the the index of the next word. 



In [19]:
# Decoding functions.
def next_symbol(input_tokens, cur_output_tokens):
    padded_length = 2**int(np.ceil(np.log2(len(cur_output_tokens) + 3)))
    padded = cur_output_tokens + [0] * (padded_length - len(cur_output_tokens))
    padded_with_batch = np.array(padded)[None, :]
    output, _ = model((input_tokens, padded_with_batch), n_accelerators=1)
    log_probs = output[0, len(cur_output_tokens), :]
    return int(np.argmax(log_probs))

Now you will implement the greedy_decode algorithm that will call the `next_symbol` function. It takes in the input_sentence and returns the the decoded sentence. 

**Instructions**: Implement the greedy_decode algorithm. 

In [20]:
def greedy_decode(input_sentence):
    input_tokens = encode(input_sentence)
    cur_output_tokens = []
    cur_output = 0
    EOS = 1
    while cur_output != EOS:
        cur_output = next_symbol(input_tokens, cur_output_tokens)
        cur_output_tokens.append(cur_output)
    return decode(cur_output_tokens)

sentence = "I am hungry."
print(sentence)
print(greedy_decode(sentence))

I am hungry.
Ich bin hungrig.
