# Seq2Seq model and Evaluation metric - Machine Translation

### Tutorial Topics
- Machine Translation:
    - Seq2Seq model
    - Evaluation metric

### Software Requirements
- Python (>=3.6)
- PyTorch (>=1.2.0) 
- Jupyter (latest)
- torchtext
- NLTK

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Seq2Seq model

In this tutorial, we will introduce a neural network to translate French sentence to English sentence.

We will introduce a important architecture in machine translation: [sequence to sequence network](http://arxiv.org/abs/1409.3215), in which two recurrent neural networks work together to transform one sequence (e.g., sentence) to another. An encoder network condenses an input sequence into a **single vector**, and a decoder network unfolds that vector into a new sequence in target language.

# Sequence to Sequence Learning

A [Sequence to Sequence network](http://arxiv.org/abs/1409.3215), or seq2seq network, or [Encoder Decoder network](https://arxiv.org/pdf/1406.1078v3.pdf), is a model consisting of two separate RNNs called the **`encoder`** and **`decoder`**. The `encoder` reads an input sequence one token at a time, and outputs a vector at each step. The final output of the encoder is kept as the **context** vector. In classification task, we use this **context** vector as the "summarization" of input sequence. In seq2seq model, the decoder uses this context vector as the initial state to generate translation. We will discuss the details in the later section.  

![](https://i.imgur.com/tVtHhNp.png)

 Picture Courtesy: https://i.imgur.com/tVtHhNp.png
 
When using a single RNN, there is a one-to-one relationship between `inputs` and `outputs`. But there are not directly one-to-one relationship between source language and target language. 

Consider a simple sentence "`Je ne suis pas le chat noir"` &rarr; "`I am not the black cat`". Many of the words have a pretty direct translation, like "chat" &rarr; "cat". However the differing grammars cause words to be in different orders, e.g. "chat noir" and "black cat". There is also the "ne ... pas" &rarr; "not" construction that makes the two sentences have different lengths.

With the seq2seq model, by encoding many source inputs into one vector, and decoding from one vector into many target outputs, we are freed from the constraints of sequence order and length. The encoded sequence is represented by a single vector which is a $N$ dimensional representation. In an ideal case, this vector can be considered as the `"summarization"` of the sequence.

The flow of rest of this tutorial is as follows:
1. Preparing data
2. Encoder
3. Decoder
4. Seq2seq
5. Training the model
6. Loading the trained model checkpoint
7. Evaluation

### Required imports

In [None]:
!pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 torchtext==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://download.pytorch.org/whl/torch_stable.html


In [None]:
!pip install --upgrade spacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import unicodedata
import string
import re
import random
import time
import datetime
import math

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence
import torchtext
# from torchtext.legacy import data
import spacy
import numpy as np

In [None]:
torch.__version__

'1.13.1+cu116'

Here we will also define a constant to decide whether to use the GPU (with CUDA specifically) or the CPU. 

If you don't have a GPU, set this as CPU. Later when we create tensors, this variable will be used to decide whether we keep them on CPU or move them to GPU.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


## 1. Preparing Data

***Define tokenizers:***
we create the tokenizers. A tokenizer is used to turn a string containing a sentence into a list of individual tokens.

`spaCy` has model for each language ("fr" for French and "en" for English) which need to be loaded so we can access the tokenizer of each model.

***Note***: the models must first be downloaded using the following on the command line:

```
python -m spacy download en_core_web_sm
python -m spacy download fr_core_news_sm
```

In [None]:
import spacy.cli

spacy.cli.download("en_core_web_sm")
spacy.cli.download("fr_core_news_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')


In [None]:
import fr_core_news_sm
import en_core_web_sm

spacy_fr = fr_core_news_sm.load()
spacy_en = en_core_web_sm.load()


In [None]:
### LATEST TORCHTEXT ###
from torchtext.data.utils import get_tokenizer

spacy_en_tokenizer = get_tokenizer("spacy", language="en_core_web_sm")
spacy_fr_tokenizer = get_tokenizer("spacy", language="fr_core_news_sm")

In [None]:
### LATEST TORCTHEXT ###

from collections import OrderedDict, Counter
from torchtext.vocab import vocab
import io

path = '/content/drive/MyDrive/COLX_531_lab3_jhlbxx/data/'
train_fn = 'train_eng_fre.tsv'
valid_fn = 'val_eng_fre.tsv'
test_fn = 'test_eng_fre.tsv'


def build_vocab(filepath, src_tokenizer, trg_tokenizer):
  src_counter, trg_counter = Counter(), Counter()
  with open(filepath, encoding="utf-8") as f:
    for i, line in enumerate(f.readlines()):
      if i == 0:  # skip header
        continue
      # split line and tokenize accordingly
      trg_line, src_line = line.strip("\n").split("\t")
      src_counter.update(src_tokenizer(src_line.lower()))
      trg_counter.update(trg_tokenizer(trg_line.lower()))
    
    # sort and wrap as OrderedDict
    # ordered_src = OrderedDict(sorted(src_counter.items(), key=lambda x: x[1], reverse=True))
    # ordered_trg = OrderedDict(sorted(trg_counter.items(), key=lambda x: x[1], reverse=True))
    ordered_src = sorted(src_counter.items(), key=lambda x: x[1], reverse=True)
    ordered_trg = sorted(trg_counter.items(), key=lambda x: x[1], reverse=True)
    
    # build vocab objects
    # NOTE: OrderedDict as input Requires torchtext >= 0.10.0. Using Counter for now
    src_vocab = vocab(
      src_counter, 
      min_freq=2, 
      specials=('<unk>', '<pad>', '<bos>', '<eos>')
    )

    trg_vocab = vocab(
      trg_counter, 
      min_freq=2,
      specials=('<unk>', '<pad>', '<bos>', '<eos>')
    )
    
    return src_vocab, trg_vocab

src_vocab, trg_vocab = build_vocab(
  path + train_fn, 
  spacy_fr_tokenizer,
  spacy_en_tokenizer
)

In [None]:
### LATEST TORCHTEXT ###

import io

# Define default index to assign to OOV tokens
unk_token = '<unk>'
src_vocab.set_default_index(src_vocab[unk_token])
trg_vocab.set_default_index(trg_vocab[unk_token])

def data_process(path, split):
  raw_iter = iter(io.open(path + split, encoding="utf-8"))
  data = []
  for i, item in enumerate(raw_iter):
    if i == 0:
      continue
    trg_raw, src_raw = item.strip("\n").split("\t")
    src_tensor = torch.tensor(
        [src_vocab[token] for token in spacy_fr_tokenizer(src_raw.lower())],
        dtype=torch.long
      )
    trg_tensor = torch.tensor(
        [trg_vocab[token] for token in spacy_en_tokenizer(trg_raw.lower())],
        dtype=torch.long
      )
    data.append((src_tensor, trg_tensor))

  return data

In [None]:
train_data = data_process(path, train_fn)
val_data = data_process(path, valid_fn)
test_data = data_process(path, test_fn)

In [None]:
print(f"Number of training examples: {len(train_data)}")
print(f"Number of validation examples: {len(val_data)}")
print(f"Number of testing examples: {len(test_data)}")

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000


trg_sent = [trg_vocab.get_itos[i] for i in train_data[0][1]]
src_sent = [src_vocab.get_itos[i] for i in train_data[0][0]]
print(trg_sent, src_sent)

In [None]:
trg_sent = [trg_vocab.itos[i] for i in val_data[100][1]]
src_sent = [src_vocab.itos[i] for i in val_data[100][0]]
print(trg_sent, src_sent)

['an', 'older', ',', 'overweight', 'man', 'flips', 'a', 'pancake', 'while', 'making', 'breakfast', '.'] ['un', 'homme', 'âgé', 'en', 'surpoids', 'fait', 'sauter', 'une', 'crêpe', 'en', 'préparant', 'le', 'petit', 'déjeuner', '.']


In [None]:
### LATEST TORCHTEXT ###

print(f"Unique tokens in source (fr) vocabulary: {len(src_vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(trg_vocab)}")

Unique tokens in source (fr) vocabulary: 6471
Unique tokens in target (en) vocabulary: 5893


In [None]:
# trg_stoi = trg_vocab.get_stoi()  # torcthext >0.10.0
trg_stoi = trg_vocab.stoi
print(trg_stoi['<pad>'])

1


In [None]:
import pickle

with open("/content/drive/MyDrive/models/data/trg_vocab", "wb") as f:
     pickle.dump(trg_vocab, f)

with open("/content/drive/MyDrive/models/data/src_vocab", "wb") as f:
     pickle.dump(src_vocab, f)

In [None]:
### LATEST TORCHTEXT ###

BATCH_SIZE = {
    "train": 16,
    "val": 256,
    "test": 256
}

PAD_IDX = trg_vocab['<pad>']
BOS_IDX = trg_vocab['<bos>']
EOS_IDX = trg_vocab['<eos>']

from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader

def generate_batch(data_batch):
  src_batch, trg_batch = [], []
  for (src_item, trg_item) in data_batch:
    src_batch.append(torch.cat([torch.tensor([BOS_IDX]), src_item, torch.tensor([EOS_IDX])], dim=0))
    trg_batch.append(torch.cat([torch.tensor([BOS_IDX]), trg_item, torch.tensor([EOS_IDX])], dim=0))
  src_batch = pad_sequence(src_batch, padding_value=PAD_IDX)
  trg_batch = pad_sequence(trg_batch, padding_value=PAD_IDX)
  return src_batch, trg_batch

In [None]:
train_iter = DataLoader(train_data, batch_size=BATCH_SIZE["train"],
                        shuffle=True, collate_fn=generate_batch)
valid_iter = DataLoader(val_data, batch_size=BATCH_SIZE["val"],
                        shuffle=True, collate_fn=generate_batch)
test_iter = DataLoader(test_data, batch_size=BATCH_SIZE["test"],
                       shuffle=True, collate_fn=generate_batch)

In [None]:
# batch example of training data
for batch in train_iter:
    src, trg = batch
    print('tensor size of source language:', src.shape)
    print('tensor size of target language:', trg.shape)
    print('the tensor of first example in target language:', trg[:, 0])
    break

tensor size of source language: torch.Size([27, 16])
tensor size of target language: torch.Size([24, 16])
the tensor of first example in target language: tensor([   2,   21,  964,  362, 2688,  202,   96, 2161,  151,   14,    3,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1])


## Building the Seq2Seq Model

## 2. Encoder

![](https://pytorch.org/tutorials/_images/seq2seq.png)

First, we'll build the encoder model that encodes the French sentence. We use a single layer `Uni-directional LSTM`.

Similar to the classifiction task (covered in DSCI 572), we only pass the output of embedding layer to the LSTM layer. The LSTM layer returns `outputs`, `hidden` and `cell`. The `hidden` is the final hidden state of LSTM layer (t=seq_len). The `cell` is the final cell state of the LSTM layer (t=seq_len). `hidden` and `cell` can be considered as the **context** representation of source language. 

In [None]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, n_layers, dropout):
        super().__init__()

        self.emb_dim = emb_dim
        self.enc_hid_dim = enc_hid_dim
        self.dropout = dropout
        self.n_layers = n_layers

        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.lstm = nn.LSTM(emb_dim, enc_hid_dim, n_layers, dropout=dropout)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        
        outputs, (hidden, cell) = self.lstm(embedded)
       
        # outputs are always from the top hidden layer, if bidirectional outputs are concatenated.
        # outputs shape [sequence_length, batch_size, hidden_dim * num_directions]
        # hidden is of shape [num_layers * num_directions, batch_size, hidden_size]
        # cell is of shape [num_layers * num_directions, batch_size, hidden_size]
        
        return hidden, cell

## 3. Decoder

![](https://pytorch.org/tutorials/_images/seq2seq.png)

Next up is the decoder. Decoder is a `uni-directional LSTM`.


At time step $t$, the input of decoder LSTM is embeded word vector of $t$th target word , $y_t$, the previous decoder hidden state, $h_{t-1}$, and the previous decoder hidden cell, $c_{t-1}$.

$$h_t, c_t = \text{DecoderLSTM}(y_t, (h_{t-1}, c_{t-1}))$$

Specially, we will use the last `hidden state` and `cell state` of the encoder LSTM as the initial states of decoder LSTM (i.e., $h_{0}, c_{0}$) rather than randomly initialize them. 

We then pass hidden state of LSTM layer, $h_t$, through the linear layer, $f$, to make a prediction of the next word in the target sentence, $\hat{y}_{t+1}$. 

$$\hat{y}_{t+1} = f(h_t)$$

In [None]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, dec_hid_dim, n_layers, dropout):
        super().__init__()

        self.emb_dim = emb_dim
        self.output_dim = output_dim
        self.dec_hid_dim = dec_hid_dim
        self.n_layers = n_layers
        self.dropout = dropout

        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.lstm = nn.LSTM(emb_dim, dec_hid_dim, n_layers, dropout=dropout)
        self.fc_out = nn.Linear(dec_hid_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
             
        # input is of shape [batch_size]
        # hidden is of shape [n_layer * num_directions, batch_size, hidden_size]
        # cell is of shape [n_layer * num_directions, batch_size, hidden_size]
        
        input = input.unsqueeze(0)
        
        # input shape is [1, batch_size]. reshape is needed rnn expects a rank 3 tensors as input.
        # so reshaping to [1, batch_size] means a batch of batch_size each containing 1 index.
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]    
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        
        # output shape is [sequence_len, batch_size, hidden_dim * num_directions]
        # hidden shape is [num_layers * num_directions, batch_size, hidden_dim]
        # cell shape is [num_layers * num_directions, batch_size, hidden_dim]

        # sequence_len and num_directions will always be 1 in the decoder.
        # output shape is [1, batch_size, hidden_dim]
        # hidden shape is [num_layers, batch_size, hidden_dim]
        # cell shape is [num_layers, batch_size, hidden_dim]
        
        prediction = self.fc_out(hidden.squeeze(0)) # linear expects as rank 2 tensor as input
        # predicted shape is [batch_size, output_dim]
        
        return prediction, hidden, cell

## 4. Seq2Seq


![](https://pytorch.org/tutorials/_images/seq2seq.png)

The `encoder` returns both the final `hidden state` and `cell state` to be used as the initial `hidden state` and `cell state` for the `decoder`.

Briefly going over all of the steps:
- the `outputs` tensor is created to hold all predictions, $\hat{Y} = \{\hat{y_0}, \hat{y_1} ... \hat{y_t}\}$;
- the source sequence, $X = \{x_0,x_1,..., x_t\}$, is fed into the encoder to receive last hidden state, $h^{Encoder}_t$, and last cell state $c^{Encoder}_t$;
- the initial decoder hidden state is set to be the $h^{Encoder}_t$, and the initial decoder cell state is set to be the $c^{Encoder}_t$. (i.e., $h^{Decoder}_0$ = $h^{Encoder}_t$; $c^{Decoder}_0$ = $c^{Encoder}_t$);
- we use a batch of `<bos>` tokens as the first `input` (i.e., $y_1$);
- we then decode within a loop:

 for i in range(1,t): t is the maximal length of target language
  - inserting the input token $y_i$, previous hidden state, $h^{Decoder}_{i-1}$, and previous cell state, $c^{Decoder}_{i-1}$, into the decoder;
  - receiving a prediction, $\hat{y}_{i+1}$, which is the most likely output sequence, a new hidden state, $h^{Decoder}_{i}$, and a new cell state, $c^{Decoder}_{i}$;
  - we then decide if we are going to **teacher force** or not, setting the next input as appropriate, that is, if teacher forcing is on, the next input will be the gold token from the previous timestep, otherwise, the next input will be the predicted token from the previous timestep.

In [None]:
class Seq2Seq(nn.Module):
    ''' This class contains the implementation of complete sequence to sequence network.
    It uses to encoder to produce the context vectors.
    It uses the decoder to produce the predicted target sentence.
    Args:
        encoder: A Encoder class instance.
        decoder: A Decoder class instance.
    '''
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        # src is of shape [src_sequence_len, batch_size]
        # trg is of shape [targ_sequence_len, batch_size]
        # if teacher_forcing_ratio is 0.5 we use ground-truth inputs 50% of time and 50% time we use decoder outputs.

        batch_size = trg.shape[1]
        max_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim

        # to store the outputs of the decoder
        outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device)

        # context vector, last hidden and cell state of encoder to initialize the decoder
        hidden, cell = self.encoder(src)

        # first input to the decoder is the <sos> tokens
        input = trg[0, :]

        for t in range(1, max_len):
            output, hidden, cell = self.decoder(input, hidden, cell)
            outputs[t] = output
            # pick a random number between 0 to ratio and decide whether to teacher force
            # if the ratio is 1.0, use_teacher_force is always 1 
            # if the ratio is 0.0, use_teacher_force is always 0
            # if the ration is 0.4, use_teacher_force is 1 for 40% of the time (on an average)
            use_teacher_force = random.random() < teacher_forcing_ratio 
            top1 = output.max(1)[1]
            # decide the next token based on use_teacher_force]
            # if teacher forcing is on, the next input will be the gold token from the previous timestep
            # otherwise, the next input will be the predicted token from the previous timestep.
            input = (trg[t] if use_teacher_force else top1) 

        # outputs is of shape [sequence_len, batch_size, output_dim]
        return outputs

## 5. Training the Seq2Seq Model
We instantiate our encoder, decoder and seq2seq model (placing it on the GPU if we have one). 

In [None]:
torch.cuda.is_available()

True

In [None]:
#INPUT_DIM = len(SRC.vocab) # tokens in source vocabulary
#OUTPUT_DIM = len(TRG.vocab) # tokens in target vocabulary
INPUT_DIM = len(src_vocab) # tokens in source vocabulary
OUTPUT_DIM = len(trg_vocab) # tokens in target vocabulary

# hyperparameters
ENC_EMB_DIM = 256 # encoder embedding size
DEC_EMB_DIM = 256 # decoder embedding size
ENC_HID_DIM = 512 # encoder hidden size
DEC_HID_DIM = 512 # decoder hidden size
ENC_DROPOUT = 0.5 # dropout for encoder
DEC_DROPOUT = 0.3 # dropout for decoder
N_LAYERS = 1 # number of LSTM layers
LEARNING_RT = 0.001 # learning rate

# model
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, DEC_HID_DIM, N_LAYERS, DEC_DROPOUT)
model = Seq2Seq(enc, dec, device).to(device)



We use a simplified version of the **weight initialization scheme**. Here, we will initialize all biases to zero and all weights from $\mathcal{N}(0, 0.01)$.

In [None]:
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6471, 256)
    (lstm): LSTM(256, 512, dropout=0.5)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(5893, 256)
    (lstm): LSTM(256, 512, dropout=0.3)
    (fc_out): Linear(in_features=512, out_features=5893, bias=True)
    (dropout): Dropout(p=0.3, inplace=False)
  )
)

Calculate the number of parameters.

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 9,342,213 trainable parameters


Create an optimizer.

In [None]:
optimizer = optim.Adam(model.parameters(), lr = LEARNING_RT)

Initialize the loss function. The pad token needs to be ignored.

In [None]:
#TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
print('<pad> token index: ', PAD_IDX)
## we will ignore the pad token in true target set
criterion = nn.CrossEntropyLoss(ignore_index = PAD_IDX)

<pad> token index:  1


### Testing Model With a Single Batch
We will run the model with first training batch to test our code.

In [None]:
clip = 1
model.train()

for i, (src, trg) in enumerate(train_iter):
    
    # read the source sentence and target sentence
    # src = batch.SRC
    # trg = batch.TRG
    src, trg = src.to(device), trg.to(device)

    # clear the gradient buffer
    optimizer.zero_grad()

    # forward pass
    output = model(src, trg)
    #trg = [trg len, batch size]
    #output = [trg len, batch size, output dim]

    output_dim = output.shape[-1]

    output = output[1:].view(-1, output_dim)
    trg = trg[1:].view(-1)

    #trg = [(trg len - 1) * batch size]
    #output = [(trg len - 1) * batch size, output dim]
    
    # compute the loss
    loss = criterion(output, trg)
    
    # compute the gradients
    loss.backward()

    # clip the gradients to prevent gradient explosion problem
    torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
    
    # update the parameters
    optimizer.step()

    print(loss/src.shape[1])
    break

tensor(0.5426, device='cuda:0', grad_fn=<DivBackward0>)


## Fully training process
If we test our code successfully. We will start the fully training loop as follows:

In [None]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, (src, trg) in enumerate(iterator):
        
        src, trg = src.to(device), trg.to(device)
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        # loss function works only 2d logits, 1d targets
        # so flatten the trg, output tensors. Ignore the <sos> token
        # trg shape shape should be [(sequence_len - 1) * batch_size]
        # output shape should be [(sequence_len - 1) * batch_size, output_dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

...and the evaluation loop, remembering to set the model to `eval` mode and turn off teaching forcing (i.e., teach forcing rate = 0).

In [None]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, (src, trg) in enumerate(iterator):

            src, trg = src.to(device), trg.to(device)

            output = model(src, trg, 0) # turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

Count the running time.

In [None]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

## Training model. 

We will train the model for 10 epochs. At the end of each epoch, we will save a checkpoint and evaluate on the development set. We will print out the loss and perplexity of train and dev set.

In [None]:
N_EPOCHS = 15
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iter, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iter, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    # Create checkpoint at end of each epoch
    state_dict_model = model.state_dict() 
    state = {
        'epoch': epoch,
        'state_dict': state_dict_model,
        'optimizer': optimizer.state_dict()
        }

    torch.save(state, "/content/drive/MyDrive/models/model_result/uni_LSTM/seq2seq_"+str(epoch+1)+".pt")

    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 1m 13s
	Train Loss: 4.663 | Train PPL: 105.992
	 Val. Loss: 4.834 |  Val. PPL: 125.754
Epoch: 02 | Time: 1m 12s
	Train Loss: 4.004 | Train PPL:  54.795
	 Val. Loss: 4.322 |  Val. PPL:  75.367
Epoch: 03 | Time: 1m 13s
	Train Loss: 3.536 | Train PPL:  34.346
	 Val. Loss: 4.072 |  Val. PPL:  58.679
Epoch: 04 | Time: 1m 13s
	Train Loss: 3.171 | Train PPL:  23.827
	 Val. Loss: 3.837 |  Val. PPL:  46.403
Epoch: 05 | Time: 1m 13s
	Train Loss: 2.872 | Train PPL:  17.665
	 Val. Loss: 3.679 |  Val. PPL:  39.607
Epoch: 06 | Time: 1m 14s
	Train Loss: 2.630 | Train PPL:  13.870
	 Val. Loss: 3.599 |  Val. PPL:  36.568
Epoch: 07 | Time: 1m 13s
	Train Loss: 2.428 | Train PPL:  11.341
	 Val. Loss: 3.569 |  Val. PPL:  35.470
Epoch: 08 | Time: 1m 13s
	Train Loss: 2.260 | Train PPL:   9.586
	 Val. Loss: 3.558 |  Val. PPL:  35.108
Epoch: 09 | Time: 1m 13s
	Train Loss: 2.110 | Train PPL:   8.247
	 Val. Loss: 3.532 |  Val. PPL:  34.195
Epoch: 10 | Time: 1m 12s
	Train Loss: 1.982 | Train PPL

## 6. Load Checkpoint
We will use the best model for the following process.

In [None]:
with open("/content/drive/MyDrive/models/data/src_vocab","rb") as f:
     src_vocab = pickle.load(f)

with open("/content/drive/MyDrive/models/data/trg_vocab","rb") as f:
     trg_vocab = pickle.load(f)

Load trained model to `model_best` and put model on device.

In [None]:
# INPUT_DIM = len(SRC_saved.vocab)
# OUTPUT_DIM = len(TRG_saved.vocab)
INPUT_DIM = len(src_vocab) # tokens in source vocabulary
OUTPUT_DIM = len(trg_vocab) # tokens in target vocabulary
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.3
N_LAYERS = 1
LEARNING_RT = 0.001
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, DEC_HID_DIM, N_LAYERS, DEC_DROPOUT)

model_best = Seq2Seq(enc, dec, device)

In [None]:
model_best.load_state_dict(torch.load('/content/drive/MyDrive/models/model_result/uni_LSTM/seq2seq_9.pt')['state_dict'])
model_best = model_best.to(device)

In [None]:
# ### INSERT YOUR REFACTORED INFERENCE FUNCTION HERE ###

# def inference(model_best, trg_vocab, test_iter, attention=False, max_trg_len = 64):
#     ### INSERT REFACTORED CODE HERE ###
#     return corpus_bleu_score

In [None]:
def inference(model, trg_vocab, test_iter, attention=False, max_trg_len=64):
    from nltk.translate.bleu_score import corpus_bleu

    def convert_itos(convert_vocab, token_ids):
        list_string = []
        for i in token_ids:
            if i == convert_vocab.get_stoi()['<eos>']:
                break
            else:
                token = convert_vocab.get_itos()[i]
                list_string.append(token)
        return list_string

    model.eval()
    all_trg = []
    all_translated_trg = []

    TRG_PAD_IDX = trg_vocab['<pad>']

    with torch.no_grad():
        for i, (src, trg) in enumerate(test_iter):
            src, trg = src.to(device), trg.to(device)
            batch_size = trg.shape[1]
            trg_placeholder = torch.Tensor(max_trg_len, batch_size)
            trg_placeholder.fill_(TRG_PAD_IDX)
            trg_placeholder = trg_placeholder.long().to(device)
            if attention:
                output, _ = model(src, trg_placeholder, 0)
            else:
                output = model(src, trg_placeholder, 0)
            output_translate = output[1:]
            all_trg.append(trg[1:].cpu())

            prob, token_id = output_translate.data.topk(1)
            translation_token_id = token_id.squeeze(2).cpu()
            all_translated_trg.append(translation_token_id)

    all_gold_text = []
    all_translated_text = []
    for i in range(len(all_trg)):
        cur_gold = all_trg[i]
        cur_translation = all_translated_trg[i]
        for j in range(cur_gold.shape[1]):
            gold_convered_strings = convert_itos(trg_vocab, cur_gold[:, j])
            trans_convered_strings = convert_itos(trg_vocab, cur_translation[:, j])
            all_gold_text.append(gold_convered_strings)
            all_translated_text.append(trans_convered_strings)

    corpus_all_gold_text = [[item] for item in all_gold_text]
    corpus_bleu_score = corpus_bleu(corpus_all_gold_text, all_translated_text)
    return corpus_bleu_score


In [None]:
print(inference(model_best, trg_vocab, test_iter, attention=False, max_trg_len=64))

0.2482460803690209


## Reference 
* https://pytorch.org/docs/stable/nn.html
* https://github.com/bentrevett/pytorch-seq2seq/blob/master/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb
* https://arxiv.org/abs/1409.3215
* https://github.com/graviraja/seq2seq
* https://github.com/eladhoffer/seq2seq.pytorch
* https://github.com/spro/practical-pytorch/tree/master/seq2seq-translation
* http://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/
* https://machinelearningmastery.com/calculate-bleu-score-for-text-python/
* https://leimao.github.io/blog/Entropy-Perplexity/