In [1]:
# For tips on running notebooks in Google Colab, see
# https://pytorch.org/tutorials/beginner/colab
%matplotlib inline

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import os
os.chdir("/content/drive/My Drive/Colab Notebooks/VietAI/FoDL2/RNN/torch_tutorial")



# NLP From Scratch: Translation with a Sequence to Sequence Network and Attention
**Author**: [Sean Robertson](https://github.com/spro)

This is the third and final tutorial on doing "NLP From Scratch", where we
write our own classes and functions to preprocess the data to do our NLP
modeling tasks. We hope after you complete this tutorial that you'll proceed to
learn how `torchtext` can handle much of this preprocessing for you in the
three tutorials immediately following this one.

In this project we will be teaching a neural network to translate from
French to English.

::

    [KEY: > input, = target, < output]

    > il est en train de peindre un tableau .
    = he is painting a picture .
    < he is painting a picture .

    > pourquoi ne pas essayer ce vin delicieux ?
    = why not try that delicious wine ?
    < why not try that delicious wine ?

    > elle n est pas poete mais romanciere .
    = she is not a poet but a novelist .
    < she not not a poet but a novelist .

    > vous etes trop maigre .
    = you re too skinny .
    < you re all alone .

... to varying degrees of success.

This is made possible by the simple but powerful idea of the [sequence
to sequence network](https://arxiv.org/abs/1409.3215)_, in which two
recurrent neural networks work together to transform one sequence to
another. An encoder network condenses an input sequence into a vector,
and a decoder network unfolds that vector into a new sequence.

.. figure:: /_static/img/seq-seq-images/seq2seq.png
   :alt:

To improve upon this model we'll use an [attention
mechanism](https://arxiv.org/abs/1409.0473)_, which lets the decoder
learn to focus over a specific range of the input sequence.

**Recommended Reading:**

I assume you have at least installed PyTorch, know Python, and
understand Tensors:

-  https://pytorch.org/ For installation instructions
-  :doc:`/beginner/deep_learning_60min_blitz` to get started with PyTorch in general
-  :doc:`/beginner/pytorch_with_examples` for a wide and deep overview
-  :doc:`/beginner/former_torchies_tutorial` if you are former Lua Torch user


It would also be useful to know about Sequence to Sequence networks and
how they work:

-  [Learning Phrase Representations using RNN Encoder-Decoder for
   Statistical Machine Translation](https://arxiv.org/abs/1406.1078)_
-  [Sequence to Sequence Learning with Neural
   Networks](https://arxiv.org/abs/1409.3215)_
-  [Neural Machine Translation by Jointly Learning to Align and
   Translate](https://arxiv.org/abs/1409.0473)_
-  [A Neural Conversational Model](https://arxiv.org/abs/1506.05869)_

You will also find the previous tutorials on
:doc:`/intermediate/char_rnn_classification_tutorial`
and :doc:`/intermediate/char_rnn_generation_tutorial`
helpful as those concepts are very similar to the Encoder and Decoder
models, respectively.

**Requirements**


In [34]:
from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import re
import random

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

import numpy as np
from torch.utils.data import TensorDataset, DataLoader, RandomSampler

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Loading data files

The data for this project is a set of many thousands of English to
French translation pairs.

[This question on Open Data Stack
Exchange](https://opendata.stackexchange.com/questions/3888/dataset-of-sentences-translated-into-many-languages)_
pointed me to the open translation site https://tatoeba.org/ which has
downloads available at https://tatoeba.org/eng/downloads - and better
yet, someone did the extra work of splitting language pairs into
individual text files here: https://www.manythings.org/anki/

The English to French pairs are too big to include in the repository, so
download to ``data/eng-fra.txt`` before continuing. The file is a tab
separated list of translation pairs:

::

    I am cold.    J'ai froid.

.. Note::
   Download the data from
   [here](https://download.pytorch.org/tutorial/data.zip)
   and extract it to the current directory.



Similar to the character encoding used in the character-level RNN
tutorials, we will be representing each word in a language as a one-hot
vector, or giant vector of zeros except for a single one (at the index
of the word). Compared to the dozens of characters that might exist in a
language, there are many many more words, so the encoding vector is much
larger. We will however cheat a bit and trim the data to only use a few
thousand words per language.

.. figure:: /_static/img/seq-seq-images/word-encoding.png
   :alt:





We'll need a unique index per word to use as the inputs and targets of
the networks later. To keep track of all this we will use a helper class
called ``Lang`` which has word → index (``word2index``) and index → word
(``index2word``) dictionaries, as well as a count of each word
``word2count`` which will be used to replace rare words later.




In [35]:
SOS_token = 0  # Start of Sentence token
EOS_token = 1  # End of Sentence token

class Lang:
    def __init__(self, name):
        self.name = name  # Initialize the language name
        self.word2index = {}  # A dictionary to map words to their corresponding indexes
        self.word2count = {}  # A dictionary to store word frequencies
        self.index2word = {0: "SOS", 1: "EOS"}  # A dictionary to map indexes to words, including special tokens
        self.n_words = 2  # Count of unique words in the language, initialized to 2 for SOS and EOS tokens

    def addSentence(self, sentence):
        # This method takes a sentence as input and adds its words to the language representation
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        # This method adds a word to the language representation
        if word not in self.word2index:
            # If the word is not already in the vocabulary, assign it a new index
            self.word2index[word] = self.n_words
            self.word2count[word] = 1  # Initialize its frequency to 1
            self.index2word[self.n_words] = word  # Update the reverse mapping
            self.n_words += 1  # Increment the word count
        else:
            # If the word is already in the vocabulary, increase its frequency
            self.word2count[word] += 1


The files are all in Unicode, to simplify we will turn Unicode
characters to ASCII, make everything lowercase, and trim most
punctuation.




In [36]:
# Import the unicodedata module for Unicode character handling
import unicodedata

# Import the re module for regular expressions
import re

# Function to convert a Unicode string to plain ASCII
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# Function to lowercase, trim, and remove non-letter characters
def normalizeString(s):
    # Convert Unicode characters to ASCII
    s = unicodeToAscii(s.lower().strip())

    # Separate punctuation marks (., !, ?) from words with spaces
    s = re.sub(r"([.!?])", r" \1", s)

    # Remove non-letter characters and replace them with spaces
    s = re.sub(r"[^a-zA-Z!?]+", r" ", s)

    return s.strip()  # Remove leading and trailing spaces


To read the data file we will split the file into lines, and then split
lines into pairs. The files are all English → Other Language, so if we
want to translate from Other Language → English I added the ``reverse``
flag to reverse the pairs.




In [37]:
# Function to read language pairs from a file
def readLangs(lang1, lang2, reverse=False):
    print("Reading lines...")

    # Read the file and split it into lines
    lines = open('data/%s-%s.txt' % (lang1, lang2), encoding='utf-8').\
        read().strip().split('\n')

    # Split every line into pairs and normalize each element
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]

    # Reverse pairs if specified, and create instances of the Lang class
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(lang2)
        output_lang = Lang(lang1)
    else:
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)

    return input_lang, output_lang, pairs


Since there are a *lot* of example sentences and we want to train
something quickly, we'll trim the data set to only relatively short and
simple sentences. Here the maximum length is 10 words (that includes
ending punctuation) and we're filtering to sentences that translate to
the form "I am" or "He is" etc. (accounting for apostrophes replaced
earlier).




In [38]:
MAX_LENGTH = 10  # Maximum sentence length allowed

eng_prefixes = (
    "i am ", "i m ",
    "he is", "he s ",
    "she is", "she s ",
    "you are", "you re ",
    "we are", "we re ",
    "they are", "they re "
)

# Function to filter a single language pair
def filterPair(p):
    # Check if the lengths of both sentences are less than MAX_LENGTH
    # and if the second sentence starts with one of the specified English prefixes
    return len(p[0].split(' ')) < MAX_LENGTH and \
        len(p[1].split(' ')) < MAX_LENGTH and \
        p[1].startswith(eng_prefixes)

# Function to filter a list of language pairs
def filterPairs(pairs):
    # Use a list comprehension to filter pairs using the filterPair function
    return [pair for pair in pairs if filterPair(pair)]


The full process for preparing the data is:

-  Read text file and split into lines, split lines into pairs
-  Normalize text, filter by length and content
-  Make word lists from sentences in pairs




In [39]:
import random
# Function to prepare language pairs for machine translation
def prepareData(lang1, lang2, reverse=False):
    # Call the readLangs function to read language pairs from a file
    input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)

    # Print the number of sentence pairs read from the file
    print("Read %s sentence pairs" % len(pairs))

    # Filter the pairs using the filterPairs function
    pairs = filterPairs(pairs)

    # Print the number of sentence pairs after filtering
    print("Trimmed to %s sentence pairs" % len(pairs))

    # Count the unique words in the input and output languages
    print("Counting words...")
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])

    # Print the vocabulary sizes of both languages
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)

    return input_lang, output_lang, pairs

# Call the prepareData function with 'eng' and 'fra' languages, and reverse the pairs
input_lang, output_lang, pairs = prepareData('eng', 'fra', True)

# Print a random pair from the prepared dataset
print(random.choice(pairs))


Reading lines...
Read 135842 sentence pairs
Trimmed to 11445 sentence pairs
Counting words...
Counted words:
fra 4601
eng 2991
['ils sont tous dingues', 'they re all nuts']


## The Seq2Seq Model

A Recurrent Neural Network, or RNN, is a network that operates on a
sequence and uses its own output as input for subsequent steps.

A [Sequence to Sequence network](https://arxiv.org/abs/1409.3215)_, or
seq2seq network, or [Encoder Decoder
network](https://arxiv.org/pdf/1406.1078v3.pdf)_, is a model
consisting of two RNNs called the encoder and decoder. The encoder reads
an input sequence and outputs a single vector, and the decoder reads
that vector to produce an output sequence.

.. figure:: /_static/img/seq-seq-images/seq2seq.png
   :alt:

Unlike sequence prediction with a single RNN, where every input
corresponds to an output, the seq2seq model frees us from sequence
length and order, which makes it ideal for translation between two
languages.

Consider the sentence ``Je ne suis pas le chat noir`` → ``I am not the
black cat``. Most of the words in the input sentence have a direct
translation in the output sentence, but are in slightly different
orders, e.g. ``chat noir`` and ``black cat``. Because of the ``ne/pas``
construction there is also one more word in the input sentence. It would
be difficult to produce a correct translation directly from the sequence
of input words.

With a seq2seq model the encoder creates a single vector which, in the
ideal case, encodes the "meaning" of the input sequence into a single
vector — a single point in some N dimensional space of sentences.




### The Encoder

The encoder of a seq2seq network is a RNN that outputs some value for
every word from the input sentence. For every input word the encoder
outputs a vector and a hidden state, and uses the hidden state for the
next input word.

.. figure:: /_static/img/seq-seq-images/encoder-network.png
   :alt:





In [40]:
# Import necessary libraries
import torch.nn as nn

# Define the EncoderRNN class, which is a neural network module
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, dropout_p=0.1):
        # Initialize the EncoderRNN class
        super(EncoderRNN, self).__init__()

        # Store the hidden size for later use
        self.hidden_size = hidden_size

        # Create an embedding layer
        self.embedding = nn.Embedding(input_size, hidden_size)

        # Create a GRU (Gated Recurrent Unit) layer
        # This layer takes hidden_size input units and produces hidden_size output units
        # batch_first=True means the input should have shape (batch_size, sequence_length, input_size)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)

        # Create a dropout layer to prevent overfitting
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, input):
        # Pass the input through the embedding layer and apply dropout
        embedded = self.dropout(self.embedding(input))

        # Pass the embedded input through the GRU layer
        # The GRU returns two values: output and hidden state
        output, hidden = self.gru(embedded)

        # Return the output and hidden state
        return output, hidden


### The Decoder

The decoder is another RNN that takes the encoder output vector(s) and
outputs a sequence of words to create the translation.




#### Simple Decoder

In the simplest seq2seq decoder we use only last output of the encoder.
This last output is sometimes called the *context vector* as it encodes
context from the entire sequence. This context vector is used as the
initial hidden state of the decoder.

At every step of decoding, the decoder is given an input token and
hidden state. The initial input token is the start-of-string ``<SOS>``
token, and the first hidden state is the context vector (the encoder's
last hidden state).

.. figure:: /_static/img/seq-seq-images/decoder-network.png
   :alt:





In [41]:

# Define the DecoderRNN class, which is a neural network module
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        # Initialize the DecoderRNN class
        super(DecoderRNN, self).__init__()

        # Create an embedding layer
        self.embedding = nn.Embedding(output_size, hidden_size)

        # Create a GRU (Gated Recurrent Unit) layer
        # Similar to the EncoderRNN, this GRU layer takes hidden_size input units and produces hidden_size output units
        # batch_first=True means the input should have shape (batch_size, sequence_length, input_size)
        self.gru = nn.GRU(hidden_size, hidden_size, batch_first=True)

        # Create a linear (fully connected) layer to produce output
        self.out = nn.Linear(hidden_size, output_size)

    def forward(self, encoder_outputs, encoder_hidden, target_tensor=None):
        batch_size = encoder_outputs.size(0)

        # Initialize the decoder input with a placeholder (SOS_token)
        decoder_input = torch.empty(batch_size, 1, dtype=torch.long, device=device).fill_(SOS_token)

        # Initialize the decoder hidden state with the encoder's final hidden state
        decoder_hidden = encoder_hidden

        # Initialize a list to store decoder outputs at each time step
        decoder_outputs = []

        for i in range(MAX_LENGTH):
            # Perform a single decoding step
            decoder_output, decoder_hidden = self.forward_step(decoder_input, decoder_hidden)

            # Append the decoder output to the list of outputs
            decoder_outputs.append(decoder_output)

            if target_tensor is not None:
                # Teacher forcing: Feed the target as the next input
                decoder_input = target_tensor[:, i].unsqueeze(1)  # Teacher forcing
            else:
                # Without teacher forcing: use its own predictions as the next input
                _, topi = decoder_output.topk(1)
                decoder_input = topi.squeeze(-1).detach()  # detach from history as input

        # Concatenate the decoder outputs along the sequence length dimension
        decoder_outputs = torch.cat(decoder_outputs, dim=1)

        # Apply log softmax to the concatenated outputs
        decoder_outputs = F.log_softmax(decoder_outputs, dim=-1)

        # Return decoder outputs, decoder hidden state, and None (for consistency in the training loop)
        return decoder_outputs, decoder_hidden, None

    def forward_step(self, input, hidden):
        # Pass the input through the embedding layer
        output = self.embedding(input)

        # Apply ReLU activation function
        output = F.relu(output)

        # Pass the output through the GRU layer and get new hidden states
        output, hidden = self.gru(output, hidden)

        # Pass the GRU output through the linear layer to get the decoder's output
        output = self.out(output)

        # Return the output and updated hidden states
        return output, hidden


I encourage you to train and observe the results of this model, but to
save space we'll be going straight for the gold and introducing the
Attention Mechanism.




#### Attention Decoder

If only the context vector is passed between the encoder and decoder,
that single vector carries the burden of encoding the entire sentence.

Attention allows the decoder network to "focus" on a different part of
the encoder's outputs for every step of the decoder's own outputs. First
we calculate a set of *attention weights*. These will be multiplied by
the encoder output vectors to create a weighted combination. The result
(called ``attn_applied`` in the code) should contain information about
that specific part of the input sequence, and thus help the decoder
choose the right output words.

.. figure:: https://i.imgur.com/1152PYf.png
   :alt:

Calculating the attention weights is done with another feed-forward
layer ``attn``, using the decoder's input and hidden state as inputs.
Because there are sentences of all sizes in the training data, to
actually create and train this layer we have to choose a maximum
sentence length (input length, for encoder outputs) that it can apply
to. Sentences of the maximum length will use all the attention weights,
while shorter sentences will only use the first few.

.. figure:: /_static/img/seq-seq-images/attention-decoder-network.png
   :alt:


Bahdanau attention, also known as additive attention, is a commonly used
attention mechanism in sequence-to-sequence models, particularly in neural
machine translation tasks. It was introduced by Bahdanau et al. in their
paper titled [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/pdf/1409.0473.pdf)_.
This attention mechanism employs a learned alignment model to compute attention
scores between the encoder and decoder hidden states. It utilizes a feed-forward
neural network to calculate alignment scores.

However, there are alternative attention mechanisms available, such as Luong attention,
which computes attention scores by taking the dot product between the decoder hidden
state and the encoder hidden states. It does not involve the non-linear transformation
used in Bahdanau attention.

In this tutorial, we will be using Bahdanau attention. However, it would be a valuable
exercise to explore modifying the attention mechanism to use Luong attention.



In [42]:
class BahdanauAttention(nn.Module):
    def __init__(self, hidden_size):
        super(BahdanauAttention, self).__init__()
        # Learnable linear transformations for query, keys, and attention scores
        self.Wa = nn.Linear(hidden_size, hidden_size)
        self.Ua = nn.Linear(hidden_size, hidden_size)
        self.Va = nn.Linear(hidden_size, 1)

    def forward(self, query, keys):
        # Calculate attention scores using query and keys
        scores = self.Va(torch.tanh(self.Wa(query) + self.Ua(keys)))
        scores = scores.squeeze(2).unsqueeze(1)  # Adjust the shape

        # Apply softmax to obtain attention weights
        weights = F.softmax(scores, dim=-1)

        # Calculate context vector by weighted sum of keys
        context = torch.bmm(weights, keys)

        return context, weights

class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1):
        super(AttnDecoderRNN, self).__init__()
        # Create an embedding layer
        self.embedding = nn.Embedding(output_size, hidden_size)

        # Create the Bahdanau attention mechanism
        self.attention = BahdanauAttention(hidden_size)

        # Create a GRU layer with input size of 2 * hidden_size
        self.gru = nn.GRU(2 * hidden_size, hidden_size, batch_first=True)

        # Create an output linear layer
        self.out = nn.Linear(hidden_size, output_size)

        # Create a dropout layer for regularization
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, encoder_outputs, encoder_hidden, target_tensor=None):
        # Initialize variables and tensors
        batch_size = encoder_outputs.size(0)
        decoder_input = torch.empty(batch_size, 1, dtype=torch.long, device=device).fill_(SOS_token)
        decoder_hidden = encoder_hidden
        decoder_outputs = []
        attentions = []

        for i in range(MAX_LENGTH):
            # Perform a single decoding step
            decoder_output, decoder_hidden, attn_weights = self.forward_step(
                decoder_input, decoder_hidden, encoder_outputs
            )

            # Append decoder output and attention weights to lists
            decoder_outputs.append(decoder_output)
            attentions.append(attn_weights)

            if target_tensor is not None:
                # Teacher forcing: Feed the target as the next input
                decoder_input = target_tensor[:, i].unsqueeze(1)  # Teacher forcing
            else:
                # Without teacher forcing: use its own predictions as the next input
                _, topi = decoder_output.topk(1)
                decoder_input = topi.squeeze(-1).detach()  # detach from history as input

        # Concatenate decoder outputs and apply log softmax
        decoder_outputs = torch.cat(decoder_outputs, dim=1)
        decoder_outputs = F.log_softmax(decoder_outputs, dim=-1)

        # Concatenate attention weights
        attentions = torch.cat(attentions, dim=1)

        return decoder_outputs, decoder_hidden, attentions

    def forward_step(self, input, hidden, encoder_outputs):
        # Pass input through embedding layer with dropout
        embedded = self.dropout(self.embedding(input))

        # Permute hidden state to match the dimensions for attention
        query = hidden.permute(1, 0, 2)

        # Calculate attention context and attention weights
        context, attn_weights = self.attention(query, encoder_outputs)

        # Concatenate embedded input and context vector
        input_gru = torch.cat((embedded, context), dim=2)

        # Pass through the GRU layer to get output and updated hidden state
        output, hidden = self.gru(input_gru, hidden)

        # Pass through the output linear layer
        output = self.out(output)

        return output, hidden, attn_weights


<div class="alert alert-info"><h4>Note</h4><p>There are other forms of attention that work around the length
  limitation by using a relative position approach. Read about "local
  attention" in [Effective Approaches to Attention-based Neural Machine
  Translation](https://arxiv.org/abs/1508.04025)_.</p></div>

## Training

### Preparing Training Data

To train, for each pair we will need an input tensor (indexes of the
words in the input sentence) and target tensor (indexes of the words in
the target sentence). While creating these vectors we will append the
EOS token to both sequences.




In [43]:

# Function to convert a sentence to a list of word indexes using the provided language object.
def indexesFromSentence(lang, sentence):
    return [lang.word2index[word] for word in sentence.split(' ')]

# Function to convert a sentence to a PyTorch tensor of word indexes.
def tensorFromSentence(lang, sentence):
    indexes = indexesFromSentence(lang, sentence)
    indexes.append(EOS_token)
    return torch.tensor(indexes, dtype=torch.long, device=device).view(1, -1)

# Function to convert a pair of input and target sentences to PyTorch tensors.
def tensorsFromPair(pair):
    input_tensor = tensorFromSentence(input_lang, pair[0])
    target_tensor = tensorFromSentence(output_lang, pair[1])
    return (input_tensor, target_tensor)

# Function to prepare data and create a DataLoader for training.
def get_dataloader(batch_size):
    # Prepare data using the 'prepareData' function (assuming it exists) and prepare input and output languages.
    input_lang, output_lang, pairs = prepareData('eng', 'fra', True)

    # Initialize numpy arrays to store input and target sentence indexes.
    n = len(pairs)
    input_ids = np.zeros((n, MAX_LENGTH), dtype=np.int32)
    target_ids = np.zeros((n, MAX_LENGTH), dtype=np.int32)

    # Loop through pairs and convert sentences to indexes, adding EOS tokens.
    for idx, (inp, tgt) in enumerate(pairs):
        inp_ids = indexesFromSentence(input_lang, inp)
        tgt_ids = indexesFromSentence(output_lang, tgt)
        inp_ids.append(EOS_token)
        tgt_ids.append(EOS_token)
        input_ids[idx, :len(inp_ids)] = inp_ids
        target_ids[idx, :len(tgt_ids)] = tgt_ids

    # Create a PyTorch TensorDataset from the indexed input and target data.
    train_data = TensorDataset(torch.LongTensor(input_ids).to(device),
                               torch.LongTensor(target_ids).to(device))

    # Create a random sampler and DataLoader for training.
    train_sampler = RandomSampler(train_data)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

    # Return the input and output languages along with the DataLoader.
    return input_lang, output_lang, train_dataloader


### Training the Model

To train we run the input sentence through the encoder, and keep track
of every output and the latest hidden state. Then the decoder is given
the ``<SOS>`` token as its first input, and the last hidden state of the
encoder as its first hidden state.

"Teacher forcing" is the concept of using the real target outputs as
each next input, instead of using the decoder's guess as the next input.
Using teacher forcing causes it to converge faster but [when the trained
network is exploited, it may exhibit
instability](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.378.4095&rep=rep1&type=pdf)_.

You can observe outputs of teacher-forced networks that read with
coherent grammar but wander far from the correct translation -
intuitively it has learned to represent the output grammar and can "pick
up" the meaning once the teacher tells it the first few words, but it
has not properly learned how to create the sentence from the translation
in the first place.

Because of the freedom PyTorch's autograd gives us, we can randomly
choose to use teacher forcing or not with a simple if statement. Turn
``teacher_forcing_ratio`` up to use more of it.

#########

Here's a summary of what this training function does:

- It takes as input a DataLoader (`dataloader`) for training data, an encoder model (`encoder`), a decoder model (`decoder`), optimizer objects for both the encoder and decoder (`encoder_optimizer` and `decoder_optimizer`), and a loss criterion (`criterion`).

- It initializes a variable `total_loss` to accumulate the loss over all batches in this epoch.

- It iterates through the data batches in the DataLoader, where each batch contains input and target tensors.

- For each batch:
  - It resets the gradients of both the encoder and decoder to zero.

  - Passes the input tensor through the encoder to get encoder outputs and hidden state.

  - Passes the encoder outputs, encoder hidden state, and target tensor through the decoder to obtain decoder outputs.

  - Calculates the loss between the decoder outputs (reshaped for the loss calculation) and the target tensor.

  - Performs backpropagation by computing gradients for both the encoder and decoder.

  - Updates the parameters of both the encoder and decoder using their respective optimizers.

  - Accumulates the loss for this batch in the `total_loss` variable.

- After processing all batches, it calculates and returns the average loss for this epoch by dividing the `total_loss` by the number of batches in the DataLoader.

This function is designed to train a sequence-to-sequence model using a DataLoader of training data. It computes the loss and performs parameter updates for both the encoder and decoder in each batch, ultimately returning the average loss for the entire epoch.




In [44]:
def train_epoch(dataloader, encoder, decoder, encoder_optimizer,
                decoder_optimizer, criterion):
    # Initialize the total loss for this epoch
    total_loss = 0

    # Iterate over the data batches in the DataLoader
    for data in dataloader:
        input_tensor, target_tensor = data

        # Zero the gradients of both the encoder and decoder
        encoder_optimizer.zero_grad()
        decoder_optimizer.zero_grad()

        # Pass the input tensor through the encoder to get encoder outputs and hidden state
        encoder_outputs, encoder_hidden = encoder(input_tensor)

        # Pass the encoder outputs, encoder hidden state, and target tensor through the decoder
        decoder_outputs, _, _ = decoder(encoder_outputs, encoder_hidden, target_tensor)

        # Calculate the loss between the decoder outputs and the target tensor
        loss = criterion(
            decoder_outputs.view(-1, decoder_outputs.size(-1)),
            target_tensor.view(-1)
        )

        # Perform backpropagation by computing gradients
        loss.backward()

        # Update the parameters of both the encoder and decoder
        encoder_optimizer.step()
        decoder_optimizer.step()

        # Accumulate the loss for this batch
        total_loss += loss.item()

    # Calculate and return the average loss for this epoch
    return total_loss / len(dataloader)


This is a helper function to print time elapsed and estimated time
remaining given the current time and progress %.




In [45]:
import time
import math

def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))

The whole training process looks like this:

-  Start a timer
-  Initialize optimizers and criterion
-  Create set of training pairs
-  Start empty losses array for plotting

Then we call ``train`` many times and occasionally print the progress (%
of examples, time so far, estimated time) and average loss.





      
Here's an overview of what this training loop does:

- It takes as input a DataLoader (`train_dataloader`) for training data, an encoder model (`encoder`), a decoder model (`decoder`), the number of training epochs (`n_epochs`), and optional hyperparameters like learning rate (`learning_rate`), print frequency (`print_every`), and plot frequency (`plot_every`).

- It initializes variables for tracking time and losses and sets up optimizers (Adam) and a loss criterion (Negative Log-Likelihood Loss).

- It iterates over the specified number of training epochs, where each epoch consists of calling the `train_epoch` function for training.

- Inside the epoch loop:
  - It calculates the loss for the current epoch using the `train_epoch` function and updates the total loss for printing and plotting.

  - It prints the average loss every `print_every` epochs to monitor training progress.

  - It stores the average loss for plotting every `plot_every` epochs.

- After all epochs are completed, it visualizes the training loss using the `showPlot` function.

This code defines the main training loop for training a sequence-to-sequence model and includes mechanisms for tracking and visualizing training progress.

In [46]:
def train(train_dataloader, encoder, decoder, n_epochs, learning_rate=0.001,
          print_every=100, plot_every=100):
    # Initialize variables for tracking and visualization
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    # Define optimizers and loss criterion
    encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate)
    criterion = nn.NLLLoss()

    # Loop over the specified number of epochs
    for epoch in range(1, n_epochs + 1):
        # Train the model for one epoch using the provided DataLoader
        loss = train_epoch(train_dataloader, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion)

        # Accumulate loss for printing and plotting
        print_loss_total += loss
        plot_loss_total += loss

        # Print progress every 'print_every' epochs
        if epoch % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (timeSince(start, epoch / n_epochs),
                                        epoch, epoch / n_epochs * 100, print_loss_avg))

        # Store the average loss for plotting every 'plot_every' epochs
        if epoch % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0

    # Visualize the training loss
    showPlot(plot_losses)


### Plotting results

Plotting is done with matplotlib, using the array of loss values
``plot_losses`` saved while training.




In [47]:
import matplotlib.pyplot as plt
plt.switch_backend('agg')
import matplotlib.ticker as ticker
import numpy as np

def showPlot(points):
    plt.figure()
    fig, ax = plt.subplots()
    # this locator puts ticks at regular intervals
    loc = ticker.MultipleLocator(base=0.2)
    ax.yaxis.set_major_locator(loc)
    plt.plot(points)

## Evaluation

Evaluation is mostly the same as training, but there are no targets so
we simply feed the decoder's predictions back to itself for each step.
Every time it predicts a word we add it to the output string, and if it
predicts the EOS token we stop there. We also store the decoder's
attention outputs for display later.





Here's a summary of what this evaluation function does:

- It takes as input the encoder (`encoder`), decoder (`decoder`), a source sentence (`sentence`), and the input and output language objects (`input_lang` and `output_lang`).

- Within a `torch.no_grad()` block, it ensures that gradient computation is disabled, as we're only evaluating the model, not training it.

- It converts the input sentence to a PyTorch tensor using the `tensorFromSentence` function.

- It passes the input tensor through the encoder to obtain encoder outputs and the encoder's final hidden state.

- It then passes the encoder outputs and hidden state through the decoder to generate the output sequence. The decoder's attention weights (`decoder_attn`) are also returned.

- It selects the top predicted index for each output token and stores them in `decoded_ids`.

- It iterates through the predicted token indices, converting them back to words using the `index2word` mapping in `output_lang`.

- If the end-of-sequence token (`EOS_token`) is encountered, it appends '<EOS>' to the list and breaks the loop.

- Finally, it returns the list of decoded words and the attention weights (`decoder_attn`) for further analysis or visualization.

In [48]:
def evaluate(encoder, decoder, sentence, input_lang, output_lang):
    # Disable gradient computation since we're only evaluating
    with torch.no_grad():
        # Convert the input sentence to a PyTorch tensor
        input_tensor = tensorFromSentence(input_lang, sentence)

        # Pass the input tensor through the encoder to obtain encoder outputs and hidden state
        encoder_outputs, encoder_hidden = encoder(input_tensor)

        # Pass the encoder outputs and hidden state through the decoder
        decoder_outputs, decoder_hidden, decoder_attn = decoder(encoder_outputs, encoder_hidden)

        # Get the top predicted indices for each output token
        _, topi = decoder_outputs.topk(1)

        # Squeeze the top indices to remove unnecessary dimensions
        decoded_ids = topi.squeeze()

        # Initialize an empty list to store decoded words
        decoded_words = []

        # Iterate through the predicted token indices
        for idx in decoded_ids:
            # Check if the current index corresponds to the end-of-sequence token
            if idx.item() == EOS_token:
                decoded_words.append('<EOS>')
                break
            # Append the decoded word (using index2word mapping) to the list
            decoded_words.append(output_lang.index2word[idx.item()])

    # Return the list of decoded words and the attention weights (decoder_attn)
    return decoded_words, decoder_attn


We can evaluate random sentences from the training set and print out the
input, target, and output to make some subjective quality judgements:




In [49]:
def evaluateRandomly(encoder, decoder, n=10):
    for i in range(n):
        pair = random.choice(pairs)
        print('>', pair[0])
        print('=', pair[1])
        output_words, _ = evaluate(encoder, decoder, pair[0], input_lang, output_lang)
        output_sentence = ' '.join(output_words)
        print('<', output_sentence)
        print('')

## Training and Evaluating

With all these helper functions in place (it looks like extra work, but
it makes it easier to run multiple experiments) we can actually
initialize a network and start training.

Remember that the input sentences were heavily filtered. For this small
dataset we can use relatively small networks of 256 hidden nodes and a
single GRU layer. After about 40 minutes on a MacBook CPU we'll get some
reasonable results.

.. Note::
   If you run this notebook you can train, interrupt the kernel,
   evaluate, and continue training later. Comment out the lines where the
   encoder and decoder are initialized and run ``trainIters`` again.




In [50]:
hidden_size = 128
batch_size = 32

input_lang, output_lang, train_dataloader = get_dataloader(batch_size)

encoder = EncoderRNN(input_lang.n_words, hidden_size).to(device)
decoder = AttnDecoderRNN(hidden_size, output_lang.n_words).to(device)

train(train_dataloader, encoder, decoder, 80, print_every=5, plot_every=5)

Reading lines...
Read 135842 sentence pairs
Trimmed to 11445 sentence pairs
Counting words...
Counted words:
fra 4601
eng 2991
1m 45s (- 26m 28s) (5 6%) 1.5348
3m 30s (- 24m 31s) (10 12%) 0.6784
5m 12s (- 22m 35s) (15 18%) 0.3458
6m 54s (- 20m 43s) (20 25%) 0.1888
8m 36s (- 18m 56s) (25 31%) 0.1159
10m 18s (- 17m 10s) (30 37%) 0.0801
12m 0s (- 15m 26s) (35 43%) 0.0611
13m 41s (- 13m 41s) (40 50%) 0.0505
15m 24s (- 11m 58s) (45 56%) 0.0436
17m 9s (- 10m 17s) (50 62%) 0.0394
18m 52s (- 8m 34s) (55 68%) 0.0360
20m 35s (- 6m 51s) (60 75%) 0.0338
22m 17s (- 5m 8s) (65 81%) 0.0319
23m 59s (- 3m 25s) (70 87%) 0.0310
25m 41s (- 1m 42s) (75 93%) 0.0294
27m 23s (- 0m 0s) (80 100%) 0.0281


Set dropout layers to ``eval`` mode



In [51]:
encoder.eval()
decoder.eval()
evaluateRandomly(encoder, decoder)

> je suis desole de decevoir vos plans
= i m sorry to upset your plans
< i m sorry to upset your plans <EOS>

> je suis desole si c est une question idiote
= i m sorry if this is a stupid question
< i m sorry if this is a stupid question <EOS>

> c est un type assez connu
= he is a pretty great guy
< he is a pretty great guy <EOS>

> j essaye de sauver la vie de tom
= i m trying to save tom s life
< i m trying to save tom s life <EOS>

> ta cuisine me manquera
= i m going to miss your cooking
< i m going to miss your cooking after this <EOS>

> je me fais vieux
= i m getting old
< i m getting old <EOS>

> vous n etes pas invite
= you aren t invited
< you aren t invited <EOS>

> je crains que tom ne se perde
= i m afraid tom will get lost
< i m afraid tom will get lost <EOS>

> tu gagnes n est ce pas ?
= you re winning aren t you ?
< you re winning aren t you ? <EOS>

> il sait parler japonais
= he s able to speak japanese
< he s able to speak japanese <EOS>



### Visualizing Attention

A useful property of the attention mechanism is its highly interpretable
outputs. Because it is used to weight specific encoder outputs of the
input sequence, we can imagine looking where the network is focused most
at each time step.

You could simply run ``plt.matshow(attentions)`` to see attention output
displayed as a matrix. For a better viewing experience we will do the
extra work of adding axes and labels:




In [52]:
def showAttention(input_sentence, output_words, attentions):
    # Create a heatmap visualization of attention weights
    fig = plt.figure()
    ax = fig.add_subplot(111)
    cax = ax.matshow(attentions.cpu().numpy(), cmap='bone')
    fig.colorbar(cax)

    # Set up axes and labels
    ax.set_xticklabels([''] + input_sentence.split(' ') + ['<EOS>'], rotation=90)
    ax.set_yticklabels([''] + output_words)

    # Show label at every tick
    ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
    ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

    # Display the heatmap
    plt.show()


def evaluateAndShowAttention(input_sentence):
    # Evaluate the input sentence, get output words and attention weights
    output_words, attentions = evaluate(encoder, decoder, input_sentence, input_lang, output_lang)

    # Print the input sentence and the generated output
    print('input =', input_sentence)
    print('output =', ' '.join(output_words))

    # Show the attention heatmap
    showAttention(input_sentence, output_words, attentions[0, :len(output_words), :])

evaluateAndShowAttention('il n est pas aussi grand que son pere')

evaluateAndShowAttention('je suis trop fatigue pour conduire')

evaluateAndShowAttention('je suis desole si c est une question idiote')

evaluateAndShowAttention('je suis reellement fiere de vous')

input = il n est pas aussi grand que son pere
output = he is not as tall as his father <EOS>
input = je suis trop fatigue pour conduire
output = i m too tired to drive it <EOS>


  ax.set_xticklabels([''] + input_sentence.split(' ') + ['<EOS>'], rotation=90)
  ax.set_yticklabels([''] + output_words)


input = je suis desole si c est une question idiote
output = i m sorry if this is a stupid question <EOS>
input = je suis reellement fiere de vous
output = i m really proud of you <EOS>


## Exercises

-  Try with a different dataset

   -  Another language pair
   -  Human → Machine (e.g. IOT commands)
   -  Chat → Response
   -  Question → Answer

-  Replace the embeddings with pretrained word embeddings such as ``word2vec`` or
   ``GloVe``
-  Try with more layers, more hidden units, and more sentences. Compare
   the training time and results.
-  If you use a translation file where pairs have two of the same phrase
   (``I am test \t I am test``), you can use this as an autoencoder. Try
   this:

   -  Train as an autoencoder
   -  Save only the Encoder network
   -  Train a new Decoder for translation from there


