## Neural Machine Traslation using Encoder-Decoder Architecture

The aim of this notebook is to implement a Neural Machine Traslation (NMT) using basic [encoder-decoder](https://proceedings.neurips.cc/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf) approach with [Bahandanau attention mechanism](https://arxiv.org/pdf/1409.0473.pdf). 

In [1]:
%%capture
#!mkdir MNT-Dataset
#!wget -P MNT-Dataset/ https://www.manythings.org/anki/spa-eng.zip
#!unzip MNT-Dataset/spa-eng.zip -d MNT-Dataset/

In [2]:
# import libaries
import torch
import spacy

import numpy as np
import pandas as pd
from torch import nn
import multiprocessing as mp

from typing import List
from tqdm.notebook import tqdm
from torch.nn.utils.rnn import pad_sequence
from torchmetrics.functional import bleu_score
from sklearn.model_selection import train_test_split


In [3]:
# load dataset
dataset = pd.read_table("MNT-Dataset/spa.txt", header=None, names=["english", "spanish", "ref"]).drop(labels=["ref"], axis=1)
print(dataset.shape)
dataset.english = dataset.english.str.lower()
dataset.spanish = dataset.spanish.str.lower()
dataset.head()

(139705, 2)


Unnamed: 0,english,spanish
0,go.,ve.
1,go.,vete.
2,go.,vaya.
3,go.,váyase.
4,hi.,hola.


In [4]:
# define a tokenizer using spacy
class Tokenizer:
    def __init__(self, language: str = None) -> None:
        """
        A simple tokenizer class that uses Spacy to tokenize text.

        Parameteres:
        ------------
            language (str, optional): The language of the text to be tokenized. Defaults to None.
                Supported languages are 'sp' for Spanish and 'en' for English.
        """

        if language == "sp":
            self.nlp = spacy.load("es_core_news_sm")  # load the Spanish Spacy model
        elif language == "en":
            self.nlp = spacy.load("en_core_web_sm")  # load the English Spacy model

    def __call__(self, text: str) -> str:
        """
        Tokenizes a given text using the Spacy tokenizer.

        Args:
            text (str): The text to be tokenized.

        Returns:
            A list of strings representing the tokens in the text.
        """

        return [w.text for w in self.nlp.tokenizer(text)]  # return the text tokens

In [5]:
# Now we a language class that represents a language and its vocabulary
class Lang:
    def __init__(self, name:str, language:str="sp"):
        """
        A class for language preprocessing and encoding. It uses a tokenizer to split text into tokens, and encodes
        these tokens into integer values. It also provides methods to add sentences and words to the vocabulary, and to
        transform text into its encoded form.

        Parameters:
        -----------
        name : str
            A name for the language object.
        language : str, default='sp'
            The language of the text to process. Currently supported languages are 'sp' (Spanish) and 'en' (English).
        """
        
        self.name = name
        self.language = language
        self.word2index = {"<pad>": 0, "<start>": 1, "<end>": 2, "<unk>": 3}
        self.word2count = {}
        self.index2word = {0: "<pad>", 1: "<start>", 2: "<end>", 3: "<unk>"}
        self.n_words = 4  # Count SOS and EOS
        self.tokenizer = Tokenizer(language)

    def addSentence(self, sentence:str):
        """
        Add a sentence to the vocabulary.

        Parameters:
        -----------
        sentence : str
            The sentence to add.
        """
        
        for word in self.tokenizer(sentence):
            self.addWord(word)

    def addWord(self, word:str):
        """
        Add a word to the vocabulary.

        Parameters:
        -----------
        word : str
            The word to add.
        """
        
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

    def fit(self, dataset:List[str]):
        """
        Build the vocabulary from a dataset.

        Parameters:
        -----------
        dataset : list
            A list of sentences to add to the vocabulary.
        """
        
        for data in tqdm(dataset):
            self.addSentence(data)

    def transform(self, text:str, padding:bool=True):
        """
        Transform text into its encoded form.

        Parameters:
        -----------
        text : str
            The text to encode.
        padding : bool, default=True
            Whether to pad the sequence to the maximum sequence length.

        Returns:
        --------
        encoding : list
            A list of integers representing the encoded sequence.
        """

        tokens = self.tokenizer(text)
        if padding:
            tokens = ["<start>"] + tokens + ["<end>"]
            tokens = tokens

        encoding = [self.word2index[tk] if tk in self.word2index.keys() else 3 for tk in tokens]

        return encoding
    
    def inverse_transform(self, tokens:List):
        """
        Decodes the encoded sequence of integers using the vocabulary of the language.

        Parameters:
        -----------
            tokens: list
                The encoded sequence of integers to decode.

        Returns:
        --------
            str: The decoded sentence.
        """
        
        words = [self.index2word[tk] for tk in tokens]

        return " ".join(words)
    
    @staticmethod
    def right_padding_per_batch(batch: tuple):
        """
        Pads the sequence of tokens with 0s to match the sequence length per batch. 
        This method will be pass to the collate_fn argument of the Dataloader class.

        Parameters:
        -----------
            batch: tuple 
                The sequence of tokens to pad.

        Returns:
        --------
            tuple: The padded sequence of tokens in the batch.
        """
    
        en_text_bs, sp_text_bs = [], []

        for en_text, sp_text in batch:
            en_text_bs.append(en_text)
            sp_text_bs.append(sp_text)

        en_text_bs = pad_sequence(en_text_bs, padding_value=0)
        sp_text_bs = pad_sequence(sp_text_bs, padding_value=0)

        return en_text_bs, sp_text_bs



# Data loader

We define a custom data loader that output the token for the sentences in spanish and english

In [6]:
# Create custom dataset
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, dataset:pd.DataFrame, sp_lang:Lang=None, en_lang:Lang=None):
        """
        A PyTorch custom dataset for language translation.

        Parameters:
        ----------
        dataset : pandas DataFrame
            The dataset containing the English and Spanish sentences.
        sp_lang: Lang
            The language object for the Spanish language. Default None
        en_lang: Lang
            The language object for the English language. Default None
        """
    
        self.dataset = dataset

        if isinstance(sp_lang, Lang) and isinstance(en_lang, Lang):
            self.sp_lang = sp_lang
            self.en_lang = en_lang
            
        else:
            # Initialize language objects for Spanish and English
            self.sp_lang = Lang("sp", language="sp")
            self.sp_lang.fit(dataset.spanish)

            self.en_lang = Lang("en", language="en")
            self.en_lang.fit(dataset.english)

    def __len__(self):
        """
        Returns the number of samples in the dataset.

        Returns:
        -------
        int
            The number of samples in the dataset

        """
        
        return len(self.dataset)

    def __getitem__(self, idx):
        """
        Returns a sample from the dataset.

        Parameters:
        ----------
        idx : int
            The index of the sample to return.

        Returns:
        -------
        tuple of torch.Tensor
            The English sentence and the Spanish sentence as tensors.

        """
        
        # Get the Spanish and English sentences from the dataset
        sp_text = self.dataset.spanish.tolist()[idx]
        en_text = self.dataset.english.tolist()[idx]

        # Transform the Spanish and English sentences using the language objects
        sp_text = self.sp_lang.transform(sp_text)
        en_text = self.en_lang.transform(en_text)

        # Convert the transformed sentences to tensors
        sp_text = torch.Tensor(sp_text).long()
        en_text = torch.Tensor(en_text).long()

        return en_text, sp_text

In [7]:
# test the dataloader
ds_train = CustomDataset(dataset)

  0%|          | 0/139705 [00:00<?, ?it/s]

  0%|          | 0/139705 [00:00<?, ?it/s]

In [8]:
# get the spanish and English vocab size
sp_vocab = ds_train.sp_lang.n_words
en_vocab = ds_train.en_lang.n_words

# Encoder

The encoder ($e$) used here is similar to the encoder implemented in NMT-RRN-NO-Attention, but in this case, as proposed in the original paper that introduced attention mechanism, we use a bidirectional GRU. The decoder is not a bidirectional RNN, so we cannot directly pass the `hidden` state of the encoder to the decoder because the shapes of the expected input hidden state of the decoder, and the output hidden state of the encoder will not match. In this direction, we use a linear layer to reduce the dimension of the encoder hidden state such that it fits with the expected dimension of the decoder hidden state. 


First, we compute the hidden states and outputs

$$\overrightarrow{O_t}, \overrightarrow{h_t} = GRU(x_t, \overrightarrow{h_{t-1}})$$
$$\overleftarrow{O_t}, \overleftarrow{h_t} = GRU(x_t, \overleftarrow{h_{t-1}})$$

Then, we use a linear layer '$a$' to reduce the dimension of the encoder hidden state

$$h_t = a([\overrightarrow{h_t}, \overleftarrow{h_t}])$$


In [9]:
class Encoder(nn.Module):
    def __init__(
        self, input_size, embedding_dim=100, n_layers=2, hidden_dim=10, dropout=0.5
    ):
        """
        Encoder class

        Parameters:
        -----------
        input_size: int
            The size of the input vocabulary
        embedding_dim: int
            The size of the embeddings
        n_layers: int
            The number of layers in the encoder
        hidden_dim: int
            The size of the hidden dimension
        dropout: float
            The dropout rate
        """

        super().__init__()
        self.input_size = input_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim

        # Embedding layer
        # input shape: either [batch_size, seq_len] or [seq_len, batch_size]
        self.embeddings = nn.Embedding(input_size, embedding_dim)

        # GRU layers
        # input shape: [batch_size, seq_len, features] if batch_first=True
        # input shape: [seq_len, batch_size, features] if batch_first=False
        self.rnn = nn.GRU(
            embedding_dim,
            hidden_dim,
            n_layers,
            dropout=dropout,
            batch_first=False,
            bidirectional=True,
        )

        # Dropout layer
        self.dropout = nn.Dropout(dropout)

        # Linear layer
        self.fc = nn.Linear(hidden_dim * 2, hidden_dim)

    def forward(self, x):
        """
        Forward pass of the encoder

        Parameters:
        -----------
        x: torch.Tensor
            The input to the encoder

        Returns:
        -----------
        x: torch.Tensor
            The output of the encoder
        hidden: torch.Tensor
            The hidden state of the encoder

        Notes:
        -----------
        x shape: [seq_len, batch_size]
        hidden shape: = [n_layers * n_directions, batch_size, hidden_dim]
        """

        # x shape: [seq_len, batch_size]

        # Embedding Layer
        # output shape: [seq_len, batch_size, emb_dim]
        x = self.dropout(self.embeddings(x))

        # GRU Layer
        # output(x) shape: [seq_len, batch_size, num_directions * hidden_dim]
        # output(hidden) shape: = [n_layers * n_directions, batch_size, hidden_dim]
        x, hidden = self.rnn(x)

        # hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        # outputs are always from the last layer
        # output(hidden) shape: = [n_layers, batch_size, hidden_dim]
        hidden = torch.cat((hidden[::2, :, :], hidden[1::2, :, :]), dim=-1)
        hidden = torch.tanh(self.fc(hidden))

        return x, hidden


### Test Enconder

We test that the encoder works properly using some dummy inputs

In [10]:
# Test encoder
encoder = Encoder(100)

x = torch.tensor([[1, 3, 4], [4, 5, 6]], dtype=torch.long)
x, hidden = encoder(x.T)
print(x.shape)
print(hidden.shape)

torch.Size([3, 2, 20])
torch.Size([2, 2, 10])


# Attetion Mechanism

In this implementation we have used the [Bahandanau attention mechanism](https://arxiv.org/pdf/1409.0473.pdf), which was the very first attention mechanism published to solve the neural machine translation problem. The attention mechanism is introduced as part of the decoder. More specifically, the hidden state for the decoder is written as:

$$s_i = f(s_{i-1}, y_{i-1}, c_i)$$

Different from the [S2S NMT](https://proceedings.neurips.cc/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf), now the context vector $C_i$ depends on a sequence of hidden states $(h_1, h_2, ... ,h_{T_x})$ to which an encoder maps the input sentence. The context vector $C_i$ is then computed as a weighted sum of these hidden states

$$C_i = \sum_{j=1}^{T_x} \alpha_{ij}h_j$$

The weights $\alpha_{ij}$ fot each $h_j$ is computed by

$$\alpha_{ij} = \frac{exp(e_{ij})}{\sum_{k=1}^{T_x}exp(e_{ik})}$$

with $$e_{ij} = a(s_{i-1}, h_j)$$

where $a$ denotes is a multilayer perceptron

In [11]:
class Attention(nn.Module):
    def __init__(self, hidden_dim: int) -> None:
        """
        Attention Mechanism

        Parameters
        ----------
        hidden_dim : int
            hidden dimension of the RNN
        """

        super().__init__()

        self.hidden_dim = hidden_dim

        # Linear layers
        self.attn = nn.Linear(self.hidden_dim * 3, self.hidden_dim)
        self.v = nn.Linear(self.hidden_dim, 1)

    def forward(self, decoder_hidden, encoder_output):
        # decoder_hidden shape: [n_layers * n_directions, batch_size, hidden_dim]
        # encoder_output shape: [seq_len, batch_size, hidden_dim * 2]

        seq_len = encoder_output.shape[0]

        # decoder shape: [1, batch_size, hidden_dim]
        decoder_hidden = decoder_hidden[[-1], :, :].repeat(seq_len, 1, 1)

        # e shape: [seq_len, batch_size, hidden_dim]
        e = torch.tanh(self.attn(torch.cat((decoder_hidden, encoder_output), dim=-1)))

        # e shape: [seq_len, batch_size, 1]
        e = self.v(e)

        # alpha shape: [seq_len, batch_size, 1]
        alpha = torch.softmax(e, dim=0)

        # context shape: [batch_size, 1, hidden_dim * 2]
        context = torch.sum(alpha * encoder_output, dim=0).unsqueeze(0)

        return context


### Test the attention mechanism

Now we test the attention mechanism using some dummy inputs

In [12]:
# test Attention
bs = 2
hidden_dim = 10
seq_len = 3
n_layers = 2

decoder_hidden = torch.rand((n_layers, bs, hidden_dim))
encoder_output = torch.rand((seq_len, bs, hidden_dim * 2))

att = Attention(hidden_dim)
context = att(decoder_hidden, encoder_output)
context.shape


torch.Size([1, 2, 20])

# Decoder

The decoder make use of the attention layer defined above, which takes previous decoder hidden states $s_{i-1}$, all of the encoder hidden states $\{h_1, h_2, .... , h_{T_x} \}$, and returns the context vector $C_t$. To compute the decoder hidden state at each time $s_t$, we use the following equation:

$$s_t = d(y_{t-1}, s_{t-1}, c)$$

where $y_{t-1}$ is the previous predicted token, $s_{t-1}$ is the previous hidden stat and $C_t$ is the context vector computed using the attention layer.

We then pass the output of the decoder to a linear layer $f$, to make prediction of the next word in the target sentence $\hat y_{t+1}$:

$$\hat y_{t+1} = f(s_{t})$$

In [13]:
class Decoder(nn.Module):
    def __init__(
        self,
        input_size,
        output_size,
        embedding_dim=100,
        n_layers=2,
        hidden_dim=10,
        dropout=0.5,
    ):
        """
        A class representing a decoder in a sequence-to-sequence model.

        Parameters:
        -----------
        input_size: int
            The number of input tokens.
        output_size: int
            The number of output tokens.
        embedding_dim: int
            The dimension of the embedding layer.
        n_layers: int
            The number of layers in the RNN.
        hidden_dim: int
            The number of hidden units in the RNN.
        dropout: float
            The dropout rate. 
        """

        super().__init__()

        self.input_size = input_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim

        # Embedding layer
        # input shape: either [batch_size, seq_len] or [seq_len, batch_size]
        self.embeddings = nn.Embedding(input_size, embedding_dim)

        # GRU layers
        # input shape: [batch_size, seq_len, features] if batch_first=True
        # input shape: [seq_len, batch_size, features] if batch_first=False
        self.rnn = nn.GRU(
            embedding_dim + hidden_dim * 2,
            hidden_dim,
            n_layers,
            batch_first=False,
            dropout=dropout,
        )

        # Fully confected layer
        self.fc = nn.Linear(hidden_dim, output_size)

        # Dropout layer
        self.dropout = nn.Dropout(dropout)

        # Attention
        self.attn = Attention(self.hidden_dim)

    def forward(self, x, decoder_hidden, encoder_output):
        """
        Perform forward pass of the decoder.

        Parameters:
        -----------
        x: torch.Tensor
            The input tensor with shape [seq_len=1, batch_size].
        decoder_hidden: torch.Tensor
            The hidden state of the decoder.
        encoder_output: torch.Tensor
            The output of the encoder.

        Returns:
        --------
        x: torch.Tensor
            The output of the decoder.
        hidden: torch.Tensor
            The hidden state of the decoder.
        context: torch.Tensor
            The attention context.
        """    


        # x shape: [seq_len=1, batch_size]
        # hidden shape: = [n_layers * n_directions, batch_size, hid_dim]
        # context shape: [1, batch_size, hid_dim]

        # Embeddings
        # x shape: [seq_len=1, batch_size, emb_dim]
        x = self.dropout(self.embeddings(x))

        # Attention mechanism
        context = self.attn(decoder_hidden, encoder_output)

        # Concatenation
        # x shape: [seq_len, batch_size, emb_dim + hid_dim]
        x = torch.cat((x, context), dim=2)

        # GRU Layer
        # x shape: [seq_len=1, batch_size, num_directions * hidden_dim]
        # hidden shape: = [n_layers * n_directions, batch_size, hidden_dim]
        x, hidden = self.rnn(x, decoder_hidden)

        # Fully connected layer
        # x shape: [batch_size, output_size]
        x = self.fc(x)

        return x, hidden


### Test the decoder

Now we test that the decoder is working as we expect.

In [14]:
# test Decoder
bs = 2
hidden_dim = 10
seq_len = 3
n_layers = 2

decoder_hidden = torch.rand((n_layers, bs, hidden_dim))
encoder_output = torch.rand((seq_len, bs, hidden_dim * 2))
x = torch.tensor([[1], [4]], dtype=torch.long)

decoder = Decoder(100, output_size=100)
x, hidden = decoder(x.T, decoder_hidden, encoder_output)
print(x.shape)
print(hidden.shape)

torch.Size([1, 2, 100])
torch.Size([2, 2, 10])


In [15]:
class NMT(nn.Module):
    def __init__(
        self,
        en_vocab,
        sp_vocab,
        en_lang,
        sp_lang,
        embedding_dim,
        n_layers=2,
        hidden_dim=10,
    ):
        super().__init__()

        """
        Neural Machine Translation model (NMT) based on encoder-decoder architecture without attention.

        Parameters
        ----------
        en_vocab : int
            Size of the English vocabulary.
        sp_vocab : int
            Size of the Spanish vocabulary.
        en_lang : object
            English language object.
        sp_lang : object
            Spanish language object.
        embedding_dim : int
            Dimension of the word embedding space.
        n_layers : int, optional
            Number of layers in the encoder and decoder (default is 2).
        hidden_dim : int, optional
            Dimension of the hidden state in the encoder and decoder (default is 10).
        """

        self.en_vocab = en_vocab
        self.sp_vocab = sp_vocab
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        self.sp_lang = sp_lang
        self.en_lang = en_lang

        # Define Encoder and Decoder
        self.encoder = Encoder(en_vocab, embedding_dim, n_layers, hidden_dim)
        self.decoder = Decoder(sp_vocab, sp_vocab, embedding_dim, n_layers, hidden_dim)

    def forward(self, x, y):
        """
        Forward pass of the NMT model.

        Parameters
        ----------
        x : tensor
            Tensor of shape (seq_len, batch_size) containing the input sequences in English.
        y : tensor
            Tensor of shape (seq_len, batch_size) containing the input sequences in Spanish.

        Returns
        -------
        outputs : tensor
            Tensor of shape (seq_len, batch_size, sp_vocab) containing the predicted Spanish sequences.
        """

        # shape x: [seq_len, batch_size]
        # shape y: [seq_len, batch_size]

        target_len = y.shape[0]
        batch_size = x.shape[1]

        # outputs tensor
        # outputs shape: [seq_len, batch_size, vocab_size]
        outputs = torch.zeros(target_len, batch_size, self.sp_vocab).to(self.device)

        # Encoder
        # y_encoder shape: [seq_len, batch_size, num_directions * hidden_dim]
        # hidden shape: = [n_layers * n_directions, batch_size, hidden_dim]
        y_encoder, hidden = self.encoder(x)

        # Initial prediction
        # x_decoder shape: [1, batch_size]
        x_decoder = y[[0], :]

        for t in range(1, target_len):
            # output(output) shape: [batch_size, output_size]
            # output(hidden) shape: = [n_layers * n_directions, batch_size, hidden_dim]
            output, hidden = self.decoder(x_decoder, hidden, y_encoder)
            outputs[[t], :, :] = output
            y_decoder = output.argmax(-1)
            x_decoder = y[[t], :] if np.random.random() < 0.5 else y_decoder

        # output shape: [seq_len, batch_size, vocab_size]
        return outputs

    def translate_sentence(self, x):
        """
        Translate an English sentence to Spanish.

        Parameters
        ----------
        x : str
            English sentence to translate.

        Returns
        -------
        outputs : list
            List of integers representing the predicted Spanish sequence.
        """

        self.eval()

        # Transform input text to tokens
        x = (
            torch.Tensor(self.en_lang.transform(x))
            .long()
            .reshape(-1, 1)
            .to(self.device)
        )

        # define output array
        outputs = []

        # Initial token <start>
        x_decoder = torch.Tensor([[1]]).long().to(self.device)

        # pass sentence to the encoder
        y_encoder, hidden = self.encoder(x)

        t = 1

        # this will run until prediction is <end> or t >= 200
        while x_decoder != 2:
            output, hidden = self.decoder(x_decoder, hidden, y_encoder)
            outputs.append(output.argmax(-1).item())
            x_decoder = y_decoder = output.argmax(-1)

            if t >= 200:
                break

            t += 1

        return self.sp_lang.inverse_transform(outputs)

    def config_model(self, device="cuda"):
        """
        Configure the NMT model.

        Parameters
        ----------
        device : str, optional
            Device to use (default is "cuda").
        """

        # define device to operate
        self.device = device

        # set model's device
        self.to(self.device)

        # define loss function
        self.loss = nn.CrossEntropyLoss(ignore_index=0)

        # define optimizer
        self.optimizer = torch.optim.Adam(self.parameters())

    def train_one_epoch(self, train_loader):
        """
        Train the NMT model for one epoch.

        Parameters
        ----------
        train_loader : DataLoader
            DataLoader object containing the training data.

        Returns
        -------
        logs : dict
            Dictionary containing the training loss and BLEU score.
        """

        running_loss = 0
        bleu = 0

        self.train()

        bar = tqdm(train_loader, leave=True)

        for step, (x, y) in enumerate(bar, 1):

            self.optimizer.zero_grad()

            # set device
            x, y = x.to(self.device), y.to(self.device)

            # forward pass
            logits = self(x, y)  # shape: [seq_len, batch_size, vocab_size]

            # Remove <start> from target
            y = y[1:]
            logits = logits[1:]

            # Compute loss
            loss = self.loss(logits.reshape(-1, logits.shape[2]), y.reshape(-1))

            # Clip the gradient value is it exceeds > 1
            torch.nn.utils.clip_grad_norm_(self.parameters(), max_norm=1)

            # Compute gradients
            loss.backward()

            # Update weigths
            self.optimizer.step()

            # compute running loss
            running_loss += loss.item()

            # predictions
            y_pred = logits.argmax(-1).detach().cpu().numpy()
            y_pred = [
                self.sp_lang.inverse_transform(y_pred[:, i]) for i in range(x.shape[1])
            ]

            # true labels
            y = y.detach().cpu().numpy()
            y = [self.sp_lang.inverse_transform(y[:, i]) for i in range(x.shape[1])]

            bleu += bleu_score(y_pred, y).item()

            bar.set_description(
                f"Train loss {round(running_loss/step, 3)}, "
                f"Train BLEU {round(bleu/step, 3)}"
            )

        logs = {
            "Train loss": round(running_loss / step, 3),
            "Train BLEU": round(bleu / step, 3),
        }

        return logs

    def test_one_epoch(self, test_loader):
        """
        Test the NMT model for one epoch.

        Parameters
        ----------
        test_loader : DataLoader
            DataLoader object containing the test data.

        Returns
        -------
        logs : dict
            Dictionary containing the test loss and BLEU score.
        """

        running_loss = 0
        bleu = 0

        self.eval()

        with torch.no_grad():
            bar = tqdm(test_loader, leave=True)

            for step, (x, y) in enumerate(bar, 1):
                self.optimizer.zero_grad()

                # set device
                x, y = x.to(self.device), y.to(self.device)

                # forward pass
                logits = self(x, y)  # shape: [seq_len, batch_size, vocab_size]

                # Remove <start> from target
                y = y[1:]
                logits = logits[1:]

                # Compute loss
                loss = self.loss(logits.reshape(-1, logits.shape[2]), y.reshape(-1))

                # compute running loss
                running_loss += loss.item()

                # predictions
                y_pred = logits.argmax(-1).detach().cpu().numpy()
                y_pred = [
                    self.sp_lang.inverse_transform(y_pred[:, i])
                    for i in range(x.shape[1])
                ]

                # true labels
                y = y.detach().cpu().numpy()
                y = [self.sp_lang.inverse_transform(y[:, i]) for i in range(x.shape[1])]

                bleu += bleu_score(y_pred, y).item()

                bar.set_description(
                    f"Test loss {round(running_loss/step, 3)}, "
                    f"Test BLEU {round(bleu/step, 3)}"
                )

                logs = {
                    "Test loss": round(running_loss / step, 3),
                    "Test BLEU": round(bleu / step, 3),
                }

                return logs

    def fit(self, train_loader, test_loader, epochs: int = 1):
        """
        Train and evalaute the model for N epochs

        Parameteres:
        -----------
        train_loader : DataLoader
            DataLoader object containing the training data.
        test_loader : DataLoader
            DataLoader object containing the test data.
        epochs: int
            Number of epochs to train and evalaute the data loader
        """

        bar = tqdm(range(epochs))

        for epoch in bar:
            train_logs = self.train_one_epoch(train_loader)
            test_logits = self.test_one_epoch(test_loader)

            logs = train_logs
            logs = logs.update(test_logits)

            print(self.translate_sentence("the man who sold the world"))

            bar.set_description(logs)

# Train and Evaluate de NMT model

In [16]:
# split data to trian and test
train_df, test_df = train_test_split(dataset, test_size=0.1, random_state=42)

# define train and test dataset
ds_train = CustomDataset(train_df)
ds_test = CustomDataset(test_df, sp_lang=ds_train.sp_lang, en_lang=ds_train.en_lang)

# define train and test datalaoder
loader_train  = torch.utils.data.DataLoader(ds_train, batch_size=64, num_workers=8, shuffle=True, collate_fn=Lang.right_padding_per_batch)
loader_test  = torch.utils.data.DataLoader(ds_test, batch_size=64, num_workers=8, shuffle=False, collate_fn=Lang.right_padding_per_batch)

#Define spanish and english vocab size
sp_vocab = ds_train.sp_lang.n_words
en_vocab = ds_train.en_lang.n_words

  0%|          | 0/125734 [00:00<?, ?it/s]

  0%|          | 0/125734 [00:00<?, ?it/s]

In [18]:
x, y = next(iter(loader_test))

In [23]:
y[:, 0]

tensor([    1,   481,  4645,    37,     3,    66, 21890,     4,    87,  2760,
           12,     2,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0])

In [17]:
# instance de NMT model
nmt = NMT(en_vocab, sp_vocab, ds_train.en_lang, ds_train.sp_lang, embedding_dim=300, n_layers=2, hidden_dim=512)
nmt.config_model(device="cuda")

In [18]:
nmt.fit(loader_train, loader_test, epochs = 10)

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/1955 [00:00<?, ?it/s]

  0%|          | 0/218 [00:00<?, ?it/s]

el policía le la la la la ? <end>


  0%|          | 0/1955 [00:00<?, ?it/s]

  0%|          | 0/218 [00:00<?, ?it/s]

el hombre hombre el el el mundo . <end>


  0%|          | 0/1955 [00:00<?, ?it/s]

  0%|          | 0/218 [00:00<?, ?it/s]

el hombre dijo que el el mundo ? <end>


  0%|          | 0/1955 [00:00<?, ?it/s]

  0%|          | 0/218 [00:00<?, ?it/s]

el hombre quién conoce el mundo del mundo . <end>


  0%|          | 0/1955 [00:00<?, ?it/s]

  0%|          | 0/218 [00:00<?, ?it/s]

el hombre quién está el todo el mundo ? <end>


  0%|          | 0/1955 [00:00<?, ?it/s]

  0%|          | 0/218 [00:00<?, ?it/s]

el hombre cree que la mundo más mundo . <end>


  0%|          | 0/1955 [00:00<?, ?it/s]

  0%|          | 0/218 [00:00<?, ?it/s]

el hombre conocía el mundo . <end>


  0%|          | 0/1955 [00:00<?, ?it/s]

  0%|          | 0/218 [00:00<?, ?it/s]

el hombre le panfletos el mundo . <end>


  0%|          | 0/1955 [00:00<?, ?it/s]

  0%|          | 0/218 [00:00<?, ?it/s]

el hombre dijo el mundo mundial . <end>


  0%|          | 0/1955 [00:00<?, ?it/s]

  0%|          | 0/218 [00:00<?, ?it/s]

el hombre que conoce el mundo mundo . <end>


In [19]:
torch.save(nmt, "./NMT-GRU.pth")

In [20]:
nmt = torch.load("./NMT-GRU.pth")

In [31]:
for _, text in dataset.sample(5).iterrows():
    
    print("------------------------------------------------------------------------------------------------------")
    print("input text: ",text.english)
    print("translation: ",nmt.translate_sentence(text.english))
    print("-----------------------------------------------------------------------------------------------------\n")

------------------------------------------------------------------------------------------------------
input text:  where's tom's computer?
translation:  ¿ dónde está el ordenador de tom ? <end>
-----------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------
input text:  is he your teacher?
translation:  ¿ es tu profesor ? <end>
-----------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------
input text:  tom was a coal miner.
translation:  tom fue un carbón de carbón . <end>
-----------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------
input text:  it made my mother