<a href="https://colab.research.google.com/github/nathan-young1/Unlocking-Languages-Dive-into-Transformer-based-Translation-with-PyTorch/blob/main/Unlocking_Languages_Dive_into_Transformer_based_Translation_with_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **English to French Translation with Transformers**

In this notebook, you'll learn how to build your own 🤖 English → French translator using Transformers, the state-of-the-art model for natural language processing. We'll use the The Europarl parallel corpus is extracted from the proceedings of the European Parliament from 1996 to 2011[[EN->FR Dataset](https://www.statmt.org/europarl/)] and train the model with PyTorch Lightning⚡️, a framework that makes training fast and easy.

You'll also:

- ⚙️ Preprocess your text data into token tensors
- 🧑‍🔧 Design an encoder-decoder transformer architecture in PyTorch
- 💬 Translate English sentences into French

Transformers are amazing models that use self-attention to capture the meaning and context of words. You'll see how they can help you create a powerful and elegant translator.

🙂 Let's get started!


# **📊 Data Exploration**

In this tutorial, we will explore the **Europarl** dataset, which is a widely used benchmark for evaluating the performance of machine translation systems. The dataset consists of the speeches delivered at the **European Parliament**, covering a variety of topics and domains. These speeches are available in **11 different languages**.

The creation of the dataset was lead by **Philipp Koehn**, a leading researcher and author in the field of machine translation.

Here we will focus on just the **English to French** translation pair.

In [40]:
# Downloading the Dataset
import requests
import os

url = "http://www.statmt.org/europarl/v7/fr-en.tgz"
filename = "fr-en.tgz"

if not os.path.exists(filename):
    r = requests.get(url)
    with open(filename, "wb") as w:
        w.write(r.content)
    print("File downloaded")
else:
    print("File already exists")

File downloaded


In [41]:
# Import the module for working with compressed tar files
import tarfile

# Open a gzipped tar file in read mode
with tarfile.open("fr-en.tgz", "r:gz") as tar:
    # Get a TarFile object for the archive
    # Extract all the files and directories to the current folder
    tar.extractall()

#### **Preprocessing**
We will perform text cleaning for Neural Machine Translation (NMT) Model.
This is useful because it can:

- Remove irrelevant or noisy information, such as HTML tags, punctuation, capitalization, or non-printable characters, that might confuse the NMT model or reduce its performance.

- Normalize the text to a consistent format, such as lowercase letters, UTF-8 encoding e.t.c. That can be easily processed by the NMT model.

And many other benefits....

> By cleaning the text, we can make it more suitable for NMT and improve the quality and accuracy of the translation output. 😊

In [20]:
import re

# Define a function for text cleaning
def clean_text(text):

    # Convert text to lowercase
    text = str(text).lower().strip()

    # Remove the \n at the end of each line in the file
    text = text.rstrip('\n')

    # Remove HTML tags and non-alphanumeric characters
    text = re.sub(r"<[^>]+>", "", text)
    text = re.sub(r"[^a-zA-ZÀ-ÿ0-9\s.,;!?':()\[\]{}-]", " ", text)  # Keep selected punctuation marks, symbols and apostrophes

    # Remove excessive whitespace (more than one space)
    text = re.sub(r"\s+", " ", text)

    text = text.encode("utf-8", errors="ignore").decode("utf-8")  # Corrected encoding

    return text

In [43]:
import pandas as pd

# Read from the two text files, and create a pandas Series object (like a python list[]), and clean the text by
# applying our clean_text function to every element.

with open('/content/europarl-v7.fr-en.en', 'r') as en_file, open('/content/europarl-v7.fr-en.fr', 'r') as fr_file:
    en = pd.Series(en_file.readlines(), name='en').apply(lambda text: clean_text(text))
    fr = pd.Series(fr_file.readlines(), name='fr').apply(lambda text: clean_text(text))

In [44]:
# Let's merge the two series objects into a Pandas Dataframe, which will create two columns
# 'en' and 'fr' corresponding to the sentence pairs.

translation_df = pd.concat([en, fr], axis=1)

# Show 5 sample rows of the dataframe.
translation_df.head()

Unnamed: 0,en,fr
0,resumption of the session,reprise de la session
1,i declare resumed the session of the european ...,je déclare reprise la session du parlement eur...
2,"although, as you will have seen, the dreaded '...","comme vous avez pu le constater, le grand bogu..."
3,you have requested a debate on this subject in...,vous avez souhaité un débat à ce sujet dans le...
4,"in the meantime, i should like to observe a mi...","en attendant, je souhaiterais, comme un certai..."


In [None]:
# There are just above 2 Million sentence pairs.
len(translation_df)

2007723

In [45]:
# We will Create an SQLite Database for the dataframe for faster access to specific lines.
from sqlalchemy import create_engine

DATABASE_NAME = 'translation.db'
TABLE_NAME = 'en_fr'

# Create a SQLite database
engine = create_engine(f'sqlite:///{DATABASE_NAME}')

# store the pandas dataframe in the created Sqlite db.
translation_df.to_sql(TABLE_NAME, engine, if_exists='append')

print(f'Moved to sqlite db')

Moved to sqlite db


In [46]:
TRANSLATION_DB_FILE_PATH = '/content/translation.db'

# **Text Tokenization**

Text tokenization is a key step in natural language processing (NLP) that enables computers to understand human language. It converts text into numerical representations that are easier for machines to interpret.

<br>

### **Why Tokenize Text ❓**

Human communication has structure through words, sentences, grammar, and so on. Tokenization exposes this structure so that machines can discover patterns, relationships, and meaning. It splits text into smaller units called tokens that can be assigned numerical values.

Common types of tokens are:

- **Words**: Words are the basic units of meaning in language, such as "love", "Paris", or "the".

- **Subwords**: Subwords are parts of words that have some meaning or function, such as prefixes, suffixes, or stems.

Without tokenization, text is just a sequence of characters without shape or meaning. Tokens add form that machines can process.

<br>

### **Challenges in Tokenization 🚧**

Tokenizing human text is challenging because languages have complex and diverse rules:

- Word spaces vary across languages. Some languages, such as Chinese and Japanese, do not use whitespace to separate words.

- Made-up words, such as names, slang, or acronyms, may not be recognized or split correctly by tokenizers.

- The same word can have different meanings depending on the context. Capturing the nuances of language is difficult for machines.

Good tokenizers handle these challenges with techniques such as vocabulary lists, multi-word tokens, word splitting rules, and more.

<br>

### **Assigning Meaning Through Embeddings 📈**

Tokenization structures text, embeddings provide meaning. They map the tokens to numerical vectors that capture its information in the language. Similar tokens have similar embeddings (e.g eat & drink, walk & run, cat & dog e.t.c)

**Note**: We will come back to this later...


# Using a Tokenizer 🛠️

Now that you know what text tokenization is and why it is important, let's see how we can do it in practice. We will use a pretrained multi-lingual tokenizer from the **Hugging Face Tokenizers Library** for both English and French texts.

**Here are some of the steps performed by the tokenizer:**

1. The tokenizer uses **WordPiece**, a way to split words into smaller parts that make sense. For example, the word "Transformer" can be split into "Trans" and "former". This helps the computer to learn new words and save space.

    WordPiece works by finding the most common parts of words in a text and combining them together. For example, if the text has many words that end with "ing", WordPiece will merge "i", "n", and "g" into one part. **This way, having seen "eat" and "ing", the model can infer the meaning of the word "eating".**

2. It also **lowercases** the text, and splits on **whitespace** and **punctuation**. For example, the sentence "Hello world!" can be split into "hello", "world", and "!".

3. It uses **special tokens** to mark the distinctions in the sentence, and to handle unknown or padding tokens. I am going to be using the [CLS] and [SEP] tokens as the beginning and ending of a sentence. For example, the sentence "I love Paris" will be tokenized as "[CLS]", "i", "love", "paris", "[SEP]".

So the special tokens are:
- "[CLS]" for start of sentence
- "[SEP]" for end of sentence
- "[UNK]" for unknown token
- "[PAD]" for padding token

<br>
📝 Technical Detail: [CLS] and [SEP] are meant as classification and separation tokens in the pretrained tokenizer, but since it does not have [SOS] and [EOS] tokens (which it shouldn't being a BERT tokenizer), I am swapping it's use.

In [None]:
# Install the necessary library
!pip install transformers

In [None]:
# Download the pretrained tokenizer
from transformers import AutoTokenizer

# Save the name of the model whose tokenizer we are using. We will need it later.
PRE_TRAINED_MODEL_NAME = "distilbert/distilbert-base-multilingual-cased"

# Download the tokenizer
tokenizer = AutoTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)

In [None]:
# Get the vocabulary size of our tokenizer (Number of unique words, subwords e.t.c. That the tokenizer understand.)
# Note: Anything any text not in its vocabulary is replace with the [UNK] special tokens.
# Note: The Special tokens are also included in the size of the vocabulary
tokenizer.vocab_size

119547

In [None]:
# Special tokens available in the tokenizer.
tokenizer.all_special_tokens

['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']

In [None]:
# View of 5 tokens and their number (called token id) in our vocabulary
print (list(tokenizer.get_vocab().items()) [21:26])

[('étend', 52192), ('crkva', 63736), ('suicide', 35656), ('русӣ', 21302), ('aérea', 74214)]


In [None]:
# Let's test our tokenizer out on both english and french text.
tokenizer(['français bonjour', 'morning francais']).input_ids

[[101, 12501, 22873, 98214, 10129, 102], [101, 28757, 63184, 12985, 102]]

In [None]:
# Above you see two lists where the first and last number is 101, and 102 respectively.
# This corresponds to our [CLS] and [SEP] token which we will be using at the beginning and
# ending of a sentence.
print('[CLS] : ', tokenizer.cls_token_id)
print('[SEP] : ', tokenizer.sep_token_id)

# But what do this numbers mean ??. Well let's continue our journey 🛣️.

[CLS] :  101
[SEP] :  102


### Word Embeddings: A Simple Explanation 📚

Word embeddings are vectors of real numbers ( e.g [ 0.1, 0.4, -0.8] ), one per token in your vocabulary. They are used to represent the semantic meaning of words in a way that is efficient and comparable.

To create this word embeddings, we had to tokenize the text, that is, convert the words into numbers. This assigns a unique integer to each token in the vocabulary. For example, the sentence “morning francais” was tokenized as [28757, 63184, 12985] (Note: This was the tokenization before we added our [CLS] and [SEP]).

But these numbers don’t tell us much about the words. They don’t tell us what the words mean, or how they are related to each other. For example, we don’t know if dog and cat are similar or different, or if apple and orange are fruits or colors.

That’s why we need to use an embedding layer, that maps this numbers to their corresponding vectors.
    
    For example:

    dog -- (tokenized to 1) -- mapped to row 1 of matrix: [0.2, -0.1, 0.5, …]
    cat -- (tokenized to 23) -- mapped to row 23 of matrix:  [0.3, -0.2, 0.4, …]
    apple -- (tokenized to 40) -- mapped to row 40 of matrix:  [-0.1, 0.4, 0.2, …]
    orange -- (tokenized to 22) -- mapped to row 22 of matrix:  [-0.2, 0.3, 0.1, …]
    ...

This vectors called embeddings, are like arrows that point in different directions and have different lengths (size of the arrow not the number of elements e.g The arrow -- [0.5, 0.7] is longer than the arrow -- [0.2, 0.1] ).

This embeddings are stored in an embedding matrix and when our neural network model is being trained it tries to make the vectors match the meaning and context of the words. For example, it tries to make the vectors of similar words point in the same direction, and the vectors of different words point in different directions. It also tries to make the vectors of words that are often used together have similar lengths, and the vectors of words that are rarely used together have different lengths.

This will enable it to understand the contextual meaning of text better, but this requires training it with millions or even billions of books, texts e.t.c.

That's why it is more common to use embedding layers of models that have already been trained.

Note: As you saw embedding maps the token numbers from the tokenizer to rows in the embedding matrix. This means the embedding matrix will be of size (num_rows, num_cols) -> (tokenizer_vocab_size, embed_dim) where embed_dim is the number of elements in the vector of embedding used to represent a token. Common sizes are 256, 300, 512, 786 e.t.c.

Note: This also implies that the tokenizer and embedding should come from the same pre-trained model for the mapping token numbers -> Row, to match.

In [2]:
# Get the embedding layer from our pre-trained model.

from transformers import AutoModelForMaskedLM

# Note 👀 how we are using the same model name.
pre_trained_model = AutoModelForMaskedLM.from_pretrained(PRE_TRAINED_MODEL_NAME) # downloads the model.

# Fetch the embedding layer from the pre-trained model.
embedding_layer = pre_trained_model.get_input_embeddings()

# These line just tells pytorch we don't intend to further train the embedding layer.
# So it freezes the layers knowledge, so we don't scatter it while our model is still starting to learn.
embedding_layer = embedding_layer.requires_grad_(False)

model.safetensors:   0%|          | 0.00/542M [00:00<?, ?B/s]

In [None]:
print('Vocabulary Size :', tokenizer.vocab_size)
print('Embedding Layer :', embedding_layer)

# As you can see the number of rows of the embedding_layer match up with the vocab size.
# We can also see that the embed_dim size (num_of elements in embedding vector) used here is 768.

Vocabulary Size : 119547
Embedding Layer : Embedding(119547, 768, padding_idx=0)


### Preparing Our Translation Dataset

In [None]:
# Install pytorch lightning
!pip install lightning

In [4]:
# Pytorch
import torch
# To access our sqlite db
import sqlite3

# other tools needed
from torch.utils.data import *
import torch.nn as nn

### Understanding the form of Our Dataset.

The transformer we are going to use (which we will see later) is going to have an encoder and decoder layer.

<br>

#### Encoder Overview 🔀

The 🕵️‍♂️ encoder creates meaningful representations of its inputs (in this case, an English sentence) 💬.

<br>

#### Decoder Overview 📡

The 📡 decoder then uses these representations to perform a specific task (translating to French 🇫🇷 in this example).

<br>

#### Translation Approach 🎯

We will implement translation using a common technique called **next token prediction** ⏭️.

At each timestep, the decoder uses the encoder's output to predict the next word in the translated sequence.  

#### Example 💡

For example, to translate "Beautiful day" into "Belle journée", we first append special start `[CLS]` and end `[SEP]` tokens to the French translation:

> "[CLS] Belle journée [SEP]"

Our goal is to train the model to predict the next token at each step:

1. When it sees `[CLS]`, predict "Belle"
2. When it sees "Belle", predict "journée"
3. When it sees "journée", predict `[SEP]` ✅  

<br>

#### Teacher Forcing 👩‍🏫  

Training the model with the above 👆👆 method will be very slooow.

So instead, we use a technique called **teacher forcing** 👩‍🏫.

We provide the full ground truth translation up to the `[SEP]` token to the decoder during training. Just "`[CLS]` Belle journée".

Then we set the models target outputs to be "Belle journée `[SEP]`".

<br>

#### Causal Masking 😷  

Since we are passing in the french translation to the decoder layer in training, we employ something called **causal masking** 😷 (more on this later) in the decoder to prevent it from cheating by looking at the full output translation.

This forces the model to predict the next token based only on the encoder outputs and what came before in the decoder outputs.  
<br>

#### Inference Process 🤔

At inference time, we pass an English input to the encoder and just the `[CLS]` token to the decoder initially.

We then feed the models predicted token from the previous timestep back into the decoder to predict the next token.

We continue this loop until the `[SEP]` token is predicted, indicating the ✅ end of translation.   
<br>

#### Note 📝
In most sequence-to-sequence models (like ours), the encoder inputs do not contain start/end tokens - those are appended to the decoder inputs only.

This allows the decoder to know when to start 🏁 and stop ✋ generating the translation.

<br>

#### **Batching**
Batching allows us to make more efficient use of computing power by training on batches of sentence pairs rather than one sentence pair at a time. Processing batches enables the model to learn general linguistic patterns across sentences, rather than potentially noisy patterns within individual sentences.

However, sentences come in varying lengths. To create same sized batches, shorter sentences will be padded with special `[PAD]` tokens to match the length of the longest sentence in the batch.

We will configured the model to ignore these pad tokens. Along with the batch of sentences, the actual lengths of each underlying sentence will also be passed to the model so it knows how much of each sentence contains real words versus padding.

This contextual information on real sentence lengths allows the model to differentiate between content words and pads inserted to standardize batch lengths.

<br>


#### **Using this information let's create the dataset for our model**

In [5]:
# Define some constants for our model architecture
TOKEN_LIMIT = 350  # The maximum number of tokens our model can handle in a sentence
PAD_IDX = tokenizer.pad_token_id  # The pad token id to use in padding shorter sentences in the batch

# This class inherits from a pytorch dataset, and its function is to load and determine how to get a particular sample
# from the data using the __getitem__ function. The collate_fn function's job is to get many samples and batch them together.
class EN_FR_Dataset(Dataset):

    TOTAL_SAMPLES = 2_007_723  # Number of sentence pairs in our dataset

    def __init__(self, *, db_path, tokenizer):
        # Create a connection to the database
        self.conn = sqlite3.connect(db_path)
        self.cursor = self.conn.cursor()

        self.tokenizer = tokenizer

    def __getitem__(self, index):
        # Execute a query to fetch the English and French sentences at the given index
        self.cursor.execute('SELECT en, fr FROM en_fr WHERE "index" = ?', (index,))
        row = self.cursor.fetchone()

        # Raise an exception if the row is not found
        if row is None:
            raise Exception('Row not found at index', index)

        try:
            # Encode the English and French sentences using the tokenizer, adding the special start [CLS] and end sentence [SEP]
            # tokens to only our french translation which will both in the input to the decoder layer and target output of the model.
            en_encoded = self.tokenizer.encode(row[0], add_special_tokens=False)
            fr_encoded = self.tokenizer.encode(row[1], add_special_tokens=True)

            # If the sentence length is greater than the token limit, replace it with a predefined sample
            if len(en_encoded) > TOKEN_LIMIT or len(fr_encoded) > TOKEN_LIMIT - 2: # Note: The -2 is because we added [CLS] and [SEP] to it.
                return self.__getitem__(0)

            # Convert the encoded sentences to PyTorch tensors
            en_tensor = torch.tensor(en_encoded)
            fr_decoder_input = torch.tensor(fr_encoded[:-1])  # Input for decoder (excluding end token)
            fr_decoder_label = torch.tensor(fr_encoded[1:])  # Label for decoder (excluding start token)

        except Exception:
            # If any error occurs, return the first sample as a fallback
            return self.__getitem__(0)

        # Return the tensors along with the acutual length of the english sentence.
        return en_tensor, len(en_encoded), fr_decoder_input, fr_decoder_label

    def __len__(self):
        # Return the total number of samples in the dataset
        return EN_FR_Dataset.TOTAL_SAMPLES

    def close(self):
        # Close the database connection
        self.conn.close()

    @staticmethod
    def collate_fn(batch):
        # `batch` is a list of samples of the form [(en_tensor, len(en_encoded), fr_decoder_input, fr_decoder_label), ...]
        # so we unpack every tuple in the list.
        en_sentences, en_seq_lengths, fr_decoder_inputs, fr_decoder_labels = zip(*batch)

        # Pad every sequence in the batch to the max sequence length using pad_sequence with PAD_IDX for padding
        padded_en_sentences = nn.utils.rnn.pad_sequence(en_sentences, batch_first=True, padding_value=PAD_IDX)
        padded_fr_decoder_inputs = nn.utils.rnn.pad_sequence(fr_decoder_inputs, batch_first=True, padding_value=PAD_IDX)
        padded_fr_decoder_labels = nn.utils.rnn.pad_sequence(fr_decoder_labels, batch_first=True, padding_value=PAD_IDX)

        # Convert sequence lengths to a LongTensor (Pytorch requires it this way).
        en_seq_lengths = torch.as_tensor(en_seq_lengths, dtype=torch.long)

        # Return the padded tensors and sequence lengths
        return padded_en_sentences, en_seq_lengths, padded_fr_decoder_inputs, padded_fr_decoder_labels


In [6]:
# Import LightningDataModule from PyTorch Lightning
import lightning as L

# Define a custom class that inherits from LightningDataModule
# A LightningDataModule organizes data-related code
# It separates data processing from model training
# It also makes data code reusable and shareable
class TranslationDataModule(L.LightningDataModule):

  # Define the constructor with three arguments
  # batch_size: number of samples per iteration
  # num_workers: number of processes for data loading
  # train_ds: dataset object with translation pairs
  def __init__(self, batch_size, num_workers, train_ds):
    super().__init__()

    # Assign the arguments to attributes
    self.batch_size = batch_size
    self.num_workers = num_workers
    self.train_ds = train_ds

  # Override the setup method
  # This method is called before data loaders are created
  # It is used for data preprocessing or splitting
  def setup(self, stage: str):
    # Split train_ds into training and validation datasets with an 80/20 ratio
    self.train_ds, self.val_ds = random_split(
      self.train_ds,
      (0.8, 0.2)
    )

  # Define the dataloader methods for both training and validation.
  # It returns a DataLoader object for training data
  # A DataLoader handles batching, shuffling, and sampling of the data gotten from our dataset.
  # It also supports multiprocessing and prefetching
  def train_dataloader(self):
    # Create and return a DataLoader with these arguments:
    # - dataset: train_ds with training data
    # - batch_size: batch_size attribute
    # - num_workers: num_workers attribute
    # - shuffle: True to shuffle the training data before each epoch (An epoch is a full iteration through our entire dataset).
    # - prefetch_factor: 2 to prefetch 2 samples per worker
    # - collate_fn: our function to combine samples into a batch
    return DataLoader(
      self.train_ds,
      batch_size=self.batch_size,
      num_workers=self.num_workers,
      shuffle=True,
      prefetch_factor=2,
      collate_fn=EN_FR_Dataset.collate_fn
      )

  # Define the val_dataloader method
  # It returns a DataLoader object for validation data
  # It is similar to train_dataloader, but uses val_ds as dataset
  def val_dataloader(self):
    # Note: We don't shuffle the validation dataset. We use this dataset to evaluate the model's
    # performance after each epoch.
    return DataLoader(
      self.val_ds,
      batch_size=self.batch_size,
      num_workers=self.num_workers,
      prefetch_factor=2,
      collate_fn=EN_FR_Dataset.collate_fn
      )

In [47]:
# create the dataset by pointing to the sqlite db file we created earlier.
translation_dataset = EN_FR_Dataset(db_path=TRANSLATION_DB_FILE_PATH, tokenizer=tokenizer)

In [None]:
# test our dataset by attempting to get the first sample
translation_dataset[0]

(tensor([39429, 94118, 10108, 10105, 30066]),
 5,
 tensor([  101, 42330, 10104, 10109, 30066]),
 tensor([42330, 10104, 10109, 30066,   102]))

In [48]:
# create our translation data module that handles loading batches of data from our dataset.
translation_datamodule = TranslationDataModule(
    batch_size=24,
    num_workers=2,
    train_ds=translation_dataset)

# Note: Typically larger batch sizes are used e.g 128 even up to 1024, as much as possible to fully utilize your gpu.
# But as you will see the model is large and my gpu ram was small (just 15 GB), so i had to go with batch size 24.

# **Transformer Architecture**
Phew! 😌 We've done the hard work of creating a dataset to train our model. Now it's time to enjoy the fruits of our labor - building a transformer model.

We will be building a popular variant of the orginial transformer  called the ReZero Transformer. It's a neat variant that's simpler, faster, and more stable.

How cool is that? 😎 Let's continue our journey 🛣️


<img src="https://onedrive.live.com/embed?resid=8C3CCBBA832CF1E0%21601&authkey=%21AHHEOlJE806ebXk&width=724&height=832" width="724" height="832" />

The ReZero transformer is a simple modification of the standard transformer architecture that improves signal propagation and convergence speed. It replaces the layer normalization (the Norm in Add & Norm in the figure above 👆) with a learned residual skip connection.

This means that the layer starts with the same data as the previous layer and only adds a small amount of new information to it. This small amount is learned by the layer and can be changed as needed.

⭐ Picture credits: [Borealis AI](https://www.borealisai.com/research-blogs/tutorial-17-transformers-iii-training/#Better_methods_for_training_transformers)

<img src="https://onedrive.live.com/embed?resid=8C3CCBBA832CF1E0%21598&authkey=%21ANA81xeriJk1m5o&width=2560&height=641" width="800" height="201" />

## **Explanation & Implementation**
In this section, we will explain and implement our transformer step by step. We have already learned about the **embedding layer**, which maps `token ids` to `vectors`.

This layer is shown as the **input and output embeddings** in the transformer picture above 👆👆.

So, we will start our explanation from **positional encoding**.


### **Positional Encodings**
The position of each word in a sentence affects its meaning. For example, "I love pizza" 🍕 and "Pizza love I" have the same words, but different orders and meanings.

We can use positional encoding to represent the position of each word as a vector, similar to `word embeddings`. This way, we can teach a computer to understand the order and the context of the words in a sentence.

There are two ways of creating positional encoding vectors:

- Fixed positional encoding: The vectors are predefined and fixed. Formulas such as sine and cosine functions is used:

$$PE_{(pos,2i)} = sin(pos / 10000^{2i / d_{model}})$$
$$PE_{(pos,2i+1)} = cos(pos / 10000^{2i / d_{model}})$$

where $pos$ is the position, $i$ is the dimension and $d_{model}$ is the size of embedding dimension (number of elements in positional encoding vector). This way captures relative distances and handle variable length sentences. *`[This is what the original transformer paper did]`*.

- Learned positional encoding: The vectors are randomly initialized and learned by the model. An embedding layer is used with different embeddings for each position. This way the model learns the positional information to add to the data enabling it to potentially capture more complex patterns. *`[This is what most state-of-the-art models do]`*.

Note: When we are using learned positional encodings we are going to need to put a limit on the number of tokens our model can take in. This is because the position encoding layer is going to have an embedding matrix with as many rows as our max number of tokens. Its number of columns should also match those of the word embeddings because we are going to add `+` them together, this way we get a new vector that contains both the meaning and the position of each word.

In this tutorial, we use the positional encoding layer from the `pre_trained_model`. It has a token limit of 512, and our model `TOKEN_LIMIT=350` so we are good to go.


In [7]:
pos_embedding_layer = pre_trained_model.get_position_embeddings()
pos_embedding_layer = pos_embedding_layer.requires_grad_(False) # Freezes knowledge as we have seen before.

pos_embedding_layer # Note how it supports up to 512 tokens and has an embed dim of 768 just like our word embedding layer.

Embedding(512, 768)

In [8]:
class LearnedPositionalEncoding(nn.Module):

    def __init__(self, *, pos_embed_layer):
        """
        Initializes the module with a learned positional embedding layer.

        Args:
            pos_embed_layer: Our Pre-Trained nn.Embedding layer containing positional encodings.
        """
        super().__init__()
        self.positional_embedding = pos_embed_layer

    def forward(self, X):
        """
        Adds learned positional encodings to the input sequence.

        Args:
            X: Input sequence of token embeddings (batch_size, sequence_length, embedding_dim).

        Returns:
            Encoded sequence with positional information added (same shape as X).
        """

        position_indices = torch.arange(X.size(1), device=X.device)  # Get position indices
        positional_embeddings = self.positional_embedding(position_indices)  # Lookup position embeddings

        # Expand positional embeddings for batch-wise operation
        expanded_positional_embeddings = positional_embeddings.unsqueeze(0)

        return X + expanded_positional_embeddings  # Add embeddings to input sequence


### **Multi-Head Attention & Masked-Head Attention**

<img src="https://onedrive.live.com/embed?resid=8C3CCBBA832CF1E0%21597&authkey=%21ABpALYIXmMaQ8os&width=1834&height=842" width="700" height="300" />

**Before we understand Multi-Head & Masked Multi Head Attention we need to understand Scaled Dot-Product Attention**

#### Scaled Dot-Product Attention (SDPA)

Imagine a vibrant party teeming with words, each eager to understand its peers. This is the essence of Scaled Dot-Product Attention (SDPA), the heart of the Transformer, a powerful language model. But before we hit the dance floor, let's meet the key players:

**Word Embeddings:** Think of these as name tags at the party, each word getting a unique vector representing its meaning. But in sentences like "Do what is Right" and "Shift to your Right," the word "Right" has the same tag despite differing contexts.

**Enter SDPA, the party game changer!** It introduces three roles:

- **Query (Q):** The curious word asking, "Who has the information I need?"
- **Key (K):** Like a name tag, revealing relevant skills or knowledge.
- **Value (V):** The actual hidden talent or information the word possesses.

Now, picture each word comparing its query (e.g., "Who has the information I need?") with everyone's keys (e.g The information they have). The more relevant the key's information, the higher the "attention score" it gets. Think of it as noticing someone with a matching talent you need!

But how does everyone stay informed about these scores? This is where the **`attention weight matrix`** comes in! It's like a giant scoreboard displayed at the party, where each cell shows the attention score between a specific word pair. For example, the cell at row "Right" and column "Shift" would hold the score indicating how well these words relate.

But how does this translate into updating the word embeddings?
That's where **softmax** steps in, acting as the **regulator** who assigns weights to each score based on its relative importance.

Imagine the **regulator** listening to each word's scores and saying, "Okay, so 'Shift' is quite relevant for determining 'Right' meaning in this context, so you get a high weight."

Here's how softmax works its magic:

1. **Listen to All Scores:** It takes all the attention scores for a specific word (a row in the our `attention weight matrix`) as input.

2. **Apply the Formula:** It uses a mathematical formula to consider the relative strength of each score compared to the others. This ensures that scores that are much higher than others have a larger impact on the final weights.

3. **Distribute the Weights:** Softmax transforms these relative strengths into weights between 0 and 1, ensuring everyone gets a fair share of attention but the most relevant ones get a bigger slice.

Now, each word has a set of weighted values based on its interactions with others. Think of it as collecting insights from the most relevant conversations at the party. These weights are then used to perform a weighted combination of everyones values in order to create a richer, context-aware representation of the word's meaning.

In our example, "Right" in "Do what is Right" might learn a stronger "morality" value, while in "Shift to your Right," it gains a stronger "direction" value.

**The result? Word embeddings that truly reflect their meaning in each sentence, just like people adapting their communication based on the context!**

**Bonus Math (optional):**

The core calculation behind SDPA involves the attention weight matrix and softmax, expressed as:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{{Q.K^T}}{{\sqrt{d_k}}}\right) .V$$

where:

- $Q$, $K$, $V$ represent the query, key, and value vectors for every word joined as rows of the corresponding matrix.
> *Note: $K^T$ means that the matrix K was transposed, we do this to align the vectors dimensions apples to apples 🍏 before the similarity check.*

- **$\cdot$** denotes the dot product (measuring similarity).

- $\sqrt{d_k}$ is a scaling factor for stability (so very high scores don't totally drown out lower ones).

- $\text{softmax}$ distributes weights between 0 and 1 based on attention scores.

- **$\cdot$** $V$ uses the **weighted attention scores** to do a weighted combination of everyones values to get a new richer embedding.

#### **Multi-Head Attention**
This is a mechanism in the Self-Attention process where multiple "heads" or focus groups are used. Remember our word party? Multi-head attention throws another twist! Imagine multiple groups (heads) at the party, each focusing on different aspects of the conversation. For instance, one head might concentrate on semantic context, while another might focus on emotional context.

**Here are the details:**

* Each group (head) has its own "attention weight matrix" showing how relevant other words are to their focus.
* They analyze independently, like different games happening at once.
* In the end, they concatenate (Concat in the image above 👆👆) their insights, creating a richer understanding of each word, like piecing together clues from different groups.

#### **Masked Multi-Head Attention**
This is used in the decoder layer to prevent the model from seeing future words. This is achieved by replacing entries above the main diagonal of the attention matrix with `-inf` before performing softmax, a technique known as **Causal Masking**. (Imagine it as putting black tape above the main diagonal of the attention weight matrix).

For example, in a sentence "The cat sat on the mat.", for the word "sat", Masked Multi-Head Attention only considers "The" and "cat", ignoring "on", "the", and "mat".

In addition to causal masking, a **Padding Mask** is used to prevent the model from attending to `[PAD]` tokens added to equalize the lengths of sequences in a batch. The attention scores of `[PAD]` tokens are set to `-inf`, ensuring these tokens do not affect the final attention output.

In the case of padding mask, if we have a batch of two sequences: ["The cat sat down", "Good Morning [PAD] [PAD]"], the model's focus remains solely on the meaningful words in the sequence.

Note: Multi-Head & Masked Multi-Head Attention also have a projection layer (The Linear in the above 👆👆 image). Its job is to project these embeddings, updated from word context, to a more concise form for the Model.

💡Note: In our implementation below instead of creating multiple heads each with their Networks to get queries (Q), keys (K) and values (V), we are going to cleverly use a big network each for getting Q, K, and V and we will share the output of this networks to all the heads (This bascially does the same thing as with creating multiple heads, this is just a more compute effective way).

In [9]:
import torch.nn.functional as f

In [10]:
class MultiHeadAttention(nn.Module):

    def __init__(self, *, dim_qkv, dim_model, num_heads, causal=False, **kwargs):
        super().__init__(**kwargs)

        # Ensure dim_qkv (embedding dimension) is divisible by number of heads
        assert dim_qkv % num_heads == 0, "DIM_QKV must be divisible by num_heads."

        self.num_heads = num_heads
        self.causal = causal

        # Scaling factor for attention scores
        self.scale_factor = torch.math.sqrt(dim_qkv//num_heads)

        # The neural networks for queries, keys, and values from the word embeddings
        self.obtain_queries = nn.Linear(dim_model, dim_qkv, bias=False)
        self.obtain_keys = nn.Linear(dim_model, dim_qkv, bias=False)
        self.obtain_values = nn.Linear(dim_model, dim_qkv, bias=False)

        # Final linear projection to get our updated context information back to the same shape as
        # the input (concise form for the model).
        self.projection = nn.Linear(dim_qkv, dim_model)

    def forward(self, queries, keys, values, seq_lengths=None):

        # Get our queries, keys, and values from the embeddings
        queries = self.obtain_queries(queries)
        keys = self.obtain_keys(keys)
        values = self.obtain_values(values)

        # Reshape for multi-head attention (will shape our matrix as if we create multiple heads separately).
        queries = self.parallel_reshape(queries)
        keys = self.parallel_reshape(keys)
        values = self.parallel_reshape(values)

        # Calculate attention scores (Remember our attention weight matrix 🙂)
        attention = queries @ keys.transpose(-1, -2)

        # Scale scores to prevent saturation (so high scores don't totally drown lower ones)
        attention = attention / self.scale_factor

        # Apply causal mask (this will be down when we pass causal=true in masked multi-head attention)
        if self.causal:
            attention = attention + self.get_causal_mask(queries)
        else:
        # Apply padding mask (this will be perform for only reqular multi-head attention)
        # Note: we don't need padding mask for causal attention because each token can only see previous token
        # so the padded token can't spoil the embeddings of acutual tokens.
            attention = torch.masked_fill(attention, self.get_timeseq_mask(seq_lengths), -torch.inf)

        # Calculate attention distribution (softmax)
        attention = f.softmax(attention, dim=-1)

        # Apply attention distribution to values (Weighted combination of the values to form better contextual embeddings)
        X = attention @ values

        # Reshape back to original (This is similar to us concatenating the information from multiple heads).
        X = self.reverse_reshape(X)

        # Final linear projection (Project back our rich embeddings to same shape as the inputs, a concise form for the model)
        X = self.projection(X)

        return X

    def parallel_reshape(self, tensor):
        """
        Reshapes the input tensor for efficient dot product computation in multi-head attention.

        This function rearranges the dimensions of the input tensor to facilitate parallel
        computation across multiple attention heads. It achieves this by:

        1. Reshaping the tensor from (batch_size, seq_len, dim) to (batch_size, seq_len, num_heads, head_dim).
        2. Permuting the dimensions to (batch_size, num_heads, seq_len, head_dim).

        This new format enables efficient computation of attention scores between queries, keys, and values
        from different heads in parallel.
        """
        Batch_size, Seq_len = tensor.shape[0], tensor.shape[1]
        return tensor.reshape(Batch_size, Seq_len, self.num_heads, -1).permute(0, 2, 1, 3)

    def reverse_reshape(self, tensor):
        """
        Reshapes the output tensor from multi-head attention back to its original format.

        This function reverses the reshaping performed in `parallel_reshape` to obtain the original
        tensor format (batch_size, seq_len, dim) used by the rest of the model. It achieves this by:

        1. Permuting the dimensions to (batch_size, seq_len, num_heads, head_dim).
        2. Reshaping the tensor to (batch_size, seq_len, dim).
        """
        Batch_size, Seq_len = tensor.shape[0], tensor.shape[2]
        return tensor.permute(0, 2, 1, 3).reshape(Batch_size, Seq_len, -1)

    def get_timeseq_mask(self, x_lengths):
        """
        Creates a mask to prevent attention to padded tokens in the sequence.

        This function generates a mask that prevents the attention mechanism from attending to padded
        tokens in the input sequence. This is crucial because attending to irrelevant padded information
        can negatively impact the model's performance.

        The mask works by identifying positions in the sequence that are shorter than the corresponding
        sequence length (valid positions) and setting those positions to `True`. All other positions
        (padded tokens) are set to `False`.
        """
        max_seq_len = x_lengths.max().item()  # Get the maximum sequence length in the batch

        # Create a sequence of numbers from 0 to max_seq_len (representing positions in the sequence)
        ids = torch.arange(0, max_seq_len, device=x_lengths.device)

        # Broadcast the sequence lengths to create a comparison matrix (batch_size, seq_len)
        # True where a position's id is less than the corresponding sequence length (valid position)
        mask = ids[None, :] < x_lengths[:, None]

        # Invert the mask to obtain True for positions that should be masked (padded tokens)
        return ~mask[:, None, None]


    def get_causal_mask(self, X):
        """
        Creates a mask to prevent attention to future tokens in the decoder, enforcing causality.

        This function generates a mask that prevents the decoder from attending to tokens that appear
        later in the sequence. This is crucial to enforce causality, as the decoder should only use
        information available up to the current position to predict the next token.

        The mask works by setting all positions above the main diagonal to `True`, effectively blocking
        the attention mechanism from looking ahead. This mimics the real-world scenario where we cannot
        predict the future.
        """

        max_seq_len = X.size(2)  # Get sequence length from the input tensor
        mask = nn.Transformer.generate_square_subsequent_mask(max_seq_len, device=X.device) # generate mask above the main diagonal (like our black tape).
        return mask


### **The Encoder Block**

<img src="https://onedrive.live.com/embed?resid=8C3CCBBA832CF1E0%21602&authkey=%21AOjn7M-A8lP-0Is&width=272&height=390" width="272" height="390" />

We crossed out the **Add&Norm** because we are replacing it with **ReZero** in this model (as we will see shortly).

The first layer shown is a **Multi-Head Attention** layer within the encoder block. This is called a **Self-Attention** layer because the multi-head attention mechanism draws its queries, keys and values only from the encoder input embeddings themselves.

In other words, the input embeddings are enriched solely based on relationships within themselves, without any external context.

We also notice the **skip connection** arrow that bypasses this Self-Attention layer, connecting the input directly to the output. This allows signals to propagate easily through the model. With ReZero, we omit the Layer normalization in the Add&Norm, instead doing a weighted addition as follows:

$$\text{skip} + \text{re_zero_weight} \times \text{output}$$

There is also a **dropout** layer (not shown in the image) that randomly sets some outputs to zero with a certain probability. This forces the model to utilize as much information as it can get. So the ReZero connection actually looks like:

$$\text{skip} + \text{re_zero_weight} \times \text{dropout}(\text{output})$$

Next is the **feed forward network**, whose purpose is to process all the information extracted by the previous layers into a more organized and understandable representation for later stages of the model. It also employs a residual skip connection.


In [27]:
# Note: In transformers, we usually set a size called dim_model. We do this so that the output
# of all layers, blocks and sub-blocks in the models have the same size since they will almost
# always be interacting with each other.

class EncoderBlock(nn.Module):
    """
    Represents a single encoder block in a Transformer model with ReZero modification.
    """

    def __init__(self, dim_qkv, dim_model, num_heads, dim_ffn, dropout_rate, **kwargs):
        super().__init__(**kwargs)

        # Multi-Head Self-Attention Layer
        self.multi_head_attn = MultiHeadAttention(
            dim_qkv=dim_qkv,  # Dimension of query, key, and value vectors
            dim_model=dim_model,  # Size of dim_model
            num_heads=num_heads  # Number of attention heads
        )

        # Feed-Forward Network
        self.ffn = nn.Sequential(
            nn.Linear(dim_model, dim_ffn),  # First linear layer
            nn.ReLU(),  # ReLU activation for non-linearity (to aid learning complex patterns)
            nn.Linear(dim_ffn, dim_model)  # Second linear layer
        )

        # Dropout for regularization (forcing the model to utilize as much information as it can get.)
        self.dropout = nn.Dropout(dropout_rate)

        # ReZero parameter (learnable weight for weighted addition, typically intialized to zero)
        self.reZero = nn.Parameter(torch.tensor(0.0))

    def forward(self, X, X_len):
        """
        Better encodes the input sequence X (in these case our english sentence).

        The X_len is the acutal length of every sentence in the batch (remember we padded the batch),
        this X_len will be passed to the self-attention layer so it can use it to know the pad tokens to
        mask during attention.
        """

        # Skip connection for residual addition
        skip = X

        # Multi-Head Self-Attention with ReZero
        X = self.multi_head_attn(queries=X, keys=X, values=X, seq_lengths=X_len)
        X = self.dropout(X)
        X = skip + self.reZero * X  # ReZero weighted addition

        # Feed-Forward Network with ReZero
        skip = X
        X = self.ffn(X)
        X = self.dropout(X)
        X = skip + self.reZero * X  # ReZero weighted addition

        return X


In [12]:
class EncoderLayer(nn.Module):
    """
    Composes multiple EncoderBlocks to form a deep encoder layer.

    The EncoderLayer holds a collection of chained EncoderBlock modules
    that are applied sequentially to the input. By chaining multiple blocks,
    the model can learn increasingly complex and abstract patterns in the
    input text enabling it to create better representations.
    """

    def __init__(self,
                 num_encoder_blocks,
                 dim_qkv,
                 dim_model,
                 num_heads,
                 dim_ffn,
                 dropout_rate,
                 **kwargs) -> None:

        super().__init__(**kwargs)

        # Collection of Encoder Blocks
        self.encoder_blocks = nn.ModuleList([
            EncoderBlock(
                dim_qkv,
                dim_model,
                num_heads,
                dim_ffn,
                dropout_rate
            ) for _ in range(num_encoder_blocks)
        ])

    def forward(self, X, X_len):
        """
        Passes the input through each Encoder Block sequentially.
        """
        for encoder in self.encoder_blocks:
            X = encoder(X, X_len)

        return X


### **The Decoder Block**

<img src="https://onedrive.live.com/embed?resid=8C3CCBBA832CF1E0%21603&authkey=%21ALcQ4yPyrJ1GpqM&width=269&height=425" width="269" height="425" />

Just like in the encoder block we crossed out the **Add&Norm** because we are replacing it with **ReZero** in this model.
<br>

**Decoder Block Similarities and Key Differences:**

The decoder block, like the encoder block, consists of multiple layers that process information sequentially. However, the decoder has two key differences crucial for translating text:

**1. Masked Multi-Head Attention:**

- This layer handles **self-attention** within the decoder's input, like the encoder block.

- **Crucial Difference:** It uses **causal masking** 😷 to prevent the model from "cheating" by peeking at future tokens during training. This aligns with our "teacher forcing" technique.

- Imagine it as building a sentence one word at a time, without knowing what the next word will be. This enforces learning based on context and previously generated words.

**2. Cross Attention:**

( The Multi-Head Attention layer highlighted with yellow 🟨, also notice the two arrows ⤴⤴ coming from outside [actually from the encoder block] )

- This layer captures the relationship between the **decoder input** (queries) and the **encoder's enriched representation** (keys and values).

- Think of it as the decoder consulting an "information sheet" (encoder's representation) while masked from its future words, helping it understand the context and generate relevant translations.

- The **skip connection** above this layer combines the information from the previous Masked Multi Head Attention and the encoded context (Cross Attention).

📝 Note: In Cross Attention we will also have to pass the actual length of the encoder sentences for masking, this is so we don't try to also get encoded information from the padded [PAD] tokens.

In [26]:
class DecoderBlock(nn.Module):
    """
    Represents a single decoder block in a Transformer model with ReZero modification.
    """

    def __init__(self, dim_qkv, dim_model, num_heads, dim_ffn, dropout_rate, **kwargs):
        super().__init__(**kwargs)

        # Masked Multi-Head Attention for self-attention within decoder input
        self.masked_multi_head_attn = MultiHeadAttention(
            dim_qkv=dim_qkv,
            dim_model=dim_model,
            num_heads=num_heads,
            causal=True  # Use causal masking to prevent future token peeking
        )

        # Cross Attention to attend to the encoded representation
        self.cross_attn = MultiHeadAttention(
            dim_qkv=dim_qkv,
            dim_model=dim_model,
            num_heads=num_heads
        )

        # Feed-Forward Network just like we saw in encoder block.
        self.ffn = nn.Sequential(
            nn.Linear(dim_model, dim_ffn),
            nn.ReLU(),
            nn.Linear(dim_ffn, dim_model)
        )

        # Dropout for regularization (just like in encoder block).
        self.dropout = nn.Dropout(dropout_rate)

        # ReZero parameter for weighted addition
        self.reZero = nn.Parameter(torch.tensor(0.0))

    def forward(self, X, enc_outputs, enc_seq_lengths):
        """
        Processes the decoder input sequence X using attention to the encoded representation.

        Args:
            X (torch.Tensor): Decoder input sequence of shape (batch_size, seq_len, dim_model).

            enc_outputs (torch.Tensor): Encoded output from the encoder of shape (batch_size, enc_seq_len, dim_model).

            enc_seq_lengths (torch.Tensor): Sequence lengths for the encoder of shape (batch_size,).

        Returns:
            torch.Tensor: Updated decoder output of shape (batch_size, seq_len, dim_model).
        """

        # Skip connection for residual addition
        skip = X

        # Masked Multi-Head Attention (Self-Attention)
        # - Prevents peeking at future words during training
        # - Focuses on context within the decoder's input
        X = self.masked_multi_head_attn(queries=X, keys=X, values=X)
        X = self.dropout(X)
        X = skip + self.reZero * X  # ReZero weighted addition

        # Skip connection for residual addition
        skip = X

        # Cross Attention
        # - Attends to the encoded representation for context
        # - Combines information from decoder and encoder
        # - Also pass the encoder sequence lengths so we don't try to get encoded information from [PAD] tokens.
        X = self.cross_attn(queries=X, keys=enc_outputs, values=enc_outputs, seq_lengths=enc_seq_lengths)
        X = self.dropout(X)
        X = skip + self.reZero * X  # ReZero weighted addition

        # Skip connection for residual addition
        skip = X

        # Feed-Forward Network (just like we have seen before).
        X = self.ffn(X)
        X = self.dropout(X)
        X = skip + self.reZero * X  # ReZero weighted addition

        return X


In [14]:
class DecoderLayer(nn.Module):
    """
    Composes multiple DecoderBlocks to form a deeper decoder layer.
    """

    def __init__(self, num_decoder_blocks, dim_qkv, dim_model, num_heads, dim_ffn, dropout_rate, **kwargs):
        super().__init__(**kwargs)

        # Collection of Decoder Blocks
        self.decoder_blocks = nn.ModuleList([
            DecoderBlock(
                dim_qkv,
                dim_model,
                num_heads,
                dim_ffn,
                dropout_rate
            ) for _ in range(num_decoder_blocks)
        ])

    def forward(self, X, enc_out, enc_seq_lengths):
        """
        Passes the input through each Decoder Block sequentially.
        """
        for decoder in self.decoder_blocks:
            X = decoder(X, enc_out, enc_seq_lengths)

        return X

Wow 😁, we have come a long way but before we start putting the pieces together to build our Transformer let's define some Hyper-Parameters (This means parameters that determines the size of the components in our architecture) for our model.

In [15]:
# Model Hyper-Parameters

# To ensure the output of all blocks, sub-blocks in the model have the same size as they will be interacting.
DIM_MODEL = 768

# The Size our feed-forward networks first expands its inputs `to` before performing nn.ReLU (non-linearity to capture complex patterns)
# and later projecting back to the inputs size.
DIM_FFN = 784

# The Size of the queries, keys and values in the attention blocks.
# Note: This will be shared equally among all the heads, so it must be divisible by the number of heads we choose.
DIM_QKV = 512

# The probability of which to turn off (Zero) outputs. 0.2 means 20 %
DROPOUT_RATE = 0.2

# The vocabulary size.
VOCAB_SIZE = tokenizer.vocab_size

### **Putting the Pieces Together: Transformer**

<img src="https://onedrive.live.com/embed?resid=8C3CCBBA832CF1E0%21604&authkey=%21AHp5UHZoaJ4wLnA&width=724&height=832" width="724" height="832" />

Let's talk about the newly added portion (the section on top highlighted in green). This section explains how we use the information from the decoder layer (which also incorporates the information from the encoder layer) to predict the next translated word. To do this, we have a linear layer that takes this information and assigns prediction scores (called logits) to every word in our vocabulary.

The output size of this layer should match the vocabulary size, so that each word has a corresponding score. The higher the score, the more confident the model is that that word is the next one.

We also have a softmax layer that transforms these scores into probabilities between 0 and 1. These probabilities are then outputted by the model.

📝 Note: We asterisk (*) the softmax function because the specific error function we are going to use in PyTorch (as you will see below) will automatically apply it for us, so we just need to return the logits.

Now we need an error function, which is a function that our model tries to minimize. By minimizing the error, the model learns how to perform our task (language translation in this case).

We are going to use CrossEntropyLoss as our error function. This function takes the logits representing the model's prediction of the next translated word and compares it with the actual next word. It then gives an error value based on how far off the model's prediction is from the actual word. The model can then adjust itself accordingly (thereby learning).

**Bonus Technical Detail**: CrossEntropyLoss is a negative log loss function. How does this work? Well, we take the logits and apply a softmax function on them to convert them into probabilities between 0 and 1.

We want the model to predict 1 for the actual next word (meaning the other probabilities will be 0), so how do we get an error value out of these? Well, we apply a logarithm function on the model's prediction of the actual word. If the model correctly predicts 1, then log(1) is 0, so the error value is 0.

This tells the model that it is on the right track and there is no error. But if the model's prediction is less than 1, then the logarithm function starts to approach $-\infty$, which means the error value becomes very large.

This is a problem because we want to minimize the error function, and the way it is shaped now, the model would learn to predict 0 for the right word in order to reduce the error.

So here is where the trick comes in: we multiply the logarithm function by `-1`, so that now for predictions less than 1, the error value starts to approach $\infty$ instead. This forces the model to learn to predict 1 for the right word.

This is why it is called negative log loss: $-log(models\ prediction \ for\ actual\ next\ word)$.

Now because we are predicting the next word multiple times in a sample, and for every sample in the batch, we calculate all the error values and average them, so that we have one error value to minimize.


In [16]:
# Import neccesary to calculate the accuracy.
from torchmetrics import Accuracy

In [17]:
# We will be using a Pytorch Lightning Module, Lightning allows building easier training process for Models
# it helps get rid of boilerplate code use in vanilla Pytorch.

class Transformer(L.LightningModule):
    """
    This class defines the Transformer model for the translation task.

    Attributes:
        pos_embed_layer (LearnedPositionalEncoding): Adds positional encodings to the input sequences.
        embedding_layer (nn.Embedding): Embeds the input tokens into vector representations.
        encoder_layer (EncoderLayer): Encoder layer of the model with stacked Encoder Blocks.
        decoder_layer (DecoderLayer): Decoder layer of the model with stacked Decoder Blocks.
        output (nn.Linear): Final layer mapping decoder outputs to vocabulary size for prediction.
        loss_fn (nn.CrossEntropyLoss): Loss function with padding token ignored.
        accuracy (Accuracy): Metric for calculating accuracy of model translation with padding token ignored.
    """

    def __init__(self, **kwargs):
        super().__init__(**kwargs)

        # Positional Encoding
        self.pos_embed_layer = LearnedPositionalEncoding(
            pos_embed_layer=pos_embedding_layer  # Inject our pre-trained positional embeddings layer.
        )

        # Inject our pre-trained token embedding layer.
        self.embedding_layer = embedding_layer

        # Encoder
        self.encoder_layer = EncoderLayer(
            num_encoder_blocks=4,  # Hyperparameter: Number of encoder blocks
            dim_qkv=DIM_QKV,  # Hyperparameter: Dimension of query, key, and value vectors
            dim_model=DIM_MODEL,  # Hyperparameter: Ensure similar output sizes across the model
            num_heads=8,  # Hyperparameter: Number of attention heads in each block
            dim_ffn=DIM_FFN,  # Hyperparameter: Dimension of feed-forward network
            dropout_rate=DROPOUT_RATE  # Hyperparameter: Dropout rate for regularization
        )

        # Decoder
        self.decoder_layer = DecoderLayer(
            num_decoder_blocks=4,  # Hyperparameter: Number of decoder blocks
            dim_qkv=DIM_QKV,  # Hyperparameter: Dimension of query, key, and value vectors
            dim_model=DIM_MODEL,  # Hyperparameter: Ensure similar output sizes across the model
            num_heads=8,  # Hyperparameter: Number of attention heads in each block
            dim_ffn=DIM_FFN,  # Hyperparameter: Dimension of feed-forward network
            dropout_rate=DROPOUT_RATE  # Hyperparameter: Dropout rate for regularization
        )

        # Output Layer
        self.output = nn.Linear(DIM_MODEL, VOCAB_SIZE)  # Project to vocabulary size for prediction

        # Loss Function and Metrics
        self.loss_fn = nn.CrossEntropyLoss(ignore_index=PAD_IDX)  # Ignore padding tokens
        self.accuracy = Accuracy(
            task='multiclass',
            num_classes=VOCAB_SIZE,
            ignore_index=PAD_IDX # Ignore padding tokens
        )

    def forward(self, X_enc, X_enc_len, X_dec):
        """
        Forward pass through the Transformer model.

        Args:
            X_enc (torch.Tensor): Input sequence for the encoder.
            X_enc_len (torch.Tensor): Sequence lengths for the encoder.
            X_dec (torch.Tensor): Input sequence for the decoder.

        Returns:
            torch.Tensor: Logits of the predicted tokens.
        """

        # Get embeddings
        X_enc = self.embedding_layer(X_enc)  # Embed encoder input
        X_dec = self.embedding_layer(X_dec)  # Embed decoder input

        # Add positional encodings
        X_enc = self.pos_embed_layer(X_enc)  # Add positional information to encoder
        X_dec = self.pos_embed_layer(X_dec)  # Add positional information to decoder

        # Encoder outputs
        X_enc = self.encoder_layer(X_enc, X_enc_len)  # Pass through encoder layers

        # Decoder outputs (which also incorporates the information from the encoder layer)
        X_dec = self.decoder_layer(X_dec, enc_out=X_enc, enc_seq_lengths=X_enc_len)

        # Final output layer (maps to vocabulary size) to get the logits of the predictions.
        logits = self.output(X_dec)

        return logits

    def training_step(self, batch, batch_idx):
        """
        Performs a single training step.

        Args:
            batch (dict): Batch of data containing encoded and decoded sequences.
            batch_idx (int): Index of the current batch.

        Returns:
            loss: Calculated loss value for the current batch.
        """

        return self._common_step(batch, 'train')

    def validation_step(self, batch, batch_idx):
        """
        Performs a single validation step.

        Args:
            batch (dict): Batch of data containing encoded and decoded sequences.
            batch_idx (int): Index of the current batch.

        Returns:
            loss: Calculated loss value for the current batch.
        """

        return self._common_step(batch, 'val')

    def _common_step(self, batch, prefix):
        """
        Shared logic for training and validation steps.

        Args:
            batch (dict): Batch of data containing encoded and decoded sequences.
            prefix (str): Prefix for logging metrics ('train' or 'val').

        Returns:
            loss: Calculated loss value for the current batch.
        """

        # Unpack batch data
        X_enc, X_enc_len, X_dec, Y_dec = batch

        # Run forward pass (asking the model to make prediction).
        logits = self.forward(X_enc, X_enc_len, X_dec)

        # Calculate loss (ignoring padding tokens)
        # Note: By permute(...) we are reshaping the logits to the form the function accepts as stated in Pytorch Documentation Online.
        loss = self.loss_fn(logits.permute(0, 2, 1), Y_dec)

        # Calculate and log accuracy to our training progress bar so we can see if
        # the model is improving while training (ignoring padding tokens).
        self.log_dict({
            f'{prefix} acc': self.accuracy(logits.permute(0,2,1), Y_dec),
            f'{prefix} loss': loss
        },
        prog_bar=True) # show metrics on the progress bar.

        return loss

    def configure_optimizers(self):
        """
        Configures the optimizer used for training the model (view this as a coach that tells the model to improve its components,
        thats why we pass to it the model parameters so it can guide the tuning of them)

        Returns:
            torch.optim.Optimizer: The chosen optimizer instance.
        """

        optimizer = torch.optim.Adam(self.parameters(), lr=5e-4)  # Example using Adam with learning rate 5e-4
        return optimizer



### **Architecture Summary**
With the aid of torchinfo we are going to print a summary of our model to see details like number of trainable parameters, and non-trainable parameters (remember we freezed the parameters in our pre-trained word and positional embeddings layers).

In [None]:
!pip install torchinfo

In [36]:
from torchinfo import summary

batch_size = 24 # will emulate passing a batch size of 24 through the model.
summary(
    Transformer(),
    input_data = [
        torch.randint(low=2, high=250, size=(batch_size, TOKEN_LIMIT)), # emulate encoder inputs
        torch.full([batch_size], TOKEN_LIMIT), # emulate lengths of encoder inputs
        torch.randint(low=2, high=250, size=(batch_size, TOKEN_LIMIT)) # emulate decoder inputs
    ],
    device='cpu'
)

Layer (type:depth-idx)                        Output Shape              Param #
Transformer                                   [24, 350, 119547]         --
├─Embedding: 1-1                              [24, 350, 768]            (91,812,096)
├─Embedding: 1-2                              [24, 350, 768]            (recursive)
├─LearnedPositionalEncoding: 1-3              [24, 350, 768]            --
│    └─Embedding: 2-1                         [350, 768]                (393,216)
├─LearnedPositionalEncoding: 1-4              [24, 350, 768]            (recursive)
│    └─Embedding: 2-2                         [350, 768]                (recursive)
├─EncoderLayer: 1-5                           [24, 350, 768]            --
│    └─ModuleList: 2-3                        --                        --
│    │    └─EncoderBlock: 3-1                 [24, 350, 768]            2,779,409
│    │    └─EncoderBlock: 3-2                 [24, 350, 768]            2,779,409
│    │    └─EncoderBlock: 3-3        

### **Training Our Model**

In [34]:
# Since we are going to be training our model for several epochs (An epoch is a full iteration through our dataset [of 2_007_723 sentence pairs]),
# we want to save the model at intervals so if there is any unforseen occurance, we don't just have to start form the beginning.
from lightning.pytorch.callbacks import ModelCheckpoint

checkpoint_callback = ModelCheckpoint(
    dirpath='latest_ckpt/',
    filename='en_fr_model',
    every_n_train_steps=200, # After how many batches should we save the model.
    save_last=True, # Save the current model at that point.
    # In addition to saving the current model we also want to save the best model so
    # far so incase the model suddenly starts back-tracking, we can revert to the best model so far.
    save_top_k=1
)

In [38]:
# We use a trainer provided by Pytorch Lightning to train the model
# (Note: It will automatically use a GPU, TPU, IPU or HPU if any is available).
trainer = L.Trainer(
    max_epochs=5, # Train for 5 epochs.
    callbacks=[checkpoint_callback] # Register our checkpoint_callback to tell the trainer to save the checkpoints.
)

INFO: GPU available: False, used: False
INFO:lightning.pytorch.utilities.rank_zero:GPU available: False, used: False
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: IPU available: False, using: 0 IPUs
INFO:lightning.pytorch.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs


In [39]:
# create an instance of the transformer to train.
transformer_model = Transformer()

In [None]:
# pass our instance to the trainer for training
trainer.fit(
    transformer_model, # our transformer instance
    translation_datamodule # the data module we previously created, will be used for training the model.
)

INFO: 
  | Name            | Type                      | Params
--------------------------------------------------------------
0 | pos_embed_layer | LearnedPositionalEncoding | 393 K 
1 | embedding_layer | Embedding                 | 91.8 M
2 | encoder_layer   | EncoderLayer              | 11.1 M
3 | decoder_layer   | DecoderLayer              | 17.4 M
4 | output          | Linear                    | 91.9 M
5 | loss_fn         | CrossEntropyLoss          | 0     
6 | accuracy        | MulticlassAccuracy        | 0     
--------------------------------------------------------------
120 M     Trainable params
92.2 M    Non-trainable params
212 M     Total params
850.667   Total estimated model params size (MB)
INFO:lightning.pytorch.callbacks.model_summary:
  | Name            | Type                      | Params
--------------------------------------------------------------
0 | pos_embed_layer | LearnedPositionalEncoding | 393 K 
1 | embedding_layer | Embedding                 | 91.8 M

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

### **Testing Our Model**
I already trained a Model though i only trained for half of the first epoch it reached up to train acc = 0.62 a 62% percent accuracy in training data next word prediction. I will use that model.

Note: This model is not added to the notebooks due to some issues i had while creating this notebooks, but you can train your own model (should take you about 3 hours to train on half of the first epoch with Nvidia T4 or P100 gpu that is on colab or kaggle).

In [28]:
# This particular line was how i loaded the model on my machine, wont work on yours
# as i stated in the markdown above 👆👆
my_trained_model_path = '/content/drive/MyDrive/Public Educational Notebooks/Lang_Translate/Resources/model.ckpt'

loaded_model = Transformer.load_from_checkpoint(my_trained_model_path)

In [31]:
def naive_greedy_decoding(en_sentence, transformer_model, tokenizer, max_output_len=TOKEN_LIMIT):
    """
    Performs basic greedy decoding on the provided English sentence using the given Transformer model.

    This method serves for educational purposes and demonstrates a simple greedy decoding approach.
    For practical applications, more efficient methods like KV caching are used.

    Args:
        en_sentence (str): The English sentence to translate.
        transformer_model (nn.Module): The trained Transformer model.
        tokenizer (transformers.PreTrainedTokenizer): The tokenizer used for text and vocabulary handling.
        max_output_len (int, optional): Maximum length of the decoded French sentence. Defaults to 350 (Our Models TOKEN_LIMIT).

    Returns:
        str: The decoded French sentence generated by the model.

    Raises:
        Exception: If the input sentence is empty.
    """

    # **Evaluation mode and GPU usage:**
    transformer_model = transformer_model.eval().cuda()  # Switch to evaluation mode and move to GPU

    # **Input validation:**
    assert isinstance(en_sentence, str), "The english sentence should be a string"
    en_sentence = en_sentence.strip()

    if en_sentence == "":
        raise Exception('Text should not be empty')

    en_sentence = clean_text(en_sentence)  # Apply any necessary text cleaning

    # **Tokenization and sequence lengths:**
    en_sen = tokenizer.encode(en_sentence, return_tensors='pt')  # Tokenize English sentence
    en_sen_len = torch.tensor([en_sen.shape[1]])  # Calculate input sequence length

    # **Initialize decoded sentence and loop:**
    fr_decoded = [tokenizer.cls_token_id]  # Start with [CLS] token representing our Start Of Sentence
    for _ in range(max_output_len):

        # **Disable gradient calculation for efficiency:**
        with torch.no_grad():
            # **Forward pass through the Transformer:**
            log_proba = transformer_model(
                en_sen.cuda(),  # Input English sentence on GPU
                en_sen_len.cuda(),  # Input English sentence length on GPU
                torch.tensor([fr_decoded]).cuda()  # Current decoded sentence on GPU
            )  # Shape: [1, len(fr_decoded), vocab_size]

            # **Greedy decoding: choose word with highest probability**
            next_word_id = torch.argmax(log_proba, dim=-1)[0][-1]
            fr_decoded.append(next_word_id)  # Add predicted word to decoded sentence

            # **Early stopping if [SEP] (our end-of-sentence) token is predicted**
            if next_word_id == tokenizer.sep_token_id:
                break

    # **Decode tokens back to human-readable text:**
    fr_text = tokenizer.decode(
        fr_decoded,
        clean_up_tokenization_spaces=True  # Remove extra spaces around punctuation
    ).replace(' ##', '')  # Join subwords back into complete words

    return fr_text

In [32]:
text = "I will like to address the court tomorrow, after much delibration on the matter"

naive_greedy_decoding(
    text,
    loaded_model,
    tokenizer
    )

'[CLS] Je voudrais adresser à la cour de justice demain, après la délibération de la question. [SEP]'

<img src="https://onedrive.live.com/embed?resid=8C3CCBBA832CF1E0%21605&authkey=%21AN-UxmT6yETwyKs&width=975&height=298" width="975" height="298" />

### **Beam Search**
Machine translation models generate output sequences like translated sentences. A common challenge is how to select the best words at each step, given the previous words and the input sequence.

One simple method is **greedy decoding**, which predicts the most likely word at each step, building the sentence word by word. However, this can lead to suboptimal results as the model might get stuck in local optima, missing better translations overall.

A more sophisticated method is **beam search**, which is a decoding algorithm that can generate sequences of words from a probability distribution over the vocabulary. It is often used in natural language processing tasks such as machine translation, text summarization, and image captioning, where the output is a sequence of words.

Unlike greedy decoding, which only selects the most likely word at each step, beam search keeps track of a fixed number of candidates (called the beam size) and expands them until the end of the sequence or a stop token is reached. This allows beam search to explore more possible sequences and find a better solution than greedy decoding.

The main steps of beam search are:

1. **Maintain multiple candidate translations (beams).** At each step, instead of just the single best word, we consider the top `k` most likely words for **each existing beam**.

2. **Expand each beam with the chosen word.** This effectively creates `k` new beams for each existing one, exploring different translation paths in parallel.

3. **Evaluate and score new beams.** We consider both the probability of the current word and the overall translation's likelihood based on previous words.

4. **Keep the top `k` beams.** We prune unlikely translations, focusing on the most promising candidates.

5. **Repeat steps 2-4 until reaching a maximum length or end-of-sentence marker.**

By considering multiple possibilities simultaneously, beam search **increases the chance of finding better translations** compared to greedy decoding. However, it comes with increased computational cost due to handling more candidate sequences.

Beam search is needed because it can improve the quality and diversity of the generated sequences. Beam search can avoid some of the problems of greedy decoding, such as repeating words or generating short and incomplete sentences. Beam search can also generate multiple sequences with different probabilities, which can be useful for tasks that require multiple outputs or evaluation metrics.

In [96]:
import numpy as np

def naive_beam_search(en_sentence, transformer_model, tokenizer, beam_width, max_output_len=250):
    """
    Performs basic beam search decoding on the provided English sentence using the given Transformer model.

    This method serves for educational purposes and demonstrates a simple beam search decoding approach.
    For practical applications, more advanced and optimized beam search implementations are available.

    Args:
        en_sentence (str): The English sentence to translate.
        transformer_model (nn.Module): The trained Transformer model.
        tokenizer (transformers.PreTrainedTokenizer): The tokenizer used for text and vocabulary handling.
        beam_width (int): The number of beams to keep during decoding.
        max_output_len (int, optional): Maximum length of the decoded French sentence. Defaults to 250.

    Returns:
        list: A list of the top beam translations decoded by the model.
    """

    # Set BOS and EOS token IDs for consistency
    tokenizer.bos_token_id = tokenizer.cls_token_id
    tokenizer.eos_token_id = tokenizer.sep_token_id

    # Switch model to evaluation mode and GPU for efficiency
    transformer_model = transformer_model.eval().cuda()

    # Input validation
    assert isinstance(en_sentence, str), "The english sentence should be a string"
    en_sentence = en_sentence.strip()

    if not en_sentence:
        raise ValueError('Text should not be empty')
    en_sentence = clean_text(en_sentence)  # Apply any necessary text cleaning

    # Tokenize English sentence and calculate sequence length
    en_sen = tokenizer.encode(en_sentence, return_tensors='pt')
    en_sen_len = torch.tensor([en_sen.shape[1]])

    # Initialize beam search data structures
    with torch.no_grad():
        # Get logits for first word from all beams
        logits = transformer_model(
            en_sen.cuda(),
            en_sen_len.cuda(),
            torch.tensor([[tokenizer.bos_token_id]]).cuda()
            )[0, 0]

        logits = torch.log(f.softmax(logits, dim=-1))  # Apply log-softmax

        # Select top k logits and indices for each beam
        top_k_logits, top_k_ind = torch.topk(logits, k=beam_width)

        # Initialize beam translations, lengths, and EOS flags
        fr_alt_translations = np.empty(beam_width, dtype=object)
        fr_alt_translations[:] = [[tokenizer.bos_token_id, top_k_ind[i].item()] for i in range(beam_width)]
        alt_sen_length = torch.tensor([2] * beam_width).cuda()  # Start with length 2 (BOS + first word)
        non_eos_ind = torch.tensor([True] * beam_width).cuda()  # All beams active initially

    # Main beam search loop
    for _ in range(max_output_len):

        # Initialize storage for next word logits
        store_logits = torch.full((beam_width, tokenizer.vocab_size), -torch.inf).cuda()

        # Set logits for beams that already reached EOS
        store_logits[~non_eos_ind, tokenizer.eos_token_id] = top_k_logits[~non_eos_ind]

        with torch.no_grad():
            # Get logits for next word predictions for active beams
            store_logits[non_eos_ind] = transformer_model(
                en_sen.repeat(sum(non_eos_ind), 1).cuda(),
                en_sen_len.repeat(sum(non_eos_ind)).cuda(),
                torch.tensor(
                    fr_alt_translations[non_eos_ind.cpu()].tolist()
                    ).cuda()
            )[:, -1]  # Get logits for last word in each sentence

        # Mask unknown token with -inf logits
        # store_logits[:, tokenizer.unk_token_id] = -torch.inf

        # Apply log-softmax and add previous beam logits for active beams
        store_logits[non_eos_ind] = torch.log(f.softmax(store_logits[non_eos_ind], dim=-1))
        store_logits[non_eos_ind] += top_k_logits[non_eos_ind].view(-1, 1)

        # Sentence length normalization:
        len_norm_factor = 1/alt_sen_length.view(-1, 1) ** 0.7  # Calculate length normalization factor
        store_logits *= len_norm_factor  # Apply normalization to logits

        # Select top k next word predictions and indices:
        top_k_logits, top_k_ind = torch.topk(store_logits.view(-1), k=beam_width)  # Reshape and get top k

        # Extract next word IDs and beam indices:
        next_token_id = top_k_ind % tokenizer.vocab_size  # Extract word IDs from indices
        beam_ind = top_k_ind // tokenizer.vocab_size  # Extract beam indices

        # Identify beams that haven't reached EOS:
        non_eos_ind = next_token_id != tokenizer.eos_token_id  # Check for EOS token in predictions

        # Select active beams (not reached EOS) for further processing:
        fr_alt_translations = fr_alt_translations[beam_ind.cpu()]  # Keep active translations
        alt_sen_length = alt_sen_length[beam_ind.cpu()]  # Keep active sentence lengths

        # Update active beams with new word and increment length:
        for idx, next_id in zip(torch.nonzero(non_eos_ind).flatten(), next_token_id[non_eos_ind]):
            fr_alt_translations[idx] = fr_alt_translations[idx] + [next_id.item()]  # Add next word
            alt_sen_length[idx] += 1  # Increase sentence length

        # Early stopping if all beams have reached EOS:
        if sum(non_eos_ind) == 0:
            break  # No active beams left, stop decoding

    # Decode final translations and return results:
    for i, translation in enumerate(fr_alt_translations):
        fr_alt_translations[i] = tokenizer.decode(
            translation,
            clean_up_tokenization_spaces=True,
            skip_special_tokens=True
        ).replace(' ##', '')  # Decode, remove spaces, combine subwords

    return fr_alt_translations, top_k_logits  # Return decoded translations and final logits


In [118]:
text = '''
    While we disagree on specifics, we have a duty to serve the citizens of this country
    to the best of our abilities.
'''

naive_beam_search(
    text,
    loaded_model,
    tokenizer,
    3)

(array(["Bien que nous ne sommes pas d'accords spécifiques, nous avons le devoir de servir aux citoyens de ce pays à la meilleure qualité de nos capacités.",
        "Bien que nous ne sommes pas d'accords spécifiques, nous avons le devoir de servir aux citoyens de ce pays à la meilleure capacité de nos capacités.",
        "Bien que nous ne sommes pas d'accords spécifiques, nous devons nous servir devoir aux citoyens de ce pays à la meilleure qualité de nos capacités."],
       dtype=object),
 tensor([-2.7071e-05, -3.5338e-05, -3.2246e-04], device='cuda:0'))

<img src="https://onedrive.live.com/embed?resid=8C3CCBBA832CF1E0%21606&authkey=%21AMbl-Me9ZEZ-WvI&width=990&height=394" width="990" height="394" />

📝 Note: Although Our Model performs fairly well for words you might normally hear at congress, it stuggles with normal texts 👇👇

In [125]:
text = '''
    Wow!, What a beautiful day for hunting deers.
'''

naive_beam_search(
    text,
    loaded_model,
    tokenizer,
    3)

(array(["Ce qui se passe, c'est un beau jour pour la chasse à des chauffeurs.",
        "Ce qui se passe, c'est un beau jour pour la chasse à la chasse aux chauffeurs.",
        "Ce qui se passe aujourd'hui, c'est un beau jour pour la chasse aux chauffeurs."],
       dtype=object),
 tensor([-6.3252e-05, -3.1468e-03, -3.7728e-03], device='cuda:0'))

<img src="https://onedrive.live.com/embed?resid=8C3CCBBA832CF1E0%21607&authkey=%21APkk4TZjfybDQPw&width=979&height=297" width="979" height="297" />

📝 This is mostly due to both reasons below:

1: I haven't trained the model enough for it to start to generalize (i only trained for half the first epoch).

2: Our Dataset is mean't to be more diverse than just congress texts, this will enable the model learn more complex language nuances.

State-Of-The-Art Translation models are way more bigger than our model often reaching up to Trillions Of Parameters and they train on Hundreds of Millions of sentence pairs.

### **Congratulations** 🎉🎉
You can now build a transformer for Language Translations. 😊

Follow me on:

* **[LinkedIn Profile](https://www.linkedin.com/in/jonathan-okorie-843126216/)** for questions, deep learning projects, chat e.t.c.

* **[Twitter Profile](https://twitter.com/Nathan_Young_1)** for bite-sized knowledge & (questionable) puns.

![Celebration](https://www.bing.com/th/id/OGC.da35457a6e4969036e52aa30920fbda9?pid=1.7&rurl=https%3a%2f%2fmedia.giphy.com%2fmedia%2fyidUzEKG4AaHWCpsPu%2fgiphy.gif&ehk=zy7t4EILSylbdm0EJC%2fkPu12DvoLx0AD89jDzaXKT8o%3d "celebration")