##Machine Translation Using a Seq2Seq Architecture

---
The goal of this colab is to get you more familiar with the Seq2Seq models and their challenges. For this reason, you will be working on machine translation problem where we would have a sentence as input (in english), and the output is gonna be the translated sentence (in french). So just like what happens with Google Translate.


**Just to give you a heads up:** We won't be having a model performing like Google translate, but at least we will have an idea about how Google Translate works and the challenges that exist with a translation problem.  

## Importing Libraries

We start by importing numpy and pandas and then we can add the rest

In [None]:
import nltk
from nltk.tokenize import word_tokenize
import pandas as pd
nltk.download('punkt')
nltk.download('punkt_tab')
import matplotlib.pyplot as plt
import string
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset, Dataset
from torch.nn.utils.rnn import pack_padded_sequence
import random

Upload your data here. Here is the [Drive link](https://drive.google.com/drive/folders/10ncj3w7kI9GPx_rz-WfKEGCv4Dz1EYf6?usp=sharing)

## Getting the data

In [None]:
en = pd.read_csv('Datasets/en.csv')
fr = pd.read_csv('Datasets/fr.csv')
english_sentences = en.iloc[:, 0]
french_sentences = fr.iloc[:, 0]

**How many sentences does each of the files contain?**

In [None]:
len(english_sentences), len(french_sentences)

Now let us concatenate the 2 dataframes into one dataframe that we call **df** where one column has the english senetnces and the other has the french sentences

In [None]:
en.columns = ['English']
fr.columns = ['French']
combined_df = pd.concat([en, fr], axis=1)
print(combined_df.head())

Pick a sentence and print it in both languages

In [None]:
print(combined_df.iloc[0, 0])
print(combined_df.iloc[0, 1])

##Cleaning Data

The data that we have is almost clean as we can see, we just need to remove the punctuations inside of it.

In [None]:
'''remove the punctuation'''
combined_df['English'] = combined_df['English'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
combined_df['French'] = combined_df['French'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))

''' convert to lowercase'''
combined_df['English'] = combined_df['English'].apply(lambda x: x.lower())
combined_df['French'] = combined_df['French'].apply(lambda x: x.lower())

Make sure that the punctuation is removed by printing the example that you printed earlier.

In [None]:
combined_df.iloc[0, 0], combined_df.iloc[0, 1]

##Exploring the Data

Add a column **ENG Length** to the dataset that shows how many words does a sentence contain, and do the same for french in a column called **FR Length**

In [None]:
# add a column to the dataframe called "ENG Length" and "FR Length" that contains the length of the English and French sentences respectively
combined_df['ENG Length'] = combined_df['English'].apply(lambda x: len(x.split()))
combined_df['FR Length'] = combined_df['French'].apply(lambda x: len(x.split()))

print(combined_df.head())


Visualize the distribution of the lengths of english sentences and french sentences.

In [None]:
plt.figure(figsize=(12, 6))
plt.hist(combined_df['ENG Length'], bins=10, alpha=0.5, label='English Sentence Lengths',density=True) # density=True normalizes the data
plt.hist(combined_df['FR Length'], bins=10, alpha=0.5, label='French Sentence Lengths', density=True) # density=True normalizes the data (getting an approximation of the pdf)
plt.xlabel('Sentence Length (Word Count)')
plt.ylabel('Frequency')
plt.title('Distribution of Sentence Lengths (English vs. French)')
plt.legend()

plt.show()

Get the maximum length of an english sentence and the maximum length of a french sentence.

In [None]:
max_eng_length = combined_df['ENG Length'].max()
max_fr_length = combined_df['FR Length'].max()
max_eng_length, max_fr_length

##Preprocessing the Data

In order for the data to be fed to the model, it has to be tokenized and padded.

####Tokenization

**To tokenize english and french sentences, we can use only one tokenizer. True or False?**

Wrong, in this case I am using basic word tokenization through nltk.word_tokenize

Tokenize the sentences that we have.

In [None]:
def tokenize_sentences(sentences):
    return [word_tokenize(sentence.lower()) for sentence in sentences]
english_sentences = combined_df['English']
french_sentences = combined_df['French']
english_tokenized = tokenize_sentences(english_sentences)
french_tokenized = tokenize_sentences(french_sentences)
print(english_tokenized[0])
print(french_tokenized[0])
print(len(english_tokenized), len(french_tokenized))

**How many unique words do we have in english and in french?**

In [None]:
def build_vocabulary(tokenized_sentences, special_tokens=["<PAD>", "<SOS>", "<EOS>", "<UNK>"]):
    all_tokens_list = []
    for sentence in tokenized_sentences:
        all_tokens_list.extend(sentence)
    vocab = {token: idx for idx, token in enumerate(special_tokens)}
    for token in all_tokens_list:
        if token not in vocab:
            vocab[token] = len(vocab) # updating each token with a unique index, since len(vocab) is changing each time
    # Decoding Later 
    index_to_token = {idx: token for token, idx in vocab.items()}
    return vocab, index_to_token
english_vocab, english_index_to_token = build_vocabulary(english_tokenized)
french_vocab, french_index_to_token = build_vocabulary(french_tokenized)

# save these dictionaries for later use
import pickle
with open('english_vocab.pkl', 'wb') as f:
    pickle.dump(english_vocab, f)
with open('english_index_to_token.pkl', 'wb') as f:
    pickle.dump(english_index_to_token, f)
with open('french_vocab.pkl', 'wb') as f:
    pickle.dump(french_vocab, f)
with open('french_index_to_token.pkl', 'wb') as f:
    pickle.dump(french_index_to_token, f)

print("English Vocabulary Size:", len(english_vocab))
print("French Vocabulary Size:", len(french_vocab))
print(english_vocab)

#### Padding + Batching + putting them in sequences

**What should be the length of the sequences that we have after padding?**

As the padding length, in this case I will take as the length of the maximum phrase here!!!

Perform padding on the sequences that we have.

In [None]:
class TranslationDataset(Dataset):
    def __init__(self, src_sentences, tgt_sentences, src_vocab, tgt_vocab, max_src_len, max_tgt_len):
        '''
        src_sentences: List of source sentences (as tokens) -> English
        tgt_sentences: List of target sentences (as tokens) -> French
        src_vocab: Source vocabulary -> Maps English Tokens to indices
        tgt_vocab: Target vocabulary -> Maps French Tokens to indices
        max_src_len: Maximum length of source sentences
        max_tgt_len: Maximum length of target sentences
        '''
        self.src_sentences = src_sentences
        self.tgt_sentences = tgt_sentences
        self.src_vocab = src_vocab
        self.tgt_vocab = tgt_vocab
        self.max_src_len = max_src_len
        self.max_tgt_len = max_tgt_len
    
    def __len__(self):
        # determine the dataset length, for later use !!!!
        return len(self.src_sentences) 
    def __getitem__(self, idx):
        '''
        idx: Index of the sentence pair to retrieve

        Returns:
        src_indices: Indices of the source sentence
        src_length: Actual length of the source sentence
        tgt_indices: Indices of the target sentence
        '''
        # Get english(source) and target sentence at this index 
        src_sentence = self.src_sentences[idx]
        tgt_sentence = self.tgt_sentences[idx]
        
        # Convert this sentence to indices (of course for each token), and pad them to the max length(liek talked about in the report)
        src_indices = self._convert_and_pad(src_sentence, self.src_vocab, self.max_src_len, "<PAD>",adding_sos_eos=False)
        tgt_indices = self._convert_and_pad(tgt_sentence, self.tgt_vocab, self.max_tgt_len, "<PAD>",adding_sos_eos=True)
        # Get actual lengths (ignoring <PAD> tokens) (like we talked about in the report)
        src_length = min(len(src_sentence), self.max_src_len) # actual length of the source sentence, which we will need in the encoder
        # tgt_length = min(len(tgt_sentence), self.max_tgt_len), we don't need this, because we are not using it in the decoder
        return torch.tensor(src_indices), src_length, torch.tensor(tgt_indices)
    def _convert_and_pad(self, sentence, vocab, max_len, pad_token,adding_sos_eos=False):
        '''Convert tokens to indices'''
        if adding_sos_eos:
            sentence = ["<SOS>"] + sentence + ["<EOS>"]
        indices = [vocab.get(token, vocab["<UNK>"]) for token in sentence] # maps each token to its index in the vocabulary, if not found, it maps it to the index of the <UNK> token
        '''Truncate or pad to max length'''
        indices = indices[:max_len] + [vocab[pad_token]] * (max_len - len(indices))
        return indices
def collate_batch(batch):
    '''
    This function is used to combine individual samples into batches with consistent shapes, required by the DataLoader

    batch: List of individual sample, where each sample is a tuple "returnedd by __getitem__" of (src_indices, src_length, tgt_indices)
    '''
    src_batch, src_lengths, tgt_batch = zip(*batch)
    # src_batch will be a list of source tensors, each of shape (max_src_len)
    # src_lengths will be a list of source lengths(ignoring padding) 
    # tgt_batch will be a list of target tensors, each of shape (max_tgt_len)

    # stacking these tensors to form a single tensor for each of the source and target sequences
    src_batch = torch.stack(src_batch) # shape: (batch_size, max_src_len)
    tgt_batch = torch.stack(tgt_batch) # shape: (batch_size, max_tgt_len)
    src_lengths = torch.tensor(src_lengths) # shape: (batch_size)   
    return src_batch, src_lengths, tgt_batch

max_src_len = 26  
max_tgt_len = 26  
batch_size = 256
train_dataset = TranslationDataset(src_sentences=english_tokenized, tgt_sentences=french_tokenized, src_vocab=english_vocab, tgt_vocab=french_vocab, max_src_len=max_src_len, max_tgt_len=max_tgt_len)
# we used collate_fn to specify how to combine individual samples into batches; because here we are not simply stacking the tensors, but also need to keep track of the actual lengths of the sequences !!!!
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=lambda x: collate_batch(x),num_workers=4, pin_memory=True, prefetch_factor=2)

## Modeling

After preprrocessing the data, we can build our model. Start by building a baseline architecture relying on one directional RNNs, LSTMs, or GRUs. It will be good to lookup how to build Seq2Seq models, there are some new layers that will help you like RepeatVector and TimeDistributed.

# Encoder

In [None]:
class Encoder(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, num_layers=2):
        '''
        input_dim: The size of the input vocabulary (It tells the model how many unique tokens it can expect to see)

        '''
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(input_dim, embedding_dim) # it will have a matrix of size (input_dim, embedding_dim)
        # where each row of the matrix will represent the embedding of a token in the vocabulary
        '''Now each token in my input sequence is an index that refers to a row in this embedding matrix.'''
        '''This lookup returns a new tensor of shape (batch_size, max_length, embedding_dim)'''
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers, batch_first=True) # batch_first=True means that the first dimension of the input and output will be the batch size (batch_size, max_length, hidden_dim)
        # note that in LSTM, the time step happens under the hood
        # LSTM treats each position along sequence lenth as a time step
        
    def forward(self, src, src_lengths):
        # src shape: (batch_size, max_length) - a batch of sequences of token indices
        # src_lengths shape: (batch_size) - the actual lengths of each sequence without padding
        '''If src has shape (batch_size, max_length) (a batch of padded sequences of token indices), the embedding layer’s output will be a tensor of shape (batch_size, max_length, embedding_dim)'''
        embedded = self.embedding(src)  # shape: (batch_size, max_length, embedding_dim)
        '''LSTM will process only the non-padded elements'''
        packed_embedded = pack_padded_sequence(embedded, src_lengths, batch_first=True, enforce_sorted=False) # 
        packed_outputs, (hidden, cell) = self.lstm(packed_embedded)
        # hidden and cell shapes: (num_layers, batch_size, hidden_dim)
        # Note if you need the outputs you need to unpack the sequences using pad_packed_sequence
        # outputs, _ = pad_packed_sequence(packed_outputs, batch_first=True) #outputs shape: (batch_size, max_length, hidden_dim)
        return hidden, cell


# Decoder 

In [None]:
class Decoder(nn.Module):
    def __init__(self, output_dim, embedding_dim, hidden_dim, num_layers=2, teacher_forcing_ratio=0.5):
        super(Decoder, self).__init__()
        self.teacher_forcing_ratio = teacher_forcing_ratio

        # Embedding layer for the output vocabulary( French vocabulary)
        self.embedding = nn.Embedding(output_dim, embedding_dim)
    
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, batch_first=True)
        
        # Fully connected layer to generate predictions over vocabulary, which then by the use of the final softmax layer(which is applied by default in the cross-entropy loss, so we will NOT do it here) will give the probabilities of each token in the vocabulary
        self.fc_out = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, input_token, hidden, cell):
        # input_token shape: (batch_size) - single token for each sequence in the batch, since later in the full model we will be looping over the sequence length(1 token at a time)
        
        '''Step 1: Embed the input token'''
        embedded = self.embedding(input_token).unsqueeze(1)  # Shape: (batch_size, 1, embedding_dim)
        
        '''Step 2: Pass the embedded input through the LSTM'''
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))  # Shape: (batch_size, 1, hidden_dim)
        
        '''Step 3: Generate prediction for the token'''
        prediction = self.fc_out(output.squeeze(1))  # Shape: (batch_size, output_dim)
        
        return prediction, hidden, cell

# Full Model (Encoder + Decoder)

In [None]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder,  teacher_forcing_ratio=0.5):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.teacher_forcing_ratio = teacher_forcing_ratio
    def forward(self, src, src_lengths, tgt):
        batch_size = src.shape[0]
        tgt_sequence_length = tgt.shape[1]
        tgt_vocab_size = self.decoder.fc_out.out_features
        # decoder_outputs (predictions at each time step)
        outputs = torch.zeros(batch_size, tgt_sequence_length - 1, tgt_vocab_size)

        # Encode the source sequence
        hidden, cell = self.encoder(src, src_lengths)

        # The first input to the decoder is the <SOS> token
        input_token = tgt[:, 0]  # Shape: (batch_size)

        for t in range(1, tgt_sequence_length):
            # Pass the input token through the decoder
            output, hidden, cell = self.decoder(input_token, hidden, cell)
            outputs[:, t - 1, :] = output

            # will i use teacher forcing?
            teacher_force = random.random() < self.teacher_forcing_ratio

            # Get the next input token
            top1 = output.argmax(1) # maximum value along the second dimension
            input_token = tgt[:, t] if teacher_force else top1
        return outputs
    def inference(self, src, src_lengths, max_len=26, start_token=1, end_token=2):
        '''Will output the generated indices of the target sequence'''
        # src: Source input sequence - Shape: (batch_size, src_sequence_length) (English)
        # src_lengths: Actual lengths of each sequence in src (ignores padding)
        # max_len: Maximum length of the generated sequence 
        # start_token: Start-of-sequence token index
        # end_token: End-of-sequence token index
        
        '''Will output the generated indices of the target sequence'''
        batch_size = src.shape[0]

        # Encode the source sequence
        hidden, cell = self.encoder(src, src_lengths)

        # start the model with input token with the <SOS> token
        input_token = torch.tensor([start_token] * batch_size)  # Shape: (batch_size)

        # here we will store the generated tokens, and NOT the probabilities
        generated_tokens = torch.zeros(batch_size, max_len).long()
        for t in range(max_len):
            # Pass the input token through the decoder
            output, hidden, cell = self.decoder(input_token, hidden, cell)

            # Get the predicted token
            predicted_token = output.argmax(1)  # Shape: (batch_size)
            generated_tokens[:, t] = predicted_token

            
            if (predicted_token == end_token).all(): # Check if all sequences have predicted <EOS>, then we are done generating
                break
            # Update input token for the next time step (like before)
            input_token = predicted_token
        return generated_tokens

Compile and train the model.
**FYI:** While specifying the architecture of your model and the number of epochs for training, keeep in your mind that your model might take A LOT of time to train.

In [None]:
input_dim = len(english_vocab)
output_dim = len(french_vocab)
embedding_dim = 256
hidden_dim = 512
num_layers = 2
learning_rate = 0.001
teacher_forcing_ratio = 0.5
num_epochs = 52

encoder = Encoder(input_dim, embedding_dim, hidden_dim, num_layers)
decoder = Decoder(output_dim, embedding_dim, hidden_dim, num_layers, teacher_forcing_ratio=teacher_forcing_ratio)
model = Seq2Seq(encoder, decoder)
model.load_state_dict(torch.load('seq2seq_model_52.pth',weights_only=True))
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss(ignore_index=french_vocab["<PAD>"])

epoch_losses = []
for epoch in range(0,num_epochs):
    model.train()
    epoch_loss = 0

    for src, src_lengths, tgt in train_dataloader:
        optimizer.zero_grad()
        output = model(src, src_lengths, tgt)  # Shape: (batch_size, tgt_sequence_length - 1, output_dim)

        # Reshape output and target for calculating loss !!!!
        output = output.reshape(-1, output_dim)
        tgt = tgt[:, 1:].reshape(-1)  # Exclude the first <SOS> token (because we don't need to calculate loss for it)

        loss = criterion(output, tgt)
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
    # Calculate average loss for the epoch, plus saving the model, since this code is running on a remote server, so I need to save the model after each epoch, to compare these models later on!!!
    avg_loss = epoch_loss / len(train_dataloader)
    epoch_losses.append(avg_loss)
    s = f"seq2seq_model_5{epoch}.pth"
    torch.save(model.state_dict(), s)
    print(f"Epoch 5{epoch }/{num_epochs}, Loss: {avg_loss}")
        

Define a function that gets an input sentence in english and gives the output sentence in the french language.

In [None]:
def translate_to_french(sentence,  french_vocab_inverse = french_index_to_token ,english_vocab = english_vocab,max_src_len = 26):
    '''
    What this function will do is take an English Sentence, convert it to tokens, then convert to indicies, pass these indices to the model, use model.inference to generate the indices of the french sentence, the convert these indices to tokens and finally to a sentence
    sentence: English sentence to translate
    french_vocab_inverse: Inverse of the French vocabulary (index to token)
    english_vocab: English vocabulary (token to index)
    max_src_len: Maximum length of source sentences 
    '''
    model.eval()
    sentence = word_tokenize(sentence.lower())
    indices = [english_vocab.get(token, english_vocab["<UNK>"]) for token in sentence]
    lengths = torch.tensor([min(len(indices), max_src_len)])

    # Pad/Truncate 
    if len(indices) < max_src_len:
        indices += [english_vocab["<PAD>"]] * (max_src_len - len(indices))
    elif len(indices) > max_src_len:
        indices = indices[:max_src_len]
    indices = torch.tensor(indices).unsqueeze(0)
    with torch.no_grad():
        translated_indices = model.inference(indices, lengths)

    # indices to token (in fench vocab)
    translated_tokens = []
    for index in translated_indices[0]:
        token = french_vocab_inverse[index.item()]
        if token == "<EOS>":
            break
        translated_tokens.append(token)
    return ' '.join(translated_tokens)

Test the following sentence

In [None]:
input = "she is driving the truck"
print("English:", input)
print("French:", translate_to_french(input))

# Note that in the report I included screen shots from a small website that I created to show case the result of the model (for the code I didn't provide it here since we were required to only submit the notepad)

Try to improve your model by modifying the architecture to take into account bidirectionality which is very useful in Machine Translation. Create a new model called model2

In [None]:
class Encoder_2(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, num_layers=2):
        '''
        input_dim: The size of the input vocabulary (It tells the model how many unique tokens it can expect to see)
        '''
        super(Encoder, self).__init__()

        self.embedding = nn.Embedding(input_dim, embedding_dim) # it will have a matrix of size (input_dim, embedding_dim)
        # where each row of the matrix will represent the embedding of a token in the vocabulary
        '''Now each token in my input sequence is an index that refers to a row in this embedding matrix.'''
        '''This lookup returns a new tensor of shape (batch_size, max_length, embedding_dim)'''
        
        '''Modification: Make the LSTM bidirectional by setting bidirectional=True'''
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers, batch_first=True, bidirectional=True) # batch_first=True means that the first dimension of the input and output will be the batch size (batch_size, max_length, hidden_dim)
        # note that in LSTM, the time step happens under the hood
        # LSTM treats each position along sequence length as a time step

    def forward(self, src, src_lengths):
        # src shape: (batch_size, max_length) - a batch of sequences of token indices
        # src_lengths shape: (batch_size) - the actual lengths of each sequence without padding

        '''If src has shape (batch_size, max_length) (a batch of padded sequences of token indices), the embedding layer’s output will be a tensor of shape (batch_size, max_length, embedding_dim)'''
        embedded = self.embedding(src)  # shape: (batch_size, max_length, embedding_dim)
        

        '''LSTM will process only the non-padded elements'''
        packed_embedded = pack_padded_sequence(embedded, src_lengths, batch_first=True, enforce_sorted=False) 
        packed_outputs, (hidden, cell) = self.lstm(packed_embedded)
        # hidden and cell shapes: (num_layers * num_directions, batch_size, hidden_dim)
        
        '''Modification: Since the LSTM is bidirectional, we need to handle the hidden and cell states accordingly.
        The hidden and cell states from the bidirectional LSTM have shapes:
        - hidden: (num_layers * num_directions, batch_size, hidden_dim)
        We need to combine the hidden states from both directions to pass to the decoder.
        One common approach is to sum the hidden states from both directions.
        '''
        # Reshape hidden and cell to (num_layers, num_directions, batch_size, hidden_dim)
        num_layers = self.lstm.num_layers
        num_directions = 2  # Because bidirectional=True
        hidden = hidden.view(num_layers, num_directions, hidden.size(1), hidden.size(2))
        cell = cell.view(num_layers, num_directions, cell.size(1), cell.size(2))

        # Sum the hidden states from both directions
        hidden = hidden.sum(dim=1)  # Now shape: (num_layers, batch_size, hidden_dim)
        cell = cell.sum(dim=1)      # Now shape: (num_layers, batch_size, hidden_dim)

        # Note if you need the outputs you need to unpack the sequences using pad_packed_sequence
        # outputs, _ = pad_packed_sequence(packed_outputs, batch_first=True) #outputs shape: (batch_size, max_length, hidden_dim * num_directions)
    
        return hidden, cell

class Decoder_2(nn.Module):
    def __init__(self, output_dim, embedding_dim, hidden_dim, num_layers=2, teacher_forcing_ratio=0.5):
        super(Decoder, self).__init__()
        self.teacher_forcing_ratio = teacher_forcing_ratio

        # Embedding layer for the output vocabulary( French vocabulary)
        self.embedding = nn.Embedding(output_dim, embedding_dim)
    
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers, batch_first=True)
        
        # Fully connected layer to generate predictions over vocabulary
        self.fc_out = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, input_token, hidden, cell):
        # input_token shape: (batch_size) - single token for each sequence in the batch, since later in the full model we will be looping over the sequence length(1 token at a time)
        
        '''Step 1: Embed the input token'''
        embedded = self.embedding(input_token).unsqueeze(1)  # Shape: (batch_size, 1, embedding_dim)
        
        '''Step 2: Pass the embedded input through the LSTM'''
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))  # Shape: (batch_size, 1, hidden_dim)
        
        '''Step 3: Generate prediction for the token'''
        prediction = self.fc_out(output.squeeze(1))  # Shape: (batch_size, output_dim)
        
        return prediction, hidden, cell

class Seq2Seq_2(nn.Module):
    def __init__(self, encoder, decoder,  teacher_forcing_ratio=0.5):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.teacher_forcing_ratio = teacher_forcing_ratio
    def forward(self, src, src_lengths, tgt):
        batch_size = src.shape[0]
        tgt_sequence_length = tgt.shape[1]
        tgt_vocab_size = self.decoder.fc_out.out_features

        # Initialize tensor to hold decoder outputs
        outputs = torch.zeros(batch_size, tgt_sequence_length - 1, tgt_vocab_size)

        # Encode the source sequence
        hidden, cell = self.encoder(src, src_lengths)

        # The first input to the decoder is the <SOS> token
        input_token = tgt[:, 0]  # Shape: (batch_size)

        for t in range(1, tgt_sequence_length):
            # Pass the input token through the decoder
            output, hidden, cell = self.decoder(input_token, hidden, cell)

            # Store the output prediction
            outputs[:, t - 1, :] = output

            # Decide whether to use teacher forcing
            teacher_force = random.random() < self.teacher_forcing_ratio

            # Get the next input token
            top1 = output.argmax(1)

            input_token = tgt[:, t] if teacher_force else top1

        return outputs
    def inference(self, src, src_lengths, max_len=26, start_token=1, end_token=2):
        '''Will output the generated indices of the target sequence'''
        # src: Source input sequence - Shape: (batch_size, src_sequence_length) (English)
        # src_lengths: Actual lengths of each sequence in src (ignores padding)
        # max_len: Maximum length of the generated sequence 
        # start_token: Start-of-sequence token index
        # end_token: End-of-sequence token index
        
        '''Will output the generated indices of the target sequence'''
        batch_size = src.shape[0]

        # Encode the source sequence
        hidden, cell = self.encoder(src, src_lengths)

        # Initialize the input token with the <SOS> token
        input_token = torch.tensor([start_token] * batch_size)  # Shape: (batch_size)

        # Initialize tensor to hold generated tokens
        generated_tokens = torch.zeros(batch_size, max_len).long()

        for t in range(max_len):
            # Pass the input token through the decoder
            output, hidden, cell = self.decoder(input_token, hidden, cell)

            # Get the predicted token
            predicted_token = output.argmax(1)  # Shape: (batch_size)
            generated_tokens[:, t] = predicted_token

            # Check if all sequences have predicted <EOS>
            if (predicted_token == end_token).all():
                break

            # Update input token for the next time step
            input_token = predicted_token

        return generated_tokens

compile and train your new model.

In [None]:
max_src_len = 26  
max_tgt_len = 26  
batch_size = 32 
train_dataset = TranslationDataset(src_sentences=english_tokenized, tgt_sentences=french_tokenized, src_vocab=english_vocab, tgt_vocab=french_vocab, max_src_len=max_src_len, max_tgt_len=max_tgt_len)
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=lambda x: collate_batch(x))
input_dim = len(english_vocab)
output_dim = len(french_vocab)
embedding_dim = 256
hidden_dim = 512
num_layers = 3
learning_rate = 0.001
teacher_forcing_ratio = 0 # here just for experimentation we pupt it 0 
num_epochs = 15
encoder = Encoder(input_dim, embedding_dim, hidden_dim, num_layers)
decoder = Decoder(output_dim, embedding_dim, hidden_dim, num_layers, teacher_forcing_ratio=teacher_forcing_ratio)
model = Seq2Seq(encoder, decoder)
model.load_state_dict(torch.load('seq2seq_model_9.pth',weights_only=True))
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss(ignore_index=french_vocab["<PAD>"])
print("Training the model...")
epoch_losses = []
for epoch in range(9,num_epochs):
    model.train()
    epoch_loss = 0

    for src, src_lengths, tgt in train_dataloader:
        optimizer.zero_grad()

        # Forward pass through the Seq2Seq model
        output = model(src, src_lengths, tgt)  # Shape: (batch_size, tgt_sequence_length - 1, output_dim)

        # Reshape output and target for calculating loss
        output = output.reshape(-1, output_dim)
        tgt = tgt[:, 1:].reshape(-1)  # Exclude the first <SOS> token

        # Calculate loss
        loss = criterion(output, tgt)

        # Backpropagation and optimization
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

    # Calculate average loss for the epoch
    avg_loss = epoch_loss / len(train_dataloader)
    epoch_losses.append(avg_loss)

    print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {avg_loss}")
    s = f"seq2seq_model_{epoch + 1}.pth"
    torch.save(model.state_dict(), s)

Define a new function that relies on your new model to make predictions.

In [None]:
def translate_to_french(sentence,  french_vocab_inverse = french_index_to_token ,english_vocab = english_vocab,max_src_len = 26):
    '''
    What this function will do is take an English Sentence, convert it to tokens, then convert to indicies, pass these indices to the model, use model.inference to generate the indices of the french sentence, the convert these indices to tokens and finally to a sentence
    sentence: English sentence to translate
    french_vocab_inverse: Inverse of the French vocabulary (index to token)
    english_vocab: English vocabulary (token to index)
    max_src_len: Maximum length of source sentences 
    '''
    model.eval()
    # Tokenize the sentence
    sentence = word_tokenize(sentence.lower())
    
    indices = []
    for token in sentence:
        indices.append(english_vocab.get(token, english_vocab["<UNK>"])) # if token not found, use the index of the <UNK> token
    # Get the lengths of the sequences, before padding 
    lengths = torch.tensor([len(indices)]) # shape: (1)

    # Pad the sequence
    if len(indices) < max_src_len:
        indices += [english_vocab["<PAD>"]] * (max_src_len - len(indices))
    # truncate 
    elif len(indices) > max_src_len:
        indices = indices[:max_src_len]
    indices = torch.tensor(indices).unsqueeze(0) # shape: (1, max_src_len)
    with torch.no_grad():
        translated_indices = model.inference(indices, lengths)
    # index to token 
    translated_tokens = []
    for index in translated_indices[0]:
        token = french_vocab_inverse[index.item()]
        if token == "<EOS>":
            break
        translated_tokens.append(token)
    # to make it a string (sentence) not a list of tokens
    return ' '.join(translated_tokens)

In [None]:
input = "she is driving the truck"
print("English:", input)
print("French:", translate_to_french(input))

**What is another adjustment in terms of architecture that you might be able to do to improve your model?**

Please see details in report, here are just brief answers ! \\
One thing that we could do is add layers and maybe remove teacher's forcing to let the model independtly learn the patterns, and maybe adding dropout to prevent the model from associating words as always together (as discussed in the report and in the demo)

**What are some additional ways that we can do to improve the performance of our model?**

Same here, I will be discussing just briefly. everything is included in the report and in the demo video. What we can do is also include attention mechanism to focus on differnt parts of the sentence while generating each token, to prevent problems that we also discuss in the report and in the demo :)

# Video Recording Link

**A short (10 minutes max) recorded video where you explain your solution.
Make sure your face is visible in the video, as if you’re presenting your
work during a job interview.**

[Share The Link Here]