## TC 5033
## Deep Learning
## Transformers

Emmanuel Francisco González Velázquez - A01364577

Oscar Israel Lerma Franco - A01380817

Jesús Mario Martínez Díaz - A01740049

Eduardo Selim Martínez Mayorga - A01795167

José Antonio Hernández Hernández - A01381334

#### Activity 4: Implementing a Translator

- Objective

To understand the Transformer Architecture by Implementing a translator.

- Instructions

    This activity requires submission in teams. While teamwork is encouraged, each member is expected to contribute individually to the assignment. The final submission should feature the best arguments and solutions from each team member. Only one person per team needs to submit the completed work, but it is imperative that the names of all team members are listed in a Markdown cell at the very beginning of the notebook (either the first or second cell). Failure to include all team member names will result in the grade being awarded solely to the individual who submitted the assignment, with zero points given to other team members (no exceptions will be made to this rule).

    Follow the provided code. The code already implements a transformer from scratch as explained in one of [week's 9 videos](https://youtu.be/XefFj4rLHgU)

    Since the provided code already implements a simple translator, your job for this assignment is to understand it fully, and document it using pictures, figures, and markdown cells.  You should test your translator with at least 10 sentences. The dataset used for this task was obtained from [Tatoeba, a large dataset of sentences and translations](https://tatoeba.org/en/downloads).
  
- Evaluation Criteria

    - Code Readability and Comments
    - Traning a translator
    - Translating at least 10 sentences.

- Submission

Submit this Jupyter Notebook in canvas with your complete solution, ensuring your code is well-commented and includes Markdown cells that explain your design choices, results, and any challenges you encountered.



#### Script to convert csv to text file

In [2]:
#This script requires to convert the TSV file to CSV
# easiest way is to open it in Calc or excel and save as csv
import pandas as pd
PATH = '/content/Sentence pairs in English-Spanish - 2024-11-14.tsv'
df = pd.read_csv(PATH, sep='\t', on_bad_lines="skip")

In [3]:
eng_spa_cols = df.iloc[:, [1, 3]]
eng_spa_cols['length'] = eng_spa_cols.iloc[:, 0].str.len()
eng_spa_cols = eng_spa_cols.sort_values(by='length')
eng_spa_cols = eng_spa_cols.drop(columns=['length'])

output_file_path = '/content/Sentence pairs in English-Spanish - 2024-11-14.txt'
eng_spa_cols.to_csv(output_file_path, sep='\t', index=False, header=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  eng_spa_cols['length'] = eng_spa_cols.iloc[:, 0].str.len()


## Transformer - Attention is all you need

In [4]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from collections import Counter
import math
import numpy as np
import re

torch.manual_seed(23)

<torch._C.Generator at 0x7cbdc1a5fdd0>

In [5]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [6]:
MAX_SEQ_LEN = 128

## PositionalEmbedding

The `PositionalEmbedding` class implements positional encoding to add information about the position of tokens in a sequence to their embeddings.

- **Attributes:**
  - `pos_embed_matrix`: Precomputed matrix of positional embeddings based on sine and cosine functions.
- **Initialization:**
  - Computes the positional encoding for a sequence up to `max_seq_len` with embedding dimension `d_model`.
- **Forward Pass:**
  - Adds the positional embedding to the input tensor `x`.

## MultiHeadAttention

The `MultiHeadAttention` class implements the multi-head attention mechanism.

- **Attributes:**
  - `W_q`, `W_k`, `W_v`: Linear transformations for queries, keys, and values.
  - `W_o`: Linear transformation for the concatenated output of all heads.
  - `d_k`, `d_v`: Dimensions of keys/values per head.
  - `num_heads`: Number of attention heads.
- **Methods:**
  - `forward(Q, K, V, mask)`: Computes the multi-head attention, handling queries (`Q`), keys (`K`), and values (`V`).
  - `scale_dot_product(Q, K, V, mask)`: Implements scaled dot-product attention.
- **Forward Pass:**
  - Splits input into multiple heads, computes attention, and recombines the heads.

## PositionFeedForward

The `PositionFeedForward` class implements a two-layer feed-forward network used in Transformer layers.

- **Attributes:**
  - `linear1`, `linear2`: Fully connected layers.
- **Forward Pass:**
  - Applies a ReLU activation between the two linear layers.

In [7]:
class PositionalEmbedding(nn.Module):
    def __init__(self, d_model, max_seq_len = MAX_SEQ_LEN):
        super().__init__()
        self.pos_embed_matrix = torch.zeros(max_seq_len, d_model, device=device)
        token_pos = torch.arange(0, max_seq_len, dtype = torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float()
                             * (-math.log(10000.0)/d_model))
        self.pos_embed_matrix[:, 0::2] = torch.sin(token_pos * div_term)
        self.pos_embed_matrix[:, 1::2] = torch.cos(token_pos * div_term)
        self.pos_embed_matrix = self.pos_embed_matrix.unsqueeze(0).transpose(0,1)

    def forward(self, x):
#         print(self.pos_embed_matrix.shape)
#         print(x.shape)
        return x + self.pos_embed_matrix[:x.size(0), :]

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model = 512, num_heads = 8):
        super().__init__()
        assert d_model % num_heads == 0, 'Embedding size not compatible with num heads'

        self.d_v = d_model // num_heads
        self.d_k = self.d_v
        self.num_heads = num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, Q, K, V, mask = None):
        batch_size = Q.size(0)
        '''
        Q, K, V -> [batch_size, seq_len, num_heads*d_k]
        after transpose Q, K, V -> [batch_size, num_heads, seq_len, d_k]
        '''
        Q = self.W_q(Q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2 )
        K = self.W_k(K).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2 )
        V = self.W_v(V).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2 )

        weighted_values, attention = self.scale_dot_product(Q, K, V, mask)
        weighted_values = weighted_values.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads*self.d_k)
        weighted_values = self.W_o(weighted_values)

        return weighted_values, attention


    def scale_dot_product(self, Q, K, V, mask = None):
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        attention = F.softmax(scores, dim = -1)
        weighted_values = torch.matmul(attention, V)

        return weighted_values, attention


class PositionFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.linear2(F.relu(self.linear1(x)))

class EncoderSubLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout = 0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.ffn = PositionFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.droupout1 = nn.Dropout(dropout)
        self.droupout2 = nn.Dropout(dropout)

    def forward(self, x, mask = None):
        attention_score, _ = self.self_attn(x, x, x, mask)
        x = x + self.droupout1(attention_score)
        x = self.norm1(x)
        x = x + self.droupout2(self.ffn(x))
        return self.norm2(x)

class Encoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, dropout=0.1):
        super().__init__()
        self.layers = nn.ModuleList([EncoderSubLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.norm = nn.LayerNorm(d_model)
    def forward(self, x, mask=None):
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

class DecoderSubLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)

    def forward(self, x, encoder_output, target_mask=None, encoder_mask=None):
        attention_score, _ = self.self_attn(x, x, x, target_mask)
        x = x + self.dropout1(attention_score)
        x = self.norm1(x)

        encoder_attn, _ = self.cross_attn(x, encoder_output, encoder_output, encoder_mask)
        x = x + self.dropout2(encoder_attn)
        x = self.norm2(x)

        ff_output = self.feed_forward(x)
        x = x + self.dropout3(ff_output)
        return self.norm3(x)

class Decoder(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers, dropout=0.1):
        super().__init__()
        self.layers = nn.ModuleList([DecoderSubLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, encoder_output, target_mask, encoder_mask):
        for layer in self.layers:
            x = layer(x, encoder_output, target_mask, encoder_mask)
        return self.norm(x)

## Sumary

1. **Positional Encoding:**

   - Adds positional information to token embeddings.
   - Implemented using sine and cosine functions.

2. **Multi-Head Attention:**

   - Splits input into multiple attention heads to capture different aspects of the relationships between tokens.

3. **Feed-Forward Network:**

   - Provides additional processing between attention layers.

4. **Encoder-Decoder Structure:**

   - The encoder processes the input sequence.
   - The decoder generates the output sequence, attending to both the input sequence (via cross-attention) and its own predictions.

5. **Layer Normalization and Dropout:**

   - Used to stabilize and regularize training.

## Transformer class creation
1. **Initialization (`__init__`):**
   - Sets up the embeddings for source and target vocabularies.
   - Includes positional embeddings for sequence order information.
   - Constructs the encoder and decoder stacks.
   - Adds a final output layer to map decoder outputs to target vocabulary predictions.

2. **Forward Pass (`forward`):**
   - Creates source and target masks to handle padding and autoregressive decoding.
   - Embeds and applies positional encodings to source and target sequences.
   - Processes the source sequence through the encoder to generate encoded representations.
   - Passes the target sequence, encoder outputs, and masks through the decoder.
   - Maps decoder outputs to predictions via the output layer.

3. **Masking (`mask`):**
   - Generates source and target masks to distinguish padding tokens.
   - Applies a triangular mask to the target sequence for autoregressive decoding.
   - Outputs both source and target masks.

The `Transformer` class integrates these steps seamlessly, providing a robust implementation of the Transformer architecture.

In [8]:
class Transformer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, num_layers,
                 input_vocab_size, target_vocab_size, 
                 max_len=MAX_SEQ_LEN, dropout=0.1):
        super().__init__()
        # Embedding layers for source and target vocabularies
        self.encoder_embedding = nn.Embedding(input_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(target_vocab_size, d_model)
        
        # Positional embedding to provide positional information to tokens
        self.pos_embedding = PositionalEmbedding(d_model, max_len)

        # Encoder and decoder components of the Transformer
        self.encoder = Encoder(d_model, num_heads, d_ff, num_layers, dropout)
        self.decoder = Decoder(d_model, num_heads, d_ff, num_layers, dropout)

        # Output linear layer to map decoder outputs to the target vocabulary size
        self.output_layer = nn.Linear(d_model, target_vocab_size)
        
    def forward(self, source, target):
        # Create masks for the source and target sequences
        source_mask, target_mask = self.mask(source, target)

        # Apply embedding and positional encoding to the source sequence
        source = self.encoder_embedding(source) * math.sqrt(self.encoder_embedding.embedding_dim)
        source = self.pos_embedding(source)

        # Pass the encoded source sequence through the encoder
        encoder_output = self.encoder(source, source_mask)

        # Apply embedding and positional encoding to the target sequence
        target = self.decoder_embedding(target) * math.sqrt(self.decoder_embedding.embedding_dim)
        target = self.pos_embedding(target)

        # Pass the target sequence and encoder output through the decoder
        output = self.decoder(target, encoder_output, target_mask, source_mask)

        # Apply the output layer to produce final predictions
        return self.output_layer(output)
        
    def mask(self, source, target):
        # Create a mask for the source sequence, marking non-padding tokens
        source_mask = (source != 0).unsqueeze(1).unsqueeze(2)

        # Create a mask for the target sequence, marking non-padding tokens
        target_mask = (target != 0).unsqueeze(1).unsqueeze(2)

        # Generate a triangular mask for the target sequence to ensure autoregressive decoding
        size = target.size(1)
        no_mask = torch.tril(torch.ones((1, size, size), device=device)).bool()
        target_mask = target_mask & no_mask

        # Return both source and target masks
        return source_mask, target_mask

#### Simple test

# Explanation:
1. Parameters:
   - `seq_len_source` and `seq_len_target` define the lengths of the sequences.
   - `batch_size` determines the number of sequences in a batch.
   - `input_vocab_size` and `target_vocab_size` specify the size of the cabularies for the source and target.
2. Random Sequence Generation:
   - torch.randint`(start, end, size)` generates a tensor with random integers 
     in the range [start, end) with the specified size.
3. Output:
   - source and target tensors represent the generated sequences for the urce and target respectively.
   - This setup is often used to simulate input-output pairs for quence-based models during prototyping.


In [9]:
seq_len_source = 10
seq_len_target = 10
batch_size = 2
input_vocab_size = 50
target_vocab_size = 50

source = torch.randint(1, input_vocab_size, (batch_size, seq_len_source))
target = torch.randint(1, target_vocab_size, (batch_size, seq_len_target))

# Explanation:
1. Parameters for the Transformer model:
   - `d_model`: Dimensionality of the model embeddings (512).
   - `num_heads`: Number of attention heads in the multi-head attention mechanism (8).
   - `d_ff`: Dimensionality of the feed-forward network (2048).
   - `num_layers`: Number of layers in the Transformer encoder/decoder stack (6).
2. Model Initialization:
   - Transformer(d_model, num_heads, d_ff, num_layers, input_vocab_size, 
     target_vocab_size, max_len, dropout) initializes a Transformer model with the
     specified parameters, vocabulary sizes, sequence length, and dropout rate.
3. Device Placement:
   - model.to(device): Moves the model to the specified device (e.g., CPU or GPU).
   - source.to(device) and target.to(device): Move the source and target tensors
     to the same device as the model for compatibility during training or inference.

In [10]:
d_model = 512
num_heads = 8
d_ff = 2048
num_layers = 6

model = Transformer(d_model, num_heads, d_ff, num_layers,
                  input_vocab_size, target_vocab_size,
                  max_len=MAX_SEQ_LEN, dropout=0.1)

model = model.to(device)
source = source.to(device)
target = target.to(device)

In [11]:
output = model(source, target)

In [12]:
# Expected output shape -> [batch, seq_len_target, target_vocab_size] i.e. [2, 10, 50]
print(f'ouput.shape {output.shape}')

ouput.shape torch.Size([2, 10, 50])


### Translator Eng-Spa

In [13]:
PATH = '/content/Sentence pairs in English-Spanish - 2024-11-14.txt'

In [14]:
with open(PATH, 'r', encoding='utf-8') as f:
    lines = f.readlines()
eng_spa_pairs = [line.strip().split('\t') for line in lines if '\t' in line]

In [15]:
eng_spa_pairs[:10]

[['OK.', 'Bueno.'],
 ['Hi.', '¡Hola!'],
 ['Go!', '¡Ve!'],
 ['Hi.', 'Hola.'],
 ['So?', '¿Entonces?'],
 ['So?', '¿Y qué?'],
 ['So?', '¿Y?'],
 ['OK.', '¡Órale!'],
 ['Go.', 'Vete.'],
 ['Go.', 'Vaya.']]

In [16]:
eng_sentences = [pair[0] for pair in eng_spa_pairs]
spa_sentences = [pair[1] for pair in eng_spa_pairs]

In [17]:
print(eng_sentences[:10])
print(spa_sentences[:10])


['OK.', 'Hi.', 'Go!', 'Hi.', 'So?', 'So?', 'So?', 'OK.', 'Go.', 'Go.']
['Bueno.', '¡Hola!', '¡Ve!', 'Hola.', '¿Entonces?', '¿Y qué?', '¿Y?', '¡Órale!', 'Vete.', 'Vaya.']


In [18]:
def preprocess_sentence(sentence):
    sentence = sentence.lower().strip()
    sentence = re.sub(r'[" "]+', " ", sentence)
    sentence = re.sub(r"[á]+", "a", sentence)
    sentence = re.sub(r"[é]+", "e", sentence)
    sentence = re.sub(r"[í]+", "i", sentence)
    sentence = re.sub(r"[ó]+", "o", sentence)
    sentence = re.sub(r"[ú]+", "u", sentence)
    sentence = re.sub(r"[^a-z]+", " ", sentence)
    sentence = sentence.strip()
    sentence = '<sos> ' + sentence + ' <eos>'
    return sentence

In [19]:
s1 = '¿Hola @ cómo estás? 123'

In [20]:
print(s1)
print(preprocess_sentence(s1))

¿Hola @ cómo estás? 123
<sos> hola como estas <eos>


In [21]:
eng_sentences = [preprocess_sentence(sentence) for sentence in eng_sentences]
spa_sentences = [preprocess_sentence(sentence) for sentence in spa_sentences]

In [22]:
spa_sentences[:10]

['<sos> bueno <eos>',
 '<sos> hola <eos>',
 '<sos> ve <eos>',
 '<sos> hola <eos>',
 '<sos> entonces <eos>',
 '<sos> y que <eos>',
 '<sos> y <eos>',
 '<sos> orale <eos>',
 '<sos> vete <eos>',
 '<sos> vaya <eos>']

In [23]:
def build_vocab(sentences):
    words = [word for sentence in sentences for word in sentence.split()]
    word_count = Counter(words)
    sorted_word_counts = sorted(word_count.items(), key=lambda x:x[1], reverse=True)
    word2idx = {word: idx for idx, (word, _) in enumerate(sorted_word_counts, 2)}
    word2idx['<pad>'] = 0
    word2idx['<unk>'] = 1
    idx2word = {idx: word for word, idx in word2idx.items()}
    return word2idx, idx2word

In [24]:
eng_word2idx, eng_idx2word = build_vocab(eng_sentences)
spa_word2idx, spa_idx2word = build_vocab(spa_sentences)
eng_vocab_size = len(eng_word2idx)
spa_vocab_size = len(spa_word2idx)

In [25]:
print(eng_vocab_size, spa_vocab_size)

27651 46929


## Explanation:
1. Initialization (__init__ method):
   - `eng_sentences` and `spa_sentences`: Lists of English and Spanish sentences respectively.
   - `eng_word2idx` and `spa_word2idx`: Dictionaries mapping words to their corresponding indices.
2. Length (__len__ method):
   - Returns the total number of English-Spanish sentence pairs in the dataset.
3. Accessing Items (__getitem__ method):
   - Retrieves a sentence pair (English and Spanish) at a specified index (idx).
   - Tokenizes the sentences into words and maps each word to its index using the word2idx dictionaries.
   - If a word is not found in the dictionary, it defaults to the index of the `<unk>` (unknown) token.
   - Converts the resulting lists of indices into PyTorch tensors for use in training or inference.

In [26]:
class EngSpaDataset(Dataset):
    def __init__(self, eng_sentences, spa_sentences, eng_word2idx, spa_word2idx):
        self.eng_sentences = eng_sentences
        self.spa_sentences = spa_sentences
        self.eng_word2idx = eng_word2idx
        self.spa_word2idx = spa_word2idx

    def __len__(self):
        return len(self.eng_sentences)

    def __getitem__(self, idx):
        eng_sentence = self.eng_sentences[idx]
        spa_sentence = self.spa_sentences[idx]
        # return tokens idxs
        eng_idxs = [self.eng_word2idx.get(word, self.eng_word2idx['<unk>']) for word in eng_sentence.split()]
        spa_idxs = [self.spa_word2idx.get(word, self.spa_word2idx['<unk>']) for word in spa_sentence.split()]

        return torch.tensor(eng_idxs), torch.tensor(spa_idxs)

1. Input:
   - `batch`: A list of tuples, where each tuple contains an English sentence tensor and a Spanish sentence tensor.

2. Unpacking:
   - `eng_batch` and `spa_batch`: Unzip the batch into separate lists for English and Spanish sequences.

3. Truncation:
   - Each sequence in eng_batch and spa_batch is truncated to a maximum length (`MAX_SEQ_LEN`) to ensure uniform ngth.
   - clone().detach(): Ensures the tensor is detached from its computation graph and cloned to avoid in-place erations.

4. Padding:
   - torch.nn.utils.rnn.pad_sequence is used to pad the sequences in each batch to the same length.
   - batch_first=True ensures the padded tensors have dimensions `(batch_size, seq_len)`.
   - padding_value=0 specifies the value used for padding.

5. Output:
   - eng_batch: A padded tensor of English sentence indices with dimensions `(batch_size, max_seq_len)`.
   - spa_batch: A padded tensor of Spanish sentence indices with dimensions `(batch_size, max_seq_len)`.

In [27]:
def collate_fn(batch):
    eng_batch, spa_batch = zip(*batch)
    eng_batch = [seq[:MAX_SEQ_LEN].clone().detach() for seq in eng_batch]
    spa_batch = [seq[:MAX_SEQ_LEN].clone().detach() for seq in spa_batch]
    eng_batch = torch.nn.utils.rnn.pad_sequence(eng_batch, batch_first=True, padding_value=0)
    spa_batch = torch.nn.utils.rnn.pad_sequence(spa_batch, batch_first=True, padding_value=0)
    return eng_batch, spa_batch


## Model training
1. Parameters:
   - model: The sequence-to-sequence model to be trained.
   - dataloader: A PyTorch DataLoader that provides batches of English and Spanish sequences.
   - loss_function: The criterion to compute the loss (e.g., CrossEntropyLoss).
   - optimiser: The optimizer used for updating model parameters (e.g., Adam).
   - epochs: Number of training epochs.
2. Training Setup:
   - model.train(): Sets the model to training mode.
   - total_loss: Accumulates the total loss over all batches in an epoch.
3. Batch Processing:
   - eng_batch and spa_batch: Batches of English and Spanish sequences moved to the specified device.
   - target_input: Prepared by slicing spa_batch to exclude the last token (decoder input).
   - target_output: Prepared by slicing spa_batch to exclude the first token and flattening the tensor.
4. Forward Pass:
   - model(eng_batch, target_input): Runs the model with English inputs and decoder inputs.
   - output: Reshaped to match the dimensions required by the loss function.
5. Loss Calculation:
   - loss_function(output, target_output): Computes the loss between the model output and target output.
6. Backpropagation and Optimization:
   - loss.backward(): Computes the gradients.
   - optimiser.step(): Updates the model parameters using the computed gradients.
   - optimiser.zero_grad(): Resets gradients for the next iteration.
7. Epoch Summary:
   - avg_loss: The average loss over all batches in an epoch.
   - Prints the epoch number and average loss.

In [28]:
def train(model, dataloader, loss_function, optimiser, epochs):
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        for i, (eng_batch, spa_batch) in enumerate(dataloader):
            eng_batch = eng_batch.to(device)
            spa_batch = spa_batch.to(device)
            # Decoder preprocessing
            target_input = spa_batch[:, :-1]
            target_output = spa_batch[:, 1:].contiguous().view(-1)
            # Zero grads
            optimiser.zero_grad()
            # run model
            output = model(eng_batch, target_input)
            output = output.view(-1, output.size(-1))
            # loss\
            loss = loss_function(output, target_output)
            # gradient and update parameters
            loss.backward()
            optimiser.step()
            total_loss += loss.item()

        avg_loss = total_loss/len(dataloader)
        print(f'Epoch: {epoch}/{epochs}, Loss: {avg_loss:.4f}')



1. Batch Size:
   - `BATCH_SIZE` = 64: Specifies the number of samples per batch.
2. Dataset:
   - `EngSpaDataset`: A custom dataset containing English and Spanish sentence pairs.
   - `eng_sentences` and `spa_sentences`: Lists of English and Spanish sentences.
   - `eng_word2idx` and `spa_word2idx`: Word-to-index mappings for English and Spanish vocabularies.
3. DataLoader:
   - DataLoader(dataset, batch_size, shuffle, collate_fn):
     - dataset: The `EngSpaDataset` instance that provides sentence pairs.
     - batch_size: Number of samples per batch (64 in this case).
     - shuffle=True: Randomizes the order of samples in each epoch for better training.
     - collate_fn: A custom collate function for preprocessing batches (e.g., truncation and padding).

In [29]:
BATCH_SIZE = 64
dataset = EngSpaDataset(eng_sentences, spa_sentences, eng_word2idx, spa_word2idx)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)

1. Parameters:
   - `d_model`: Defines the size of the model's embedding vectors (512).
   - `num_heads`: Sets the number of heads in the multi-head attention mechanism (8).
   - `d_ff`: Specifies the dimensionality of the feed-forward layer (2048).
   - `num_layers`: The number of encoder and decoder layers in the Transformer (6).
   - `input_vocab_size`: Vocabulary size for the source language (e.g., English).
   - `target_vocab_size`: Vocabulary size for the target language (e.g., Spanish).
   - `max_len`: The maximum sequence length supported by the model.
   - `dropout`: The dropout rate applied during training (0.1).

In [30]:
model = Transformer(d_model=512, num_heads=8, d_ff=2048, num_layers=6,
                    input_vocab_size=eng_vocab_size, target_vocab_size=spa_vocab_size,
                    max_len=MAX_SEQ_LEN, dropout=0.1)

In [31]:
model = model.to(device)
loss_function = nn.CrossEntropyLoss(ignore_index=0)
optimiser = optim.Adam(model.parameters(), lr=0.0001)


In [32]:
train(model, dataloader, loss_function, optimiser, epochs = 10)

Epoch: 0/10, Loss: 3.5947
Epoch: 1/10, Loss: 2.2026
Epoch: 2/10, Loss: 1.7022
Epoch: 3/10, Loss: 1.3742
Epoch: 4/10, Loss: 1.1222
Epoch: 5/10, Loss: 0.9198
Epoch: 6/10, Loss: 0.7546
Epoch: 7/10, Loss: 0.6260
Epoch: 8/10, Loss: 0.5306
Epoch: 9/10, Loss: 0.4632


1. Data preparation functions:
   - sentence_to_indices: Converts a sentence into a list of word indices using a word2idx mapping.
   - indices_to_sentence: Converts a list of word indices back into a readable sentence using an idx2word mapping.

2. Translation Function:
   - Preprocesses and tokenizes an input sentence.
   - Generates an output sequence using the Transformer model.
   - Stops when the <eos> (end-of-sequence) token is generated or when the maximum sequence length is reached.
   - Converts the generated indices back to a readable sentence.
   - Copiar código

In [33]:
def sentence_to_indices(sentence, word2idx):
    return [word2idx.get(word, word2idx['<unk>']) for word in sentence.split()]

def indices_to_sentence(indices, idx2word):
    return ' '.join([idx2word[idx] for idx in indices if idx in idx2word and idx2word[idx] != '<pad>'])

def translate_sentence(model, sentence, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device='cpu'):
    model.eval()
    sentence = preprocess_sentence(sentence)
    input_indices = sentence_to_indices(sentence, eng_word2idx)
    input_tensor = torch.tensor(input_indices).unsqueeze(0).to(device)

    # Initialize the target tensor with <sos> token
    tgt_indices = [spa_word2idx['<sos>']]
    tgt_tensor = torch.tensor(tgt_indices).unsqueeze(0).to(device)

    with torch.no_grad():
        for _ in range(max_len):
            output = model(input_tensor, tgt_tensor)
            output = output.squeeze(0)
            next_token = output.argmax(dim=-1)[-1].item()
            tgt_indices.append(next_token)
            tgt_tensor = torch.tensor(tgt_indices).unsqueeze(0).to(device)
            if next_token == spa_word2idx['<eos>']:
                break

    return indices_to_sentence(tgt_indices, spa_idx2word)

## Evaluate_translations function:

- Takes a trained model and a list of input sentences.
- Translates each sentence using the translate_sentence function.
- Prints the original sentence and its translation in Spanish.
- Example Sentences are given

- The model is moved to the specified device, and translations are evaluated and displayed in the console.

In [36]:
def evaluate_translations(model, sentences, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device='cpu'):
    for sentence in sentences:
        translation = translate_sentence(model, sentence, eng_word2idx, spa_idx2word, max_len, device)
        print(f'Input sentence: {sentence}')
        print(f'Traducción: {translation}')
        print()

# Example sentences to test the translator
test_sentences = [
    "Hello, how are you?",
    "I am learning artificial intelligence.",
    "Artificial intelligence is great.",
    "Good night!"
]

# Assuming the model is trained and loaded
# Set the device to 'cpu' or 'cuda' as needed
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Evaluate translations
evaluate_translations(model, test_sentences, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device=device)


Input sentence: Hello, how are you?
Traducción: <sos> hola como estas <eos>

Input sentence: I am learning artificial intelligence.
Traducción: <sos> estoy aprendiendo inteligencia artificial <eos>

Input sentence: Artificial intelligence is great.
Traducción: <sos> la inteligencia artificial es buenisima <eos>

Input sentence: Good night!
Traducción: <sos> buenas noches <eos>



In [47]:
def evaluate_translations(model, sentences, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device='cpu'):
    for sentence in sentences:
        translation = translate_sentence(model, sentence, eng_word2idx, spa_idx2word, max_len, device)
        print(f'Input sentence: {sentence}')
        print(f'Traducción: {translation}')
        print()

# Example sentences to test the translator
test_sentences = [
    "I want to pass this course",
    "I have two brothers",
    "We are from Mexico",
    "This is an interesting homework",
    "The sun is shining brightly today",
    "My favorite sport is Baseball",
    "Today we do some homework",
    "I need to learn more languajes",
    "I like to listening to music",
    "The man studies in the school"
]

# Assuming the model is trained and loaded
# Set the device to 'cpu' or 'cuda' as needed
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Evaluate translations
evaluate_translations(model, test_sentences, eng_word2idx, spa_idx2word, max_len=MAX_SEQ_LEN, device=device)

Input sentence: I want to pass this course
Traducción: <sos> quiero pasar este curso <eos>

Input sentence: I have two brothers
Traducción: <sos> tengo dos hermanos <eos>

Input sentence: We are from Mexico
Traducción: <sos> somos de mexico <eos>

Input sentence: This is an interesting homework
Traducción: <sos> este es un tarea interesante <eos>

Input sentence: The sun is shining brightly today
Traducción: <sos> el sol esta brillando intensamente hoy <eos>

Input sentence: My favorite sport is Baseball
Traducción: <sos> el beisbol es mi deporte favorito <eos>

Input sentence: Today we do some homework
Traducción: <sos> hoy hacemos un poco de tarea <eos>

Input sentence: I need to learn more languajes
Traducción: <sos> necesito aprender mas para aprender <eos>

Input sentence: I like to listening to music
Traducción: <sos> me gusta escuchar musica <eos>

Input sentence: The man studies in the school
Traducción: <sos> el hombre estudia en la escuela <eos>



## Conclusions

The interesting first hurdle was the training data (Sentence pairs in English-Spanish - 2024-11-14.tsv) , originally expected to be parsed with tab (\t) characters, on line 11864 the format seems to break, either deleting the line or reconstructing it with tabs manually solves the problem (which apparently only happened on some systems). 

The second challenge was, of course the training function, using local hardware did not appear to work as well as it used to, giving the error: 

```CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 7.78 GiB total capacity; 7.16 GiB already allocated; 14.38 MiB free; 7.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF```

(The error above was of course an out of memory exception on the local graphic card.)
The solution seemed at first modification of the layers of the model, which did not turn out to be a satisfactory result, at first, given the training time was still an issue. Reliant on a more powerful graphics card, training took about 54 minutes. 

Modifying the batch size and the os environment variable ‘max_split_size_mb’ did in fact help solve the problem of a memory exception.


