<a href="https://colab.research.google.com/github/juacardonahe/Curso_NLP/blob/main/2_Mecanismos_Transformers/2.1_Encoder_Decoder/2_1_1_Seq2Seq.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://raw.githubusercontent.com/juacardonahe/Curso_NLP/refs/heads/main/data/UnFieldB.png" width="40%">

# **Natural Language Procesing (NLP)**
### Departamento de Ingeniería Eléctrica, Electrónica y Computación
#### Universidad Nacional de Colombia - Sede Manizales

#### Created by: Juan José Cardona H.
#### Reviewed by: Diego A. Perez

#**2.1.1 - Seq2Seq Model using Encoder = Decoder**

In this notebook, we’ll build a straightforward sequence-to-sequence (Seq2Seq) network with an Encoder–Decoder setup in PyTorch. Seq2Seq models power applications like translating text between languages, condensing articles, and answering questions by reading one sequence (for example, a sentence) and producing another sequence (for example, its translation).

We’ll focus on a toy machine‑translation problem: converting German sentences into English. To do this, we’ll use the parallel sentence pairs provided by the Multi30k dataset.

**About Multi30k**

Multi30k consists of thousands of aligned German–English sentence pairs. Each entry pairs a German sentence with its correct English translation, making it ideal for training translation models.

Our workflow will include tokenizing both German and English texts, constructing vocabularies for each language, and training our Seq2Seq model to take German input and generate the corresponding English output.

#**What is done in this mini project?

1. **Data Preprocessing:**  
   We’ll tokenize sentences in German and English and prepare the dataset for training.

2. **Model Architecture:**  
   - **Encoder:** Reads the source (German) sentence and converts it into a context vector.  
   - **Decoder:** Uses the context vector to generate the target (English) sentence.

3. **Training:**  
   Train the model on the dataset to minimize the loss between the predicted and actual target sentences.

4. **Evaluation:**  
   We’ll evaluate the model by translating new sentences and computing the BLEU score, which is commonly used to evaluate the quality of machine translations.


##**1. Installing libraries**
We will use libraries such as torch for deep learning, torchtext to manage datasets, and spacy for tokenizing text.

In [None]:
!pip install torch torchtext spacy
!python -m spacy download de_core_news_sm
!python -m spacy download en_core_web_sm

###**Import libraries**

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import random
import spacy
from torchtext.data import Field, BucketIterator
from torchtext.datasets import Multi30k

##**2. Data Pre-Procesing**

In [2]:
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')
def tokenize_de(text):
 return [tok.text for tok in spacy_de.tokenizer(text)][::-1]
def tokenize_en(text):
 return [tok.text for tok in spacy_en.tokenizer(text)]

###**Field Definitions**

We use `torchtext`’s `Field` to define how the data will be processed. We specify how to tokenize the data and handle the start (`<sos>`) and end (`<eos>`) tokens.

In [3]:
SRC = Field(tokenize=tokenize_de, init_token='<sos>', eos_token='<eos>', lower=True)
TRG = Field(tokenize=tokenize_en, init_token='<sos>', eos_token='<eos>', lower=True)

NameError: name 'Field' is not defined

###**Loading and Building Vocab**

We now load the dataset and build the vocabulary for the source (German) and target (English) languages.

In [None]:
train_data, valid_data, test_data = Multi30k.splits(exts=('.de', '.en'), fields=(SRC, TRG))
SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)

`min_freq=2` ensures that only words that appear at least twice in the training data are included in the vocabulary.

##**3. Encoder**
The Encoder is an RNN that reads a sequence of words (in this case, a German sentence) and encodes it into a context vector.

In [None]:
class Encoder(nn.Module):
 def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
 super().__init__()
 self.embedding = nn.Embedding(input_dim, emb_dim)
 self.rnn = nn.GRU(emb_dim, hid_dim, n_layers, dropout=dropout)
 self.dropout = nn.Dropout(dropout)

 def forward(self, src):
 # src = [src_len, batch_size]
 embedded = self.dropout(self.embedding(src)) # [src_len, batch_size, emb_dim]
 outputs, hidden = self.rnn(embedded)
 return hidden

##**4. Decoder**
The Decoder generates the target sentence (in English) one word at a time, conditioned on the context vector produced by the encoder.

In [None]:
class Decoder(nn.Module):
 def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
 super().__init__()
 self.embedding = nn.Embedding(output_dim, emb_dim)
 self.rnn = nn.GRU(emb_dim, hid_dim, n_layers, dropout=dropout)
 self.fc_out = nn.Linear(hid_dim, output_dim)
 self.dropout = nn.Dropout(dropout)

 def forward(self, input, hidden):
 # input = [batch_size]
 input = input.unsqueeze(0) # [1, batch_size]
 embedded = self.dropout(self.embedding(input)) # [1, batch_size, emb_dim]
 output, hidden = self.rnn(embedded, hidden) # [1, batch_size, hid_dim], [n_layers, batch_size, hid_dim]
 prediction = self.fc_out(output.squeeze(0)) # [batch_size, output_dim]
 return prediction, hidden

##**5. Seq2Seq Model**

In [None]:
class Seq2Seq(nn.Module):
 def __init__(self, encoder, decoder, device):
 super().__init__()
 self.encoder = encoder
 self.decoder = decoder
 self.device = device
def forward(self, src, trg, teacher_forcing_ratio=0.5):
 trg_len = trg.shape[0]
 batch_size = trg.shape[1]
 trg_vocab_size = self.decoder.fc_out.out_features

 outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
 hidden = self.encoder(src)

 input = trg[0, :]
 for t in range(1, trg_len):
 output, hidden = self.decoder(input, hidden)
 outputs[t] = output
 top1 = output.argmax(1)
 input = trg[t] if random.random() < teacher_forcing_ratio else top1

 return outputs

##**6. Training the model**
We define the training loop, where we feed in the German sentence, run it through the Seq2Seq model, and compute the loss using Cross Entropy.

In [None]:
def train(model, iterator, optimizer, criterion, clip):
 model.train()
 epoch_loss = 0

 for i, batch in enumerate(iterator):
 src = batch.src
 trg = batch.trg

 optimizer.zero_grad()
 output = model(src, trg)

 output_dim = output.shape[-1]
 output = output[1:].view(-1, output_dim)
 trg = trg[1:].view(-1)

 loss = criterion(output, trg)
 loss.backward()
 torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
 optimizer.step()
 epoch_loss += loss.item()

 return epoch_loss / len(iterator)

##**7. Evaluate the model**
Evaluation works similarly to training, but we disable backpropagation since we’re only interested in the model’s performance.

In [None]:
def evaluate(model, iterator, criterion):
 model.eval()
 epoch_loss = 0

 with torch.no_grad():
 for i, batch in enumerate(iterator):
 src = batch.src
 trg = batch.trg
 output = model(src, trg, 0)

 output_dim = output.shape[-1]
 output = output[1:].view(-1, output_dim)
 trg = trg[1:].view(-1)
loss = criterion(output, trg)
 epoch_loss += loss.item()

 return epoch_loss / len(iterator)

##**8. Initialize the model**
Now, we initialize the model, optimizer, and loss function, and specify hyperparameters.

In [None]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Seq2Seq(enc, dec, device).to(device)
optimizer = optim.Adam(model.parameters())
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index=TRG_PAD_IDX)

##**9. Training loop**
We train the model for a set number of epochs.


In [None]:
N_EPOCHS = 10
CLIP = 1

for epoch in range(N_EPOCHS):
 train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
 valid_loss = evaluate(model, valid_iterator, criterion)

 print(f'Epoch: {epoch+1}')
 print(f'Train Loss: {train_loss:.3f} | Val. Loss: {valid_loss:.3f}')

##**10. BLEU Score**
Finally, we can test our model by translating new sentences and evaluating the BLEU score.

In [None]:
def translate_sentence(sentence, src_field, trg_field, model, device, max_len=50):
 model.eval()
 tokens = [token.text.lower() for token in spacy_de(sentence)]
 tokens = [src_field.init_token] + tokens + [src_field.eos_token]

 src_indexes = [src_field.vocab.stoi[token] for token in tokens]
 src_tensor = torch.LongTensor(src_indexes).unsqueeze(1).to(device)

 with torch.no_grad():
 hidden = model.encoder(src_tensor)

 trg_indexes = [trg_field.vocab.stoi[trg_field.init_token]]
 for i in range(max_len):
 trg_tensor = torch.LongTensor([trg_indexes[-1]]).to(device)
 with torch.no_grad():
 output, hidden = model.decoder(trg_tensor, hidden)
 pred_token = output.argmax(1).item()
 trg_indexes.append(pred_token)
 if pred_token == trg_field.vocab.stoi[trg_field.eos_token]:
 break
 trg_tokens = [trg_field.vocab.itos[i] for i in trg_indexes]
 return trg_tokens[1:]

This function translates a sentence from German to English using the trained Seq2Seq model.