# Transformers: Arquitectura de Atención para Traducción Automática

En esta notebook, exploraremos la implementación de un modelo **Transformer** desde cero utilizando PyTorch. Esta arquitectura revolucionaria, introducida en el paper "Attention is All You Need" (Vaswani et al., 2017), marcó un antes y un después en el campo del Procesamiento de Lenguaje Natural (NLP), siendo la base de modelos modernos como BERT y GPT.

## Introducción

### Objetivos

1. **Comprender la arquitectura Transformer** y cómo funciona el mecanismo de atención (self-attention) para procesar secuencias de texto.
2. **Implementar desde cero los componentes clave** del Transformer: Positional Encoding, Multi-Head Attention, Encoder, y Decoder.
3. **Entrenar un modelo de traducción automática** inglés-español utilizando la arquitectura Transformer completa.
4. **Explorar el uso de máscaras** para el entrenamiento autorregresivo y el manejo de padding en secuencias de longitud variable.

### Contenido

1. Introducción a la arquitectura Transformer y su relevancia en NLP moderno.
2. Preparación de datos y construcción de vocabularios para traducción automática.
3. Implementación del **Positional Encoding** para capturar información de posición en las secuencias.
4. Desarrollo del mecanismo de **Multi-Head Attention** y comprensión del scaled dot-product attention.
5. Construcción de las capas del **Encoder** y **Decoder** con sus componentes: self-attention, encoder-decoder attention, y feed-forward networks.
6. Implementación de **máscaras** para padding y look-ahead masking en el decoder.
7. Entrenamiento del modelo Transformer completo y evaluación en traducción de frases.

### Concepto de Transformer

La arquitectura Transformer revolucionó el procesamiento de secuencias al **eliminar completamente las redes recurrentes** (RNNs y LSTMs) y basarse únicamente en mecanismos de atención. A diferencia de las arquitecturas secuenciales tradicionales, el Transformer puede procesar todos los elementos de una secuencia en paralelo, lo que mejora significativamente la eficiencia del entrenamiento.

**Componentes principales:**

- **Encoder**: Procesa la secuencia de entrada mediante capas de self-attention y redes feed-forward, generando representaciones contextualizadas de cada token.
- **Decoder**: Genera la secuencia de salida de forma autorregresiva, utilizando tanto self-attention sobre los tokens ya generados como atención cruzada (cross-attention) sobre la salida del encoder.
- **Multi-Head Attention**: Permite al modelo atender a diferentes representaciones y posiciones de la secuencia simultáneamente, capturando relaciones complejas entre palabras.
- **Positional Encoding**: Como el Transformer no tiene noción inherente del orden de las palabras, se añade información posicional mediante funciones seno y coseno.

Esta arquitectura se ha convertido en el estándar de facto para tareas de NLP, siendo la base de los modelos de lenguaje más avanzados de la actualidad.

<div align="center">
    <img src="https://d1.awsstatic.com/GENAI-1.151ded5440b4c997bac0642ec669a00acff2cca1.png" width="600px">
</div>

### Dataset de Traducción

Para esta notebook, utilizaremos el dataset de traducción inglés-español de Tatoeba, que contiene pares de frases en ambos idiomas. El objetivo es entrenar un modelo Transformer que aprenda a traducir frases del inglés al español. El dataset incluye frases cortas y medianas de diversos contextos cotidianos, lo que permite al modelo aprender patrones lingüísticos variados.

El dataset está disponible en: [Tatoeba Downloads](https://tatoeba.org/en/downloads)

### Referencias

- [Attention is All You Need](https://arxiv.org/abs/1706.03762) - Vaswani et al. (2017)
- [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/) - Jay Alammar <- Encoder + Decoder
- [Transformer Explainer](https://poloclub.github.io/transformer-explainer/) - Visualización interactiva <- Decoder only
- [LLM Visualization](https://bbycroft.net/llm) - Brendan Bycroft <- Decoder only

---

In [1]:
# !pip install torchinfo

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, random_split
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_sequence

from torchinfo import summary

import math
import numpy as np
import os
import re
from pathlib import Path
from collections import Counter

In [3]:
# Fijamos la semilla para que los resultados sean reproducibles
SEED = 23

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

In [4]:
import sys

# definimos el dispositivo que vamos a usar
DEVICE = "cpu"  # por defecto, usamos la CPU
if torch.cuda.is_available():
    DEVICE = "cuda"  # si hay GPU, usamos la GPU
elif torch.backends.mps.is_available():
    DEVICE = "mps"  # si no hay GPU, pero hay MPS, usamos MPS
elif torch.xpu.is_available():
    DEVICE = "xpu"  # si no hay GPU, pero hay XPU, usamos XPU

print(f"Usando {DEVICE}")

NUM_WORKERS = 0  # Win y MacOS pueden tener problemas con múltiples workers
if sys.platform == "linux":
    NUM_WORKERS = 4  # numero de workers para cargar los datos (depende de cada caso)

print(f"Usando {NUM_WORKERS}")

Usando cuda
Usando 4


## Cargar los Datos & Preprocesamiento

Vamos a leer el dataset de traducción y realizar un preprocesamiento básico para limpiar y normalizar los textos antes de alimentarlos al modelo. Por ejemplo, vamos a convertir los textos a minúsculas, eliminar algunos caracteres especiales y filtrar las oraciones más largas.

In [5]:
def clean_text(text):
    # Convertimos a minúsculas
    text = text.lower()

    # Insertamos espacios alrededor de los símbolos de puntuación que queremos conservar
    text = re.sub(r"([¿?¡!,])", r" \1 ", text)

    # Eliminamos todo lo que no sea letras, números, o los símbolos que queremos conservar
    text = re.sub(r"[^a-zA-Z0-9áéíóúüñ¿?¡!,]+", " ", text)

    # Remover espacios extras
    text = re.sub(r"\s+", " ", text).strip()

    return text

In [6]:
DATA_PATH = str(Path("data") / "English-Spanish.tsv")

MAX_SENTENCE_LENGTH = 15  # Máxima longitud de las frases que vamos a considerar


def load_data(source_file, max_words=5):
    with open(source_file, "r") as f:
        lines = f.readlines()

    # Separamos las frases en dos listas
    input_texts = []
    target_texts = []

    for line in lines:
        elements = line.split("\t")

        input_text = elements[1]
        target_text = elements[3]

        input_text_clean = clean_text(input_text)
        target_text_clean = clean_text(target_text)

        # Filtramos frases de hasta max_words palabras
        if (
            len(input_text_clean.split()) <= max_words
            and len(target_text_clean.split()) <= max_words
        ):
            input_texts.append(input_text_clean)
            target_texts.append(target_text_clean)

    return input_texts, target_texts


src_texts, trg_texts = load_data(DATA_PATH, MAX_SENTENCE_LENGTH)

In [7]:
print(f"Number of samples: {len(src_texts)}")

Number of samples: 264266


In [8]:
random_idx = np.random.randint(0, len(src_texts), 10)
for idx in random_idx:
    print(f"Input: {src_texts[idx]}")
    print(f"Target: {trg_texts[idx]}\n")

Input: as usual , the physics teacher was late for class
Target: como de costumbre , el profesor de física llegó tarde a clase

Input: nobody seems to know where jean is
Target: nadie parece saber dónde está jean

Input: how many pupils are there in your school ?
Target: ¿ cuántos alumnos hay en tu escuela ?

Input: he died at the age of 54
Target: murió a la edad de 54 años

Input: i have a lot of things to do this morning
Target: tengo muchas cosas que hacer esta mañana

Input: it s about to rain
Target: está a punto de llover

Input: i won t die
Target: no moriré

Input: they are about the same age
Target: ellos tienen más o menos la misma edad

Input: tom gave me a pen
Target: tom me dio un bolígrafo

Input: the tea we had there was excellent
Target: el té que tomamos allí era excelente



## Construcción de los Vocabularios

Es importante construir un vocabulario para cada idioma en el dataset, ya que cada vocabulario tiene que ser capaz de mapear palabras a índices enteros y viceversa.

> Nota: para reducir el tiempo de entrenamiento, vamos a limitar el tamaño del vocabulario a las palabras más comunes en cada idioma, con el argumento `FREQ_THRESHOLD` controlamos la cantidad de palabras que se incluirán en el vocabulario.

Tenemos además que agregar token especiales:

- `SOS` (Start of Sentence): Indica el inicio de una oración.
- `EOS` (End of Sentence): Indica el final de una oración.
- `UNK` (Unknown): Indica una palabra desconocida que no está en el vocabulario.
- `PAD` (Padding): Se utiliza para rellenar secuencias a la misma longitud.

In [9]:
PAD_TOKEN = "<PAD>"
SOS_TOKEN = "<SOS>"
EOS_TOKEN = "<EOS>"
UNK_TOKEN = "<UNK>"
FREQ_THRESHOLD = 1  # Frecuencia mínima para considerar una palabra en el vocabulario


class Vocab:
    def __init__(self):
        # mapea palabras a índices
        self.word2index = {}
        # mapea índices a palabras
        self.index2word = {}
        # contador de palabras
        self.word_count = Counter()
        self.index = 0

        # Tokens especiales
        self.add_special_tokens()

    def add_special_tokens(self):
        self.add_word(PAD_TOKEN)
        self.add_word(SOS_TOKEN)
        self.add_word(EOS_TOKEN)
        self.add_word(UNK_TOKEN)

    def add_word(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.index
            self.index2word[self.index] = word
            self.index += 1

    def build_vocab(self, sentences, min_freq=1):
        word_counter = Counter()
        for sentence in sentences:
            for word in sentence.split():
                word_counter[word] += 1

        # Filtrar palabras que no alcanzan la frecuencia mínima
        words = [word for word, count in word_counter.items() if count >= min_freq]

        # Agregar palabras filtradas al vocabulario
        for word in words:
            self.add_word(word)
            self.word_count[word] = word_counter[word]

    def __len__(self):
        return len(self.word2index)

    def __getitem__(self, key):
        if isinstance(key, int):
            return self.index2word.get(key, UNK_TOKEN)
        if isinstance(key, str):
            return self.word2index.get(key, self.word2index[UNK_TOKEN])


# Construimos los vocabularios
SRC_VOCAB = Vocab()
TRG_VOCAB = Vocab()

SRC_VOCAB.build_vocab(src_texts, min_freq=FREQ_THRESHOLD)
TRG_VOCAB.build_vocab(trg_texts, min_freq=FREQ_THRESHOLD)

SRC_VOCAB_SIZE = len(SRC_VOCAB)
TRG_VOCAB_SIZE = len(TRG_VOCAB)

print(f"English vocab size: {SRC_VOCAB_SIZE}")
print(f"Spanish vocab size: {TRG_VOCAB_SIZE}")

English vocab size: 25033
Spanish vocab size: 45139


Algunos ejemplos de uso de los vocabularios:

- `SRC_VOCAB['hello']`: Devuelve el índice de la palabra "hello" en el vocabulario de origen.
- `TGT_VOCAB['hola']`: Devuelve el índice de la palabra "hola" en el vocabulario de destino.
- `TRG_VOCAB['palabra_no_existente']`: Devuelve el índice de la palabra desconocida (`<UNK>`) en el vocabulario de destino.
- `TRG_VOCAB[10]`: Devuelve la palabra en el índice 10 del vocabulario de destino.
- `TRG_VOCAB[3]`: Devuelve la palabra en el índice 3 del vocabulario de destino.
- `SRC_VOCAB[PAD_TOKEN]`: Devuelve el índice del token de padding en el vocabulario de origen.

In [10]:
print(SRC_VOCAB["hello"])
print(TRG_VOCAB["hola"])
print(TRG_VOCAB["palabra_no_existente"])
print(TRG_VOCAB[10])
print(TRG_VOCAB[3])
print(TRG_VOCAB[PAD_TOKEN])

979
1308
3
irme
<UNK>
0


In [11]:
# Función para codificar una frase
def encode_sentence(sentence, vocab):
    return [vocab[word] for word in sentence.split()]


print(encode_sentence("hello world", SRC_VOCAB))
print(encode_sentence("hola mundo extraterrestre", TRG_VOCAB))

[979, 59]
[1308, 63, 26196]


In [12]:
def decode_sentence(indices, vocab):
    return " ".join(
        [
            vocab[idx]
            for idx in indices
            if idx != vocab[PAD_TOKEN]
            and idx != vocab[EOS_TOKEN]
            and idx != vocab[SOS_TOKEN]
        ]
    )


print(decode_sentence([10, 11, 12], TRG_VOCAB))

irme a dormir


## Dataset de Traducción

Trabajaremos con un pequeño dataset de traducción **inglés-español**, compuesto por pares de frases simples. El objetivo es que el modelo aprenda a traducir una frase en inglés a su equivalente en español.

#### Ejemplos de Pares:

- **Inglés**: "hello" → **Español**: "hola"
- **Inglés**: "how are you?" → **Español**: "¿cómo estás?"

### Preparación del Dataset

Cada frase será tokenizada y convertida a índices numéricos de sus respectivos vocabularios. En las secuencias objetivo, añadimos los tokens especiales `<SOS>` y `<EOS>` para marcar el inicio y el fin de cada traducción.

In [13]:
class TranslationDataset(Dataset):
    def __init__(self, source_sentences, target_sentences):
        super(TranslationDataset, self).__init__()
        self.source_sentences = source_sentences
        self.target_sentences = target_sentences

    def __len__(self):
        return len(self.source_sentences)

    def __getitem__(self, idx):
        source_sentence = self.source_sentences[idx]
        target_sentence = self.target_sentences[idx]
        encoded_source_sentence = encode_sentence(source_sentence, SRC_VOCAB)
        encoded_target_sentence = encode_sentence(target_sentence, TRG_VOCAB)

        # Añadimos tokens especiales <SOS> y <EOS>
        encoded_target_sentence = (
            [TRG_VOCAB[SOS_TOKEN]] + encoded_target_sentence + [TRG_VOCAB[EOS_TOKEN]]
        )

        x = torch.tensor(encoded_source_sentence, dtype=torch.long)
        y = torch.tensor(encoded_target_sentence, dtype=torch.long)

        return x, y


# Crear el dataset y el dataloader
train_dataset = TranslationDataset(src_texts, trg_texts)
val_len = int(0.10 * len(train_dataset))
train_len = len(train_dataset) - val_len
train_dataset, val_dataset = random_split(train_dataset, [train_len, val_len])

Debido a que las frases tienen longitudes variables, utilizaremos padding para asegurarnos de que todas las secuencias en un batch tengan la misma longitud.

La función `collate_fn` es un argumento opcional que se pasa al DataLoader de PyTorch para personalizar el procesamiento de los datos. En este caso, se utiliza para rellenar y agrupar las secuencias de entrada y salida en lotes.

In [14]:
def collate_fn(batch):
    sources = [item[0] for item in batch]
    targets = [item[1] for item in batch]

    # Padding de las secuencias
    sources_padded = pad_sequence(
        sources,
        padding_value=SRC_VOCAB[PAD_TOKEN],
        batch_first=True,
        padding_side="right",
    )
    targets_padded = pad_sequence(
        targets,
        padding_value=TRG_VOCAB[PAD_TOKEN],
        batch_first=True,
        padding_side="right",
    )

    return sources_padded, targets_padded

In [15]:
BATCH_SIZE = 128  # 128 o menos para clase

train_loader = DataLoader(
    train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn
)
valid_loader = DataLoader(
    val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn
)

## Modelo

> **Encoder**: The encoder is composed of a stack of $N = 6$ identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, positionwise fully connected feed-forward network. We employ a residual connection around each of the two sub-layers, followed by layer normalization. That is, the output of each sub-layer is $LayerNorm(x + Sublayer(x))$, where $Sublayer(x)$ is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension $d_{model} = 512$.
>
> **Decoder**: The decoder is also composed of a stack of $N = 6$ identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position,ensures that the predictions for position $i$ can depend only on the known outputs at positions less than $i$.

Paper original: [Attention is All You Need](https://arxiv.org/abs/1706.03762)

<div align="center">
    <img src="https://d1.awsstatic.com/GENAI-1.151ded5440b4c997bac0642ec669a00acff2cca1.png" width="400px">
</div>

### Hiperparametros para el modelo
Modificar para que el modelo sea "ejecutable" dentro de un tiempo razonable de la clase

In [16]:
# Parámetros del modelo
SRC_PAD_IDX = SRC_VOCAB[PAD_TOKEN]
TRG_PAD_IDX = TRG_VOCAB[PAD_TOKEN]
D_MODEL = 256
NUM_LAYERS = 6  # clase usar 4 o 2
FORWARD_EXPANSION = 4
HEADS = 8
DROPOUT = 0.1
MAX_SENTENCE_LENGTH_MODEL = (
    MAX_SENTENCE_LENGTH + 2
)  # tenemos en cuenta los tokens SOS y EOS

### Positional Encoding

<div style="text-align: center;">
    <img src="https://kazemnejad.com/img/transformer_architecture_positional_encoding/model_arc.jpg" width="1200"/>
</div>

> Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed. In this work, we use sine and cosine functions of different frequencies:
> $$
> \begin{aligned}
> \text{PE}_{(pos,2i)} &= \sin(pos / 10000^{2i/d_{\text{model}}}) \\
> \text{PE}_{(pos,2i+1)} &= \cos(pos / 10000^{2i/d_{\text{model}}})
> \end{aligned}
> $$
> where $pos$ is the position and $i$ is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from $2\pi$ to $10000 \cdot 2\pi$. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$.

> Nota: $d_{model}$ = embedding dimension

Esta **forma de codificación posicional** permite:
- Dada una posición, obtener una representación única para esa posición (depende de la frecuencia y dimensión).
- La distancia entre dos posiciones es consistente sin importar la longitud de la secuencia.
- El modelo puede generalizar a secuencias más largas sin necesidad de reentrenamiento.
- Su calculo es determinista.

**¿Por qué usar funciones seno y coseno?** Permite al model capturar relaciones de distancia entre posiciones mediante combinaciones lineales. Ver [Linear Relationships in the Transformer’s Positional Encoding](https://blog.timodenk.com/linear-relationships-in-the-transformers-positional-encoding/)

**¿Por qué 10000?** Es un hiperparámetro que define la escala de las frecuencias utilizadas en las funciones seno y coseno. Si fuese muy pequeño, dos posiciones diferentes podrían tener representaciones muy similares, si fuese muy grande, las diferencias entre posiciones cercanas podrían volverse insignificantes. "The wavelengths form a geometric progression from $2\pi$ to $10000 \cdot 2\pi$." -> la primera dimensión (i=0) tendrá una frecuencia alta (ciclos rápidos), cada aproximadamente 6.57 ($2\pi$), mientras que las últimas dimensiones tendrán frecuencias mucho más bajas (ciclos lentos), permitiendo capturar patrones a largo plazo.

**¿Por qué se suman y no concatenan?** 
- La suma mantiene la dimensionalidad constante (baja la complejidad computacional).
- Empiricamente, sumar embeddings y codificaciones posicionales ha demostrado ser efectivo. Ver [Rethinking Positional Encoding in Language Pre-training](https://arxiv.org/abs/2006.15595)

Links útiles:
- [Transformer Architecture: The Positional Encoding](https://kazemnejad.com/blog/transformer_architecture_positional_encoding/)
- [\[video\] ¿Por qué estas REDES NEURONALES son tan POTENTES? ](https://www.youtube.com/watch?v=xi94v_jl26U)
- [\[video\] How do Transformer Models keep track of the order of words? Positional Encoding](https://www.youtube.com/watch?v=IHu3QehUmrQ)


In [17]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len):
        super(PositionalEncoding, self).__init__()
        assert d_model % 2 == 0, "d_model debe ser par para usar sinusoides"

        self.max_len = max_len

        pe = torch.zeros((max_len, d_model))
        # pe: [max_len, d_model]
        pos = torch.arange(0, max_len, 1)
        # pos: [max_len]
        pos = pos.unsqueeze(1)
        # pos: [max_len, 1]

        two_i = torch.arange(0, d_model, 2)  # [0, 2, 4, ... d_model]
        # two_i: [d_model / 2]
        div = 10_000 ** (two_i / d_model)
        # div [d_model / 2]

        pe[:, 0::2] = torch.sin(pos / div)
        pe[:, 1::2] = torch.cos(pos / div)

        pe = pe.unsqueeze(0)
        # pe[1, max_len, d_model]

        # Registrar 'pe' como buffer para que no se actualice durante el entrenamiento : https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.register_buffer
        self.register_buffer("pe", pe)

    def forward(self, x):
        # x: [batch_size, seq_len, d_model]
        # pe: [1, max_len, d_model]
        seq_len = x.size(1)
        return x + self.pe[:, :seq_len, :]


summary(PositionalEncoding(D_MODEL, MAX_SENTENCE_LENGTH_MODEL))

Layer (type:depth-idx)                   Param #
PositionalEncoding                       --
Total params: 0
Trainable params: 0
Non-trainable params: 0

No tiene párametros entrenables !

```python
[[ 0.          1.          0.          1.        ]
 [ 0.84147096  0.54030234  0.00999983  0.99995   ]
 [ 0.9092974  -0.41614684  0.01999867  0.9998    ]
 [ 0.14112    -0.9899925   0.0299955   0.99955004]
 [-0.7568025  -0.6536436   0.03998933  0.9992001 ]
 [-0.9589243   0.2836622   0.04997917  0.99875027]]
```

In [18]:
# Parámetros
d_model = 4
max_len = 6

# Crear instancia de PositionalEncoding
pos_encoding = PositionalEncoding(d_model, max_len)
# Imprimir los valores obtenidos
positional_encoded_values = pos_encoding.pe.squeeze().numpy()
# Mostrar los valores numéricos obtenidos
print(positional_encoded_values)

[[ 0.          1.          0.          1.        ]
 [ 0.84147096  0.54030234  0.00999983  0.99995   ]
 [ 0.9092974  -0.41614684  0.01999867  0.9998    ]
 [ 0.14112    -0.9899925   0.0299955   0.99955004]
 [-0.7568025  -0.6536436   0.03998933  0.9992001 ]
 [-0.9589243   0.2836622   0.04997917  0.99875027]]


In [19]:
rand_embed = torch.rand(
    1, 5, d_model
)  # generamos un batch con una sola secuencia de 5 tokens
print(rand_embed, "\n")  # embedding original
print(pos_encoding(rand_embed))  # embedding + pe

tensor([[[0.0184, 0.6415, 0.3315, 0.1464],
         [0.9324, 0.0599, 0.1288, 0.0022],
         [0.1819, 0.0601, 0.0801, 0.8526],
         [0.6830, 0.9237, 0.5524, 0.6168],
         [0.9921, 0.7302, 0.1769, 0.9967]]]) 

tensor([[[ 0.0184,  1.6415,  0.3315,  1.1464],
         [ 1.7739,  0.6002,  0.1388,  1.0021],
         [ 1.0912, -0.3560,  0.1001,  1.8524],
         [ 0.8241, -0.0662,  0.5824,  1.6163],
         [ 0.2353,  0.0765,  0.2169,  1.9959]]])


### Multi-Head Attention

> We call our particular attention "Scaled Dot-Product Attention". The input consists of queries and keys of dimension $d_k$, and values of dimension $d_v$. We compute the dot products of the query with all keys, divide each by $\sqrt{d_k}$, and apply a softmax function to obtain the weights on the values. In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix $Q$. The keys and values are also packed together into matrices $K$ and $V$ . We compute the matrix of outputs as:
> 
> $$
> \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V
> $$
>
> ...
>
> We suspect that for large values of $d_k$, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by $\frac{1}{\sqrt{d_k}}$.
>
> ...
> 
> Instead of performing a single attention function with $d_{\text{model}}$-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values $h$ times with different, learned linear projections to $d_k$, $d_k$ and $d_v$ dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding $d_v$-dimensional output values. These are concatenated and once again projected, resulting in the final values:
> 
> $$
> \begin{aligned}
> \text{MultiHead}(Q, K, V) &= \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O \\
> \text{where head}_i &= \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
> \end{aligned}


Links útiles:
- [\[AI by hand\] 11. Self Attention](https://aibyhand.substack.com/p/11-can-you-calculate-self-attention)
- [\[video\] ¿Qué es un TRANSFORMER?](https://www.youtube.com/watch?v=aL-EmKuB078)

<div style="text-align: center;">
    <img src="https://yjucho1.github.io/assets/img/2018-10-13/transformer.png" width="1200"/>
</div>

> The Transformer uses multi-head attention in three different ways:
> - In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models.
> - The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.
> - Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections.

In [20]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, heads):
        super(MultiHeadAttention, self).__init__()

        self.d_k = d_model // heads
        self.heads = heads
        self.d_model = d_model

        assert (
            self.d_k * heads == d_model
        ), "el d_model debe ser divisible por los heads"

        self.keys = nn.Linear(d_model, d_model)
        self.values = nn.Linear(d_model, d_model)
        self.queries = nn.Linear(d_model, d_model)

        self.output = nn.Linear(d_model, d_model)

    def forward(self, values, keys, queries, mask):
        # values = [batch_size, values_len, d_model]
        # keys = [batch_size, keys_len, d_model]
        # queries = [batch_size, queries_len, d_model]
        # values_len = keys_len
        batch_size = values.size(0)
        values_len, keys_len, queries_len = (
            values.size(1),
            keys.size(1),
            queries.size(1),
        )

        values = self.values(values)
        keys = self.keys(keys)
        queries = self.queries(queries)
        # values = [batch_size, values_len, d_model]
        # keys = [batch_size, keys_len, d_model]
        # queries = [batch_size, queries_len, d_model]

        # dividir en heads
        values = values.reshape(batch_size, values_len, self.heads, self.d_k)
        keys = keys.reshape(batch_size, keys_len, self.heads, self.d_k)
        queries = queries.reshape(batch_size, queries_len, self.heads, self.d_k)

        # calcular el score de atencion
        queries = torch.permute(queries, (0, 2, 1, 3))
        # queries = [batch_size, heads, queries_len, d_model]
        keys = torch.permute(keys, (0, 2, 3, 1))
        # keys = [batch_size, heads, d_model, key_len]
        attention_score = torch.matmul(queries, keys)
        # attention_score:  [batch_size, heads, queries_len, key_len]
        attention_score = attention_score / (self.d_k**0.5)

        # aplicar mascara si me la pasan
        if mask is not None:
            attention_score = torch.masked_fill(
                attention_score, mask == False, float("-inf")
            )

        attention_score = torch.softmax(attention_score, dim=-1)
        # attention_score:  [batch_size, heads, queries_len, key_len]
        values = torch.permute(values, (0, 2, 1, 3))
        # values:  [batch_size, heads, values_len, d_model]
        # recordar: key_len = values_len
        attention = torch.matmul(attention_score, values)
        # attention:  [batch_size, heads, queries_len, d_model]

        # unir las cabezales
        attention = torch.reshape(
            torch.permute(attention, (0, 2, 1, 3)),
            (batch_size, queries_len, self.d_model),
        )
        # attention: [batch_size, queries_len, d_model]

        output = self.output(attention)
        # output: [batch_size, queries_len, d_model]

        return output


summary(MultiHeadAttention(d_model=D_MODEL, heads=HEADS))

Layer (type:depth-idx)                   Param #
MultiHeadAttention                       --
├─Linear: 1-1                            65,792
├─Linear: 1-2                            65,792
├─Linear: 1-3                            65,792
├─Linear: 1-4                            65,792
Total params: 263,168
Trainable params: 263,168
Non-trainable params: 0

### Feed-Forward

> In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.
> 
> $$
> \text{FFN}(x) = \max(0, xW_1 + b_1) W_2 + b_2
> $$
>
> (...) The dimensionality of input and output is $d_{\text{model}} = 512$, and the inner-layer has dimensionality $d_{\text{ff}} = 2048$.

In [21]:
class FeedForward(nn.Module):
    def __init__(self, d_model, expansion_factor):
        super(FeedForward, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, expansion_factor * d_model),
            nn.ReLU(),
            nn.Linear(expansion_factor * d_model, d_model),
        )

    def forward(self, x):
        return self.net(x)


summary(FeedForward(d_model=D_MODEL, expansion_factor=FORWARD_EXPANSION))

Layer (type:depth-idx)                   Param #
FeedForward                              --
├─Sequential: 1-1                        --
│    └─Linear: 2-1                       263,168
│    └─ReLU: 2-2                         --
│    └─Linear: 2-3                       262,400
Total params: 525,568
Trainable params: 525,568
Non-trainable params: 0

### Encoder

<div align="center">
    <img src="https://www.researchgate.net/publication/334288604/figure/fig1/AS:778232232148992@1562556431066/The-Transformer-encoder-structure.ppm" width="400px">
</div>

> **Encoder**: The encoder is composed of a stack of $N = 6$ identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, positionwise fully connected feed-forward network. We employ a residual connection around each of the two sub-layers, followed by layer normalization. That is, the output of each sub-layer is $LayerNorm(x + Sublayer(x))$, where $Sublayer(x)$ is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension $d_{model} = 512$.
>
> ...
>
> **Residual Dropout**: During training, we apply dropout to the output of each sub-layer before it is added to the sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a dropout rate of $P_{drop} = 0.1$.

In [22]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model, heads, dropout, forward_expansion):
        super(EncoderLayer, self).__init__()
        self.self_attention = MultiHeadAttention(d_model, heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.feed_forward = FeedForward(d_model, forward_expansion)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        # x: [Batch, seq_len, d_model]
        # mask: [Batch, 1, 1, seq_len] or None
        attention = self.self_attention(x, x, x, mask)
        # attention [Batch, seq_len, d_model]
        attention = self.dropout(attention)
        x = self.norm1(attention + x)
        # x: [Batch, seq_len, d_model]
        forward = self.feed_forward(x)
        # forward: [Batch, seq_len, d_model]
        forward = self.dropout(forward)
        out = self.norm2(forward + x)
        # out: [Batch, seq_len, d_model]
        return out


rand_input = torch.randn(
    (BATCH_SIZE, MAX_SENTENCE_LENGTH_MODEL, D_MODEL), dtype=torch.float
)  # generamos un batch de una secuencia de embeddings
rand_mask = torch.ones(
    (BATCH_SIZE, 1, 1, MAX_SENTENCE_LENGTH_MODEL), dtype=torch.bool
)  # mascara de atencion (aqui no enmascaramos nada)

summary(
    EncoderLayer(
        d_model=D_MODEL,
        heads=HEADS,
        dropout=DROPOUT,
        forward_expansion=FORWARD_EXPANSION,
    ),
    input_data=(rand_input, rand_mask),
)

Layer (type:depth-idx)                   Output Shape              Param #
EncoderLayer                             [128, 17, 256]            --
├─MultiHeadAttention: 1-1                [128, 17, 256]            --
│    └─Linear: 2-1                       [128, 17, 256]            65,792
│    └─Linear: 2-2                       [128, 17, 256]            65,792
│    └─Linear: 2-3                       [128, 17, 256]            65,792
│    └─Linear: 2-4                       [128, 17, 256]            65,792
├─Dropout: 1-2                           [128, 17, 256]            --
├─LayerNorm: 1-3                         [128, 17, 256]            512
├─FeedForward: 1-4                       [128, 17, 256]            --
│    └─Sequential: 2-5                   [128, 17, 256]            --
│    │    └─Linear: 3-1                  [128, 17, 1024]           263,168
│    │    └─ReLU: 3-2                    [128, 17, 1024]           --
│    │    └─Linear: 3-3                  [128, 17, 256]        

In [23]:
class Encoder(nn.Module):
    def __init__(
        self,
        src_vocab_size,
        d_model,
        num_layers,
        heads,
        forward_expansion,
        dropout,
        max_length,
    ):
        super(Encoder, self).__init__()
        self.word_embedding = nn.Embedding(
            src_vocab_size, d_model, padding_idx=SRC_VOCAB[PAD_TOKEN]
        )
        self.position_embedding = PositionalEncoding(d_model, max_length)
        self.dropout = nn.Dropout(dropout)
        self.d_model = d_model

        self.layers = nn.ModuleList(
            [
                EncoderLayer(d_model, heads, dropout, forward_expansion)
                for _ in range(num_layers)
            ]
        )

    def forward(self, x, mask):
        # x: [batch_size, seq_len]
        embeddings = self.word_embedding(x)
        # embeddings: [batch_size, seq_len, d_model]
        embeddings = self.position_embedding(embeddings)
        # embeddings: [batch_size, seq_len, d_model]
        out = self.dropout(embeddings)
        # out: [batch_size, seq_len, d_model]

        for layer in self.layers:
            out = layer(out, mask)
        # out: [batch_size, seq_len, d_model]

        return out


rand_input = torch.randint(
    0, SRC_VOCAB_SIZE, (BATCH_SIZE, MAX_SENTENCE_LENGTH_MODEL)
)  # generamos un batch de una secuencia de tokens
rand_mask = torch.ones(
    (BATCH_SIZE, 1, 1, MAX_SENTENCE_LENGTH_MODEL), dtype=torch.bool
)  # mascara de atencion (aqui no enmascaramos nada)

summary(
    Encoder(
        src_vocab_size=SRC_VOCAB_SIZE,
        d_model=D_MODEL,
        num_layers=NUM_LAYERS,
        heads=HEADS,
        forward_expansion=FORWARD_EXPANSION,
        dropout=DROPOUT,
        max_length=MAX_SENTENCE_LENGTH_MODEL,
    ),
    input_data=(rand_input, rand_mask),
)

Layer (type:depth-idx)                   Output Shape              Param #
Encoder                                  [128, 17, 256]            --
├─Embedding: 1-1                         [128, 17, 256]            6,408,448
├─PositionalEncoding: 1-2                [128, 17, 256]            --
├─Dropout: 1-3                           [128, 17, 256]            --
├─ModuleList: 1-4                        --                        --
│    └─EncoderLayer: 2-1                 [128, 17, 256]            --
│    │    └─MultiHeadAttention: 3-1      [128, 17, 256]            263,168
│    │    └─Dropout: 3-2                 [128, 17, 256]            --
│    │    └─LayerNorm: 3-3               [128, 17, 256]            512
│    │    └─FeedForward: 3-4             [128, 17, 256]            525,568
│    │    └─Dropout: 3-5                 [128, 17, 256]            --
│    │    └─LayerNorm: 3-6               [128, 17, 256]            512
│    └─EncoderLayer: 2-2                 [128, 17, 256]           

### Decoder

<div align="center">
    <img src="https://d1.awsstatic.com/GENAI-1.151ded5440b4c997bac0642ec669a00acff2cca1.png" width="400px">
</div>

> **Decoder**: The decoder is also composed of a stack of $N = 6$ identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position,ensures that the predictions for position $i$ can depend only on the known outputs at positions less than $i$.

In [24]:
class DecoderLayer(nn.Module):
    def __init__(self, d_model, heads, forward_expansion, dropout):
        super(DecoderLayer, self).__init__()
        self.self_attention = MultiHeadAttention(d_model, heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.encoder_attention = MultiHeadAttention(d_model, heads)
        self.norm2 = nn.LayerNorm(d_model)
        self.feed_forward = FeedForward(d_model, forward_expansion)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_out, src_mask, trg_mask):
        # x: [Batch, trg_seq_len, d_model]
        # enc_out: [Batch, src_seq_len, d_model]
        # src_mask: [Batch, 1, 1, src_seq_len] or None
        # trg_mask: [Batch, 1, trg_seq_len, trg_seq_len] or None

        # Self-Attention con máscara de look-ahead
        self_attn = self.self_attention(x, x, x, trg_mask)
        # self_attn: [Batch, trg_seq_len, d_model]
        self_attn = self.dropout(self_attn)
        x = self.norm1(self_attn + x)

        # Atención sobre la salida del encoder
        enc_attn = self.encoder_attention(enc_out, enc_out, x, src_mask)
        # enc_attn: [Batch, trg_seq_len, d_model]

        enc_attn = self.dropout(enc_attn)
        x = self.norm2(enc_attn + x)

        # Red Feed-Forward
        forward = self.feed_forward(x)
        # forward: [Batch, trg_seq_len, d_model]
        forward = self.dropout(forward)
        out = self.norm3(forward + x)
        # out: [Batch, trg_seq_len, d_model]

        return out


rand_enc_out = torch.randn(
    (BATCH_SIZE, MAX_SENTENCE_LENGTH_MODEL, D_MODEL), dtype=torch.float
)  # generamos un batch de salida del encoder
rand_trg_input = torch.randn(
    (BATCH_SIZE, MAX_SENTENCE_LENGTH_MODEL, D_MODEL), dtype=torch.float
)  # generamos un batch de una secuencia de embeddings (input del decoder)
rand_src_mask = torch.ones(
    (BATCH_SIZE, 1, 1, MAX_SENTENCE_LENGTH_MODEL), dtype=torch.bool
)  # mascara de atencion del encoder (aqui no enmascaramos nada)
rand_trg_mask = torch.ones(
    (BATCH_SIZE, 1, MAX_SENTENCE_LENGTH_MODEL, MAX_SENTENCE_LENGTH_MODEL),
    dtype=torch.bool,
)  # mascara de atencion del decoder (aqui no enmascaramos nada)

summary(
    DecoderLayer(
        d_model=D_MODEL,
        heads=HEADS,
        forward_expansion=FORWARD_EXPANSION,
        dropout=DROPOUT,
    ),
    input_data=(rand_trg_input, rand_enc_out, rand_src_mask, rand_trg_mask),
)

Layer (type:depth-idx)                   Output Shape              Param #
DecoderLayer                             [128, 17, 256]            --
├─MultiHeadAttention: 1-1                [128, 17, 256]            --
│    └─Linear: 2-1                       [128, 17, 256]            65,792
│    └─Linear: 2-2                       [128, 17, 256]            65,792
│    └─Linear: 2-3                       [128, 17, 256]            65,792
│    └─Linear: 2-4                       [128, 17, 256]            65,792
├─Dropout: 1-2                           [128, 17, 256]            --
├─LayerNorm: 1-3                         [128, 17, 256]            512
├─MultiHeadAttention: 1-4                [128, 17, 256]            --
│    └─Linear: 2-5                       [128, 17, 256]            65,792
│    └─Linear: 2-6                       [128, 17, 256]            65,792
│    └─Linear: 2-7                       [128, 17, 256]            65,792
│    └─Linear: 2-8                       [128, 17, 256] 

In [25]:
class Decoder(nn.Module):
    def __init__(
        self,
        trg_vocab_size,
        d_model,
        num_layers,
        heads,
        forward_expansion,
        dropout,
        max_length,
    ):
        super(Decoder, self).__init__()
        self.d_model = d_model

        self.word_embedding = nn.Embedding(
            trg_vocab_size, d_model, padding_idx=TRG_VOCAB[PAD_TOKEN]
        )
        self.position_embedding = PositionalEncoding(d_model, max_length)
        self.dropout = nn.Dropout(dropout)

        self.layers = nn.ModuleList(
            [
                DecoderLayer(d_model, heads, forward_expansion, dropout)
                for _ in range(num_layers)
            ]
        )

        self.fc_out = nn.Linear(d_model, trg_vocab_size)

    def forward(self, x, enc_out, src_mask, trg_mask):
        # x : [Batch, trg_seq_len]
        # enc_out:  [Batch, src_seq_len, d_model]
        # src_mask: [Batch, 1, 1, src_seq_len] or None
        # trg_mask: [Batch, 1, trg_seq_len, trg_seq_len] or None

        # Embeddings de entrada
        x = self.word_embedding(x)
        # x: [Batch, trg_seq_len, d_model]
        x = self.position_embedding(x)
        # x: [Batch, trg_seq_len, d_model]
        x = self.dropout(x)
        # x: [Batch, trg_seq_len, d_model]

        # Pasar por las capas del decoder
        for layer in self.layers:
            x = layer(x, enc_out, src_mask, trg_mask)
            # x: [Batch, trg_seq_len, d_model]

        # Proyección a vocabulario de salida
        out = self.fc_out(x)
        # out: [[Batch, trg_seq_len, trg_vocab_size]
        return out


rand_enc_out = torch.randn(
    (BATCH_SIZE, MAX_SENTENCE_LENGTH_MODEL, D_MODEL), dtype=torch.float
)  # generamos un batch de salida del encoder
rand_trg_input = torch.randint(
    0, TRG_VOCAB_SIZE, (BATCH_SIZE, MAX_SENTENCE_LENGTH_MODEL)
)  # generamos un batch de una secuencia de tokens
rand_src_mask = torch.ones(
    (BATCH_SIZE, 1, 1, MAX_SENTENCE_LENGTH_MODEL), dtype=torch.bool
)  # mascara de atencion del encoder (aqui no enmascaramos nada)
rand_trg_mask = torch.ones(
    (BATCH_SIZE, 1, MAX_SENTENCE_LENGTH_MODEL, MAX_SENTENCE_LENGTH_MODEL),
    dtype=torch.bool,
)  # mascara de atencion del decoder (aqui no enmascaramos nada)

summary(
    Decoder(
        trg_vocab_size=TRG_VOCAB_SIZE,
        d_model=D_MODEL,
        num_layers=NUM_LAYERS,
        heads=HEADS,
        forward_expansion=FORWARD_EXPANSION,
        dropout=DROPOUT,
        max_length=MAX_SENTENCE_LENGTH_MODEL,
    ),
    input_data=(rand_trg_input, rand_enc_out, rand_src_mask, rand_trg_mask),
)

Layer (type:depth-idx)                   Output Shape              Param #
Decoder                                  [128, 17, 45139]          --
├─Embedding: 1-1                         [128, 17, 256]            11,555,584
├─PositionalEncoding: 1-2                [128, 17, 256]            --
├─Dropout: 1-3                           [128, 17, 256]            --
├─ModuleList: 1-4                        --                        --
│    └─DecoderLayer: 2-1                 [128, 17, 256]            --
│    │    └─MultiHeadAttention: 3-1      [128, 17, 256]            263,168
│    │    └─Dropout: 3-2                 [128, 17, 256]            --
│    │    └─LayerNorm: 3-3               [128, 17, 256]            512
│    │    └─MultiHeadAttention: 3-4      [128, 17, 256]            263,168
│    │    └─Dropout: 3-5                 [128, 17, 256]            --
│    │    └─LayerNorm: 3-6               [128, 17, 256]            512
│    │    └─FeedForward: 3-7             [128, 17, 256]          

### Masking

#### Encoder Padding Mask

En el caso de del encoder, cada token puede atender a todos los tokens de la secuencia de entrada (sin enmascaramiento). Sólo se aplica una máscara para los tokens de padding, de manera que el modelo no preste atención a esos tokens irrelevantes.

In [26]:
def create_src_mask(src, src_pad_idx):
    # src: [batch_size, src_len]
    src_mask = (src != src_pad_idx).unsqueeze(1).unsqueeze(2)
    # [batch_size, 1, 1, src_len]
    # recordemos que esto se hace en el contexto de que la aplicación de la máscara es en la matriz de atención (con el shape [batch_size, heads, queries_len, key_len])
    # se hace broadcastin en heads porque no importa para qué cabeza es, la máscara es la misma
    # se hace broadcasting en queries_len porque no sabemos cuántas queries habrá en el decoder lo importante es que las keys (src_len) estén enmascaradas (las culumnas)
    return src_mask

In [27]:
input_test = torch.tensor([[1, 2, 3, 0, 0]], dtype=torch.float32)
input_test_mask = create_src_mask(input_test, SRC_VOCAB[PAD_TOKEN])
print(input_test_mask)

tensor([[[[ True,  True,  True, False, False]]]])


Veamos como la mascara puede ser usada para enmascarar los scores de atención

In [28]:
src_len = input_test.size(1)  # longitud de la secuencia de entrada
rand_input = torch.rand(
    (1, 1, src_len, src_len), dtype=torch.float
)  # generamos scores de atención (random)
print("Antes de la máscara:")
print(rand_input)

Antes de la máscara:
tensor([[[[0.8890, 0.1382, 0.5053, 0.1419, 0.9124],
          [0.4866, 0.5986, 0.6645, 0.8106, 0.3378],
          [0.8874, 0.6087, 0.3668, 0.0065, 0.1461],
          [0.5273, 0.5827, 0.5564, 0.2997, 0.5261],
          [0.5466, 0.0291, 0.2278, 0.8868, 0.6011]]]])


In [29]:
masked = torch.masked_fill(
    rand_input, input_test_mask == False, float("-inf")
)  # aplicamos la máscara
print("Después de la máscara:")
print(masked)

Después de la máscara:
tensor([[[[0.8890, 0.1382, 0.5053,   -inf,   -inf],
          [0.4866, 0.5986, 0.6645,   -inf,   -inf],
          [0.8874, 0.6087, 0.3668,   -inf,   -inf],
          [0.5273, 0.5827, 0.5564,   -inf,   -inf],
          [0.5466, 0.0291, 0.2278,   -inf,   -inf]]]])


### Decoder Look-Ahead Masking

> We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

Nos ayudamos con la función [torch.tril](https://pytorch.org/docs/stable/generated/torch.tril.html) para obtener la matriz triangular inferior de una matriz.

In [30]:
torch.tril(
    torch.full((5, 5), True), diagonal=0
)  # jugando con la diagonal podemos controlar qué parte de la matriz triangular inferior queremos

tensor([[ True, False, False, False, False],
        [ True,  True, False, False, False],
        [ True,  True,  True, False, False],
        [ True,  True,  True,  True, False],
        [ True,  True,  True,  True,  True]])

In [31]:
def create_trg_mask(trg, trg_pad_idx):
    # trg: [batch_size, trg_len]
    trg_pad_mask = (trg != trg_pad_idx).unsqueeze(1).unsqueeze(2)
    # trg_pad_mask: [batch_size, 1, 1, trg_len]

    # Crear máscara de look-ahead
    trg_len = trg.size(1)
    trg_sub_mask = torch.tril(
        torch.full((trg_len, trg_len), True, device=trg.device)
    )  # tril genera la matriz en CPU por defecto, así que nos aseguramos de que esté en el mismo dispositivo que 'trg'
    # trg_sub_mask: [trg_len, trg_len]
    trg_mask = trg_pad_mask & trg_sub_mask  # combinamos ambas máscaras
    # trg_mask: [batch_size, 1, trg_len, trg_len]
    return trg_mask

In [32]:
input_test = torch.tensor([[1, 2, 3, 0, 0]], dtype=torch.float32)
print(create_trg_mask(input_test, 0))

tensor([[[[ True, False, False, False, False],
          [ True,  True, False, False, False],
          [ True,  True,  True, False, False],
          [ True,  True,  True, False, False],
          [ True,  True,  True, False, False]]]])


### Transformer

Links útiles:
- [\[visualización\] LLM Visualization](https://bbycroft.net/llm)
- [\[visualización\] TRANSFORMER EXPLAINER](https://poloclub.github.io/transformer-explainer/)


<div align="center">
    <img src="https://d1.awsstatic.com/GENAI-1.151ded5440b4c997bac0642ec669a00acff2cca1.png" width="400px">
</div>

In [33]:
class Transformer(nn.Module):
    def __init__(
        self,
        src_vocab_size,
        trg_vocab_size,
        src_pad_idx,
        trg_pad_idx,
        d_model=512,
        num_layers=6,
        forward_expansion=4,
        heads=8,
        dropout=0.1,
        max_length=100,
    ):
        super(Transformer, self).__init__()

        self.encoder = Encoder(
            src_vocab_size,
            d_model,
            num_layers,
            heads,
            forward_expansion,
            dropout,
            max_length,
        )

        self.decoder = Decoder(
            trg_vocab_size,
            d_model,
            num_layers,
            heads,
            forward_expansion,
            dropout,
            max_length,
        )

        self.src_pad_idx = src_pad_idx
        self.trg_pad_idx = trg_pad_idx

    def forward(self, src, trg):
        src_mask = create_src_mask(src, self.src_pad_idx)
        trg_mask = create_trg_mask(trg, self.trg_pad_idx)

        enc_src = self.encoder(src, src_mask)
        out = self.decoder(trg, enc_src, src_mask, trg_mask)
        return out


summary(
    Transformer(
        src_vocab_size=SRC_VOCAB_SIZE,
        trg_vocab_size=TRG_VOCAB_SIZE,
        src_pad_idx=SRC_PAD_IDX,
        trg_pad_idx=TRG_PAD_IDX,
        d_model=D_MODEL,
        num_layers=NUM_LAYERS,
        forward_expansion=FORWARD_EXPANSION,
        heads=HEADS,
        dropout=DROPOUT,
        max_length=MAX_SENTENCE_LENGTH_MODEL,
    ),
    input_data=(
        torch.randint(0, SRC_VOCAB_SIZE, (BATCH_SIZE, MAX_SENTENCE_LENGTH_MODEL)),
        torch.randint(0, TRG_VOCAB_SIZE, (BATCH_SIZE, MAX_SENTENCE_LENGTH_MODEL)),
    ),
    depth=5,
)

Layer (type:depth-idx)                        Output Shape              Param #
Transformer                                   [128, 17, 45139]          --
├─Encoder: 1-1                                [128, 17, 256]            --
│    └─Embedding: 2-1                         [128, 17, 256]            6,408,448
│    └─PositionalEncoding: 2-2                [128, 17, 256]            --
│    └─Dropout: 2-3                           [128, 17, 256]            --
│    └─ModuleList: 2-4                        --                        --
│    │    └─EncoderLayer: 3-1                 [128, 17, 256]            --
│    │    │    └─MultiHeadAttention: 4-1      [128, 17, 256]            --
│    │    │    │    └─Linear: 5-1             [128, 17, 256]            65,792
│    │    │    │    └─Linear: 5-2             [128, 17, 256]            65,792
│    │    │    │    └─Linear: 5-3             [128, 17, 256]            65,792
│    │    │    │    └─Linear: 5-4             [128, 17, 256]            65,7

## Training

In [34]:
LR = 0.0005

model = Transformer(
    src_vocab_size=SRC_VOCAB_SIZE,
    trg_vocab_size=TRG_VOCAB_SIZE,
    src_pad_idx=SRC_PAD_IDX,
    trg_pad_idx=TRG_PAD_IDX,
    d_model=D_MODEL,
    num_layers=NUM_LAYERS,
    forward_expansion=FORWARD_EXPANSION,
    heads=HEADS,
    dropout=DROPOUT,
    max_length=MAX_SENTENCE_LENGTH_MODEL,
).to(DEVICE)

criterion = nn.CrossEntropyLoss(
    ignore_index=TRG_PAD_IDX,
    label_smoothing=0.05,
).to(DEVICE)

optimizer = optim.Adam(model.parameters(), lr=LR)

In [35]:
def process_batch(src, trg, model, criterion):
    # src: [BATCH_SIZE, SRC_LEN]
    # trg: [BATCH_SIZE, TRG_LEN]

    input_trg = trg[:, :-1]  # Entrada al decoder: todos los tokens excepto el último
    # input_trg: [batch_size, trg_len - 1]

    target_trg = trg[
        :, 1:
    ]  # Objetivo para la pérdida: todos los tokens excepto el primero
    # target_trg: [batch_size, trg_len - 1]

    # Pasar por el modelo
    output = model(src, input_trg)
    # output: [batch_size, trg_len - 1, trg_vocab_size]

    # Reestructurar las dimensiones para calcular la pérdida
    output = output.reshape(-1, output.size(-1))
    # output: [batch_size * (trg_len - 1), trg_vocab_size]

    target_trg = target_trg.reshape(-1)
    # target_trg: [batch_size * (trg_len - 1)]

    return criterion(output, target_trg)

In [36]:
def evaluate_epoch(model, criterion, valid_loader):
    model.eval()
    valid_loss = 0

    with torch.no_grad():
        for src, trg in valid_loader:
            src = src.to(DEVICE)
            trg = trg.to(DEVICE)

            loss = process_batch(src, trg, model, criterion)

            valid_loss += loss.item()

    return valid_loss / len(valid_loader)

In [37]:
def train_epoch(model, criterion, train_loader, optimizer):
    model.train()
    train_loss = 0

    for src, trg in train_loader:
        src = src.to(DEVICE)
        # src: [BATCH_SIZE, SRC_LEN]
        trg = trg.to(DEVICE)
        # trg: [BATCH_SIZE, TRG_LEN]

        optimizer.zero_grad()

        loss = process_batch(src, trg, model, criterion)

        loss.backward()
        optimizer.step()
        train_loss += loss.item()

    return train_loss / len(train_loader)

In [38]:
def train(model, train_loader, valid_loader, optimizer, criterion, n_epochs):
    for epoch in range(n_epochs):
        train_loss = train_epoch(model, criterion, train_loader, optimizer)
        val_loss = evaluate_epoch(model, criterion, valid_loader)
        print(
            f"Epoch {epoch + 1} Train Loss: {train_loss:.4f} Val Loss: {val_loss:.4f}"
        )

In [None]:
NUM_EPOCHS = 10  # 5 para la clase

train(model, train_loader, valid_loader, optimizer, criterion, NUM_EPOCHS)

Epoch 1 Train Loss: 3.6644 Val Loss: 2.5896
Epoch 2 Train Loss: 2.3532 Val Loss: 2.2277
Epoch 3 Train Loss: 2.0148 Val Loss: 2.1012
Epoch 4 Train Loss: 1.8351 Val Loss: 2.0485
Epoch 5 Train Loss: 1.7139 Val Loss: 1.9969
Epoch 6 Train Loss: 1.6249 Val Loss: 1.9784
Epoch 7 Train Loss: 1.5591 Val Loss: 1.9629
Epoch 8 Train Loss: 1.5055 Val Loss: 1.9405
Epoch 9 Train Loss: 1.4631 Val Loss: 1.9419
Epoch 10 Train Loss: 1.4282 Val Loss: 1.9431
Epoch 11 Train Loss: 1.3991 Val Loss: 1.9447
Epoch 12 Train Loss: 1.3750 Val Loss: 1.9353
Epoch 13 Train Loss: 1.3512 Val Loss: 1.9361
Epoch 14 Train Loss: 1.3312 Val Loss: 1.9338
Epoch 15 Train Loss: 1.3137 Val Loss: 1.9390
Epoch 16 Train Loss: 1.2961 Val Loss: 1.9359
Epoch 17 Train Loss: 1.2805 Val Loss: 1.9453
Epoch 18 Train Loss: 1.2667 Val Loss: 1.9413
Epoch 19 Train Loss: 1.2543 Val Loss: 1.9493
Epoch 20 Train Loss: 1.2414 Val Loss: 1.9477


In [40]:
def translate_sentence(model, sentence, max_len):
    # Configuramos el modo de evaluación
    model.eval()

    # Limpiamos la oración de entrada
    sentence = clean_text(sentence)

    # Preprocesamos la oración de entrada (src_sentence) a tensor
    sentence_indexes = encode_sentence(sentence, SRC_VOCAB)
    # Convertimos a tensor y añadimos una dimensión extra para el batch
    sentence_tensor = torch.tensor(sentence_indexes).unsqueeze(0).to(DEVICE)

    # Creamos una máscara para la entrada (evitamos atender al padding)
    src_mask = create_src_mask(sentence_tensor, SRC_VOCAB[PAD_TOKEN])

    with torch.no_grad():
        # Pasamos por el encoder
        enc_src = model.encoder(sentence_tensor, src_mask)

        # El primer token del decoder es <SOS>
        input_tokens = torch.tensor([TRG_VOCAB[SOS_TOKEN]]).unsqueeze(0).to(DEVICE)

        # Almacenamos los tokens generados por el decoder
        generated_tokens = []

        # Decodificación paso a paso
        for _ in range(max_len):
            trg_mask = create_trg_mask(input_tokens, TRG_VOCAB[PAD_TOKEN])

            # Pasamos el token actual por el decoder
            output = model.decoder(input_tokens, enc_src, src_mask, trg_mask)

            # Solo nos interesa el último token generado
            output = output[:, -1, :]

            # Obtener el token con mayor probabilidad
            top1 = output.argmax(1).item()

            # Si el token predicho es <EOS>, detenemos la decodificación
            if top1 == TRG_VOCAB[EOS_TOKEN]:
                break

            generated_tokens.append(top1)

            # El próximo token de entrada es el que acaba de predecir el modelo
            input_tokens = torch.cat(
                [input_tokens, torch.tensor([[top1]]).to(DEVICE)], dim=1
            )

        # Convertimos los índices predichos en palabras usando el vocabulario
        predicted_sentence = decode_sentence(generated_tokens, TRG_VOCAB)

    return predicted_sentence

In [41]:
sentences = [
    "I am hungry",
    "I am tired",
    "I am happy",
    "I'm sad",
    "I am angry",
    "every time I study, I get sleepy",
    "I am going to the gym",
    "I am going to the beach",
    "I am going to the supermarket",
    "I'm going to the movies",
    "I don't know what to do",
    "I love deep learning",
    "I can't open the door",
    "you can go if you want to",
    "i'm going to the party",
    "where does all this come from ?",
    "I can read your mind",
    "I can't believe it",
    "I can't believe you",
    "I can't believe this",
    "I didn't like it",
    "You can do it",
    "Do you speak Italian?",
    "Do you want to learn Spanish?",
    "I want to learn French",
    "Do you want to go to the movies?",
    "this is my favorite song",
    "I can't wait to see you",
    "see you later",
    "have a nice day",
    "we'll talk later",
    "let's grab a coffee sometime",
    "they're coming to the party",
    "she's my best friend",
    "he's a great guy",
    "My class is in 30 minutes.",
    "I have class tomorrow.",
    "This is a very long sentence, let's see how the model handles it.",
    "Can you help me with my homework?",
    "What time is it?",
    "Where is the nearest restaurant?",
    "How do I get to the airport?",
    "I would like to make a reservation.",
    "The fish is fresh and delicious.",
    "I need to buy a new laptop.",
    "My favorite color is blue.",
    "I enjoy hiking on the weekends.",
    "The weather is nice today.",
]

for sentence in sentences:
    print(f"Input: {sentence}")
    print(
        f"Translation: {translate_sentence(model, sentence, MAX_SENTENCE_LENGTH_MODEL)}\n"
    )

Input: I am hungry
Translation: tengo hambre

Input: I am tired
Translation: estoy cansado

Input: I am happy
Translation: estoy feliz

Input: I'm sad
Translation: estoy triste

Input: I am angry
Translation: estoy enojada

Input: every time I study, I get sleepy
Translation: cada vez que estudio me da sueño

Input: I am going to the gym
Translation: voy al gimnasio

Input: I am going to the beach
Translation: voy a la playa

Input: I am going to the supermarket
Translation: voy al supermercado

Input: I'm going to the movies
Translation: voy al cine

Input: I don't know what to do
Translation: no sé qué hacer

Input: I love deep learning
Translation: me encanta aprender profundo

Input: I can't open the door
Translation: no puedo abrir la puerta

Input: you can go if you want to
Translation: puedes ir si quieres

Input: i'm going to the party
Translation: voy a la fiesta

Input: where does all this come from ?
Translation: ¿ de dónde viene todo esto ?

Input: I can read your mind
Tran

In [42]:
translate_sentence(model, "Attention is all you need", MAX_SENTENCE_LENGTH_MODEL)

'la atención es todo lo que necesitas'

## Tarea
Modificar la inferencia para que no sea determinista, es decir, que en lugar de tomar siempre el token con mayor probabilidad, tome muestras de la distribución de probabilidad generada por el modelo en cada paso. Incluir parámetros como temperature y top-k sampling para controlar la diversidad de las traducciones generadas. Ver [torch.multinomial](https://docs.pytorch.org/docs/stable/generated/torch.multinomial.html), [torch.topk](https://docs.pytorch.org/docs/stable/generated/torch.topk.html#torch.topk).