### Sequence to Sequence Models


# TODO {
Background
Motivation
What problems ResNets solve?
Architecture
Summary
}



Machine translation, translating an input sentence from one language to another, is a great use case.
The representation is quite intuitive: sentences can be regarded as sequences of words.


### Seq2Seq

Ideally, a model has to understand the input sentence in one language. This is captured in the so-called “encoder”. 
We need to convert the meaning into another language, so let’s call this model decoder.

<img src="./img/transformers/Seq2Seq.png" alt="seg2seq" width="700"/>

The goal is to transform an input sequence (source) to a new one (target). The two sequences can be of the same or arbitrary length.

The reason is simple: we liked to treat sequences sequentially. Sounds obvious and optimal? Attention mechanisms, and eventually transformers, proved that it was not!

### A comprehensive view of encoder and decoder

Let’s suppose that the encoder and decoder are stacked RNN/LSTM cells. z can be regarded as a compressed format of the input.

<img src="./img/transformers/s2sLSTM.png" alt="s2sLSTM" width="400"/>

The decoder receives the context vector z and generates the output sequence

We can think of the input sequence as the representation of a sentence in English and the output as the same sentence in French.


<img src="./img/transformers/s2s_rnn.png" alt="s2s_rnn" width="400"/>
<img src="./img/transformers/s2s_limitations.png" alt="s2s_limitations" width="400"/>

In fact, RNN-based architectures used to work very well, especially with LSTM components.

The problem? Only for small sequences (<20 timesteps). Visually:


### The limitations of RNNs

The main issue is that the intermediate representation z cannot encode information from all the input timesteps. 

This is commonly known as the **bottleneck problem.**
The vector z needs to capture all the information about the source sentence.

RNNs tend to forget information from timesteps that are far behind.


For example, 97 words sentance:

<img src="./img/transformers/97_words_sentance.png" alt="derivative" width="700"/>

**"bliend man"** is a key for the understanding of the text. The vector z will be unable to compress the information of the first few words as well as the 97th word.

Eventually, the system pays more attention to the last parts of the sequence. **This is not usually the optimal way to approach a sequence task**





### Attention

Attention was born in order to address the limitations of Seq2Seq models.

The core idea is that the context vector z should have access to **all parts** of the input sequence instead of just the last one.

We can look at all the different words at the same time and learn to **“pay attention“** to the correct ones depending on the task at hand.

Attention is simply a notion of memory gained from attending at multiple inputs through time.



### Attention as an alignment between words

The attention score describes the relationship between the two states and captures how “aligned” they are.

Many different ideas to compute that score. The simplest one computes attention as the dot product between the two states **yi−1h**


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class Attention(nn.Module):
    """
    This class defines an attention mechanism, which is a component in neural networks 
    that helps the model to dynamically focus on certain parts of the input.
    It inherits from nn.Module, which is the base class for all neural network modules in PyTorch.
    """

    def __init__(self, y_dim: int, h_dim: int):
        """
        The constructor for the Attention class.
        
        Args:
        y_dim (int): The dimension of the input tensor y.
        h_dim (int): The dimension of the input tensor h.
        """
        super().__init__()  # Initializes the base class nn.Module.
        
        # Store the dimensions of y and h in instance variables for later use.
        self.y_dim = y_dim
        self.h_dim = h_dim

        # Define a learnable weight matrix W as a parameter of the module.
        # The shape of W is (y_dim, h_dim), allowing it to transform a vector from y_dim to h_dim.
        self.W = nn.Parameter(torch.FloatTensor(self.y_dim, self.h_dim))

    def forward(self, y: torch.Tensor, h: torch.Tensor):
        """
        The forward pass of the Attention mechanism.
        
        Args:
        y (torch.Tensor): The input tensor y with shape (batch_size, y_dim).
        h (torch.Tensor): The input tensor h with shape (batch_size, h_dim).
        
        Returns:
        torch.Tensor: The result of applying the attention mechanism to the input tensors.
        """
        
        # Compute the attention scores.
        # This is done by matrix-multiplying the input y with the weight matrix W,
        # and then with the transpose of h.
        # The result is a score matrix of shape (batch_size, batch_size), 
        # representing the attention score between each pair of y and h elements.
        score = torch.matmul(torch.matmul(y, self.W), h.T)

        # Apply the softmax function to the scores along dimension 0.
        # This normalizes the scores so that they sum up to 1, 
        # making them interpretable as probabilities.
        z = F.softmax(score, dim=0)

        # Multiply the normalized scores with the input h.
        # This step effectively computes a weighted sum of the h vectors, 
        # with the weights given by the attention scores.
        # The result is a tensor of shape (batch_size, h_dim), 
        # containing the attended features.
        return torch.matmul(z, h)


<img src="./img/transformers/self_attention.png" alt="derivative" width="700"/>

# Transformers Architecture Overview 

### 1. Sets and tokenization

<img src="./img/transformers/sets_and_tokenization.png" alt="sets_and_tokenization" width="700"/>
