##"Attention Is All You Need"##

In [11]:
# imports
import torch
import torch.nn as nn

In [12]:
class SelfAttention(nn.Module):
    def __init__(self, embed_size, heads):
        super(SelfAttention, self).__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.heads_dim = embed_size // heads

        assert (self.heads_dim * heads == embed_size), "embed_size must be divisible by heads"

        self.values = nn.Linear(self.heads_dim, self.heads_dim, bias=False)
        self.keys = nn.Linear(self.heads_dim, self.heads_dim, bias=False)
        self.queries = nn.Linear(self.heads_dim, self.heads_dim, bias=False)
        self.fc_out = nn.Linear(self.heads * self.heads_dim, embed_size, bias=False)

    def forward(self, values, keys, query, mask):
        N = query.shape[0]
        value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]
        # split the embedding into self.heads pieces
        values = values.reshape(N, value_len, self.heads, self.heads_dim)
        keys = keys.reshape(N, key_len, self.heads, self.heads_dim)
        queries = query.reshape(N, query_len, self.heads, self.heads_dim)


        values = self.values(values)
        keys = self.keys(keys)
        queries = self.queries(queries)

        energy = torch.einsum("nqhd,nkhd->nhqk", [queries, keys])

        # queries shape : (N, query_len, heads, heads_dim)
        # keys shape : (N, key_len, heads, heads_dim)
        # energy shape : (N, heads, query_len, key_len)
        # it says for each word in our target (query_len)
        # how much are we going to pay attention to each word in the source sentence (key_len)
        # can be done using torch.bmm    here we used torch.einsum

        if mask is not None:
            energy = energy.masked_fill(mask == 0, float('-inf'))

        attention = torch.softmax(energy/ (self.embed_size)**(1/2), dim=3)

        out = torch.einsum("nhql,nlhd->nqhd", [attention, values]).reshape(
            N, query_len,self.heads*self.heads_dim
        )

        # attention shape : (N, heads, query_len, key_len)
        # values shape : (N, values_len, heads, heads_dim)
        # after einsum, multiplied to be : (N, query_len, heads, head_dims)
        # then flatten last two dimensions

        out = self.fc_out(out)
        return out

# Multi-Head Self-Attention

The below implementation corresponds to the **Multi-Head Self-Attention Mechanism** described in the paper *"Attention Is All You Need"* by Vaswani et al. Here's a breakdown of its components and their roles:

## Key Concepts
1. **Self-Attention Mechanism**:
   The goal of self-attention is to compute a weighted sum of the values, where the weights are determined by the similarity between the query and keys.

   \[
   \text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V
   \]

   - \(Q\): Queries.
   - \(K\): Keys.
   - \(V\): Values.
   - \(d_k\): Dimensionality of the key/query vectors.

2. **Multi-Head Attention**:
   Instead of performing a single attention computation, the mechanism is split into \(h\) attention heads to capture different aspects of the sequence.

3. **Scaling**:
   The dot-product of queries and keys is scaled by \(\sqrt{d_k}\) to avoid large values that could lead to vanishing gradients in the softmax function.

---



## Code Explanation
### Initialization
- **`embed_size`**: Size of the input embeddings.
- **`heads`**: Number of attention heads.
- **`heads\_dim`**: Dimensionality of each head, computed as:
  \[
  \text{heads\_dim} = \frac{\text{embed\_size}}{\text{heads}}
  \]

- **Linear Layers**:
  - `self.values`, `self.keys`, and `self.queries` map input embeddings to their respective representations.
  - `self.fc_out` combines the outputs of all heads into the original embedding dimension.

### Forward Pass
#### 1. Input Shapes
- Input tensors \(values\), \(keys\), and \(queries\) typically have shapes:
  \[
  (N, \text{seq\_len}, \text{embed\_size})
  \]
  where \(N\) is the batch size, and \(\text{seq\_len}\) is the sequence length.

#### 2. Splitting into Heads
The embeddings are reshaped to split the embedding into multiple heads:
\[
\text{Shape after reshaping: } (N, \text{seq\_len}, \text{heads}, \text{heads\_dim})
\]

#### 3. Linear Transformations
- Values (\(V\)), Keys (\(K\)), and Queries (\(Q\)) are transformed using their respective linear layers.

#### 4. Energy Calculation
The dot-product between queries and keys is computed using:
\[
\text{energy}[n, h, q, k] = Q[n, q, h, d] \cdot K[n, k, h, d]
\]
This is implemented as:
\[
\text{torch.einsum}("nqhd, nkhd \to nhqk")
\]

The resulting shape of the energy is:
\[
(N, \text{heads}, \text{query\_len}, \text{key\_len})
\]

#### 5. Masking (Optional)
- If a mask is provided, positions with a mask value of \(0\) are set to \(-\infty\):
\[
\text{energy}[i, j] = -\infty \quad \text{if } \text{mask}[i, j] = 0
\]

#### 6. Attention Weights
- Apply the softmax function across the \(\text{key\_len}\) dimension:
\[
\text{attention} = \text{Softmax}\left(\frac{\text{energy}}{\sqrt{\text{embed\_size}}}\right)
\]

#### 7. Weighted Sum of Values
Using the attention weights, compute a weighted sum of the values:
\[
\text{out}[n, q, h, d] = \sum_{k} \text{attention}[n, h, q, k] \cdot \text{values}[n, k, h, d]
\]

The output shape becomes:
\[
(N, \text{query\_len}, \text{heads}, \text{heads\_dim})
\]

#### 8. Combining Heads
- The output is reshaped to combine the dimensions of `heads` and `heads_dim`:
\[
(N, \text{query\_len}, \text{embed\_size})
\]

#### 9. Final Linear Transformation
- The combined output is passed through `self.fc_out` to project it back to the original embedding dimension.

---


In [13]:
class TransformerBlock(nn.Module):
    def __init__(self, embed_size, heads, dropout, forward_expansion):
        super(TransformerBlock, self).__init__()
        self.attention = SelfAttention(embed_size, heads)
        self.norm1 = nn.LayerNorm(embed_size)
        self.norm2 = nn.LayerNorm(embed_size)

        self.feed_forward = nn.Sequential(
            nn.Linear(embed_size, forward_expansion*embed_size),
            nn.ReLU(),
            nn.Linear(forward_expansion*embed_size, embed_size)
        )

        self.dropout = nn.Dropout(dropout)

    def forward(self, value, key, query, mask):
        attention = self.attention(value,key,query,mask)
        x = self.dropout(self.norm1(attention + query))
        forward = self.feed_forward(x)
        out = self.dropout(self.norm2(forward + x))
        return out

#tranformer Block

The above implementation corresponds to a single **Transformer Block**, combining **self-attention**, **residual connections**, and **position-wise feed-forward networks** with layer normalization and dropout. This block is a key part of both the encoder and decoder in the Transformer architecture.

---

## Key Components
1. **Multi-Head Self-Attention**:
   The block begins with a self-attention mechanism, implemented using the `SelfAttention` class. This allows the model to compute dependencies between all positions in the sequence.

2. **Residual Connections**:
   Residual connections are added after both the self-attention and feed-forward layers to prevent vanishing gradients and to stabilize training:
   \[
   \text{output} = \text{LayerNorm}(\text{input} + \text{layer\_output})
   \]

3. **Layer Normalization**:
   Layer normalization normalizes the inputs across the embedding dimensions to improve convergence.

4. **Feed-Forward Network (FFN)**:
   A two-layer feed-forward network is applied to each position independently. The dimensions are expanded temporarily to increase model capacity:
   \[
   \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2
   \]

5. **Dropout**:
   Dropout is applied to the outputs of both the self-attention and the feed-forward layers to prevent overfitting.

---

## Code Explanation
### Initialization (`__init__`)
- **`SelfAttention(embed_size, heads)`**: Computes multi-head self-attention.
- **Layer Normalization**:
  - `self.norm1`: Applied after the self-attention layer.
  - `self.norm2`: Applied after the feed-forward layer.
- **Feed-Forward Network**:
  - Temporarily expands the embedding size by a factor of `forward_expansion` before projecting back to the original size.
  - Structure:
    \[
    \text{FFN}(x) = \text{Linear}(\text{ReLU}(\text{Linear}(x)))
    \]
- **Dropout**:
  - Regularization to prevent overfitting during training.

---

### Forward Pass
#### 1. Self-Attention
The block first computes self-attention using the `SelfAttention` module:
\[
\text{attention} = \text{SelfAttention}(value, key, query, mask)
\]
- **Input shapes**:
  \[
  (N, \text{seq\_len}, \text{embed\_size})
  \]
- **Output shape**:
  \[
  (N, \text{seq\_len}, \text{embed\_size})
  \]

#### 2. Add & Normalize (First Residual Connection)
The result of self-attention is added to the original query (residual connection) and normalized:
\[
x = \text{LayerNorm}(\text{attention} + \text{query})
\]

#### 3. Feed-Forward Network
The normalized output passes through the feed-forward network:
\[
\text{forward} = \text{FFN}(x)
\]

#### 4. Add & Normalize (Second Residual Connection)
The output of the feed-forward network is added to the input of the feed-forward layer and normalized:
\[
\text{out} = \text{LayerNorm}(\text{forward} + x)
\]

#### 5. Dropout
Dropout is applied after each residual connection to further regularize the model.

---

### Final Output
The Transformer Block outputs the following:
- **Output shape**:
  \[
  (N, \text{seq\_len}, \text{embed\_size})
  \]

This forms the processed representation of the input sequence, which can be passed to subsequent Transformer Blocks or other parts of the model.

---

### Summary
This implementation of the Transformer Block includes:
- **Multi-Head Self-Attention** for capturing dependencies between sequence elements.
- **Feed-Forward Networks** for per-position transformations.
- **Residual Connections**, **Layer Normalization**, and **Dropout** for training stability and regularization.

The Transformer Block is the foundation for building both the **encoder** and **decoder** in the Transformer architecture.


In [14]:
class Encoder(nn.Module):
    def __init__(self,
                 src_vocab_size,
                 embed_size,
                 num_layers,
                 heads,
                 device,
                 forward_expansion,
                 dropout,
                 max_length
    ):
        super(Encoder, self).__init__()
        self.embed_size = embed_size
        self.device = device
        self.word_embeddings = nn.Embedding(src_vocab_size, embed_size)
        self.position_embeddings = nn.Embedding(max_length, embed_size)
        self.layers = nn.ModuleList([
            TransformerBlock(embed_size, heads, dropout=dropout, forward_expansion=forward_expansion)
            for _ in range(num_layers)
        ])
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        N, seq_length = x.shape
        positions = torch.arange(0, seq_length).expand(N,seq_length).to(self.device)

        out = self.dropout(self.word_embeddings(x) + self.position_embeddings(positions))

        for layer in self.layers:
            out = layer(out, out, out, mask)

        return out

# Encoder

The Encoder is responsible for processing the input sequence into a representation that captures its contextual meaning, which can then be used by the Decoder. This implementation consists of embedding layers, positional encoding, and a stack of Transformer blocks.

---

## Key Components
1. **Input Embedding**:
   Converts input tokens (word indices) into dense vector representations.

2. **Positional Encoding**:
   Adds positional information to the embeddings, enabling the model to capture sequential relationships.

3. **Stack of Transformer Blocks**:
   Multiple `TransformerBlock` layers are applied in sequence to process the embeddings. Each block combines:
   - Multi-head self-attention.
   - Feed-forward networks.
   - Residual connections and normalization.

4. **Dropout**:
   Regularization to prevent overfitting during training.

---

## Code Explanation
### Initialization (`__init__`)
1. **Parameters**:
   - `src_vocab_size`: Size of the source vocabulary.
   - `embed_size`: Dimensionality of the embeddings.
   - `num_layers`: Number of Transformer blocks in the encoder.
   - `heads`: Number of attention heads in each block.
   - `device`: Device for computation (e.g., CPU or GPU).
   - `forward_expansion`: Expansion factor for the feed-forward layer in each Transformer block.
   - `dropout`: Dropout rate for regularization.
   - `max_length`: Maximum sequence length supported by the encoder.

2. **Embedding Layers**:
   - `self.word_embeddings`: Maps input tokens to dense vectors of size `embed_size`.
   - `self.position_embeddings`: Maps positions to dense vectors of size `embed_size`.

3. **Transformer Layers**:
   - `self.layers`: A list of `num_layers` `TransformerBlock` modules.

4. **Dropout**:
   - Applied after adding embeddings and positional encodings.

---

### Forward Pass
#### 1. Input Dimensions
- Input tensor `x` has shape:
  \[
  (N, \text{seq\_length})
  \]
  where \(N\) is the batch size, and \(\text{seq\_length}\) is the length of the input sequence.

#### 2. Positional Encoding
- Compute positional indices for each token:
  \[
  \text{positions} = [0, 1, 2, \ldots, \text{seq\_length}-1]
  \]
  The shape of `positions` is:
  \[
  (N, \text{seq\_length})
  \]

#### 3. Combine Embeddings
- Add the word embeddings and positional embeddings:
  \[
  \text{out} = \text{Dropout}(\text{word\_embeddings}(x) + \text{position\_embeddings}(\text{positions}))
  \]
  The shape of `out` is:
  \[
  (N, \text{seq\_length}, \text{embed\_size})
  \]

#### 4. Pass Through Transformer Blocks
- Pass `out` through each `TransformerBlock`:
  \[
  \text{out} = \text{TransformerBlock}(\text{out}, \text{out}, \text{out}, \text{mask})
  \]
  - The same tensor is used for `value`, `key`, and `query` in self-attention.
  - The shape remains:
    \[
    (N, \text{seq\_length}, \text{embed\_size})
    \]

#### 5. Output
- The final output represents the processed sequence, capturing its contextual information:
  \[
  \text{out} = (N, \text{seq\_length}, \text{embed\_size})
  \]

---

## Summary
The Encoder processes input sequences as follows:
1. Converts tokens into embeddings.
2. Adds positional information.
3. Applies a stack of `TransformerBlock`s to refine the representation.

The Encoder's output is used by the Decoder for tasks like machine translation, text summarization, and more.


In [15]:
class DecoderBlock(nn.Module):
    def __init__(self, embed_size, heads, forward_expansion, dropout, device):
        super(DecoderBlock, self).__init__()
        self.attention = SelfAttention(embed_size, heads)
        self.norm = nn.LayerNorm(embed_size)
        self.transformer_block = TransformerBlock(embed_size, heads, dropout, forward_expansion)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, value, key, src_mask, trg_mask):
        attention = self.attention(x, x, x, trg_mask)
        query = self.dropout(self.norm(attention + x))
        out = self.transformer_block(value, key, query, src_mask)
        return out

# Decoder Block

The Decoder Block processes the target sequence, generating a contextual representation that incorporates both the target sequence and the source sequence information. This is achieved using self-attention, encoder-decoder attention, and feed-forward layers.

---

## Key Components
1. **Self-Attention**:
   Allows the decoder to attend to the target sequence itself, ensuring that the decoder only "sees" the past tokens (up to the current position).

2. **Residual Connections and Normalization**:
   Residual connections are applied after both the self-attention and encoder-decoder attention layers, followed by layer normalization.

3. **Encoder-Decoder Attention**:
   This cross-attention mechanism allows the decoder to attend to the encoder's output, integrating information from the source sequence.

4. **Transformer Block**:
   Incorporates multi-head attention, feed-forward layers, residual connections, and normalization, similar to the encoder block.

5. **Dropout**:
   Regularization to prevent overfitting during training.

---

## Code Explanation
### Initialization (`__init__`)
1. **Parameters**:
   - `embed_size`: Dimensionality of the embeddings.
   - `heads`: Number of attention heads in multi-head attention.
   - `forward_expansion`: Expansion factor for the feed-forward network.
   - `dropout`: Dropout rate for regularization.
   - `device`: Device for computation (e.g., CPU or GPU).

2. **Components**:
   - `self.attention`: Implements masked self-attention for the target sequence.
   - `self.norm`: Layer normalization applied after self-attention.
   - `self.transformer_block`: Processes the encoder-decoder attention and feed-forward steps.
   - `self.dropout`: Applies dropout after residual connections.

---

### Forward Pass
#### 1. Masked Self-Attention
- The decoder attends to the target sequence using `SelfAttention`. The `trg_mask` ensures that the decoder can only attend to past tokens (and the current token), preventing it from "cheating" during training:
  \[
  \text{attention} = \text{SelfAttention}(x, x, x, \text{trg\_mask})
  \]
- The shape of `attention` is:
  \[
  (N, \text{trg\_len}, \text{embed\_size})
  \]

#### 2. Add & Normalize
- A residual connection adds the input `x` to the output of self-attention, followed by dropout and layer normalization:
  \[
  \text{query} = \text{LayerNorm}(\text{attention} + x)
  \]

#### 3. Encoder-Decoder Attention
- The `TransformerBlock` uses the encoder's output (`value` and `key`) and the updated `query` to compute encoder-decoder attention. The `src_mask` ensures that irrelevant parts of the source sequence are ignored:
  \[
  \text{out} = \text{TransformerBlock}(\text{value}, \text{key}, \text{query}, \text{src\_mask})
  \]
- The shape of `out` is:
  \[
  (N, \text{trg\_len}, \text{embed\_size})
  \]

---

### Summary
The Decoder Block performs the following:
1. **Masked Self-Attention**:
   Ensures the decoder focuses only on the tokens seen so far in the target sequence.

2. **Encoder-Decoder Attention**:
   Allows the decoder to attend to relevant parts of the source sequence output from the encoder.

3. **Residual Connections, Normalization, and Dropout**:
   Improve training stability and prevent overfitting.

This structure is repeated in the full Transformer Decoder, with multiple decoder blocks stacked sequentially.

---

### Output
- The output shape is:
  \[
  (N, \text{trg\_len}, \text{embed\_size})
  \]
- It represents the processed target sequence, incorporating information from both the target and source sequences.


In [16]:
class Decoder(nn.Module):

    def __init__(self, trg_vocab_size, embed_size, num_layers, heads, forward_expansion,dropout, device,max_length):
        super(Decoder, self).__init__()
        self.device = device
        self.word_embeddings = nn.Embedding(trg_vocab_size, embed_size)
        self.position_embeddings = nn.Embedding(max_length, embed_size)

        self.layers = nn.ModuleList([
            DecoderBlock(embed_size, heads, forward_expansion, dropout, device)
            for _ in range(num_layers)
        ])

        self.fc_out = nn.Linear(embed_size, trg_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_out, src_mask, trg_mask):
        N, seq_length = x.shape
        position = torch.arange(0, seq_length).expand(N, seq_length).to(self.device)
        x = self.dropout((self.word_embeddings(x)) + (self.position_embeddings(position)))

        for layer in self.layers:
            x = layer(x, enc_out, enc_out, src_mask, trg_mask)

        out = self.fc_out(x)
        return out

# Transformer Decoder

The `Decoder` is designed to take the target sequence (partially constructed during training or from previous predictions during inference) and the encoder outputs to generate predictions. It consists of multiple stacked `DecoderBlock` layers.

---

## Key Components

1. **Word Embeddings**:
   - Converts token indices in the target vocabulary into dense vector representations of size `embed_size`.

2. **Positional Embeddings**:
   - Encodes positional information for each token in the sequence to preserve order, as Transformers lack inherent sequential bias.

3. **Decoder Blocks**:
   - Each `DecoderBlock` processes the current representation of the target sequence while incorporating the encoded source sequence.

4. **Dropout**:
   - Helps regularize the model by preventing overfitting.

5. **Final Linear Layer**:
   - Maps the `embed_size` vectors to the size of the target vocabulary (`trg_vocab_size`) to predict probabilities for each possible token.

---

## Code Walkthrough

### Initialization (`__init__`)
1. **Parameters**:
   - `trg_vocab_size`: Size of the target vocabulary.
   - `embed_size`: Dimensionality of token embeddings.
   - `num_layers`: Number of stacked decoder blocks.
   - `heads`: Number of attention heads in multi-head attention.
   - `forward_expansion`: Expansion factor for the feed-forward network inside each block.
   - `dropout`: Dropout rate for regularization.
   - `device`: The computation device (e.g., CPU or GPU).
   - `max_length`: Maximum length of input sequences.

2. **Components**:
   - `word_embeddings`: Maps target vocabulary indices to dense vectors.
   - `position_embeddings`: Encodes positional information for tokens.
   - `layers`: A stack of `DecoderBlock` instances.
   - `fc_out`: Maps the decoder's final outputs to the target vocabulary space.
   - `dropout`: Applies dropout to embeddings and residual connections.

---

### Forward Pass
#### Inputs:
- `x`: Target sequence (as token indices) of shape \((N, \text{trg\_len})\).
- `enc_out`: Output of the encoder, which represents the source sequence context.
- `src_mask`: Source sequence mask to ignore padding tokens.
- `trg_mask`: Target sequence mask to prevent the decoder from attending to future tokens.

#### 1. Embedding and Positional Encoding:
\[
\text{x\_embed} = \text{word\_embeddings}(x)
\]
\[
\text{pos\_embed} = \text{position\_embeddings}(\text{position})
\]
\[
x = \text{Dropout}(\text{x\_embed} + \text{pos\_embed})
\]

Here:
- `x_embed` captures the semantic representation of the target tokens.
- `pos_embed` encodes the token positions.
- Shape of `x`: \((N, \text{trg\_len}, \text{embed\_size})\).

#### 2. Decoder Blocks:
Each `DecoderBlock` processes the sequence and integrates source context:
\[
x = \text{DecoderBlock}(x, \text{enc\_out}, \text{enc\_out}, \text{src\_mask}, \text{trg\_mask})
\]
This is repeated for all layers in `self.layers`.

#### 3. Final Linear Layer:
The final layer projects the output into the target vocabulary space:
\[
\text{out} = \text{fc\_out}(x)
\]
- Shape of `out`: \((N, \text{trg\_len}, \text{trg\_vocab\_size})\).

---

## Output:
- `out`: Predicted token probabilities for each position in the target sequence.
- Shape: \((N, \text{trg\_len}, \text{trg\_vocab\_size})\).

---

## Summary of Functionality:
1. **Embeddings**:
   Combine semantic and positional embeddings for the target sequence.

2. **Masked Self-Attention**:
   Prevents information leakage by ensuring the decoder can only attend to already-seen tokens.

3. **Encoder-Decoder Attention**:
   Incorporates context from the encoder output, allowing the decoder to relate target tokens to the source sequence.

4. **Prediction**:
   Produces token probabilities for the target sequence.

This class forms the backbone of the decoder side in sequence-to-sequence tasks, such as machine translation and text generation.


In [17]:
class Transformer(nn.Module):
    def __init__(self, src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx,
                 embed_size=256, num_layers=6, forward_expansion=4,
                 heads=8, dropout=0.1, device='cpu',max_length=100):
        super(Transformer, self).__init__()
        self.encoder = Encoder(src_vocab_size,embed_size, num_layers,
                               heads,device,forward_expansion,dropout,max_length)

        self.decoder = Decoder(trg_vocab_size,embed_size,num_layers,
                               heads,forward_expansion,dropout,device,max_length)

        self.src_pad_idx = src_pad_idx
        self.trg_pad_idx = trg_pad_idx
        self.device = device


    def make_src_mask(self, src):
        src_mask = (src != self.src_pad_idx).unsqueeze(1).unsqueeze(2)
        # (N, 1, 1, src_length)
        return src_mask.to(self.device)

    def make_trg_mask(self, trg):
        N, trg_len = trg.shape
        trg_mask = torch.tril(torch.ones((trg_len,trg_len))).expand(N, 1,trg_len, trg_len)
        return trg_mask.to(self.device)

    def forward(self, src, trg):
        src_mask = self.make_src_mask(src)
        trg_mask = self.make_trg_mask(trg)
        enc_src = self.encoder(src, src_mask)
        out = self.decoder(trg, enc_src, src_mask, trg_mask)
        return out

# Transformer: Full Model

The `Transformer` class combines the **Encoder** and **Decoder** components, managing masks and facilitating end-to-end forward propagation.

---

## Key Components

1. **Encoder**:
   - Processes the source sequence, mapping it into a context-rich latent representation.
   - Parameters:
     - `src_vocab_size`: Vocabulary size for the source language.
     - `embed_size`: Dimensionality of embeddings.
     - `num_layers`: Number of stacked Transformer blocks.
     - `heads`: Number of attention heads.
     - `forward_expansion`: Expansion factor for feed-forward layers.
     - `dropout`: Dropout rate for regularization.
     - `max_length`: Maximum input sequence length.

2. **Decoder**:
   - Generates the output sequence using the encoded representation and target sequence.
   - Parameters mirror the encoder, except the target vocabulary size (`trg_vocab_size`).

3. **Masks**:
   - **Source Mask (`src_mask`)**:
     - Ensures that padding tokens in the source sequence are ignored during attention.
   - **Target Mask (`trg_mask`)**:
     - Prevents the decoder from attending to future tokens during training, preserving causality.

4. **Device Management**:
   - Ensures all computations occur on the specified device (`cpu` or `gpu`).

---

## Code Walkthrough

### Initialization (`__init__`)
1. **Encoder and Decoder Initialization**:
   - Creates encoder and decoder with matching embedding sizes, layer counts, and attention heads.
   
2. **Padding Indices**:
   - `src_pad_idx`: Index of padding token in the source vocabulary.
   - `trg_pad_idx`: Index of padding token in the target vocabulary.

3. **Device Assignment**:
   - The model is designed to run on the specified computation device.

---

### Mask Creation Functions
#### `make_src_mask(src)`
- Purpose: Masks out padding tokens in the source sequence.
\[
\text{src\_mask} = (\text{src} \neq \text{src\_pad\_idx}).\text{unsqueeze}(1).\text{unsqueeze}(2)
\]
- Shape:
  \[
  (N, 1, 1, \text{src\_len})
  \]

#### `make_trg_mask(trg)`
- Purpose: Masks out future tokens in the target sequence.
- Creates a lower triangular matrix:
\[
\text{trg\_mask} = \text{torch.tril}(\text{torch.ones}((\text{trg\_len}, \text{trg\_len})))
\]
- Expanded to match batch size:
\[
\text{trg\_mask}.\text{expand}(N, 1, \text{trg\_len}, \text{trg\_len})
\]
- Shape:
  \[
  (N, 1, \text{trg\_len}, \text{trg\_len})
  \]

---

### Forward Pass
#### Inputs:
1. `src`: Source sequence of shape \((N, \text{src\_len})\).
2. `trg`: Target sequence of shape \((N, \text{trg\_len})\).

#### Steps:
1. **Source Mask**:
   \[
   \text{src\_mask} = \text{make\_src\_mask(src)}
   \]

2. **Target Mask**:
   \[
   \text{trg\_mask} = \text{make\_trg\_mask(trg)}
   \]

3. **Encoder**:
   - Processes the source sequence:
   \[
   \text{enc\_src} = \text{encoder(src, src\_mask)}
   \]
   - Shape of `enc_src`: \((N, \text{src\_len}, \text{embed\_size})\).

4. **Decoder**:
   - Combines target sequence with encoder output:
   \[
   \text{out} = \text{decoder(trg, enc\_src, src\_mask, trg\_mask)}
   \]
   - Shape of `out`: \((N, \text{trg\_len}, \text{trg\_vocab\_size})\).

---

### Output:
- `out`: Predictions for each token in the target sequence.
- Shape: \((N, \text{trg\_len}, \text{trg\_vocab\_size})\).

---

## Summary of Functionality:
1. **Encoder**:
   - Maps the source sequence to a contextual representation.

2. **Masks**:
   - Handle padding and causality to ensure proper attention.

3. **Decoder**:
   - Incorporates encoder context and generates output predictions.

This implementation provides the framework for sequence-to-sequence tasks like translation, summarization, and text generation.


In [18]:
if __name__ == '__main__':
    device = torch.device("cpu")
    #device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    x = torch.tensor([[1,5,6,4,3,9,5,2,0],[1,8,7,3,4,5,6,7,2]]).to(device)
    trg = torch.tensor([[1,7,4,3,5,9,2,0],[1,5,6,2,4,7,6,2]]).to(device)

    src_pad_idx, trg_pad_idx = 0,0
    src_vocab_size, trg_vocab_size = 10,10
    model = Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx).to(device)

    out = model(x, trg[:,:-1])
    print(out.shape)

torch.Size([2, 7, 10])
