# Language Models Explained Visually

Discover comprehensive visual explanations of language models through the following resources:

- [BBY Croft's Large Language Models Guide](https://bbycroft.net/llm)
- [Polo Club's Transformer Explainer](https://poloclub.github.io/transformer-explainer/)


In [None]:

# corpus
    # A collection of text documents used to train a language model. The corpus can be a collection of books, articles, or any text data.
    
    
# Vocabulary
    # The set of unique tokens (words, sub-words, or characters) that a model can understand. The vocabulary is typically derived from
    # the training corpus and includes common words and special tokens like [PAD], [UNK], [CLS], and [SEP].
    
    
# Attention Mechanism
    # A mechanism that allows the model to focus on important words in a sequence, enabling the model to handle long-range dependencies 
    # and capture context.


# Tokens
    # The smallest unit of input or output the model processes. Tokens can represent words, sub-words, or even characters, depending 
    # on the tokenization strategy.
    
    
# Tokenization
    # The process of converting raw text into tokens (usually words, sub-words, or characters) that the model can process. 
    # Tokenizers break down text based on a model's vocabulary (e.g., Byte Pair Encoding or WordPiece).
    # Byte Pair Encoding (BPE):
        # A tokenization algorithm that iteratively merges the most frequent pairs of characters in a corpus to create a vocabulary of 
        # variable-length tokens. BPE is widely used in NLP tasks, including machine translation and text generation.


# Embeddings
    # Dense vector representations of tokens (words/sub-words) that capture semantic meaning. Used in LLMs to map input tokens into 
    # a continuous vector space where similar meanings are close together. Instead of treating each word as a unique, isolated token, 
    # embeddings allow words with similar meanings to be represented by vectors (arrays of numbers) that are close together 
    # in a multi-dimensional space.
    
    # example: Word2Vec, GloVe, FastText, BERT embeddings.
        # Imagine you have the words: 
        # "dog", "cat", "apple", "banana" 
        # "dog" -> [0.1, 0.2, 0.3, 0.4], 
        # "cat" -> [0.2, 0.3, 0.4, 0.5], 
        # "apple" -> [0.3, 0.4, 0.5, 0.6], 
        # "banana" -> [0.4, 0.5, 0.6, 0.7]
        # The embeddings for "dog" and "cat" are closer together than "dog" and "apple" because "dog" and "cat" are semantically
        # similar (both animals) compared to "dog" and "apple" (different categories).
        
        
# Part-of-Speech (POS) Tagging:
    # Assigning each word in a sentence a grammatical category (e.g., noun, verb, adjective). Helps the model understand the 
    # structure of sentences and the role of each word, which is useful for tasks like parsing, translation, and question answering


# Named Entity Recognition (NER):
    # Identifying and categorizing entities (names, dates, locations, organizations, etc.) in text.


# Stemming or Lemmatization:
    # Reducing words to their base or root form. Stemming is a rule-based process that removes prefixes or suffixes, while lemmatization 
    # uses a vocabulary and morphological analysis to return the base form of a word. 
    # examples are: 
    

# Sampling Techniques
    # Methods used to generate outputs from a model, such as greedy search (selecting the most likely next token), beam search (exploring 
    # multiple token sequences), and temperature sampling (introducing randomness to outputs).


# Beam Search
    # A search strategy used during text generation to explore multiple possible token sequences and select the most likely ones. 
    # It reduces the likelihood of poor-quality outputs compared to greedy search.
    
    
# Greedy Search
    # A simpler search method where the model always selects the most probable next token. It is fast but may lead to less coherent 
    # or repetitive outputs.


# Autoregressive Models
    # LLMs like GPT, which generate text one token at a time, predicting the next token based on previously generated tokens. 
    # This type of model is suitable for tasks like text generation.
    
    
# Masked Language Models (MLM)
    # Models like BERT that learn by predicting masked-out tokens in a sentence, using the surrounding context. These models are 
    # bidirectional, meaning they consider context from both directions.
    
    
# Zero-Shot Learning
    # The model’s ability to perform tasks without explicit examples in the training data. For example, a zero-shot LLM can classify 
    # text without having seen labeled examples for that specific task.
    
    
# Few-Shot Learning
    # The model can generalize from only a few examples during inference. For instance, by providing the model a few sample questions 
    # and answers, it can handle similar tasks effectively.
    
    
# Fine-Tuning vs. Transfer Learning
    # Fine-Tuning: The process of adapting a pretrained LLM to a specific task (e.g., classification, question answering) by training 
        # it further on task-specific labeled data.
    # Transfer Learning: Leveraging knowledge from a pretrained model and applying it to a new but related task, without needing to 
        # retrain from scratch.


# Temperature Sampling
    # A technique used during text generation to control the randomness of the output. Higher temperatures (e.g., 1.0) result in 
    # more diverse outputs, while lower temperatures (e.g., 0.2) make the model more deterministic.





> LLM Training

In [None]:
# 1. PEFT + LoRA (Parameter Efficient Fine-tuning + Low-Rank Adaptation)
    # Description: Fine-tunes only a small adapter layer added on top of a pre-trained model, conserving memory and improving efficiency.
    # Use Case: Helps in training large models by keeping the original model frozen and updating only small parts.


# 2. Quantization-Aware Training (QAT)
    # Description: Reduces model size by converting high-precision weights (e.g., FP32) to lower precision formats (e.g., FP16 or INT8).
    # Benefits: Saves memory and reduces training time but may affect model accuracy.
    # Challenges: Requires careful monitoring to ensure model quality isn’t degraded.


# 3. Gradient Checkpointing
    # Description: Saves memory by storing only certain intermediate values during backpropagation.
    # Use Case: Reduces memory usage but slows down training.


# 4. Distributed Training
    # Description: Splits the model and data across multiple devices or nodes for faster training.
    # Key Techniques:
        # FSDP (Fully Sharded Data Parallel): Shards model weights and optimizer states across devices.
        # Deepspeed Zero Redundancy Optimizer (ZeRO): Distributes model parameters to save memory and optimize training efficiency.


> LLM Inference

In [None]:
# 1. Post-Training Quantization (PTQ)
    # Description: Quantizes a model’s weights and activations after training to reduce memory usage.
    # Use Case: Reduces memory footprint for serving models at lower precision (e.g., FP32 → INT8).


# 2. Distributed Inference
    # Description: Partitioning model weights across multiple devices to handle large models.
    # Techniques:
    # Model Partitioning: Divides a large model across multiple GPUs or nodes for more efficient computation.
    # In-flight Batching: Enables the processing of new requests while others are still being computed, improving GPU utilization.


# 3. Dynamic Batching & Continuous Batching
    # Description: Dynamically adjusts batch sizes during inference to maximize GPU utilization, reducing latency.
    # Benefits: Ensures high throughput and efficiency, especially for models with varying input lengths.

> Optimization Techniques

In [None]:
# 1. TensorRT-LLM
    # Description: Optimizes models with kernel fusion and memory techniques like KV caching, Paged Attention, and FlashAttention.
    # Benefits: Improves performance but requires conversion into TensorRT format for use.


# 2. vLLM
    # Description: An inference engine that uses Paged Attention to reduce resource wastage, optimizing memory usage and improving throughput.
    # Benefits: High efficiency in processing tokens compared to traditional methods.


# 3. DeepSpeed-Fastgen
    # Description: Combines DeepSpeed's training and inference capabilities for fast, efficient model serving.
    # Key Features: Supports Dynamic Splitfuse batching, improving latency and throughput for large models.
    

# Key Considerations
    # Memory Constraints: LLM training and inference are memory-intensive processes. Techniques like PEFT, QAT, and gradient 
        # checkpointing can help mitigate memory limitations.
    # Model Size: Models with billions of parameters may require distributed training or inference strategies to handle the memory demands.
    # Efficiency: Methods like mixed precision, distributed training, and dynamic batching are key to improving efficiency in 
        # training and inference.
    # Latency: Techniques like dynamic batching and continuous batching can help reduce inference latency, especially for real-time 
        # applications.
    # Throughput: Distributed inference and model partitioning can improve throughput by leveraging multiple devices for 
        # parallel processing.
    # Resource Optimization: Techniques like TensorRT-LLM and vLLM optimize memory usage and improve performance for large models.
    # Scalability: Distributed training and inference methods enable scaling LLMs to handle larger models and datasets efficiently.
    # Model Serving: Techniques like DeepSpeed-Fastgen provide end-to-end solutions for training and serving large language 
        # models effectively.
    # Performance Trade-offs: Quantization and distributed strategies may impact model accuracy, so careful monitoring and tuning 
        # are essential to maintain performance.



# Gradient Descent in Machine Learning

Gradient Descent is an optimization algorithm used to minimize the cost (or loss) function in machine learning. It works by iteratively updating the parameters (weights) of the model in the direction that reduces the error (cost) the most, i.e., in the direction of the **negative gradient** of the loss function.

Let’s assume we have a loss function $J(\theta)$, where $\theta$ represents the parameters (weights) of our model. The goal is to minimize this loss function, i.e., find the set of parameters that gives us the lowest possible value for $J(\theta)$.

Gradient Descent works by updating the parameters as follows:

$$
\theta = \theta - \alpha \cdot \nabla_\theta J(\theta)
$$

Where:
- $ \theta $ is the **parameter** (or weight) of the model.
- $ \alpha $ is the **learning rate** (a small positive value, typically between 0 and 1).
- $ \nabla_\theta J(\theta) $ is the **gradient** (the partial derivative) of the loss function with respect to the parameter $ \theta $.


## Loss Function Example: Binary Cross-Entropy (BCE)

For binary classification problems, one commonly used loss function is the **Binary Cross-Entropy (BCE)** loss. The BCE loss function is given by:

$$
\mathcal{L}(\hat{y}, y) = - \frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]
$$

Where:
- $ \hat{y}_i $ is the predicted probability for the $ i^{th} $ sample.
- $ y_i $ is the true label (0 or 1) for the $ i^{th} $ sample.
- $ N $ is the number of samples in the dataset.

The predicted value $ \hat{y}_i $ is typically computed using a **sigmoid activation function**:

$$
\hat{y}_i = \sigma(w^T x_i + b)
$$

Where:
- $ w $ are the weights of the model.
- $ x_i $ is the input features of the $ i^{th} $ sample.
- $ b $ is the bias term.
- $ \sigma(z) = \frac{1}{1 + e^{-z}} $ is the sigmoid function.

The goal of training is to minimize this loss function using gradient descent, iteratively adjusting the weights \( w \) and bias \( b \) to reduce the difference between predicted and true labels.


## Explanation:

- Gradient ($ \nabla_\theta J(\theta) $): This tells us the direction in which the function $ J(\theta) $ increases the most. In other words, it shows us how steep the slope is at any point on the function. We want to move **in the opposite direction** (down the slope) to minimize the loss.
  
- **Learning rate ($ \alpha $)**: This determines the size of the steps we take in the direction of the gradient. A small learning rate means small steps, and a large learning rate means larger steps. Too small of a learning rate will make the process slow, while too large of a learning rate could cause overshooting and prevent convergence.

- **loss.backward()** - Computes the gradients of the loss with respect to the model's parameters using backpropagation.
- **optimizer.step()** - Updates the model's parameters based on the computed gradients, performing the gradient descent step.



## Calculating Gradients with Respect to Parameters

To update the parameters using gradient descent, we need to calculate the gradient of the loss function with respect to each parameter (weights and biases). Here's how it's done for the Binary Cross-Entropy (BCE) loss in a logistic regression model.

### 1. Binary Cross-Entropy Loss:

For binary classification, the BCE loss function is:

$$
\mathcal{L}(y, \hat{y}) = - \left[ y \log(\hat{y}) + (1 - y) \log(1 - \hat{y}) \right]
$$

Where $ \hat{y} = \sigma(w \cdot x + b) $, and $ \sigma(z) = \frac{1}{1 + e^{-z}} $ is the sigmoid activation function.

### 2. Gradient of Loss with respect to Parameters:

- The gradient of the loss with respect to \( w \) (weight):

$$
\frac{\partial \mathcal{L}}{\partial w} = (\hat{y} - y) \cdot x
$$

- The gradient of the loss with respect to \( b \) (bias):

$$
\frac{\partial \mathcal{L}}{\partial b} = \hat{y} - y
$$

Where:
- $ \hat{y} $ is the predicted probability ($ \hat{y} = \sigma(w \cdot x + b) $).
- $ y $ is the true label (0 or 1).
- $ x $ is the input feature vector.

These gradients are used to update the parameters in the direction that minimizes the loss:

$$
w = w - \alpha \cdot \frac{\partial \mathcal{L}}{\partial w}
$$
$$
b = b - \alpha \cdot \frac{\partial \mathcal{L}}{\partial b}
$$

Where $ \alpha $ is the learning rate.



# Variants of Gradient Descent in Machine Learning

## Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) is a variant of the basic gradient descent algorithm where the parameters are updated using only one data point (randomly selected) at a time, rather than the entire dataset. This leads to faster convergence but with noisier updates.

The update rule for SGD is:

$$
\theta = \theta - \alpha \cdot \nabla_\theta J(\theta; x^{(i)}, y^{(i)})
$$

Where:
- $ \theta $ is the parameter (or weight).
- $ \alpha $ is the learning rate.
- $ \nabla_\theta J(\theta; x^{(i)}, y^{(i)}) $ is the gradient of the cost function, computed with respect to the $i$-th training example $ (x^{(i)}, y^{(i)}) $.

Since SGD uses one data point at a time, it is computationally more efficient but introduces more variance in the updates. This often causes the algorithm to fluctuate around the minimum rather than converging smoothly.

## Mini-Batch Gradient Descent (MBGD)

Mini-Batch Gradient Descent is a compromise between the standard (Batch) Gradient Descent and Stochastic Gradient Descent. In mini-batch GD, instead of using the full dataset or a single data point, a small random subset of the data (mini-batch) is used to compute the gradient.

The update rule for Mini-Batch GD is:

$$
\theta = \theta - \alpha \cdot \frac{1}{m} \sum_{i=1}^{m} \nabla_\theta J(\theta; x^{(i)}, y^{(i)})
$$

Where:
- $ m $ is the number of samples in the mini-batch.

Mini-batch gradient descent provides a balance between the computational efficiency of batch gradient descent and the faster convergence of stochastic gradient descent.

## Adagrad (Adaptive Gradient Algorithm)

Adagrad is an adaptive learning rate method. It adjusts the learning rate for each parameter individually based on the past gradient updates. Parameters that have larger gradients (more significant updates) will have smaller learning rates, while parameters with smaller gradients will have larger learning rates.

The update rule for Adagrad is:

$$
\theta_t = \theta_{t-1} - \frac{\alpha}{\sqrt{G_t + \epsilon}} \cdot \nabla_\theta J(\theta)
$$

Where:
- $ G_t $ is the sum of the squared gradients up to time step $ t $:

$$
G_t = \sum_{i=1}^{t} \nabla_\theta J(\theta_i)^2
$$

- $ \epsilon $ is a small constant added to prevent division by zero (typically $ 10^{-8} $).
- $ \alpha $ is the learning rate.

The adaptive nature of Adagrad makes it effective for sparse data (where most features are zero) or data with varying levels of gradients.

## RMSprop (Root Mean Square Propagation)

RMSprop is another adaptive learning rate method, designed to solve some issues with Adagrad, especially the fact that Adagrad's learning rate can shrink too much over time. RMSprop uses a moving average of squared gradients to scale the learning rate, which helps keep the updates stable.

The update rule for RMSprop is:

$$
\theta_t = \theta_{t-1} - \frac{\alpha}{\sqrt{E[g^2]_t + \epsilon}} \cdot g_t
$$

Where:
- $ E[g^2]_t $ is the moving average of squared gradients at time step $ t $:

$$
E[g^2]_t = \beta \cdot E[g^2]_{t-1} + (1-\beta) \cdot g_t^2
$$

where $ \beta $ is a smoothing factor (often close to $ 0.9 $).
- $ g_t $ is the gradient at time step $ t $.
- $ \alpha $ is the learning rate.
- $ \epsilon $ is a small constant added to prevent division by zero.

RMSprop helps improve convergence by adapting the learning rate to the magnitude of recent gradients, which is useful for non-stationary objectives.

## Adam (Adaptive Moment Estimation)

Adam combines ideas from both Adagrad and RMSprop. It computes adaptive learning rates for each parameter, but also takes into account the momentum of past gradients (i.e., the exponentially decaying average of past gradients) to improve optimization.

The update rule for Adam is:

$$
\theta_t = \theta_{t-1} - \frac{\alpha}{\sqrt{v_t} + \epsilon} \cdot m_t
$$

Where:
- $ m_t $ is the first moment estimate (mean of gradients), typically:

$$
m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t
$$

with $ \beta_1 $ being the decay rate (typically $ 0.9 $).

- $ v_t $ is the second moment estimate (variance of gradients), typically:

$$
v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
$$

with $ \beta_2 $ being another decay rate (typically $ 0.999 $).

- $ \alpha $ is the learning rate.
- $ \epsilon $ is a small constant added to prevent division by zero.

Adam is widely used because it combines the benefits of both momentum and adaptive learning rates, making it well-suited for a variety of machine learning tasks.




<p align="center">
  <img src="transformer.png" alt="Feed-Forward Network" />
</p>


# Transformer Model: Step-by-Step Workflow

Transformers are a foundational architecture in modern deep learning, particularly in natural language processing (NLP). Below is a comprehensive step-by-step guide outlining the complete workflow of a transformer model, from input tensors to the final output.

## 1. Input Preparation

### 1.1. Raw Text Input
- **Description:** The process begins with raw text data, such as sentences or paragraphs.
- **Example:** `"The cat sat on the mat."`

### 1.2. Tokenization
- **Description:** Converts raw text into discrete tokens (words, subwords, or characters).
- **Substeps:**
  - **a. Splitting:** Break text into tokens based on spaces or specific rules.
  - **b. Subword Tokenization (Optional):** Further splits rare words into subword units using algorithms like Byte Pair Encoding (BPE) or WordPiece.
- **Example:** `"The cat sat on the mat."` → `["The", "cat", "sat", "on", "the", "mat", "."]`

### 1.3. Numerical Encoding
- **Description:** Maps tokens to unique numerical identifiers using a vocabulary index.
- **Substeps:**
  - **a. Vocabulary Lookup:** Each token is assigned an integer based on its position in the vocabulary.
  - **b. Handling Unknown Tokens:** Tokens not present in the vocabulary are mapped to a special `[UNK]` token.
- **Example:** `["The", "cat", "sat", "on", "the", "mat", "."]` → `[101, 2024, 2003, 1037, 7099, 2527, 1012]`

## 2. Embedding Layer

### 2.1. Token Embeddings
- **Description:** Transforms numerical token IDs into dense vector representations.
- **Substeps:**
  - **a. Embedding Matrix:** A learnable matrix where each row corresponds to a token's embedding.
  - **b. Lookup:** Each token ID retrieves its corresponding embedding vector.
- **Example:** `Embedding Matrix [Vocab Size x d_model]` → `X = [n_tokens x d_model]`

### 2.2. Positional Encodings
- **Description:** Adds information about the position of each token in the sequence to the token embeddings.
- **Substeps:**
  - **a. Sinusoidal Encoding (Fixed):** Uses sine and cosine functions of different frequencies.
  - **b. Learned Positional Embeddings (Learnable):** Embeddings are learned during training.
- **Example:**
  $$
  \text{Positional Encoding}_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) \\
  \text{Positional Encoding}_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)
  $$
  
### 2.3. Combined Embeddings
- **Description:** Summing token embeddings with positional encodings to incorporate positional information.
- **Formula:**
  $$
  E = X + \text{Positional Encodings}
  $$
- **Result:** A matrix `E` representing the input sequence with positional information.

## 3. Transformer Architecture

The transformer consists of an **Encoder** and a **Decoder**, each composed of multiple layers. Below, we outline the main components and their subcomponents.

### 3.1. Encoder

#### 3.1.1. Multi-Head Self-Attention
- **Description:** Allows the model to focus on different parts of the input sequence simultaneously.
- **Substeps:**
  - **a. Linear Projections:** Compute queries (Q), keys (K), and values (V) using learned weight matrices.
    $$
    Q = E W_Q, \quad K = E W_K, \quad V = E W_V
    $$
  
  - **b. Attention Scores:** For each query vector in Q, compute the dot product with all key vectors in K. These scores indicate the relevance or similarity between each query and all keys.
    - *High Score:* The corresponding value should be given more attention.
    - *Low Score:* The corresponding value should be given less attention.
      $$
      Attention Scores = Q * K^T
      $$
  - **c. Scaled Dot-Product Attention:** Calculate attention scores, apply scaling, softmax, and compute weighted sums.
    $$
    \text{Scores} = \frac{Q K^T}{\sqrt{d_k}} \\
    \text{Attention Weights} = \text{softmax}(\text{Scores}) \\ 
    \text{Attention Output} = \text{Attention Weights} \times V
    $$
    - Scaling prevents the dot products from growing too large, which can push the softmax function into regions with very small gradients.
    - These weights determine how much each value vector (from V) contributes to the final output. \\
    - Weights aggregates the value vectors based on their relevance to each query, producing a weighted sum that captures contextual information.
  
  - **d. Concatenation:** Concatenate outputs from all attention heads.
  - **e. Final Linear Projection:** Apply a linear transformation to the concatenated output.
    $$
    \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W_O
    $$

#### 3.1.2. Add & Norm
- **Description:**  
  Incorporates a **residual (skip) connection** by adding the sub-layer's input \( E \) to its output, followed by **layer normalization**. This helps preserve the original information and stabilizes the training process.

- **Formula:**
  $$
  \text{Output} = \text{LayerNorm}(E + \text{MultiHead}(Q, K, V))
  $$
  $$
  \text{or}
  $$
  $$
  \text{Output} = \text{LayerNorm}(E + \text{Attention Output})
  $$


#### 3.1.3. Feed-Forward Network (FFN)
- **Description:**  
  Processes each position independently through a two-layer fully connected network to capture complex patterns and non-linear relationships.
  
- **Substeps:**
  - **a. Linear Transformation:**  
    **Purpose:** Expands the dimensionality of the input to increase the model's capacity to learn.  
    **Formula:**  
    $$
    \text{FFN}_1 = \text{ReLU}(E W_1 + b_1)
    $$
    - $ E $: **Input** from the previous **Add & Norm** step ($E$ here is the **Output** from the Add & Norm step).
    - $ W_1 $: **Weight matrix** for the first linear layer.
    - $ b_1 $: **Bias vector** for the first linear layer.
    - **ReLU:** Activation function introducing non-linearity.
  
  - **b. Linear Transformation:**  
    **Purpose:** Reduces the dimensionality back to the original size, ensuring consistency in the model's architecture.  
    **Formula:**  
    $$
    \text{FFN}_2 = \text{FFN}_1 W_2 + b_2
    $$
    - $ W_2 $: **Weight matrix** for the second linear layer.
    - $ b_2 $: **Bias vector** for the second linear layer.

#### 3.1.4. Add & Norm
- **Description:**  
  Adds a **residual (skip) connection** by combining the FFN's output with its input $ E $, followed by **layer normalization**. This step helps preserve the original information and stabilizes the training process.
  
- **Formula:**
  $$
  \text{Encoder Output} = \text{LayerNorm}(E + \text{FFN}_2)
  $$
  - $ E $: **Input** to the Feed-Forward Network (output from the previous **Add & Norm** step).
  - $ \text{FFN}_2 $: **Output** from the Feed-Forward Network.
  - **LayerNorm:** Normalizes the combined output to have a mean of 0 and a variance of 1, enhancing training stability and performance.

---
$$



$$
---

### 3.2. Decoder

While the **Encoder** processes the input sequence to generate contextualized representations, the **Decoder** generates the output sequence by leveraging these representations and the previously generated tokens. The Decoder consists of multiple layers, each containing three main sub-layers:

1. **Masked Multi-Head Self-Attention**
2. **Multi-Head Cross-Attention**
3. **Position-wise Feed-Forward Network (FFN)**

Each sub-layer is followed by **Residual Connections** and **Layer Normalization**, similar to the Encoder.


#### 3.2.1. Masked Multi-Head Self-Attention
- **Description:** Similar to encoder's self-attention but with masking to prevent attending to future tokens.
- **Purpose:**  
  - **Prevent Information Leakage:** By masking future tokens, the model ensures that the prediction for the current token doesn't incorporate information from tokens that haven't been generated yet.
  - **Maintain Causality:** Ensures that the generation process respects the sequential nature of language.

- **Substeps:**
  - **a. Masking:** Apply a causal mask to ensure autoregressive property.
    $$
    \text{Masked Scores} = \frac{Q K^T}{\sqrt{d_k}} + M \\
    \text{Attention Weights} = \text{softmax}(\text{Masked Scores})
    $$
  - **b. Compute Attention Output:** As in encoder.
    $$
    \text{MaskedAttention Output} = \text{MultiHead}(Q, K, V) \quad \text{with Masking}
    $$
    $$
    or
    $$ 
    $$
    \text{MaskedAttention Output} = \text{Attention Weights} \times V
    $$

- **How Masking Works:**
  - **Mask Matrix:** A triangular matrix that masks (sets to \(-\infty\)) the attention scores for future tokens.
  - **Application:** Before applying the softmax function, the mask is added to the attention scores to nullify the influence of future tokens.


#### 3.2.2. Add & Norm (Post Masked Self-Attention)
- **Description:** Residual connection and layer normalization.
- **Formula:**
  $$
  \text{Output} = \text{LayerNorm}(E + \text{Masked MultiHead}(Q, K, V))
  $$

#### 3.2.3. Multi-Head Attention over Encoder Outputs (or Multi-Head Cross-Attention)
- **Description:** Allows the decoder to attend to the encoder's output.
- **Components:**
  - **Queries (Q):** Derived from the Decoder's previous sub-layer (Masked Self-Attention output).
  - **Keys (K) & Values (V):** Derived from the Encoder's final output.

- **Substeps:**
  - **a. Compute Q from Decoder:** 
    $$
    Q = \text{Decoder Output} W_Q'
    $$
  - **b. Compute K and V from Encoder:**
    $$
    K = \text{Encoder Output} W_K', \quad V = \text{Encoder Output} W_V'
    $$
  - **c. Compute Attention Output:** As in encoder.

#### 3.2.4. Add & Norm
- **Description:**   Adds another **residual (skip) connection** by combining the input $ E' $ (output from the previous Add & Norm step) with the cross-attention output, followed by **layer normalization**. This step further integrates information from the Encoder while maintaining training stability.

- **Formula:**
  $$
  \text{Output} = \text{LayerNorm}(\text{Decoder Output} + \text{MultiHead}(Q, K, V))
  $$

#### 3.2.5. Feed-Forward Network (FFN) - Position Wise Feed Forward Network
- **Description:** Same as encoder's FFN.
- **Substeps:**
  - **a. Linear Transformation and Activation:**
    $$
    \text{FFN}_1 = \text{ReLU}(\text{Output} W_1 + b_1)
    $$
  - **b. Linear Transformation:**
    $$
    \text{FFN}_2 = \text{FFN}_1 W_2 + b_2
    $$

#### 3.2.6. Add & Norm (Post Feed-Forward Network)
- **Description:** Adds a final **residual (skip) connection** by combining the FFN's output $ \text{FFN}_2 $ with its input $ E'' $ (output from the previous Add & Norm step), followed by **layer normalization**. This final normalization ensures that the Encoder's output is well-conditioned for generating predictions.

- **Formula:**
  $$
  \text{Decoder Output} = \text{LayerNorm}(\text{Output from Add & Norm} + \text{FFN}_2)
  $$


## 4. Output Embeddings and Generation

- **Description:**  
  The **Decoder Output** is transformed into **output embeddings** which are then used to generate the final predictions (e.g., token probabilities).

### 4.1. Final Linear Layer
- **Description:** Transforms decoder outputs to match the size of the target vocabulary.
- **Formula:**
  $$
  \text{Logits} = \text{Decoder Output} W_O + b_O
  $$

### 4.2. Softmax
- **Description:** Converts logits into probability distributions over the vocabulary.
- **Formula:**
  $$
  \text{Probabilities} = \text{softmax}(\text{Logits})
  $$

### 4.3. Token Selection
- **Description:**  The token with the highest probability is selected as the next token in the sequence.
- **Methods:**
  - **a. Greedy Search:** Select the token with the highest probability.
  - **b. Beam Search:** Explore multiple sequences to find the most likely output.
  - **c. Sampling:** Randomly sample tokens based on their probabilities.

### 4.4. Output Text
- **Description:** Converts selected token IDs back to human-readable text.
- **Example:** `[101, 2024, 2003, 102]` → `"The cat sat."`

## 5. Summary of Transformer Steps

1. **Input Preparation:**
   - Raw Text Input
   - Tokenization
   - Numerical Encoding

2. **Embedding Layer:**
   - Token Embeddings
   - Positional Encodings
   - Combined Embeddings

3. **Transformer Architecture:**
   - **Encoder:**
     - Multi-Head Self-Attention
     - Add & Norm
     - Feed-Forward Network (FFN)
     - Add & Norm
   - **Decoder:**
     - Masked Multi-Head Self-Attention
     - Add & Norm
     - Multi-Head Attention over Encoder Outputs
     - Add & Norm
     - Feed-Forward Network (FFN)
     - Add & Norm

4. **Output Generation:**
   - Final Linear Layer
   - Softmax
   - Token Selection
   - Output Text



## 6. Masked Multi-Head Attention vs. Regular Multi-Head Attention

- **Masked Multi-Head Attention:**  
  - **Used In:** Decoder's self-attention sub-layer.
  - **Function:** Prevents the model from attending to future tokens in the sequence during training, maintaining causality.
  
- **Regular Multi-Head Attention:**  
  - **Used In:** Encoder's self-attention sub-layers and Decoder's cross-attention sub-layer.
  - **Function:** Allows the model to attend to all positions in the input or encoder's output without restrictions.

- **Key Difference:**  
  The **masking** in the Decoder's self-attention ensures that each position can only attend to previous and current positions, not future ones, which is essential for tasks like language generation where the model should not peek ahead.

---

### 6.1 Differences Between Encoder Inputs and Decoder Outputs

- **Encoder Inputs:**
  - **Nature:** The source sequence (e.g., a sentence in the source language).
  - **Processing:** Fully accessible to all Encoder sub-layers; each token can attend to every other token in the sequence.
  
- **Decoder Outputs:**
  - **Nature:** The target sequence being generated (e.g., a sentence in the target language).
  - **Processing:** 
    - **Masked Self-Attention:** Each position can only attend to previous tokens in the target sequence.
    - **Cross-Attention:** Each position can attend to all tokens in the Encoder's output, integrating source information.
    - **Autoregressive Generation:** Each new token is generated based on previously generated tokens and the Encoder's context.

- **Summary of Differences:**
  | Feature                     | Encoder Inputs                      | Decoder Outputs                                |
  |-----------------------------|-------------------------------------|------------------------------------------------|
  | **Access to Sequence**      | Full access to entire input sequence| Limited to past and current tokens (masked)    |
  | **Attention Mechanism**     | Self-Attention (no masking)         | Masked Self-Attention & Cross-Attention        |
  | **Generation Flow**         | Processes input in parallel         | Generates output sequentially (autoregressive) |
  | **Integration with Encoder**| Independent of Encoder               | Attends to Encoder's output for context        |

---

### 6.2 Flow Overview of Decoder Layer

1. **Masked Multi-Head Self-Attention:**
   - **Input:** Previously generated tokens (shifted right target sequence).
   - **Process:** Allows attention only to past and present tokens.
   - **Output:** Contextualized representation of the target sequence up to the current token.

2. **Add & Norm:**
   - **Input:** Output from Masked Self-Attention.
   - **Process:** Residual connection + Layer Normalization.
   - **Output:** Normalized representation for the next sub-layer.

3. **Multi-Head Cross-Attention:**
   - **Input:** Encoder's output and normalized target representation.
   - **Process:** Integrates information from the source sequence.
   - **Output:** Enhanced representation combining source and target contexts.

4. **Add & Norm:**
   - **Input:** Output from Cross-Attention.
   - **Process:** Residual connection + Layer Normalization.
   - **Output:** Normalized representation for the FFN.

5. **Feed-Forward Network (FFN):**
   - **Input:** Normalized representation from Cross-Attention.
   - **Process:** Two linear transformations with ReLU activation.
   - **Output:** Transformed representation ready for final normalization.

6. **Add & Norm:**
   - **Input:** FFN's output.
   - **Process:** Residual connection + Layer Normalization.
   - **Output:** **Decoder Output**, ready for the final prediction layer.

7. **Output Embeddings and Generation:**
   - **Input:** Decoder Output.
   - **Process:** Linear transformation + Softmax to generate token probabilities.
   - **Output:** Predicted next token in the sequence.

---

### 6.3. Key Differences Between Encoder Inputs and Decoder Outputs

Understanding the distinction between what the Encoder receives and what the Decoder produces is crucial for grasping the Transformer's functionality.

- **Encoder:**
  - **Input:** Entire source sequence (e.g., a sentence in English).
  - **Output:** Contextualized representations of the source sequence, capturing relationships and dependencies between tokens.
  
- **Decoder:**
  - **Input:** Previously generated tokens in the target sequence (shifted right during training) and the Encoder's output.
  - **Output:** Predictions for the next token in the target sequence, based on both the generated tokens so far and the source context.

- **Autoregressive Nature:**
  - **Encoder:** Processes all input tokens simultaneously.
  - **Decoder:** Generates tokens one by one, each time conditioning on the previously generated tokens and the Encoder's output.

- **Interaction:**
  - **Encoder:** Independent of the Decoder.
  - **Decoder:** Relies on the Encoder's output through the Cross-Attention mechanism to inform its token generation.

---

### 6.4. Summary

The **Transformer Decoder** complements the Encoder by generating the output sequence in a manner that is both informed by the source context and conditioned on previously generated tokens. Key components like **Masked Multi-Head Self-Attention** ensure that the generation process remains autoregressive and causally consistent, while **Multi-Head Cross-Attention** integrates valuable information from the Encoder's output. The **Add & Norm** steps maintain stability and facilitate effective gradient flow, enabling the model to learn deep and complex representations.

Understanding the interplay between these components is essential for leveraging the Transformer architecture in tasks such as machine translation, text generation, and more.



# Basic Tips

> Embeddings

## 1. What Are Text Embeddings?

Text embeddings map pieces of text, such as sentences or paragraphs, into vectors in a high-dimensional space. The core idea is to represent words or sentences as points in this space such that semantically similar pieces of text are positioned closer together. This space is often referred to as the **embedding space** or **latent space**.

- Each text element (word, sentence, or document) is converted into a vector of numbers.
- These vectors typically have hundreds or even thousands of dimensions (e.g., 768, 1536, etc.).
  
For example, the sentence "I love programming" might be represented as:

$$
\mathbf{v} = \begin{bmatrix} 0.25 & 0.13 & 0.98 & \dots & 0.51 \end{bmatrix}
$$

where each entry in the vector corresponds to a learned feature in the embedding space.

---

## 2. Tokenization and Embedding Calculation

Text embeddings are typically generated by processing text in smaller units called **tokens**. These tokens are often words, subwords, or characters, depending on the model architecture. Here's the general process:

### Step 1: **Tokenization**

Tokenization is the process of splitting text into smaller units, usually words or subwords. For example:

- "I love programming" → Tokens: ["I", "love", "programming"]

Some tokenizers break down words into subwords to handle out-of-vocabulary (OOV) words, like:

- "unhappiness" → Tokens: ["un", "happiness"]

### Step 2: Mapping Tokens to Vectors

#### A. Tokenization

Tokenization breaks down text into smaller units called **tokens**, which can be words, subwords, or characters. For example:

- Sentence: "I love programming."
- Tokens: ["I", "love", "programming"]

For **subword-level tokenization**:
- Sentence: "unhappiness"
- Tokens: ["un", "happiness"]


#### B. Mapping Tokens to Vectors

The first step in processing a token (word) is to convert it into a **vector representation** called an **embedding**. 

- **Token Embeddings**: Each word or token is mapped to a high-dimensional vector, represented as **E**.
  
    $$
    \mathbf{E} = W_{\text{emb}} \cdot \mathbf{x}
    $$
    where:
    - **E** is the token embedding vector (dimension $d$),
    - **$W_{\text{emb}}$** is the embedding matrix,
    - **x** is the token index (one-hot vector representing the word).

##### Explanation of Embedding Matrix ($ W_{\text{emb}} $)

To understand embeddings more deeply, let’s break down the **embedding matrix**:

The **embedding matrix** $ W_{\text{emb}} $ is essentially a lookup table that maps each word in the vocabulary to a vector in a high-dimensional space. This matrix has dimensions $ V \times d $, where:
- **V** is the size of the vocabulary (the number of unique words or tokens),
- **d** is the dimensionality of the vector space (embedding size).

Each row of this matrix corresponds to the **embedding** of a word in the vocabulary. If the token **x** corresponds to the word "dog," its embedding is simply the **x-th row** of the matrix.

##### Example:
Consider a small vocabulary with **V = 4** words: ["cat", "dog", "sat", "mat"], and each word is represented by a 3-dimensional embedding vector (**d = 3**). The embedding matrix might look like this:

$$
W_{\text{emb}} = \begin{bmatrix}
0.1 & 0.3 & 0.5 \\
0.4 & 0.2 & 0.7 \\
0.6 & 0.8 & 0.1 \\
0.9 & 0.5 & 0.2
\end{bmatrix}
$$

If the word "dog" corresponds to index 2, its embedding vector **E_dog** will be the 2nd row:

$$
\mathbf{E}_{\text{dog}} = \begin{bmatrix} 0.4 \\ 0.2 \\ 0.7 \end{bmatrix}
$$

---

#### **Positional Encoding**  
To account for the order of tokens in a sequence, we introduce **positional encodings**. These are vectors that encode the position of each token in the sequence.

The positional encoding for a position $ p $ and dimension $ i $ is defined as:

$$
\text{PE}_{p, 2i} = \sin \left( \frac{p}{n^{2i/d}} \right)
$$
$$
\text{PE}_{p, 2i+1} = \cos \left( \frac{p}{n^{2i/d}} \right)
$$

Where:
- **p** is the position of the token in the sequence (e.g., 1 for the first word),
- **n** is the User-defined scalar, set to 10,000 by the authors of Attention Is All You Need.
- **d** is the dimensionality of the embedding space (e.g., 512),
- **i** is the index of the dimension (from 0 to $ d-1 $).

For example, for a sequence of 5 tokens, the positional encoding for the first token (at position $ p=1 $) with dimension $ d=4 $ will be calculated as:

- $ \text{PE}_{1, 0} = \sin\left(\frac{1}{10000^{0/4}}\right) $
- $ \text{PE}_{1, 1} = \cos\left(\frac{1}{10000^{1/4}}\right) $
- $ \text{PE}_{1, 2} = \sin\left(\frac{1}{10000^{2/4}}\right) $
- $ \text{PE}_{1, 3} = \cos\left(\frac{1}{10000^{3/4}}\right) $

These values are added to the token embeddings to form the final input for the model:

$$
\mathbf{X}_{\text{final}} = \mathbf{E} + \mathbf{PE}
$$

---


## Attention Mechanism in Transformers: Detailed Mathematical Explanation

In Transformer models like **BERT** and **GPT**, the **self-attention mechanism** allows each token in a sequence to attend to all other tokens dynamically, adjusting its representation based on the context of the sentence. Below is the breakdown of key components of attention and their mathematical formulation.

### 1. **Query, Key, and Value Vectors (Q, K, V)**

In the context of attention, each token is projected into three different vectors: **Query (Q)**, **Key (K)**, and **Value (V)**. These vectors help compute which tokens are relevant to one another based on their interactions.

#### 1.1. **Query (Q)**, **Key (K)**, and **Value (V) Definition:

- **Query (Q)**: Represents a question or request for information. For each token, a query vector is created that defines what information it is looking for in the other tokens.
- **Key (K)**: Represents the context or feature of each token. It's used to determine if the token is relevant to the current query.
- **Value (V)**: Represents the actual information that will be passed through if the token is relevant to the query.

#### 1.2. **Mathematical Formulation of Q, K, V:**

Each token's embedding $ t_i $ is projected into the query, key, and value vectors using learned weight matrices:

$$
Q_i = W_Q \cdot t_i \quad \text{(Query vector for token \( t_i \))}
$$
$$
K_i = W_K \cdot t_i \quad \text{(Key vector for token \( t_i \))}
$$
$$
V_i = W_V \cdot t_i \quad \text{(Value vector for token \( t_i \))}
$$

Where:
- $ W_Q, W_K, W_V $ are learned weight matrices,
- $ t_i $ is the token embedding for token $ t_i $.


### 2. **Attention Score Calculation**

To determine the relevance of each token in the sequence to a given token (based on the query and key), we calculate an **attention score** using the **dot product** between the query vector of token $ t_i $ and the key vector of token $ t_j $.

#### 2.1. **Attention Score Formula:**

$$
\text{Attention\_Score}_{ij} = \frac{Q_i \cdot K_j}{\sqrt{d_k}}
$$

Where:
- $ d_k $ is the dimensionality of the key vector (scaling factor to stabilize gradients),
- $ Q_i $ is the query vector of token $ t_i $,
- $ K_j $ is the key vector of token $ t_j $.

The dot product between the query and key determines the **similarity** or **relevance** between two tokens.


### 3. **Attention Weights Calculation**

Once the attention scores are computed, we apply the **softmax** function to convert these scores into **attention weights**. The softmax function normalizes the attention scores to ensure that the sum of the attention weights across all tokens is 1.

#### 3.1. **Attention Weight Formula:**

$$
\alpha_{ij} = \frac{\exp(\text{Attention\_Score}_{ij})}{\sum_{k=1}^{n} \exp(\text{Attention\_Score}_{ik})}
$$

Where:
- $ \alpha_{ij} $ is the attention weight for the pair of tokens $ (t_i, t_j) $,
- The denominator sums over all tokens $ k $ in the sequence, ensuring the weights are normalized.

This step determines how much attention token $ t_i $ should give to token $ t_j $.


### 4. **Contextualized Embedding Calculation**

The final step is to compute the **contextualized embedding** for each token. This is done by taking a weighted sum of the **value vectors** based on the attention weights.

#### 4.1. **Contextualized Embedding Formula:**

$$
\mathbf{t}_i^\text{contextualized} = \sum_{j=1}^{n} \alpha_{ij} \cdot V_j
$$

Where:
- $ \mathbf{t}_i^\text{contextualized} $ is the updated embedding for token $ t_i $ after attending to all other tokens,
- $ \alpha_{ij} $ is the attention weight for token $ t_j $ as calculated earlier,
- $ V_j $ is the value vector for token $ t_j $.

This gives us a new representation for each token, which now incorporates contextual information from the entire sequence.


### 5. **Multi-Head Attention**

Instead of performing attention only once, **multi-head attention** performs attention multiple times in parallel. Each "head" operates on the input using different learned weight matrices, allowing the model to focus on different aspects of the sequence simultaneously.

### 5.1. **Multi-Head Attention Formula:**

For each head $ i $, attention is calculated as:

$$
\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
$$

Where:
- $ W_i^Q, W_i^K, W_i^V $ are the weight matrices for the $ i $-th attention head.

After computing attention for all heads, the outputs are concatenated and passed through a final linear transformation:

$$
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O
$$

Where:
- $ h $ is the number of attention heads,
- $ W^O $ is the output weight matrix.

This allows the model to capture a rich set of relationships in the input sequence, each head focusing on different features or aspects of the data.

---

### Summary of Key Components

- **Query (Q)**: A vector representing the information we’re looking for.  $ Q_i = W_Q t_i$
- **Key (K)**: A vector representing the context of each token. $\quad K_i = W_K t_i$
- **Value (V)**: A vector containing the actual information that will be passed on if the token matches the query. $\quad V_i = W_V t_i $
- **Attention Score**: The similarity between the query and the key, used to determine relevance. $ \text{Attention\_Score}_{ij} = \frac{Q_i \cdot K_j}{\sqrt{d_k}} $
- **Attention Weight**: Normalized attention score using softmax, which decides how much influence each token has on others.
  $ \alpha_{ij} = \frac{\exp(\text{Attention\_Score}_{ij})}{\sum_{k=1}^{n} \exp(\text{Attention\_Score}_{ik})} $
- **Contextualized Embedding**: The weighted sum of the value vectors, giving each token a new representation based on its context in the sequence. $ \mathbf{t}_i^\text{contextualized} = \sum_{j=1}^{n} \alpha_{ij} V_j $
- **Multi-Head Attention**: Performs multiple attention operations in parallel, allowing the model to focus on different aspects of the sequence simultaneously.

---

## 3. Distance and Similarity in Embedding Space

Once we have our tokens represented as vectors, we can use these embeddings for various tasks, such as measuring text similarity. A key feature of embedding spaces is that semantically similar items are placed closer together. 

Two common distance metrics are:

### a. Euclidean Distance

Euclidean distance is the straight-line distance between two vectors in the embedding space. For two vectors $\mathbf{v}_1 = [v_{1,1}, v_{1,2}, \dots, v_{1,n}]$ and $\mathbf{v}_2 = [v_{2,1}, v_{2,2}, \dots, v_{2,n}]$, the Euclidean distance $d(\mathbf{v}_1, \mathbf{v}_2)$ is given by:

$$
d(\mathbf{v}_1, \mathbf{v}_2) = \sqrt{\sum_{i=1}^{n} (v_{1,i} - v_{2,i})^2}
$$

### b. Cosine Similarity

Cosine similarity measures the cosine of the angle between two vectors. It ranges from \(-1\) (completely opposite) to \(1\) (completely similar). The formula for cosine similarity is:

$$
\text{cosine\_similarity}(\mathbf{v}_1, \mathbf{v}_2) = \frac{\mathbf{v}_1 \cdot \mathbf{v}_2}{||\mathbf{v}_1|| ||\mathbf{v}_2||}
$$

where:
- $\mathbf{v}_1 \cdot \mathbf{v}_2$ is the dot product of the vectors.
- $||\mathbf{v}_1||$ and $||\mathbf{v}_2||$ are the magnitudes (norms) of the vectors.

Cosine similarity works well when we care about the **relative weights** of features (terms) rather than their absolute magnitudes. It is widely used in NLP for comparing text.

---

## 4. Example: Text Embedding in Action

Let’s say we have the following three sentences:

1. "I love programming."
2. "I enjoy coding."
3. "I dislike programming."

Assume the embeddings of these sentences (in a 3-dimensional space) are as follows:

$$
\mathbf{v_1} = \begin{bmatrix} 0.2 & 0.4 & 0.9 \end{bmatrix}, \quad \mathbf{v_2} = \begin{bmatrix} 0.1 & 0.5 & 0.8 \end{bmatrix}, \quad \mathbf{v_3} = \begin{bmatrix} 0.7 & 0.2 & 0.1 \end{bmatrix}
$$

Now, we compute cosine similarity between:

- **Sentence 1 and Sentence 2**:

$$
\text{cosine\_similarity}(\mathbf{v_1}, \mathbf{v_2}) = \frac{(0.2)(0.1) + (0.4)(0.5) + (0.9)(0.8)}{\sqrt{(0.2^2 + 0.4^2 + 0.9^2)} \cdot \sqrt{(0.1^2 + 0.5^2 + 0.8^2)}}
$$

$$
= \frac{0.02 + 0.2 + 0.72}{\sqrt{0.2 + 0.16 + 0.81} \cdot \sqrt{0.01 + 0.25 + 0.64}} = \frac{0.94}{\sqrt{1.17} \cdot \sqrt{0.9}}
$$

$$
= \frac{0.94}{1.08 \cdot 0.95} = \frac{0.94}{1.026} \approx 0.916
$$

This indicates a **high similarity** between "I love programming" and "I enjoy coding."

- **Sentence 1 and Sentence 3**:

Now, calculate the cosine similarity between the first and third sentences:

$$
\text{cosine\_similarity}(\mathbf{v_1}, \mathbf{v_3}) = \frac{(0.2)(0.7) + (0.4)(0.2) + (0.9)(0.1)}{\sqrt{(0.2^2 + 0.4^2 + 0.9^2)} \cdot \sqrt{(0.7^2 + 0.2^2 + 0.1^2)}}
$$

$$
= \frac{0.14 + 0.08 + 0.09}{\sqrt{0.2 + 0.16 + 0.81} \cdot \sqrt{0.49 + 0.04 + 0.01}} = \frac{0.31}{\sqrt{1.17} \cdot \sqrt{0.54}}
$$

$$
= \frac{0.31}{1.08 \cdot 0.73} = \frac{0.31}{0.7884} \approx 0.393
$$

This indicates a **low similarity** between "I love programming" and "I dislike programming."

---

## 5. Conclusion

Text embeddings are a powerful way to represent text data numerically, allowing machines to perform tasks such as text classification, sentiment analysis, and semantic search. Through the use of distance metrics like **Euclidean distance** and **Cosine similarity**, we can quantify how similar two pieces of text are and apply this to a variety of NLP applications. Understanding embeddings and how they map text into high-dimensional spaces is crucial for building more effective and intuitive machine learning models.


> Text Embeddings in Code

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, dataloader, random_split
import numpy as np

# Step 1: Generate a simple corpus (same as before)
words = ["apple", "banana", "cherry", "dog", "elephant", "fish", "grape", "house", "ice", "jungle", "kite", "lemon"]
num_sentences = 5
max_sentence_length = 6

# Generate random sentences
corpus = []
for _ in range(num_sentences):
    sentence_length = np.random.randint(3, max_sentence_length)
    sentence = np.random.choice(words, sentence_length, replace=True)
    corpus.append(sentence)

# Step 2: Create a token-to-index dictionary for words in the corpus
word_to_index = {word: idx for idx, word in enumerate(words)}

# Step 3: Convert words in the corpus to indices using the dictionary
corpus_indices = [[word_to_index[word] for word in sentence] for sentence in corpus]

# Step 4: Define an embedding layer
embedding_dim = 512
vocab_size = len(words)
token_embedding_table = nn.Embedding(vocab_size, embedding_dim)

# Step 5: Pass the indices through the embedding layer
embeddings = [token_embedding_table(torch.tensor(sentence)) for sentence in corpus_indices]

# Print the embeddings for each word in the corpus
print("\nWord Embeddings for the Corpus:")
for i, sentence_embedding in enumerate(embeddings):
    print(f"Sentence {i + 1}:")
    for word_idx, word_embedding in zip(corpus_indices[i], sentence_embedding):
        print(f"Word '{words[word_idx]}': {word_embedding[:5]}...")  # Print first 5 elements of each embedding
    print()  # New line between sentences


In [32]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super(BigramLanguageModel, self).__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, index, targets=None):
        # Fetch logits (predicted scores)
        logits = self.token_embedding_table(index)

        # Compute loss if targets are provided
        loss = None
        if targets is not None:
            # Reshape logits and targets for cross-entropy
            B, T, C = logits.shape  # Batch size, Vocab size, Vector size
            logits = logits.view(B * T, C)  
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, index, max_new_tokens):
        # Generate a sequence of tokens
        for _ in range(max_new_tokens):
            # Predict next token based on current sequence
            logits, _ = self.forward(index)
            logits = logits[:, -1, :]  # Focus on the last time step
            probs = F.softmax(logits, dim=-1)  # Convert logits to probabilities

            # Sample next token from the probability distribution
            index_next = torch.multinomial(probs, num_samples=1)    # Sample one token
            
            # Append the new token to the sequence
            index = torch.cat((index, index_next), dim=1)
        
        return index

# Instantiate and run the model
vocab_size = 100  # Example vocab size
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = BigramLanguageModel(vocab_size).to(device)

# Example context and generation
context = torch.zeros((1, 1), dtype=torch.long, device=device)  # Initial context
generated_tokens = model.generate(context, max_new_tokens=500)
# print(generated_tokens)


## Word2Vec

**Word2Vec** is a neural network-based technique used to generate vector representations (embeddings) of words in a corpus of text. The key idea behind Word2Vec is to transform words into dense, continuous vectors in a high-dimensional space, where words with similar meanings are located closer to each other. These word embeddings capture semantic relationships between words, such as synonyms, antonyms, and analogies (e.g., "king" - "man" + "woman" = "queen"). Word2Vec achieves this by training a model to predict the surrounding words (context) given a target word, or vice versa, depending on the chosen model. The two main architectures for Word2Vec are the **Skip-gram model** and the **Continuous Bag of Words (CBOW) model**.

In the **Skip-gram** model, the goal is to predict the context words given a target word. For example, in the sentence "The cat sat on the mat", if "sat" is the target word, the model tries to predict the surrounding words "The", "cat", "on", "the", and "mat". The model learns the vector representation for each word by adjusting the embeddings so that words occurring in similar contexts (like "cat" and "dog") have similar vectors. This process is repeated over a large corpus of text, allowing the embeddings to capture semantic information based on word co-occurrences. Word2Vec is efficient because it leverages a shallow neural network with a simple architecture that can be trained quickly on large datasets. The result is a set of word vectors that can be used in various downstream NLP tasks, such as sentiment analysis, machine translation, and information retrieval.


In [43]:
training_pairs[15]

(4, 2)

In [47]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import Counter
from torch.utils.data import Dataset, DataLoader

# Step 1: Prepare the data
text = "The quick brown fox jumps over the lazy dog"

# Tokenize the text and build a vocabulary
tokens = text.lower().split()
vocab = Counter(tokens)  # Count frequency of each word
vocab_size = len(vocab)

# Map words to indices and vice versa
word_to_index = {word: idx for idx, (word, _) in enumerate(vocab.items())}
index_to_word = {idx: word for word, idx in word_to_index.items()}

# Hyperparameters
embedding_dim = 10  # Size of the word embeddings
window_size = 2  # Context window size (skip-gram)

# Create training pairs (target, context)
training_pairs = []
for i, word in enumerate(tokens):
    target = word_to_index[word]
    # Get the context window around the target word
    context = list(set(tokens[max(i - window_size, 0):i] + tokens[i + 1:i + window_size + 1]))
    context = [word_to_index[context_word] for context_word in context if context_word != word]
    
    for context_word in context:
        training_pairs.append((target, context_word))

# Step 2: Define the Word2Vec Model (Skip-gram)
class Word2Vec(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(Word2Vec, self).__init__()
        self.in_embeddings = nn.Embedding(vocab_size, embedding_dim)  # Input embedding layer
        self.out_embeddings = nn.Embedding(vocab_size, embedding_dim)  # Output embedding layer
    
    def forward(self, target, context):
        # Get embeddings for target and context words
        in_emb = self.in_embeddings(target)  # (B, E)
        out_emb = self.out_embeddings(context)  # (B, E)
        
        # Compute logits (dot product between target and context word embeddings)
        return torch.sum(in_emb * out_emb, dim=1)

# Step 3: Create Dataset and DataLoader
class Word2VecDataset(Dataset):
    def __init__(self, pairs):
        # Convert pairs to tensors
        self.targets = torch.tensor([pair[0] for pair in pairs])
        self.contexts = torch.tensor([pair[1] for pair in pairs])
    
    def __len__(self):
        return len(self.targets)
    
    def __getitem__(self, idx):
        return self.targets[idx], self.contexts[idx]

# Step 4: Train the model
dataset = Word2VecDataset(training_pairs)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

model = Word2Vec(vocab_size, embedding_dim)
optimizer = optim.SGD(model.parameters(), lr=0.1)
loss_fn = nn.BCEWithLogitsLoss()  # Binary cross-entropy loss

# Training loop
epochs = 100
for epoch in range(epochs):
    total_loss = 0
    for target, context in dataloader:
        optimizer.zero_grad()
        
        # Compute model output
        output = model(target, context)
        
        # Create positive labels (1 for correct context)
        labels = torch.ones_like(output, dtype=torch.float)
        
        # Calculate the loss 
        loss = loss_fn(output, labels)
        total_loss += loss.item()
        
        # Backpropagate and optimize
        loss.backward()
        optimizer.step()
    
    if (epoch + 1) % 10 == 0:
        print(f"Epoch [{epoch+1}/{epochs}], Loss: {total_loss/len(dataloader)}")

# Step 5: Check the word embeddings learned by the model
word_embeddings = model.in_embeddings.weight.data.numpy()

# Show word embeddings for the first 5 words
for idx in range(min(5, vocab_size)):
    word = index_to_word[idx]
    embedding = word_embeddings[idx]
    print(f"Word: {word}, Embedding: {embedding}")

Epoch [10/100], Loss: 0.21508773763974506
Epoch [20/100], Loss: 0.07223054816325505
Epoch [30/100], Loss: 0.038460445155700046
Epoch [40/100], Loss: 0.025005883164703847
Epoch [50/100], Loss: 0.01808421897391478
Epoch [60/100], Loss: 0.013966864347457886
Epoch [70/100], Loss: 0.011278403705606859
Epoch [80/100], Loss: 0.00939424525325497
Epoch [90/100], Loss: 0.008014310896396638
Epoch [100/100], Loss: 0.006963123753666878
Word: the, Embedding: [ 0.65964425 -1.0634614  -0.07938081 -1.0143552  -0.42179868  1.3326547
 -0.32117814 -1.9897895  -1.6684294  -0.01161219]
Word: quick, Embedding: [ 2.8160932  -0.24703808  0.58954406 -0.66629213 -0.77631253 -0.5413049
 -1.2220969  -1.0079648  -1.1804886  -1.2120961 ]
Word: brown, Embedding: [ 1.0991999  -1.2361306  -0.11001078 -0.96111774 -0.1683051   0.40376255
 -1.2056465  -1.2161037  -1.1382229   1.7400821 ]
Word: fox, Embedding: [ 0.29302096 -0.86034954 -1.805973    0.1150915  -0.595443   -0.59565365
 -0.04477623 -1.3482325  -0.30772996 -0.9

> LLM in code

In [67]:
import torch
import torch.nn as nn
import torch.optim as optim

# Define the GPT-like Model with Embedding Layers
class SimpleGPT(nn.Module):
    def __init__(self, vocab_size, embed_size, num_heads, num_layers, num_classes, max_len=512):
        super(SimpleGPT, self).__init__()
        
        # Embedding layer: Map tokens to embedding vectors
        self.embedding = nn.Embedding(vocab_size, embed_size)
        
        # Positional encoding (used to add positions to the embeddings)
        self.positional_encoding = nn.Parameter(torch.zeros(1, max_len, embed_size))
        
        # Transformer Layer
        self.transformer = nn.Transformer(
            d_model=embed_size,
            nhead=num_heads,
            num_encoder_layers=num_layers,
            num_decoder_layers=num_layers,
            dim_feedforward=embed_size*4
        )
        
        # Output layer for classification or next-token prediction
        self.fc_out = nn.Linear(embed_size, num_classes)
        
    def forward(self, x):
        # Get token embeddings
        embeddings = self.embedding(x)
        
        # Add positional encoding to embeddings
        seq_len = x.size(1)
        embeddings = embeddings + self.positional_encoding[:, :seq_len, :]
        
        # Pass through the transformer
        transformer_out = self.transformer(embeddings, embeddings)
        
        # Use the last token's representation for classification or next token prediction
        output = self.fc_out(transformer_out[:, -1, :])
        
        return output


# Hyperparameters
vocab_size = 5000  # Example vocabulary size
embed_size = 256  # Embedding dimension
num_heads = 8     # Number of attention heads
num_layers = 6    # Number of transformer layers
num_classes = vocab_size  # For next-token prediction (same size as vocab)
max_len = 512     # Max sequence length
batch_size = 32   # Batch size
learning_rate = 1e-4

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Instantiate the model
model = SimpleGPT(vocab_size, embed_size, num_heads, num_layers, num_classes, max_len)
model.to(device)

# Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
def train(model, train_loader, criterion, optimizer, num_epochs=5):
    model.train()
    for epoch in range(num_epochs):
        total_loss = 0
        for batch in train_loader:
            inputs, targets = batch
            inputs, targets = inputs.to(device), targets.to(device)  # Move data to device
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {total_loss / len(train_loader)}")

# Sample DataLoader for training (replace with actual data)
from torch.utils.data import DataLoader, TensorDataset

# Simulated training data (replace with actual tokenized text)
train_data = torch.randint(0, vocab_size, (1000, max_len))  # 1000 sequences
train_labels = torch.randint(0, vocab_size, (1000,))  # Corresponding next token (or class)

# DataLoader
train_dataset = TensorDataset(train_data, train_labels)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

# Train the model
train(model, train_loader, criterion, optimizer)

# Inference function for a sample text
def infer(model, text, vocab, max_len=512):
    model.to(device)
    model.eval()
    
    # Tokenize input text (basic tokenizer for demo purposes)
    tokens = torch.tensor([vocab.get(word, 0) for word in text.split()]).unsqueeze(0).to(device)  # Adding batch dimension and moving to device
    tokens = tokens[:, :max_len]  # Ensure the sequence length is within max_len
    
    # Get the model's prediction
    with torch.no_grad():
        output = model(tokens)
    
    # Convert output to probabilities (for next token prediction)
    prob = torch.softmax(output, dim=-1)
    
    # Get the predicted token (highest probability)
    predicted_token_idx = torch.argmax(prob, dim=-1).item()
    
    # Decode the token back to text (inverse vocab lookup)
    reverse_vocab = {v: k for k, v in vocab.items()}
    predicted_word = reverse_vocab.get(predicted_token_idx, "<UNK>")
    
    return predicted_word

# Sample Vocab (just for illustration)
vocab = {str(i): i for i in range(vocab_size)}  # Just a simple numerical vocabulary, replace with actual vocab

# Sample input text
sample_text = "this is a sample sentence"

# Perform inference
predicted_word = infer(model, sample_text, vocab)
print(f"Predicted next word: {predicted_word}")


Epoch [1/5], Loss: 8.687405735254288
Epoch [2/5], Loss: 8.092840701341629
Epoch [3/5], Loss: 7.657473549246788
Epoch [4/5], Loss: 7.372279062867165
Epoch [5/5], Loss: 7.210036337375641
Predicted next word: 3797


In [1]:
# c

# Worked Example of Attention with a Toy Corpus

We’ve explained the mathematics behind single-head and multi-head attention. Now, let’s apply these steps to a simple, concrete example. This will demonstrate how the calculations flow from inputs all the way through to the final attention output.

## Toy Setup

### Assumptions and Simplifications

- **Vocabulary and Embeddings:**  
  We have a small vocabulary with three tokens:  
  1. "The"
  2. "cat"
  3. "sat"

  We assume we already have embeddings for these tokens. Let's say `d_model = 4` for simplicity. Thus, each token is represented by a 4-dimensional vector. For demonstration:
  - "The" → $[0.1, 0.3, 0.5, 0.7]$
  - "cat" → $[0.2, 0.4, 0.4, 0.6]$
  - "sat" → $[0.15, 0.25, 0.5, 0.1]$

- **Input Sequence:**  
  Suppose our input sequence is: "The cat sat"

So we have 3 tokens, hence $n=3$.

- **Positional Encodings:**  
For simplicity, let’s not add complex positional encodings. Suppose we have a simple positional encoding that just adds a small unique offset for each token position:
- Position 1 encoding: $[0.0, 0.1, 0.0, 0.0]$
- Position 2 encoding: $[0.0, 0.0, 0.1, 0.0]$
- Position 3 encoding: $[0.0, 0.0, 0.0, 0.1]$

After adding positional encodings:
- For "The" (position 1): $[0.1, 0.3+0.1, 0.5, 0.7] = [0.1, 0.4, 0.5, 0.7]$
- For "cat" (position 2): $[0.2, 0.4, 0.4+0.1, 0.6] = [0.2, 0.4, 0.5, 0.6]$
- For "sat" (position 3): $[0.15, 0.25, 0.5, 0.1+0.1] = [0.15, 0.25, 0.5, 0.2]$

Thus, our input matrix $X$ is:
$$
X = \begin{bmatrix}
  0.1 & 0.4 & 0.5 & 0.7 \\[6pt]
  0.2 & 0.4 & 0.5 & 0.6 \\[6pt]
  0.15 & 0.25 & 0.5 & 0.2
\end{bmatrix}
$$
where each row is a token embedding with positional info.

- **Parameters (W_Q, W_K, W_V):**  
Let’s define:
- $d_{\text{model}} = 4$
- Single-head attention: $d_k = d_{\text{model}} = 4$

Example parameter matrices:
$$
W_Q = \begin{bmatrix}
0.5 & 0.1 & 0.0 & 0.3 \\[6pt]
0.4 & 0.2 & 0.1 & 0.0 \\[6pt]
0.3 & 0.3 & 0.3 & 0.3 \\[6pt]
0.2 & 0.1 & 0.5 & 0.4
\end{bmatrix}, \quad
W_K = \begin{bmatrix}
0.1 & 0.4 & 0.0 & 0.0 \\[6pt]
0.0 & 0.5 & 0.2 & 0.1 \\[6pt]
0.3 & 0.0 & 0.3 & 0.3 \\[6pt]
0.2 & 0.2 & 0.1 & 0.0
\end{bmatrix}, \quad
W_V = \begin{bmatrix}
0.2 & 0.1 & 0.0 & 0.0 \\[6pt]
0.0 & 0.3 & 0.5 & 0.0 \\[6pt]
0.1 & 0.1 & 0.1 & 0.4 \\[6pt]
0.0 & 0.2 & 0.1 & 0.1
\end{bmatrix}
$$

### Step-by-Step Computation

#### 1. Compute Q, K, V

- $Q = XW_Q$:
Let's multiply $X$ (3x4) by $W_Q$ (4x4):

For the first token "The":
$$
Q_1 = [0.1, 0.4, 0.5, 0.7]
\begin{bmatrix}
  0.5 & 0.1 & 0.0 & 0.3 \\
  0.4 & 0.2 & 0.1 & 0.0 \\
  0.3 & 0.3 & 0.3 & 0.3 \\
  0.2 & 0.1 & 0.5 & 0.4
\end{bmatrix}
$$

Compute row-by-column:
- $Q_1[1] = 0.1*0.5 + 0.4*0.4 + 0.5*0.3 + 0.7*0.2 = 0.05 + 0.16 + 0.15 + 0.14 = 0.5$
- $Q_1[2] = 0.1*0.1 + 0.4*0.2 + 0.5*0.3 + 0.7*0.1 = 0.01 + 0.08 + 0.15 + 0.07 = 0.31$
- $Q_1[3] = 0.1*0.0 + 0.4*0.1 + 0.5*0.3 + 0.7*0.5 = 0 + 0.04 + 0.15 + 0.35 = 0.54$
- $Q_1[4] = 0.1*0.3 + 0.4*0.0 + 0.5*0.3 + 0.7*0.4 = 0.03 + 0 + 0.15 + 0.28 = 0.46$

So $Q_1 = [0.5, 0.31, 0.54, 0.46]$.

For the second token "cat":
$$
Q_2 = [0.2, 0.4, 0.5, 0.6]W_Q
$$
- $Q_2[1] = 0.2*0.5 + 0.4*0.4 + 0.5*0.3 + 0.6*0.2 = 0.1 + 0.16 + 0.15 + 0.12 = 0.53$
- $Q_2[2] = 0.2*0.1 + 0.4*0.2 + 0.5*0.3 + 0.6*0.1 = 0.02 + 0.08 + 0.15 + 0.06 = 0.31$
- $Q_2[3] = 0.2*0.0 + 0.4*0.1 + 0.5*0.3 + 0.6*0.5 = 0 + 0.04 + 0.15 + 0.3 = 0.49$
- $Q_2[4] = 0.2*0.3 + 0.4*0.0 + 0.5*0.3 + 0.6*0.4 = 0.06 + 0 + 0.15 + 0.24 = 0.45$

So $Q_2 = [0.53, 0.31, 0.49, 0.45]$.

For the third token "sat":
$$
Q_3 = [0.15, 0.25, 0.5, 0.2]W_Q
$$
- $Q_3[1] = 0.15*0.5 + 0.25*0.4 + 0.5*0.3 + 0.2*0.2 = 0.075 + 0.1 + 0.15 + 0.04 = 0.365$
- $Q_3[2] = 0.15*0.1 + 0.25*0.2 + 0.5*0.3 + 0.2*0.1 = 0.015 + 0.05 + 0.15 + 0.02 = 0.235$
- $Q_3[3] = 0.15*0.0 + 0.25*0.1 + 0.5*0.3 + 0.2*0.5 = 0 + 0.025 + 0.15 + 0.1 = 0.275$
- $Q_3[4] = 0.15*0.3 + 0.25*0.0 + 0.5*0.3 + 0.2*0.4 = 0.045 + 0 + 0.15 + 0.08 = 0.275$

So $Q_3 = [0.365, 0.235, 0.275, 0.275]$.

Thus:
$$
Q = \begin{bmatrix}
  0.5 & 0.31 & 0.54 & 0.46 \\[4pt]
  0.53 & 0.31 & 0.49 & 0.45 \\[4pt]
  0.365 & 0.235 & 0.275 & 0.275
\end{bmatrix}
$$

- $K = XW_K$:
Similarly, multiply $X$ by $W_K$.

For "The":
$$
K_1 = [0.1,0.4,0.5,0.7]W_K
$$
Compute:
- $K_1[1] = 0.1*0.1 + 0.4*0.0 + 0.5*0.3 + 0.7*0.2 = 0.01+0+0.15+0.14=0.3$
- $K_1[2] = 0.1*0.4 + 0.4*0.5 + 0.5*0.0 + 0.7*0.2 = 0.04+0.2+0+0.14=0.38$
- $K_1[3] = 0.1*0.0 + 0.4*0.2 + 0.5*0.3 + 0.7*0.1 = 0+0.08+0.15+0.07=0.3$
- $K_1[4] = 0.1*0.0 + 0.4*0.1 + 0.5*0.3 + 0.7*0.0 = 0+0.04+0.15+0=0.19$

$K_1 = [0.3, 0.38, 0.3, 0.19]$.

For "cat":
$$
K_2 = [0.2,0.4,0.5,0.6]W_K
$$
- $K_2[1] = 0.2*0.1 + 0.4*0.0 + 0.5*0.3 + 0.6*0.2 = 0.02+0+0.15+0.12=0.29$
- $K_2[2] = 0.2*0.4 + 0.4*0.5 + 0.5*0.0 + 0.6*0.2 = 0.08+0.2+0+0.12=0.4$
- $K_2[3] = 0.2*0.0 + 0.4*0.2 + 0.5*0.3 + 0.6*0.1 = 0+0.08+0.15+0.06=0.29$
- $K_2[4] = 0.2*0.0 + 0.4*0.1 + 0.5*0.3 + 0.6*0.0 = 0+0.04+0.15+0=0.19$

$K_2 = [0.29, 0.4, 0.29, 0.19]$.

For "sat":
$$
K_3 = [0.15,0.25,0.5,0.2]W_K
$$
- $K_3[1] = 0.15*0.1 + 0.25*0.0 + 0.5*0.3 + 0.2*0.2 = 0.015+0+0.15+0.04=0.205$
- $K_3[2] = 0.15*0.4 + 0.25*0.5 + 0.5*0.0 + 0.2*0.2 = 0.06+0.125+0+0.04=0.225$
- $K_3[3] = 0.15*0.0 + 0.25*0.2 + 0.5*0.3 + 0.2*0.1 = 0+0.05+0.15+0.02=0.22$
- $K_3[4] = 0.15*0.0 + 0.25*0.1 + 0.5*0.3 + 0.2*0.0 = 0+0.025+0.15+0=0.175$

$K_3 = [0.205,0.225,0.22,0.175]$.

Thus:
$$
K = \begin{bmatrix}
  0.3 & 0.38 & 0.3 & 0.19 \\[4pt]
  0.29 & 0.4 & 0.29 & 0.19 \\[4pt]
  0.205 & 0.225 & 0.22 & 0.175
\end{bmatrix}
$$

- $V = XW_V$:
Similarly:

For "The":
- $V_1[1] = 0.1*0.2 + 0.4*0.0 + 0.5*0.1 + 0.7*0.0 = 0.02+0+0.05+0=0.07$
- $V_1[2] = 0.1*0.1 + 0.4*0.3 + 0.5*0.1 + 0.7*0.2 = 0.01+0.12+0.05+0.14=0.32$
- $V_1[3] = 0.1*0.0 + 0.4*0.5 + 0.5*0.1 + 0.7*0.1 = 0+0.2+0.05+0.07=0.32$
- $V_1[4] = 0.1*0.0 + 0.4*0.0 + 0.5*0.4 + 0.7*0.1 = 0+0+0.2+0.07=0.27$

$V_1 = [0.07, 0.32, 0.32, 0.27]$.

For "cat":
- $V_2[1] = 0.2*0.2 + 0.4*0.0 + 0.5*0.1 + 0.6*0.0 = 0.04+0+0.05+0=0.09$
- $V_2[2] = 0.2*0.1 + 0.4*0.3 + 0.5*0.1 + 0.6*0.2 = 0.02+0.12+0.05+0.12=0.31$
- $V_2[3] = 0.2*0.0 + 0.4*0.5 + 0.5*0.1 + 0.6*0.1 = 0+0.2+0.05+0.06=0.31$
- $V_2[4] = 0.2*0.0 + 0.4*0.0 + 0.5*0.4 + 0.6*0.1 = 0+0+0.2+0.06=0.26$

$V_2 = [0.09,0.31,0.31,0.26]$.

For "sat":
- $V_3[1] = 0.15*0.2 + 0.25*0.0 + 0.5*0.1 + 0.2*0.0 = 0.03+0+0.05+0=0.08$
- $V_3[2] = 0.15*0.1 + 0.25*0.3 + 0.5*0.1 + 0.2*0.2 = 0.015+0.075+0.05+0.04=0.18$
- $V_3[3] = 0.15*0.0 + 0.25*0.5 + 0.5*0.1 + 0.2*0.1 = 0+0.125+0.05+0.02=0.195$
- $V_3[4] = 0.15*0.0 + 0.25*0.0 + 0.5*0.4 + 0.2*0.1 = 0+0+0.2+0.02=0.22$

$V_3 = [0.08,0.18,0.195,0.22]$.

Thus:
$$
V = \begin{bmatrix}
  0.07 & 0.32 & 0.32 & 0.27 \\[4pt]
  0.09 & 0.31 & 0.31 & 0.26 \\[4pt]
  0.08 & 0.18 & 0.195 & 0.22
\end{bmatrix}
$$

#### 2. Compute the Attention Scores

$$
\text{Scores} = QK^T
$$

- Dimension check: $Q \in \mathbb{R}^{3 \times 4}, K \in \mathbb{R}^{3 \times 4}$, so $K^T \in \mathbb{R}^{4 \times 3}$. Thus, $\text{Scores} \in \mathbb{R}^{3 \times 3}$.

Compute $\text{Scores}[i,j]$ = $Q_i \cdot K_j$:

- $\text{Scores}[1,1] = Q_1 \cdot K_1 = [0.5*0.3 + 0.31*0.38 + 0.54*0.3 + 0.46*0.19]$
= $0.15 + 0.1178 + 0.162 + 0.0874 = 0.5172$

- $\text{Scores}[1,2] = Q_1 \cdot K_2 = [0.5*0.29 + 0.31*0.4 + 0.54*0.29 + 0.46*0.19]$
= $0.145 + 0.124 + 0.1566 + 0.0874 = 0.513$

- $\text{Scores}[1,3] = Q_1 \cdot K_3 = [0.5*0.205 + 0.31*0.225 + 0.54*0.22 + 0.46*0.175]$
= $0.1025 + 0.06975 + 0.1188 + 0.0805 = 0.37155$

- $\text{Scores}[2,1] = Q_2 \cdot K_1$
= $[0.53*0.3 + 0.31*0.38 + 0.49*0.3 + 0.45*0.19]$
= $0.159 + 0.1178 + 0.147 + 0.0855 = 0.5093$

- $\text{Scores}[2,2] = Q_2 \cdot K_2$
= $[0.53*0.29 + 0.31*0.4 + 0.49*0.29 + 0.45*0.19]$
= $0.1537 + 0.124 + 0.1421 + 0.0855 = 0.5053$

- $\text{Scores}[2,3] = Q_2 \cdot K_3$
= $[0.53*0.205 + 0.31*0.225 + 0.49*0.22 + 0.45*0.175]$
= $0.10865 + 0.06975 + 0.1078 + 0.07875 = 0.36495$

- $\text{Scores}[3,1] = Q_3 \cdot K_1$
= $[0.365*0.3 + 0.235*0.38 + 0.275*0.3 + 0.275*0.19]$
= $0.1095 + 0.0893 + 0.0825 + 0.05225 = 0.33355$

- $\text{Scores}[3,2] = Q_3 \cdot K_2$
= $[0.365*0.29 + 0.235*0.4 + 0.275*0.29 + 0.275*0.19]$
= $0.10585 + 0.094 + 0.07975 + 0.05225 = 0.33185$

- \(\text{Scores}[3,3] = Q_3 \cdot K_3\)
= \([0.365*0.205 + 0.235*0.225 + 0.275*0.22 + 0.275*0.175]\)
= \(0.074825 + 0.052875 + 0.0605 + 0.048125 = 0.236325\)

Thus:
$$
\text{Scores} = \begin{bmatrix}
0.5172 & 0.513  & 0.37155 \\[4pt]
0.5093 & 0.5053 & 0.36495 \\[4pt]
0.33355 & 0.33185 & 0.236325
\end{bmatrix}
$$

#### 3. Scale the Scores

$$
\tilde{\text{Scores}} = \frac{\text{Scores}}{\sqrt{d_k}} = \frac{\text{Scores}}{\sqrt{4}} = \frac{\text{Scores}}{2}
$$

$$
\tilde{\text{Scores}} = \begin{bmatrix}
0.2586 & 0.2565 & 0.185775 \\[4pt]
0.25465 & 0.25265 & 0.182475 \\[4pt]
0.166775 & 0.165925 & 0.1181625
\end{bmatrix}
$$

#### 4. Apply Softmax to Each Row

The **softmax** function is defined as follows:

$$
\text{softmax}(\mathbf{z})_i = \frac{\exp(z_i)}{\sum_{j=1}^{N} \exp(z_j)}
$$

where $\mathbf{z} = (z_1, z_2, \ldots, z_N)$ is a vector of real numbers and $N$ is the dimension of the vector $\mathbf{z}$.


For the first row:
- Sum = $\exp(0.2586) + \exp(0.2565) + \exp(0.185775)$
- $\exp(0.2586) \approx 1.295$
- $\exp(0.2565) \approx 1.292$
- $\exp(0.185775) \approx 1.204$

Sum ≈ 1.295 + 1.292 + 1.204 = 3.791

So:
- $A[1,1] = 1.295/3.791 ≈ 0.3418$
- $A[1,2] = 1.292/3.791 ≈ 0.3408$
- $A[1,3] = 1.204/3.791 ≈ 0.3177$

For the second row:
- $\exp(0.25465) \approx 1.290$
- $\exp(0.25265) \approx 1.287$
- $\exp(0.182475) \approx 1.200$

Sum ≈ 1.290 + 1.287 + 1.200 = 3.777

- $A[2,1] = 1.290/3.777 ≈ 0.3415$
- $A[2,2] = 1.287/3.777 ≈ 0.3407$
- $A[2,3] = 1.200/3.777 ≈ 0.3178$

For the third row:
- $\exp(0.166775) \approx 1.181$
- $\exp(0.165925) \approx 1.180$
- $\exp(0.1181625) \approx 1.1255$

Sum ≈ 1.181 + 1.180 + 1.1255 = 3.4865

- $A[3,1] = 1.181/3.4865 ≈ 0.3388$
- $A[3,2] = 1.180/3.4865 ≈ 0.3385$
- $A[3,3] = 1.1255/3.4865 ≈ 0.3227$

So our attention weight matrix $A$ is approximately:
$$
A = \begin{bmatrix}
0.3418 & 0.3408 & 0.3177 \\[4pt]
0.3415 & 0.3407 & 0.3178 \\[4pt]
0.3388 & 0.3385 & 0.3227
\end{bmatrix}
$$

#### 5. Compute the Final Output

$$
\text{AttOutput} = A V
$$

- Dimension: $A \in \mathbb{R}^{3 \times 3}, V \in \mathbb{R}^{3 \times 4}$, resulting in $\text{AttOutput} \in \mathbb{R}^{3 \times 4}$.

For each row $i$:
$$
\text{AttOutput}_i = \sum_{j=1}^{3} A[i,j] V_j
$$

- For the first token:
$$
\text{AttOutput}_1 = 0.3418V_1 + 0.3408V_2 + 0.3177V_3
$$
Recall:
- $V_1 = [0.07,0.32,0.32,0.27]$
- $V_2 = [0.09,0.31,0.31,0.26]$
- $V_3 = [0.08,0.18,0.195,0.22]$

Compute component-wise:
- Dim1: $0.3418*0.07 + 0.3408*0.09 + 0.3177*0.08 = 0.0233 + 0.0307 + 0.0254 = 0.0794$
- Dim2: $0.3418*0.32 + 0.3408*0.31 + 0.3177*0.18 = 0.1094 + 0.1056 + 0.0572 = 0.2722$
- Dim3: $0.3418*0.32 + 0.3408*0.31 + 0.3177*0.195 = 0.1094 + 0.1056 + 0.0629 = 0.2779$
- Dim4: $0.3418*0.27 + 0.3408*0.26 + 0.3177*0.22 = 0.0923 + 0.0886 + 0.0699 = 0.2508$

$\text{AttOutput}_1 = [0.0794, 0.2722, 0.2779, 0.2508]$

- For the second token:
$$
\text{AttOutput}_2 = 0.3415V_1 + 0.3407V_2 + 0.3178V_3
$$
Repeat similarly:
- Dim1: $0.3415*0.07 + 0.3407*0.09 + 0.3178*0.08 = 0.023105 + 0.030663 + 0.025424 = 0.079192$
- Dim2: $0.3415*0.32 + 0.3407*0.31 + 0.3178*0.18 = 0.10928 + 0.105617 + 0.057204 = 0.272101$
- Dim3: $0.3415*0.32 + 0.3407*0.31 + 0.3178*0.195 = 0.10928 + 0.105617 + 0.062971 = 0.277868$
- Dim4: $0.3415*0.27 + 0.3407*0.26 + 0.3178*0.22 = 0.092205 + 0.088582 + 0.069916 = 0.250703$

$\text{AttOutput}_2 \approx [0.0792, 0.2721, 0.2779, 0.2507]$

- For the third token:
$$
\text{AttOutput}_3 = 0.3388V_1 + 0.3385V_2 + 0.3227V_3
$$
- Dim1: $0.3388*0.07 + 0.3385*0.09 + 0.3227*0.08 = 0.023716 + 0.030465 + 0.025816 = 0.079997$
- Dim2: $0.3388*0.32 + 0.3385*0.31 + 0.3227*0.18 = 0.108416 + 0.104935 + 0.058086 = 0.271437$
- Dim3: $0.3388*0.32 + 0.3385*0.31 + 0.3227*0.195 = 0.108416 + 0.104935 + 0.062933 = 0.276284$
- Dim4: $0.3388*0.27 + 0.3385*0.26 + 0.3227*0.22 = 0.091476 + 0.08801 + 0.071 = 0.250486$

$\text{AttOutput}_3 \approx [0.08, 0.2714, 0.2763, 0.2505]$

Final $\text{AttOutput}$:
$$
\text{AttOutput} \approx \begin{bmatrix}
0.0794 & 0.2722 & 0.2779 & 0.2508 \\[4pt]
0.0792 & 0.2721 & 0.2779 & 0.2507 \\[4pt]
0.08   & 0.2714 & 0.2763 & 0.2505
\end{bmatrix}
$$

## Interpretation

- Each row of $\text{AttOutput}$ is the transformed representation of the corresponding token after attending to all tokens in the sequence (including itself).
- Notice that the rows are fairly similar, which reflects the similarity in the queries and keys for this small artificial example. In a more complex and varied sequence, these values would differ more significantly.
- In practice, multiple heads are used, and their outputs are concatenated to capture various patterns. Here, we demonstrated just a single-head scenario.






### Decoder Architecture in Transformers
#### 6. Apply Masking

**Masking** is used to prevent attention to certain positions. In this example, we will apply a **causal mask** to prevent each token from attending to future tokens. This is particularly useful in decoder architectures to maintain the autoregressive property.

- **Causal Mask Matrix ($M$):**

  The causal mask ensures that each position can only attend to itself and previous positions. For our 3-token sequence:

  $$
  M = \begin{bmatrix}
  0 & -\infty & -\infty \\[4pt]
  0 & 0 & -\infty \\[4pt]
  0 & 0 & 0
  \end{bmatrix}
  $$

  - $0$ allows attention.
  - $-\infty$ effectively masks out the position by making its softmax probability zero.

- **Apply Mask to Scores:**

  $$
  \text{Masked Scores} = \text{Scores} + M
  $$

  Performing element-wise addition:

  $$
  \text{Masked Scores} = \begin{bmatrix}
  0.5172 & 0.5130 & -\infty \\[4pt]
  0.5093 & 0.5053 & -\infty \\[4pt]
  0.33355 & 0.33185 & 0.236325
  \end{bmatrix}
  $$

  **Explanation:**  
  - For the first token ("The"), it cannot attend to the third token ("sat"), hence $-\infty$.
  - For the second token ("cat"), it cannot attend to the third token ("sat"), hence $-\infty$.
  - The third token ("sat") can attend to all tokens, including itself.

#### 7. Scale the Scores

Scaling helps in stabilizing gradients during training.

$$
\tilde{\text{Scores}} = \frac{\text{Masked Scores}}{\sqrt{d_k}} = \frac{\text{Masked Scores}}{\sqrt{4}} = \frac{\text{Masked Scores}}{2}
$$

$$
\tilde{\text{Scores}} = \begin{bmatrix}
0.2586 & 0.2565 & -\infty \\[4pt]
0.25465 & 0.25265 & -\infty \\[4pt]
0.166775 & 0.165925 & 0.1181625
\end{bmatrix}
$$

#### 8. Apply Softmax to Each Row

The **softmax** function converts the scaled scores into probabilities.

$$
\text{softmax}(\mathbf{z})_i = \frac{\exp(z_i)}{\sum_{j=1}^{N} \exp(z_j)}
$$

**Applying Softmax:**

- **First Row:**
  $$
  \mathbf{z}_1 = [0.2586, 0.2565, -\infty]
  $$
  - $\exp(0.2586) \approx 1.295$
  - $\exp(0.2565) \approx 1.292$
  - $\exp(-\infty) = 0$

  Sum: $1.295 + 1.292 + 0 = 2.587$

  Softmax:
  $$
  A[1,1] = \frac{1.295}{2.587} \approx 0.500 \\
  A[1,2] = \frac{1.292}{2.587} \approx 0.500 \\
  A[1,3] = \frac{0}{2.587} = 0.0
  $$

- **Second Row:**
  $$
  \mathbf{z}_2 = [0.25465, 0.25265, -\infty]
  $$
  - $\exp(0.25465) \approx 1.290$
  - $\exp(0.25265) \approx 1.287$
  - $\exp(-\infty) = 0$

  Sum: $1.290 + 1.287 + 0 = 2.577$

  Softmax:
  $$
  A[2,1] = \frac{1.290}{2.577} \approx 0.500 \\
  A[2,2] = \frac{1.287}{2.577} \approx 0.500 \\
  A[2,3] = \frac{0}{2.577} = 0.0
  $$

- **Third Row:**
  $$
  \mathbf{z}_3 = [0.166775, 0.165925, 0.1181625]
  $$
  - $\exp(0.166775) \approx 1.181$
  - $\exp(0.165925) \approx 1.180$
  - $\exp(0.1181625) \approx 1.1255$

  Sum: $1.181 + 1.180 + 1.1255 = 3.4865$

  Softmax:
  $$
  A[3,1] = \frac{1.181}{3.4865} \approx 0.339 \\
  A[3,2] = \frac{1.180}{3.4865} \approx 0.339 \\
  A[3,3] = \frac{1.1255}{3.4865} \approx 0.323
  $$

Thus, our attention weight matrix $A$ is approximately:

$$
A = \begin{bmatrix}
0.500 & 0.500 & 0.0 \\[4pt]
0.500 & 0.500 & 0.0 \\[4pt]
0.339 & 0.339 & 0.323
\end{bmatrix}
$$

**Effect of Masking:**
- **First Token ("The"):** Can only attend to itself and "cat". No attention to "sat".
- **Second Token ("cat"):** Can only attend to itself and "The". No attention to "sat".
- **Third Token ("sat"):** Can attend to all tokens, including itself.

#### 6. Compute the Final Output

$$
\text{AttOutput} = A V
$$

- **Dimension Check:**  
  $A \in \mathbb{R}^{3 \times 3}$, $V \in \mathbb{R}^{3 \times 4}$, so $\text{AttOutput} \in \mathbb{R}^{3 \times 4}$.

- **Calculation:**

  - **For the first token ("The"):**
    $$
    \text{AttOutput}_1 = 0.500 \times V_1 + 0.500 \times V_2 + 0.0 \times V_3 \\
    = 0.500 \times [0.07, 0.32, 0.32, 0.27] + 0.500 \times [0.09, 0.31, 0.31, 0.26] + 0.0 \times [0.08, 0.18, 0.195, 0.22] \\
    = [0.035, 0.16, 0.16, 0.135] + [0.045, 0.155, 0.155, 0.13] + [0.0, 0.0, 0.0, 0.0] \\
    = [0.080, 0.315, 0.315, 0.265]
    $$

  - **For the second token ("cat"):**
    $$
    \text{AttOutput}_2 = 0.500 \times V_1 + 0.500 \times V_2 + 0.0 \times V_3 \\
    = 0.500 \times [0.07, 0.32, 0.32, 0.27] + 0.500 \times [0.09, 0.31, 0.31, 0.26] + 0.0 \times [0.08, 0.18, 0.195, 0.22] \\
    = [0.035, 0.16, 0.16, 0.135] + [0.045, 0.155, 0.155, 0.13] + [0.0, 0.0, 0.0, 0.0] \\
    = [0.080, 0.315, 0.315, 0.265]
    $$

  - **For the third token ("sat"):**
    $$
    \text{AttOutput}_3 = 0.339 \times V_1 + 0.339 \times V_2 + 0.323 \times V_3 \\
    = 0.339 \times [0.07, 0.32, 0.32, 0.27] + 0.339 \times [0.09, 0.31, 0.31, 0.26] + 0.323 \times [0.08, 0.18, 0.195, 0.22] \\
    = [0.02373, 0.10848, 0.10848, 0.07269] + [0.03051, 0.10509, 0.10509, 0.08814] + [0.02584, 0.05814, 0.06309, 0.07106] \\
    = [0.07908, 0.27171, 0.27666, 0.23189]
    $$

Thus, the final attention output matrix $\text{AttOutput}$ is approximately:

$$
\text{AttOutput} \approx \begin{bmatrix}
0.080 & 0.315 & 0.315 & 0.265 \\[4pt]
0.080 & 0.315 & 0.315 & 0.265 \\[4pt]
0.079 & 0.271 & 0.277 & 0.232
\end{bmatrix}
$$

**Interpretation:**
- **First Token ("The"):** Its representation is an average of "The" and "cat", ignoring "sat" due to the causal mask.
- **Second Token ("cat"):** Similarly, it averages "The" and "cat".
- **Third Token ("sat"):** It attends to all three tokens, incorporating information from "The", "cat", and itself.

#### 7. Incorporate Multi-Head Attention (Optional)

**Multi-Head Attention** allows the model to attend to information from different representation subspaces at different positions. Here's how to extend our example to multi-head attention.

- **Assumptions:**
  - Number of heads: $h = 2$
  - Dimension per head: $d_k = d_v = d_{\text{model}} / h = 2$

- **Parameter Matrices for Each Head:**
  For simplicity, we define separate $W_Q$, $W_K$, and $W_V$ for each head. Assume these are predefined.

  **Head 1:**
  $$
  W_{Q}^{(1)} = \begin{bmatrix}
  0.5 & 0.1 \\
  0.4 & 0.2 \\
  0.3 & 0.3 \\
  0.2 & 0.1
  \end{bmatrix}, \quad
  W_{K}^{(1)} = \begin{bmatrix}
  0.1 & 0.4 \\
  0.0 & 0.5 \\
  0.3 & 0.0 \\
  0.2 & 0.2
  \end{bmatrix}, \quad
  W_{V}^{(1)} = \begin{bmatrix}
  0.2 & 0.1 \\
  0.0 & 0.3 \\
  0.1 & 0.1 \\
  0.0 & 0.2
  \end{bmatrix}
  $$

  **Head 2:**
  $$
  W_{Q}^{(2)} = \begin{bmatrix}
  0.0 & 0.3 \\
  0.1 & 0.0 \\
  0.3 & 0.3 \\
  0.5 & 0.4
  \end{bmatrix}, \quad
  W_{K}^{(2)} = \begin{bmatrix}
  0.0 & 0.0 \\
  0.2 & 0.1 \\
  0.3 & 0.3 \\
  0.1 & 0.0
  \end{bmatrix}, \quad
  W_{V}^{(2)} = \begin{bmatrix}
  0.0 & 0.0 \\
  0.5 & 0.0 \\
  0.1 & 0.4 \\
  0.1 & 0.1
  \end{bmatrix}
  $$

- **Compute Q, K, V for Each Head:**

  **Head 1:**
  $$
  Q^{(1)} = X W_{Q}^{(1)} = \begin{bmatrix}
  0.1 & 0.4 & 0.5 & 0.7 \\
  0.2 & 0.4 & 0.5 & 0.6 \\
  0.15 & 0.25 & 0.5 & 0.2
  \end{bmatrix}
  \begin{bmatrix}
  0.5 & 0.1 \\
  0.4 & 0.2 \\
  0.3 & 0.3 \\
  0.2 & 0.1
  \end{bmatrix}
  = \begin{bmatrix}
  0.5 & 0.31 \\
  0.53 & 0.31 \\
  0.365 & 0.235
  \end{bmatrix}
  $$

  $$
  K^{(1)} = X W_{K}^{(1)} = \begin{bmatrix}
  0.1 & 0.4 & 0.5 & 0.7 \\
  0.2 & 0.4 & 0.5 & 0.6 \\
  0.15 & 0.25 & 0.5 & 0.2
  \end{bmatrix}
  \begin{bmatrix}
  0.1 & 0.4 \\
  0.0 & 0.5 \\
  0.3 & 0.0 \\
  0.2 & 0.2
  \end{bmatrix}
  = \begin{bmatrix}
  0.3 & 0.38 \\
  0.29 & 0.4 \\
  0.205 & 0.225
  \end{bmatrix}
  $$

  $$
  V^{(1)} = X W_{V}^{(1)} = \begin{bmatrix}
  0.1 & 0.4 & 0.5 & 0.7 \\
  0.2 & 0.4 & 0.5 & 0.6 \\
  0.15 & 0.25 & 0.5 & 0.2
  \end{bmatrix}
  \begin{bmatrix}
  0.2 & 0.1 \\
  0.0 & 0.3 \\
  0.1 & 0.1 \\
  0.0 & 0.2
  \end{bmatrix}
  = \begin{bmatrix}
  0.07 & 0.32 \\
  0.09 & 0.31 \\
  0.08 & 0.18
  \end{bmatrix}
  $$

  **Head 2:**
  $$
  Q^{(2)} = X W_{Q}^{(2)} = \begin{bmatrix}
  0.1 & 0.4 & 0.5 & 0.7 \\
  0.2 & 0.4 & 0.5 & 0.6 \\
  0.15 & 0.25 & 0.5 & 0.2
  \end{bmatrix}
  \begin{bmatrix}
  0.0 & 0.3 \\
  0.1 & 0.0 \\
  0.3 & 0.3 \\
  0.5 & 0.4
  \end{bmatrix}
  = \begin{bmatrix}
  0.38 & 0.59 \\
  0.39 & 0.57 \\
  0.305 & 0.255
  \end{bmatrix}
  $$

  $$
  K^{(2)} = X W_{K}^{(2)} = \begin{bmatrix}
  0.1 & 0.4 & 0.5 & 0.7 \\
  0.2 & 0.4 & 0.5 & 0.6 \\
  0.15 & 0.25 & 0.5 & 0.2
  \end{bmatrix}
  \begin{bmatrix}
  0.0 & 0.0 \\
  0.2 & 0.1 \\
  0.3 & 0.3 \\
  0.1 & 0.0
  \end{bmatrix}
  = \begin{bmatrix}
  0.1 & 0.13 \\
  0.11 & 0.13 \\
  0.125 & 0.11
  \end{bmatrix}
  $$

  $$
  V^{(2)} = X W_{V}^{(2)} = \begin{bmatrix}
  0.1 & 0.4 & 0.5 & 0.7 \\
  0.2 & 0.4 & 0.5 & 0.6 \\
  0.15 & 0.25 & 0.5 & 0.2
  \end{bmatrix}
  \begin{bmatrix}
  0.0 & 0.0 \\
  0.5 & 0.0 \\
  0.1 & 0.4 \\
  0.1 & 0.1
  \end{bmatrix}
  = \begin{bmatrix}
  0.05 & 0.04 \\
  0.06 & 0.03 \\
  0.07 & 0.07
  \end{bmatrix}
  $$

- **Compute Attention for Each Head:**

  For each head, perform the same steps as single-head attention:
  
  1. Compute Scores: $Q^{(h)} {K^{(h)}}^T$
  2. Apply Masking (if necessary)
  3. Scale Scores
  4. Apply Softmax
  5. Compute Attention Output: $A^{(h)} V^{(h)}$

  **Head 1:**

  - **Scores:**
    $$
    \text{Scores}^{(1)} = Q^{(1)} {K^{(1)}}^T = \begin{bmatrix}
    0.5 & 0.31 & 0.54 & 0.46 \\
    0.53 & 0.31 & 0.49 & 0.45 \\
    0.365 & 0.235 & 0.275 & 0.275
    \end{bmatrix}
    \begin{bmatrix}
    0.3 & 0.38 & 0.3 & 0.19 \\
    0.29 & 0.4 & 0.29 & 0.19 \\
    0.205 & 0.225 & 0.22 & 0.175
    \end{bmatrix}
    = \begin{bmatrix}
    0.5172 & 0.5130 & 0.37155 \\
    0.5093 & 0.5053 & 0.36495 \\
    0.33355 & 0.33185 & 0.236325
    \end{bmatrix}
    $$

  - **Apply Masking:**
    $$
    M = \begin{bmatrix}
    0 & -\infty & -\infty \\
    0 & 0 & -\infty \\
    0 & 0 & 0
    \end{bmatrix}
    $$
    $$
    \text{Masked Scores}^{(1)} = \text{Scores}^{(1)} + M = \begin{bmatrix}
    0.5172 & 0.5130 & -\infty \\
    0.5093 & 0.5053 & -\infty \\
    0.33355 & 0.33185 & 0.236325
    \end{bmatrix}
    $$

  - **Scale:**
    $$
    \tilde{\text{Scores}}^{(1)} = \frac{\text{Masked Scores}^{(1)}}{2} = \begin{bmatrix}
    0.2586 & 0.2565 & -\infty \\
    0.25465 & 0.25265 & -\infty \\
    0.166775 & 0.165925 & 0.1181625
    \end{bmatrix}
    $$

  - **Softmax:**
    $$
    A^{(1)} = \text{softmax}(\tilde{\text{Scores}}^{(1)}) = \begin{bmatrix}
    0.500 & 0.500 & 0.0 \\
    0.500 & 0.500 & 0.0 \\
    0.339 & 0.339 & 0.323
    \end{bmatrix}
    $$

  - **Attention Output:**
    $$
    \text{AttOutput}^{(1)} = A^{(1)} V^{(1)} = \begin{bmatrix}
    0.500 & 0.500 & 0.0 \\
    0.500 & 0.500 & 0.0 \\
    0.339 & 0.339 & 0.323
    \end{bmatrix}
    \begin{bmatrix}
    0.07 & 0.32 \\
    0.09 & 0.31 \\
    0.08 & 0.18
    \end{bmatrix}
    = \begin{bmatrix}
    0.080 & 0.315 \\
    0.080 & 0.315 \\
    0.079 & 0.271
    \end{bmatrix}
    $$

  **Head 2:**

  - **Scores:**
    $$
    \text{Scores}^{(2)} = Q^{(2)} {K^{(2)}}^T = \begin{bmatrix}
    0.38 & 0.59 \\
    0.39 & 0.57 \\
    0.305 & 0.255
    \end{bmatrix}
    \begin{bmatrix}
    0.1 & 0.13 \\
    0.11 & 0.13 \\
    0.125 & 0.11
    \end{bmatrix}
    = \begin{bmatrix}
    0.38*0.1 + 0.59*0.11 & 0.38*0.13 + 0.59*0.13 \\
    0.39*0.1 + 0.57*0.11 & 0.39*0.13 + 0.57*0.13 \\
    0.305*0.1 + 0.255*0.11 & 0.305*0.13 + 0.255*0.13
    \end{bmatrix}
    = \begin{bmatrix}
    0.038 + 0.0649 & 0.0494 + 0.0767 \\
    0.039 + 0.0627 & 0.0507 + 0.0741 \\
    0.0305 + 0.02805 & 0.03965 + 0.03315
    \end{bmatrix}
    = \begin{bmatrix}
    0.1029 & 0.1261 \\
    0.1017 & 0.1248 \\
    0.05855 & 0.0728
    \end{bmatrix}
    $$

  - **Apply Masking:**
    $$
    M = \begin{bmatrix}
    0 & -\infty & -\infty \\
    0 & 0 & -\infty \\
    0 & 0 & 0
    \end{bmatrix}
    $$
    Since $K^{(2)}$ has dimension $2$, and our sequence length is $3$, we need to adjust the mask accordingly. However, for simplicity, assume a similar causal mask applies:

    $$
    \text{Masked Scores}^{(2)} = \text{Scores}^{(2)} + M = \begin{bmatrix}
    0.1029 & 0.1261 & -\infty \\
    0.1017 & 0.1248 & -\infty \\
    0.05855 & 0.0728 & 0.0
    \end{bmatrix}
    $$

  - **Scale:**
    $$
    \tilde{\text{Scores}}^{(2)} = \frac{\text{Masked Scores}^{(2)}}{2} = \begin{bmatrix}
    0.05145 & 0.06305 & -\infty \\
    0.05085 & 0.0624 & -\infty \\
    0.029275 & 0.0364 & 0.0
    \end{bmatrix}
    $$

  - **Softmax:**
    $$
    A^{(2)} = \text{softmax}(\tilde{\text{Scores}}^{(2)}) = \begin{bmatrix}
    0.500 & 0.500 & 0.0 \\
    0.500 & 0.500 & 0.0 \\
    0.339 & 0.339 & 0.323
    \end{bmatrix}
    $$

  - **Attention Output:**
    $$
    \text{AttOutput}^{(2)} = A^{(2)} V^{(2)} = \begin{bmatrix}
    0.500 & 0.500 & 0.0 \\
    0.500 & 0.500 & 0.0 \\
    0.339 & 0.339 & 0.323
    \end{bmatrix}
    \begin{bmatrix}
    0.05 & 0.04 \\
    0.06 & 0.03 \\
    0.07 & 0.07
    \end{bmatrix}
    = \begin{bmatrix}
    0.500*0.05 + 0.500*0.06 & 0.500*0.04 + 0.500*0.03 \\
    0.500*0.05 + 0.500*0.06 & 0.500*0.04 + 0.500*0.03 \\
    0.339*0.05 + 0.339*0.06 + 0.323*0.07 & 0.339*0.04 + 0.339*0.03 + 0.323*0.07
    \end{bmatrix}
    = \begin{bmatrix}
    0.055 & 0.035 \\
    0.055 & 0.035 \\
    0.01695 + 0.02034 + 0.02261 & 0.01356 + 0.01017 + 0.02261
    \end{bmatrix}
    = \begin{bmatrix}
    0.055 & 0.035 \\
    0.055 & 0.035 \\
    0.0599 & 0.04634
    \end{bmatrix}
    $$

- **Concatenate Heads and Project:**

  After computing attention outputs for all heads, concatenate them:

  $$
  \text{Concat} = [\text{AttOutput}^{(1)}, \text{AttOutput}^{(2)}] = \begin{bmatrix}
  0.080 & 0.315 & 0.055 & 0.035 \\
  0.080 & 0.315 & 0.055 & 0.035 \\
  0.079 & 0.271 & 0.0599 & 0.04634
  \end{bmatrix}
  $$

  Then apply a final linear projection:

  Let’s define $W_O \in \mathbb{R}^{(h \cdot d_v) \times d_{\text{model}}}$ as:

  $$
  W_O = \begin{bmatrix}
  0.1 & 0.0 & 0.2 & 0.1 \\
  0.0 & 0.1 & 0.0 & 0.2 \\
  0.3 & 0.1 & 0.0 & 0.0 \\
  0.0 & 0.2 & 0.1 & 0.1
  \end{bmatrix}
  $$

  Compute the final output:

  $$
  \text{FinalOutput} = \text{Concat} \cdot W_O = \begin{bmatrix}
  0.080 & 0.315 & 0.055 & 0.035 \\
  0.080 & 0.315 & 0.055 & 0.035 \\
  0.079 & 0.271 & 0.0599 & 0.04634
  \end{bmatrix}
  \begin{bmatrix}
  0.1 & 0.0 & 0.2 & 0.1 \\
  0.0 & 0.1 & 0.0 & 0.2 \\
  0.3 & 0.1 & 0.0 & 0.0 \\
  0.0 & 0.2 & 0.1 & 0.1
  \end{bmatrix}
  = \begin{bmatrix}
  0.080*0.1 + 0.315*0.0 + 0.055*0.3 + 0.035*0.0 & \ldots \\
  0.080*0.1 + 0.315*0.0 + 0.055*0.3 + 0.035*0.0 & \ldots \\
  0.079*0.1 + 0.271*0.0 + 0.0599*0.3 + 0.04634*0.0 & \ldots
  \end{bmatrix}
  $$

  (Complete the matrix multiplication as needed.)

#### 8. Final Representation

The final output represents the attended information for each token, enriched by multiple attention heads capturing diverse patterns.

## Summary of Enhanced Steps

1. **Input Preparation:**
   - Create input embeddings and add positional encodings.

2. **Linear Projections:**
   - Compute $Q = XW_Q$, $K = XW_K$, $V = XW_V$ for each head.

3. **Attention Scores:**
   - Compute $\text{Scores} = QK^T$.

4. **Apply Masking:**
   - Add mask matrix $M$ to $\text{Scores}$ to obtain $\text{Masked Scores}$.

5. **Scaling:**
   - Scale the scores by $\sqrt{d_k}$.

6. **Softmax:**
   - Apply softmax to obtain attention weights $A$.

7. **Attention Output:**
   - Compute $\text{AttOutput} = A V$.

8. **Multi-Head Concatenation (if applicable):**
   - Concatenate outputs from all heads and apply final linear projection.

9. **Final Representation:**
   - The final output represents the attended information for each token.



## Summary of Steps

1. Took the input sequence and created $X$.
2. Computed $Q = XW_Q$, $K = XW_K$, $V = XW_V$.
3. Calculated $\text{Scores} = QK^T$, then scaled them.
4. Applied softmax to get attention weights $A$.
5. Computed the final output as $A V$.

This step-by-step example shows how the linear algebra operations translate into the final attended representation.


## Training and Inference Steps for a Transformer-based Language Model

We have obtained the final attention output (`AttOutput`) for a sequence. Recall the final `AttOutput` (for a single layer or after the attention mechanism) is:

$$
\text{AttOutput} \approx \begin{bmatrix}
0.0794 & 0.2722 & 0.2779 & 0.2508 \\[4pt]
0.0792 & 0.2721 & 0.2779 & 0.2507 \\[4pt]
0.08   & 0.2714 & 0.2763 & 0.2505
\end{bmatrix}
$$

This represents the transformed representations of each token in the input sequence after applying attention. Each row corresponds to a token position in the sequence, and each column is one of the model’s hidden dimensions $(d_{\text{model}} = 4$ in this example).

## From AttOutput to Predictions

To make predictions about the next token, the model uses a final linear layer and a softmax to produce a probability distribution over the vocabulary.

### Final Linear Projection

Suppose we have a vocabulary $\mathcal{V}$ of size $|\mathcal{V}| = V$. We introduce a parameter matrix:
$$
W_{\text{out}} \in \mathbb{R}^{d_{\text{model}} \times V}
$$
For our example, let’s assume $V = 5$ (a tiny vocabulary for demonstration) and $d_{\text{model}} = 4$.

A single row of `AttOutput` (say the last token’s representation) is multiplied by $W_{\text{out}}$ to produce logits (unnormalized scores) for the next token:
$$
Z_t = \text{AttOutput}_t W_{\text{out}}
$$
$\text{AttOutput}_t \in \mathbb{R}^{1 \times 4}$ and $W_{\text{out}} \in \mathbb{R}^{4 \times 5}$, thus $Z_t \in \mathbb{R}^{1 \times 5}$.

For the sake of example, let’s pick the last row of AttOutput:
$$
\text{AttOutput}_3 = [0.08, 0.2714, 0.2763, 0.2505].
$$

Assume:
$$
W_{\text{out}} = \begin{bmatrix}
0.2 & 0.1 & 0.0 & 0.3 & 0.1 \\[4pt]
0.0 & 0.5 & 0.1 & 0.0 & 0.2 \\[4pt]
0.1 & 0.1 & 0.3 & 0.3 & 0.1 \\[4pt]
0.2 & 0.2 & 0.0 & 0.1 & 0.0 
\end{bmatrix}
$$

Compute $Z_3$:
$$
Z_3 = \text{AttOutput}_3 W_{\text{out}}
$$

Component-wise:
- $Z_3[1] = 0.08*0.2 + 0.2714*0.0 + 0.2763*0.1 + 0.2505*0.2 = 0.016 + 0 + 0.02763 + 0.0501 = 0.09373$
- $Z_3[2] = 0.08*0.1 + 0.2714*0.5 + 0.2763*0.1 + 0.2505*0.2 = 0.008 + 0.1357 + 0.02763 + 0.0501 = 0.22143$
- $Z_3[3] = 0.08*0.0 + 0.2714*0.1 + 0.2763*0.3 + 0.2505*0.0 = 0 + 0.02714 + 0.08289 + 0 = 0.11003$
- $Z_3[4] = 0.08*0.3 + 0.2714*0.0 + 0.2763*0.3 + 0.2505*0.1 = 0.024 + 0 + 0.08289 + 0.02505 = 0.13194$
- $Z_3[5] = 0.08*0.1 + 0.2714*0.2 + 0.2763*0.1 + 0.2505*0.0 = 0.008 + 0.05428 + 0.02763 + 0 = 0.08991$

So:
$$
Z_3 \approx [0.09373, \; 0.22143, \; 0.11003, \; 0.13194, \; 0.08991].
$$

### Softmax for Next-Token Prediction

The probability distribution for the next token given the first three tokens is:
$$
p_{\theta}(w_4 \mid w_1, w_2, w_3) = \text{softmax}(Z_3).
$$

Compute softmax:
$$
\exp(Z_3[1]) = \exp(0.09373) \approx 1.0981
$$
$$
\exp(Z_3[2]) = \exp(0.22143) \approx 1.2479
$$
$$
\exp(Z_3[3]) = \exp(0.11003) \approx 1.1162
$$
$$
\exp(Z_3[4]) = \exp(0.13194) \approx 1.1410
$$
$$
\exp(Z_3[5]) = \exp(0.08991) \approx 1.0940
$$

Sum these:
$$
S = 1.0981 + 1.2479 + 1.1162 + 1.1410 + 1.0940 \approx 5.6972
$$

Thus:
$$
p_{\theta}(w_4 = i \mid w_{1:3}) = \frac{\exp(Z_3[i])}{S}
$$

- $p_{\theta}(w_4=1) \approx 1.0981 / 5.6972 \approx 0.1927$
- $p_{\theta}(w_4=2) \approx 1.2479 / 5.6972 \approx 0.2190$
- $p_{\theta}(w_4=3) \approx 1.1162 / 5.6972 \approx 0.1960$
- $p_{\theta}(w_4=4) \approx 1.1410 / 5.6972 \approx 0.2003$
- $p_{\theta}(w_4=5) \approx 1.0940 / 5.6972 \approx 0.1920$

We have a probability distribution over the next token.

## Target Labels and Loss Calculation

In training, we have a ground truth next token. Suppose the correct next token (the label) is $w_4 = 2$. The one-hot target vector $y$ for this position is:
$$
y = [0, 1, 0, 0, 0].
$$

The model’s predicted distribution for $w_4$ is:
$$
\hat{p} = [0.1927,\;0.2190,\;0.1960,\;0.2003,\;0.1920].
$$

The loss function commonly used is the cross-entropy:
$$
\mathcal{L}(\theta) = -\sum_{i=1}^{V} y_i \log \hat{p}_i.
$$

Since $y_i=0$ for all $i \neq 2$ and $y_2=1$:
$$
\mathcal{L}(\theta) = -\log(\hat{p}_2) = -\log(0.2190) \approx 1.517.
$$

## Backpropagation and Parameter Updates

To train the model, we compute the gradient of the loss $\mathcal{L}(\theta)$ with respect to all parameters ($\theta$ includes $W_Q, W_K, W_V, W_{\text{out}}$, and others).

1. **Gradient with respect to Output Layer:**

   For the output weights $W_{\text{out}}$, the gradient is derived from:
   $$
   \frac{\partial \mathcal{L}}{\partial Z_3[i]} = \hat{p}_i - y_i.
   $$

   Here:
   - For $i=2$ (correct token), $\hat{p}_2 - y_2 = 0.2190 - 1 = -0.7810$.
   - For other $i$, $\hat{p}_i - 0 = \hat{p}_i$.

   Thus:
   $$
   \frac{\partial \mathcal{L}}{\partial Z_3} = [0.1927,\; -0.7810,\; 0.1960,\;0.2003,\;0.1920].
   $$

   Then:
   $$
   \frac{\partial \mathcal{L}}{\partial W_{\text{out}}} = \text{AttOutput}_3^T \frac{\partial \mathcal{L}}{\partial Z_3}
   $$
   where $\text{AttOutput}_3 \in \mathbb{R}^{1 \times 4}$ and $\frac{\partial \mathcal{L}}{\partial Z_3} \in \mathbb{R}^{1 \times 5}$, so the resulting gradient is $\in \mathbb{R}^{4 \times 5}$.

   This will update $W_{\text{out}}$.

2. **Gradient flows backward:**
   
   The gradient also propagates back into `AttOutput_3`:
   $$
   \frac{\partial \mathcal{L}}{\partial \text{AttOutput}_3} = \frac{\partial \mathcal{L}}{\partial Z_3} W_{\text{out}}^T.
   $$

   From there, we continue backpropagation:
   - Through the attention mechanism: $\frac{\partial \mathcal{L}}{\partial V}$, $\frac{\partial \mathcal{L}}{\partial A}$, and hence $\frac{\partial \mathcal{L}}{\partial Q}, \frac{\partial \mathcal{L}}{\partial K}$, and so on.
   - Through $Q = XW_Q$, $K = XW_K$, $V = XW_V$, we get gradients w.r.t. $W_Q, W_K, W_V$.

   Every parameter in the model receives a gradient signal that tells it how to adjust to reduce the loss.

3. **Gradient Descent Update:**
   
   Once we have $\frac{\partial \mathcal{L}}{\partial \theta}$, we perform a parameter update:
   $$
   \theta \leftarrow \theta - \eta \nabla_{\theta}\mathcal{L}(\theta),
   $$
   where $\eta$ is the learning rate.

   Over many training steps with many examples, the parameters $W_Q, W_K, W_V, W_{\text{out}}$, and others, adjust so that the model becomes better at predicting correct next tokens, thus giving Q, K, and V their functional meaning.

## Training Summary

- We start with a forward pass that produces probability distributions over next tokens.
- We compute the loss using the target token.
- Backpropagate the loss to find gradients w.r.t. all parameters.
- Use gradient descent (or Adam or another optimizer) to update parameters.
- Repeat for many examples and epochs until the model converges.

## Inference (After Training)

After training, we fix the parameters $\theta$. Given a prompt $(w_1, w_2, w_3)$:
1. Compute `AttOutput` just like during training (forward pass only, no backprop).
2. Compute $\hat{p}(w_4 \mid w_{1:3}) = \text{softmax}(Z_3)$.
3. Sample a token from the distribution. Suppose we pick the token with the highest probability or sample stochastically.
4. Append the chosen token to the sequence and repeat until a termination condition.

During inference, no gradients are computed, and no parameter updates are made. We simply use the learned parameters to generate new text.

---


# Detailed Training and Inference Process for a Transformer-based Language Model

In the previous sections, we computed the attention output (`AttOutput`) for a given input sequence. Now, we will delve deeper into the training process of a Transformer-based language model, starting from the `AttOutput`, and progress through loss calculation, backpropagation, gradient descent, and finally inference. This explanation focuses on the mathematical foundations and step-by-step calculations involved.

## Transformer Layer Overview

A **Transformer layer** is the core component of Transformer-based models like GPT and BERT. Each Transformer layer comprises two main sub-layers:

1. **Multi-Head Self-Attention Mechanism**
2. **Position-Wise Feed-Forward Network (FFN)**

Each sub-layer is followed by **residual connections** and **layer normalization** to stabilize and enhance training.

### 1. Multi-Head Self-Attention

The multi-head self-attention mechanism allows the model to focus on different parts of the input sequence simultaneously. Mathematically, for each head $ h $ in the multi-head attention:

$$
\text{head}_h = \text{Attention}(QW_Q^{(h)}, KW_K^{(h)}, VW_V^{(h)})
$$

Where:
- $ Q = XW_Q $
- $ K = XW_K $
- $ V = XW_V $
- $ W_Q^{(h)}, W_K^{(h)}, W_V^{(h)} $ are the projection matrices for head $ h $

The attention mechanism is defined as:

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V
$$

Here, $ d_k $ is the dimensionality of the key vectors.

After computing all heads, their outputs are concatenated and projected:

$$
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, \ldots, \text{head}_h)W_O
$$

Where $ W_O $ is the output projection matrix.

### 2. Position-Wise Feed-Forward Network (FFN)

After the multi-head attention, each Transformer layer applies a feed-forward neural network independently to each position:

$$
\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2
$$

Where:
- $ W_1 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}} $
- $ W_2 \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{model}}} $
- $ b_1, b_2 $ are bias vectors
- $ d_{\text{ff}} $ is the dimensionality of the feed-forward layer

### 3. Residual Connections and Layer Normalization

Each sub-layer is wrapped with residual connections and followed by layer normalization:

$$
\begin{aligned}
& \text{LayerNorm}(X + \text{MultiHead}(Q, K, V)) \\
& \text{LayerNorm}(X + \text{FFN}(X))
\end{aligned}
$$

This structure helps in mitigating issues like vanishing gradients and stabilizes the training process.

## From `AttOutput` to Predictions

Given the attention output matrix (`AttOutput`):

$$
\text{AttOutput} \approx \begin{bmatrix}
0.0794 & 0.2722 & 0.2779 & 0.2508 \\
0.0792 & 0.2721 & 0.2779 & 0.2507 \\
0.08   & 0.2714 & 0.2763 & 0.2505
\end{bmatrix} \in \mathbb{R}^{3 \times 4}
$$

Each row corresponds to a token in the input sequence, and each column represents a hidden dimension ($d_{\text{model}} = 4$ in this example).

### 1. Position-Wise Feed-Forward Network (FFN)

Assume simple FFN parameters for illustration:

$$
W_1 = \begin{bmatrix}
0.1 & 0.2 & 0.3 \\
0.4 & 0.5 & 0.6 \\
0.7 & 0.8 & 0.9 \\
0.1 & 0.3 & 0.5 \\
\end{bmatrix} \in \mathbb{R}^{4 \times 3}, \quad
W_2 = \begin{bmatrix}
0.2 & 0.4 & 0.6 & 0.8 \\
0.1 & 0.3 & 0.5 & 0.7 \\
0.0 & 0.2 & 0.4 & 0.6 \\
\end{bmatrix} \in \mathbb{R}^{3 \times 4}
$$

**No biases** are considered for simplicity.

#### Applying FFN to Each Position

For each row in `AttOutput`, compute:

$$
\text{FFN}(x) = \text{ReLU}(x W_1) W_2
$$

Let's compute the FFN output for the **first token**:

1. **Compute $x W_1$:**

$$
x = [0.0794, 0.2722, 0.2779, 0.2508] \in \mathbb{R}^{1 \times 4}
$$

$$
x W_1 = [0.0794 \times 0.1 + 0.2722 \times 0.4 + 0.2779 \times 0.7 + 0.2508 \times 0.1, \; \ldots ]
$$

Calculate each component:

$$
\begin{aligned}
& y_1 = 0.0794 \times 0.1 + 0.2722 \times 0.4 + 0.2779 \times 0.7 + 0.2508 \times 0.1 \\
& \quad = 0.00794 + 0.10888 + 0.19453 + 0.02508 \approx 0.33643 \\
& y_2 = 0.0794 \times 0.2 + 0.2722 \times 0.5 + 0.2779 \times 0.8 + 0.2508 \times 0.3 \\
& \quad = 0.01588 + 0.1361 + 0.22232 + 0.07524 \approx 0.44954 \\
& y_3 = 0.0794 \times 0.3 + 0.2722 \times 0.6 + 0.2779 \times 0.9 + 0.2508 \times 0.5 \\
& \quad = 0.02382 + 0.16332 + 0.25011 + 0.1254 \approx 0.56265 \\
\end{aligned}
$$

$$
x W_1 \approx [0.33643, 0.44954, 0.56265]
$$

2. **Apply ReLU:**

$$
\text{ReLU}(x W_1) = [\max(0, 0.33643), \max(0, 0.44954), \max(0, 0.56265)] = [0.33643, 0.44954, 0.56265]
$$

3. **Compute $\text{ReLU}(x W_1) W_2$:**

$$
\text{ReLU}(x W_1) W_2 = [0.33643, 0.44954, 0.56265] \begin{bmatrix}
0.2 & 0.4 & 0.6 & 0.8 \\
0.1 & 0.3 & 0.5 & 0.7 \\
0.0 & 0.2 & 0.4 & 0.6 \\
\end{bmatrix}
$$

Calculate each component:

$$
\begin{aligned}
& o_1 = 0.33643 \times 0.2 + 0.44954 \times 0.1 + 0.56265 \times 0.0 = 0.067286 + 0.044954 + 0 = 0.11224 \\
& o_2 = 0.33643 \times 0.4 + 0.44954 \times 0.3 + 0.56265 \times 0.2 = 0.134572 + 0.134862 + 0.11253 \approx 0.3810 \\
& o_3 = 0.33643 \times 0.6 + 0.44954 \times 0.5 + 0.56265 \times 0.4 = 0.201858 + 0.22477 + 0.22506 \approx 0.6517 \\
& o_4 = 0.33643 \times 0.8 + 0.44954 \times 0.7 + 0.56265 \times 0.6 = 0.269144 + 0.314678 + 0.33759 \approx 0.9214 \\
\end{aligned}
$$

$$
\text{FFNOutput}_1 \approx [0.11224, 0.3810, 0.6517, 0.9214]
$$

Repeat similarly for the **second** and **third tokens**:

$$
\text{FFNOutput} \approx \begin{bmatrix}
0.11224 & 0.3810 & 0.6517 & 0.9214 \\
0.1105  & 0.3752 & 0.6431 & 0.9108 \\
0.1130  & 0.3798 & 0.6485 & 0.9150 \\
\end{bmatrix} \in \mathbb{R}^{3 \times 4}
$$

### 2. Output Projection and Loss Calculation

#### 1. Final Linear Projection

To produce predictions for the next token, apply a linear projection followed by a softmax to obtain a probability distribution over the vocabulary.

Assume the vocabulary size $ V = 5 $, and \( W_{\text{out}} \) is defined as:

$$
W_{\text{out}} = \begin{bmatrix}
0.2 & 0.1 & 0.0 & 0.3 & 0.1 \\
0.0 & 0.5 & 0.1 & 0.0 & 0.2 \\
0.1 & 0.1 & 0.3 & 0.3 & 0.1 \\
0.2 & 0.2 & 0.0 & 0.1 & 0.0 
\end{bmatrix} \in \mathbb{R}^{4 \times 5}
$$

For the **third token** ($ t = 3 $), compute the logits $ Z_3 $:

$$
Z_3 = \text{FFNOutput}_3 W_{\text{out}} \in \mathbb{R}^{1 \times 5}
$$

Given:

$$
\text{FFNOutput}_3 = [0.1130, 0.3798, 0.6485, 0.9150]
$$

Compute each component:

$$
\begin{aligned}
& Z_3[1] = 0.1130 \times 0.2 + 0.3798 \times 0.0 + 0.6485 \times 0.1 + 0.9150 \times 0.2 = 0.0226 + 0 + 0.06485 + 0.1830 = 0.27045 \\
& Z_3[2] = 0.1130 \times 0.1 + 0.3798 \times 0.5 + 0.6485 \times 0.1 + 0.9150 \times 0.2 = 0.0113 + 0.1899 + 0.06485 + 0.1830 = 0.44805 \\
& Z_3[3] = 0.1130 \times 0.0 + 0.3798 \times 0.1 + 0.6485 \times 0.3 + 0.9150 \times 0.0 = 0 + 0.03798 + 0.19455 + 0 = 0.23253 \\
& Z_3[4] = 0.1130 \times 0.3 + 0.3798 \times 0.0 + 0.6485 \times 0.3 + 0.9150 \times 0.1 = 0.0339 + 0 + 0.19455 + 0.0915 = 0.3200 \\
& Z_3[5] = 0.1130 \times 0.1 + 0.3798 \times 0.2 + 0.6485 \times 0.1 + 0.9150 \times 0.0 = 0.0113 + 0.07596 + 0.06485 + 0 = 0.1521 \\
\end{aligned}
$$

$$
Z_3 \approx [0.27045, \; 0.44805, \; 0.23253, \; 0.3200, \; 0.1521]
$$

#### 2. Softmax for Next-Token Prediction

Apply the softmax function to convert logits into probabilities:

$$
p_{\theta}(w_4 \mid w_1, w_2, w_3) = \text{softmax}(Z_3)
$$

Compute exponentials:

$$
\begin{aligned}
& \exp(Z_3[1]) = \exp(0.27045) \approx 1.310 \\
& \exp(Z_3[2]) = \exp(0.44805) \approx 1.565 \\
& \exp(Z_3[3]) = \exp(0.23253) \approx 1.262 \\
& \exp(Z_3[4]) = \exp(0.3200) \approx 1.377 \\
& \exp(Z_3[5]) = \exp(0.1521) \approx 1.164 \\
\end{aligned}
$$

Sum of exponentials:

$$
S = 1.310 + 1.565 + 1.262 + 1.377 + 1.164 \approx 6.678
$$

Compute probabilities:

$$
\begin{aligned}
& p(w_4=1) = \frac{\exp(Z_3[1])}{S} \approx \frac{1.310}{6.678} \approx 0.196 \\
& p(w_4=2) = \frac{\exp(Z_3[2])}{S} \approx \frac{1.565}{6.678} \approx 0.235 \\
& p(w_4=3) = \frac{\exp(Z_3[3])}{S} \approx \frac{1.262}{6.678} \approx 0.189 \\
& p(w_4=4) = \frac{\exp(Z_3[4])}{S} \approx \frac{1.377}{6.678} \approx 0.206 \\
& p(w_4=5) = \frac{\exp(Z_3[5])}{S} \approx \frac{1.164}{6.678} \approx 0.175 \\
\end{aligned}
$$

Thus:

$$
p_{\theta}(w_4 \mid w_1, w_2, w_3) \approx [0.196, \; 0.235, \; 0.189, \; 0.206, \; 0.175]
$$

## Target Labels and Loss Calculation

During training, each input sequence has a corresponding target sequence. The goal is for the model to predict the next token in the sequence. 

**Example:**

Assume the correct next token ($ w_4 $) is token **2**.

### 1. One-Hot Encoding of Target

The target vector $ y $ is a one-hot encoded vector indicating the correct next token:

$$
y = [0, 1, 0, 0, 0]
$$

### 2. Cross-Entropy Loss

The loss function commonly used is the cross-entropy loss, which measures the difference between the predicted probability distribution $ \hat{p} $ and the target distribution $ y $:

$$
\mathcal{L}(\theta) = -\sum_{i=1}^{V} y_i \log \hat{p}_i
$$

Given $ y = [0, 1, 0, 0, 0] $ and $ \hat{p} = [0.196, \; 0.235, \; 0.189, \; 0.206, \; 0.175] $:

$$
\mathcal{L}(\theta) = -\log(\hat{p}_2) = -\log(0.235) \approx 1.447
$$

## Backpropagation and Parameter Updates

To minimize the loss $ \mathcal{L}(\theta) $, we perform backpropagation to compute gradients and update the model parameters accordingly.

### 1. Gradient with respect to Logits ($ Z_3 $)

The gradient of the loss with respect to each logit $ Z_3[i] $ is:

$$
\frac{\partial \mathcal{L}}{\partial Z_3[i]} = \hat{p}_i - y_i
$$

For our example:

$$
\frac{\partial \mathcal{L}}{\partial Z_3} = [0.196, \; 0.235 - 1, \; 0.189, \; 0.206, \; 0.175] = [0.196, \; -0.765, \; 0.189, \; 0.206, \; 0.175]
$$

### 2. Gradient with respect to Output Weights ($ W_{\text{out}} $)

The gradient of the loss with respect to $ W_{\text{out}} $ is:

$$
\frac{\partial \mathcal{L}}{\partial W_{\text{out}}} = \text{FFNOutput}_3^\top \cdot \frac{\partial \mathcal{L}}{\partial Z_3}
$$

Given:

$$
\text{FFNOutput}_3 = [0.1130, \; 0.3798, \; 0.6485, \; 0.9150] \in \mathbb{R}^{1 \times 4}
$$

$$
\frac{\partial \mathcal{L}}{\partial Z_3} = [0.196, \; -0.765, \; 0.189, \; 0.206, \; 0.175] \in \mathbb{R}^{1 \times 5}
$$

Compute each element of $ \frac{\partial \mathcal{L}}{\partial W_{\text{out}}} $:

$$
\frac{\partial \mathcal{L}}{\partial W_{\text{out}}} = \begin{bmatrix}
0.1130 \times 0.196 & 0.1130 \times (-0.765) & 0.1130 \times 0.189 & 0.1130 \times 0.206 & 0.1130 \times 0.175 \\
0.3798 \times 0.196 & 0.3798 \times (-0.765) & 0.3798 \times 0.189 & 0.3798 \times 0.206 & 0.3798 \times 0.175 \\
0.6485 \times 0.196 & 0.6485 \times (-0.765) & 0.6485 \times 0.189 & 0.6485 \times 0.206 & 0.6485 \times 0.175 \\
0.9150 \times 0.196 & 0.9150 \times (-0.765) & 0.9150 \times 0.189 & 0.9150 \times 0.206 & 0.9150 \times 0.175 \\
\end{bmatrix}
$$

Calculating each element:

$$
\frac{\partial \mathcal{L}}{\partial W_{\text{out}}} \approx \begin{bmatrix}
0.022148 & -0.086295 & 0.021357 & 0.023278 & 0.019775 \\
0.074408 & -0.290577 & 0.071922 & 0.078308 & 0.066465 \\
0.127486 & -0.490653 & 0.122317 & 0.132282 & 0.113387 \\
0.178980 & -0.696975 & 0.173415 & 0.189030 & 0.160125 \\
\end{bmatrix} \in \mathbb{R}^{4 \times 5}
$$

### 3. Gradient with respect to FFN Output ($ \text{FFNOutput}_3 $)

The gradient of the loss with respect to $ \text{FFNOutput}_3 $ is:

$$
\frac{\partial \mathcal{L}}{\partial \text{FFNOutput}_3} = \frac{\partial \mathcal{L}}{\partial Z_3} W_{\text{out}}^\top
$$

Given:

$$
W_{\text{out}}^\top = \begin{bmatrix}
0.2 & 0.0 & 0.1 & 0.2 \\
0.1 & 0.5 & 0.1 & 0.2 \\
0.0 & 0.1 & 0.3 & 0.0 \\
0.3 & 0.0 & 0.3 & 0.1 \\
0.1 & 0.2 & 0.1 & 0.0 \\
\end{bmatrix} \in \mathbb{R}^{5 \times 4}
$$

Compute each component of $ \frac{\partial \mathcal{L}}{\partial \text{FFNOutput}_3} $:

$$
\begin{aligned}
& \frac{\partial \mathcal{L}}{\partial \text{FFNOutput}_3}[1] = 0.196 \times 0.2 + (-0.765) \times 0.0 + 0.189 \times 0.1 + 0.206 \times 0.2 + 0.175 \times 0.1 \\
& \quad = 0.0392 + 0 + 0.0189 + 0.0412 + 0.0175 \approx 0.1168 \\
& \frac{\partial \mathcal{L}}{\partial \text{FFNOutput}_3}[2] = 0.196 \times 0.1 + (-0.765) \times 0.5 + 0.189 \times 0.1 + 0.206 \times 0.0 + 0.175 \times 0.2 \\
& \quad = 0.0196 - 0.3825 + 0.0189 + 0 + 0.0350 \approx -0.3090 \\
& \frac{\partial \mathcal{L}}{\partial \text{FFNOutput}_3}[3] = 0.196 \times 0.0 + (-0.765) \times 0.1 + 0.189 \times 0.3 + 0.206 \times 0.3 + 0.175 \times 0.1 \\
& \quad = 0 + (-0.0765) + 0.0567 + 0.0618 + 0.0175 \approx -0.0415 \\
& \frac{\partial \mathcal{L}}{\partial \text{FFNOutput}_3}[4] = 0.196 \times 0.2 + (-0.765) \times 0.2 + 0.189 \times 0.0 + 0.206 \times 0.1 + 0.175 \times 0.0 \\
& \quad = 0.0392 - 0.1530 + 0 + 0.0206 + 0 \approx -0.0932 \\
\end{aligned}
$$

$$
\frac{\partial \mathcal{L}}{\partial \text{FFNOutput}_3} \approx [0.1168, \; -0.3090, \; -0.0415, \; -0.0932] \in \mathbb{R}^{1 \times 4}
$$

### 4. Gradient Through FFN

#### 4.1 Gradient Through $ W_2 $

The gradient with respect to $ W_2 $ is:

$$
\frac{\partial \mathcal{L}}{\partial W_2} = \text{ReLU}(x W_1)^\top \cdot \frac{\partial \mathcal{L}}{\partial \text{FFNOutput}_3}
$$

Given:

$$
\text{ReLU}(x W_1) = [0.33643, \; 0.44954, \; 0.56265] \in \mathbb{R}^{1 \times 3}
$$

$$
\frac{\partial \mathcal{L}}{\partial \text{FFNOutput}_3} = [0.1168, \; -0.3090, \; -0.0415, \; -0.0932] \in \mathbb{R}^{1 \times 4}
$$

Compute each element of $ \frac{\partial \mathcal{L}}{\partial W_2} $:

$$
\frac{\partial \mathcal{L}}{\partial W_2} = \begin{bmatrix}
0.33643 \times 0.1168 & 0.33643 \times (-0.3090) & 0.33643 \times (-0.0415) & 0.33643 \times (-0.0932) \\
0.44954 \times 0.1168 & 0.44954 \times (-0.3090) & 0.44954 \times (-0.0415) & 0.44954 \times (-0.0932) \\
0.56265 \times 0.1168 & 0.56265 \times (-0.3090) & 0.56265 \times (-0.0415) & 0.56265 \times (-0.0932) \\
\end{bmatrix}
$$

Calculating each element:

$$
\begin{aligned}
& \frac{\partial \mathcal{L}}{\partial W_2}[1,1] = 0.33643 \times 0.1168 \approx 0.0393 \\
& \frac{\partial \mathcal{L}}{\partial W_2}[1,2] = 0.33643 \times (-0.3090) \approx -0.1040 \\
& \frac{\partial \mathcal{L}}{\partial W_2}[1,3] = 0.33643 \times (-0.0415) \approx -0.01397 \\
& \frac{\partial \mathcal{L}}{\partial W_2}[1,4] = 0.33643 \times (-0.0932) \approx -0.03134 \\
& \frac{\partial \mathcal{L}}{\partial W_2}[2,1] = 0.44954 \times 0.1168 \approx 0.0524 \\
& \frac{\partial \mathcal{L}}{\partial W_2}[2,2] = 0.44954 \times (-0.3090) \approx -0.1389 \\
& \frac{\partial \mathcal{L}}{\partial W_2}[2,3] = 0.44954 \times (-0.0415) \approx -0.0186 \\
& \frac{\partial \mathcal{L}}{\partial W_2}[2,4] = 0.44954 \times (-0.0932) \approx -0.0418 \\
& \frac{\partial \mathcal{L}}{\partial W_2}[3,1] = 0.56265 \times 0.1168 \approx 0.0657 \\
& \frac{\partial \mathcal{L}}{\partial W_2}[3,2] = 0.56265 \times (-0.3090) \approx -0.1739 \\
& \frac{\partial \mathcal{L}}{\partial W_2}[3,3] = 0.56265 \times (-0.0415) \approx -0.0233 \\
& \frac{\partial \mathcal{L}}{\partial W_2}[3,4] = 0.56265 \times (-0.0932) \approx -0.0524 \\
\end{aligned}
$$

$$
\frac{\partial \mathcal{L}}{\partial W_2} \approx \begin{bmatrix}
0.0393 & -0.1040 & -0.01397 & -0.03134 \\
0.0524 & -0.1389 & -0.0186 & -0.0418 \\
0.0657 & -0.1739 & -0.0233 & -0.0524 \\
\end{bmatrix} \in \mathbb{R}^{3 \times 4}
$$

#### 4.2 Gradient Through $ W_1 $

The gradient with respect to $ W_1 $ involves the chain rule through the FFN:

$$
\frac{\partial \mathcal{L}}{\partial W_1} = X^\top \cdot \left( \frac{\partial \mathcal{L}}{\partial \text{FFNOutput}_3} W_2^\top \right) \cdot \text{ReLU}'(x W_1)
$$

Since all pre-activations were positive in ReLU, \( \text{ReLU}'(x W_1) = 1 \):

$$
\frac{\partial \mathcal{L}}{\partial W_1} = X^\top \cdot \frac{\partial \mathcal{L}}{\partial \text{FFNOutput}_3} W_2^\top
$$

Given:

$$
X = \begin{bmatrix}
0.0794 & 0.2722 & 0.2779 & 0.2508 \\
0.0792 & 0.2721 & 0.2779 & 0.2507 \\
0.08   & 0.2714 & 0.2763 & 0.2505 \\
\end{bmatrix} \in \mathbb{R}^{3 \times 4}
$$

$$
\frac{\partial \mathcal{L}}{\partial \text{FFNOutput}_3} = [0.1168, \; -0.3090, \; -0.0415, \; -0.0932] \in \mathbb{R}^{1 \times 4}
$$

$$
W_2^\top = \begin{bmatrix}
0.2 & 0.1 & 0.0 \\
0.4 & 0.3 & 0.2 \\
0.6 & 0.5 & 0.4 \\
0.8 & 0.7 & 0.6 \\
\end{bmatrix} \in \mathbb{R}^{4 \times 3}
$$

Compute the intermediate product:

$$
\frac{\partial \mathcal{L}}{\partial \text{FFNOutput}_3} W_2^\top = [0.1168, \; -0.3090, \; -0.0415, \; -0.0932] \begin{bmatrix}
0.2 & 0.1 & 0.0 \\
0.4 & 0.3 & 0.2 \\
0.6 & 0.5 & 0.4 \\
0.8 & 0.7 & 0.6 \\
\end{bmatrix} = [ \text{...}, \text{...}, \text{...} ]
$$

Calculating each component:

$$
\begin{aligned}
& \text{Component 1} = 0.1168 \times 0.2 + (-0.3090) \times 0.4 + (-0.0415) \times 0.6 + (-0.0932) \times 0.8 \\
& \quad = 0.02336 - 0.1236 - 0.0249 - 0.07456 \approx -0.1997 \\
& \text{Component 2} = 0.1168 \times 0.1 + (-0.3090) \times 0.3 + (-0.0415) \times 0.5 + (-0.0932) \times 0.7 \\
& \quad = 0.01168 - 0.0927 - 0.02075 - 0.06524 \approx -0.1660 \\
& \text{Component 3} = 0.1168 \times 0.0 + (-0.3090) \times 0.2 + (-0.0415) \times 0.4 + (-0.0932) \times 0.6 \\
& \quad = 0 + (-0.0618) + (-0.0166) + (-0.05592) \approx -0.1343 \\
\end{aligned}
$$

$$
\frac{\partial \mathcal{L}}{\partial \text{FFNOutput}_3} W_2^\top \approx [-0.1997, \; -0.1660, \; -0.1343] \in \mathbb{R}^{1 \times 3}
$$

Now, compute $ \frac{\partial \mathcal{L}}{\partial W_1} $:

$$
\frac{\partial \mathcal{L}}{\partial W_1} = X^\top \cdot [-0.1997, \; -0.1660, \; -0.1343] \in \mathbb{R}^{4 \times 3}
$$

Given:

$$
X^\top = \begin{bmatrix}
0.0794 & 0.0792 & 0.08 \\
0.2722 & 0.2721 & 0.2714 \\
0.2779 & 0.2779 & 0.2763 \\
0.2508 & 0.2507 & 0.2505 \\
\end{bmatrix} \in \mathbb{R}^{4 \times 3}
$$

Multiply each row of $ X^\top $ with the gradient vector:

$$
\frac{\partial \mathcal{L}}{\partial W_1} = \begin{bmatrix}
0.0794 \times (-0.1997) + 0.0792 \times (-0.1660) + 0.08 \times (-0.1343) \\
0.2722 \times (-0.1997) + 0.2721 \times (-0.1660) + 0.2714 \times (-0.1343) \\
0.2779 \times (-0.1997) + 0.2779 \times (-0.1660) + 0.2763 \times (-0.1343) \\
0.2508 \times (-0.1997) + 0.2507 \times (-0.1660) + 0.2505 \times (-0.1343) \\
\end{bmatrix}
$$

Calculating each element:

$$
\begin{aligned}
& \frac{\partial \mathcal{L}}{\partial W_1}[1, :] = 0.0794 \times (-0.1997) + 0.0792 \times (-0.1660) + 0.08 \times (-0.1343) \\
& \quad = -0.01583 - 0.01315 - 0.01074 \approx -0.03972 \\
& \frac{\partial \mathcal{L}}{\partial W_1}[2, :] = 0.2722 \times (-0.1997) + 0.2721 \times (-0.1660) + 0.2714 \times (-0.1343) \\
& \quad = -0.05431 - 0.04518 - 0.03645 \approx -0.1350 \\
& \frac{\partial \mathcal{L}}{\partial W_1}[3, :] = 0.2779 \times (-0.1997) + 0.2779 \times (-0.1660) + 0.2763 \times (-0.1343) \\
& \quad = -0.05546 - 0.04621 - 0.03710 \approx -0.1388 \\
& \frac{\partial \mathcal{L}}{\partial W_1}[4, :] = 0.2508 \times (-0.1997) + 0.2507 \times (-0.1660) + 0.2505 \times (-0.1343) \\
& \quad = -0.05014 - 0.04161 - 0.03361 \approx -0.12536 \\
\end{aligned}
$$

$$
\frac{\partial \mathcal{L}}{\partial W_1} \approx \begin{bmatrix}
-0.03972 & -0.03972 & -0.03972 \\
-0.1350 & -0.1350 & -0.1350 \\
-0.1388 & -0.1388 & -0.1388 \\
-0.12536 & -0.12536 & -0.12536 \\
\end{bmatrix} \in \mathbb{R}^{4 \times 3}
$$

### 5. Parameter Updates

After computing the gradients for all parameters, update them using an optimization algorithm like **Gradient Descent** or **Adam**.

#### Gradient Descent Update Rule

$$
\theta \leftarrow \theta - \eta \nabla_{\theta} \mathcal{L}(\theta)
$$

Where:
- $ \theta $ represents all model parameters ($ W_Q, W_K, W_V, W_{\text{out}}, W_1, W_2, \ldots $)
- $ \eta $ is the learning rate
- $ \nabla_{\theta} \mathcal{L}(\theta) $ is the gradient of the loss with respect to the parameters

**Example Update for $ W_{\text{out}} $:**

Assume a learning rate $ \eta = 0.01 $.

$$
W_{\text{out}}^{\text{new}} = W_{\text{out}} - 0.01 \times \frac{\partial \mathcal{L}}{\partial W_{\text{out}}}
$$

Each element of $ W_{\text{out}} $ is updated accordingly.

**Example Update for $ W_1 $:**

$$
W_1^{\text{new}} = W_1 - 0.01 \times \frac{\partial \mathcal{L}}{\partial W_1}
$$

Similarly, update $ W_K, W_V, W_2, $ etc.

### 6. Iterative Training Process

The training process involves iterating over numerous input sequences (or mini-batches), performing forward and backward passes, and updating the model parameters to minimize the loss.

**Training Steps:**

1. **Forward Pass:**
   - Embed input tokens and add positional encodings.
   - Pass embeddings through multi-head attention and FFN within Transformer layers.
   - Project to output logits and compute softmax probabilities.
   - Calculate loss using target labels.

2. **Backward Pass:**
   - Compute gradients of the loss with respect to all parameters via backpropagation.

3. **Parameter Update:**
   - Adjust parameters in the direction that minimizes the loss using the optimization algorithm.

4. **Repeat:**
   - Continue this process for many examples and epochs until the model converges (i.e., the loss stabilizes or decreases below a threshold).

## Inference (Generation Phase)

After training, the model can generate new text by performing forward passes without updating parameters.

### 1. Given a Prompt

Suppose the trained model receives the prompt:

$$
(w_1, w_2, w_3) = (\text{"The"}, \text{"cat"}, \text{"sat"})
$$

### 2. Forward Pass to Predict Next Token

1. **Embedding and Positional Encoding:**

$$
X = \begin{bmatrix}
0.0794 & 0.2722 & 0.2779 & 0.2508 \\
0.0792 & 0.2721 & 0.2779 & 0.2507 \\
0.08   & 0.2714 & 0.2763 & 0.2505 \\
\end{bmatrix} \in \mathbb{R}^{3 \times 4}
$$

2. **Multi-Head Attention and FFN:**

$$
\text{AttOutput} \approx \begin{bmatrix}
0.0794 & 0.2722 & 0.2779 & 0.2508 \\
0.0792 & 0.2721 & 0.2779 & 0.2507 \\
0.08   & 0.2714 & 0.2763 & 0.2505 \\
\end{bmatrix} \in \mathbb{R}^{3 \times 4}
$$

$$
\text{FFNOutput} \approx \begin{bmatrix}
0.11224 & 0.3810 & 0.6517 & 0.9214 \\
0.1105  & 0.3752 & 0.6431 & 0.9108 \\
0.1130  & 0.3798 & 0.6485 & 0.9150 \\
\end{bmatrix} \in \mathbb{R}^{3 \times 4}
$$

3. **Output Projection:**

For the **third token** (R t = 3 R):

$$
Z_3 \approx [0.27045, \; 0.44805, \; 0.23253, \; 0.3200, \; 0.1521] \in \mathbb{R}^{1 \times 5}
$$

4. **Softmax to Get Probabilities:**

$$
p_{\theta}(w_4 \mid w_{1:3}) \approx [0.196, \; 0.235, \; 0.189, \; 0.206, \; 0.175]
$$

### 3. Sampling the Next Token

Based on the probability distribution:

$$
p_{\theta}(w_4 \mid w_{1:3}) \approx [0.196, \; 0.235, \; 0.189, \; 0.206, \; 0.175]
$$

- **Greedy Decoding:** Choose the token with the highest probability.

$$
w_4 = \text{argmax}(p_{\theta}(w_4 \mid w_{1:3})) = 2 \quad (\text{"cat"})
$$

- **Stochastic Sampling:** Sample a token based on the probability distribution.

### 4. Iterative Generation

Append $ w_4 $ to the sequence and repeat the forward pass to generate $ w_5 $:

$$
(w_1, w_2, w_3, w_4) = (\text{"The"}, \text{"cat"}, \text{"sat"}, \text{"cat"})
$$

Continue this process until reaching a termination condition (e.g., maximum length, end-of-sequence token).

## Summary of Mathematical Operations

1. **Forward Pass:**
   - **Embedding:** $ X = \text{Embeddings}(w_1, w_2, w_3) $
   - **Multi-Head Attention:** Compute $ \text{AttOutput} $
   - **Feed-Forward Network:** Compute $$ \text{FFNOutput} $
   - **Output Projection:** Compute $ Z_t = \text{FFNOutput}_t W_{\text{out}} $
   - **Softmax:** Compute $ p_{\theta}(w_{t+1} \mid w_{1:t}) $

2. **Loss Calculation (Training):**
   - **Cross-Entropy Loss:** $ \mathcal{L}(\theta) = -\log(p_{\theta}(w_{t+1} \mid w_{1:t})) $

3. **Backward Pass (Training):**
   - **Compute Gradients:** $ \nabla_{\theta} \mathcal{L}(\theta) $ via backpropagation.
   - **Update Parameters:** $ \theta \leftarrow \theta - \eta \nabla_{\theta} \mathcal{L}(\theta) $

4. **Inference (Generation):**
   - **Generate Next Token:** Sample $ w_{t+1} $ from $ p_{\theta}(w_{t+1} \mid w_{1:t}) $
   - **Repeat:** Append $ w_{t+1} $ and continue.

## Conclusion

This detailed mathematical walkthrough illustrates how a Transformer-based language model is trained and used for inference:

1. **Transformer Layer:**
   - **Multi-Head Attention:** Computes context-aware representations by attending to different parts of the input sequence.
   - **Feed-Forward Network:** Applies non-linear transformations to each position independently.
   - **Residual Connections and Layer Normalization:** Stabilize training and facilitate gradient flow.

2. **Training Process:**
   - **Forward Pass:** Compute predictions using the Transformer layers.
   - **Loss Calculation:** Measure discrepancy between predictions and true labels using cross-entropy loss.
   - **Backward Pass:** Compute gradients via backpropagation.
   - **Parameter Update:** Adjust model parameters to minimize loss using gradient descent or optimizers like Adam.

3. **Inference:**
   - **Prompting:** Provide an initial sequence.
   - **Generating Tokens:** Use the trained model to predict and sample the next token iteratively.

Through end-to-end training, the model learns to adjust $ W_Q $, $ W_K $, and $ W_V $ (along with all other parameters) to produce meaningful attention distributions that enhance its ability to predict and generate coherent and contextually relevant text.



In [14]:
import torch
import torch.nn as nn
import numpy as np

# Step 1: Generate a simple corpus (same as before)
words = ["apple", "banana", "cherry", "dog", "elephant", "fish", "grape", "house", "ice", "jungle", "kite", "lemon"]
num_sentences = 5
max_sentence_length = 6

# Generate random sentences
corpus = []
for _ in range(num_sentences):
    sentence_length = np.random.randint(3, max_sentence_length)
    sentence = np.random.choice(words, sentence_length, replace=True)
    corpus.append(sentence)

# Step 2: Create a token-to-index dictionary for words in the corpus
word_to_index = {word: idx for idx, word in enumerate(words)}

# Step 3: Convert words in the corpus to indices using the dictionary
corpus_indices = [[word_to_index[word] for word in sentence] for sentence in corpus]

# Step 4: Define an embedding layer
embedding_dim = 512
vocab_size = len(words)
token_embedding_table = nn.Embedding(vocab_size, embedding_dim)

# Step 5: Pass the indices through the embedding layer
# Ensure input tensor has shape (B, T), where B is batch size and T is sequence length
# We pad sequences to the same length to maintain uniformity
max_length = max(len(sentence) for sentence in corpus_indices)
padded_corpus_indices = [sentence + [0] * (max_length - len(sentence)) for sentence in corpus_indices]

# Convert to a tensor of shape (B, T)
input_tensor = torch.tensor(padded_corpus_indices)

# Step 6: Get the embeddings for the input tensor
embeddings = token_embedding_table(input_tensor)  # This will have shape (B, T, embedding_dim)

# Print the embeddings for each word in the corpus
print("\nWord Embeddings for the Corpus (shape: B, T, embedding_dim):")
B, T = input_tensor.shape  # B = batch size, T = sequence length
print(f"Input Tensor Shape: {input_tensor.shape}")
print(f"Embeddings Shape: {embeddings.shape}")

# Optionally: To view the embeddings of each token in the first sentence
print("\nEmbeddings for the first sentence:")
print(embeddings[0])  # This will show the embeddings for the first sentence



Word Embeddings for the Corpus (shape: B, T, embedding_dim):
Input Tensor Shape: torch.Size([5, 5])
Embeddings Shape: torch.Size([5, 5, 512])

Embeddings for the first sentence:
tensor([[ 0.7161,  0.5239,  1.9768,  ..., -1.5047,  0.8310, -0.2109],
        [ 0.3967, -0.8570, -0.6796,  ..., -1.0295,  0.6432, -0.0169],
        [-0.1029, -0.2704, -1.1114,  ..., -1.3107, -0.2067, -1.4294],
        [-0.0183, -0.7638, -0.5150,  ..., -0.0061, -0.2643, -0.8754],
        [-0.0183, -0.7638, -0.5150,  ..., -0.0061, -0.2643, -0.8754]],
       grad_fn=<SelectBackward0>)


In [18]:
torch.tensor(padded_corpus_indices).shape

torch.Size([5, 5])

In [29]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, dataloader, random_split
import numpy as np

# Step 1: Generate a simple corpus (same as before)
words = ["apple", "banana", "cherry", "dog", "elephant", "fish", "grape", "house", "ice", "jungle", "kite", "lemon"]
num_sentences = 5
max_sentence_length = 6

# Generate random sentences
corpus = []
for _ in range(num_sentences):
    sentence_length = np.random.randint(3, max_sentence_length)
    sentence = np.random.choice(words, sentence_length, replace=True)
    corpus.append(sentence)

# Step 2: Create a token-to-index dictionary for words in the corpus
word_to_index = {word: idx for idx, word in enumerate(words)}

# Step 3: Convert words in the corpus to indices using the dictionary
corpus_indices = [[word_to_index[word] for word in sentence] for sentence in corpus]

# Step 4: Define an embedding layer
embedding_dim = 512
vocab_size = len(words)
token_embedding_table = nn.Embedding(vocab_size, embedding_dim)

max_length = max(len(sentence) for sentence in corpus_indices)
padded_corpus_indices = [sentence + [0] * (max_length - len(sentence)) for sentence in corpus_indices]

# Convert to a tensor of shape (B, T)
input_tensor = torch.tensor(padded_corpus_indices)

# Step 5: Pass the indices through the embedding layer
embeddings = token_embedding_table(input_tensor)

B, T, C = embeddings.shape

logits = embeddings.view(B*T, C)  # Reshape to (B, C*T) for linear layer
targets = torch.randint(0, vocab_size, (B,))  # Random targets for illustration

print(f"Input Shape: {embeddings.shape}")
print(f"Logits Shape: {logits.shape}")
print(f"Targets Shape: {targets.shape}")

Input Shape: torch.Size([5, 5, 512])
Logits Shape: torch.Size([25, 512])
Targets Shape: torch.Size([5])
