# Language Models Explained Visually

Discover comprehensive visual explanations of language models through the following resources:

- [BBY Croft's Large Language Models Guide](https://bbycroft.net/llm)
- [Polo Club's Transformer Explainer](https://poloclub.github.io/transformer-explainer/)
- [HuggingFace Transformers code for all Tasks](https://huggingface.co/docs/transformers/en/tasks/sequence_classification)
- [HuggingFace Tasks](https://huggingface.co/tasks)


## Transformers Tutorial
-[Transformers from scratch](https://peterbloem.nl/blog/transformers)

In [None]:

# corpus
    # A collection of text documents used to train a language model. The corpus can be a collection of books, articles, or any text data.
    
    
# Vocabulary
    # The set of unique tokens (words, sub-words, or characters) that a model can understand. The vocabulary is typically derived from
    # the training corpus and includes common words and special tokens like [PAD], [UNK], [CLS], and [SEP].
    
    
# Attention Mechanism
    # A mechanism that allows the model to focus on important words in a sequence, enabling the model to handle long-range dependencies 
    # and capture context.

    # Self-Attention:
        # a mechanism in deep learning that allows a system to focus on different parts of input data to make predictions or estimations
            # Encoder: Uses full attention, allowing each token to attend to every other token in the input sequence.
            # Decoder: Uses causal (autoregressive) masking to prevent tokens from attending to future tokens in the output sequence.

    # Cross-Attention:
        # a mechanism in transformers where one sequence (e.g., the decoder's sequence) attends to another sequence 
        # (e.g., the encoder's output) to extract relevant context.
            # Encoder-Decoder Attention: The decoder attends to the encoder's output to incorporate relevant information.
    
# Tokens
    # The smallest unit of input or output the model processes. Tokens can represent words, sub-words, or even characters, depending 
    # on the tokenization strategy.
    
    
# Tokenization
    # The process of converting raw text into tokens (usually words, sub-words, or characters) that the model can process. 
    # Tokenizers break down text based on a model's vocabulary (e.g., Byte Pair Encoding or WordPiece).
    # Byte Pair Encoding (BPE):
        # A tokenization algorithm that iteratively merges the most frequent pairs of characters in a corpus to create a vocabulary of 
        # variable-length tokens. BPE is widely used in NLP tasks, including machine translation and text generation.


# Embeddings
    # Dense vector representations of tokens (words/sub-words) that capture semantic meaning. Used in LLMs to map input tokens into 
    # a continuous vector space where similar meanings are close together. Instead of treating each word as a unique, isolated token, 
    # embeddings allow words with similar meanings to be represented by vectors (arrays of numbers) that are close together 
    # in a multi-dimensional space.
    
    # example: Word2Vec, GloVe, FastText, BERT embeddings.
        # Imagine you have the words: 
        # "dog", "cat", "apple", "banana" 
        # "dog" -> [0.1, 0.2, 0.3, 0.4], 
        # "cat" -> [0.2, 0.3, 0.4, 0.5], 
        # "apple" -> [0.3, 0.4, 0.5, 0.6], 
        # "banana" -> [0.4, 0.5, 0.6, 0.7]
        # The embeddings for "dog" and "cat" are closer together than "dog" and "apple" because "dog" and "cat" are semantically
        # similar (both animals) compared to "dog" and "apple" (different categories).
        
        
# Part-of-Speech (POS) Tagging:
    # Assigning each word in a sentence a grammatical category (e.g., noun, verb, adjective). Helps the model understand the 
    # structure of sentences and the role of each word, which is useful for tasks like parsing, translation, and question answering


# Named Entity Recognition (NER):
    # Identifying and categorizing entities (names, dates, locations, organizations, etc.) in text.


# Stemming or Lemmatization:
    # Reducing words to their base or root form. Stemming is a rule-based process that removes prefixes or suffixes, while lemmatization 
    # uses a vocabulary and morphological analysis to return the base form of a word. 
    # examples are: 
    

# Sampling Techniques
    # Methods used to generate outputs from a model, such as greedy search (selecting the most likely next token), beam search (exploring 
    # multiple token sequences), and temperature sampling (introducing randomness to outputs).


# Beam Search
    # A search strategy used during text generation to explore multiple possible token sequences and select the most likely ones. 
    # It reduces the likelihood of poor-quality outputs compared to greedy search.
    
    
# Greedy Search
    # A simpler search method where the model always selects the most probable next token. It is fast but may lead to less coherent 
    # or repetitive outputs.


# Autoregressive Models
    # LLMs like GPT, which generate text one token at a time, predicting the next token based on previously generated tokens. 
    # This type of model is suitable for tasks like text generation.
    
    
# Masked Language Models (MLM)
    # Models like BERT that learn by predicting masked-out tokens in a sentence, using the surrounding context. These models are 
    # bidirectional, meaning they consider context from both directions.
    
    
# Zero-Shot Learning
    # The model’s ability to perform tasks without explicit examples in the training data. For example, a zero-shot LLM can classify 
    # text without having seen labeled examples for that specific task.
    
    
# Few-Shot Learning
    # The model can generalize from only a few examples during inference. For instance, by providing the model a few sample questions 
    # and answers, it can handle similar tasks effectively.
    
    
# Fine-Tuning vs. Transfer Learning
    # Fine-Tuning: The process of adapting a pretrained LLM to a specific task (e.g., classification, question answering) by training 
        # it further on task-specific labeled data.
    # Transfer Learning: Leveraging knowledge from a pretrained model and applying it to a new but related task, without needing to 
        # retrain from scratch.


# Temperature Sampling
    # A technique used during text generation to control the randomness of the output. Higher temperatures (e.g., 1.0) result in 
    # more diverse outputs, while lower temperatures (e.g., 0.2) make the model more deterministic.





> LLM Training

In [None]:
# 1. PEFT + LoRA (Parameter Efficient Fine-tuning + Low-Rank Adaptation)
    # Description: Fine-tunes only a small adapter layer added on top of a pre-trained model, conserving memory and improving efficiency.
    # Use Case: Helps in training large models by keeping the original model frozen and updating only small parts.


# 2. Quantization-Aware Training (QAT)
    # Description: Reduces model size by converting high-precision weights (e.g., FP32) to lower precision formats (e.g., FP16 or INT8).
    # Benefits: Saves memory and reduces training time but may affect model accuracy.
    # Challenges: Requires careful monitoring to ensure model quality isn’t degraded.


# 3. Gradient Checkpointing
    # Description: Saves memory by storing only certain intermediate values during backpropagation.
    # Use Case: Reduces memory usage but slows down training.


# 4. Distributed Training
    # Description: Splits the model and data across multiple devices or nodes for faster training.
    # Key Techniques:
        # FSDP (Fully Sharded Data Parallel): Shards model weights and optimizer states across devices.
        # Deepspeed Zero Redundancy Optimizer (ZeRO): Distributes model parameters to save memory and optimize training efficiency.


> LLM Inference

In [None]:
# 1. Post-Training Quantization (PTQ)
    # Description: Quantizes a model’s weights and activations after training to reduce memory usage.
    # Use Case: Reduces memory footprint for serving models at lower precision (e.g., FP32 → INT8).


# 2. Distributed Inference
    # Description: Partitioning model weights across multiple devices to handle large models.
    # Techniques:
    # Model Partitioning: Divides a large model across multiple GPUs or nodes for more efficient computation.
    # In-flight Batching: Enables the processing of new requests while others are still being computed, improving GPU utilization.


# 3. Dynamic Batching & Continuous Batching
    # Description: Dynamically adjusts batch sizes during inference to maximize GPU utilization, reducing latency.
    # Benefits: Ensures high throughput and efficiency, especially for models with varying input lengths.

> Optimization Techniques

In [None]:
# 1. TensorRT-LLM
    # Description: Optimizes models with kernel fusion and memory techniques like KV caching, Paged Attention, and FlashAttention.
    # Benefits: Improves performance but requires conversion into TensorRT format for use.


# 2. vLLM
    # Description: An inference engine that uses Paged Attention to reduce resource wastage, optimizing memory usage and improving throughput.
    # Benefits: High efficiency in processing tokens compared to traditional methods.


# 3. DeepSpeed-Fastgen
    # Description: Combines DeepSpeed's training and inference capabilities for fast, efficient model serving.
    # Key Features: Supports Dynamic Splitfuse batching, improving latency and throughput for large models.
    

# Key Considerations
    # Memory Constraints: LLM training and inference are memory-intensive processes. Techniques like PEFT, QAT, and gradient 
        # checkpointing can help mitigate memory limitations.
    # Model Size: Models with billions of parameters may require distributed training or inference strategies to handle the memory demands.
    # Efficiency: Methods like mixed precision, distributed training, and dynamic batching are key to improving efficiency in 
        # training and inference.
    # Latency: Techniques like dynamic batching and continuous batching can help reduce inference latency, especially for real-time 
        # applications.
    # Throughput: Distributed inference and model partitioning can improve throughput by leveraging multiple devices for 
        # parallel processing.
    # Resource Optimization: Techniques like TensorRT-LLM and vLLM optimize memory usage and improve performance for large models.
    # Scalability: Distributed training and inference methods enable scaling LLMs to handle larger models and datasets efficiently.
    # Model Serving: Techniques like DeepSpeed-Fastgen provide end-to-end solutions for training and serving large language 
        # models effectively.
    # Performance Trade-offs: Quantization and distributed strategies may impact model accuracy, so careful monitoring and tuning 
        # are essential to maintain performance.



# Gradient Descent in Machine Learning

Gradient Descent is an optimization algorithm used to minimize the cost (or loss) function in machine learning. It works by iteratively updating the parameters (weights) of the model in the direction that reduces the error (cost) the most, i.e., in the direction of the **negative gradient** of the loss function.

Let’s assume we have a loss function $J(\theta)$, where $\theta$ represents the parameters (weights) of our model. The goal is to minimize this loss function, i.e., find the set of parameters that gives us the lowest possible value for $J(\theta)$.

Gradient Descent works by updating the parameters as follows:

$$
\theta = \theta - \alpha \cdot \nabla_\theta J(\theta)
$$

Where:
- $ \theta $ is the **parameter** (or weight) of the model.
- $ \alpha $ is the **learning rate** (a small positive value, typically between 0 and 1).
- $ \nabla_\theta J(\theta) $ is the **gradient** (the partial derivative) of the loss function with respect to the parameter $ \theta $.


## Loss Function Example: Binary Cross-Entropy (BCE)

For binary classification problems, one commonly used loss function is the **Binary Cross-Entropy (BCE)** loss. The BCE loss function is given by:

$$
\mathcal{L}(\hat{y}, y) = - \frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]
$$

Where:
- $ \hat{y}_i $ is the predicted probability for the $ i^{th} $ sample.
- $ y_i $ is the true label (0 or 1) for the $ i^{th} $ sample.
- $ N $ is the number of samples in the dataset.

The predicted value $ \hat{y}_i $ is typically computed using a **sigmoid activation function**:

$$
\hat{y}_i = \sigma(w^T x_i + b)
$$

Where:
- $ w $ are the weights of the model.
- $ x_i $ is the input features of the $ i^{th} $ sample.
- $ b $ is the bias term.
- $ \sigma(z) = \frac{1}{1 + e^{-z}} $ is the sigmoid function.

The goal of training is to minimize this loss function using gradient descent, iteratively adjusting the weights \( w \) and bias \( b \) to reduce the difference between predicted and true labels.


## Explanation:

- Gradient ($ \nabla_\theta J(\theta) $): This tells us the direction in which the function $ J(\theta) $ increases the most. In other words, it shows us how steep the slope is at any point on the function. We want to move **in the opposite direction** (down the slope) to minimize the loss.
  
- **Learning rate ($ \alpha $)**: This determines the size of the steps we take in the direction of the gradient. A small learning rate means small steps, and a large learning rate means larger steps. Too small of a learning rate will make the process slow, while too large of a learning rate could cause overshooting and prevent convergence.

- **loss.backward()** - Computes the gradients of the loss with respect to the model's parameters using backpropagation.
- **optimizer.step()** - Updates the model's parameters based on the computed gradients, performing the gradient descent step.



## Calculating Gradients with Respect to Parameters

To update the parameters using gradient descent, we need to calculate the gradient of the loss function with respect to each parameter (weights and biases). Here's how it's done for the Binary Cross-Entropy (BCE) loss in a logistic regression model.

### 1. Binary Cross-Entropy Loss:

For binary classification, the BCE loss function is:

$$
\mathcal{L}(y, \hat{y}) = - \left[ y \log(\hat{y}) + (1 - y) \log(1 - \hat{y}) \right]
$$

Where $ \hat{y} = \sigma(w \cdot x + b) $, and $ \sigma(z) = \frac{1}{1 + e^{-z}} $ is the sigmoid activation function.

### 2. Gradient of Loss with respect to Parameters:

- The gradient of the loss with respect to \( w \) (weight):

$$
\frac{\partial \mathcal{L}}{\partial w} = (\hat{y} - y) \cdot x
$$

- The gradient of the loss with respect to \( b \) (bias):

$$
\frac{\partial \mathcal{L}}{\partial b} = \hat{y} - y
$$

Where:
- $ \hat{y} $ is the predicted probability ($ \hat{y} = \sigma(w \cdot x + b) $).
- $ y $ is the true label (0 or 1).
- $ x $ is the input feature vector.

These gradients are used to update the parameters in the direction that minimizes the loss:

$$
w = w - \alpha \cdot \frac{\partial \mathcal{L}}{\partial w}
$$
$$
b = b - \alpha \cdot \frac{\partial \mathcal{L}}{\partial b}
$$

Where $ \alpha $ is the learning rate.



# Variants of Gradient Descent in Machine Learning

## Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) is a variant of the basic gradient descent algorithm where the parameters are updated using only one data point (randomly selected) at a time, rather than the entire dataset. This leads to faster convergence but with noisier updates.

The update rule for SGD is:

$$
\theta = \theta - \alpha \cdot \nabla_\theta J(\theta; x^{(i)}, y^{(i)})
$$

Where:
- $ \theta $ is the parameter (or weight).
- $ \alpha $ is the learning rate.
- $ \nabla_\theta J(\theta; x^{(i)}, y^{(i)}) $ is the gradient of the cost function, computed with respect to the $i$-th training example $ (x^{(i)}, y^{(i)}) $.

Since SGD uses one data point at a time, it is computationally more efficient but introduces more variance in the updates. This often causes the algorithm to fluctuate around the minimum rather than converging smoothly.

## Mini-Batch Gradient Descent (MBGD)

Mini-Batch Gradient Descent is a compromise between the standard (Batch) Gradient Descent and Stochastic Gradient Descent. In mini-batch GD, instead of using the full dataset or a single data point, a small random subset of the data (mini-batch) is used to compute the gradient.

The update rule for Mini-Batch GD is:

$$
\theta = \theta - \alpha \cdot \frac{1}{m} \sum_{i=1}^{m} \nabla_\theta J(\theta; x^{(i)}, y^{(i)})
$$

Where:
- $ m $ is the number of samples in the mini-batch.

Mini-batch gradient descent provides a balance between the computational efficiency of batch gradient descent and the faster convergence of stochastic gradient descent.

## Adagrad (Adaptive Gradient Algorithm)

Adagrad is an adaptive learning rate method. It adjusts the learning rate for each parameter individually based on the past gradient updates. Parameters that have larger gradients (more significant updates) will have smaller learning rates, while parameters with smaller gradients will have larger learning rates.

The update rule for Adagrad is:

$$
\theta_t = \theta_{t-1} - \frac{\alpha}{\sqrt{G_t + \epsilon}} \cdot \nabla_\theta J(\theta)
$$

Where:
- $ G_t $ is the sum of the squared gradients up to time step $ t $:

$$
G_t = \sum_{i=1}^{t} \nabla_\theta J(\theta_i)^2
$$

- $ \epsilon $ is a small constant added to prevent division by zero (typically $ 10^{-8} $).
- $ \alpha $ is the learning rate.

The adaptive nature of Adagrad makes it effective for sparse data (where most features are zero) or data with varying levels of gradients.

## RMSprop (Root Mean Square Propagation)

RMSprop is another adaptive learning rate method, designed to solve some issues with Adagrad, especially the fact that Adagrad's learning rate can shrink too much over time. RMSprop uses a moving average of squared gradients to scale the learning rate, which helps keep the updates stable.

The update rule for RMSprop is:

$$
\theta_t = \theta_{t-1} - \frac{\alpha}{\sqrt{E[g^2]_t + \epsilon}} \cdot g_t
$$

Where:
- $ E[g^2]_t $ is the moving average of squared gradients at time step $ t $:

$$
E[g^2]_t = \beta \cdot E[g^2]_{t-1} + (1-\beta) \cdot g_t^2
$$

where $ \beta $ is a smoothing factor (often close to $ 0.9 $).
- $ g_t $ is the gradient at time step $ t $.
- $ \alpha $ is the learning rate.
- $ \epsilon $ is a small constant added to prevent division by zero.

RMSprop helps improve convergence by adapting the learning rate to the magnitude of recent gradients, which is useful for non-stationary objectives.

## Adam (Adaptive Moment Estimation)

Adam combines ideas from both Adagrad and RMSprop. It computes adaptive learning rates for each parameter, but also takes into account the momentum of past gradients (i.e., the exponentially decaying average of past gradients) to improve optimization.

The update rule for Adam is:

$$
\theta_t = \theta_{t-1} - \frac{\alpha}{\sqrt{v_t} + \epsilon} \cdot m_t
$$

Where:
- $ m_t $ is the first moment estimate (mean of gradients), typically:

$$
m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t
$$

with $ \beta_1 $ being the decay rate (typically $ 0.9 $).

- $ v_t $ is the second moment estimate (variance of gradients), typically:

$$
v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
$$

with $ \beta_2 $ being another decay rate (typically $ 0.999 $).

- $ \alpha $ is the learning rate.
- $ \epsilon $ is a small constant added to prevent division by zero.

Adam is widely used because it combines the benefits of both momentum and adaptive learning rates, making it well-suited for a variety of machine learning tasks.




<p align="center">
  <img src="transformer.png" alt="Feed-Forward Network" />
</p>


# Transformer Model: Step-by-Step Workflow

Transformers are a foundational architecture in modern deep learning, particularly in natural language processing (NLP). Below is a comprehensive step-by-step guide outlining the complete workflow of a transformer model, from input tensors to the final output.

## 1. Input Preparation

### 1.1. Raw Text Input
- **Description:** The process begins with raw text data, such as sentences or paragraphs.
- **Example:** `"The cat sat on the mat."`

### 1.2. Tokenization
- **Description:** Converts raw text into discrete tokens (words, subwords, or characters).
- **Substeps:**
  - **a. Splitting:** Break text into tokens based on spaces or specific rules.
  - **b. Subword Tokenization (Optional):** Further splits rare words into subword units using algorithms like Byte Pair Encoding (BPE) or WordPiece.
- **Example:** `"The cat sat on the mat."` → `["The", "cat", "sat", "on", "the", "mat", "."]`

### 1.3. Numerical Encoding
- **Description:** Maps tokens to unique numerical identifiers using a vocabulary index.
- **Substeps:**
  - **a. Vocabulary Lookup:** Each token is assigned an integer based on its position in the vocabulary.
  - **b. Handling Unknown Tokens:** Tokens not present in the vocabulary are mapped to a special `[UNK]` token.
- **Example:** `["The", "cat", "sat", "on", "the", "mat", "."]` → `[101, 2024, 2003, 1037, 7099, 2527, 1012]`

## 2. Embedding Layer

### 2.1. Token Embeddings
- **Description:** Transforms numerical token IDs into dense vector representations.
- **Substeps:**
  - **a. Embedding Matrix:** A learnable matrix where each row corresponds to a token's embedding.
  - **b. Lookup:** Each token ID retrieves its corresponding embedding vector.
- **Example:** `Embedding Matrix [Vocab Size x d_model]` → `X = [n_tokens x d_model]`

    $$
    \mathbf{E} = W_{\text{emb}} \cdot \mathbf{x}
    $$
    where:
    - **E** is the token embedding vector (dimension $d$),
    - **$W_{\text{emb}}$** is the embedding matrix,
    - **x** is the token index (one-hot vector representing the word).


### 2.2. Positional Encodings
- **Description:** Adds information about the position of each token in the sequence to the token embeddings.
- **Substeps:**
  - **a. Sinusoidal Encoding (Fixed):** Uses sine and cosine functions of different frequencies.
  - **b. Learned Positional Embeddings (Learnable):** Embeddings are learned during training.
- **Example:**
  $$
  \text{Positional Encoding}_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) \\
  \text{Positional Encoding}_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)
  $$
  
### 2.3. Combined Embeddings
- **Description:** Summing token embeddings with positional encodings to incorporate positional information.
- **Formula:**
  $$
  E = X + \text{Positional Encodings}
  $$
- **Result:** A matrix `E` representing the input sequence with positional information.

## 3. Transformer Architecture

The transformer consists of an **Encoder** and a **Decoder**, each composed of multiple layers. Below, we outline the main components and their subcomponents.

### 3.1. Encoder

#### 3.1.1. Multi-Head Self-Attention
- **Description:** Allows the model to focus on different parts of the input sequence simultaneously.
- **Substeps:**
  - **a. Linear Projections:** Compute queries (Q), keys (K), and values (V) using learned weight matrices.
    $$
    Q = E W_Q, \quad K = E W_K, \quad V = E W_V
    $$
  
    Each token's embedding $ E_i $ is projected into the query, key, and value vectors using learned weight matrices:

    $$
    Q_i = W_Q \cdot E_i \quad \text{(Query vector for token $ E_i $)}
    $$
    $$
    K_i = W_K \cdot E_i \quad \text{(Key vector for token $ E_i $)}
    $$
    $$
    V_i = W_V \cdot E_i \quad \text{(Value vector for token $ E_i $)}
    $$

    Where:
    - $ W_Q, W_K, W_V $ are learned weight matrices,
    - $ E_i $ is the token embedding for token $ E_i $.


  - **b. Attention Scores:** For each query vector in Q, compute the dot product with all key vectors in K. These scores indicate the relevance or similarity between each query and all keys.
    - *High Score:* The corresponding value should be given more attention.
    - *Low Score:* The corresponding value should be given less attention.
      $$
      Attention Scores = Q * K^T
      $$
       **Example for $ N=4 $:**
          $$
          Q = 
          \begin{bmatrix}
          q_{1} \\[6pt]
          q_{2} \\[6pt]
          q_{3} \\[6pt]
          q_{4}
          \end{bmatrix}_{4 \times d_k}, \quad
          K =
          \begin{bmatrix}
          k_{1} \\[6pt]
          k_{2} \\[6pt]
          k_{3} \\[6pt]
          k_{4}
          \end{bmatrix}_{4 \times d_k}
          $$

      The score matrix (before scaling and softmax) is:
          $$
          QK^T = 
          \begin{bmatrix}
          q_{1}\cdot k_{1}^T & q_{1}\cdot k_{2}^T & q_{1}\cdot k_{3}^T & q_{1}\cdot k_{4}^T \\[6pt]
          q_{2}\cdot k_{1}^T & q_{2}\cdot k_{2}^T & q_{2}\cdot k_{3}^T & q_{2}\cdot k_{4}^T \\[6pt]
          q_{3}\cdot k_{1}^T & q_{3}\cdot k_{2}^T & q_{3}\cdot k_{3}^T & q_{3}\cdot k_{4}^T \\[6pt]
          q_{4}\cdot k_{1}^T & q_{4}\cdot k_{2}^T & q_{4}\cdot k_{3}^T & q_{4}\cdot k_{4}^T
          \end{bmatrix}_{4 \times 4}
          $$
      Each row now represents how much attention a single query token gives to every token in the sequence, and each row sums to 1.
      
  - **c. Scaled Dot-Product Attention:** Calculate attention scores, apply scaling, softmax, and compute weighted sums.
    $$
    \text{Scores} = \frac{Q K^T}{\sqrt{d_k}} \\
    \text{Attention Weights} = \text{softmax}(\text{Scores}) \\ 
    \text{Attention Output} = \text{Attention Weights} \times V
    $$
    - Scaling prevents the dot products from growing too large, which can push the softmax function into regions with very small gradients.
    - These weights determine how much each value vector (from V) contributes to the final output. \\
    - Weights aggregates the value vectors based on their relevance to each query, producing a weighted sum that captures contextual information.
  
  - **d. Concatenation:** Concatenate outputs from all attention heads.
    $$
    \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \in \mathbb{R}^{N \times (h \cdot d_k)}
    $$
  - **e. Final Linear Projection:** Apply a linear transformation to the concatenated output.
    $$
    \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W_O
    $$
    - Here, $ W_O $ is a learnable weight matrix of size $(h \cdot d_k) \times d_{model}$.

#### 3.1.2. Add & Norm
- **Description:**  
  Incorporates a **residual (skip) connection** by adding the sub-layer's input \( E \) to its output, followed by **layer normalization**. This helps preserve the original information and stabilizes the training process.

- **Formula:**
  $$
  \text{Output} = \text{LayerNorm}(E + \text{MultiHead}(Q, K, V))
  $$
  $$
  \text{or}
  $$
  $$
  \text{Output} = \text{LayerNorm}(E + \text{Attention Output})
  $$


#### 3.1.3. Feed-Forward Network (FFN)
- **Description:**  
  Processes each position independently through a two-layer fully connected network to capture complex patterns and non-linear relationships.
  
- **Substeps:**
  - **a. Linear Transformation:**  
    **Purpose:** Expands the dimensionality of the input to increase the model's capacity to learn.  
    **Formula:**  
    $$
    \text{FFN}_1 = \text{ReLU}(E W_1 + b_1)
    $$
    - $ E $: **Input** from the previous **Add & Norm** step ($E$ here is the **Output** from the Add & Norm step).
    - $ W_1 $: **Weight matrix** for the first linear layer.
    - $ b_1 $: **Bias vector** for the first linear layer.
    - **ReLU:** Activation function introducing non-linearity.
  
  - **b. Linear Transformation:**  
    **Purpose:** Reduces the dimensionality back to the original size, ensuring consistency in the model's architecture.  
    **Formula:**  
    $$
    \text{FFN}_2 = \text{FFN}_1 W_2 + b_2
    $$
    - $ W_2 $: **Weight matrix** for the second linear layer.
    - $ b_2 $: **Bias vector** for the second linear layer.

#### 3.1.4. Add & Norm
- **Description:**  
  Adds a **residual (skip) connection** by combining the FFN's output with its input $ E $, followed by **layer normalization**. This step helps preserve the original information and stabilizes the training process.
  
- **Formula:**
  $$
  \text{Encoder Output} = \text{LayerNorm}(E + \text{FFN}_2)
  $$
  - $ E $: **Input** to the Feed-Forward Network (output from the previous **Add & Norm** step).
  - $ \text{FFN}_2 $: **Output** from the Feed-Forward Network.
  - **LayerNorm:** Normalizes the combined output to have a mean of 0 and a variance of 1, enhancing training stability and performance.

---
$$



$$
---

### 3.2. Decoder

While the **Encoder** processes the input sequence to generate contextualized representations, the **Decoder** generates the output sequence by leveraging these representations and the previously generated tokens. The Decoder consists of multiple layers, each containing three main sub-layers:

1. **Masked Multi-Head Self-Attention**
2. **Multi-Head Cross-Attention**
3. **Position-wise Feed-Forward Network (FFN)**

Each sub-layer is followed by **Residual Connections** and **Layer Normalization**, similar to the Encoder.


#### 3.2.1. Masked Multi-Head Self-Attention
- **Description:** Similar to encoder's self-attention but with masking to prevent attending to future tokens.
- **Purpose:**  
  - **Prevent Information Leakage:** By masking future tokens, the model ensures that the prediction for the current token doesn't incorporate information from tokens that haven't been generated yet.
  - **Maintain Causality:** Ensures that the generation process respects the sequential nature of language.

- **Substeps:**
  - **a. Masking:** Apply a causal mask to ensure autoregressive property. $M$ (with `0` for allowed positions and `-∞` for future positions)
    $$
    \text{Masked Scores} = \frac{Q K^T}{\sqrt{d_k}} + M \\
    $$
  
  $$
  \text{Raw Scores (no mask)} = 
  \begin{bmatrix}
  s_{00} & s_{01} & s_{02} & s_{03} \\
  s_{10} & s_{11} & s_{12} & s_{13} \\
  s_{20} & s_{21} & s_{22} & s_{23} \\
  s_{30} & s_{31} & s_{32} & s_{33} \\
  \end{bmatrix}
  $$

  
  $$
  \text{Mask Matrix \(M\) = }
  \begin{bmatrix}
  0 & -\infty & -\infty & -\infty \\
  0 & 0 & -\infty & -\infty \\
  0 & 0 & 0 & -\infty \\
  0 & 0 & 0 & 0
  \end{bmatrix}
  $$

**At Each Timestep \(t\):**
- At $t=0$: Only token 0 visible:
  $$
  \text{Masked Scores} = 
  \begin{bmatrix}
  s_{00} & -\infty & -\infty & -\infty
  \end{bmatrix}
  $$
- At $t=1$: Tokens 0 and 1 visible:
  $$
  \text{Masked Scores} = 
  \begin{bmatrix}
  s_{00} & -\infty \\
  s_{10} & s_{11}
  \end{bmatrix}
  $$
- At $t=2$: Tokens 0,1,2 visible:
  $$
  \begin{bmatrix}
  s_{00} & -\infty & -\infty \\
  s_{10} & s_{11} & -\infty \\
  s_{20} & s_{21} & s_{22}
  \end{bmatrix}
  $$

  $$
  \text{Masked Scores} = 
  \begin{bmatrix}
  s_{00} & -\infty & -\infty & -\infty \\
  s_{10} & s_{11} & -\infty & -\infty \\
  s_{20} & s_{21} & s_{22} & -\infty \\
  s_{30} & s_{31} & s_{32} & s_{33} \\
  \end{bmatrix}
  $$

 - Applying Softmax:
  $$
  \text{Attention Weights} = \text{softmax}(\text{Masked Scores})
  $$

  $$
  \text{Attention Weights} = 
  \begin{bmatrix}
  s_{00} & 0 & 0 & 0 \\
  s_{10} & s_{11} & 0 & 0 \\
  s_{20} & s_{21} & s_{22} & 0 \\
  s_{30} & s_{31} & s_{32} & s_{33} \\
  \end{bmatrix}
  $$

- **b. Compute Attention Output:** As in encoder.
  $$
  \text{MaskedAttention Output} = \text{MultiHead}(Q, K, V) \quad \text{with Masking}
  $$
  $$
  or
  $$ 
  $$
  \text{MaskedAttention Output} = \text{Attention Weights} \times V
  $$

- **How Masking Works:**
  - **Mask Matrix:** A triangular matrix that masks (sets to $-\infty$) the attention scores for future tokens.
  - **Application:** Before applying the softmax function, the mask is added to the attention scores to nullify the influence of future tokens.


#### 3.2.2. Add & Norm (Post Masked Self-Attention)
- **Description:** Residual connection and layer normalization.
- **Formula:**
  $$
  \text{Output} = \text{LayerNorm}(E + \text{Masked MultiHead}(Q, K, V))
  $$

#### 3.2.3. Multi-Head Attention over Encoder Outputs (or Multi-Head Cross-Attention)
- **Description:** Allows the decoder to attend to the encoder's output.
- **Components:**
  - **Queries (Q):** Derived from the Decoder's previous sub-layer (Masked Self-Attention output).
  - **Keys (K) & Values (V):** Derived from the Encoder's final output.

- **Substeps:**
  - **a. Compute Q from Decoder:** 
    $$
    Q = \text{Decoder Output} W_Q'
    $$
  - **b. Compute K and V from Encoder:**
    $$
    K = \text{Encoder Output} W_K', \quad V = \text{Encoder Output} W_V'
    $$
  - **c. Compute Attention Output:** As in encoder.

#### 3.2.4. Add & Norm
- **Description:**   Adds another **residual (skip) connection** by combining the input $ E' $ (output from the previous Add & Norm step) with the cross-attention output, followed by **layer normalization**. This step further integrates information from the Encoder while maintaining training stability.

- **Formula:**
  $$
  \text{Output} = \text{LayerNorm}(\text{Decoder Output} + \text{MultiHead}(Q, K, V))
  $$

#### 3.2.5. Feed-Forward Network (FFN) - Position Wise Feed Forward Network
- **Description:** Same as encoder's FFN.
- **Substeps:**
  - **a. Linear Transformation and Activation:**
    $$
    \text{FFN}_1 = \text{ReLU}(\text{Output} W_1 + b_1)
    $$
  - **b. Linear Transformation:**
    $$
    \text{FFN}_2 = \text{FFN}_1 W_2 + b_2
    $$

#### 3.2.6. Add & Norm (Post Feed-Forward Network)
- **Description:** Adds a final **residual (skip) connection** by combining the FFN's output $ \text{FFN}_2 $ with its input $ E'' $ (output from the previous Add & Norm step), followed by **layer normalization**. This final normalization ensures that the Encoder's output is well-conditioned for generating predictions.

- **Formula:**
  $$
  \text{Decoder Output} = \text{LayerNorm}(\text{Output from Add & Norm} + \text{FFN}_2)
  $$


## 4. Output Embeddings and Generation

- **Description:**  
  The **Decoder Output** is transformed into **output embeddings** which are then used to generate the final predictions (e.g., token probabilities).

### 4.1. Final Linear Layer
- **Description:** Transforms decoder outputs to match the size of the target vocabulary.
- **Formula:**
  $$
  \text{Logits} = \text{Decoder Output} W_O + b_O
  $$

### 4.2. Softmax
- **Description:** Converts logits into probability distributions over the vocabulary.
- **Formula:**
  $$
  \text{Probabilities} = \text{softmax}(\text{Logits})
  $$

### 4.3. Token Selection
- **Description:**  The token with the highest probability is selected as the next token in the sequence.
- **Methods:**
  - **a. Greedy Search:** Select the token with the highest probability.
  - **b. Beam Search:** Explore multiple sequences to find the most likely output.
  - **c. Sampling:** Randomly sample tokens based on their probabilities.

### 4.4. Output Text
- **Description:** Converts selected token IDs back to human-readable text.
- **Example:** `[101, 2024, 2003, 102]` → `"The cat sat."`

---

## 5. Masked Multi-Head Attention vs. Regular Multi-Head Attention

- **Masked Multi-Head Attention:**  
  - **Used In:** Decoder's self-attention sub-layer.
  - **Function:** Prevents the model from attending to future tokens in the sequence during training, maintaining causality.
  
- **Regular Multi-Head Attention:**  
  - **Used In:** Encoder's self-attention sub-layers and Decoder's cross-attention sub-layer.
  - **Function:** Allows the model to attend to all positions in the input or encoder's output without restrictions.

- **Key Difference:**  
  The **masking** in the Decoder's self-attention ensures that each position can only attend to previous and current positions, not future ones, which is essential for tasks like language generation where the model should not peek ahead.

---

### 5.1 Differences Between Encoder Inputs and Decoder Outputs

- **Encoder Inputs:**
  - **Nature:** The source sequence (e.g., a sentence in the source language).
  - **Processing:** Fully accessible to all Encoder sub-layers; each token can attend to every other token in the sequence.
  
- **Decoder Outputs:**
  - **Nature:** The target sequence being generated (e.g., a sentence in the target language).
  - **Processing:** 
    - **Masked Self-Attention:** Each position can only attend to previous tokens in the target sequence.
    - **Cross-Attention:** Each position can attend to all tokens in the Encoder's output, integrating source information.
    - **Autoregressive Generation:** Each new token is generated based on previously generated tokens and the Encoder's context.

- **Summary of Differences:**
  | Feature                     | Encoder Inputs                      | Decoder Outputs                                |
  |-----------------------------|-------------------------------------|------------------------------------------------|
  | **Access to Sequence**      | Full access to entire input sequence| Limited to past and current tokens (masked)    |
  | **Attention Mechanism**     | Self-Attention (no masking)         | Masked Self-Attention & Cross-Attention        |
  | **Generation Flow**         | Processes input in parallel         | Generates output sequentially (autoregressive) |
  | **Integration with Encoder**| Independent of Encoder               | Attends to Encoder's output for context        |

---


In [67]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# Example small vocabulary (just for demonstration, real training would require more data)
word_list = ["the", "quick", "brown", "fox", "jumps", "over", "lazy", "dog"]
vocab_size = len(word_list)
word_to_idx = {w: i for i, w in enumerate(word_list)}
idx_to_word = {i: w for w, i in word_to_idx.items()}

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super(BigramLanguageModel, self).__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, index, targets=None):
        # logits: [Batch, Time, Vocab]
        logits = self.token_embedding_table(index)
        loss = None
        if targets is not None:
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits, targets)
        return logits, loss

    def generate(self, index, max_new_tokens):
        for _ in range(max_new_tokens):
            logits, _ = self.forward(index)
            # Focus on last timestep for next token prediction
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            index_next = torch.multinomial(probs, num_samples=1)    # Sample from the distribution
            index = torch.cat((index, index_next), dim=1)
            
            next_word = index.view(-1).tolist()
            print([idx_to_word[i] for i in next_word])

        # print()  # New line at the end of generation
        return index

# Instantiate and run the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BigramLanguageModel(vocab_size).to(device)

# Start generation with the token "the"
start_word = "the"
context = torch.tensor([[word_to_idx[start_word]]], dtype=torch.long, device=device)

# Generate more words and print as they come out
model.generate(context, max_new_tokens=20)


['the', 'over']
['the', 'over', 'fox']
['the', 'over', 'fox', 'lazy']
['the', 'over', 'fox', 'lazy', 'jumps']
['the', 'over', 'fox', 'lazy', 'jumps', 'jumps']
['the', 'over', 'fox', 'lazy', 'jumps', 'jumps', 'jumps']
['the', 'over', 'fox', 'lazy', 'jumps', 'jumps', 'jumps', 'jumps']
['the', 'over', 'fox', 'lazy', 'jumps', 'jumps', 'jumps', 'jumps', 'lazy']
['the', 'over', 'fox', 'lazy', 'jumps', 'jumps', 'jumps', 'jumps', 'lazy', 'jumps']
['the', 'over', 'fox', 'lazy', 'jumps', 'jumps', 'jumps', 'jumps', 'lazy', 'jumps', 'quick']
['the', 'over', 'fox', 'lazy', 'jumps', 'jumps', 'jumps', 'jumps', 'lazy', 'jumps', 'quick', 'brown']
['the', 'over', 'fox', 'lazy', 'jumps', 'jumps', 'jumps', 'jumps', 'lazy', 'jumps', 'quick', 'brown', 'brown']
['the', 'over', 'fox', 'lazy', 'jumps', 'jumps', 'jumps', 'jumps', 'lazy', 'jumps', 'quick', 'brown', 'brown', 'lazy']
['the', 'over', 'fox', 'lazy', 'jumps', 'jumps', 'jumps', 'jumps', 'lazy', 'jumps', 'quick', 'brown', 'brown', 'lazy', 'jumps']
['th

tensor([[0, 5, 3, 6, 4, 4, 4, 4, 6, 4, 1, 2, 2, 6, 4, 6, 4, 1, 2, 2, 3]],
       device='cuda:0')

> LLM in code

In [67]:
import torch
import torch.nn as nn
import torch.optim as optim

# Define the GPT-like Model with Embedding Layers
class SimpleGPT(nn.Module):
    def __init__(self, vocab_size, embed_size, num_heads, num_layers, num_classes, max_len=512):
        super(SimpleGPT, self).__init__()
        
        # Embedding layer: Map tokens to embedding vectors
        self.embedding = nn.Embedding(vocab_size, embed_size)
        
        # Positional encoding (used to add positions to the embeddings)
        self.positional_encoding = nn.Parameter(torch.zeros(1, max_len, embed_size))
        
        # Transformer Layer
        self.transformer = nn.Transformer(
            d_model=embed_size,
            nhead=num_heads,
            num_encoder_layers=num_layers,
            num_decoder_layers=num_layers,
            dim_feedforward=embed_size*4
        )
        
        # Output layer for classification or next-token prediction
        self.fc_out = nn.Linear(embed_size, num_classes)
        
    def forward(self, x):
        # Get token embeddings
        embeddings = self.embedding(x)
        
        # Add positional encoding to embeddings
        seq_len = x.size(1)
        embeddings = embeddings + self.positional_encoding[:, :seq_len, :]
        
        # Pass through the transformer
        transformer_out = self.transformer(embeddings, embeddings)
        
        # Use the last token's representation for classification or next token prediction
        output = self.fc_out(transformer_out[:, -1, :])
        
        return output


# Hyperparameters
vocab_size = 5000  # Example vocabulary size
embed_size = 256  # Embedding dimension
num_heads = 8     # Number of attention heads
num_layers = 6    # Number of transformer layers
num_classes = vocab_size  # For next-token prediction (same size as vocab)
max_len = 512     # Max sequence length
batch_size = 32   # Batch size
learning_rate = 1e-4

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Instantiate the model
model = SimpleGPT(vocab_size, embed_size, num_heads, num_layers, num_classes, max_len)
model.to(device)

# Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
def train(model, train_loader, criterion, optimizer, num_epochs=5):
    model.train()
    for epoch in range(num_epochs):
        total_loss = 0
        for batch in train_loader:
            inputs, targets = batch
            inputs, targets = inputs.to(device), targets.to(device)  # Move data to device
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {total_loss / len(train_loader)}")

# Sample DataLoader for training (replace with actual data)
from torch.utils.data import DataLoader, TensorDataset

# Simulated training data (replace with actual tokenized text)
train_data = torch.randint(0, vocab_size, (1000, max_len))  # 1000 sequences
train_labels = torch.randint(0, vocab_size, (1000,))  # Corresponding next token (or class)

# DataLoader
train_dataset = TensorDataset(train_data, train_labels)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

# Train the model
train(model, train_loader, criterion, optimizer)

# Inference function for a sample text
def infer(model, text, vocab, max_len=512):
    model.to(device)
    model.eval()
    
    # Tokenize input text (basic tokenizer for demo purposes)
    tokens = torch.tensor([vocab.get(word, 0) for word in text.split()]).unsqueeze(0).to(device)  # Adding batch dimension and moving to device
    tokens = tokens[:, :max_len]  # Ensure the sequence length is within max_len
    
    # Get the model's prediction
    with torch.no_grad():
        output = model(tokens)
    
    # Convert output to probabilities (for next token prediction)
    prob = torch.softmax(output, dim=-1)
    
    # Get the predicted token (highest probability)
    predicted_token_idx = torch.argmax(prob, dim=-1).item()
    
    # Decode the token back to text (inverse vocab lookup)
    reverse_vocab = {v: k for k, v in vocab.items()}
    predicted_word = reverse_vocab.get(predicted_token_idx, "<UNK>")
    
    return predicted_word

# Sample Vocab (just for illustration)
vocab = {str(i): i for i in range(vocab_size)}  # Just a simple numerical vocabulary, replace with actual vocab

# Sample input text
sample_text = "this is a sample sentence"

# Perform inference
predicted_word = infer(model, sample_text, vocab)
print(f"Predicted next word: {predicted_word}")


Epoch [1/5], Loss: 8.687405735254288
Epoch [2/5], Loss: 8.092840701341629
Epoch [3/5], Loss: 7.657473549246788
Epoch [4/5], Loss: 7.372279062867165
Epoch [5/5], Loss: 7.210036337375641
Predicted next word: 3797


In [1]:
# c

# Worked Example of Attention with a Toy Corpus

We’ve explained the mathematics behind single-head and multi-head attention. Now, let’s apply these steps to a simple, concrete example. This will demonstrate how the calculations flow from inputs all the way through to the final attention output.

## Toy Setup

### Assumptions and Simplifications

- **Vocabulary and Embeddings:**  
  We have a small vocabulary with three tokens:  
  1. "The"
  2. "cat"
  3. "sat"

  We assume we already have embeddings for these tokens. Let's say `d_model = 4` for simplicity. Thus, each token is represented by a 4-dimensional vector. For demonstration:
  - "The" → $[0.1, 0.3, 0.5, 0.7]$
  - "cat" → $[0.2, 0.4, 0.4, 0.6]$
  - "sat" → $[0.15, 0.25, 0.5, 0.1]$

- **Input Sequence:**  
  Suppose our input sequence is: "The cat sat"

So we have 3 tokens, hence $n=3$.

- **Positional Encodings:**  
For simplicity, let’s not add complex positional encodings. Suppose we have a simple positional encoding that just adds a small unique offset for each token position:
- Position 1 encoding: $[0.0, 0.1, 0.0, 0.0]$
- Position 2 encoding: $[0.0, 0.0, 0.1, 0.0]$
- Position 3 encoding: $[0.0, 0.0, 0.0, 0.1]$

After adding positional encodings:
- For "The" (position 1): $[0.1, 0.3+0.1, 0.5, 0.7] = [0.1, 0.4, 0.5, 0.7]$
- For "cat" (position 2): $[0.2, 0.4, 0.4+0.1, 0.6] = [0.2, 0.4, 0.5, 0.6]$
- For "sat" (position 3): $[0.15, 0.25, 0.5, 0.1+0.1] = [0.15, 0.25, 0.5, 0.2]$

Thus, our input matrix $X$ is:
$$
X = \begin{bmatrix}
  0.1 & 0.4 & 0.5 & 0.7 \\[6pt]
  0.2 & 0.4 & 0.5 & 0.6 \\[6pt]
  0.15 & 0.25 & 0.5 & 0.2
\end{bmatrix}
$$
where each row is a token embedding with positional info.

- **Parameters (W_Q, W_K, W_V):**  
Let’s define:
- $d_{\text{model}} = 4$
- Single-head attention: $d_k = d_{\text{model}} = 4$

Example parameter matrices:
$$
W_Q = \begin{bmatrix}
0.5 & 0.1 & 0.0 & 0.3 \\[6pt]
0.4 & 0.2 & 0.1 & 0.0 \\[6pt]
0.3 & 0.3 & 0.3 & 0.3 \\[6pt]
0.2 & 0.1 & 0.5 & 0.4
\end{bmatrix}, \quad
W_K = \begin{bmatrix}
0.1 & 0.4 & 0.0 & 0.0 \\[6pt]
0.0 & 0.5 & 0.2 & 0.1 \\[6pt]
0.3 & 0.0 & 0.3 & 0.3 \\[6pt]
0.2 & 0.2 & 0.1 & 0.0
\end{bmatrix}, \quad
W_V = \begin{bmatrix}
0.2 & 0.1 & 0.0 & 0.0 \\[6pt]
0.0 & 0.3 & 0.5 & 0.0 \\[6pt]
0.1 & 0.1 & 0.1 & 0.4 \\[6pt]
0.0 & 0.2 & 0.1 & 0.1
\end{bmatrix}
$$

### Step-by-Step Computation

#### 1. Compute Q, K, V

- $Q = XW_Q$:
Let's multiply $X$ (3x4) by $W_Q$ (4x4):

For the first token "The":
$$
Q_1 = [0.1, 0.4, 0.5, 0.7]
\begin{bmatrix}
  0.5 & 0.1 & 0.0 & 0.3 \\
  0.4 & 0.2 & 0.1 & 0.0 \\
  0.3 & 0.3 & 0.3 & 0.3 \\
  0.2 & 0.1 & 0.5 & 0.4
\end{bmatrix}
$$

Compute row-by-column:
- $Q_1[1] = 0.1*0.5 + 0.4*0.4 + 0.5*0.3 + 0.7*0.2 = 0.05 + 0.16 + 0.15 + 0.14 = 0.5$
- $Q_1[2] = 0.1*0.1 + 0.4*0.2 + 0.5*0.3 + 0.7*0.1 = 0.01 + 0.08 + 0.15 + 0.07 = 0.31$
- $Q_1[3] = 0.1*0.0 + 0.4*0.1 + 0.5*0.3 + 0.7*0.5 = 0 + 0.04 + 0.15 + 0.35 = 0.54$
- $Q_1[4] = 0.1*0.3 + 0.4*0.0 + 0.5*0.3 + 0.7*0.4 = 0.03 + 0 + 0.15 + 0.28 = 0.46$

So $Q_1 = [0.5, 0.31, 0.54, 0.46]$.

For the second token "cat":
$$
Q_2 = [0.2, 0.4, 0.5, 0.6]W_Q
$$
- $Q_2[1] = 0.2*0.5 + 0.4*0.4 + 0.5*0.3 + 0.6*0.2 = 0.1 + 0.16 + 0.15 + 0.12 = 0.53$
- $Q_2[2] = 0.2*0.1 + 0.4*0.2 + 0.5*0.3 + 0.6*0.1 = 0.02 + 0.08 + 0.15 + 0.06 = 0.31$
- $Q_2[3] = 0.2*0.0 + 0.4*0.1 + 0.5*0.3 + 0.6*0.5 = 0 + 0.04 + 0.15 + 0.3 = 0.49$
- $Q_2[4] = 0.2*0.3 + 0.4*0.0 + 0.5*0.3 + 0.6*0.4 = 0.06 + 0 + 0.15 + 0.24 = 0.45$

So $Q_2 = [0.53, 0.31, 0.49, 0.45]$.

For the third token "sat":
$$
Q_3 = [0.15, 0.25, 0.5, 0.2]W_Q
$$
- $Q_3[1] = 0.15*0.5 + 0.25*0.4 + 0.5*0.3 + 0.2*0.2 = 0.075 + 0.1 + 0.15 + 0.04 = 0.365$
- $Q_3[2] = 0.15*0.1 + 0.25*0.2 + 0.5*0.3 + 0.2*0.1 = 0.015 + 0.05 + 0.15 + 0.02 = 0.235$
- $Q_3[3] = 0.15*0.0 + 0.25*0.1 + 0.5*0.3 + 0.2*0.5 = 0 + 0.025 + 0.15 + 0.1 = 0.275$
- $Q_3[4] = 0.15*0.3 + 0.25*0.0 + 0.5*0.3 + 0.2*0.4 = 0.045 + 0 + 0.15 + 0.08 = 0.275$

So $Q_3 = [0.365, 0.235, 0.275, 0.275]$.

Thus:
$$
Q = \begin{bmatrix}
  0.5 & 0.31 & 0.54 & 0.46 \\[4pt]
  0.53 & 0.31 & 0.49 & 0.45 \\[4pt]
  0.365 & 0.235 & 0.275 & 0.275
\end{bmatrix}
$$

- $K = XW_K$:
Similarly, multiply $X$ by $W_K$.

For "The":
$$
K_1 = [0.1,0.4,0.5,0.7]W_K
$$
Compute:
- $K_1[1] = 0.1*0.1 + 0.4*0.0 + 0.5*0.3 + 0.7*0.2 = 0.01+0+0.15+0.14=0.3$
- $K_1[2] = 0.1*0.4 + 0.4*0.5 + 0.5*0.0 + 0.7*0.2 = 0.04+0.2+0+0.14=0.38$
- $K_1[3] = 0.1*0.0 + 0.4*0.2 + 0.5*0.3 + 0.7*0.1 = 0+0.08+0.15+0.07=0.3$
- $K_1[4] = 0.1*0.0 + 0.4*0.1 + 0.5*0.3 + 0.7*0.0 = 0+0.04+0.15+0=0.19$

$K_1 = [0.3, 0.38, 0.3, 0.19]$.

For "cat":
$$
K_2 = [0.2,0.4,0.5,0.6]W_K
$$
- $K_2[1] = 0.2*0.1 + 0.4*0.0 + 0.5*0.3 + 0.6*0.2 = 0.02+0+0.15+0.12=0.29$
- $K_2[2] = 0.2*0.4 + 0.4*0.5 + 0.5*0.0 + 0.6*0.2 = 0.08+0.2+0+0.12=0.4$
- $K_2[3] = 0.2*0.0 + 0.4*0.2 + 0.5*0.3 + 0.6*0.1 = 0+0.08+0.15+0.06=0.29$
- $K_2[4] = 0.2*0.0 + 0.4*0.1 + 0.5*0.3 + 0.6*0.0 = 0+0.04+0.15+0=0.19$

$K_2 = [0.29, 0.4, 0.29, 0.19]$.

For "sat":
$$
K_3 = [0.15,0.25,0.5,0.2]W_K
$$
- $K_3[1] = 0.15*0.1 + 0.25*0.0 + 0.5*0.3 + 0.2*0.2 = 0.015+0+0.15+0.04=0.205$
- $K_3[2] = 0.15*0.4 + 0.25*0.5 + 0.5*0.0 + 0.2*0.2 = 0.06+0.125+0+0.04=0.225$
- $K_3[3] = 0.15*0.0 + 0.25*0.2 + 0.5*0.3 + 0.2*0.1 = 0+0.05+0.15+0.02=0.22$
- $K_3[4] = 0.15*0.0 + 0.25*0.1 + 0.5*0.3 + 0.2*0.0 = 0+0.025+0.15+0=0.175$

$K_3 = [0.205,0.225,0.22,0.175]$.

Thus:
$$
K = \begin{bmatrix}
  0.3 & 0.38 & 0.3 & 0.19 \\[4pt]
  0.29 & 0.4 & 0.29 & 0.19 \\[4pt]
  0.205 & 0.225 & 0.22 & 0.175
\end{bmatrix}
$$

- $V = XW_V$:
Similarly:

For "The":
- $V_1[1] = 0.1*0.2 + 0.4*0.0 + 0.5*0.1 + 0.7*0.0 = 0.02+0+0.05+0=0.07$
- $V_1[2] = 0.1*0.1 + 0.4*0.3 + 0.5*0.1 + 0.7*0.2 = 0.01+0.12+0.05+0.14=0.32$
- $V_1[3] = 0.1*0.0 + 0.4*0.5 + 0.5*0.1 + 0.7*0.1 = 0+0.2+0.05+0.07=0.32$
- $V_1[4] = 0.1*0.0 + 0.4*0.0 + 0.5*0.4 + 0.7*0.1 = 0+0+0.2+0.07=0.27$

$V_1 = [0.07, 0.32, 0.32, 0.27]$.

For "cat":
- $V_2[1] = 0.2*0.2 + 0.4*0.0 + 0.5*0.1 + 0.6*0.0 = 0.04+0+0.05+0=0.09$
- $V_2[2] = 0.2*0.1 + 0.4*0.3 + 0.5*0.1 + 0.6*0.2 = 0.02+0.12+0.05+0.12=0.31$
- $V_2[3] = 0.2*0.0 + 0.4*0.5 + 0.5*0.1 + 0.6*0.1 = 0+0.2+0.05+0.06=0.31$
- $V_2[4] = 0.2*0.0 + 0.4*0.0 + 0.5*0.4 + 0.6*0.1 = 0+0+0.2+0.06=0.26$

$V_2 = [0.09,0.31,0.31,0.26]$.

For "sat":
- $V_3[1] = 0.15*0.2 + 0.25*0.0 + 0.5*0.1 + 0.2*0.0 = 0.03+0+0.05+0=0.08$
- $V_3[2] = 0.15*0.1 + 0.25*0.3 + 0.5*0.1 + 0.2*0.2 = 0.015+0.075+0.05+0.04=0.18$
- $V_3[3] = 0.15*0.0 + 0.25*0.5 + 0.5*0.1 + 0.2*0.1 = 0+0.125+0.05+0.02=0.195$
- $V_3[4] = 0.15*0.0 + 0.25*0.0 + 0.5*0.4 + 0.2*0.1 = 0+0+0.2+0.02=0.22$

$V_3 = [0.08,0.18,0.195,0.22]$.

Thus:
$$
V = \begin{bmatrix}
  0.07 & 0.32 & 0.32 & 0.27 \\[4pt]
  0.09 & 0.31 & 0.31 & 0.26 \\[4pt]
  0.08 & 0.18 & 0.195 & 0.22
\end{bmatrix}
$$

#### 2. Compute the Attention Scores

$$
\text{Scores} = QK^T
$$

- Dimension check: $Q \in \mathbb{R}^{3 \times 4}, K \in \mathbb{R}^{3 \times 4}$, so $K^T \in \mathbb{R}^{4 \times 3}$. Thus, $\text{Scores} \in \mathbb{R}^{3 \times 3}$.

Compute $\text{Scores}[i,j]$ = $Q_i \cdot K_j$:

- $\text{Scores}[1,1] = Q_1 \cdot K_1 = [0.5*0.3 + 0.31*0.38 + 0.54*0.3 + 0.46*0.19]$
= $0.15 + 0.1178 + 0.162 + 0.0874 = 0.5172$

- $\text{Scores}[1,2] = Q_1 \cdot K_2 = [0.5*0.29 + 0.31*0.4 + 0.54*0.29 + 0.46*0.19]$
= $0.145 + 0.124 + 0.1566 + 0.0874 = 0.513$

- $\text{Scores}[1,3] = Q_1 \cdot K_3 = [0.5*0.205 + 0.31*0.225 + 0.54*0.22 + 0.46*0.175]$
= $0.1025 + 0.06975 + 0.1188 + 0.0805 = 0.37155$

- $\text{Scores}[2,1] = Q_2 \cdot K_1$
= $[0.53*0.3 + 0.31*0.38 + 0.49*0.3 + 0.45*0.19]$
= $0.159 + 0.1178 + 0.147 + 0.0855 = 0.5093$

- $\text{Scores}[2,2] = Q_2 \cdot K_2$
= $[0.53*0.29 + 0.31*0.4 + 0.49*0.29 + 0.45*0.19]$
= $0.1537 + 0.124 + 0.1421 + 0.0855 = 0.5053$

- $\text{Scores}[2,3] = Q_2 \cdot K_3$
= $[0.53*0.205 + 0.31*0.225 + 0.49*0.22 + 0.45*0.175]$
= $0.10865 + 0.06975 + 0.1078 + 0.07875 = 0.36495$

- $\text{Scores}[3,1] = Q_3 \cdot K_1$
= $[0.365*0.3 + 0.235*0.38 + 0.275*0.3 + 0.275*0.19]$
= $0.1095 + 0.0893 + 0.0825 + 0.05225 = 0.33355$

- $\text{Scores}[3,2] = Q_3 \cdot K_2$
= $[0.365*0.29 + 0.235*0.4 + 0.275*0.29 + 0.275*0.19]$
= $0.10585 + 0.094 + 0.07975 + 0.05225 = 0.33185$

- \(\text{Scores}[3,3] = Q_3 \cdot K_3\)
= \([0.365*0.205 + 0.235*0.225 + 0.275*0.22 + 0.275*0.175]\)
= \(0.074825 + 0.052875 + 0.0605 + 0.048125 = 0.236325\)

Thus:
$$
\text{Scores} = \begin{bmatrix}
0.5172 & 0.513  & 0.37155 \\[4pt]
0.5093 & 0.5053 & 0.36495 \\[4pt]
0.33355 & 0.33185 & 0.236325
\end{bmatrix}
$$

#### 3. Scale the Scores

$$
\tilde{\text{Scores}} = \frac{\text{Scores}}{\sqrt{d_k}} = \frac{\text{Scores}}{\sqrt{4}} = \frac{\text{Scores}}{2}
$$

$$
\tilde{\text{Scores}} = \begin{bmatrix}
0.2586 & 0.2565 & 0.185775 \\[4pt]
0.25465 & 0.25265 & 0.182475 \\[4pt]
0.166775 & 0.165925 & 0.1181625
\end{bmatrix}
$$

#### 4. Apply Softmax to Each Row

The **softmax** function is defined as follows:

$$
\text{softmax}(\mathbf{z})_i = \frac{\exp(z_i)}{\sum_{j=1}^{N} \exp(z_j)}
$$

where $\mathbf{z} = (z_1, z_2, \ldots, z_N)$ is a vector of real numbers and $N$ is the dimension of the vector $\mathbf{z}$.


For the first row:
- Sum = $\exp(0.2586) + \exp(0.2565) + \exp(0.185775)$
- $\exp(0.2586) \approx 1.295$
- $\exp(0.2565) \approx 1.292$
- $\exp(0.185775) \approx 1.204$

Sum ≈ 1.295 + 1.292 + 1.204 = 3.791

So:
- $A[1,1] = 1.295/3.791 ≈ 0.3418$
- $A[1,2] = 1.292/3.791 ≈ 0.3408$
- $A[1,3] = 1.204/3.791 ≈ 0.3177$

For the second row:
- $\exp(0.25465) \approx 1.290$
- $\exp(0.25265) \approx 1.287$
- $\exp(0.182475) \approx 1.200$

Sum ≈ 1.290 + 1.287 + 1.200 = 3.777

- $A[2,1] = 1.290/3.777 ≈ 0.3415$
- $A[2,2] = 1.287/3.777 ≈ 0.3407$
- $A[2,3] = 1.200/3.777 ≈ 0.3178$

For the third row:
- $\exp(0.166775) \approx 1.181$
- $\exp(0.165925) \approx 1.180$
- $\exp(0.1181625) \approx 1.1255$

Sum ≈ 1.181 + 1.180 + 1.1255 = 3.4865

- $A[3,1] = 1.181/3.4865 ≈ 0.3388$
- $A[3,2] = 1.180/3.4865 ≈ 0.3385$
- $A[3,3] = 1.1255/3.4865 ≈ 0.3227$

So our attention weight matrix $A$ is approximately:
$$
A = \begin{bmatrix}
0.3418 & 0.3408 & 0.3177 \\[4pt]
0.3415 & 0.3407 & 0.3178 \\[4pt]
0.3388 & 0.3385 & 0.3227
\end{bmatrix}
$$

#### 5. Compute the Final Output

$$
\text{AttOutput} = A V
$$

- Dimension: $A \in \mathbb{R}^{3 \times 3}, V \in \mathbb{R}^{3 \times 4}$, resulting in $\text{AttOutput} \in \mathbb{R}^{3 \times 4}$.

For each row $i$:
$$
\text{AttOutput}_i = \sum_{j=1}^{3} A[i,j] V_j
$$

- For the first token:
$$
\text{AttOutput}_1 = 0.3418V_1 + 0.3408V_2 + 0.3177V_3
$$
Recall:
- $V_1 = [0.07,0.32,0.32,0.27]$
- $V_2 = [0.09,0.31,0.31,0.26]$
- $V_3 = [0.08,0.18,0.195,0.22]$

Compute component-wise:
- Dim1: $0.3418*0.07 + 0.3408*0.09 + 0.3177*0.08 = 0.0233 + 0.0307 + 0.0254 = 0.0794$
- Dim2: $0.3418*0.32 + 0.3408*0.31 + 0.3177*0.18 = 0.1094 + 0.1056 + 0.0572 = 0.2722$
- Dim3: $0.3418*0.32 + 0.3408*0.31 + 0.3177*0.195 = 0.1094 + 0.1056 + 0.0629 = 0.2779$
- Dim4: $0.3418*0.27 + 0.3408*0.26 + 0.3177*0.22 = 0.0923 + 0.0886 + 0.0699 = 0.2508$

$\text{AttOutput}_1 = [0.0794, 0.2722, 0.2779, 0.2508]$

- For the second token:
$$
\text{AttOutput}_2 = 0.3415V_1 + 0.3407V_2 + 0.3178V_3
$$
Repeat similarly:
- Dim1: $0.3415*0.07 + 0.3407*0.09 + 0.3178*0.08 = 0.023105 + 0.030663 + 0.025424 = 0.079192$
- Dim2: $0.3415*0.32 + 0.3407*0.31 + 0.3178*0.18 = 0.10928 + 0.105617 + 0.057204 = 0.272101$
- Dim3: $0.3415*0.32 + 0.3407*0.31 + 0.3178*0.195 = 0.10928 + 0.105617 + 0.062971 = 0.277868$
- Dim4: $0.3415*0.27 + 0.3407*0.26 + 0.3178*0.22 = 0.092205 + 0.088582 + 0.069916 = 0.250703$

$\text{AttOutput}_2 \approx [0.0792, 0.2721, 0.2779, 0.2507]$

- For the third token:
$$
\text{AttOutput}_3 = 0.3388V_1 + 0.3385V_2 + 0.3227V_3
$$
- Dim1: $0.3388*0.07 + 0.3385*0.09 + 0.3227*0.08 = 0.023716 + 0.030465 + 0.025816 = 0.079997$
- Dim2: $0.3388*0.32 + 0.3385*0.31 + 0.3227*0.18 = 0.108416 + 0.104935 + 0.058086 = 0.271437$
- Dim3: $0.3388*0.32 + 0.3385*0.31 + 0.3227*0.195 = 0.108416 + 0.104935 + 0.062933 = 0.276284$
- Dim4: $0.3388*0.27 + 0.3385*0.26 + 0.3227*0.22 = 0.091476 + 0.08801 + 0.071 = 0.250486$

$\text{AttOutput}_3 \approx [0.08, 0.2714, 0.2763, 0.2505]$

Final $\text{AttOutput}$:
$$
\text{AttOutput} \approx \begin{bmatrix}
0.0794 & 0.2722 & 0.2779 & 0.2508 \\[4pt]
0.0792 & 0.2721 & 0.2779 & 0.2507 \\[4pt]
0.08   & 0.2714 & 0.2763 & 0.2505
\end{bmatrix}
$$

## Interpretation

- Each row of $\text{AttOutput}$ is the transformed representation of the corresponding token after attending to all tokens in the sequence (including itself).
- Notice that the rows are fairly similar, which reflects the similarity in the queries and keys for this small artificial example. In a more complex and varied sequence, these values would differ more significantly.
- In practice, multiple heads are used, and their outputs are concatenated to capture various patterns. Here, we demonstrated just a single-head scenario.






### Decoder Architecture in Transformers
#### 6. Apply Masking

**Masking** is used to prevent attention to certain positions. In this example, we will apply a **causal mask** to prevent each token from attending to future tokens. This is particularly useful in decoder architectures to maintain the autoregressive property.

- **Causal Mask Matrix ($M$):**

  The causal mask ensures that each position can only attend to itself and previous positions. For our 3-token sequence:

  $$
  M = \begin{bmatrix}
  0 & -\infty & -\infty \\[4pt]
  0 & 0 & -\infty \\[4pt]
  0 & 0 & 0
  \end{bmatrix}
  $$

  - $0$ allows attention.
  - $-\infty$ effectively masks out the position by making its softmax probability zero.

- **Apply Mask to Scores:**

  $$
  \text{Masked Scores} = \text{Scores} + M
  $$

  Performing element-wise addition:

  $$
  \text{Masked Scores} = \begin{bmatrix}
  0.5172 & 0.5130 & -\infty \\[4pt]
  0.5093 & 0.5053 & -\infty \\[4pt]
  0.33355 & 0.33185 & 0.236325
  \end{bmatrix}
  $$

  **Explanation:**  
  - For the first token ("The"), it cannot attend to the third token ("sat"), hence $-\infty$.
  - For the second token ("cat"), it cannot attend to the third token ("sat"), hence $-\infty$.
  - The third token ("sat") can attend to all tokens, including itself.

#### 7. Scale the Scores

Scaling helps in stabilizing gradients during training.

$$
\tilde{\text{Scores}} = \frac{\text{Masked Scores}}{\sqrt{d_k}} = \frac{\text{Masked Scores}}{\sqrt{4}} = \frac{\text{Masked Scores}}{2}
$$

$$
\tilde{\text{Scores}} = \begin{bmatrix}
0.2586 & 0.2565 & -\infty \\[4pt]
0.25465 & 0.25265 & -\infty \\[4pt]
0.166775 & 0.165925 & 0.1181625
\end{bmatrix}
$$

#### 8. Apply Softmax to Each Row

The **softmax** function converts the scaled scores into probabilities.

$$
\text{softmax}(\mathbf{z})_i = \frac{\exp(z_i)}{\sum_{j=1}^{N} \exp(z_j)}
$$

**Applying Softmax:**

- **First Row:**
  $$
  \mathbf{z}_1 = [0.2586, 0.2565, -\infty]
  $$
  - $\exp(0.2586) \approx 1.295$
  - $\exp(0.2565) \approx 1.292$
  - $\exp(-\infty) = 0$

  Sum: $1.295 + 1.292 + 0 = 2.587$

  Softmax:
  $$
  A[1,1] = \frac{1.295}{2.587} \approx 0.500 \\
  A[1,2] = \frac{1.292}{2.587} \approx 0.500 \\
  A[1,3] = \frac{0}{2.587} = 0.0
  $$

- **Second Row:**
  $$
  \mathbf{z}_2 = [0.25465, 0.25265, -\infty]
  $$
  - $\exp(0.25465) \approx 1.290$
  - $\exp(0.25265) \approx 1.287$
  - $\exp(-\infty) = 0$

  Sum: $1.290 + 1.287 + 0 = 2.577$

  Softmax:
  $$
  A[2,1] = \frac{1.290}{2.577} \approx 0.500 \\
  A[2,2] = \frac{1.287}{2.577} \approx 0.500 \\
  A[2,3] = \frac{0}{2.577} = 0.0
  $$

- **Third Row:**
  $$
  \mathbf{z}_3 = [0.166775, 0.165925, 0.1181625]
  $$
  - $\exp(0.166775) \approx 1.181$
  - $\exp(0.165925) \approx 1.180$
  - $\exp(0.1181625) \approx 1.1255$

  Sum: $1.181 + 1.180 + 1.1255 = 3.4865$

  Softmax:
  $$
  A[3,1] = \frac{1.181}{3.4865} \approx 0.339 \\
  A[3,2] = \frac{1.180}{3.4865} \approx 0.339 \\
  A[3,3] = \frac{1.1255}{3.4865} \approx 0.323
  $$

Thus, our attention weight matrix $A$ is approximately:

$$
A = \begin{bmatrix}
0.500 & 0.500 & 0.0 \\[4pt]
0.500 & 0.500 & 0.0 \\[4pt]
0.339 & 0.339 & 0.323
\end{bmatrix}
$$

**Effect of Masking:**
- **First Token ("The"):** Can only attend to itself and "cat". No attention to "sat".
- **Second Token ("cat"):** Can only attend to itself and "The". No attention to "sat".
- **Third Token ("sat"):** Can attend to all tokens, including itself.

#### 6. Compute the Final Output

$$
\text{AttOutput} = A V
$$

- **Dimension Check:**  
  $A \in \mathbb{R}^{3 \times 3}$, $V \in \mathbb{R}^{3 \times 4}$, so $\text{AttOutput} \in \mathbb{R}^{3 \times 4}$.

- **Calculation:**

  - **For the first token ("The"):**
    $$
    \text{AttOutput}_1 = 0.500 \times V_1 + 0.500 \times V_2 + 0.0 \times V_3 \\
    = 0.500 \times [0.07, 0.32, 0.32, 0.27] + 0.500 \times [0.09, 0.31, 0.31, 0.26] + 0.0 \times [0.08, 0.18, 0.195, 0.22] \\
    = [0.035, 0.16, 0.16, 0.135] + [0.045, 0.155, 0.155, 0.13] + [0.0, 0.0, 0.0, 0.0] \\
    = [0.080, 0.315, 0.315, 0.265]
    $$

  - **For the second token ("cat"):**
    $$
    \text{AttOutput}_2 = 0.500 \times V_1 + 0.500 \times V_2 + 0.0 \times V_3 \\
    = 0.500 \times [0.07, 0.32, 0.32, 0.27] + 0.500 \times [0.09, 0.31, 0.31, 0.26] + 0.0 \times [0.08, 0.18, 0.195, 0.22] \\
    = [0.035, 0.16, 0.16, 0.135] + [0.045, 0.155, 0.155, 0.13] + [0.0, 0.0, 0.0, 0.0] \\
    = [0.080, 0.315, 0.315, 0.265]
    $$

  - **For the third token ("sat"):**
    $$
    \text{AttOutput}_3 = 0.339 \times V_1 + 0.339 \times V_2 + 0.323 \times V_3 \\
    = 0.339 \times [0.07, 0.32, 0.32, 0.27] + 0.339 \times [0.09, 0.31, 0.31, 0.26] + 0.323 \times [0.08, 0.18, 0.195, 0.22] \\
    = [0.02373, 0.10848, 0.10848, 0.07269] + [0.03051, 0.10509, 0.10509, 0.08814] + [0.02584, 0.05814, 0.06309, 0.07106] \\
    = [0.07908, 0.27171, 0.27666, 0.23189]
    $$

Thus, the final attention output matrix $\text{AttOutput}$ is approximately:

$$
\text{AttOutput} \approx \begin{bmatrix}
0.080 & 0.315 & 0.315 & 0.265 \\[4pt]
0.080 & 0.315 & 0.315 & 0.265 \\[4pt]
0.079 & 0.271 & 0.277 & 0.232
\end{bmatrix}
$$

**Interpretation:**
- **First Token ("The"):** Its representation is an average of "The" and "cat", ignoring "sat" due to the causal mask.
- **Second Token ("cat"):** Similarly, it averages "The" and "cat".
- **Third Token ("sat"):** It attends to all three tokens, incorporating information from "The", "cat", and itself.

#### 7. Incorporate Multi-Head Attention (Optional)

**Multi-Head Attention** allows the model to attend to information from different representation subspaces at different positions. Here's how to extend our example to multi-head attention.

- **Assumptions:**
  - Number of heads: $h = 2$
  - Dimension per head: $d_k = d_v = d_{\text{model}} / h = 2$

- **Parameter Matrices for Each Head:**
  For simplicity, we define separate $W_Q$, $W_K$, and $W_V$ for each head. Assume these are predefined.

  **Head 1:**
  $$
  W_{Q}^{(1)} = \begin{bmatrix}
  0.5 & 0.1 \\
  0.4 & 0.2 \\
  0.3 & 0.3 \\
  0.2 & 0.1
  \end{bmatrix}, \quad
  W_{K}^{(1)} = \begin{bmatrix}
  0.1 & 0.4 \\
  0.0 & 0.5 \\
  0.3 & 0.0 \\
  0.2 & 0.2
  \end{bmatrix}, \quad
  W_{V}^{(1)} = \begin{bmatrix}
  0.2 & 0.1 \\
  0.0 & 0.3 \\
  0.1 & 0.1 \\
  0.0 & 0.2
  \end{bmatrix}
  $$

  **Head 2:**
  $$
  W_{Q}^{(2)} = \begin{bmatrix}
  0.0 & 0.3 \\
  0.1 & 0.0 \\
  0.3 & 0.3 \\
  0.5 & 0.4
  \end{bmatrix}, \quad
  W_{K}^{(2)} = \begin{bmatrix}
  0.0 & 0.0 \\
  0.2 & 0.1 \\
  0.3 & 0.3 \\
  0.1 & 0.0
  \end{bmatrix}, \quad
  W_{V}^{(2)} = \begin{bmatrix}
  0.0 & 0.0 \\
  0.5 & 0.0 \\
  0.1 & 0.4 \\
  0.1 & 0.1
  \end{bmatrix}
  $$

- **Compute Q, K, V for Each Head:**

  **Head 1:**
  $$
  Q^{(1)} = X W_{Q}^{(1)} = \begin{bmatrix}
  0.1 & 0.4 & 0.5 & 0.7 \\
  0.2 & 0.4 & 0.5 & 0.6 \\
  0.15 & 0.25 & 0.5 & 0.2
  \end{bmatrix}
  \begin{bmatrix}
  0.5 & 0.1 \\
  0.4 & 0.2 \\
  0.3 & 0.3 \\
  0.2 & 0.1
  \end{bmatrix}
  = \begin{bmatrix}
  0.5 & 0.31 \\
  0.53 & 0.31 \\
  0.365 & 0.235
  \end{bmatrix}
  $$

  $$
  K^{(1)} = X W_{K}^{(1)} = \begin{bmatrix}
  0.1 & 0.4 & 0.5 & 0.7 \\
  0.2 & 0.4 & 0.5 & 0.6 \\
  0.15 & 0.25 & 0.5 & 0.2
  \end{bmatrix}
  \begin{bmatrix}
  0.1 & 0.4 \\
  0.0 & 0.5 \\
  0.3 & 0.0 \\
  0.2 & 0.2
  \end{bmatrix}
  = \begin{bmatrix}
  0.3 & 0.38 \\
  0.29 & 0.4 \\
  0.205 & 0.225
  \end{bmatrix}
  $$

  $$
  V^{(1)} = X W_{V}^{(1)} = \begin{bmatrix}
  0.1 & 0.4 & 0.5 & 0.7 \\
  0.2 & 0.4 & 0.5 & 0.6 \\
  0.15 & 0.25 & 0.5 & 0.2
  \end{bmatrix}
  \begin{bmatrix}
  0.2 & 0.1 \\
  0.0 & 0.3 \\
  0.1 & 0.1 \\
  0.0 & 0.2
  \end{bmatrix}
  = \begin{bmatrix}
  0.07 & 0.32 \\
  0.09 & 0.31 \\
  0.08 & 0.18
  \end{bmatrix}
  $$

  **Head 2:**
  $$
  Q^{(2)} = X W_{Q}^{(2)} = \begin{bmatrix}
  0.1 & 0.4 & 0.5 & 0.7 \\
  0.2 & 0.4 & 0.5 & 0.6 \\
  0.15 & 0.25 & 0.5 & 0.2
  \end{bmatrix}
  \begin{bmatrix}
  0.0 & 0.3 \\
  0.1 & 0.0 \\
  0.3 & 0.3 \\
  0.5 & 0.4
  \end{bmatrix}
  = \begin{bmatrix}
  0.38 & 0.59 \\
  0.39 & 0.57 \\
  0.305 & 0.255
  \end{bmatrix}
  $$

  $$
  K^{(2)} = X W_{K}^{(2)} = \begin{bmatrix}
  0.1 & 0.4 & 0.5 & 0.7 \\
  0.2 & 0.4 & 0.5 & 0.6 \\
  0.15 & 0.25 & 0.5 & 0.2
  \end{bmatrix}
  \begin{bmatrix}
  0.0 & 0.0 \\
  0.2 & 0.1 \\
  0.3 & 0.3 \\
  0.1 & 0.0
  \end{bmatrix}
  = \begin{bmatrix}
  0.1 & 0.13 \\
  0.11 & 0.13 \\
  0.125 & 0.11
  \end{bmatrix}
  $$

  $$
  V^{(2)} = X W_{V}^{(2)} = \begin{bmatrix}
  0.1 & 0.4 & 0.5 & 0.7 \\
  0.2 & 0.4 & 0.5 & 0.6 \\
  0.15 & 0.25 & 0.5 & 0.2
  \end{bmatrix}
  \begin{bmatrix}
  0.0 & 0.0 \\
  0.5 & 0.0 \\
  0.1 & 0.4 \\
  0.1 & 0.1
  \end{bmatrix}
  = \begin{bmatrix}
  0.05 & 0.04 \\
  0.06 & 0.03 \\
  0.07 & 0.07
  \end{bmatrix}
  $$

- **Compute Attention for Each Head:**

  For each head, perform the same steps as single-head attention:
  
  1. Compute Scores: $Q^{(h)} {K^{(h)}}^T$
  2. Apply Masking (if necessary)
  3. Scale Scores
  4. Apply Softmax
  5. Compute Attention Output: $A^{(h)} V^{(h)}$

  **Head 1:**

  - **Scores:**
    $$
    \text{Scores}^{(1)} = Q^{(1)} {K^{(1)}}^T = \begin{bmatrix}
    0.5 & 0.31 & 0.54 & 0.46 \\
    0.53 & 0.31 & 0.49 & 0.45 \\
    0.365 & 0.235 & 0.275 & 0.275
    \end{bmatrix}
    \begin{bmatrix}
    0.3 & 0.38 & 0.3 & 0.19 \\
    0.29 & 0.4 & 0.29 & 0.19 \\
    0.205 & 0.225 & 0.22 & 0.175
    \end{bmatrix}
    = \begin{bmatrix}
    0.5172 & 0.5130 & 0.37155 \\
    0.5093 & 0.5053 & 0.36495 \\
    0.33355 & 0.33185 & 0.236325
    \end{bmatrix}
    $$

  - **Apply Masking:**
    $$
    M = \begin{bmatrix}
    0 & -\infty & -\infty \\
    0 & 0 & -\infty \\
    0 & 0 & 0
    \end{bmatrix}
    $$
    $$
    \text{Masked Scores}^{(1)} = \text{Scores}^{(1)} + M = \begin{bmatrix}
    0.5172 & 0.5130 & -\infty \\
    0.5093 & 0.5053 & -\infty \\
    0.33355 & 0.33185 & 0.236325
    \end{bmatrix}
    $$

  - **Scale:**
    $$
    \tilde{\text{Scores}}^{(1)} = \frac{\text{Masked Scores}^{(1)}}{2} = \begin{bmatrix}
    0.2586 & 0.2565 & -\infty \\
    0.25465 & 0.25265 & -\infty \\
    0.166775 & 0.165925 & 0.1181625
    \end{bmatrix}
    $$

  - **Softmax:**
    $$
    A^{(1)} = \text{softmax}(\tilde{\text{Scores}}^{(1)}) = \begin{bmatrix}
    0.500 & 0.500 & 0.0 \\
    0.500 & 0.500 & 0.0 \\
    0.339 & 0.339 & 0.323
    \end{bmatrix}
    $$

  - **Attention Output:**
    $$
    \text{AttOutput}^{(1)} = A^{(1)} V^{(1)} = \begin{bmatrix}
    0.500 & 0.500 & 0.0 \\
    0.500 & 0.500 & 0.0 \\
    0.339 & 0.339 & 0.323
    \end{bmatrix}
    \begin{bmatrix}
    0.07 & 0.32 \\
    0.09 & 0.31 \\
    0.08 & 0.18
    \end{bmatrix}
    = \begin{bmatrix}
    0.080 & 0.315 \\
    0.080 & 0.315 \\
    0.079 & 0.271
    \end{bmatrix}
    $$

  **Head 2:**

  - **Scores:**
    $$
    \text{Scores}^{(2)} = Q^{(2)} {K^{(2)}}^T = \begin{bmatrix}
    0.38 & 0.59 \\
    0.39 & 0.57 \\
    0.305 & 0.255
    \end{bmatrix}
    \begin{bmatrix}
    0.1 & 0.13 \\
    0.11 & 0.13 \\
    0.125 & 0.11
    \end{bmatrix}
    = \begin{bmatrix}
    0.38*0.1 + 0.59*0.11 & 0.38*0.13 + 0.59*0.13 \\
    0.39*0.1 + 0.57*0.11 & 0.39*0.13 + 0.57*0.13 \\
    0.305*0.1 + 0.255*0.11 & 0.305*0.13 + 0.255*0.13
    \end{bmatrix}
    = \begin{bmatrix}
    0.038 + 0.0649 & 0.0494 + 0.0767 \\
    0.039 + 0.0627 & 0.0507 + 0.0741 \\
    0.0305 + 0.02805 & 0.03965 + 0.03315
    \end{bmatrix}
    = \begin{bmatrix}
    0.1029 & 0.1261 \\
    0.1017 & 0.1248 \\
    0.05855 & 0.0728
    \end{bmatrix}
    $$

  - **Apply Masking:**
    $$
    M = \begin{bmatrix}
    0 & -\infty & -\infty \\
    0 & 0 & -\infty \\
    0 & 0 & 0
    \end{bmatrix}
    $$
    Since $K^{(2)}$ has dimension $2$, and our sequence length is $3$, we need to adjust the mask accordingly. However, for simplicity, assume a similar causal mask applies:

    $$
    \text{Masked Scores}^{(2)} = \text{Scores}^{(2)} + M = \begin{bmatrix}
    0.1029 & 0.1261 & -\infty \\
    0.1017 & 0.1248 & -\infty \\
    0.05855 & 0.0728 & 0.0
    \end{bmatrix}
    $$

  - **Scale:**
    $$
    \tilde{\text{Scores}}^{(2)} = \frac{\text{Masked Scores}^{(2)}}{2} = \begin{bmatrix}
    0.05145 & 0.06305 & -\infty \\
    0.05085 & 0.0624 & -\infty \\
    0.029275 & 0.0364 & 0.0
    \end{bmatrix}
    $$

  - **Softmax:**
    $$
    A^{(2)} = \text{softmax}(\tilde{\text{Scores}}^{(2)}) = \begin{bmatrix}
    0.500 & 0.500 & 0.0 \\
    0.500 & 0.500 & 0.0 \\
    0.339 & 0.339 & 0.323
    \end{bmatrix}
    $$

  - **Attention Output:**
    $$
    \text{AttOutput}^{(2)} = A^{(2)} V^{(2)} = \begin{bmatrix}
    0.500 & 0.500 & 0.0 \\
    0.500 & 0.500 & 0.0 \\
    0.339 & 0.339 & 0.323
    \end{bmatrix}
    \begin{bmatrix}
    0.05 & 0.04 \\
    0.06 & 0.03 \\
    0.07 & 0.07
    \end{bmatrix}
    = \begin{bmatrix}
    0.500*0.05 + 0.500*0.06 & 0.500*0.04 + 0.500*0.03 \\
    0.500*0.05 + 0.500*0.06 & 0.500*0.04 + 0.500*0.03 \\
    0.339*0.05 + 0.339*0.06 + 0.323*0.07 & 0.339*0.04 + 0.339*0.03 + 0.323*0.07
    \end{bmatrix}
    = \begin{bmatrix}
    0.055 & 0.035 \\
    0.055 & 0.035 \\
    0.01695 + 0.02034 + 0.02261 & 0.01356 + 0.01017 + 0.02261
    \end{bmatrix}
    = \begin{bmatrix}
    0.055 & 0.035 \\
    0.055 & 0.035 \\
    0.0599 & 0.04634
    \end{bmatrix}
    $$

- **Concatenate Heads and Project:**

  After computing attention outputs for all heads, concatenate them:

  $$
  \text{Concat} = [\text{AttOutput}^{(1)}, \text{AttOutput}^{(2)}] = \begin{bmatrix}
  0.080 & 0.315 & 0.055 & 0.035 \\
  0.080 & 0.315 & 0.055 & 0.035 \\
  0.079 & 0.271 & 0.0599 & 0.04634
  \end{bmatrix}
  $$

  Then apply a final linear projection:

  Let’s define $W_O \in \mathbb{R}^{(h \cdot d_v) \times d_{\text{model}}}$ as:

  $$
  W_O = \begin{bmatrix}
  0.1 & 0.0 & 0.2 & 0.1 \\
  0.0 & 0.1 & 0.0 & 0.2 \\
  0.3 & 0.1 & 0.0 & 0.0 \\
  0.0 & 0.2 & 0.1 & 0.1
  \end{bmatrix}
  $$

  Compute the final output:

  $$
  \text{FinalOutput} = \text{Concat} \cdot W_O = \begin{bmatrix}
  0.080 & 0.315 & 0.055 & 0.035 \\
  0.080 & 0.315 & 0.055 & 0.035 \\
  0.079 & 0.271 & 0.0599 & 0.04634
  \end{bmatrix}
  \begin{bmatrix}
  0.1 & 0.0 & 0.2 & 0.1 \\
  0.0 & 0.1 & 0.0 & 0.2 \\
  0.3 & 0.1 & 0.0 & 0.0 \\
  0.0 & 0.2 & 0.1 & 0.1
  \end{bmatrix}
  = \begin{bmatrix}
  0.080*0.1 + 0.315*0.0 + 0.055*0.3 + 0.035*0.0 & \ldots \\
  0.080*0.1 + 0.315*0.0 + 0.055*0.3 + 0.035*0.0 & \ldots \\
  0.079*0.1 + 0.271*0.0 + 0.0599*0.3 + 0.04634*0.0 & \ldots
  \end{bmatrix}
  $$

  (Complete the matrix multiplication as needed.)

#### 8. Final Representation

The final output represents the attended information for each token, enriched by multiple attention heads capturing diverse patterns.

## Summary of Enhanced Steps

1. **Input Preparation:**
   - Create input embeddings and add positional encodings.

2. **Linear Projections:**
   - Compute $Q = XW_Q$, $K = XW_K$, $V = XW_V$ for each head.

3. **Attention Scores:**
   - Compute $\text{Scores} = QK^T$.

4. **Apply Masking:**
   - Add mask matrix $M$ to $\text{Scores}$ to obtain $\text{Masked Scores}$.

5. **Scaling:**
   - Scale the scores by $\sqrt{d_k}$.

6. **Softmax:**
   - Apply softmax to obtain attention weights $A$.

7. **Attention Output:**
   - Compute $\text{AttOutput} = A V$.

8. **Multi-Head Concatenation (if applicable):**
   - Concatenate outputs from all heads and apply final linear projection.

9. **Final Representation:**
   - The final output represents the attended information for each token.



## Summary of Steps

1. Took the input sequence and created $X$.
2. Computed $Q = XW_Q$, $K = XW_K$, $V = XW_V$.
3. Calculated $\text{Scores} = QK^T$, then scaled them.
4. Applied softmax to get attention weights $A$.
5. Computed the final output as $A V$.

This step-by-step example shows how the linear algebra operations translate into the final attended representation.


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

###############################################################################
# NOTE:
# This is an illustrative (pseudo) code template designed to show how you might
# structure training pipelines for different tasks using Transformers in PyTorch.
# It is NOT fully functional or optimized for production use.
#
# The key idea:
# - The core Transformer architecture is often the same (or very similar).
# - Task differences are primarily in:
#   1. Data Representation (input-output format, multimodal encoders, etc.)
#   2. Additional heads or modules depending on the modality and task.
#   3. Training objective / loss function.
#
# Tasks shown conceptually:
# - Q/A (Text-to-Text)
# - Summarization (Text-to-Text)
# - Content creation (e.g. text generation)
# - Sentiment analysis (Text classification)
# - Chatbot (Text-to-Text, possibly instruction-tuned)
# - Image-to-Audio (Vision encoder + Audio decoder)
# - Image-to-Image (Vision-to-Vision transform, possibly using diffusion models or VQ-VAE decoders)
# - Text-to-Image (Text encoder + Image decoder/generator)
# - Text-to-Video (Text encoder + Video generator)
#
# In practice, each of these tasks might require a different specialized architecture,
# especially when moving beyond text (multimodal). Below, we give a rough template
# and indicate what changes.
###############################################################################

########################################
# Basic Transformer Building Block
########################################
class TransformerEncoderLayer(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_dim):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(embed_dim, num_heads)
        self.fc = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),
            nn.ReLU(),
            nn.Linear(ff_dim, embed_dim)
        )
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)

    def forward(self, x, mask=None):
        # Self-attention
        attn_out, _ = self.self_attn(x, x, x, attn_mask=mask)
        x = x + attn_out
        x = self.norm1(x)

        # Feed-forward
        ff_out = self.fc(x)
        x = x + ff_out
        x = self.norm2(x)
        return x

class TransformerDecoderLayer(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_dim):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(embed_dim, num_heads)
        self.cross_attn = nn.MultiheadAttention(embed_dim, num_heads)
        self.fc = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),
            nn.ReLU(),
            nn.Linear(ff_dim, embed_dim)
        )
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.norm3 = nn.LayerNorm(embed_dim)

    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None):
        # Self-attention (causal mask for autoregressive decoding)
        attn_out, _ = self.self_attn(tgt, tgt, tgt, attn_mask=tgt_mask)
        tgt = tgt + attn_out
        tgt = self.norm1(tgt)

        # Cross-attention
        cross_out, _ = self.cross_attn(tgt, memory, memory, attn_mask=memory_mask)
        tgt = tgt + cross_out
        tgt = self.norm2(tgt)

        # Feed-forward
        ff_out = self.fc(tgt)
        tgt = tgt + ff_out
        tgt = self.norm3(tgt)
        return tgt

########################################
# Text-Only Transformer (e.g., for Q/A, Summarization, Chatbot)
########################################
class TextTransformer(nn.Module):
    def __init__(self, vocab_size, embed_dim=256, num_heads=4, ff_dim=1024, num_layers=4):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.pos_embed = nn.Embedding(512, embed_dim)  # positional embedding
        self.encoder = nn.ModuleList([TransformerEncoderLayer(embed_dim, num_heads, ff_dim) for _ in range(num_layers)])
        self.decoder = nn.ModuleList([TransformerDecoderLayer(embed_dim, num_heads, ff_dim) for _ in range(num_layers)])
        self.output_head = nn.Linear(embed_dim, vocab_size)

    def forward(self, src_tokens, tgt_tokens, src_mask=None, tgt_mask=None):
        # Encode
        src_seq_len = src_tokens.size(1)
        src_positions = torch.arange(src_seq_len, device=src_tokens.device).unsqueeze(0)
        src_embed = self.embedding(src_tokens) + self.pos_embed(src_positions)

        memory = src_embed
        for layer in self.encoder:
            memory = layer(memory, mask=src_mask)

        # Decode
        tgt_seq_len = tgt_tokens.size(1)
        tgt_positions = torch.arange(tgt_seq_len, device=tgt_tokens.device).unsqueeze(0)
        tgt_embed = self.embedding(tgt_tokens) + self.pos_embed(tgt_positions)

        out = tgt_embed
        for layer in self.decoder:
            out = layer(out, memory, tgt_mask, src_mask)

        logits = self.output_head(out)
        return logits

    # Training for Q/A, Summarization, Chatbot: 
    # - The model architecture remains the same.
    # - The difference is in the data:
    #   Q/A: src=input passage+question, tgt=answer
    #   Summarization: src=original text, tgt=summary
    #   Chatbot: src=dialogue history, tgt=next reply
    # 
    # All are text-to-text. The difference is just *how you present the input-output pairs*.

# Example dataset loaders for different tasks
class TaskDataset(Dataset):
    def __init__(self, inputs, targets, tokenizer):
        self.inputs = inputs
        self.targets = targets
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        src = self.tokenizer(self.inputs[idx])
        tgt = self.tokenizer(self.targets[idx])
        return torch.tensor(src), torch.tensor(tgt)

# Dummy tokenizer for illustration
def dummy_tokenizer(text, vocab_size=1000):
    return [hash(token) % vocab_size for token in text.split()]

# Datasets for different tasks
question_answering_data = TaskDataset(
    inputs=["What is the capital of France?", "Who wrote 1984?"],
    targets=["Paris", "George Orwell"],
    tokenizer=lambda x: dummy_tokenizer(x)
)

summarization_data = TaskDataset(
    inputs=["This is a long article about AI and its applications."],
    targets=["AI and its applications."],
    tokenizer=lambda x: dummy_tokenizer(x)
)

chatbot_data = TaskDataset(
    inputs=["chat: Hello, how are you?", "chat: What is your name?"],
    targets=["I am fine, thank you.", "I am a chatbot."],
    tokenizer=lambda x: dummy_tokenizer(x)
)

########################################
# Text Classification Head (e.g., Sentiment Analysis)
########################################
class TextClassifier(nn.Module):
    def __init__(self, base_model, num_classes):
        super().__init__()
        self.base_model = base_model  # a TransformerEncoder for example
        self.classifier = nn.Linear(base_model.encoder[-1].norm2.normalized_shape[0], num_classes)

    def forward(self, src_tokens, src_mask=None):
        # We'll just use the encoder part of a transformer for classification
        src_seq_len = src_tokens.size(1)
        src_positions = torch.arange(src_seq_len, device=src_tokens.device).unsqueeze(0)
        embed = self.base_model.embedding(src_tokens) + self.base_model.pos_embed(src_positions)
        
        x = embed
        for layer in self.base_model.encoder:
            x = layer(x, mask=src_mask)

        # Take the representation of the first token (e.g. [CLS])
        cls_repr = x[:, 0, :]
        logits = self.classifier(cls_repr)
        return logits

    # For sentiment analysis, we don't need a decoder.
    # We just encode the text and classify. This changes the output head
    # (instead of text generation, we predict class logits).


########################################
# Multimodal Extensions
########################################

# For Image-to-Audio, Text-to-Image, etc., we need different encoders/decoders.

# Example: Image Encoder (e.g., a Vision Transformer (ViT)-style encoder)
class VisionEncoder(nn.Module):
    def __init__(self, image_patch_dim, embed_dim=256, num_heads=4, ff_dim=1024, num_layers=4):
        super().__init__()
        # image_patch_dim: dimension after splitting image into patches and flattening
        self.linear = nn.Linear(image_patch_dim, embed_dim)
        self.pos_embed = nn.Embedding(512, embed_dim)
        self.layers = nn.ModuleList([TransformerEncoderLayer(embed_dim, num_heads, ff_dim) for _ in range(num_layers)])

    def forward(self, image_patches):
        # image_patches: [B, num_patches, patch_dim]
        B, T, _ = image_patches.size()
        positions = torch.arange(T, device=image_patches.device).unsqueeze(0)
        x = self.linear(image_patches) + self.pos_embed(positions)

        for layer in self.layers:
            x = layer(x)
        return x  # This acts like "memory" for the decoder

# For Audio Decoding or Image Decoding, the architecture might vary significantly.
# Here, we might have a Transformer decoder that outputs spectrogram tokens for audio,
# or latent codes for an image decoder (like VQ-VAE codes).

class AudioDecoder(nn.Module):
    def __init__(self, vocab_size, embed_dim=256, num_heads=4, ff_dim=1024, num_layers=4):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.pos_embed = nn.Embedding(512, embed_dim)
        self.layers = nn.ModuleList([TransformerDecoderLayer(embed_dim, num_heads, ff_dim) for _ in range(num_layers)])
        self.output_head = nn.Linear(embed_dim, vocab_size)

    def forward(self, memory, tgt_tokens, tgt_mask=None):
        # memory from vision encoder
        T = tgt_tokens.size(1)
        positions = torch.arange(T, device=tgt_tokens.device).unsqueeze(0)
        x = self.embedding(tgt_tokens) + self.pos_embed(positions)

        for layer in self.layers:
            x = layer(x, memory, tgt_mask=tgt_mask)

        logits = self.output_head(x)
        return logits

# For text-to-image or text-to-video: 
# - Text-to-image might have a text encoder (like the TextTransformer encoder) and a custom image decoder.
# - Text-to-video might be similar, but the decoder would produce a sequence of video frames (tokens), possibly using a specialized tokenization.

# The main changes are:
# 1. Add a different encoder for non-text input (images, audio features).
# 2. Add a different decoder for non-text output (image tokens, audio tokens).
# 3. Data presentation: 
#    - Image-to-Audio: Input: image (converted to patches), Output: sequence of audio tokens (like spectrogram patches)
#    - Image-to-Image: Input: image patches, Output: another set of image tokens (e.g., style transfer)
#    - Text-to-Image: Input: text tokens, Output: image tokens
#    - Text-to-Video: Input: text tokens, Output: video tokens (multiple frames encoded as patches)
#
# Each requires specialized data loaders that convert raw data (images, audio, video) into token-like sequences.
# Often, a learned codebook or a pretrained VAE (for images) or audio tokenizer is used.

########################################
# Example Dummy Training Loop (Text-to-Text)
########################################
# Q/A training: Input: "question + context", Output: "answer"  
class QADataset(Dataset):
    def __init__(self, questions, contexts, answers, tokenizer):
        self.questions = questions
        self.contexts = contexts
        self.answers = answers
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, idx):
        # This is just a conceptual example
        src = self.tokenizer(self.contexts[idx] + " " + self.questions[idx])
        tgt = self.tokenizer(self.answers[idx])
        return torch.tensor(src), torch.tensor(tgt)

# Instantiate a text model
vocab_size = 10000
model = TextTransformer(vocab_size)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

# Dummy data
questions = ["What is the capital of France?", "Who wrote 1984?"]
contexts = ["France is a country in Europe. Its capital is Paris.", "1984 is a novel by George Orwell."]
answers = ["Paris", "George Orwell"]
tokenizer = lambda x: [hash(t)%vocab_size for t in x.split()] # Dummy tokenizer

dataset = QADataset(questions, contexts, answers, tokenizer)
loader = DataLoader(dataset, batch_size=2, shuffle=True)

for src, tgt in loader:
    # Shift tgt by one for teacher forcing:
    # Input to decoder: tgt[:, :-1]
    # Targets for loss: tgt[:, 1:]
    logits = model(src, tgt[:, :-1])
    # logits: [B, T, vocab_size]
    # targets: [B, T]
    loss = F.cross_entropy(logits.transpose(1, 2), tgt[:, 1:])
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    print("Loss:", loss.item())

# This shows a typical text-to-text training step.
# For other tasks, you'd change:
# - The dataset (how data is loaded and tokenized)
# - Possibly the architecture (add a vision encoder for image tasks, etc.)
# - The loss function (maybe multiple heads or different output spaces)
#
# For advanced multimodal tasks, you'd have something like:
#   encoder = VisionEncoder(image_patch_dim=...),
#   decoder = AudioDecoder(vocab_size=...)
# and train similarly, but feeding image patches to the encoder and audio tokens to the decoder.

###############################################################################
# Summary:
# - Q/A, Summarization, Content creation, Chatbot: Mostly text-to-text.
#   * Same model architecture (encoder-decoder).
#   * Different datasets and prompts.
# 
# - Sentiment Analysis: Classification head on top of an encoder. Same model layers,
#   but different final layer (linear classifier) and data labeling.
#
# - Image-to-Audio, Image-to-Image, Text-to-Image, Text-to-Video:
#   * Typically involves a multimodal architecture:
#       - Add a Vision Encoder for images.
#       - Add a different Decoder (or generation head) for the output modality.
#   * Data presentation changes: images become patches, audio/video are tokenized,
#     text remains text tokens.
#   * Training objective still involves predicting the next "token" in the output modality.
###############################################################################


> BERT TOKENS VS BERT EMBEDDINGS VS BERT MODEL ARCHITECTURE

In [None]:
import torch
from transformers import BertTokenizer, BertModel

# -------------------------- 1. BERT Tokens -----------------------------
# Step 1: Tokenize a given input sentence
pretrained_model = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(pretrained_model)

# Example sentence
text = "The cat sat on the mat."

# Tokenize the sentence
tokens = tokenizer.tokenize(text)  # Step 1: Get BERT Tokens
input_ids = tokenizer.convert_tokens_to_ids(tokens)  # Convert tokens to input IDs

print("Step 1: BERT Tokens")
print(f"Tokens: {tokens}")
print(f"Token IDs: {input_ids}\n")

# ---------------------- 2. BERT Embeddings ----------------------------
# Step 2: Prepare input tensors and extract embeddings
encoded_input = tokenizer(text, return_tensors="pt")  # Auto-prepare inputs

# Load the BERT model
model = BertModel.from_pretrained(pretrained_model)

# Extract Input Embeddings
with torch.no_grad():
    # Word Embeddings
    word_embeddings = model.embeddings.word_embeddings(encoded_input["input_ids"])

    # Position Embeddings
    position_ids = torch.arange(0, encoded_input["input_ids"].size(1)).unsqueeze(0)
    position_embeddings = model.embeddings.position_embeddings(position_ids)

    # Segment (Token Type) Embeddings
    token_type_embeddings = model.embeddings.token_type_embeddings(encoded_input["token_type_ids"])

# Combine all embeddings
input_embeddings = word_embeddings + position_embeddings + token_type_embeddings

print("Step 2: BERT Embeddings")
print(f"Word Embeddings Shape: {word_embeddings.shape}")
print(f"Position Embeddings Shape: {position_embeddings.shape}")
print(f"Segment Embeddings Shape: {token_type_embeddings.shape}")
print(f"Combined Input Embeddings Shape: {input_embeddings.shape}\n")
# Shape: (batch_size, sequence_length, hidden_size)

# ---------------------- 3. BERT Model/Architecture ---------------------
# Step 3: Pass embeddings through the Transformer architecture
with torch.no_grad():
    outputs = model(**encoded_input)

# Outputs
last_hidden_state = outputs.last_hidden_state  # Contextualized token representations
pooler_output = outputs.pooler_output          # [CLS] token output for sentence-level tasks

print("Step 3: BERT Model/Architecture")
print(f"Last Hidden State Shape: {last_hidden_state.shape}")
print(f"Pooler Output Shape: {pooler_output.shape}")
# Shape: (batch_size, sequence_length, hidden_size) for last_hidden_state
# Shape: (batch_size, hidden_size) for pooler_output

# --------------------------- Recap -------------------------------------
"""
1. BERT Tokens:
   - Tokenization breaks text into subword tokens.
   - Tokens are converted into IDs that BERT understands.

2. BERT Embeddings:
   - Input IDs are converted into embeddings (Word, Position, and Segment embeddings).
        - Token Embeddings: Convert each token (word/subword) into a vector.
        - Segment Embeddings: Distinguish between different sentences in sentence-pair tasks (e.g., sentence A vs. sentence B).
        - Position Embeddings: Represent the position of each token in the sequence since Transformers lack inherent positional information.

   - Combined embeddings are the input to BERT's Transformer layers.

3. BERT Model/Architecture:
   - The Transformer layers contextualize the embeddings.
   - Outputs include:
       - last_hidden_state: Token-level contextual embeddings.
       - pooler_output: Sentence-level embedding from the [CLS] token.
"""


# AI Notebooks

## Generative AI and LLM Projects Classification
code in ``` C:\Users\pault\Documents\3. AI and Machine Learning\2. Deep Learning\1. VERY IMPORTANT\AI notebooks\notebooks ```

### 1. Natural Language Processing (NLP)
These projects focus on understanding, generating, or interacting with text.

#### **Text Generation and Editing**
- Grammar correction (`grammar-correction`)
- Text summarization (`summarization-genai`)
- Instruction-following LLMs (`dolly-2-instruction-following`)
- Content creation (e.g., blogs, articles, or creative writing)
- Text editing (`instruct-pix2pix-image-editing`)

#### **Question Answering and Chatbots**
- Question answering systems (`llm-question-answering`, `table-question-answering`)
- Conversational AI/chatbots (`llm-chatbot`, `llava-multimodal-chatbot`)
- Task-specific chatbots (`nano-llava-multimodal-chatbot`, `mobilevlm-language-assistant`)

#### **Named Entity Recognition (NER)**
- Extracting structured information from text (`named-entity-recognition`, `nuextract-structure-extraction`)

#### **Language Translation and Multilingual Models**
- Cross-lingual translation (`cross-lingual-books-alignment`)
- Massively multilingual speech (`mms-massively-multilingual-speech`)

#### **RAG and Knowledge Retrieval**
- Retrieval-Augmented Generation (RAG) (`llm-rag-langchain`, `llm-rag-llamaindex`)
- Knowledge-based question answering (`knowledge-graphs-conve`)

---

### 2. Computer Vision
These projects involve image understanding, generation, or manipulation.

#### **Image Generation**
- Text-to-image (`stable-diffusion-text-to-image`, `text-to-image-genai`, `wuerstchen-image-generation`)
- Paint-by-example (`paint-by-example`)
- Style transfer (`style-transfer-webcam`, `pixart`)

#### **Object Detection and Segmentation**
- Object detection from webcam (`object-detection-webcam`, `hello-detection`)
- Image segmentation (`segment-anything`, `oneformer-segmentation`)

#### **Image-to-Image Transformation**
- Sketch-to-image (`sketch-to-image-pix2pix-turbo`)
- Image colorization (`ddcolor-image-colorization`)

#### **Multimodal Image Understanding**
- Visual language processing (`blip-visual-language-processing`)
- Zero-shot image classification (`clip-zero-shot-image-classification`, `siglip-zero-shot-image-classification`)

---

### 3. Speech and Audio Processing
Projects that focus on generating, understanding, or manipulating audio.

#### **Speech-to-Text**
- Automatic Speech Recognition (ASR) (`distil-whisper-asr`, `whisper-asr-genai`)
- Subtitle generation from audio (`whisper-subtitles-generation`)

#### **Text-to-Speech**
- Text-to-speech conversion (`parler-tts-text-to-speech`, `outetts-text-to-speech`)
- Multilingual TTS systems (`bark-text-to-audio`)

#### **Voice Conversion**
- Voice cloning or conversion (`freevc-voice-conversion`, `softvc-voice-conversion`)

#### **Music and Sound Generation**
- Text-to-music (`riffusion-text-to-music`)
- Audio generation (`sound-generation-audioldm2`, `stable-audio`)

---

### 4. Multimodal AI
Models combining text, images, audio, and video.

#### **Multimodal Chatbots**
- Multimodal assistants (`kosmos2-multimodal-large-language-model`, `llava-next-multimodal-chatbot`)

#### **Image-to-Audio or Video**
- Text-to-video generation (`zeroscope-text2video`)
- Image-to-audio conversion (`bark-text-to-audio`)

#### **Cross-Modal Retrieval**
- Text-to-video retrieval (`s3d-mil-nce-text-to-video-retrieval`)
- Mobile-based video search (`mobileclip-video-search`)

---

### 5. Specialized Applications
Projects addressing domain-specific challenges.

#### **Healthcare**
- 3D segmentation for medical images (`3D-segmentation-point-clouds`)
- CT scan segmentation and quantization (`ct-segmentation-quantize`)

#### **Education**
- Explainable AI (`explainable-ai-1-basic`, `explainable-ai-3-map-interpretation`)
- Knowledge-based tutors or learning assistants

### **Industrial Applications**
- Meter reading (`meter-reader`)
- Vehicle detection and recognition (`vehicle-detection-and-recognition`)
- Person tracking and counting (`person-tracking-webcam`, `person-counting-webcam`)

---

### 6. Optimization and Efficiency
Improving model performance or enabling deployment on resource-constrained devices.

#### **Model Quantization and Optimization**
- Post-training quantization (`pytorch-post-training-quantization-nncf`, `tensorflow-quantization-aware-training`)
- Quantization-aware training (`pytorch-quantization-aware-training`)

#### **Deployment Tools**
- Model conversion to OpenVINO (`pytorch-to-openvino`, `tensorflow-object-detection-to-openvino`)
- Lightweight implementations for mobile devices (`amused-lightweight-text-to-image`)

---

### 7. Emerging Domains
Innovative and experimental areas in GenAI.

#### **Real-Time Interaction**
- Real-time animation of humans (`animate-anyone`, `dynamicrafter-animating-images`)
- Pose estimation (`pose-estimation-webcam`, `3D-pose-estimation-webcam`)

#### **Image Depth and Monodepth**
- Depth estimation (`depth-anything`, `vision-monodepth`)

#### **Generative Diffusion Models**
- Stable diffusion for various tasks (`stable-diffusion-v2`, `stable-diffusion-v3`, `stable-diffusion-xl`)
- Video generation with diffusion models (`stable-video-diffusion`)


# Open AI Notebooks

## Categorization of AI Projects

### 1. Multimodal Applications
#### 1.1 Vision
- GPT_with_vision_for_video_understanding.ipynb
- Tag_caption_images_with_GPT4V.ipynb
- Image_generations_edits_and_variations_with_DALL-E.ipynb
- How_to_create_dynamic_masks_with_DALL-E_and_Segment_Anything.ipynb
- Creating_slides_with_Assistants_API_and_DALL-E3.ipynb
- Vision_Fine_tuning_on_GPT4o_for_Visual_Question_Answering.ipynb
- Using_GPT4_Vision_With_Function_Calling.ipynb

#### 1.2 Audio
- Whisper_correct_misspelling.ipynb
- Whisper_processing_guide.ipynb
- Whisper_prompting_guide.ipynb
- voice_translation_into_different_languages_using_GPT-4o.ipynb
- steering_tts.ipynb

#### 1.3 Text + Visuals
- chat_with_your_own_data.ipynb
- Using_vision_modality_for_RAG_with_Pinecone.ipynb

---

### 2. Natural Language Processing (NLP)
#### 2.1 Text Classification
- Fine-tuned_classification.ipynb
- Multiclass_classification_for_transactions.ipynb
- Zero-shot_classification_with_embeddings.ipynb
- Classification_using_embeddings.ipynb

#### 2.2 Named Entity Recognition (NER) and Entity Extraction
- Named_Entity_Recognition_to_enrich_text.ipynb
- Entity_extraction_for_long_documents.ipynb

#### 2.3 Question Answering (QA)
- Question_answering_using_a_search_API.ipynb
- Question_answering_using_embeddings.ipynb
- QA_with_Langchain_AnalyticDB_and_OpenAI.ipynb
- QA_with_Langchain_Qdrant_and_OpenAI.ipynb
- QA_with_Langchain_Tair_and_OpenAI.ipynb
- question-answering-with-weaviate-and-openai.ipynb
- olympics-2-create-qa.ipynb
- olympics-3-train-qa.ipynb

#### 2.4 Summarization
- Summarizing_long_documents.ipynb
- How_to_eval_abstractive_summarization.ipynb

#### 2.5 Semantic Search
- Embedding_Wikipedia_articles_for_search.ipynb
- Semantic_text_search_using_embeddings.ipynb
- elasticsearch-semantic-search.ipynb
- OpenAI_wikipedia_semantic_search.ipynb
- Semantic_Search.ipynb

---

### 3. Machine Learning & Embeddings
#### 3.1 Embedding Applications
- Using_embeddings.ipynb
- custom_image_embedding_search.ipynb
- Recommendation_using_embeddings.ipynb
- Regression_using_embeddings.ipynb
- Visualizing_embeddings_in_2D.ipynb
- Visualizing_embeddings_in_3D.ipynb
- Visualizing_embeddings_in_Kangas.ipynb
- Visualizing_embeddings_in_wandb.ipynb
- Visualizing_embeddings_with_Atlas.ipynb

#### 3.2 Clustering and Dimensionality Reduction
- Clustering.ipynb
- Clustering_for_transaction_classification.ipynb

#### 3.3 Embedding Search
- Using_Chroma_for_embeddings_search.ipynb
- semantic_search_using_mongodb_atlas_vector_search.ipynb
- Using_Pinecone_for_embeddings_search.ipynb
- Using_Redis_for_embeddings_search.ipynb
- Using_Qdrant_for_embeddings_search.ipynb
- Using_Typesense_for_embeddings_search.ipynb
- Using_MyScale_for_embeddings_search.ipynb
- Using_Weaviate_for_embeddings_search.ipynb
- Filtered_search_with_Milvus_and_OpenAI.ipynb
- Filtered_search_with_Zilliz_and_OpenAI.ipynb
- Redis-hybrid-query-examples.ipynb

---

### 4. Retrieval-Augmented Generation (RAG)
#### 4.1 General
- RAG_with_graph_db.ipynb
- GPT4_Retrieval_Augmentation.ipynb
- hyde-with-chroma-and-openai.ipynb
- generative-search-with-weaviate-and-openai.ipynb
- Evaluate_RAG_with_LlamaIndex.ipynb
- ft_retrieval_augmented_generation_qdrant.ipynb

#### 4.2 Document Parsing and Knowledge Base Integration
- Parse_PDF_docs_for_RAG.ipynb
- financial_document_analysis_with_llamaindex.ipynb
- deeplake_langchain_qa.ipynb
- Using_Pinecone_for_embeddings_search.ipynb

---

### 5. Assistants and Function Calling
#### 5.1 Assistants API
- Assistants_API_overview_python.ipynb
- Using_tool_required_for_customer_service.ipynb
- Using_reasoning_for_data_validation.ipynb

#### 5.2 Function Calling
- Function_calling_with_an_OpenAPI_spec.ipynb
- Fine_tuning_for_function_calling.ipynb
- Using_chained_calls_for_o1_structured_outputs.ipynb
- Function_calling_finding_nearby_places.ipynb

#### 5.3 Workflow and Automation
- How_to_automate_S3_storage_with_functions.ipynb
- Openai_monitoring_with_wandb_weave.ipynb
- orchestration_agents.ipynb

---

### 6. Multimodal Generative AI
#### 6.1 DALL-E
- DALL-E.ipynb
- How_to_create_dynamic_masks_with_DALL-E_and_Segment_Anything.ipynb

#### 6.2 Whisper
- whisper.ipynb
- Whisper_correct_misspelling.ipynb
- Whisper_processing_guide.ipynb
- Whisper_prompting_guide.ipynb

---

### 7. Tools and Integration
#### 7.1 Database and Vector Stores
- Getting_started_with_bigquery_vector_search_and_openai.ipynb
- Getting_started_with_AnalyticDB_and_OpenAI.ipynb
- redisjson.ipynb

#### 7.2 Miscellaneous
- api_request_parallel_processor.py
- batch_processing.ipynb


> Llama FineTuning

In [None]:
# ============================================
# A Quick Cheat Sheet for Fine-Tuning LLMs
# Using LoRA and QLoRA (Step-by-Step Walkthrough)
# ============================================
#
# This code cell provides a concise, step-by-step cheat sheet for 
# fine-tuning Large Language Models (LLMs) using LoRA and QLoRA. 
# Each step includes sample commands, key parameters, and explanations 
# in comments. Adapt as needed for your own training scripts!

# --------------------------------------------------
# STEP 1: INSTALL DEPENDENCIES
# --------------------------------------------------
# Make sure you have the following libraries installed:
#   - transformers (latest)
#   - accelerate
#   - bitsandbytes
#   - peft
#   - datasets
#   - (optional) wandb for experiment tracking
#
# !pip install -q -U bitsandbytes
# !pip install -q -U git+https://github.com/huggingface/transformers.git
# !pip install -q -U git+https://github.com/huggingface/peft.git
# !pip install -q -U git+https://github.com/huggingface/accelerate.git
# !pip install -q -U datasets scipy ipywidgets matplotlib

import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from datasets import load_dataset
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model, PeftModel

# --------------------------------------------------
# STEP 2: DATA PREPARATION
# --------------------------------------------------
# 1. Load your dataset(s). In many cases, you'll have train/eval/test splits.
# 2. Define a prompt format function to guide how the data is turned into text.
# 3. Tokenize the prompts to create input_ids and labels for causal LM.

train_data = load_dataset('gem', 'viggo', split='train')
eval_data  = load_dataset('gem', 'viggo', split='validation')

# Example prompt format function
def create_prompt(example):
    # Customize your prompt as needed:
    prompt = (
        f"Given a target sentence, construct the underlying meaning representation.\n\n"
        f"Target Sentence:\n{example['target']}\n\n"
        f"Meaning Representation:\n{example['meaning_representation']}\n"
    )
    return prompt

# We'll create a tokenizer later when we load the model. For demonstration, here's a placeholder.
# tokenized_train_data = train_data.map(lambda x: tokenizer(create_prompt(x)), batched=False)

# --------------------------------------------------
# STEP 3: LOAD THE BASE MODEL (LLAMA 2 EXAMPLE)
# --------------------------------------------------
# For QLoRA, we use bitsandbytes to load the model in 4-bit precision.
# You can adapt for other LLMs as well.

base_model_id = "meta-llama/Llama-2-7b-hf"  # Example: 7B Llama 2

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                # 4-bit quantization
    bnb_4bit_use_double_quant=True,   # Double quantization for improved memory usage
    bnb_4bit_quant_type="nf4",        # Normal Float 4 quantization
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
# Make sure the tokenizer has a pad_token or set it to eos.
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# --------------------------------------------------
# STEP 4: PREPARE MODEL FOR K-BIT TRAINING
# --------------------------------------------------
# We use PEFT's utility function to prepare the model for 8bit/4bit training. 
# This helps ensure certain layers are frozen appropriately.

model.gradient_checkpointing_enable()  # Optional memory-saving trick
model = prepare_model_for_kbit_training(model)

# --------------------------------------------------
# STEP 5: SET UP LoRA OR QLoRA ADAPTERS
# --------------------------------------------------
# LoRA config parameters:
#   - r: Rank of the adapter matrix (larger => more capacity).
#   - lora_alpha: Scaling factor.
#   - lora_dropout: Dropout for LoRA layers.
#   - target_modules: Which model modules to apply LoRA to.
# For QLoRA, use the same approach but ensure you loaded the model in 4bit above.

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj", "lm_head"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)  # Apply LoRA adapters

# Utility to see trainable params
def print_trainable_parameters(m):
    trainable_params = 0
    all_params = 0
    for _, p in m.named_parameters():
        all_params += p.numel()
        if p.requires_grad:
            trainable_params += p.numel()
    print(f"Trainable params: {trainable_params} | All params: {all_params} | Trainable%: {100 * trainable_params / all_params:.2f}%")

print_trainable_parameters(model)

model.config.use_cache = False  # re-enable for inference to speed up predictions for similar inputs
# --------------------------------------------------
# STEP 6: TOKENIZATION PIPELINE (UPDATED)
# --------------------------------------------------
# We'll tokenize with appropriate padding and truncation for your max_length.

def tokenize_function(ex):
    prompt = create_prompt(ex)
    tokenized = tokenizer(
        prompt,
        truncation=True,
        max_length=512,         # Adjust as needed
        padding="max_length"
    )
    # Causal LM training: labels match input_ids
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

tokenized_train_data = train_data.map(tokenize_function, batched=False)
tokenized_eval_data  = eval_data.map(tokenize_function, batched=False)

# --------------------------------------------------
# STEP 7: TRAINING PREP WITH TRANSFORMERS
# --------------------------------------------------
# We'll create a Trainer or Accelerate-based training loop.
# For demonstration, here's a minimal Trainer approach:

from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

training_args = TrainingArguments(
    output_dir="./lora-qlora-output",
    per_device_train_batch_size=2,  # Adjust based on GPU memory
    per_device_eval_batch_size=2,   # Adjust as needed
    gradient_accumulation_steps=4,
    evaluation_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=50,
    logging_steps=25,
    num_train_epochs=1,            # or use max_steps if you prefer
    learning_rate=2e-4,
    bf16=True,                     # set bf16 or fp16 as your hardware allows
    optim="paged_adamw_8bit",      # recommended for 4-bit/8-bit
    report_to="none"               # or "wandb" if you want to track metrics
)

# Data collator: ensure we handle LM-style tasks properly
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_data,
    eval_dataset=tokenized_eval_data,
    data_collator=data_collator
)

# --------------------------------------------------
# STEP 8: TRAIN THE MODEL
# --------------------------------------------------
# This will run the LoRA/QLoRA fine-tuning. Watch out for OOM (Out of Memory) errors.

trainer.train()

# --------------------------------------------------
# STEP 9: SAVING AND LOADING THE FINETUNED ADAPTER
# --------------------------------------------------
# Once training completes, you can save the PEFT LoRA adapter weights with:
trainer.save_model("./lora-qlora-output")

# Later, to reload, you do:
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto"
)
lora_model = PeftModel.from_pretrained(base_model, "./lora-qlora-output")

# --------------------------------------------------
# STEP 10: INFERENCE
# --------------------------------------------------
# Let's see how the model performs with a test prompt after fine-tuning.

sample_prompt = "Please convert the following text into a meaning representation: ... "
tokens = tokenizer(sample_prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
    output_ids = lora_model.generate(**tokens, max_new_tokens=100)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

# --------------------------------------------------
# NOTE ON QLoRA vs LoRA
# --------------------------------------------------
# - QLoRA: A memory-efficient form of LoRA that uses 4-bit quantization for base model weights.
# - LoRA: Traditional approach that can run at full precision or 8-bit if loaded that way.
#
# Both rely on the PEFT library's approach of training rank-decomposed adapter matrices 
# while freezing most of the base model's parameters.

# --------------------------------------------------
# DONE!
# --------------------------------------------------
# This cheat sheet has walked you through:
#  1) Installing dependencies
#  2) Loading data and formatting prompts
#  3) Loading a base LLM in 4-bit quantization (QLoRA approach)
#  4) Preparing the model for k-bit training
#  5) Setting LoRA config for target modules
#  6) Tokenizing data for causal LM
#  7) Creating a Transformers Trainer
#  8) Fine-tuning with LoRA/QLoRA
#  9) Saving/loading your LoRA adapters
# 10) Running inference with the finetuned model

print("Cheat Sheet for LoRA & QLoRA Fine-tuning Complete!")


> LLM Training

In [None]:
# ==============================================================================
# CHEAT SHEET: ADVANCED TRAINING TECHNIQUES FOR LARGE MODELS
# ==============================================================================
#
# This code cell is a walkthrough detailing how to implement:
#   1) Parameter-Efficient Fine-Tuning (PEFT) + LoRA
#   2) Quantization-Aware Training (QAT)
#   3) Gradient Checkpointing
#   4) Distributed Training (FSDP, ZeRO)
#
# Each section includes explanatory code and use cases. 
# Adapt these examples to your specific model and training environment.
# ==============================================================================

# ==============================================================================
# 1. PEFT + LoRA (Parameter-Efficient Fine-Tuning + Low-Rank Adaptation)
# ------------------------------------------------------------------------------
# DESCRIPTION:
#   - Fine-tunes only small "adapter" layers (LoRA) added on top of a large 
#     pre-trained model, freezing most of the base model's parameters.
#   - This conserves memory and improves efficiency.
# USE CASE:
#   - Large language models (e.g., LLaMA 2) for domain-specific tasks with 
#     limited data. Train minimal parameters while keeping the rest of the 
#     model untouched.
# ==============================================================================

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# EXAMPLE: Loading a base LLM and applying PEFT + LoRA

base_model_id = "meta-llama/Llama-2-7b-hf"  # Example: LLaMA 2 7B
tokenizer = AutoTokenizer.from_pretrained(base_model_id)

model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    device_map="auto",
)

# OPTIONAL: Use 8-bit or 4-bit quantization for memory savings 
# (comment out if not needed)
# from transformers import BitsAndBytesConfig
# bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
# model = AutoModelForCausalLM.from_pretrained(
#     base_model_id,
#     quantization_config=bnb_config,
#     device_map="auto"
# )
# model = prepare_model_for_kbit_training(model)

# LoRA Config: 
lora_config = LoraConfig(
    r=8,                       # Rank of low-rank matrices
    lora_alpha=16,            # Scaling factor
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],  # Which layers to adapt
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# Convert your base model to a PEFT model using LoRA:
lora_model = get_peft_model(model, lora_config)

# Now you can train 'lora_model' on your domain-specific data. 
# Example of checking trainable params:
def print_trainable_parameters(m):
    trainable_params = 0
    total_params = 0
    for _, p in m.named_parameters():
        total_params += p.numel()
        if p.requires_grad:
            trainable_params += p.numel()
    print(f"Trainable params: {trainable_params} / {total_params} "
          f"({100 * trainable_params/total_params:.2f}%)")

print_trainable_parameters(lora_model)

# ==============================================================================
# 2. Quantization-Aware Training (QAT)
# ------------------------------------------------------------------------------
# DESCRIPTION:
#   - Converts model weights from high precision (e.g., FP32) to lower precision 
#     (e.g., FP16, INT8, or INT4). 
# BENEFITS:
#   - Saves memory and can reduce training time.
# CHALLENGES:
#   - Must monitor potential accuracy degradation. 
#   - Not all ops and hardware fully support certain precisions.
# EXAMPLE:
#   - Using `bitsandbytes` or built-in Transformer/PEFT quantization for 
#     4-bit/8-bit training. 
# ==============================================================================

# DEMO: Basic QAT with bitsandbytes (using an 8-bit or 4-bit quant config).
# Already shown partial steps above, so here's a simplified snippet:

"""
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,  # threshold for outlier removal
    llm_int8_has_fp16_weight=True
)

model_8bit = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto"
)
# Then proceed with normal training steps while carefully tracking performance.
"""

# NOTE: For actual QAT, you might combine these quantization steps with 
# further fine-tuning under quantized conditions.

# ==============================================================================
# 3. Gradient Checkpointing
# ------------------------------------------------------------------------------
# DESCRIPTION:
#   - Saves memory by only storing select intermediate activations needed for 
#     backward pass, recomputing the rest during backprop.
# USE CASE:
#   - Useful for large models / limited GPU memory. 
# TRADE-OFF:
#   - Slower training due to re-computation, but significantly lower memory usage.
# EXAMPLE:
# ==============================================================================
"""
# If your model is a Transformers-based model:
model.gradient_checkpointing_enable()

# This will reduce memory usage at the cost of extra compute during backprop.
# Ensure that your training loop or Trainer approach is aware of this setting 
# so no unexpected issues arise (like weird memory or speed metrics).
"""

# ==============================================================================
# 4. Distributed Training
# ------------------------------------------------------------------------------
# DESCRIPTION:
#   - Splits data and/or model across multiple GPUs or nodes to reduce training 
#     time and handle bigger models. 
# KEY TECHNIQUES:
#   - FSDP (Fully Sharded Data Parallel): 
#       Shards model weights and optimizer states across devices. 
#   - DeepSpeed ZeRO (Zero Redundancy Optimizer): 
#       Distributes model parameters, gradients, and optimizer states among 
#       workers to save memory and improve throughput.
# ==============================================================================
"""
# EXAMPLE: Using PyTorch FSDP
#   1. Import and wrap your model in FSDP:
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.wrap import wrap

#   2. Initialize distributed environment (e.g., torchrun or manually)
#       torchrun --nproc_per_node=4 your_script.py
#   3. Wrap your model:
model = wrap(model)
#   4. Train as usual, ensuring the data is distributed among processes.

# EXAMPLE: Using DeepSpeed ZeRO
#   1. Install DeepSpeed.
#   2. Use a deepspeed config file specifying "zero_optimization" stages 1, 2, or 3.
#   3. Launch training with deepspeed script:
#       deepspeed --num_gpus=4 your_script.py --deepspeed_config ds_config.json
#   4. Model parameters, gradients, and optimizer states are partitioned 
#      across devices.

# Both FSDP and ZeRO require a distributed environment setup. 
# They are powerful for large-scale training while controlling 
# memory usage effectively.
"""

# ==============================================================================
# END CHEAT SHEET
# ==============================================================================
print("Cheat sheet complete! Review code, tailor for your environment, and happy training!")
