<a href="https://colab.research.google.com/github/mukul-mschauhan/GenerativeAI/blob/main/Self_Attention_%26_Multi_head_Attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Code the attention mechanism from scratch (Self Attention and Multi Head Attention). Take one example sentence and show the calculation of Self and Multi Head Attention. Summarize the whole exercise by suggesting what will happen to these context embeddings. Keep the code simple for beginners and include comments.

## Implement self-attention

### Subtask:
Write Python code to implement the self-attention mechanism from scratch, including the calculation of Query, Key, and Value matrices, the dot product, scaling, softmax, and the final output.


**Reasoning**:
The first step is to implement the self-attention mechanism as a Python function, including the definition of weight matrices, calculation of Q, K, V, dot product, scaling, softmax, and the final output, based on the instructions.



In [1]:
import numpy as np

def self_attention(input_embeddings):
    """
    Implements the self-attention mechanism.

    Args:
        input_embeddings: Input embeddings of shape (sequence_length, embedding_dim).

    Returns:
        Output of the self-attention mechanism of shape (sequence_length, embedding_dim).
    """
    sequence_length, embedding_dim = input_embeddings.shape

    # 2. Define learnable weight matrices for Query, Key, and Value (W_Q, W_K, W_V).
    #    Initialize these randomly. For simplicity, we assume the output dimension
    #    is the same as the input embedding dimension.
    np.random.seed(42) # for reproducibility
    w_q = np.random.rand(embedding_dim, embedding_dim)
    w_k = np.random.rand(embedding_dim, embedding_dim)
    w_v = np.random.rand(embedding_dim, embedding_dim)

    # 3. Calculate the Query, Key, and Value matrices
    query = np.dot(input_embeddings, w_q)
    key = np.dot(input_embeddings, w_k)
    value = np.dot(input_embeddings, w_v)

    # 4. Compute the dot product of the Query and Key matrices.
    dot_product = np.dot(query, key.T) # Transpose Key for matrix multiplication

    # 5. Scale the dot product
    scaling_factor = np.sqrt(embedding_dim)
    scaled_dot_product = dot_product / scaling_factor

    # 6. Apply the softmax function to the scaled dot product to obtain attention weights.
    #    We need to apply softmax row-wise.
    #    Subtracting the max for numerical stability
    exp_scaled_dot_product = np.exp(scaled_dot_product - np.max(scaled_dot_product, axis=1, keepdims=True))
    attention_weights = exp_scaled_dot_product / np.sum(exp_scaled_dot_product, axis=1, keepdims=True)

    # 7. Multiply the attention weights by the Value matrix to get the final self-attention output.
    output = np.dot(attention_weights, value)

    # 8. Return the self-attention output.
    return output

# Example Usage:
# Let's create a dummy input embedding for a sentence with 3 words and embedding dimension 4.
dummy_input_embeddings = np.array([
    [1.0, 0.5, 0.1, 0.9],  # Embedding for word 1
    [0.2, 0.8, 0.3, 0.7],  # Embedding for word 2
    [0.6, 0.4, 0.9, 0.2]   # Embedding for word 3
])

attention_output = self_attention(dummy_input_embeddings)
print("Input Embeddings Shape:", dummy_input_embeddings.shape)
print("Self-Attention Output Shape:", attention_output.shape)
print("\nSelf-Attention Output:")
display(attention_output)

Input Embeddings Shape: (3, 4)
Self-Attention Output Shape: (3, 4)

Self-Attention Output:


array([[0.41424831, 1.28155963, 1.19660905, 1.47464873],
       [0.41537054, 1.2600382 , 1.18050333, 1.46096009],
       [0.41012037, 1.269359  , 1.1786554 , 1.47522522]])

## Implement multi-head attention

### Subtask:
Write Python code to implement the multi-head attention mechanism from scratch, building upon the self-attention implementation. This will involve splitting the input into multiple heads, applying self-attention to each head, and concatenating the results.


**Reasoning**:
Define the `multi_head_attention` function, calculate head dimensions, and implement the splitting and linear transformations for Q, K, V, and the final output.



In [2]:
def multi_head_attention(input_embeddings, num_heads):
    """
    Implements the multi-head attention mechanism.

    Args:
        input_embeddings: Input embeddings of shape (sequence_length, embedding_dim).
        num_heads: The number of attention heads.

    Returns:
        Output of the multi-head attention mechanism of shape (sequence_length, embedding_dim).
    """
    sequence_length, embedding_dim = input_embeddings.shape

    # 2. Calculate the dimension of each head.
    if embedding_dim % num_heads != 0:
        raise ValueError("Embedding dimension must be divisible by the number of heads")
    head_dim = embedding_dim // num_heads

    # 4. Initialize weight matrices for Query, Key, Value, and the final output.
    #    For simplicity, we'll initialize single larger matrices and reshape/split them later.
    np.random.seed(42) # for reproducibility
    w_q_all = np.random.rand(embedding_dim, embedding_dim)
    w_k_all = np.random.rand(embedding_dim, embedding_dim)
    w_v_all = np.random.rand(embedding_dim, embedding_dim)
    w_o = np.random.rand(embedding_dim, embedding_dim)

    # 3. Split the input embeddings into multiple "heads" - this will be done implicitly
    #    by splitting the Q, K, V matrices after the initial linear transformation.

    # 5. Calculate the Query, Key, and Value matrices for all heads combined initially.
    query_all = np.dot(input_embeddings, w_q_all)
    key_all = np.dot(input_embeddings, w_k_all)
    value_all = np.dot(input_embeddings, w_v_all)

    # Reshape Q, K, V to split into multiple heads
    # Shape becomes (sequence_length, num_heads, head_dim)
    query_all_reshaped = query_all.reshape(sequence_length, num_heads, head_dim)
    key_all_reshaped = key_all.reshape(sequence_length, num_heads, head_dim)
    value_all_reshaped = value_all.reshape(sequence_length, num_heads, head_dim)

    # Transpose to (num_heads, sequence_length, head_dim) for easier processing per head
    query_all_transposed = query_all_reshaped.transpose(1, 0, 2)
    key_all_transposed = key_all_reshaped.transpose(1, 0, 2)
    value_all_transposed = value_all_reshaped.transpose(1, 0, 2)


    attention_outputs_per_head = []
    # 6. Apply self_attention to each head.
    for i in range(num_heads):
        q_head = query_all_transposed[i] # Shape (sequence_length, head_dim)
        k_head = key_all_transposed[i]   # Shape (sequence_length, head_dim)
        v_head = value_all_transposed[i] # Shape (sequence_length, head_dim)

        # Apply self-attention logic (dot product, scaling, softmax, multiply by V)
        # This part mirrors the self_attention function logic, but applied to individual heads
        dot_product_head = np.dot(q_head, k_head.T)
        scaling_factor_head = np.sqrt(head_dim)
        scaled_dot_product_head = dot_product_head / scaling_factor_head

        exp_scaled_dot_product_head = np.exp(scaled_dot_product_head - np.max(scaled_dot_product_head, axis=1, keepdims=True))
        attention_weights_head = exp_scaled_dot_product_head / np.sum(exp_scaled_dot_product_head, axis=1, keepdims=True)

        output_head = np.dot(attention_weights_head, v_head) # Shape (sequence_length, head_dim)
        attention_outputs_per_head.append(output_head)

    # 7. Concatenate the outputs of all the heads.
    # Shape becomes (num_heads, sequence_length, head_dim)
    concatenated_heads = np.stack(attention_outputs_per_head, axis=0)
    # Reshape to (sequence_length, num_heads * head_dim), which is (sequence_length, embedding_dim)
    concatenated_heads_reshaped = concatenated_heads.transpose(1, 0, 2).reshape(sequence_length, embedding_dim)

    # 8. Apply a final linear transformation.
    final_output = np.dot(concatenated_heads_reshaped, w_o)

    # 9. Return the final multi-head attention output.
    return final_output

# Example Usage:
# Let's use the same dummy input embedding and test with 2 heads.
dummy_input_embeddings = np.array([
    [1.0, 0.5, 0.1, 0.9],  # Embedding for word 1
    [0.2, 0.8, 0.3, 0.7],  # Embedding for word 2
    [0.6, 0.4, 0.9, 0.2]   # Embedding for word 3
])
num_heads = 2 # Must be a divisor of embedding_dim (4)

multi_head_output = multi_head_attention(dummy_input_embeddings, num_heads)

print("Input Embeddings Shape:", dummy_input_embeddings.shape)
print("Multi-Head Attention Output Shape:", multi_head_output.shape)
print("\nMulti-Head Attention Output:")
display(multi_head_output)

Input Embeddings Shape: (3, 4)
Multi-Head Attention Output Shape: (3, 4)

Multi-Head Attention Output:


array([[2.08600928, 1.83908121, 2.41701368, 2.39544226],
       [2.07620086, 1.8254594 , 2.41172336, 2.3813065 ],
       [2.08055114, 1.832296  , 2.41240516, 2.38511854]])

## Example calculation

### Subtask:
Choose a simple sentence and demonstrate the step-by-step calculation of self-attention and multi-head attention using the implemented code.


**Reasoning**:
Choose a simple sentence, create dummy input embeddings, and then call the self-attention and multi-head attention functions with these inputs to demonstrate the calculations and print the results.



In [3]:
# 1. Choose a simple sentence
sentence = "Hello world"
words = sentence.split()
sequence_length = len(words)
embedding_dim = 4 # Choose a simple embedding dimension

print(f"Chosen sentence: '{sentence}'")
print(f"Words: {words}")
print(f"Sequence length: {sequence_length}")
print(f"Embedding dimension: {embedding_dim}")

# 2. Create dummy input embeddings
# Shape: (sequence_length, embedding_dim)
np.random.seed(0) # Set seed for reproducibility
input_embeddings = np.random.rand(sequence_length, embedding_dim)

# 3. Print the chosen sentence and the dummy input embeddings
print("\nDummy Input Embeddings:")
display(input_embeddings)

# 4. Call the self_attention function and print the output
print("\nCalculating Self-Attention:")
self_attention_output = self_attention(input_embeddings)
print("Self-Attention Output:")
display(self_attention_output)

# 5. Call the multi_head_attention function and print the output
print(f"\nCalculating Multi-Head Attention with {num_heads} heads:")
# Ensure embedding_dim is divisible by num_heads
if embedding_dim % num_heads != 0:
    raise ValueError(f"Embedding dimension ({embedding_dim}) must be divisible by the number of heads ({num_heads})")

multi_head_attention_output = multi_head_attention(input_embeddings, num_heads)
print("Multi-Head Attention Output:")
display(multi_head_attention_output)

Chosen sentence: 'Hello world'
Words: ['Hello', 'world']
Sequence length: 2
Embedding dimension: 4

Dummy Input Embeddings:


array([[0.5488135 , 0.71518937, 0.60276338, 0.54488318],
       [0.4236548 , 0.64589411, 0.43758721, 0.891773  ]])


Calculating Self-Attention:
Self-Attention Output:


array([[0.48703578, 1.26063123, 1.17903779, 1.5423837 ],
       [0.4873165 , 1.26078774, 1.17857997, 1.54167794]])


Calculating Multi-Head Attention with 2 heads:
Multi-Head Attention Output:


array([[2.15381008, 1.8670745 , 2.55620525, 2.47266139],
       [2.15397661, 1.86714714, 2.55631435, 2.47285283]])

## Summarize embedding transformation

### Subtask:
Explain how the attention mechanism transforms the input embeddings into context embeddings, highlighting the role of weighted sums and the capture of dependencies.


**Reasoning**:
Explain how the attention mechanism transforms input embeddings into context embeddings, covering the role of input embeddings, attention weights, weighted sums, and how this captures dependencies. Also, briefly mention multi-head attention's role.



In [4]:
# 1. Explain how input embeddings represent individual words or tokens.
print("1. Input Embeddings:")
print("   Input embeddings are numerical representations of individual words or tokens.")
print("   They are typically dense vectors where similar words have similar vector representations.")
print("   These embeddings are the starting point for the attention mechanism.")
print(f"   In our example, the input embeddings have a shape of (sequence_length={sequence_length}, embedding_dim={embedding_dim}).")
print(f"   Each row represents a word's initial embedding.")
print("-" * 50)

# 2. Describe how the attention mechanism calculates attention weights.
print("2. Calculating Attention Weights:")
print("   The core idea of attention is to determine how much each input element should 'attend' to other input elements.")
print("   This is done by calculating attention weights based on the relationships between Query (Q), Key (K), and Value (V) vectors.")
print("   - Query: Represents the current element's desire to find relevant information in other elements.")
print("   - Key: Represents the information that each element makes available to others.")
print("   - Value: Represents the actual content or information of each element.")
print("   The attention weights are typically calculated by taking the dot product of the Query of one element with the Keys of all other elements, scaling, and applying a softmax function.")
print("   The softmax ensures that the weights for a given query sum up to 1, indicating a probability distribution over the input elements.")
print("-" * 50)

# 3. Explain that the attention output is a weighted sum of the Value vectors.
print("3. Weighted Sum of Value Vectors:")
print("   Once the attention weights are calculated, the output for each input element is computed as a weighted sum of the Value vectors of *all* input elements.")
print("   Each Value vector is multiplied by its corresponding attention weight (how much the current element should attend to that specific element's Value).")
print("   These weighted Value vectors are then summed up to produce the final output vector for the current input element.")
print("-" * 50)

# 4. Discuss how this captures dependencies and creates "context embeddings".
print("4. Capturing Dependencies and Creating Context Embeddings:")
print("   This weighted sum is crucial because it allows the model to incorporate information from all other words in the sequence when representing a single word.")
print("   Words that are highly relevant to the current word (as determined by the attention weights) contribute more significantly to its output representation.")
print("   Crucially, this mechanism can capture dependencies between words regardless of their distance in the sequence, unlike traditional recurrent neural networks (RNNs).")
print("   The resulting output vectors are often referred to as 'context embeddings' because they are no longer just representations of individual words in isolation, but are enriched with contextual information from the entire sequence.")
print("-" * 50)

# 5. Briefly mention multi-head attention.
print("5. Role of Multi-Head Attention:")
print("   Multi-head attention enhances this process by performing the attention calculation multiple times in parallel, using different sets of learned Query, Key, and Value weight matrices for each 'head'.")
print("   This allows the model to attend to different aspects of the relationships between words simultaneously.")
print("   For example, one head might focus on syntactic relationships, while another might focus on semantic relationships.")
print("   The outputs from the different heads are then concatenated and linearly transformed to produce the final multi-head attention output.")
print(f"   In our example, we used {num_heads} heads, allowing the model to capture {num_heads} different types of relationships concurrently.")
print("-" * 50)

print("\nIn summary, the attention mechanism transforms initial word embeddings into context embeddings by dynamically weighting the contributions of all other words in the sequence based on their relevance, thereby capturing complex dependencies.")

1. Input Embeddings:
   Input embeddings are numerical representations of individual words or tokens.
   They are typically dense vectors where similar words have similar vector representations.
   These embeddings are the starting point for the attention mechanism.
   In our example, the input embeddings have a shape of (sequence_length=2, embedding_dim=4).
   Each row represents a word's initial embedding.
--------------------------------------------------
2. Calculating Attention Weights:
   The core idea of attention is to determine how much each input element should 'attend' to other input elements.
   This is done by calculating attention weights based on the relationships between Query (Q), Key (K), and Value (V) vectors.
   - Query: Represents the current element's desire to find relevant information in other elements.
   - Key: Represents the information that each element makes available to others.
   - Value: Represents the actual content or information of each element.
   Th

## Summary:

### Data Analysis Key Findings

*   The self-attention mechanism was successfully implemented in Python, demonstrating the calculation of Query, Key, and Value matrices, scaled dot product, softmax, and weighted sum to produce context embeddings.
*   The multi-head attention mechanism was implemented by splitting the input into multiple heads, applying the self-attention logic independently to each head, concatenating the results, and applying a final linear transformation.
*   Using the simple sentence "Hello world" and dummy input embeddings, both the self-attention and multi-head attention functions were successfully applied, producing output embeddings of the same shape as the input embeddings, indicating the creation of context-aware representations.
*   The explanation detailed how the attention mechanism calculates attention weights based on Query, Key, and Value relationships and uses these weights to compute a weighted sum of Value vectors, effectively capturing dependencies between words and creating context embeddings.
*   Multi-head attention was explained as a method to capture different aspects of word relationships simultaneously by running the attention process in parallel across multiple heads.

### Insights or Next Steps

*   The resulting context embeddings from both self-attention and multi-head attention layers serve as richer representations of words that incorporate information from the entire sequence. These embeddings can then be used as input for subsequent layers in a neural network (e.g., feed-forward networks) for downstream tasks like classification, translation, or text generation.
*   Further exploration could involve visualizing the attention weights for the example sentence to understand which words each word is attending to, providing insight into the captured dependencies.
