# Simple Attention Mechanism Explained

This notebook walks through a simple attention mechanism implementation with detailed explanations at each step. We'll use very descriptive variable names and avoid complex math notation to make it accessible for everyone.

## What is Attention?

At its core, **attention** is about helping a model focus on the most relevant parts of the input when producing each part of the output.

Think of it like this: When you're reading a long document and trying to answer a specific question, you don't give equal focus to every word. You *pay attention* to the parts most relevant to your question. Attention mechanisms help neural networks do the same thing.

## Scaled Dot-Product Attention

We'll implement the simplest form of attention used in the "Attention Is All You Need" paper (2017) by Vaswani et al., which introduced the Transformer architecture.

In [None]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

## The Three Key Players: Queries, Keys, and Values

Before we dive into code, let's understand the three main components of attention:

1. **Queries**: Think of these as "questions" or "search terms" that help you look for relevant information
2. **Keys**: These are like "labels" or "tags" for different pieces of information
3. **Values**: These are the actual information or content you want to extract

With attention, we use the similarity between queries and keys to determine how much of each value to extract.

### Real-World Analogy: Library Search

Imagine you're at a library:
- **Query**: The topic you're researching (e.g., "French cuisine")
- **Keys**: The titles or categories of books on the shelves
- **Values**: The actual content inside each book

You compare your query ("French cuisine") to all the keys (book titles), find the most relevant ones, and then extract information from their values (book contents). Books with titles more similar to your query get more of your attention.

In [None]:
def simple_attention_mechanism(query_vectors, key_vectors, value_vectors):
    """
    A simple implementation of scaled dot-product attention.
    
    Parameters:
    -----------
    query_vectors : numpy array of shape (batch_size, query_sequence_length, feature_dimension)
        The queries - what we're using to search for relevant information
        
    key_vectors : numpy array of shape (batch_size, key_sequence_length, feature_dimension)
        The keys - what we're searching through to find matches to our queries
        
    value_vectors : numpy array of shape (batch_size, key_sequence_length, value_dimension)
        The values - the actual information we want to extract based on query-key matching
        Note: key_sequence_length must match value_sequence_length as they're paired
        
    Returns:
    --------
    weighted_sum_of_values : numpy array of shape (batch_size, query_sequence_length, value_dimension)
        Each query gets its own weighted sum of values
        
    attention_weights : numpy array of shape (batch_size, query_sequence_length, key_sequence_length)
        How much each query attended to each key (useful for visualization)
    """
    
    # Step 1: Calculate similarity between queries and keys using dot product
    # For each query, we compute how similar it is to each key
    # Higher dot product = more similarity = more attention
    similarity_scores = np.matmul(query_vectors, np.transpose(key_vectors, (0, 2, 1)))
    
    # Step 2: Scale the similarity scores to prevent extremely small gradients
    # We divide by square root of the dimensionality of the key vectors
    # This is important for stable training, especially with larger dimensions
    dimension_of_key_vectors = key_vectors.shape[-1]
    scaled_similarity_scores = similarity_scores / np.sqrt(dimension_of_key_vectors)
    
    # Step 3: Convert similarity scores to probabilities using softmax
    # This ensures all attention weights for each query sum to 1
    # Each query now has a probability distribution over all keys
    
    # First, we need a softmax function
    def softmax(x):
        # Subtract max for numerical stability (prevents overflow)
        shifted_x = x - np.max(x, axis=-1, keepdims=True)
        # Calculate exponential values
        exp_x = np.exp(shifted_x)
        # Normalize to get probabilities
        return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
    
    # Apply softmax to get attention weights
    attention_weights = softmax(scaled_similarity_scores)
    
    # Step 4: Use attention weights to create a weighted sum of values
    # This gives us a new representation focused on the most relevant parts
    weighted_sum_of_values = np.matmul(attention_weights, value_vectors)
    
    return weighted_sum_of_values, attention_weights

## Understanding with a Simple Example

Let's create a tiny example to visualize exactly what's happening:

In [None]:
# Create a simple example with small vectors
batch_size = 1                  # Just one example for simplicity
query_sequence_length = 2       # We have 2 queries (imagine 2 words in an output sentence)
key_value_sequence_length = 3   # We have 3 keys/values (imagine 3 words in an input sentence)
feature_dimension = 4           # Each vector has 4 features
value_dimension = 6             # Values have 6 features (could be different from query/key)

# Create sample queries, keys, and values
# In real applications, these would come from learned embeddings or transformations
example_queries = np.random.randn(batch_size, query_sequence_length, feature_dimension)
example_keys = np.random.randn(batch_size, key_value_sequence_length, feature_dimension)
example_values = np.random.randn(batch_size, key_value_sequence_length, value_dimension)

# Run our attention mechanism
attention_output, attention_weights = simple_attention_mechanism(
    example_queries, example_keys, example_values)

print(f"Shape of attention output: {attention_output.shape}")
print(f"Shape of attention weights: {attention_weights.shape}")

### Visualization of Attention Weights

Let's visualize the attention weights to understand what's happening:

In [None]:
# Visualize the attention weights
plt.figure(figsize=(10, 6))
plt.imshow(attention_weights[0], cmap='viridis')
plt.colorbar(label='Attention Weight')
plt.xlabel('Keys (Input Words)')
plt.ylabel('Queries (Output Words)')
plt.title('Attention Weights: How Much Each Query Focuses on Each Key')

# Add text annotations
for i in range(query_sequence_length):
    for j in range(key_value_sequence_length):
        plt.text(j, i, f'{attention_weights[0, i, j]:.2f}',
                 ha="center", va="center", color="white" if attention_weights[0, i, j] < 0.5 else "black")

# Use more intuitive labels
plt.xticks(range(key_value_sequence_length), [f'Input {i+1}' for i in range(key_value_sequence_length)])
plt.yticks(range(query_sequence_length), [f'Output {i+1}' for i in range(query_sequence_length)])

plt.tight_layout()
plt.show()

## A More Realistic Example: Sentence Translation

Let's create a more realistic example where we use attention for translating a simple sentence.

We'll simulate a very simple English → Spanish translation task.

In this example, we'll manually create our vectors rather than training a model, but the attention mechanism works the same way.

In [None]:
# Define a simple English sentence and its Spanish translation
english_sentence = ["The", "cat", "is", "sleeping", "on", "the", "mat"]
spanish_sentence = ["El", "gato", "está", "durmiendo", "en", "la", "estera"]

# For simplicity, we'll manually create word embeddings
# In a real system, these would be learned by the model
# We'll use 5-dimensional embeddings
embedding_dimension = 5

# Create random but fixed embeddings for our words
np.random.seed(123)  # For reproducibility
english_embeddings = {word: np.random.randn(embedding_dimension) for word in english_sentence}
spanish_embeddings = {word: np.random.randn(embedding_dimension) for word in spanish_sentence}

# Convert sentences to sequences of embeddings
english_vectors = np.array([english_embeddings[word] for word in english_sentence])
spanish_vectors = np.array([spanish_embeddings[word] for word in spanish_sentence])

# Reshape for our attention function (add batch dimension)
english_vectors = english_vectors.reshape(1, len(english_sentence), embedding_dimension)
spanish_vectors = spanish_vectors.reshape(1, len(spanish_sentence), embedding_dimension)

# In this example:
# - Keys and Values are from the English sentence (source language)
# - Queries are from the Spanish sentence (target language)
# This simulates attention during translation

translation_attention_output, translation_attention_weights = simple_attention_mechanism(
    query_vectors=spanish_vectors,   # Target language (what we're generating)
    key_vectors=english_vectors,     # Source language (what we're translating from)
    value_vectors=english_vectors    # Using same vectors as values for simplicity
)

In [None]:
# Visualize the translation attention weights
plt.figure(figsize=(12, 8))
plt.imshow(translation_attention_weights[0], cmap='YlOrRd')
plt.colorbar(label='Attention Weight')
plt.xlabel('English Words (Source)')
plt.ylabel('Spanish Words (Target)')
plt.title('Translation Attention Weights: Word Alignment Visualization')

# Add text annotations for the weights
for i in range(len(spanish_sentence)):
    for j in range(len(english_sentence)):
        plt.text(j, i, f'{translation_attention_weights[0, i, j]:.2f}',
                 ha="center", va="center",
                 color="black" if translation_attention_weights[0, i, j] < 0.15 else "white")

# Use actual words as labels
plt.xticks(range(len(english_sentence)), english_sentence, rotation=45)
plt.yticks(range(len(spanish_sentence)), spanish_sentence)

plt.tight_layout()
plt.show()

## Understanding the Attention Map

In the visualization above, each cell shows how much a Spanish word (target) attends to each English word (source). Brighter/darker colors indicate stronger attention.

In an ideal translation system, we would see:
- "El" attending most to "The"
- "gato" attending most to "cat"
- "está durmiendo" attending most to "is sleeping"
- And so on...

Our random embeddings won't show perfect alignment, but in a trained model, you would see these patterns emerge!

## Implementing Attention in TensorFlow/Keras

Now let's implement the same attention mechanism using TensorFlow/Keras for use in neural networks:

In [None]:
import tensorflow as tf
from tensorflow.keras import layers

class ScaledDotProductAttention(layers.Layer):
    """Custom layer implementing scaled dot-product attention."""
    
    def __init__(self, **kwargs):
        super(ScaledDotProductAttention, self).__init__(**kwargs)
    
    def call(self, query_tensor, key_tensor, value_tensor, mask=None):
        """Forward pass for the attention layer.
        
        Parameters:
        -----------
        query_tensor: Tensor of shape (..., query_sequence_length, query_dimension)
            The queries we use to search through the keys
            
        key_tensor: Tensor of shape (..., key_sequence_length, key_dimension)
            The keys we're searching through (must have same dimension as queries)
            
        value_tensor: Tensor of shape (..., key_sequence_length, value_dimension)
            The values we extract based on attention weights
            
        mask: Optional tensor for masking certain positions (e.g., padding)
        
        Returns:
        --------
        attention_output: Tensor of shape (..., query_sequence_length, value_dimension)
            The weighted combination of values
            
        attention_weights: Tensor of shape (..., query_sequence_length, key_sequence_length)
            The attention weights for visualization
        """
        # Step 1: Calculate similarity scores using tensor dot product
        # We transpose the last two dimensions of key_tensor to align with query_tensor
        query_key_similarity_scores = tf.matmul(query_tensor, key_tensor, transpose_b=True)
        
        # Step 2: Scale the similarity scores by square root of dimension
        # This prevents the softmax from having extremely small gradients
        dimension_of_keys = tf.cast(tf.shape(key_tensor)[-1], tf.float32)
        scaled_similarity_scores = query_key_similarity_scores / tf.math.sqrt(dimension_of_keys)
        
        # Step 3: Apply mask if provided (useful for padding or future-masking)
        if mask is not None:
            # Add a very large negative value to masked positions
            # This forces softmax to give ~0 attention to those positions
            scaled_similarity_scores += (mask * -1e9)  
        
        # Step 4: Convert to probabilities with softmax
        # This makes all attention weights for each query sum to 1
        attention_probability_weights = tf.nn.softmax(scaled_similarity_scores, axis=-1)
        
        # Step 5: Create weighted sum of values using attention weights
        # Each query gets its own custom weighted blend of values
        attention_weighted_output = tf.matmul(attention_probability_weights, value_tensor)
        
        return attention_weighted_output, attention_probability_weights

Let's test our TensorFlow attention layer with the same translation example:

In [None]:
# Convert our numpy arrays to TensorFlow tensors
tf_spanish_vectors = tf.constant(spanish_vectors, dtype=tf.float32)
tf_english_vectors = tf.constant(english_vectors, dtype=tf.float32)

# Create our attention layer
attention_layer = ScaledDotProductAttention()

# Apply the attention
tf_output, tf_weights = attention_layer(
    query_tensor=tf_spanish_vectors,
    key_tensor=tf_english_vectors,
    value_tensor=tf_english_vectors
)

# Convert back to numpy for visualization
tf_attention_weights = tf_weights.numpy()

# Compare with our previous implementation
print("Maximum absolute difference between implementations:", 
      np.max(np.abs(tf_attention_weights - translation_attention_weights)))

# They should be very close (small differences due to floating point precision)

## Building a Multi-Head Attention Mechanism

Now that we understand the basic attention mechanism, let's implement Multi-Head Attention, which is a key component of Transformers.

**Why Multiple Attention Heads?**

Using multiple attention heads allows the model to focus on different aspects of the input simultaneously. For example:
- One head might focus on grammatical relationships
- Another might focus on semantic meanings
- A third might focus on contextual clues

This is like having multiple people read the same document, each focusing on different aspects, and then combining their insights.

In [None]:
class MultiHeadAttention(layers.Layer):
    """Multi-head attention as described in 'Attention Is All You Need'."""
    
    def __init__(self, embedding_dimension, number_of_attention_heads):
        """Initialize the multi-head attention layer.
        
        Parameters:
        -----------
        embedding_dimension: Integer
            Dimension of the input embeddings
            
        number_of_attention_heads: Integer
            Number of parallel attention heads to use
        """
        super(MultiHeadAttention, self).__init__()
        self.number_of_attention_heads = number_of_attention_heads
        self.embedding_dimension = embedding_dimension
        
        # Ensure the embedding dimension is divisible by the number of heads
        assert embedding_dimension % number_of_attention_heads == 0, \
            f"Embedding dimension {embedding_dimension} must be divisible by number of heads {number_of_attention_heads}"
        
        # Calculate dimension per head (how we'll split the embedding)
        self.dimension_per_attention_head = embedding_dimension // number_of_attention_heads
        
        # Create linear transformations for queries, keys, and values
        # These learn different projections for each attention head
        self.query_projection_layer = layers.Dense(embedding_dimension)
        self.key_projection_layer = layers.Dense(embedding_dimension)
        self.value_projection_layer = layers.Dense(embedding_dimension)
        
        # Final output projection combines outputs from all heads
        self.output_projection_layer = layers.Dense(embedding_dimension)
        
        # The basic attention mechanism we'll use for each head
        self.scaled_dot_product_attention = ScaledDotProductAttention()
    
    def split_heads(self, input_tensor, batch_size):
        """Split the embedding dimension into multiple heads.
        
        This reshapes the input tensor to separate out the head dimension
        and then transposes to get the right shape for attention calculation.
        
        Parameters:
        -----------
        input_tensor: Tensor of shape (batch_size, sequence_length, embedding_dimension)
            The input embeddings to split into heads
            
        batch_size: Integer
            Batch size of the input
            
        Returns:
        --------
        split_tensor: Tensor of shape (batch_size, num_heads, sequence_length, dim_per_head)
            The input tensor reshaped to separate out the attention heads
        """
        # Reshape to separate the embedding dimension into heads
        # From: (batch_size, sequence_length, embedding_dimension)
        # To: (batch_size, sequence_length, number_of_attention_heads, dimension_per_attention_head)
        split_tensor = tf.reshape(
            input_tensor,
            (batch_size, -1, self.number_of_attention_heads, self.dimension_per_attention_head)
        )
        
        # Transpose to get shape (batch_size, number_of_attention_heads, sequence_length, dimension_per_attention_head)
        # This puts the head dimension where we need it for parallel attention calculation
        return tf.transpose(split_tensor, perm=[0, 2, 1, 3])
    
    def call(self, query_input, key_input, value_input, mask=None):
        """Forward pass for multi-head attention.
        
        Parameters:
        -----------
        query_input: Tensor of shape (batch_size, query_sequence_length, embedding_dimension)
            Input tensor for queries
            
        key_input: Tensor of shape (batch_size, key_sequence_length, embedding_dimension)
            Input tensor for keys
            
        value_input: Tensor of shape (batch_size, value_sequence_length, embedding_dimension)
            Input tensor for values
            
        mask: Optional tensor for masking certain positions
        
        Returns:
        --------
        output: Tensor of shape (batch_size, query_sequence_length, embedding_dimension)
            The multi-head attention output
            
        attention_weights: Dictionary of attention weights from each head
            Useful for visualization and analysis
        """
        batch_size = tf.shape(query_input)[0]
        
        # Step 1: Apply linear projections to create queries, keys, and values
        # These projections create different representations for each head
        query_projections = self.query_projection_layer(query_input)
        key_projections = self.key_projection_layer(key_input)
        value_projections = self.value_projection_layer(value_input)
        
        # Step 2: Split projections into multiple heads
        # This allows each head to focus on different aspects of the input
        query_multi_head = self.split_heads(query_projections, batch_size)
        key_multi_head = self.split_heads(key_projections, batch_size)
        value_multi_head = self.split_heads(value_projections, batch_size)
        
        # Step 3: Apply scaled dot-product attention for each head
        # Each head calculates its own attention weights and weighted outputs
        attention_output_per_head, attention_weights_per_head = self.scaled_dot_product_attention(
            query_multi_head, key_multi_head, value_multi_head, mask)
        
        # Step 4: Transpose and reshape to combine all heads' outputs
        # First transpose from: (batch_size, num_heads, seq_len, dim_per_head)
        # To: (batch_size, seq_len, num_heads, dim_per_head)
        transposed_attention_output = tf.transpose(attention_output_per_head, perm=[0, 2, 1, 3])
        
        # Reshape to: (batch_size, seq_len, embedding_dimension)
        # This combines all the heads' outputs back into the original embedding dimension
        combined_attention_output = tf.reshape(
            transposed_attention_output,
            (batch_size, -1, self.embedding_dimension)
        )
        
        # Step 5: Apply final output projection
        # This learns how to best combine the outputs from all heads
        final_output = self.output_projection_layer(combined_attention_output)
        
        # Create a dictionary of attention weights from each head for visualization
        attention_weights_dict = {f"head_{i+1}": attention_weights_per_head[:, i, :, :]
                                 for i in range(self.number_of_attention_heads)}
        
        return final_output, attention_weights_dict

In [None]:
# Let's test the multi-head attention with our translation example
embedding_dimension = 5  # Matches our example vectors
number_of_heads = 1      # For simplicity first, will try more later

# Create multi-head attention layer
multi_head_attention = MultiHeadAttention(
    embedding_dimension=embedding_dimension,
    number_of_attention_heads=number_of_heads
)

# Apply multi-head attention to our translation example
mha_output, mha_weights_dict = multi_head_attention(
    query_input=tf_spanish_vectors,
    key_input=tf_english_vectors,
    value_input=tf_english_vectors
)

print(f"Multi-head attention output shape: {mha_output.shape}")
print(f"Attention weights for head 1 shape: {mha_weights_dict['head_1'].shape}")

In [None]:
# Visualize the attention weights from the first head
head1_weights = mha_weights_dict['head_1'].numpy()

plt.figure(figsize=(12, 8))
plt.imshow(head1_weights[0], cmap='plasma')
plt.colorbar(label='Attention Weight')
plt.xlabel('English Words (Source)')
plt.ylabel('Spanish Words (Target)')
plt.title('Multi-Head Attention (Head 1): Word Alignment Visualization')

# Add text annotations for the weights
for i in range(len(spanish_sentence)):
    for j in range(len(english_sentence)):
        plt.text(j, i, f'{head1_weights[0, i, j]:.2f}',
                 ha="center", va="center",
                 color="black" if head1_weights[0, i, j] < 0.15 else "white")

# Use actual words as labels
plt.xticks(range(len(english_sentence)), english_sentence, rotation=45)
plt.yticks(range(len(spanish_sentence)), spanish_sentence)

plt.tight_layout()
plt.show()

## Trying with Multiple Attention Heads

Now let's see what happens when we use multiple attention heads. This better reflects how attention works in actual Transformer models.

In [None]:
# We need to adjust embedding dimension to be divisible by number of heads
# Let's use a larger embedding size
multi_head_embedding_dimension = 10  # Divisible by 2 and 5
number_of_heads = 2

# Create new random embeddings with the larger dimension
np.random.seed(456)
english_multi_head_embeddings = {word: np.random.randn(multi_head_embedding_dimension) 
                                for word in english_sentence}
spanish_multi_head_embeddings = {word: np.random.randn(multi_head_embedding_dimension) 
                                for word in spanish_sentence}

# Convert to sequences
english_vectors_multi = np.array([english_multi_head_embeddings[word] for word in english_sentence])
spanish_vectors_multi = np.array([spanish_multi_head_embeddings[word] for word in spanish_sentence])

# Reshape for multi-head attention
english_vectors_multi = english_vectors_multi.reshape(1, len(english_sentence), multi_head_embedding_dimension)
spanish_vectors_multi = spanish_vectors_multi.reshape(1, len(spanish_sentence), multi_head_embedding_dimension)

# Convert to TensorFlow tensors
tf_english_vectors_multi = tf.constant(english_vectors_multi, dtype=tf.float32)
tf_spanish_vectors_multi = tf.constant(spanish_vectors_multi, dtype=tf.float32)

# Create multi-head attention with 2 heads
multi_head_attention = MultiHeadAttention(
    embedding_dimension=multi_head_embedding_dimension,
    number_of_attention_heads=number_of_heads
)

# Apply multi-head attention
mha_output_multi, mha_weights_multi = multi_head_attention(
    query_input=tf_spanish_vectors_multi,
    key_input=tf_english_vectors_multi,
    value_input=tf_english_vectors_multi
)

In [None]:
# Visualize attention weights for each head side by side
plt.figure(figsize=(18, 8))

# Plot the first head
plt.subplot(1, 2, 1)
head1_weights_multi = mha_weights_multi['head_1'].numpy()
plt.imshow(head1_weights_multi[0], cmap='YlOrRd')
plt.colorbar(label='Attention Weight')
plt.xlabel('English Words (Source)')
plt.ylabel('Spanish Words (Target)')
plt.title('Head 1 Attention Weights')
plt.xticks(range(len(english_sentence)), english_sentence, rotation=45)
plt.yticks(range(len(spanish_sentence)), spanish_sentence)

# Plot the second head
plt.subplot(1, 2, 2)
head2_weights_multi = mha_weights_multi['head_2'].numpy()
plt.imshow(head2_weights_multi[0], cmap='YlOrRd')
plt.colorbar(label='Attention Weight')
plt.xlabel('English Words (Source)')
plt.ylabel('Spanish Words (Target)')
plt.title('Head 2 Attention Weights')
plt.xticks(range(len(english_sentence)), english_sentence, rotation=45)
plt.yticks(range(len(spanish_sentence)), spanish_sentence)

plt.tight_layout()
plt.show()

## Summary: Understanding Attention

In this notebook, we've broken down the attention mechanism into its simplest form:

1. **Basic Attention**: A weighted sum where the weights come from comparing queries to keys
   - Calculate similarity scores between queries and keys (dot product)
   - Scale scores and convert to probabilities (softmax)
   - Use these probabilities to create a weighted sum of values

2. **Multi-Head Attention**: Multiple parallel attention operations that let the model focus on different aspects
   - Project queries, keys, and values into different subspaces for each head
   - Apply basic attention in each subspace
   - Combine the results from all heads

These attention mechanisms form the foundation of modern Transformer models like those used in BERT, GPT, and other large language models.

The key insight is that attention allows the model to dynamically focus on the most relevant parts of the input for each part of the output, rather than treating all input elements equally.