
![Logo](logo.jpg)

# Understanding Attention Mechanism in Book Recommendations: A Practical Tutorial

This tutorial demonstrates how attention mechanisms work using a simple book recommendation system. We'll explore both single-head and multi-head attention, visualizing how they help in making book recommendations based on user preferences.

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Sample books and their features (embeddings)

 **Data Structure**  
 We start with a dictionary of books, where each book has:  
 - An embedding vector representing three features: magic, adventure, and romance  
 - A rating score  

 The embeddings are normalized between 0 and 1, where higher values indicate stronger presence of the feature.

In [3]:
books = {
    "Harry Potter": {
        "embedding": np.array([0.9, 0.8, 0.3]),  # [magic, adventure, romance]
        "rating": 4.8
    },
    "Lord of the Rings": {
        "embedding": np.array([0.8, 0.9, 0.2]),
        "rating": 4.9
    },
    "Romeo and Juliet": {
        "embedding": np.array([0.1, 0.2, 0.9]),
        "rating": 4.5
    },
    "Sherlock Holmes": {
        "embedding": np.array([0.2, 0.9, 0.3]),
        "rating": 4.6
    },
    "Twilight": {
        "embedding": np.array([0.4, 0.3, 0.9]),
        "rating": 4.0
    }
}

## Create KQV (Keys, Query, Values)

- Keys: Book features (embeddings)
- Values: Book ratings
- Query: User preferences (what the user likes)

In [4]:
def create_KQV():
    # Keys: book features
    keys = np.array([book["embedding"] for book in books.values()])
    
    # Values: book ratings
    values = np.array([book["rating"] for book in books.values()])
    
    # Query: user preferences (example: likes magic and adventure)
    query = np.array([0.8, 0.7, 0.2]) # This are my preferences: I like magic and adventure
    
    return keys, query, values

## Single-Head Attention

This function implements the attention mechanism in three steps:
- Compute similarity scores between query and keys
- Apply softmax to get attention weights
- Calculate weighted sum of values

In [9]:
def compute_attention(keys, query, values, temperature=1.0):
    # Compute similarity between query and keys
    attention_scores = np.dot(keys, query) / temperature
    # Softmax
    attention_weights = np.exp(attention_scores) / np.sum(np.exp(attention_scores))
    
    # Weighted sum of values
    weighted_sum = np.sum(values * attention_weights)
    
    return attention_weights, weighted_sum

## Visualization

The visualization shows:
- How similar each book is to user preferences for each feature
- The attention weights assigned to each book
- The final predicted rating

In [10]:
def visualize_attention():
    keys, query, values = create_KQV()
    attention_weights, predicted_rating = compute_attention(keys, query, values)
    
    plt.figure(figsize=(15, 5))
    
    # 1. Query-Key Similarity per feature
    plt.subplot(1, 2, 1)
    similarity_matrix = np.zeros((3, len(books)))  # 3 features × N books
    for i in range(3):  # for each feature
        # The smaller the difference between query and key, the greater the similarity
        similarity_matrix[i] = 1 - np.abs(query[i] - keys[:, i])
    
    book_names = list(books.keys())
    feature_names = ['Magic', 'Adventure', 'Romance']
    
    sns.heatmap(
        similarity_matrix,
        xticklabels=book_names,
        yticklabels=feature_names,
        annot=True,
        fmt='.2f',
        cmap='RdBu',
        vmin=0,
        vmax=1,
        center=0.5
    )
    plt.title('Feature-wise Similarity (1 - |query - key|)')
    plt.xlabel('Books')
    plt.ylabel('Features')
    
    # 2. Attention weights for each book
    plt.subplot(1, 2, 2)
    plt.bar(book_names, attention_weights)
    plt.xticks(rotation=45, ha='right')
    plt.title(f'Attention Weights (Predicted Rating: {predicted_rating:.2f})')
    plt.xlabel('Books')
    plt.ylabel('Attention Weight')
    
    plt.tight_layout()
    plt.show()

In [None]:
# Visualization
visualize_attention()

## Multi-head Attention

The multi-head attention mechanism:
- Uses multiple projection matrices to focus on different aspects of the features
- Each head has its own perspective on the data
- Combines results from all heads for final prediction

## !TODO - implement Multi-Head Attention

Let's implement the multi-head attention mechanism:  
Step 1: Initialize Variables: First, we need to track attention weights and values for each  head  
Step 2: Process Each Head: For each attention head, we need to:
- Transform the input using projection matrices
- Compute attention weights
- Store the results  

Step 3: Combine Results: After processing all heads, compute the final prediction

In [7]:
# TODO - make multi-head attention
def multi_head_attention(keys, query, values, n_heads=2, temperature=1.0):
    
    """Multi-head attention mechanism"""
    """
    Parameters:
    - keys: array of book features (embeddings)
    - query: user preferences
    - values: book ratings
    - n_heads: number of attention heads (default: 2)
    - temperature: scaling factor for attention scores
    
    Returns:
    - attention_per_head: list of attention weights for each head
    - final_values: average prediction from all heads
    - projection_matrices: matrices used for each head
    """
    d_model = query.shape[0] #  dimension of the model's input and output vectors =3 (magic, adventure, romance). Used for scaling attention scores.
    # np.sqrt(d_model) - scalling Atttention

    # Each head has its own projection matrix W that allow each attention head to focus on different aspects of the input data.
    projection_matrices = {
        0: np.array([[0.8, 0.2, 0.0],    # First head - higher weight for magic
                     [0.2, 0.7, 0.1],
                     [0.0, 0.1, 0.9]]),
        1: np.array([[0.1, 0.8, 0.1],    # Second head - higher weight for adventure
                     [0.1, 0.1, 0.8],
                     [0.8, 0.1, 0.1]])
    }
    # Step 1. Initialize empty lists to store results from each head
    # - attention_per_head weights for each head
    # - values_per_head weighted sums for each head
   

    # Step 2. For each attention head, we need to:
    # Transform the input (k, q) using projection matrices (np.dot())
    # Compute attention weights
    # Store the results
    
        # Step 3. Project query and key into a new space as a dot product of query and key with projection matrix
        #q_transformed = 
        #k_transformed = 
        
        # Step 4. Use compute_attention for each head
        
        # Step 5.append attention weights and weighted sums to the lists
        
    
    # Calculate weighted ratings for each book from each head
    book_ratings_per_head = []
    for head_idx in range(n_heads):
        # For each head, calculate how it would rate each book
        head_ratings = np.array([books[book]["rating"] * attention_per_head[head_idx][i] 
                               for i, book in enumerate(books.keys())])
        book_ratings_per_head.append(head_ratings)
    
    # You can average the ratings across heads for each book
    final_ratings = np.mean(book_ratings_per_head, axis=0)
    
    return attention_per_head, final_ratings, projection_matrices

    # Step 3. Combne Results
    # After the for loop:
    # Average of predicted values from all heads
    #final_values = np.mean(values_per_head)
    #return attention_per_head, final_values, projection_matrices

## Visualize results of Multi-Head Attention

In [8]:
def visualize_multi_head_attention():
    keys, query, values = create_KQV()
    attention_per_head, final_ratings, projection_matrices = multi_head_attention(keys, query, values)
    
    plt.figure(figsize=(15, 10))
    book_names = list(books.keys())
    feature_names = ['Magic', 'Adventure', 'Romance']
    
    # 1. Feature similarity for each head
    for head in range(2):
        plt.subplot(2, 2, head+1)
        
        q_transformed = np.dot(query, projection_matrices[head])
        k_transformed = np.dot(keys, projection_matrices[head])
        
        similarity = np.zeros((len(feature_names), len(book_names)))
        for i in range(len(feature_names)):
            similarity[i] = 1 - np.abs(q_transformed[i] - k_transformed[:, i])
        
        sns.heatmap(
            similarity,
            xticklabels=book_names,
            yticklabels=feature_names,
            annot=True,
            fmt='.2f',
            cmap='RdBu',
            vmin=0,
            vmax=1,
            center=0.5
        )
        plt.title(f'Head {head+1} Feature Similarity\n(After Projection)')
        plt.xticks(rotation=45, ha='right')
    
    # 2. Attention weights for each head
    for head in range(2):
        plt.subplot(2, 2, head+3)
        plt.bar(book_names, attention_per_head[head])
        plt.title(f'Head {head+1} Attention Weights')
        plt.xticks(rotation=45, ha='right')
        plt.ylabel('Attention Weight')
    
    plt.tight_layout()
    plt.show()
    
    # 3. Final ratings for each book (average from all heads)
    plt.figure(figsize=(10, 5))
    plt.bar(book_names, final_ratings)
    plt.title('Final Book Ratings (Average from all heads)')
    plt.xticks(rotation=45, ha='right')
    plt.ylabel('Rating')
    plt.tight_layout()
    plt.show()


In [None]:
visualize_multi_head_attention()
