# Attention with Frameworks

© Data Trainers LLC. GPL v 3.0.

Author: Axel Sirota

A whole new world opportunities appear when considering using the layer implementations of the attention components. As of July 2023 we have 1 layers implemented:

- MultiHeadAttention: The general attention everyone uses and we will learn in this demo! It is basically many layers of self attention.

Let's get to it!


## Prep

In [None]:
!pip install --upgrade  textblob gensim pytorch-nlp swifter


Let's run some helper functions to setup using the GPUs

In [None]:

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import itertools
import sys
from textblob import TextBlob, Word
import numpy as np
import random
import re
import swifter
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader

import os
import pandas as pd
import gensim
import warnings
import nltk

max_length = 100
# Hyperparameters
embedding_dim = 100  # embedding dimension
hidden_dim = 100  # LSTM hidden dimensions
num_layers = 1  # number of LSTM layers
batch_size = 64  # batch size


def set_seeds_and_trace():
  os.environ['PYTHONHASHSEED'] = '0'
  np.random.seed(42)
  random.seed(42)

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
set_seeds_and_trace()
warnings.filterwarnings('ignore')
nltk.download('punkt')
textblob_tokenizer = lambda x: TextBlob(x).words


## Custom Attention

The easiest way to test attention in PyTorch is to create a simple model that uses such a layer, we will do just that! This also shows how easy is to add attention to your models, which we will use extensively when creating THE Transformer from scratch

Notice we need a custom model class because the inputs needs to be the query and value, and they could have different embeddings as well.

In [None]:
class DotProductAttention(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, query, key, value):
        """
        Forward pass for the Dot Product Attention.

        Args:
        - query: A tensor of shape (batch_size, query_length, dimensions)
        - key: A tensor of shape (batch_size, key_length, dimensions)
        - value: A tensor of shape (batch_size, value_length, dimensions) where key_length == value_length

        Returns:
        - The context vector and the attention weights.
        """
        # Calculate the scores
        scores = None # Implement the logic based on what we did before!
        # Apply the softmax to get attention weights
        attention_weights = None
        # Create the context vector
        context = None
        return context, attention_weights

In [None]:
batch_size = 2
query_length = 20
key_value_length = 10
dimensions = 3

In [None]:
model = DotProductAttention()

Oh no! We need to call the model, well that is simple let's simulate 3 sentences!

In [None]:
query = torch.randn(batch_size, query_length, dimensions)
key = torch.randn(batch_size, key_value_length, dimensions)
value = torch.randn(batch_size, key_value_length, dimensions)

In [None]:
context, attention_weights = model(query, key, value)

In [None]:
context.shape

In [None]:
attention_weights.shape

Notice that attention adds very few parameters, adds many knowledge to the following layers, and is paralellizable.

## MultiHead Attention

Now you are ready to see Multi Head Attention. The idea is quite simple, as in CNNs we had many filters and each convolution checked many different aspects of an image, having many self attentions can check different aspects of our entity, globally. In image it is:

<figure>
<center>
<img src='https://www.dropbox.com/s/wjfxpap06viclhv/mha.png?raw=1'  />
<figcaption>Attention</figcaption></center>
</figure>

Each head performs Scaled attention as we did before with the weird formula, and then we concatenate!

In [None]:
embed_size = 4  # Embedding size
num_heads = 2   # Number of attention heads
sequence_length = 3  # Sequence length for each input
batch_size = 1  # Batch size

In [None]:
class MultiHeadAttentionModel(nn.Module):
    def __init__(self, embed_size, num_heads):
        super(MultiHeadAttentionModel, self).__init__()
        self.multihead_attn = None # Set the nn.MultiHeadedAttention module

    def forward(self, query, key, value):
        # In practice, attention is often applied to a sequence of embeddings with padding.
        # Attention mask could be used to ignore the padding or past/future tokens.
        # Here we do not use such masks for simplicity.

        # MultiheadAttention requires the input of shape (sequence_length, batch_size, embed_size)
        query = query.transpose(0, 1)  # Transpose for the multihead attention input requirements
        key = key.transpose(0, 1)
        value = value.transpose(0, 1)

        # Forward pass of the multihead attention
        # attn_output is the attention applied embeddings (context vectors)
        # attn_output_weights are the attention weights
        attn_output, attn_output_weights = None
        return attn_output, attn_output_weights


In [None]:
# Initialize the model
multihead_attn_model = MultiHeadAttentionModel(embed_size, num_heads)

# Dummy data with sequence first format
query = torch.randn(sequence_length, batch_size, embed_size)
key = torch.randn(sequence_length, batch_size, embed_size)
value = torch.randn(sequence_length, batch_size, embed_size)


In [None]:
# Forward pass
attn_output, attn_output_weights = multihead_attn_model(query, key, value)


In [None]:
# Transpose back to (batch_size, sequence_length, embed_size) for the output
attn_output = attn_output.transpose(0, 1)

attn_output, attn_output_weights

In [None]:
attn_output.shape

In [None]:
attn_output_weights.shape

**Can you guess each value in the response.shape where does it come from?**

Again, notice Attention as complex as multi head attention did not add many params and adds a lot lexical intelligence.