<a href="https://colab.research.google.com/github/pmadhyastha/INM434/blob/main/attention_and_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
__author__ = "Pranava Madhyastha" 
__version__ = "INM434/IN3045 City, University of London, Spring 2023"

# Attention and Transformers

Today we will try take a peak into concepts of attention and the modelling framework called Transformers. 

### TODO: Please print out the values and verify if you are able to understand each step below. Use lecture slides as the reference. 

In [2]:
import torch
import torch.nn.functional as F

# The sample sentence

We will begin with a sample sentence that we have seen in the lectures: `the cat is on a new mat`


In [3]:
# Define the input sentence
input_sentence = 'the cat is on a new mat'

# Creation of the dictionary 

The next step is to create a dictionary that maps each word in the input sentence to a unique index. This is done by splitting the sentence into a list of words. 

The list of words is then sorted, and each word is assigned an index using the enumerate() function. The result is a dictionary that maps each word to a unique index.

In [5]:
# Create a dictionary that maps each word to a unique index
word_to_index = {word: i for i, word in enumerate(sorted(input_sentence.split()))}

# Tensor of indices

The input sentence is then converted to a tensor of indices using the word_to_index dictionary. The tensor is created by iterating over each word in the input sentence (after removing the comma) and looking up its index in the dictionary. The resulting tensor is a one-dimensional tensor of integer indices.

In [6]:
# Convert the input sentence to a tensor of indices using the word_to_index dictionary
input_tensor = torch.tensor([word_to_index[word] for word in input_sentence.replace(',', '').split()])


# The embedding layer

The next step is to define an embedding layer. The embedding layer is a PyTorch module that learns a lookup table of word embeddings for a vocabulary of size n and embedding dimensionality d. In this case, the embedding layer is initialized with a vocabulary size equal to the number of words in the input sentence and an embedding dimensionality of 16. The input tensor is then embedded using the embedding layer to produce an embedded sentence. The .detach() method is used to detach the embedded sentence from its computation graph, preventing it from being backpropagated through.

In [7]:
# Define the embedding layer
embedding_layer = torch.nn.Embedding(len(word_to_index), 16)

# Embed the input tensor to get the embedded sentence
embedded_sentence = embedding_layer(input_tensor).detach()

The size of the embedding is then defined by taking the shape of the embedded sentence and extracting its second dimension.

In [8]:
# Define the size of the embedding
embedding_size = embedded_sentence.shape[1]

# Query, key and value 

The sizes of the query, key, and value vectors are defined. Recall that these vectors are used in the attention mechanism to compute the attention weights and context vectors. In this case, the query and key vectors have a size of 24, while the value vector has a size of 28.

In [9]:
# Define the sizes of the query, key, and value vectors
query_size, key_size, value_size = 24, 24, 28


# Initialisation and projection

Query, key, and value weight matrices are then defined using PyTorch's random number generator. The weight matrices are used to compute the query, key, and value vectors for each word in the input sentence. Refer to the lecture slides! 

In [10]:
# Define the query, key, and value weight matrices
query_weights = torch.rand(query_size, embedding_size)
key_weights = torch.rand(key_size, embedding_size)
value_weights = torch.rand(value_size, embedding_size)

# Example computation 

The query, key, and value vectors for the second word in the input sentence are then computed by multiplying the corresponding weight matrices by the embedding vector for the second word.

In [11]:
# Get the query, key, and value vectors for the second word in the input sentence
x_2 = embedded_sentence[1]
query_2 = query_weights.matmul(x_2)
key_2 = key_weights.matmul(x_2)
value_2 = value_weights.matmul(x_2)

# Parallelisation

The keys and values for all words in the input sentence are then computed by multiplying the corresponding weight matrices by the transpose of the embedded sentence. The transpose operation is used to swap the rows and columns of the embedded sentence, making it possible to compute the dot product between the weight matrices and the embedding vectors. The transpose operation is then applied again to swap the rows and columns back to their original order.

In [12]:
# Compute the keys and values for all words in the input sentence
keys = key_weights.matmul(embedded_sentence.transpose(0, 1)).transpose(0, 1)
values = value_weights.matmul(embedded_sentence.transpose(0, 1)).transpose(0, 1)

# Computing the attention weights

The attention weights for the second word in the input sentence are then computed using the softmax function applied to the dot product of the query vector for the second word and the transpose of the key vectors for all words in the input sentence. The softmax function normalizes the dot product, resulting in a set of attention weights that sum to one. The size of the key vectors is used to scale the dot product.



In [13]:
# Compute the keys and values for all words in the input sentence
keys = key_weights.matmul(embedded_sentence.transpose(0, 1)).transpose(0, 1)
values = value_weights.matmul(embedded_sentence.transpose(0, 1)).transpose(0, 1)

# Computing `h` or the context vector

The context vector for the second word in the input sentence is then computed by multiplying the attention weights by the value vectors for all words in the input sentence. The result is a weighted sum of the value vectors, where the weights are given by the attention weights.

In [16]:
# Compute the attention weights for the second word in the input sentence
omega_2 = query_2.matmul(keys.T)
attention_weights_2 = F.softmax(omega_2 / key_size **0.5, dim=0)


# Compute the context vector for the second word in the input sentence
context_vector_2 = attention_weights_2.matmul(values)

# Multi-head attention

The code then defines the number of attention heads to use, which is set to 3. It also defines new query, key, and value weight matrices for the multi-head attention mechanism. The size of each weight matrix is expanded to include a third dimension for the number of attention heads.

Refer to the lecture notes to understand the concept of Multi-head attention.

In [17]:
# Define the number of attention heads
num_heads = 3

# Define the query, key, and value weight matrices for the multi-head attention
multihead_query_weights = torch.rand(num_heads, query_size, embedding_size)
multihead_key_weights = torch.rand(num_heads, key_size, embedding_size)
multihead_value_weights = torch.rand(num_heads, value_size, embedding_size)


The query vector for the second word in the input sentence is obtained by multiplying the embedded vector of the second word with the multi-head query weight matrix. This is done separately for each attention head.

In [18]:
# Get the query vector for the second word in the input sentence for each attention head
multihead_query_2 = multihead_query_weights.matmul(x_2)

# Repeat 

The embedded sentence is repeated for each attention head using the repeat function. This allows each head to have its own set of keys and values for computing attention.

In [21]:
# Repeat the embedded sentence for each attention head
repeated_embedded_sentence = embedded_sentence.unsqueeze(0).repeat(num_heads, 1, 1)

The keys and values for all words in the input sentence are computed for each attention head by multiplying the repeated embedded sentence with the multi-head key and value weight matrices. The transpose function is used to reshape the tensor dimensions so that the matrix multiplication can be computed.

In [22]:
# Compute the keys and values for all words in the input sentence for each attention head
multihead_keys = multihead_key_weights.matmul(repeated_embedded_sentence.transpose(1, 2)).transpose(1, 2)
multihead_values = multihead_value_weights.matmul(repeated_embedded_sentence.transpose(1, 2)).transpose(1, 2)

# Compute the attention weights and context vectors for each attention head
attention_weights = F.softmax(multihead_query_2.matmul(multihead_keys.transpose(1, 2)) / key_size**0.5, dim=2)
context_vectors = attention_weights.matmul(multihead_values)


### TODO: Please follow the annotated transformer: http://nlp.seas.harvard.edu/annotated-transformer/ and consider re-writing and running it here on google colab. 



# BERT based classifier

In the following sections, we are going to work on the BERT classifier for sentiment analysis - trained and tested over the IMDB dataset. 

In [None]:
!pip install transformers
!pip install datasets # remember lab3? 

The above code gets us pre-optimised versions (and pre-trained weights) for a BERT based model. We are also downloading the datasets library - we had also done this before. 

In [None]:
import torch
import torch.nn as nn
from transformers import DistilBertModel, DistilBertTokenizer, AdamW
from datasets import load_dataset

# Load the dataset

dataset = load_dataset('imdb')

We will not work with BERT directly (as it may take a while to run), instead we shall use a "distilled version" of BERT (reduced model size) called [DistillBERT](https://arxiv.org/abs/1910.01108) . The code below loads the IMDB dataset from Hugging Face and preprocesses it. It loads the tokenizer and model from the transformers library, tokenizes the text using the DistilBERT tokenizer, and converts the data to the PyTorch format.

In [None]:
# Load the tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained('distilbert-base-uncased')


The code below just defines the architecture the model architecture using a PyTorch neural network module. The SentimentClassifier class takes the DistilBERT model as input and defines the forward method. It extracts the last hidden state from the output of the DistilBERT model, and passes it through a linear layer to get the final logits.

In [None]:
# Define the model architecture
class SentimentClassifier(nn.Module):
    def __init__(self, model):
        super(SentimentClassifier, self).__init__()
        self.model = model
        self.linear = nn.Linear(768, 1)
        
    def forward(self, input_ids, attention_mask):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden_state = outputs.last_hidden_state[:, 0, :]
        logits = self.linear(last_hidden_state)
        return logits.squeeze(-1)

# Instantiate the model
model = SentimentClassifier(model)



The code below now defines the training and eval loop. It sets the model to training mode and iterates through each batch of the training data. It calculates the loss, computes the gradients, and updates the model parameters using the AdamW optimizer.

In [None]:
# Tokenize the data
def tokenize(batch):
    return tokenizer(batch['text'], padding=True, truncation=True)

train_dataset = dataset['train'].map(tokenize, batched=True)
test_dataset = dataset['test'].map(tokenize, batched=True)

# Set the data format
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

# Set up the optimizer and loss function
optimizer = AdamW(model.parameters(), lr=5e-5)
loss_fn = nn.BCEWithLogitsLoss()

# Define the training loop
def train(model, train_loader, optimizer, loss_fn, device):
    model.train()
    for batch in train_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask)
        loss = loss_fn(outputs, labels.float())
        print(loss)
        loss.backward()
        optimizer.step()

# Define the evaluation loop
def evaluate(model, test_loader, device):
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for batch in test_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)

            outputs = model(input_ids, attention_mask)
            predictions = torch.round(torch.sigmoid(outputs))

            total += labels.size(0)
            correct += (predictions == labels).sum().item()

    return correct / total




We will now formally train the model. 

In [None]:
# Train the model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32, shuffle=False)

for epoch in range(1):
    train(model, train_loader, optimizer, loss_fn, device)
    accuracy = evaluate(model, test_loader, device)
    print(f'Epoch {epoch+1} - Test Accuracy: {accuracy:.3f}')

# Todo: 
- Compare this with your models from Lab 3 and Lab 4. 
- Test it on other sentiment datasets, does it generalise across? 
- How good is the generalisation? 
- Have a look at huggingface library and replace distillBERT with the BERT-base-uncased version, rerun the model. Compare the results. 


