## Basic Setup

Run the cells below for the basic setup of this notebook.

In [1]:
try:
    from google.colab import drive
    IN_COLAB = True
except:
    IN_COLAB = False
    print('No colab environment, assuming local setup.')

if IN_COLAB:
    drive.mount('/content/drive')

    # TODO: Enter the foldername in your Drive where you have saved the unzipped
    # turorials folder, e.g. 'alphafold-decoded/tutorials'
    FOLDERNAME = None
    assert FOLDERNAME is not None, "[!] Enter the foldername."

    # Now that we've mounted your Drive, this ensures that
    # the Python interpreter of the Colab VM can load
    # python files from within it.
    import sys
    sys.path.append('/content/drive/My Drive/{}'.format(FOLDERNAME))
    %cd /content/drive/My\ Drive/$FOLDERNAME

    print('Connected COLAB to Google Drive.')

import os
    
base_folder = 'attention'
control_folder = f'{base_folder}/control_values'

assert os.path.isdir(control_folder), 'Folder "control_values" not found, make sure that FOLDERNAME is set correctly.' if IN_COLAB else 'Folder "control_values" not found, make sure that your root folder is set correctly.'

In [2]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [3]:
import math
import torch
import os

# Attention

Attention is the underlying mechanism for most of the biggest breakthroughs in Machine Learning in the last years. Google published the original transformer paper under the name 'Attention Is All You Need' and so far, it lived up to its expectation.

In this Notebook, we will implement the following attention mechanisms:

- MultiHeadAttention
- Gated MultiHeadAttention
- Global Gated MultiHeadAttention

These modules will do the heavy lifting for the Evoformer, the first part of AlphaFold's architecture. The rest of the Evoformer will mostly be about stacking the layers correctly. All of them will be implemented in the class `MultiHeadAttention`.

To get started, head over to `mha.py` and implement the `__init__` method and `prepare_qkv`. Don't worry about the global parameter for now, treat as if it were set to False. `prepare_qkv` will rearrange the query, key and value embeddings, so that the different heads are split up and  the attention dimension is moved to a fixed position.

Run the following code cell to check your implementation.

In [4]:
from attention.control_values.attention_checks import c_in, c, N_head, attn_dim
from attention.control_values.attention_checks import test_module_shape

from attention.mha import MultiHeadAttention

mha = MultiHeadAttention(c_in, c, N_head, attn_dim, gated=True)
mha_bias = MultiHeadAttention(c_in, c, N_head, attn_dim, gated=True, use_bias_for_embeddings=True)

test_module_shape(mha, 'mha_init', control_folder)
test_module_shape(mha_bias, 'mha_bias_init', control_folder)



In [5]:
from attention.control_values.attention_checks import test_module_method

mha = MultiHeadAttention(c_in, c, N_head, attn_dim=attn_dim, gated=True)

test_module_method(mha, 'mha_prep_qkv', ('q', 'k', 'v'), ('q_prep', 'k_prep', 'v_prep'), control_folder, mha.prepare_qkv)

Next, implement the forward pass through the MultiHeadAttention module. Again, don't worry about global attention for now. The method contains step-by-step instructions for the implementation. You can implement the following modes one-by-one and check each step:
- non-gated non-bias
- gated non-bias 
- gated with bias

The cell checks your implementation in this order.

In [6]:
from attention.control_values.attention_checks import c_in, c, N_head, attn_dim
from attention.control_values.attention_checks import test_module_forward

mha_ungated = MultiHeadAttention(c_in, c, N_head, attn_dim=attn_dim, gated=False)
test_module_forward(mha_ungated, 'mha_ungated_forward', 'x', 'out', control_folder)

mha_gated = MultiHeadAttention(c_in, c, N_head, attn_dim=attn_dim, gated=True)
test_module_forward(mha_ungated, 'mha_gated_forward', 'x', 'out', control_folder)

test_module_forward(mha_ungated, 'mha_gated_bias_forward', ('x', 'bias'), 'out', control_folder)




Last, we will implement the global self-attention mechanism. It will be used in the ExtraMSA stack in AlphaFold to account for the large number of sequences. 

Global self-attention has two major differences:
- For the key and value embeddings, only one head is used
- The query vectors will be averaged over the query dimension, so that only one query vector will be used for the attention mechanism

Thinking back to the attention mechanism, the number of query vectors determines the number of outputs of the layer, so the global attention variant would reduce the number of outputs to one. However, AlphaFold only uses gated global attention, and the number of outputs is restored when broadcasting the weighted value vectors against the gate embedding.

Implement the method `prepare_qkv_global`. Also, modify the `__init__` method so that key and value embeddings use only one head when is_global is set, and modify the `forward` method so that `prepare_qkv_global` is called instead of `prepare_qkv` if is_global is set. You won't have to do any other modifications to forward, but it might be helpful to carefully look through the function and see why that's the case.

Test your code with the following cells.

In [7]:
from attention.control_values.attention_checks import c_in, c, N_head, attn_dim
from attention.control_values.attention_checks import test_module_shape

mha_global = MultiHeadAttention(c_in, c, N_head, attn_dim, gated=False, is_global=True)

test_module_shape(mha_global, 'mha_global_init', control_folder)

In [8]:
mha_global = MultiHeadAttention(c_in, c, N_head, attn_dim, gated=False, is_global=True)

test_module_method(mha_global, 'mha_global_prep_qkv', ('q_global', 'k_global', 'v_global'), ('q', 'k', 'v'), control_folder, mha_global.prepare_qkv_global)

In [9]:

mha_global = MultiHeadAttention(c_in, c, N_head, attn_dim, gated=False, is_global=True)

test_module_forward(mha_global, 'mha_global_forward', 'x', 'out', control_folder)

## Task: Sentiment Analysis

In this section, we'll put our newly built MultiHeadAttention module to work with a natural language task. Specifically, we will build a model to do sentiment analaysis, which means classifying text (in this case sentences from movie reviews) as either positive or negative. 

This isn't directly linked to implementing AlphaFold, so if you're in a hurry, feel free to skip over this section. However, natural language processing has become the most relevant topic in AI, and I think it's really cool to see how relatively simple these models can be built. 

The following picture shows the architecture of Google's Transformer architecture. The decoder (the generating part) of the model is grayed out, as we only need to implement the encoder to extract the semantic meaning. Implementing the Decoder would enable the model to actually generate text as well, like a translated version or a response.

<figure align=center style="padding: 30px">
<img src='images/transformer.png' height=600px>
<figcaption>Source: Vaswani et al. Attention Is All You Need.</figcaption>
</figure>

* Input Embeddings: The input sentence is tokenized by breaking it into word fragments based on a pre-defined dictionary ('breathtaking' -> 'breath', '##taking'), which are then replaced by indices ('breath', '##taking' -> 3052, 17904). These indices are replaced by learned embedding vectors. 

* Positional Encoding: Attention, and therefore transformers, have no inherent grasp of the order of their inputs. Attention is a set-to-set operation. To account for that, the inputs are changed based on their position by adding positional encodings. These encodings can be either static functions (like sinusoidal encodings) or learned, where each position index is just replaced with a learned embedding vector. Both give similar performance and we'll use learned embeddings, as they're simpler to implement.

* Multi-Head Attention: We know that one. The only thing to take care of here is masking key vectors that are not actually part of the sentence but just padded so that each element in the batch has the same length.

* Add & Norm: Previous values are added, followed by a LayerNorm.

* Feed Forward: Linear - GELU - Linear Feed-forward model. GELU is similar to ReLU but smooth.

The tokenizer we are using adds a special start token to the sentence, and we'll use this to do our classification. After the encoder, we'll use a two-layer feed-forward neural net on the feature at the start position. The encoder can use all it's layers to accumulate the semantic meaning of the sentence into this token before our classification.

We'll start with the implementation by modifying the forward pass of `MultiHeadAttention` so that it can make use of an attention mask. Basically, that means adding a large value (like `-1e8`) to all raw attention scores, where the attention mask is set to zero. This is done before softmax and will lead to the result that the values in the next layer are the same they would be if the sequence didn't contain these tokens. 

Note that we are only allowing attention masks of shape (\*, k), where '\*' represents batch dimensions and k the key dimension. This only allows to completely mask a key for all tokens in a text and is the right way to treat padded tokens that should never be attended to. For the masked attention in a decoder, we would need to allow masks of shape (\*, q, k), since the masks for that task are masking out keys dependent on the query: Queries can only attend keys at the same timestep or before.

Modify the forward pass in `MultiHeadAttention` and check your implementation by running the following cell.

In [10]:
from attention.control_values.attention_checks import test_module_method

mha = MultiHeadAttention(c_in, c, N_head, attn_dim, use_bias_for_embeddings=True)

test_module_method(mha, 'attention_mask', ('x', 'fake_attention_mask'), 'out', control_folder, lambda x, attention_mask: mha(x, attention_mask=attention_mask))



Next, we'll implement the attention block from the encoder, consisting of the steps Multi-Head Attention - Add & Norm - Feed Forward - Add & Norm. Notably, in the transformer architecture, the key, value and query embeddings typically use an embedding size of `c = c_in / N_head`, so that the output layer (which comes after concatenating the outputs of the individual attention heads) has the dimensions c_in -> c_in. Implement the `__init__` method and the forward pass and check your implementation by running the following two cells.

In [11]:
from attention.sentiment_analysis import AttentionBlock
from attention.control_values.attention_checks import hidden_size, intermediate_size, N_head

# Check for __init__

attn_block = AttentionBlock(hidden_size, intermediate_size, N_head)
test_module_shape(attn_block, 'attn_block',control_folder)

In [12]:

attn_block = AttentionBlock(hidden_size, intermediate_size, N_head)
test_module_forward(attn_block, 'attn_block_forward', ('sentiment_attn_input', 'sentiment_attn_mask'), 'out', control_folder)

### Preparing the Input
Natural language processing works with text, and that can be a bit messy, especially during the tokenization process. You have to decide how many words or word pieces you want in your vocabulary, which ones to include, and write code to actually parse the text into tokens.

We'll use a tokenizer from HuggingFace for this task. HuggingFace is a leading provider of NLP tools and pre-trained models, making it easier to implement state-of-the-art NLP techniques.

The following cell loads the tokenizer and demonstrates how it is used.

In [13]:
from transformers import AutoTokenizer

# We build our model similar to this architecture
# and use the same tokenizer.
distilbert_model_name = 'distilbert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(distilbert_model_name, resume_download=None)

tokens = tokenizer("You're breathtaking.")
print('Tokens:')
print(tokens)
print()

decoded_tokens = tokenizer.convert_ids_to_tokens(tokens['input_ids'])
print('Decoded tokens: ')
print(decoded_tokens)

You can see that two special tokens were added ([CLS] as start token and [SEP] as end token) and that the string was converted to lower-case. The model we're using (distilbert-base-uncased) doesn't distinguish case.

We'll be working with fixed size input, by padding or truncating short and long sequences. That works like shown here:

In [14]:
sentences = ["You're breathtaking.", "short"]
tokens = tokenizer(sentences, padding='max_length', truncation=True, max_length=6)
decoded_tokens = [tokenizer.convert_ids_to_tokens(tokens['input_ids'][i]) for i in range(len(sentences))]

print('Tokens: ', tokens, '')
print('Decoded tokens: ', decoded_tokens)

You can see that the long sentence was truncated to six tokens, while the short one was padded. These padded tokens are also set to 0 in the attention_mask.

We provide you with the code for loading the train and validation set in the following cell. Please read through it carefully and make sure you understand it.

In [53]:
from datasets import load_dataset
from torch.utils.data import DataLoader

dataset = load_dataset('glue', 'sst2')

def preprocess_function(examples):
    return tokenizer(examples['sentence'], padding='max_length', truncation=True, max_length=128)

# dataset.map adds the new keys from preprocess_function (input_ids and attention_mask)
# to the already existing ones (sentence, label, idx)
encoded_dataset = dataset.map(preprocess_function, batched=True)

# HuggingFace datasets support multiple libraries. 
# We explicitly specify we're using torch, so that the entries
# in the dataset are mapped to torch tensors
encoded_dataset.set_format('torch')

# Dataloaders are splitting the dataset into batches. 
# We are using only the first 3000 samples for train and 300 for validation.
train_loader = DataLoader(encoded_dataset['train'].select(list(range(3000))), batch_size=16)
# train_loader = DataLoader(encoded_dataset['train'], batch_size=16)
val_loader = DataLoader(encoded_dataset['validation'].select(list(range(300))), batch_size=16)

Let's look at some examples from the dataset. Try if you can make out which label is used for positive and which for negative sentiment.

In [36]:
batch = next(iter(train_loader))

print('Sentences: ')
print(batch['sentence'])
print()

print('Labels: ')
print(batch['label'])
print()

print('Input ids: ')
print(batch['input_ids'].shape, batch['input_ids'].dtype)
print('First twenty tokens of the first sentence:')
print(batch['input_ids'][0,:])
print()

print('Attention mask: ')
print(batch['attention_mask'].shape, batch['attention_mask'].dtype)
print('Attention mask for first twenty tokens of the first sentence:')
print(batch['attention_mask'][0,:])

### Building the Model

If you look closely at the transformer architecture, you'll see that there's not too much work to do for the model aside of the attention block we already implemented. We need to create word and position embeddings and a feed-forward model for the final output. 

For the word and position embeddings, we use the `nn.Embedding` module. It's basically a joined one-hot encoding and linear layer, or, equivalently, a matrix that is converting an index i to the i-th column of the matrix. 

We use learned position embeddings, which means converting the indices [0, 1, 2, ..., max_length] to embeddings via `nn.Embedding`. After adding the position and word embeddings, they are passed through LayerNorm. 

After the attention blocks, we crop the output at the location of the first token. The model is supposed to learn accumulating the semantic meaning of the text in this token over the previous layers. This tensor is passed through a two-layer neural network with two outputs to classify the text as either negative or positive.

We'll use two versions of the model, one with a small architecture that we train from scratch and one with a large architecture that's pretrained on internet text. The following cell specifies the parameters for the architectures:

In [37]:
from types import SimpleNamespace
from transformers import AutoConfig

# In contrast to the naming in our methods, dim is our hidden_size 
# (the dimensions of the attention block inputs and outputs) and 
# hidden_dim is our intermediate_size (the dimension in the 
# feed-forward part of the attention block).
# n_heads are the attention heads, n_layers the number of attention 
# blocks and max_position_embeddings the maximum input length
small_config = SimpleNamespace(vocab_size=30522, dim=128, hidden_dim=256, n_heads=4, n_layers=2, max_position_embeddings=128)

# This config uses the same names as above
large_config = AutoConfig.from_pretrained(distilbert_model_name, resume_dowload=None)

Start by implementing the `__init__` method in SentimentAnalysis and check your implementation by running the following cell.

In [38]:
from attention.sentiment_analysis import SentimentAnalysis

model = SentimentAnalysis(small_config.vocab_size, small_config.dim, small_config.hidden_dim, small_config.n_heads, small_config.n_layers, small_config.max_position_embeddings)

test_module_shape(model, 'small_sentiment_init', control_folder)

Next, implement the forward pass and check your implementation by running the following cell.

In [39]:
model = SentimentAnalysis(small_config.vocab_size, small_config.dim, small_config.hidden_dim, small_config.n_heads, small_config.n_layers, small_config.max_position_embeddings)

test_module_forward(model, 'small_sentiment_forward', ('input_ids', 'attention_mask'), 'out', control_folder)

### Training the Model

You already know how basic training works: Forward pass, loss computation, backpropagation and weight updates using gradient descent. In this notebook, we won't implement the training loop ourselves but use PyTorch Lightning for that. To do so, we need to prepare a wrapper for our module. The wrapper needs to implement the following methods:

- `__init__`: Storing the wrapped module and initialization of the criterion (the loss function).
- `forward`: Simply forwards the input to the wrapped module.
- `training_step`: Extracts input and labels from the batch, calculates training metrics (loss and accuracy) and returns the loss.
- `validation_step`: Mostly identical to training_step. This function is called on the end of an epoch to calculate validation metrics.
- `configure_optimizers`: Defines the optimizer to use. For gradient descent, this function would need to return an instance of  torch.optim.SGD (stochastic gradient decent). We are using AdamW, which is a slight variant of gradient descent that's less sensitive to the choice of the learning rate.

After implementing the wrapper, PyTorch Lightning takes care of the training and the logging of metrics. Implement the class `SentimentWrapper` by following the TODO messages. After that, you can check your implementation by trying to fit the dataset.

If you are getting errors, it might be easier to explicitly set `accelerator='cpu'` in the arguments for `Trainer`, as errors on the GPU often come without descriptive debugging information. For the training however, you'll want to use a GPU. If you are working with Colab, make sure to select a runtime with one.

In [50]:
from attention.sentiment_analysis import SentimentWrapper
from pytorch_lightning import Trainer

model = SentimentAnalysis(small_config.vocab_size, small_config.dim, small_config.hidden_dim, small_config.n_heads, small_config.n_layers, small_config.max_position_embeddings)

model_wrapper = SentimentWrapper(model, learning_rate=1e-3)

trainer = Trainer(max_epochs=10)
trainer.fit(model_wrapper, train_loader, val_loader)


In our experiment, we reached a train accuracy of 96% and a validation accuracy of 64% at the end of training. That's a huge gap and a clear sign of overfitting! 

The problem can be mitigated by training on a larger dataset. If you want  to, you can try to increase the size of the samples in the dataloader above, maybe from 3000 to 30000.

### Fine-tuning a Model

Even with a larger training set, you'll probably end up with a model that's heavily overfitted and with bad validation performance, even if you go up to the full training size of 67000 samples. 

This is a general problem for natural language processing: Aside of easy patterns (like looking for certain buzzwords), language is inherently complicated. But, luckily, the rules are pretty much the same for all language applications. That's why it's common practice to pretrain language models on a really large dataset of internet text, often with next-word prediction as the task, and then fine-tune them to the task at hand.

We'll do that here. Out model architecture was built so that it matches the one of the model 'distilbert' when using the parameters from `large_config`. At the bottom of the sentiment_analysis file, we have a small method for renaming the weight names of Distilbert to the ones we are using. All we've got to do is load the weights from the pretrained model, rename them and load them into our model.

In [52]:
from transformers import DistilBertModel
from attention.sentiment_analysis import map_keynames_from_distilbert

large_model = SentimentAnalysis(large_config.vocab_size, large_config.dim, large_config.hidden_dim, large_config.n_heads, large_config.n_layers, large_config.max_position_embeddings)

distilbert = DistilBertModel.from_pretrained(distilbert_model_name)

parameters = map_keynames_from_distilbert(distilbert.named_parameters())

large_model.load_state_dict(parameters, strict=False)

In [55]:
##########################################################################
# TODO: Initialize a pytorch lightning trainer and a SentimentWrapper    #
#   for the large model. Then, use the fit method to fit the model       #
#   to the dataset. Make sure you're using a small training set of maybe #
#   3000 samples again, as the larger model is training slow enough.     #
#   For finetuning, a smaller learning rate like 2e-5 is often better.   #
##########################################################################

trainer = Trainer(max_epochs=5)  
model_wrapper = SentimentWrapper(large_model, learning_rate=2e-5)
trainer.fit(model_wrapper, train_loader, val_loader)

##########################################################################
#               END OF YOUR CODE                                         #
##########################################################################

Using the pretrained model, we achieved a validation accuracy of about 82%. That's pretty good for the size of the dataset! You can try if you can beat that with different parameters or a larger training set. If you want to, you can try out the larger model with some reviews of your own in the next cell. You'll probably have little luck with sarcastic answers, and maybe some weird responses to easy reviews as well. 

You can also try out some reviews from the official test part of the sst2 dataset ([link](https://huggingface.co/datasets/stanfordnlp/sst2/viewer/default/test)) which might be better at matching the tone of the train set.

In [81]:
def predict_review(text):
    inp = {'sentence': text}
    encoding = preprocess_function(inp)
    input_ids = torch.tensor(encoding['input_ids'])
    attention_mask = torch.tensor(encoding['attention_mask'])
    with torch.no_grad():
        out = model_wrapper(input_ids, attention_mask)

    scores = torch.softmax(out, dim=-1)
    out = {'Positive': scores[1], 'Negative': scores[0]}
    return out

text = input('Please provide a really short movie review: ')
out = predict_review(text)
print(out)

## Conclusion

With this chapter, we are done with the introductory material. In the next chapter, we will implement the input feature extractor, the module that builds the numeric input tensors for the model from the raw MSA text file.

If you want to learn more about attention, you can check out the later assignments from CS231n (the Computer Vision course from Stanford we suggested in the last chapter) or the [Annotated Transformer](http://nlp.seas.harvard.edu/annotated-transformer/), an online Jupyter Notebook that explains the Transformer Architecture, which powers modern LLMs like ChatGPT.