# Week 6： Transformers

This week's coding assignment will be short. You will be implementing a function that computes multihead attention for a sequence of input tokens, however, you will not build or train the rest of the transformer due to time and compute constraints. Instead, you will walk yourself through another notebook that fine-tunes BERT for text classification and answer some questions about it.


### Part I: Self Attention
For this part, you will implement self attention, the core mechanism in transformers, in a simplified setting. Given the input token sequence and projection matrices, your task is to write a function that computes the transformed sequence after applying self-attention. In practice, you would use autograd libraries like tensorflow or pytroch, but since you won't be training the model here, numpy will suffice. Don't worry about training -- your only job is to apply the layer.


Let's begin by defining the components. We will just use random matrices for the weights since we won't be able to train them. Once again, don't worry about optimization and focus on implementing the inference function, assuming that these are the weight matrices we get. 

In [67]:
# Mount your drive before running this. the easiest way to do this is:
# click the Files icon on the left of this page -> mount drive (third icon)

# modify this as needed
path = '/content/drive/MyDrive/Colab Notebooks/test_case/'

import numpy as np

d_qkv = 32    # dimension of our query, key, and value vectors
d_model = 256 # dimension of our model
input_len = 32

# the input x consists of 
# input_len tokens 
# each represented by a vector of dimension d_model
x = np.load(path+'x.npy')

# projection matrices to query, keys, and values
W_q = np.load(path+'W_q.npy')
W_k = np.load(path+'W_k.npy')
W_v = np.load(path+'W_v.npy')

# final projection
W_o = np.load(path+'W_o.npy')

Now let's implement the attention layer. You will be doing a lot of matrix multiplications, so to save some pain thinking about how to reshape and transpose them, I recommend checking out the [einsum function](https://rockt.github.io/2018/04/30/einsum). 

![](https://www.researchgate.net/publication/350311050/figure/fig2/AS:1004363044642817@1616470218633/Transformers-Scaled-Dot-Product-Attention-and-Multi-Head-Attention-From-Vas-17.ppm)

You will implement the computation in the left diagram. However, first you want to obtain the queries, keys, and values by projecting each token by the respective matrix. So if $x_i$ is token i's vector, $\text{query}_i = W_qx_i$. For efficiency, try to avoid computing these projections individually and instead compute a matrix $Q$ such that  $Q[i] = \text{query}_i$. (similarly for keys and values)

$$
\alpha_{ij} = \text{softmax}(\frac{\text{(query}_i)^T(\text{key}_j)}{\sqrt{d_{qkv}}}) \hspace{5mm}
$$

Then you will compute the attention weights according to the above formula. In words, the attention of token $i$ on token $j$ is equal to the dot product between $\text{query}_i$ and $\text{key}_j$, scaled down by a factor of $\sqrt{d_{qkv}}$. Remember to take the softmax at the end to normalize the attention weights. 

$$
\text{out}_i = \sum_j \alpha_{ij} \cdot \text{value}_j
$$

Now, apply the attention weights by computing a weighted sum of the values. 
Finally, project back to the dimension of the model using $y_i = W_o \text{out}_i$. Here, $y_i$ is the result of token $x_i$ after applying attention, which has queried for and combined information from other tokens (thus "contextualized").


In [68]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [69]:
print(x.shape)
print(W_q.shape)
print(W_k.shape)
print(W_v.shape)
print(W_o.shape)

def softmax(x):
    max = np.max(x, axis=1, keepdims=True)
    e_x = np.exp(x - max)
    sum = np.sum(e_x, axis=1, keepdims=True)
    f_x = e_x / sum 
    return f_x

def apply_self_attention(x, W_q, W_k, W_v, W_o):
  
  ### YOUR CODE HERE
  # find Q, K, V
  Q = x @ W_q # (32, 32)
  K = x @ W_k # (32, 32)
  V = x @ W_v # (32, 32)
  # compute attention weights
  Aij = Q @ K.T / (d_qkv ** 0.5) # (32, 32)
  A = softmax(Aij) # (32, 32)
  # apply attention to values
  weighted = np.einsum('ij,jk->jik', A, V) # (32, 32, 32)
  out = weighted.sum(axis=0) # (32, 32)
  # final projection
  return out @ W_o

# expected output
y_ = np.load(path + 'y.npy')
y = apply_self_attention(x, W_q, W_k, W_v, W_o)
assert np.allclose(y, y_)

(32, 256)
(256, 32)
(256, 32)
(256, 32)
(32, 256)


Now, let's increase the complexity and implement multihead attention. Instead of having one matrix per projection, you will have n_head sets of $\{W_q, W_k, W_v\}$ that each represent a version of self-attention running in parallel. In the implementation, we usually combine these matrices into a *tensor* (think of it as a 3d array) of shape [n_heads, d_model, d_qkv] to speed up computations.

Each attention head will compute contextualized embeddings as in the previous exercise. The main difference is that before the final projection, you will concatenate the outputs of all the heads and project back down to the dimension of the model. 

So if $\text{out}_i^j$ is the output at token $i$ by attention head $j$, the combined output is  $y_i = \text{Concat}(\text{out}_i^1, \ldots, \text{out}_i^{num\_heads})W_o$

Feel free to reuse code from the previous exercise. If you really want to get familiar with manipulating these matrices and tensors, use einsum. You can avoid using any for loops at all. 

In [70]:
# parameters for multihead attention
n_heads = 8

# projection matrices to query, keys, and values
W_q = np.load(path+'W_q_mh.npy')
W_k = np.load(path+'W_k_mh.npy')
W_v = np.load(path+'W_v_mh.npy')

# final projection
W_o = np.load(path+'W_o_mh.npy')

In [71]:
print(x.shape)
print(W_q.shape)
print(W_k.shape)
print(W_v.shape)
print(W_o.shape)

def apply_multihead_attention(x, W_q, W_k, W_v, W_o):
  
  ### YOUR CODE HERE
  out_list = np.empty([n_heads, 32, 32])
  for h in range(n_heads):
    # find Q, K, V
    Q = x @ W_q[h]
    K = x @ W_k[h]
    V = x @ W_v[h]
    # compute attention weights
    A = Q @ K.T / (d_qkv ** 0.5)
    A = softmax(A) 
    # apply attention to values
    weighted = np.einsum('ij,jk->jik', A, V)
    out = weighted.sum(axis=0)
    out_list[h] = out
  # concatenate head outputs and project
  y = np.concatenate(out_list, axis=1) @ W_o
  return y

# expected output
y_ = np.load(path + 'y_mh.npy')
assert np.allclose(y_, apply_multihead_attention(x, W_q, W_k, W_v, W_o))

(32, 256)
(8, 256, 32)
(8, 256, 32)
(8, 256, 32)
(256, 256)


Hint: Keep track of the shapes of matrices that you compute. 

### PART II: BERT Fine-Tuning Walkthrough

Take a look at [this notebook](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb#scrollTo=TJmV43-aMYPF) that teaches you how to load a pre-trained BERT model and fine-tune it to the task of text classification. Pay *attention* to the workflow and answer the following questions:





1. What architecture is used on top of BERT as the final classifier?

Transformer

YOUR ANSWER HERE

2. Which function prepares text to be fed into BERT?

tokenzier (DistilBertTokenizer.from_pretrained)



YOUR ANSWER HERE

3. What are the roles of the special tokens [CLS] and [SEP] in BERT's input? You might want to look this up.

CLS: represent the start of sequence
SEP: separate segment

YOUR ANSWER HERE

4. What is your intuition for why pre-training improves performance on downstream tasks?

reduces training time with pre-training

YOUR ANSWER HERE