# CS549 Machine Learning
# Assignment 11: Transformer and Transformer-based Models

**Author:** Yang Xu, Assistant Professor of Computer Science, San Diego State University

**Total points: 10**

In this assignment, you will: 
1) Implement the **multiple head attention** sub layer in a transformer encoder.

In [314]:
import math
import numpy as np
import torch
import torch.nn as nn
from scipy.special import softmax

print(torch.__version__)

1.10.2


## Task 1. Implement the multiple head attention sub layer
**Points: 5**

### 1.1 Initialize input data
Step 1, generate some random input data in the shape of $\text{n\_inputs}\times \text{d\_model}$. *Hint*: Use `np.random.rand()`.

In [315]:
np.random.seed(0) # Do not remove this line

d_model = 512
n_inputs = 3

### START YOUR CODE ###
x = np.random.rand(n_inputs, d_model)
### END YOUR CODE ###

In [316]:
# Do not change the code in this cell
print('x:', x)
print('x.shape:', x.shape)

x: [[0.5488135  0.71518937 0.60276338 ... 0.44613551 0.10462789 0.34847599]
 [0.74009753 0.68051448 0.62238443 ... 0.6204999  0.63962224 0.9485403 ]
 [0.77827617 0.84834527 0.49041991 ... 0.07382628 0.49096639 0.7175595 ]]
x.shape: (3, 512)


**Expected output**\
x: [[0.5488135  0.71518937 0.60276338 ... 0.44613551 0.10462789 0.34847599]\
 [0.74009753 0.68051448 0.62238443 ... 0.6204999  0.63962224 0.9485403 ]\
 [0.77827617 0.84834527 0.49041991 ... 0.07382628 0.49096639 0.7175595 ]]\
x.shape: (3, 512)

---
### 1.2 Create weight matrices for *query*, *key*, and *value*

Step 2, create the weight matrices into the correct dimensions. 

Let's start with `W_query` and `Q`. *Hint*: We first initialize an empty tensor `W` in the dimension of `(d_model, d_k)`, using the `torch.empty()` function. Then we initialize it with `nn.init.xavier_uniform_()`.

After `W_query` is initialized, we can get the query matrix `Q` with a multiplication between `x` and `W_query`. *Hint*: Use `np.matmul()`.

In [317]:
torch.manual_seed(0) # Do not remove this line

n_heads = 8
d_k = d_model // n_heads

### START YOUR CODE ###
W = torch.empty((d_model, d_k)) # Create an empty tensor W with the correct dimension.
### END YOUR CODE ###

nn.init.xavier_uniform_(W) # Randomly initialize the values in the tensor.
W_query = W.data.numpy() # Copy out the numpy array

### START YOUR CODE ###
Q = np.matmul(x, W_query)
### END YOUR CODE ###

In [318]:
# Do not change the code in this cell
print('W_query[0,:5]:', W_query[0,:5])
print('W_query.shape:', W_query.shape)
print('Q[0, :5]:', Q[0,:5])
print('Q.shape:', Q.shape)

W_query[0,:5]: [-0.00076412  0.05475055 -0.0840017  -0.07511146 -0.03930965]
W_query.shape: (512, 64)
Q[0, :5]: [-0.22772415  0.48167861  1.48693408 -1.00410576  0.19323685]
Q.shape: (3, 64)


**Expected output**\
W_query[0,:5]: [-0.00076412  0.05475055 -0.0840017  -0.07511146 -0.03930965]\
W_query.shape: (512, 64)\
Q[0, :5]: [-0.22772415  0.48167861  1.48693408 -1.00410576  0.19323685]\
Q.shape: (3, 64)

---
Next, repeat for `W_key` & `K`, and `W_value` & `V`.

In [319]:
torch.manual_seed(1) # Do not remove this line

### START YOUR CODE ###
W = torch.empty((d_model, d_k)) # Create an empty tensor W with the correct dimension.
### END YOUR CODE ###

nn.init.xavier_uniform_(W)
W_key = W.data.numpy()

### START YOUR CODE ###
K = np.matmul(x, W_key)
### END YOUR CODE ###

In [320]:
torch.manual_seed(2) # Do not remove this line

### START YOUR CODE ###
W = torch.empty((d_model, d_k)) # Create an empty tensor W with the correct dimension.
### END YOUR CODE ###

nn.init.xavier_uniform_(W)
W_value = W.data.numpy()

### START YOUR CODE ###
V = np.matmul(x, W_value)
### END YOUR CODE ###

In [321]:
# Do not change the code in this cell
print('K[0,:5]', K[0,:5])
print('K.shape', K.shape)
print('V[0,:5]', V[0,:5])
print('V.shape', V.shape)

K[0,:5] [ 0.2283654  -0.65482728 -0.07202067  0.49886374  0.57045028]
K.shape (3, 64)
V[0,:5] [-0.44997754  0.92097362 -0.76932045  0.03289757 -0.49462588]
V.shape (3, 64)


**Expected output**\
K[0,:5] [ 0.2283654  -0.65482728 -0.07202067  0.49886374  0.57045028]\
K.shape (3, 64)\
V[0,:5] [-0.44997754  0.92097362 -0.76932045  0.03289757 -0.49462588]\
V.shape (3, 64)

---
### 1.3 Compute attention scores and weighted output

Step 3, compute the attension scores using the matrices `Q` and `K`, following the equation:

\begin{equation}
Attention(Q, K, V) = softmax(\frac{Q\cdot K^T}{\sqrt{d_k}})V
\end{equation}

in which $\sqrt{d_k}$ is for normalization purpose.

*Hint*: You should first compute `attn_scores`, which is the unnormalized score. Then you can apply the `softmax()` function imported from `scipy` to calculate the normalized scores. Note that you need to specify the `axis` argument correctly when you call `softmax()`.

In [322]:
### START YOUR CODE ###
attn_scores = (Q.dot(K.transpose()))/(np.sqrt(d_k))
### END YOUR CODE ###

### START YOUR CODE ###
attn_scores_norm = softmax(attn_scores, axis=1)
### END YOUR CODE ###

In [323]:
# Do not change the code in this cell
print('attn_scores.shape:', attn_scores.shape)
print('Unnormalized attn_scores:', attn_scores)
print('Normalized atten_scores:', attn_scores_norm)

attn_scores.shape: (3, 3)
Unnormalized attn_scores: [[-0.75497307 -0.97036233 -0.85112729]
 [ 0.23777018 -0.70730381 -0.37639239]
 [ 0.21608578 -0.73905372 -0.89881112]]
Normalized atten_scores: [[0.36838498 0.29700212 0.33461289]
 [0.51820328 0.20140013 0.2803966 ]
 [0.58387084 0.22464925 0.19147991]]


**Expected output**\
attn_scores.shape: (3, 3)\
Unnormalized attn_scores: [[-0.75497307 -0.97036233 -0.85112729]\
 [ 0.23777018 -0.70730381 -0.37639239]\
 [ 0.21608578 -0.73905372 -0.89881112]]\
Normalized atten_scores: [[0.36838498 0.29700212 0.33461289]\
 [0.51820328 0.20140013 0.2803966 ]\
 [0.58387084 0.22464925 0.19147991]]\

---

Step 4, finally, compute the output as the weighted sum of value (`V`), using the above computed `attn_scores_norm` as the weight.

*Hint*: `attn_scores_norm[0,:]` is the weight for the first output `weighted_output[0,:]`, \
so the computation is:\
`weighted_output[0,:] = attn_scores_norm[0,0] * V[0,:] + attn_scores_norm[0,1] * V[1,:] + attn_scores_norm[0,2] * V[2,:]`. \
But you can achieve this with one line code using `np.matmul()`.

In [324]:
### START YOUR CODE ###
weighted_output = attn_scores_norm @ V
### END YOUR CODE ###

print('weighted_output[0,:5]:', weighted_output[0,:5])
print('weighted_output.shape:', weighted_output.shape)

weighted_output[0,:5]: [-0.37040031  0.493314   -0.78595572  0.09711595 -0.33551551]
weighted_output.shape: (3, 64)


**Expected output**\
weighted_output[0,:5]: [-0.37040031  0.493314   -0.78595572  0.09711595 -0.33551551]\
weighted_output.shape: (3, 64)

---
**Congratulation!** You have finished Task 1, and now you know how to implement the self-attention module, which is the core technique of Transformer.