# 10-714 Homework 4 Extension

This homework is an extension of homework 4, where you will be implementing the Transformer architecture. For this assignment, all the things you need to implement is in the file `python/needle/nn/nn_transformer.py`. Other things in the needle library remains the same. This homework extension is built on homework 4, so make sure to copy the solutions from homework 4.

In [1]:
!make

-- Found pybind11: /root/miniconda3/envs/LightTorch/lib/python3.9/site-packages/pybind11/include (found version "2.13.1")
-- Found cuda, building cuda backend
Sun Aug 11 22:18:38 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 3090         On | 00000000:3D:00.0 Off |                  N/A |
| 50%   38C    P8               33W / 350W|     23MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |

In [2]:
%set_env PYTHONPATH ./python
%set_env NEEDLE_BACKEND nd

env: PYTHONPATH=./python
env: NEEDLE_BACKEND=nd


In [3]:
import sys
sys.path.append('./python')

In [None]:
# Download the PTB dataset

import urllib.request
import os

# Download Penn Treebank dataset
ptb_data = "https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb."
for f in ['train.txt', 'test.txt', 'valid.txt']:
    if not os.path.exists(os.path.join('./data/ptb', f)):
        urllib.request.urlretrieve(ptb_data + f, os.path.join('./data/ptb', f))

## Transformers

In the previous homework you have implemented two sequence models, the Recurrent Neural Network, and Long Short-Term Memory. These models were once the state-of-the-art and default architecture choices on sequence modelling tasks, including language generation, until recently when the famous paper "[Attention Is All You Need](https://arxiv.org/abs/1706.03762)" (Vaswani et al. 2017) came out in 2017. Since then, Transformers, a model architecture introduced in the aforementioned paper, have become the standard and most performant class of model on language tasks. 

You will be implementing a Transformer in `python/needle/nn/nn_transformer.py`.

Transformers are composed of three mains components that you will implement. 
1. A masked multi-head attention mechanism that adaptively focuses on different timesteps of a sequence. 
2. A residual block consisting of the attention layer followed by a two-layer neural network applied independently at each timestep. 
3. A Transformer model consisting of several stacked residual blocks (in this homework you will implement a decoder-only transformer).

![model](https://miro.medium.com/v2/1*ZCFSvkKtppgew3cc7BIaug.png)

The above is a photo of the Transformer architecture from Vaswani et al. 2017. The version of the transformer you will implement is nearly identical, but has layer normalization applied at the start of each residual block (referred to as a [prenorm variant](https://arxiv.org/abs/2002.04745) of the Transformer).

## Part 1: Implementing the Multi-Head Attention Activation Layer

In this subproblem, you will be implementing the `forward` function of a "base" attention activation layer `MultiHeadAttention` in `python/needle/nn/nn_transformer.py`. This activation layer will take in three inputs: 
<p style="text-align: center;">multi-head queries $Q \in R^\mathcal{B \times H \times T \times D}$, keys $K \in R^\mathcal{B \times H \times T \times D}$, and values $V \in R^\mathcal{B \times H \times T \times D}$</p>

where $B$ is the batch size, $H$ is the number of attention heads, $T$ is the sequence length, and $D$ is the hidden dimension. 

The attention output $X \in R^{B \times H \times T \times D}$ is computed as follows:

<p style="text-align: center;">$X = \text{softmax}(\frac{Q K^T}{\sqrt{D}}) V$</p>

Note that the matrix multiplications above are batched. This functionality is not natively supported in needle yet, so we have provided a convenient function `matmul` for batched matrix multiplications in `MultiHeadAttention`. Your goal in this section is to return $X$ given the input queries, keys, and values. 

For auto-regressive Transformer, this attention should support causal masking using the function `self.create_causal_mask` we have provided. This is to make sure that the prediction of next token only depends on it's previous tokens. Specifically, causal masking is applying a mask before the softmax so that the softmax probability is computed over a masked matrix of $\frac{Q K^T}{\sqrt{D}}$. 

In addition, your implementation should apply dropout to the attention softmax $\text{softmax}(\frac{Q K^T}{\sqrt{D}})$. You can use the `self.dropout` function of the `MultiHeadAttention` module.

Importantly, this layer is only an activation function, and has no trainable variables (these come later).

Once you have finished your implementation, test your code with the following test cases.

In [4]:
!python3 -m pytest -l -v -k "attention_activation"

platform linux -- Python 3.9.19, pytest-8.3.1, pluggy-1.5.0 -- /root/miniconda3/envs/LightTorch/bin/python3
cachedir: .pytest_cache
rootdir: /root/workspace/LightTorch/archive/hw4_extra
plugins: anyio-4.4.0
collected 112 items / 96 deselected / 16 selected                              [0m[1m

tests/hw4_extra/test_transformer.py::test_attention_activation[cpu-0.0-False-64-31-5-4] [32mPASSED[0m[32m [  6%][0m
tests/hw4_extra/test_transformer.py::test_attention_activation[cpu-0.0-False-64-31-5-8] [32mPASSED[0m[32m [ 12%][0m
tests/hw4_extra/test_transformer.py::test_attention_activation[cpu-0.0-True-64-31-5-4] [32mPASSED[0m[32m [ 18%][0m
tests/hw4_extra/test_transformer.py::test_attention_activation[cpu-0.0-True-64-31-5-8] [32mPASSED[0m[32m [ 25%][0m
tests/hw4_extra/test_transformer.py::test_attention_activation[cpu-0.1-False-64-31-5-4] [32mPASSED[0m[32m [ 31%][0m
tests/hw4_extra/test_transformer.py::test_attention_activation[cpu-0.1-False-64-31-5-8] [32mPASSED[0m[3

In [5]:
!python3 -m mugrade submit "YOUR_KEY_HERE" -k "attention_activation"

submit
platform linux -- Python 3.9.19, pytest-8.3.1, pluggy-1.5.0
rootdir: /root/workspace/LightTorch/archive/hw4_extra
plugins: anyio-4.4.0
[1mcollecting ... [0mUsing needle backend
collected 4 items / 3 deselected / 1 selected                                  [0m[1m

tests/hw4_extra/test_transformer.py [31mF[0m

[31m[1m_________________________ submit_attention_activation __________________________[0m

self = <urllib3.connectionpool.HTTPSConnectionPool object at 0x7f7e3ebe9eb0>
conn = <urllib3.connection.HTTPSConnection object at 0x7f7e3ec0b6d0>
method = 'POST'
url = '/_/api/submission?user_key=YOUR_KEY_HERE&func_name=attention_activation'
body = None
headers = {'User-Agent': 'python-requests/2.32.3', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '0'}
retries = Retry(total=0, connect=None, read=False, redirect=None, status=None)
timeout = Timeout(connect=None, read=None, total=None), chunked = False
response_conn = <urlli

## Part 2 Implementing the Self-Attention Layer with trainable parameters

In this subproblem, you will use the `MultiHeadAttention` class you just implemented, and wrap it in a subclass of `Module` called `AttentionLayer` in `python/needle/nn/nn_transformer.py`. 

This layer implements the self-attention with prenorm (when k, and v are None in the `self.forward` call) and cross-attention (when k and v are present in the `self.forward` call). We have provided skeleton code with the appropriate layer attributes defined. Your job is to write the forward pass of the `AttentionLayer`. Note that you are implementing multi-head attention, where the number of attention heads is given by the `self.num_head` attribute of the `AttentionLayer` class.

Given inputs $Q \in R^\mathcal{B \times T \times D'}$, keys $K \in R^\mathcal{B \times T \times D'}$, and values $V \in R^\mathcal{B \times T \times D'}$ where $B$ is the batch size, $T$ is the sequence length, and $D'$ is the embedding dimension. This layer performs the following computation sequentially:

(1) map queries, key, and values to heads.

<p style="text-align: center;">$Q' = \text{LayerNorm}_q (Q) \; W_q$</p>

<p style="text-align: center;">$K' = \text{LayerNorm}_k (K) \; W_k$</p>

<p style="text-align: center;">$V' = \text{LayerNorm}_v (V) \; W_v$</p>

where $\text{LayerNorm}_q , \text{LayerNorm}_k, \text{LayerNorm}_v $ are the prenorm `self.prenorm_q`, `self.prenorm_k` and `self.prenorm_v` respectively.

(2) unravel heads from the channels axis.

<p style="text-align: center;">$Q' \in R^{B \times T \times (HD)} \to Q' \in R^{B \times H \times T \times D} $</p>

<p style="text-align: center;">$K' \in R^{B \times T \times (HD)} \to K' \in R^{B \times H \times T \times D} $</p>

<p style="text-align: center;">$V' \in R^{B \times T \times (HD)} \to V' \in R^{B \times H \times T \times D} $</p>

where $H$ and $D$ are `self.num_head` and `self.head_dim` respectively.

(3) compute the multi-head attention activation.

<p style="text-align: center;">$X = \text{softmax}(\frac{Q' (K')^T}{\sqrt{D}}) V'$</p>

<p style="text-align: center;">$X \in R^{B \times H \times T \times D} \to X \in R^{B \times T \times H \times D} $</p>

<p style="text-align: center;">$X \in R^{B \times T \times H \times D} \to X \in R^{B \times T \times (HD)}$</p>

The last two steps do a transpose and then reshape to get the hidden states to be the correct shape.

(4) project back to the input space of the layer with `self.out_projection`

<p style="text-align: center;">$X' = X \; W_o$</p>

Your goal in this part is to return $X$ in the `self.forward` call of `AttentionLayer`. For debugging, you may capture the `probs` variable returned by the inner `MultiHeadAttention` module and store it in an attribute such as `self.probs` of the attention layer.

Once finished, you may test your layer with the following test cases.

In [10]:
!python3 -m pytest -l -v -k "attention_layer" 

platform linux -- Python 3.9.19, pytest-8.3.1, pluggy-1.5.0 -- /root/miniconda3/envs/LightTorch/bin/python3
cachedir: .pytest_cache
rootdir: /root/workspace/LightTorch/archive/hw4_extra
plugins: anyio-4.4.0
collected 112 items / 80 deselected / 32 selected                              [0m[1m

tests/hw4_extra/test_transformer.py::test_attention_layer[cpu-0.0-False-32-8-27-5-4] [32mPASSED[0m[32m [  3%][0m
tests/hw4_extra/test_transformer.py::test_attention_layer[cpu-0.0-False-32-8-27-5-8] [32mPASSED[0m[32m [  6%][0m
tests/hw4_extra/test_transformer.py::test_attention_layer[cpu-0.0-False-32-8-27-11-4] [32mPASSED[0m[32m [  9%][0m
tests/hw4_extra/test_transformer.py::test_attention_layer[cpu-0.0-False-32-8-27-11-8] [32mPASSED[0m[32m [ 12%][0m
tests/hw4_extra/test_transformer.py::test_attention_layer[cpu-0.0-True-32-8-27-5-4] [32mPASSED[0m[32m [ 15%][0m
tests/hw4_extra/test_transformer.py::test_attention_layer[cpu-0.0-True-32-8-27-5-8] [32mPASSED[0m[32m [ 18%][0m
te

In [None]:
!python3 -m mugrade submit "YOUR_KEY_HERE" -k "attention_layer"

## Part 3 Implementing a prenorm residual Transformer Layer

You now have all the parts necessary to build a full Transformer by this point. In this subproblem, you will assemble the attention layer with a feedforward network into a stackable residual block. We have provided starter code in the `TransformerLayer` class. 

You will need to define the necessary class attributes in the `self.__init__` call of the module `TransformerLayer`, and fill in the forward pass in `self.forward`. Your transformer layer should support dropout applied to $X'$ from the previous step before adding a residual connection. Implement the following pseudocode of the layer, properly handling the intermediate tensor shapes:

x - current sequence of hidden states

<p style="text-align: center;">$x = x + \text{Dropout}(\text{Attention}(x))$</p>
<p style="text-align: center;">$x = x + \text{Dropout}(\text{Linear}_{2}(\text{Dropout}(\text{ReLU}(\text{Linear}_{1}(\text{LayerNorm1d}(x))))))$</p>

For the MLP, there are two Linear layers $\text{Linear}_{1}$ and $\text{Linear}_{2}$:
- $\text{Linear}_{1}$: input shape `q_features`, output shape `hidden_size`
- $\text{Linear}_{2}$: input shape `hidden_size`, output shape `q_features`

Once finished, run the following test cases.

In [6]:
!python3 -m pytest -l -v -k "transformer_layer"

platform linux -- Python 3.9.19, pytest-8.3.1, pluggy-1.5.0 -- /root/miniconda3/envs/LightTorch/bin/python3
cachedir: .pytest_cache
rootdir: /root/workspace/LightTorch/archive/hw4_extra
plugins: anyio-4.4.0
collected 112 items / 80 deselected / 32 selected                              [0m[1m

tests/hw4_extra/test_transformer.py::test_transformer_layer[cpu-0.0-False-64-32-8-27-5-2] [32mPASSED[0m[32m [  3%][0m
tests/hw4_extra/test_transformer.py::test_transformer_layer[cpu-0.0-False-64-32-8-27-5-4] [32mPASSED[0m[32m [  6%][0m
tests/hw4_extra/test_transformer.py::test_transformer_layer[cpu-0.0-False-64-32-8-27-11-2] [32mPASSED[0m[32m [  9%][0m
tests/hw4_extra/test_transformer.py::test_transformer_layer[cpu-0.0-False-64-32-8-27-11-4] [32mPASSED[0m[32m [ 12%][0m
tests/hw4_extra/test_transformer.py::test_transformer_layer[cpu-0.0-True-64-32-8-27-5-2] [32mPASSED[0m[32m [ 15%][0m
tests/hw4_extra/test_transformer.py::test_transformer_layer[cpu-0.0-True-64-32-8-27-5-4] [32

In [8]:
!python3 -m mugrade submit "YOUR_KEY_HERE" -k "transformer_layer"

submit
platform linux -- Python 3.9.19, pytest-8.3.1, pluggy-1.5.0
rootdir: /root/workspace/LightTorch/archive/hw4_extra
plugins: anyio-4.4.0
[1mcollecting ... [0mUsing needle backend
collected 4 items / 3 deselected / 1 selected                                  [0m[1m

tests/hw4_extra/test_transformer.py ^C


## Part 4 Implementing the Transformer model

In this subsection, you will compose the residual transformer layers you implemented in the previous part to build the full Transformer model. Fill in the code in the `Transformer` class by defining a set of `num_layers` `TransformerLayer` modules with the appropriat parameters passed in from the parent `Transformer` class. Then, implement the `self.forward` call of the `Transformer`. 

As is, your current Transformer layers are permutation-invariant, and cannot tell which position each token is in the sequence. To break this symmetry, you will add a positional embedding to your Transformer.

The original Transformer paper uses sinusoidal positional embeddings, and then adds to the input embeddings before the first `TransformerLayer`. These work well, but a more common strategy in modern Transformers is to learn the positional embeddings. 

To do this, you should use `needle.nn.Embedding`. In your Transformer implementation, create a learnable positional encoding using `needle.nn.Embedding` from homework 4, with `num_embeddings` set as `sequence_len`. Given an input sequence, you should create a tensor that has the timestep id of each token in the sequence (timesteps have increasing value, representing the position of a token in time), and use it like a word id. 

Last, add the created positional encoding to the input token embeddings before your transformer layers.

Once complete, submit the following test cases.

In [4]:
!python3 -m pytest -l -v -k "transformer_model"

platform linux -- Python 3.9.19, pytest-8.3.1, pluggy-1.5.0 -- /root/miniconda3/envs/LightTorch/bin/python3
cachedir: .pytest_cache
rootdir: /root/workspace/LightTorch/archive/hw4_extra
plugins: anyio-4.4.0
collected 112 items / 80 deselected / 32 selected                              [0m[1m

tests/hw4_extra/test_transformer.py::test_transformer_model[cpu-0.0-False-32-8-2-64-27-5-8] [31mFAILED[0m[31m [  3%][0m
tests/hw4_extra/test_transformer.py::test_transformer_model[cpu-0.0-False-32-8-2-64-27-11-8] [31mFAILED[0m[31m [  6%][0m
tests/hw4_extra/test_transformer.py::test_transformer_model[cpu-0.0-False-32-8-4-64-27-5-8] [31mFAILED[0m[31m [  9%][0m
tests/hw4_extra/test_transformer.py::test_transformer_model[cpu-0.0-False-32-8-4-64-27-11-8] [31mFAILED[0m[31m [ 12%][0m
tests/hw4_extra/test_transformer.py::test_transformer_model[cpu-0.0-True-32-8-2-64-27-5-8] [31mFAILED[0m[31m [ 15%][0m
tests/hw4_extra/test_transformer.py::test_transformer_model[cpu-0.0-True-32-8-2-64-

In [None]:
!python3 -m mugrade submit "YOUR_KEY_HERE" -k "transformer_model"

Now, you can train a Transformer language model on the Penn Treebank dataset:

Note: make sure to initialize a transformer model in the class `LanguageModel` of `apps/models.py`; also for Transformers, the final linear head `self.linear` should take in input dimension `embedding_size`.

In [4]:
import needle as ndl
sys.path.append('./apps')
from models import LanguageModel
from simple_ml import train_ptb, evaluate_ptb

device = ndl.cuda()
corpus = ndl.data.Corpus("data/ptb")
train_data = ndl.data.batchify(corpus.train, batch_size=64, device=device, dtype="float32")
model = LanguageModel(20, len(corpus.dictionary), hidden_size=32, num_layers=12, seq_model='transformer', seq_len=20, device=device)
train_ptb(model, train_data, seq_len=20, n_epochs=10, device=device, lr=0.003, optimizer=ndl.optim.Adam)
evaluate_ptb(model, train_data, seq_len=20, device=device)

Using needle backend


100%|██████████| 727/727 [13:02<00:00,  1.08s/it]


avg_acc: 0.04345477862700544, avg_loss: 7.289129734039307


100%|██████████| 727/727 [12:54<00:00,  1.07s/it]


avg_acc: 0.05107200647249191, avg_loss: 6.596351623535156


100%|██████████| 727/727 [12:54<00:00,  1.06s/it]


avg_acc: 0.05193916546168147, avg_loss: 6.58458137512207


100%|██████████| 727/727 [12:52<00:00,  1.06s/it]


avg_acc: 0.051618553329201955, avg_loss: 6.578758716583252


100%|██████████| 727/727 [12:53<00:00,  1.06s/it]


avg_acc: 0.05231249569648144, avg_loss: 6.57459831237793


100%|██████████| 727/727 [12:55<00:00,  1.07s/it]


avg_acc: 0.05308067375886525, avg_loss: 6.567375659942627


100%|██████████| 727/727 [12:52<00:00,  1.06s/it]


avg_acc: 0.05300321042484335, avg_loss: 6.559388637542725


100%|██████████| 727/727 [14:41<00:00,  1.21s/it]  


avg_acc: 0.052470650003442816, avg_loss: 6.5526275634765625


 46%|████▌     | 336/727 [05:55<06:53,  1.06s/it]


KeyboardInterrupt: 