# From attention to transformers


In this tutorial, our focus is on delving into the intricacies of the attention mechanism. If you're keen on it, you'll be able to create a self-attention layer and construct your own transformer model from skatch.

In many well-established libraries like **torch**, the code tends to be somewhat challenging to decipher due to efficiency optimizations and the inclusion of various conditional paths using **if** and **else**. Here, we will craft **a more intelligible yet functionally equivalent model** and verify its performance against the official implementation.



### General note for GPU training (in colab)

* First, please use the GPU runtime. If so the `!nvidia-smi` will return no error.
  1. Click on "Runtime" in the top menu bar.
  2. Select "Change runtime type" from the drop-down menu.
  3. In the "Runtime type" section, select "GPU" as the hardware accelerator.
  4. Click "Save" to apply the changes.


* What should I do with **Cuda out of memory error.**? (this is THE most common error in DL)
![](https://miro.medium.com/v2/resize:fit:828/format:webp/1*enMsxkgJ1eb9XvtWju5V8Q.png)
  1. In colab notebook, **unfortunately, you need to restart the kernel after OOM happened**. Or it will keep happening no matter what.
  2. Change the model to save memory, usually includes, decrease batch size, decrease the number of layers, decrease the max sequence length, decrease the hidden / embedding dimension
  3. If you know mixed precision training, you can switch to low precision `fp16` numbers for weights and inputs.

* What should I do for the **Device siee assert triggered** error
  > RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
  
  * Usually it's because the embedding layer receive an index (token id or position id) not stored in it.
  * Could be sth. else, which will be harder to debug...

In [None]:
# import locale
# locale.getpreferredencoding = lambda: "UTF-8" # to fix a potential locale bug
!nvidia-smi

### Imports

In [None]:
!pip install torch
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import math
import matplotlib.pyplot as plt

In [None]:
seed = 42 # 随机种子用于确保随机数生成器产生确定性的序列，使实验结果可复现。
np.random.seed(seed)
torch.manual_seed(seed) # 使用相同的种子初始化PyTorch的随机数生成器，确保PyTorch操作的确定性。
torch.cuda.manual_seed(seed) # 确保在CUDA设备上执行的任何随机数生成也是可复现的。

## Self-Attention Mechanism: Single Head

![](https://raw.githubusercontent.com/Animadversio/TransformerFromScratch/main/media/AttentionSchematics_white-01.png)

In [None]:
embdim = 256 # 这定义了嵌入（embedding）的维度为256。在自注意力机制中，每个词（或者说是token）被表示为一个256维的向量
headdim = 64 # 每个注意力头（attention head）的维度为64。在多头注意力机制中，原始嵌入被分割成多个“头”，每个头处理嵌入的一个子集
tokens = torch.randn(1, 5, embdim) # batch, tokens, embedding  # 这行代码生成了一个形状为(1, 5, 256)的张量，表示有1个批次，5个词(token)，每个词用256维的向量表示。这些值是从标准正态分布随机生成的。
Wq = torch.randn(embdim, headdim) / math.sqrt(embdim) # 初始化查询矩阵（Query）的权重。它的形状为(256, 64)，其中256是输入嵌入的维度，64是每个注意力头的维度。除以math.sqrt(embdim)是为了缩放初始化，这有助于模型的训练稳定性
Wk = torch.randn(embdim, headdim) / math.sqrt(embdim)

# 维持值矩阵维度不变的另一个原因是为了保持信息的完整性。通过保留与原始输入相同的维度，模型能够在不丢失信息的情况下，通过自注意力层进行信息的重组和加权。这对于保持深度神经网络中信息的流动至关重要
Wv = torch.randn(embdim, embdim) / math.sqrt(embdim) # 注意，这里的值矩阵维度保持不变，而不是缩减到headdim

Fill in the score matrix computation

In [None]:
qis = torch.einsum("BSE,EH->BSH", tokens, Wq) # batch x seqlen x headdim # 使用einsum计算查询矩阵。这里，tokens是输入的词嵌入张量，形状为(batch_size, sequence_length, embedding_dimension)，而Wq是查询权重矩阵，形状为(embedding_dimension, head_dimension)。结果是每个token的查询表示，形状为(batch_size, sequence_length, head_dimension)
kis = torch.einsum("BTE,EH->BTH", tokens, Wk) # batch x seqlen x headdim
vis = torch.einsum("BTE,EF->BTF", tokens, Wv) # batch x seqlen x embeddim
#### ------ Add your code here:compute query-key similarities. ------ ####
scoremat = torch.einsum('bsq,bkq->bsk', qis, kis) # output: batch x seqlen (Query) x seqlen (Key)
#### ------ End ------ ####
attmat = F.softmax(scoremat / math.sqrt(headdim), dim=2) #首先对得分矩阵进行缩放处理，除以sqrt(head_dimension)是为了缓解由于维度较高导致的梯度消失或爆炸问题（这是Transformer中的一个技巧）。然后，对每一行应用softmax函数进行归一化，使得每个元素都在0和1之间，且每行的元素之和为1。这个归一化的得分矩阵（注意力矩阵attmat）表示了每个词对于其他词的注意力权重。

Some checks to make sure the score correspond to the product of the right pair.

In [None]:
assert(torch.isclose(scoremat[0,1,2], qis[0,1,:]@kis[0,2,:])) # torch.isclose用于比较两个张量是否在容忍度范围内逐元素接近。如果比较结果是True，即两个值足够接近，那么assert语句就会通过，不会抛出异常
assert(torch.isclose(scoremat[0,3,4], qis[0,3,:]@kis[0,4,:]))
assert(torch.isclose(scoremat[0,2,2], qis[0,2,:]@kis[0,2,:]))

In [None]:
zis = torch.einsum("BST,BTF->BSF", attmat, vis) # 使用einsum计算了经过注意力权重调整后的值向量。attmat是注意力矩阵，形状为(batch size, sequence length, sequence length)，表示每个序列位置对于其他所有位置的注意力权重；vis是值（Value）矩阵的表示，形状为(batch size, sequence length, embedding dimension)。这个操作的结果zis具有形状(batch size, sequence length, embedding dimension)，表示每个序列位置的新嵌入向量，是原始值向量根据注意力权重加权和调整后的结果。

In pytorch, these operations are packed int the function `F.scaled_dot_product_attention`. So let's test our implementation of the single head attention against it.

In [None]:
attn_torch = F.scaled_dot_product_attention(qis,kis,vis)
assert(torch.allclose(attn_torch, zis, atol=1E-6,rtol=1E-6))

## Multi-head attention

In [None]:
embdim = 768 # 设置嵌入维度为768，这是每个词向量的维度
headcnt = 12 # 设置头的数量为12，表示将要把嵌入维度分割成多少个独立的注意力机制
headdim = embdim // headcnt
assert headdim * headcnt == embdim
tokens = torch.randn(1, 5, embdim) # batch, tokens, embedding
Wq = torch.randn(embdim, headcnt * headdim) / math.sqrt(embdim) # heads packed in a single dim
Wk = torch.randn(embdim, headcnt * headdim) / math.sqrt(embdim) # heads packed in a single dim
Wv = torch.randn(embdim, headcnt * headdim) / math.sqrt(embdim) # heads packed in a single dim

In [None]:
batch, token_num, _ = tokens.shape
qis = torch.einsum("BSE,EH->BSH", tokens, Wq)
kis = torch.einsum("BTE,EH->BTH", tokens, Wk)
vis = torch.einsum("BTE,EH->BTH", tokens, Wv) # qis, kis, vis 的形状分别为(batch size, sequence length, headcnt * headdim)
# split the single hidden dim into the heads
qis_mh = qis.view(batch, token_num, headcnt, headdim)
kis_mh = kis.view(batch, token_num, headcnt, headdim)
vis_mh = vis.view(batch, token_num, headcnt, headdim) # 将查询向量qis重新形状化，分割成多个头。这样做的结果是将原本打包在一起的头分开，使得每个头可以独立处理其对应的序列部分。处理后，这些矩阵的形状变为(batch size, sequence length, headcnt, headdim)，允许在后续步骤中对每个头分别进行操作。

Now your challenge is to compute multihead attention using `einsum`

In [None]:
#### ------ Add your code here: compute query-key similarities. ------ ####
scoremat_mh = torch.einsum('bshd,bthd->bhst', qis_mh, kis_mh)  # Output: batch x headcnt x seqlen (query) x seqlen (key)
#### ------ End ------ ####
attmat_mh = F.softmax(scoremat_mh / math.sqrt(headdim), dim=-1)
zis_mh = torch.einsum("BCST,BTCH->BSCH", attmat_mh, vis_mh)  # batch x seqlen (query) x headcnt x headdim
zis = zis_mh.reshape(batch, token_num, headcnt * headdim)

Let's validate the tensor multiplication is correct

In [None]:
# raw attention score of the 1st attention head
assert (torch.allclose(scoremat_mh[0, 1], qis_mh[0,:,1] @ kis_mh[0,:,1,:].T))

In [None]:
print(tokens.shape)
print(qis_mh.shape)
print(kis_mh.shape)
print(vis_mh.shape)
print(attmat_mh.shape)
print(zis_mh.shape)
print(zis.shape)

In `torch` this operation is packed in `nn.MultiheadAttention`, including the input projection, attention and out projection. So, note the input the the `mha.forward` function are the *token_embeddings* not the Q,K,Vs as we put it in `F.scaled_dot_product_attention`

In [None]:
mha = nn.MultiheadAttention(embdim, headcnt, batch_first=True,) # 参数embdim指嵌入维度768，headcnt指头的数量12，batch_first=True表明输入张量的第一个维度为批次大小
print(mha.in_proj_weight.shape) # 3 * embdim x embdim # 这是内部用于将输入投影到查询/键/值空间的权重矩阵。根据代码注释，期望的形状是(3 * embdim, embdim)，这意味着它包含了查询、键、值的权重矩阵连续排列在一起。
mha.in_proj_weight.data = torch.cat([Wq, Wk, Wv], dim=1).T #将自定义权重Wq、Wk、Wv按列拼接（torch.cat([Wq, Wk, Wv], dim=1)），然后转置（.T），以匹配in_proj_weight的形状，并赋值给它

In [None]:
attn_out, attn_weights = mha(tokens, tokens, tokens, average_attn_weights=False,) #  将tokens作为查询、键、值输入到多头自注意力模块。由于average_attn_weights=False，返回的attn_weights将包含每个头的注意力权重，而不是它们的平均值。
assert torch.allclose(attmat_mh, attn_weights, atol=1e-6, rtol=1e-6) # 使用了torch.allclose函数来判断两个张量是否在指定的绝对（atol）和相对（rtol）容差范围内全元素相等。

In `nn.MultiheadAttention` , there is a output projection `out_proj`, projecting the values. It is a linear layer with bias. We can validate that going through this projection our outputs `zis` is the same as the output of `mha`

In [None]:
print(mha.out_proj)
assert torch.allclose(attn_out, mha.out_proj(zis), atol=1e-6, rtol=1e-6)

### Causal attention mask

For models such as GPT, each token can only attend to tokens before it, thus the attention score needs to be modified before entering softmax.

The common way of masking is to add a large negative number to the locations that you'd not want the model to attend to.

In [None]:
attn_mask = torch.ones(token_num,token_num,) # 创建一个形状为token_num x token_num的全1矩阵，token_num是序列长度
attn_mask = -1E4 * torch.triu(attn_mask,1) #在应用softmax之前加到注意力得分上，上三角部分（即未来的词对当前词的影响）的得分会变得非常小，导致softmax后接近零，实现了因果遮蔽。
attn_mask

In [None]:
scoremat_mh_msk = torch.einsum("BSCH,BTCH->BCST", qis_mh, kis_mh)  # batch x headcnt x seqlen (query) x seqlen (key)
scoremat_mh_msk += attn_mask  # add the attn mask to the scores before SoftMax normalization
attmat_mh_msk = F.softmax(scoremat_mh_msk / math.sqrt(headdim), dim=-1)
zis_mh_msk = torch.einsum("BCST,BTCH->BSCH", attmat_mh_msk, vis_mh)  # batch x seqlen (query) x headcnt x headdim
zis_msk = zis_mh_msk.reshape(batch, token_num, headcnt * headdim)

**Note** `is_causal` parameter should work and create a causal mask automatically. But in a recent pytorch bug, it doesn't work. So beware~
https://github.com/pytorch/pytorch/issues/99282

In [None]:
attn_out_causal, attn_weights_causal = mha(tokens, tokens, tokens, average_attn_weights=False, attn_mask=attn_mask) #将同样的输入和注意力遮罩传递给nn.MultiheadAttention实例mha，得到输出attn_out_causal和注意力权重attn_weights_causal

In [None]:
assert torch.allclose(attn_weights_causal, attmat_mh_msk, atol=1e-6, rtol=1e-6)
assert torch.allclose(attn_out_causal, mha.out_proj(zis_msk), atol=1e-6, rtol=1e-6)

In [None]:
plt.figure()
for head in range(headcnt):
    plt.subplot(3, 4, head + 1)
    plt.imshow(attn_weights_causal[0, head].detach().numpy()) #plt.imshow显示每个头的注意力权重矩阵，这有助于理解模型是如何分配其注意力的，尤其是在因果关系（或序列生成）任务
    plt.title(f"head {head}")
    plt.axis("off")
plt.show()

## Transformer Block

Having gaining some intuition about attention layer, let's build it into a transformer. An vanilla transformer block usually looks like this. Note there are slight difference between the transformer blocks in GPT2, BERT and other models, but they generally has the following components

* Transformer Block
  * Layernorm
  * Skip connections
  * Multi-head attention
  * MLP, Feedforward net


In [None]:
class TransformerBlock_simple(nn.Module):

    def __init__(self, embdim, headcnt, *args, dropout=0.0, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.ln1 = nn.LayerNorm(embdim) # 层归一化（Layer Normalization）层，用于归一化前一个层的输出，有助于稳定和加速训练过程
        self.ln2 = nn.LayerNorm(embdim)
        self.attn = nn.MultiheadAttention(embdim, headcnt, batch_first=True,) # 多头自注意力层，设置为批次优先（batch_first=True），允许模型在输入序列的不同位置间学习到相关性
        self.ffn = nn.Sequential(
            nn.Linear(embdim, 4 * embdim), #  由两个线性层和一个GELU激活函数组成的前馈网络，中间增加了一个线性变换的维度到4 * embdim
            nn.GELU(),
            nn.Linear(4 * embdim, embdim),
            nn.Dropout(dropout),
        )

    def forward(self, x, is_causal=True):
        batch, token_num, hidden_dim = x.shape
        if is_causal: #表示是否使用因果注意力遮罩，用于控制信息流向，使得位置i只能受到位置i及其之前位置的影响，常见于生成任务
            # 如果is_causal为True，则创建一个因果注意力遮罩，否则为None。
            attn_mask = torch.ones(token_num, token_num,)
            attn_mask = -1E4 * torch.triu(attn_mask,1)
        else:
            attn_mask = None

        residue = x
        x = self.ln1(x) #对输入x应用层归一化（self.ln1）
        #### ------ Add your code here: multihead attention ------ ####
        attn_output, attn_weights = self.attn(x, x, x, attn_mask=attn_mask)  # first output is the output latent states
        #### ------ End ------ ####
        x = residue + attn_output #使用多头自注意力处理归一化后的输入，并将结果与原始输入相加，形成残差连接

        residue = x
        x = self.ln2(x)
        ffn_output = self.ffn(x)
        output = residue + ffn_output
        return output

Compare the implmentation with the schematics and see if it makes more sense!


*Attention Block*


![BERT (Transformer encoder)](https://iq.opengenus.org/content/images/2020/06/encoder-1.png)


# Image Classification

Now we employ Transformer structure to conduct image classification.

### Imports

In [None]:
!pip install transformers
!pip install torchvision

## Import transformers
from transformers import get_linear_schedule_with_warmup
from transformers import BertForSequenceClassification
from transformers import BertModel, BertTokenizer, BertConfig

import os
from os.path import join
from tqdm.notebook import tqdm, trange
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import AdamW, Adam
from torch.utils.data import Dataset, DataLoader
from torchvision.utils import make_grid, save_image
import matplotlib.pyplot as plt
from torchvision.datasets import MNIST, CIFAR10
from torchvision import datasets, transforms


### Preparing Image Dataset
Load the dataset, note, the augmentations are necessary. If no augmentation, Transformer will overfit very soon.

In [None]:
!mkdir data #在当前工作目录下创建一个新的子目录data，用于存储数据集

# 使用CIFAR10类从PyTorch的torchvision.datasets加载CIFAR-10训练集。transform参数指定了数据预处理和增强的步骤，包括随机水平翻转、随机裁剪和标准化
dataset = CIFAR10(root='./data/', train=True, download=True, transform=
transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
]))
# augmentations are super important for CNN trainings, or it will overfit very fast without achieving good generalization accuracy
# 加载CIFAR-10测试集作为验证集，应用了转换为张量和标准化的预处理步骤，但没有应用数据增强
val_dataset = CIFAR10(root='./data/', train=False, download=True, transform=transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),]))
#%%

Citing https://openreview.net/pdf?id=SCN8UaetXx,

> "Visual Transformers. Despite some previous work in which attention is used inside the convolutional layers of a CNN [57, 26], the first fully-transformer architectures for vision are iGPT [8] and ViT [17]. The former is trained using a "masked-pixel" self-supervised approach, similar in spirit to the common masked-word task used, for instance, in BERT [15] and in GPT [45] (see below). On the other hand, ViT is trained in a supervised way, using a special "class token" and a classification head attached to the final embedding of this token. Both methods are computationally expensive and, despite their very good results when trained on huge datasets, they underperform ResNet architectures when trained from scratch using only ImageNet-1K [17, 8]. VideoBERT [51] is conceptually similar to iGPT, but, rather than using pixels as tokens, each frame of a video is holistically represented by a feature vector, which is quantized using an off-the-shelf pretrained video classification model. DeiT [53] trains ViT using distillation information provided by a pretrained CNN."

### Transformer model for images

In [None]:
config = BertConfig(hidden_size=256, intermediate_size=1024, num_hidden_layers=12,
                    num_attention_heads=8, max_position_embeddings=256,
                    vocab_size=100, bos_token_id=101, eos_token_id=102,
                    cls_token_id=103, ) #使用BertConfig创建BERT模型的配置。这里配置了模型的一些关键参数，如隐藏层大小、中间层大小、注意力头数等。
model = BertModel(config).cuda() #使用BertModel创建了BERT模型实例
patch_embed = nn.Conv2d(3, config.hidden_size, kernel_size=4, stride=4).cuda() #nn.Conv2d创建一个卷积层patch_embed，将输入图像转换为小块（patch）的嵌入表示，以适配BERT模型的输入需求
CLS_token = nn.Parameter(torch.randn(1, 1, config.hidden_size, device="cuda") / math.sqrt(config.hidden_size))
readout = nn.Sequential(nn.Linear(config.hidden_size, config.hidden_size), #readout层用于将BERT输出转换为最终的分类结果。它包含一个线性层、GELU激活函数和另一个线性层，输出大小为10，对应CIFAR-10数据集的类别数
                        nn.GELU(),
                        nn.Linear(config.hidden_size, 10)
                        ).cuda()
for module in [patch_embed, readout, model, CLS_token]:
    module.cuda()

optimizer = AdamW([*model.parameters(),
                   *patch_embed.parameters(),
                   *readout.parameters(),
                   CLS_token], lr=5e-4) #使用AdamW优化器，将模型的所有参数以及CLS_token的参数进行优化。学习率设置为5e-4

In [None]:
batch_size = 192 # 96
train_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
model.train()
loss_list = []
acc_list = []
correct_cnt = 0
total_loss = 0
for epoch in trange(10, leave=False):
    pbar = tqdm(train_loader, leave=False)
    for i, (imgs, labels) in enumerate(pbar):
        patch_embs = patch_embed(imgs.cuda())
        #### ------ Add your code here: replace the None with the correct order of the embedding dimension. ------ ####
        patch_embs = patch_embs.flatten(2).permute(0, 2, 1) # hint: (batch_size, HW, hidden) #通过flatten和permute调整补丁嵌入的维度，以符合模型的输入要求
        #### ------ End ------ ####
        # print(patch_embs.shape)
        input_embs = torch.cat([CLS_token.expand(imgs.shape[0], 1, -1), patch_embs], dim=1) # 将CLS_token与补丁嵌入拼接，形成完整的输入嵌入
        # print(input_embs.shape)
        output = model(inputs_embeds=input_embs)
        logit = readout(output.last_hidden_state[:, 0, :])
        loss = F.cross_entropy(logit, labels.cuda())
        # print(loss)
        loss.backward()#执行反向传播和优化步骤
        optimizer.step()
        optimizer.zero_grad()
        pbar.set_description(f"loss: {loss.item():.4f}")
        total_loss += loss.item() * imgs.shape[0]
        correct_cnt += (logit.argmax(dim=1) == labels.cuda()).sum().item()

    loss_list.append(round(total_loss / len(dataset), 4))
    acc_list.append(round(correct_cnt / len(dataset), 4))
    # test on validation set
    model.eval() #模型设置为评估模式model.eval()。
    correct_cnt = 0
    total_loss = 0

    for i, (imgs, labels) in enumerate(val_loader):
        patch_embs = patch_embed(imgs.cuda())
        #### ------ Add your code here: replace the None with the correct order of the embedding dimension. ------ ####
        patch_embs = patch_embs.flatten(2).permute(0, 2, 1)  # hint: (batch_size, HW, hidden)
        #### ------ End ------ ####
        input_embs = torch.cat([CLS_token.expand(imgs.shape[0], 1, -1), patch_embs], dim=1)
        output = model(inputs_embeds=input_embs)
        logit = readout(output.last_hidden_state[:, 0, :])
        loss = F.cross_entropy(logit, labels.cuda())
        total_loss += loss.item() * imgs.shape[0]
        correct_cnt += (logit.argmax(dim=1) == labels.cuda()).sum().item() #计算损失和准确率，但不进行反向传播或优化

    print(f"val loss: {total_loss / len(val_dataset):.4f}, val acc: {correct_cnt / len(val_dataset):.4f}")

    correct_cnt = 0
    total_loss = 0

In [None]:
#### ------ Add your code here: plot the training loss curve to show its variation with the epoch. ------ ####
# hints: use the data in list 'loss_list' and 'acc_list' to plot the curve via plt.plot()
plt.plot(loss_list, label='Training Loss', marker='o')
plt.title('Training Loss Curve')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.show()
#### ------ End ------ ####

In [None]:
#### ------ Add your code here: plot the accuracy score curve to show its variation with the epoch. ------ ####
# hints: use the data in list 'loss_list' and 'acc_list' to plot the curve via plt.plot()
plt.plot(acc_list, label='Training Accuracy', marker='o')
plt.title('Accuracy Score Curve')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()
#### ------ End ------ ####

In [None]:
torch.save(model.state_dict(),"bert.pth")
!du -sh bert.pth

**Reference:**
Tutorial for Harvard Medical School ML from Scratch Series: Transformer from Scratch (https://github.com/Animadversio/TransformerFromScratch?tab=readme-ov-file).