# GPT2 Masked Multi-head Self-attention详解

## 环境配置

1. 配置python3.9环境

In [1]:
%%capture captured_output
!/home/ma-user/anaconda3/bin/conda create -n python-3.9.0 python=3.9.0 -y --override-channels --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
!/home/ma-user/anaconda3/envs/python-3.9.0/bin/pip install ipykernel

In [2]:
import json
import os

data = {
   "display_name": "python-3.9.0",
   "env": {
      "PATH": "/home/ma-user/anaconda3/envs/python-3.9.0/bin:/home/ma-user/anaconda3/envs/python-3.7.10/bin:/modelarts/authoring/notebook-conda/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/ma-user/modelarts/ma-cli/bin:/home/ma-user/modelarts/ma-cli/bin"
   },
   "language": "python",
   "argv": [
      "/home/ma-user/anaconda3/envs/python-3.9.0/bin/python",
      "-m",
      "ipykernel",
      "-f",
      "{connection_file}"
   ]
}

if not os.path.exists("/home/ma-user/anaconda3/share/jupyter/kernels/python-3.9.0/"):
    os.mkdir("/home/ma-user/anaconda3/share/jupyter/kernels/python-3.9.0/")

with open('/home/ma-user/anaconda3/share/jupyter/kernels/python-3.9.0/kernel.json', 'w') as f:
    json.dump(data, f, indent=4)

***注：以上代码执行完成后，需点击左上角或右上角将kernel更换为python-3.9.0***

2. 安装mindspore2.2.14、indNLP及相关依赖，MindNLP官方仓详见：MindNLP

In [6]:

!pip install https://ms-release.obs.cn-north-4.myhuaweicloud.com/2.2.14/MindSpore/unified/x86_64/mindspore-2.2.14-cp39-cp39-linux_x86_64.whl --trusted-host ms-release.obs.cn-north-4.myhuaweicloud.com -i https://pypi.tuna.tsinghua.edu.cn/simple!pip install tokenizers==0.15.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
!wget https://repo.mindspore.cn/mindspore-lab/mindnlp/daily/202405/20240527/master_20240527120020_a8299282e6686ff94519b7a6acbb61dcf7942116_newest/any/mindnlp-0.3.1+20240527-py3-none-any.whl
!pip install mindnlp-0.3.1+20240527-py3-none-any.whl

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting mindspore==2.2.14
  Using cached https://ms-release.obs.cn-north-4.myhuaweicloud.com/2.2.14/MindSpore/unified/x86_64/mindspore-2.2.14-cp39-cp39-linux_x86_64.whl (743.0 MB)
--2024-08-05 20:59:42--  https://repo.mindspore.cn/mindspore-lab/mindnlp/daily/202405/20240527/master_20240527120020_a8299282e6686ff94519b7a6acbb61dcf7942116_newest/any/mindnlp-0.3.1+20240527-py3-none-any.whl
Resolving proxy.modelarts.com (proxy.modelarts.com)... 192.168.6.3
Connecting to proxy.modelarts.com (proxy.modelarts.com)|192.168.6.3|:80... connected.
Proxy request sent, awaiting response... 200 OK
Length: 3829051 (3.7M) [application/octet-stream]
Saving to: ‘mindnlp-0.3.1+20240527-py3-none-any.whl’


2024-08-05 20:59:43 (41.2 MB/s) - ‘mindnlp-0.3.1+20240527-py3-none-any.whl’ saved [3829051/3829051]

Looking in indexes: http://repo.myhuaweicloud.com/repository/pypi/simple
Processing ./mindnlp-0.3.1+20240527-py3-none-any.whl
Collecting dat

***注：执行如上命令完成安装后，请点击上方的restart kernel图标重启kernel，再进行实验***

In [2]:
# code from mindnlp and huggingface transformers

In [7]:
import numpy as np
import mindspore
from mindspore import nn, ops, Tensor

## GPT-2 Self-attention: 1- Creating queries, keys, and values

![gpt2-self-attention-3.png](https://jalammar.github.io/images/gpt2/gpt2-self-attention-3.png)

In [8]:
batch_size = 1
seq_len = 10
embed_dim = 768

# input x: (1, 10, 768)
x = Tensor(np.random.randn(batch_size, seq_len, embed_dim), mindspore.float32)

In [9]:
from mindnlp._legacy.functional import split
from mindnlp.transformers.ms_utils import Conv1D

# query = Wq * X, key = Wk * X, value = Wv * X
# c_attn: (1, 10, 768*3) --> query, key, value: (1, 10, 768), (1, 10, 768), (1, 10, 768) 
c_attn = Conv1D(3 * embed_dim, embed_dim)
query, key, value = split(c_attn(x), embed_dim, axis=2)
query.shape, key.shape, value.shape

  from .autonotebook import tqdm as notebook_tqdm
Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.823 seconds.
Prefix dict has been built successfully.


((1, 10, 768), (1, 10, 768), (1, 10, 768))

![gpt2-self-attention-split-attention-heads-1.png](https://jalammar.github.io/images/gpt2/gpt2-self-attention-split-attention-heads-1.png)

![gpt2-self-attention-split-attention-heads-2.png](https://jalammar.github.io/images/gpt2/gpt2-self-attention-split-attention-heads-2.png)

In [10]:
def split_heads(tensor, num_heads, attn_head_size):
    """
    Splits hidden_size dim into attn_head_size and num_heads
    """
    # (batch_size, seq_len, hidden_size) --> (batch_size, seq_len, num_heads, attn_head_size)
    new_shape = tensor.shape[:-1] + (num_heads, attn_head_size)
    tensor = tensor.view(new_shape)
    # (batch_size, seq_len, num_heads, attn_head_size) --> (batch_size, num_heads, seq_len, attn_head_size)
    return ops.transpose(tensor, (0, 2, 1, 3))  

In [11]:
num_heads = 12
head_dim = embed_dim // num_heads

# (1, 10, 768) --> (1, 10, 12, 64) --> (1, 12, 10, 64)
query = split_heads(query, num_heads, head_dim)
key = split_heads(key, num_heads, head_dim)
value = split_heads(value, num_heads, head_dim)

query.shape, key.shape, value.shape

((1, 12, 10, 64), (1, 12, 10, 64), (1, 12, 10, 64))

## GPT-2 Self-attention: 2- Scoring

![gpt2-self-attention-scoring.png](https://jalammar.github.io/images/gpt2/gpt2-self-attention-scoring.png)

![](https://jalammar.github.io/images/gpt2/gpt2-self-attention-scoring-2.png)

In [12]:
# qk点积
# q: (1, 12, 10, 64), k^T: (1, 12, 64, 10)
# attn_weights: (1, 12, 10, 10)
attn_weights = ops.matmul(query, key.swapaxes(-1, -2))

attn_weights.shape

(1, 12, 10, 10)

![](https://jalammar.github.io/images/gpt2/transformer-decoder-attention-mask-dataset.png)

In [13]:
# diagonal matrix to implement masked multi-head attention
# To ensure not to attend to future information
max_positions = seq_len

bias = Tensor(np.tril(np.ones((max_positions, max_positions))).reshape(
              (1, 1, max_positions, max_positions)), mindspore.bool_)
bias

Tensor(shape=[1, 1, 10, 10], dtype=Bool, value=
[[[[ True, False, False ... False, False, False],
   [ True,  True, False ... False, False, False],
   [ True,  True,  True ... False, False, False],
   ...
   [ True,  True,  True ...  True, False, False],
   [ True,  True,  True ...  True,  True, False],
   [ True,  True,  True ...  True,  True,  True]]]])

![](https://jalammar.github.io/images/gpt2/queries-keys-attention-mask.png)

![](https://jalammar.github.io/images/gpt2/transformer-attention-mask.png)

In [14]:
from mindnlp._legacy.functional import where, softmax

attn_weights = attn_weights / ops.sqrt(ops.scalar_to_tensor(value.shape[-1]))
query_length, key_length = query.shape[-2], key.shape[-2]
causal_mask = bias[:, :, key_length - query_length: key_length, :key_length].bool()
mask_value = Tensor(np.finfo(np.float32).min, dtype=attn_weights.dtype)
attn_weights = where(causal_mask, attn_weights, mask_value)

In [15]:
np.finfo(np.float32).min

np.float32(-3.4028235e+38)

In [16]:
attn_weights[0, 0]

Tensor(shape=[10, 10], dtype=Float32, value=
[[-3.11613619e-01, -3.40282347e+38, -3.40282347e+38 ... -3.40282347e+38, -3.40282347e+38, -3.40282347e+38],
 [-2.06547290e-01, -7.39235580e-02, -3.40282347e+38 ... -3.40282347e+38, -3.40282347e+38, -3.40282347e+38],
 [-3.00765842e-01,  5.50258100e-01,  3.82859051e-01 ... -3.40282347e+38, -3.40282347e+38, -3.40282347e+38],
 ...
 [-4.89603162e-01,  1.24119543e-01,  5.49100041e-01 ... -5.90216100e-01, -3.40282347e+38, -3.40282347e+38],
 [-2.93846786e-01, -5.76866195e-02,  3.15943241e-01 ...  8.10846090e-02, -9.59210396e-02, -3.40282347e+38],
 [ 3.16390023e-02, -4.88018245e-03, -1.06886245e-01 ... -6.14760891e-02, -4.29923803e-01, -4.14124966e-01]])

![](https://jalammar.github.io/images/gpt2/transformer-attention-masked-scores-softmax.png)

In [17]:
attn_weights = softmax(attn_weights, axis=-1)
attn_weights.shape

(1, 12, 10, 10)

In [18]:
attn_weights[0, 0]

Tensor(shape=[10, 10], dtype=Float32, value=
[[ 1.00000000e+00,  0.00000000e+00,  0.00000000e+00 ...  0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
 [ 4.66892570e-01,  5.33107400e-01,  0.00000000e+00 ...  0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
 [ 1.87860817e-01,  4.39978272e-01,  3.72160882e-01 ...  0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
 ...
 [ 7.49308616e-02,  1.38419464e-01,  2.11720958e-01 ...  6.77587166e-02,  0.00000000e+00,  0.00000000e+00],
 [ 8.04141685e-02,  1.01834655e-01,  1.47965670e-01 ...  1.16993889e-01,  9.80145559e-02,  0.00000000e+00],
 [ 1.05273820e-01,  1.01498663e-01,  9.16557387e-02 ...  9.59137827e-02,  6.63538650e-02,  6.74105063e-02]])

![](https://jalammar.github.io/images/gpt2/gpt2-self-attention-multihead-sum-1.png)

In [19]:
attn_output = ops.matmul(attn_weights, value)

attn_output.shape

(1, 12, 10, 64)

## GPT-2 Self-attention: 3.5- Merge attention heads

![](https://jalammar.github.io/images/gpt2/gpt2-self-attention-merge-heads-1.png)

In [20]:
def merge_heads(tensor, num_heads, attn_head_size):
    """
    Merges attn_head_size dim and num_attn_heads dim into hidden_size
    """
    # (batch_size, num_heads, seq_len, attn_head_size) --> (batch_size, seq_len, num_heads, seq_len)
    tensor = ops.transpose(tensor, (0, 2, 1, 3))
    new_shape = tensor.shape[:-2] + (num_heads * attn_head_size,)
    return tensor.view(new_shape)

In [21]:
# (1, 12, 10, 64) --> (1, 10, 12, 64) --> (1, 10, 768)
attn_output = merge_heads(attn_output, num_heads, head_dim)

attn_output.shape

(1, 10, 768)

## GPT-2 Self-attention: 4- Projecting

![](https://jalammar.github.io/images/gpt2/gpt2-self-attention-project-1.png)

In [22]:
c_proj = Conv1D(embed_dim, embed_dim)

In [23]:
attn_output = c_proj(attn_output)
attn_output.shape

(1, 10, 768)