### Self Attention 模型 + 公式

- 参考 [Self-attention计算方法](https://blog.csdn.net/weixin_43282288/article/details/103513107)

### 第一步 : 初始化 Q K V

- $W^q, W^k, W^v 是\text{随机初始化权重矩阵, 维度是 Q, K, V的维度}$
- 让 输入 $x$ 和 $W^q, W^k, W^v$ 相乘 得到 $q, k, v$

In [1]:
import numpy as np
from numpy.random import randn

# d 是维度
d = 256
# n 是序列长度 seq_len
n = 32
# x 是输入矩阵
x = randn(n, d)
x.shape

(32, 256)

In [2]:
# 初始化权重矩阵
wq = randn(d, d)
wk = randn(d, d)
wv = randn(d, d)

wq.shape, wk.shape, wv.shape

((256, 256), (256, 256), (256, 256))

In [3]:
# 输入 x 和 权重矩阵 wq, wk, wv 分别相乘(矩阵乘法). 得到 q, k, v

# x(32, 256) @ w(256, 256)
q = x @ wq
k = x @ wk
v = x @ wq

q.shape, k.shape, v.shape

((32, 256), (32, 256), (32, 256))

### 第二步 : 计算注意力权重

$$
    a_{j,i} = q^{(j)} * k^{(i)}^T / \sqrt{d}
$$


In [5]:
a = (q @ k.T ) / np.sqrt(d)
a.shape

(32, 32)

In [7]:
# 对a进行缩放, 使用 np.clip(x, x_max, x_min)
a = np.clip(a, 100, -100)

### 第四步 SoftMax

$$
    \hat{\alpha}_{1, i}=\exp \left(\alpha_{1, i}\right) / \sum_{j} \exp \left(\alpha_{1, j}\right)
$$

In [9]:
# softmax
a = np.exp(a) / np.sum(np.exp(a), axis = -1, keepdims=True)
a.shape

(32, 32)

### 第五步 a 与 v 相乘

$$
    b^{1}=\sum_{i} \hat{\alpha}_{1, i} v^{i}
$$


In [None]:
# b(32, 256) = a(32,32) @ v(32,256)
b = a @ v
b.shape()