credit to: [笔记：hwcoder](https://hwcoder.top/Manual-Coding-1) and [知乎：论文精读之Recurrent Models of Visual Attention](https://zhuanlan.zhihu.com/p/555778029)

- Additive Attention
2014 年[《Neural Machine Translation by Jointly Learning to Align and Translate》](https://arxiv.org/abs/1409.0473)
这虽然不是最早的attention机制提出者，但是提出了注意力机制在机器翻译中的应用。
核心思想是：在生成每一个目标词时，自动对源句子的所有词分配权重（注意力），然后根据这些权重加权求和出一个“上下文向量”，作为当前步的辅助信息。

对于decoder每个时间步$i$, 有$s_{i-1}$:是上一步的hidden state (原论文的公式1)，和$\{h_1,...,h_T\}$:encoder所有时间步的hidden state(encoder输入了T个token的时候)
那么生成目标词$y_i$就需要关注这个词最应该关注原来语句中的哪些token

（原论文第七页A.2.2 decoder architecture)
第一步：计算匹配程度
对于decoder的第i步和encoder的第j步
计算decoder第i步的状态($s_{i-1}$)与 encoder第j个字符之间的关系($h_j$)
$e_{ij} = v_a^\top \tanh(W_a s_{i-1} + U_a h_j)$ 其中的$v_a^\top \in \mathbb{R}^{n'} \ \ \ \ W_a \in \mathbb{R}^{n' \times n} \ \ \ U_a \in \mathbb{R}^{n' \times 2n} $是可训练参数，$tanh$ 是激活函数。 以鄙人之见，这玩意就相当于一个一层MLP，作为Universal Approximator的，用来估计$s_{i-1}$和$h_j$的相似度的，而$s_{i-1}$代表着当前词汇的当前语境。

第二步：softmax归一化注意力权重 $\alpha_{ij}$
对于每个encoder位置j,计算一个概率权重($\alpha_{ij}$,$\sum \alpha_{ij} = 1$):
$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}$
就是一个softmax函数，用来1.exp放大差距2.将$e_{ij}$归一化并转换成非负权重

第三步：用注意力权重计算上下文向量$c_i$
$c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j$

$c_i$表示的是一个上下文变量，表示decoder第i个字符对于encoder中每个字符的相关度(attention值)

In [2]:
import numpy as np
np.random.seed(2025)

In [21]:
def additive_attention(s_prev,h,Wa,Ua,va):
    """
    :param s_prev: (n,)上一步decoder的hidden state (s_{j-1})
    :param h: (Tx,n)所有encoder的hidden state
    :param Wa: (d,n)s_prev的 projection matrix 
    :param Ua: (d,n)h的projection matrix
    :param va: (d,)
    :return: context:(n,)context vector c_i alpha:(Tx,)注意力权重 (Tx是输入长度,即encoder长度)
    """
    Tx,n = h.shape
    d = va.shape
    s_expanded = np.tile(s_prev,(Tx,1))
    
    e = np.tanh(s_expanded@Wa.T + h@Ua.T)@va
    # alpha_ij = softmax(e_ij)
    print(e.shape)
    alpha = np.exp(e - np.max(e))  # 避免数值不稳定
    alpha = alpha / np.sum(alpha)  # (Tx,)
    # print(alpha.shape) (Tx,n)
    # context vector c_i = sum_j alpha_ij * h_j
    print(alpha.shape,h.shape)# 输出(5,) (5, 4)
    # context = np.sum(alpha[:,np.newaxis] * h, axis=0)  # np.sum((Tx, n), axis=0) → (n,)
    context = alpha@h #(Tx,)@(Tx,n) = (n,)
    return context, alpha


def test_additive_attention():
    Tx = 5 # length of encoder
    n = 4 # hidden_size(defined in RNN/DRU/LSTM)
    d = 3 # attention dimension
    
    s_prev = np.random.randn(n)            # decoder state (n,)
    h = np.random.randn(Tx, n)             # encoder hidden states (Tx, n)
    Wa = np.random.randn(d, n)              
    Ua = np.random.randn(d, n)
    va = np.random.randn(d)
    context, alpha = additive_attention(s_prev, h, Wa, Ua, va)

    print("Context vector (c_i):", context)
    print("Attention weights (α):", alpha)
    print("Sum of attention weights:", np.sum(alpha))  # 应该接近 1
    
test_additive_attention()

(5,)
(5,) (5, 4)
Context vector (c_i): [-0.07463606  0.69465917  0.25594682 -0.61618757]
Attention weights (α): [0.16482456 0.15648484 0.1736355  0.34484006 0.16021504]
Sum of attention weights: 1.0


再附赠一个pytorch版本的
这里$W_a$和$U_a$用了linear实现，支持batch操作
$$
s_{exp} \in \mathbb{R}^{batch \times T_x \times hidden} \quad \@ \quad W_a \in \mathbb{R}^{hidden \times attn}\quad \Rightarrow \quad \mathbb{R}^{batch \times T_x \times attn}
$$


In [23]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class AdditiveAttention(nn.Module):
    def __init__(self, hidden_size, attention_dim):
        super().__init__()
        self.Wa = nn.Linear(hidden_size, attention_dim, bias=False)
        self.Ua = nn.Linear(hidden_size, attention_dim, bias=False)
        self.va = nn.Parameter(torch.randn(attention_dim))  # learnable vector

    def forward(self, s_prev:torch.Tensor, h):
        """
        :param s_prev: (batch, hidden_size) decoder 上一步 hidden state
        :param h: (batch, Tx, hidden_size) 所有 encoder hidden state
        :return: context: (batch, hidden_size), alpha: (batch, Tx)
        """
        # 1. expand s_prev → (batch, Tx, hidden_size)
        batch_size, Tx, _ = h.size()
        s_exp = s_prev.unsqueeze(1).expand(-1, Tx, -1)
        # s_prev (batch, hidden_size) -> unsqueeze(1) (batch, 1, hidden_size) -> expand (batch, Tx, hidden_size)
        # 所以s_exp =>(batch, Tx, hidden_size)
        # 2. 计算注意力 energy scores
        e = torch.tanh(self.Wa(s_exp) + self.Ua(h))  # (batch, Tx, attn_dim)
        e_scores = torch.matmul(e, self.va)  # (batch, Tx) (batch, Tx, attn_dim) @ (attn_dim,) → (batch, Tx)

        # 3. softmax 得到注意力分布
        alpha = F.softmax(e_scores, dim=1)  # (batch, Tx)

        # 4. 上下文向量 c_i = alpha @ h
        context = torch.bmm(alpha.unsqueeze(1), h)  # torch.bmm((batch, 1, Tx), (batch, Tx, hidden)) → (batch, 1, hidden)
        context = context.squeeze(1)  # (batch, hidden_size)

        return context, alpha
def test_torch_additive_attention():
    batch_size = 2
    Tx = 5              # encoder 时间步
    hidden_size = 4     # encoder/decoder hidden dim
    attention_dim = 3   # attention 隐空间维度

    # 设置设备
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # 随机输入：decoder state + encoder states
    s_prev = torch.randn(batch_size, hidden_size).to(device)       # (batch, hidden_size)
    h = torch.randn(batch_size, Tx, hidden_size).to(device)        # (batch, Tx, hidden_size)

    # 初始化 attention 模块
    attn = AdditiveAttention(hidden_size, attention_dim).to(device)

    # 前向计算
    context, alpha = attn(s_prev, h)

    # 输出检查
    print("s_prev shape:", s_prev.shape)
    print("h shape:", h.shape)
    print("context shape:", context.shape)  # (batch, hidden_size)
    print("alpha shape:", alpha.shape)      # (batch, Tx)
    print("alpha sum (per batch):", alpha.sum(dim=1))  # 应该接近 1
test_torch_additive_attention()

s_prev shape: torch.Size([2, 4])
h shape: torch.Size([2, 5, 4])
context shape: torch.Size([2, 4])
alpha shape: torch.Size([2, 5])
alpha sum (per batch): tensor([1., 1.], device='cuda:0', grad_fn=<SumBackward1>)


In [24]:
%reset -f

- Transformer: MHA Multi Head Attention

如何从AA迁移到MHA:  $q = W_q s_{i-1} \quad k_j = W_k h_j \quad v_j = W_v h_j$
这样看QKV就很类似了