# Determining how to extract QK and OV matrices from GPT-Neo attention heads
We'd like to analyze the composition terms for attention heads across GPT-Neo models, but we need to understand how to extract these terms out of the model in order to do that. This notebook demonstrates how to reproduce the self-attention mechanism using the extracted matrices. We'll work with the first layer for simplicity.

## Importing the model

In [1]:
import numpy as np
import torch
from transformers import AutoModelForCausalLM

# Automatically rounding outputs to 4 digits
np.set_printoptions(precision=4)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model = AutoModelForCausalLM.from_pretrained(
    "EleutherAI/gpt-neo-125M"
)

In [3]:
type(model)

transformers.models.gpt_neo.modeling_gpt_neo.GPTNeoForCausalLM

The results below were extracted by experimenting around with the code [for this class](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neo/modeling_gpt_neo.py), playing with the parameters until I could reliably reproduce parts of the model. I summarize a version of this below.

## Extracting the parameters from the layer

In [4]:
# layer zero self-attention - 'h' specifies the layers in a ModuleList
att0 = model.transformer.h[0].attn.attention
att0

GPTNeoSelfAttention(
  (attn_dropout): Dropout(p=0, inplace=False)
  (resid_dropout): Dropout(p=0, inplace=False)
  (k_proj): Linear(in_features=768, out_features=768, bias=False)
  (v_proj): Linear(in_features=768, out_features=768, bias=False)
  (q_proj): Linear(in_features=768, out_features=768, bias=False)
  (out_proj): Linear(in_features=768, out_features=768, bias=True)
)

In [5]:
# Query projection
Q = att0.q_proj.weight.data.numpy()

# Key projection
K = att0.k_proj.weight.data.numpy()

# Value projection
V = att0.v_proj.weight.data.numpy()

# Output projection (with biases, others have no biases)
O = att0.out_proj.weight.data.numpy()
Ob = att0.out_proj.bias.data.numpy()

In [6]:
# A dict storing hyperparameters
config = model.config

## Running self-attention as a reference

Making some dummy embedding vectors


Generating "fully random" vectors often results in attention weights equal to the identity matrix, which isn't the best to test replication. Multiplying by the identity is a no-op, and we want to make sure that we apply the attention weights correctly.

Instead, we'll generate a base vector and add some small noise to it

In [7]:
np.random.seed(2349058)
base_dv = np.random.rand(1, 768)

# sequence length X hidden dimension
dvs = np.empty((5, 768), dtype=np.float32)
dvs[0, :] = base_dv
dvs[1:, :] = np.random.randn(4, 768)*0.05 + base_dv

# Making a tensor for PyTorch code
# batch X sequence length X hidden dimension
dvs_t = torch.tensor(dvs).view(1, 5, 768)

This is just running the code within the `att0.forward` method manually

In [8]:
num_heads = config.num_heads
head_dim = config.hidden_size // num_heads

print(f"# heads: {num_heads}, head dimension: {head_dim}")

# heads: 12, head dimension: 64


In [9]:
q = att0._split_heads(att0.q_proj(dvs_t), num_heads, head_dim)
k = att0._split_heads(att0.k_proj(dvs_t), num_heads, head_dim)
v = att0._split_heads(att0.v_proj(dvs_t), num_heads, head_dim)

# ao = value-multiplied attention output per head
# aw = attention weights
ao, aw = att0._attn(q, k, v)

ao_ = att0._merge_heads(ao, num_heads, head_dim)
# fo = final output
fo = att0.out_proj(ao_)

Checking the attention weights. Again, if these turn out to be the identity matrix, these vectors aren't the best test.

In [10]:
# selecting head 0
aw[0, 0]

tensor([[1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
        [1.2767e-05, 9.9999e-01, 0.0000e+00, 0.0000e+00, 0.0000e+00],
        [8.8918e-13, 1.4486e-12, 1.0000e+00, 0.0000e+00, 0.0000e+00],
        [3.4189e-10, 3.6326e-09, 1.0000e+00, 3.0019e-06, 0.0000e+00],
        [3.2485e-11, 2.9238e-11, 1.0000e+00, 2.6672e-09, 1.8745e-08]],
       grad_fn=<SelectBackward0>)

In [11]:
# selecting head 1
aw[0, 1]

tensor([[1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
        [7.9256e-06, 9.9999e-01, 0.0000e+00, 0.0000e+00, 0.0000e+00],
        [4.6404e-12, 3.7506e-07, 1.0000e+00, 0.0000e+00, 0.0000e+00],
        [1.3137e-11, 8.1386e-07, 1.0000e+00, 1.5599e-16, 0.0000e+00],
        [1.4078e-11, 9.4648e-07, 1.0000e+00, 4.9371e-16, 1.6365e-12]],
       grad_fn=<SelectBackward0>)

In [12]:
fo[..., :5]

tensor([[[-23.1595,  -4.3884,   1.0211, -35.5319,  53.8781],
         [-23.1733,  -9.9036,   2.5395, -32.8883,  52.2064],
         [-23.7669,  -5.7476,   4.7038, -33.8980,  52.6261],
         [-20.4753,  -7.8682,  -1.8134, -38.4741,  53.1547],
         [-22.1756,  -2.8764,   0.0873, -33.7324,  54.9025]]],
       grad_fn=<SliceBackward0>)

## Reproducing self-attention using the parameters

In [13]:
def headQK(Q, K, head_index):
    assert head_index >= 0
    assert head_index < num_heads
    
    i = head_index
    Qh = Q[i*64:(i+1)*64, :]
    Kh = K[i*64:(i+1)*64, :]
        
    return Qh.T @ Kh


def headOV(O, V, head_index):
    assert head_index >= 0
    assert head_index < num_heads
    
    i = head_index
    Oh = O[:, i*64:(i+1)*64]
    Vh = V[i*64:(i+1)*64, :]
    
    return Oh @ Vh

#### Reproducing the attention weights

In [15]:
# The code calls the causal mask the "bias" for some reason.
# It also extends to the maximum context size, but we don't need that
# for our example of 5 vectors.
causal_mask = att0.bias[0, 0, :5, :5].numpy()

causal_mask

array([[1, 0, 0, 0, 0],
       [1, 1, 0, 0, 0],
       [1, 1, 1, 0, 0],
       [1, 1, 1, 1, 0],
       [1, 1, 1, 1, 1]], dtype=uint8)

In [16]:
def attention_weights(inputs, Q, K, head_index):
    raw = inputs @ headQK(Q, K, head_index) @ inputs.T
    final = torch.nn.functional.softmax(
        # raw weights with causal mask
        torch.tensor(np.where(causal_mask == 1, raw, -1e9)),
        dim=-1,
    ).numpy()
    
    return final

In [17]:
attention_weights(dvs, Q, K, 0)

array([[1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
       [1.2767e-05, 9.9999e-01, 0.0000e+00, 0.0000e+00, 0.0000e+00],
       [8.8901e-13, 1.4484e-12, 1.0000e+00, 0.0000e+00, 0.0000e+00],
       [3.4183e-10, 3.6322e-09, 1.0000e+00, 3.0012e-06, 0.0000e+00],
       [3.2482e-11, 2.9234e-11, 1.0000e+00, 2.6669e-09, 1.8745e-08]],
      dtype=float32)

In [18]:
attention_weights(dvs, Q, K, 1)

array([[1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
       [7.9251e-06, 9.9999e-01, 0.0000e+00, 0.0000e+00, 0.0000e+00],
       [4.6404e-12, 3.7506e-07, 1.0000e+00, 0.0000e+00, 0.0000e+00],
       [1.3138e-11, 8.1397e-07, 1.0000e+00, 1.5600e-16, 0.0000e+00],
       [1.4078e-11, 9.4644e-07, 1.0000e+00, 4.9375e-16, 1.6364e-12]],
      dtype=float32)

These match the `aw` outputs above. Moving on to the whole self-attention mechanism.

In [19]:
def selfattention(inputs, Q, K, O, V, head_index):
    aw = attention_weights(inputs, Q, K, head_index)
    return aw @ inputs @ headOV(O, V, head_index).T

Summing across heads and adding the bias

In [20]:
result = sum(selfattention(dvs, Q, K, O, V, i) for i in range(12)) + Ob
result[..., :5]

array([[-23.1595,  -4.3884,   1.0211, -35.5319,  53.8781],
       [-23.1733,  -9.9036,   2.5395, -32.8883,  52.2064],
       [-23.7669,  -5.7476,   4.7038, -33.898 ,  52.6261],
       [-20.4753,  -7.8682,  -1.8134, -38.4741,  53.1547],
       [-22.1756,  -2.8764,   0.0873, -33.7324,  54.9025]], dtype=float32)

This matches the final output above :tada: