### Multi-head Attention Matrices


$$ Q\cdot K^T = X\,(W_Q \cdot W_K^T)\,X^T$$
Here we search for a possible mathematical interpretation of the linear transformations involved in the computation of $Q, K, V$ in multi-head attention. Specifically, we want to answer the following questions:


In single head attention:

- how different is the matrix $W_Q \cdot W_K^T$ from the identity matrix? If we find a significative difference, it means that the attention is doing something more than $X\cdot X^T$ - i.e. the projection of the buffer on itself;

- is $W_Q \cdot W_K^T$ symmetric? In that case, the attention weights can be seen as $X' \cdot X'^T$ where $X' = A\cdot X$.

**NOTES** Which model should we choose?

- GPT2Model: the base transformer, outputs the hidden states(the embedded buffer, X, after all the transformations)
- GPT2LMHeadModel: the base transformer, plus the layer which calculates the probabilities (aka the token logits) $p(t_i) \propto exp(\vec{x}_N\cdot \vec{x}_{t_i})$ for each token $t_i$ in the vocabulary.
- GPT2DoubleHeadsModel: has both the layer that calculates the probabilities and a layer for classification (whatever it is). Used for multiple-choice Q&A.


We need to use GPT2LMHeadModel.from_pretrained(). This is an instance of the class `transformers.PreTrainedModel` and inherits all its methods. Check its the documentation.


Also check `transformers.GenerationMixin.generate()` and its documentation.

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
import numpy as np

# Load pretrained model
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token_id = tokenizer.eos_token_id
inputs =  tokenizer(["Today we can go to"], return_tensors="pt") # atributes: input_ids, attention_mask
print(inputs.input_ids)
outputs = model.generate(**inputs, max_new_tokens = 1,
                        return_dict_in_generate = True,
                        output_scores = True,
                        temperature = 1.0,
                        output_logits = True,
                        output_hidden_states = True,
                        output_attentions = True,
                        pad_token_id = 50256)

type(outputs) # transformers.generation.utils.GenerateDecoderOnlyOutput

tensor([[8888,  356,  460,  467,  284]])


transformers.generation.utils.GenerateDecoderOnlyOutput

Documentation [here](https://huggingface.co/docs/transformers/v4.51.3/en/internal/generation_utils#transformers.generation.GenerateDecoderOnlyOutput)

**Notes**
Scores are equal to logits in case of greedy decoding, but are different in case of more fancy decoding methods like `beam`, or `top_k`. See the 
doc page [Generation Strategies](https://huggingface.co/docs/transformers/en/generation_strategies#decoding-strategies). See [here](https://discuss.huggingface.co/t/what-is-the-difference-between-logits-and-scores/79796/3) for forum discussion.



logits (tuple(torch.FloatTensor) optional, returned when output_logits=True) — Unprocessed prediction scores of the language modeling head (scores for each vocabulary token before SoftMax) at each generation step. Tuple of torch.FloatTensor with up to max_new_tokens elements (one element for each generated token), with each tensor of shape (batch_size, config.vocab_size).

attentions (tuple(tuple(torch.FloatTensor)), optional, returned when output_attentions=True) — Tuple (one element for each generated token) of tuples (one element for each layer of the decoder) of torch.FloatTensor of shape (batch_size, num_heads, generated_length, sequence_length).

hidden_states (tuple(tuple(torch.FloatTensor)), optional, returned when output_hidden_states=True) — Tuple (one element for each generated token) of tuples (one element for each layer of the decoder) of torch.FloatTensor of shape (batch_size, generated_length, hidden_size).

### Logits

In [80]:
print(outputs.sequences.shape)
print(tokenizer.decode(outputs.sequences[0][0], skip_special_tokens =False))
print(tokenizer.decode(outputs.sequences[0][:], skip_special_tokens =False))

torch.Size([1, 6])
Today
Today we can go to the


In [81]:
logits = outputs.scores[0][0]
top_values, top_indices = torch.topk(logits, k=10, largest=True)  # or largest=False for smallest
for idx, val in zip(top_indices.tolist(), top_values.tolist()):
    print(f"Index: {idx}, Value: {val}, Decoded: {tokenizer.decode(idx)}")

Index: 262, Value: -80.64991760253906, Decoded:  the
Index: 257, Value: -81.82210540771484, Decoded:  a
Index: 670, Value: -82.49638366699219, Decoded:  work
Index: 1175, Value: -82.70538330078125, Decoded:  war
Index: 597, Value: -82.73980712890625, Decoded:  any
Index: 3993, Value: -82.82905578613281, Decoded:  sleep
Index: 674, Value: -83.12396240234375, Decoded:  our
Index: 1194, Value: -83.35417175292969, Decoded:  another
Index: 3996, Value: -83.4070816040039, Decoded:  bed
Index: 477, Value: -83.7303237915039, Decoded:  all


### Hidden states at the end of each layer

We can follow the buffer as it exits each of the layers. Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. In this case, $1 + 12 = 13$.

In [91]:
print(len(outputs.hidden_states[0]))

13


In [None]:
print(outputs.hidden_states[0][0][0].shape) # hidden_states[generated token number][layer number] is a tensor (input length x D)

torch.Size([5, 768])


# Accessing model parameters

Now we retrieve the set of model parameters that are all learned during the training, and kept fixed during inference. These includes:

- the embedding map, E
- the attention matrices, in each layer and each head $(W_Q, W_K, W_V)$ 
- the neural net weights, in each layer


When instanciating the model using "from_pretrained()", dropout is deactivated by default by  model.eval() (sets the model to evaluation mode). To train the model, you should first set it back in training mode with model.train().

In [92]:
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [99]:
# dir(model) # check all methods and attributes of the class

In [98]:
state_dict = model.state_dict()
for name, weights in state_dict.items():
    print(name, weights.shape)

transformer.wte.weight torch.Size([50257, 768])
transformer.wpe.weight torch.Size([1024, 768])
transformer.h.0.ln_1.weight torch.Size([768])
transformer.h.0.ln_1.bias torch.Size([768])
transformer.h.0.attn.c_attn.weight torch.Size([768, 2304])
transformer.h.0.attn.c_attn.bias torch.Size([2304])
transformer.h.0.attn.c_proj.weight torch.Size([768, 768])
transformer.h.0.attn.c_proj.bias torch.Size([768])
transformer.h.0.ln_2.weight torch.Size([768])
transformer.h.0.ln_2.bias torch.Size([768])
transformer.h.0.mlp.c_fc.weight torch.Size([768, 3072])
transformer.h.0.mlp.c_fc.bias torch.Size([3072])
transformer.h.0.mlp.c_proj.weight torch.Size([3072, 768])
transformer.h.0.mlp.c_proj.bias torch.Size([768])
transformer.h.1.ln_1.weight torch.Size([768])
transformer.h.1.ln_1.bias torch.Size([768])
transformer.h.1.attn.c_attn.weight torch.Size([768, 2304])
transformer.h.1.attn.c_attn.bias torch.Size([2304])
transformer.h.1.attn.c_proj.weight torch.Size([768, 768])
transformer.h.1.attn.c_proj.bias 

In [None]:
embedding_matrix = state_dict["transformer.wte.weight"] # E
lm_head_matrix = state_dict["lm_head.weight"]
torch.equal(embedding_matrix, lm_head_matrix) # True

True

### Layer-specific parameters


ln_1.weight | LayerNorm |  |[789]

ln_1.bias | LayerNorm |  |[789]

attn.c_attn.weight | 

attnc_attn.bias | LayerNorm |  |[789]

attn.c_proj.weight

attn.c_proj.bias

ln_2.weight

ln_2.bias

mlp.c_fc.weight

mlp.c_fc.bias

mlp.c_proj.weight

mlp.c_proj.bias


MLP stands for "Multi-Layer Perceptron" - it is the Neural Net.

**SOURCE CODE** [HERE](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_gpt2.py)