# Downloading GPT2 (Generative pre-trained transformer) from Open-AI


This notebook contains a begginers guide to the usage of transformers library. It covers:

- 0: installation
- 1: loading model
- 2: basic text generation
- 3: accessing next token probabilities
- 4: accessing hidden states
- 5: acessing model paramenters

#### Step 0: Install libraries

You need to install transformers and pytorch. Run from bash shell:

```{bash}
! pip install transformers
! pip install torch
```

#### Step 1: Load GPT-2

How can we use GPT-2? Basically, we have two options:

- **the "low level" option**: use GPT-2 specific API (Application Programming Interface). 

  - Provides finer control. You specify attention masks, padding tokens, decoding, etc. You’re interacting **directly** with the model and tokenizer objects. 
  - Better for customization** — e.g., adding constraints, working with batches, doing masked generation, etc.

    ```{python}
        from transformers import GPT2Tokenizer, GPT2LMHeadModel
        tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
        model = GPT2LMHeadModel.from_pretrained("gpt2")
    ```


- **the "high level" option**: use the  `transformers.pipeline` API. 
  - This is a "wrapper" around the model. You don’t need to manually encode or decode anything. Automatically handles tokenization, decoding and attention under the hood.
  - It is faster and easier to use.

    ```{python}
        from transformers import pipeline, set_seed
        set_seed(42)
        input_text = ["Hello", "Hello dear!"]
        generator = pipeline('text-generation', model='gpt2', device=-1)
        generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
    ```

Lets look at a first example of text generation, using both methods.

### 2: Basic Text Generation

#### 2.1: GPT-2 API.

GPT-2 is provided as an API (Application Programming Interface) inside the Python library Transformers from HuggingFace.
This library allows us to load pre-trained models and easily use them for tasks like text generation.

From the main page of the Transformers documentation, look for: Transformers/API/Text Models/GPT-2. Alternatively, follow [this link](https://huggingface.co/docs/transformers/en/model_doc/gpt2).

##### A first basic example of text generation.

- **Import the required classes**

We import `transformers.GPT2Tokenizer` and `transformers.GPT2LMHeadModel`. 
These allow us to tokenize (convert text into tokens the model can process) and load the model.
```python
from transformers import GPT2Tokenizer, GPT2LMHeadModel
```

- **Model selection (pt.1)**

We load the _smallest_ GPT-2 model and tokenizer by default using:
```python
model_name = "gpt2"  # You can also try "gpt2-medium", "gpt2-large", or "gpt2-xl"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
```
This loads the pre-trained weights from HuggingFace. The `from_pretrained()` method is inherited from `transformers.PreTrainedModel`, and allows you to load weights, configurations, etc (inherits all its methods). Check documentation!

- **About the components**

1. _Tokenizer class_: this class takes care of converting text into tokens (integers that correspond to words, subwords, or characters in the model’s vocabulary) and vice versa. If you use the flag return_tensors='pt', the tokenizer returns PyTorch tensors (instead of just a list of token IDs). This is necessary because the model expects tensors as input; 

Example:
```python
inputs = tokenizer("Once upon a time", return_tensors='pt')
```

2. _LMHeadModel class_: this class represents the GPT-2 model with a language modeling head (a linear layer that predicts the probability distribution over the vocabulary for the next token). The model predicts the next token given the previous tokens. In general we can have multiple heads: you can think of the as "specialists" trained to focus on a different aspect of language (syntax, long-range dependencies, punctuation, ...); 


3. _Padding_: when you provide multiple input sequences (e.g., a batch of sentences), they will often have different lengths.
Neural networks need inputs of the same size, so padding tokens are added (usually to the right or left of the sequences) to make them the same length.

Example:
"Hello" → [15496, 0, 0]
"Hello world" → [15496, 995, 0]

NOTICE: Here 0 could represent a padding token (GPT-2 doesn't have an official padding token, but you can define one if needed).

4. _Attention mask_:
The model needs to know which tokens are real content and which are padding.
The attention mask is a tensor with:

1 → position contains a real token (to attend to)

0 → position contains padding (to ignore)

The tokenizer generates this mask automatically and it is returned by the tokenizer under the key “attention_mask”.

Example:
```python
inputs = tokenizer(["Hello", "Hello world"], return_tensors="pt", padding=True)
inputs["attention_mask"]
```

**More about model selection (there are many other possibilities beyond the one shown above)...**

- **GPT2Model**: the bare GPT-2 transformer model, the core transformer network. It includes: token embeddings, positional embeddings, stacked transformed blocks (self-attention + feedforward layers + layer norms), final hidden states (the outputs of the transofrmer layers). It does not include: a final linear layer that maps hidden states to vocabulary logits (i.e. no prediction head).
It outputs the hidden states(the embedded buffer, X, after all the transformations. Namely, contextualized token representations). It is useful when you want to:  

1. Use the transformer’s attention + representation power, but apply your own task-specific head (e.g. classification, regression, custom scoring);

2.  analyze or visualize the hidden states or attention weights.
Example: 
```python
from transformers import GPT2Tokenizer, GPT2Model

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2Model.from_pretrained("gpt2")

inputs = tokenizer("The quick brown fox", return_tensors="pt")
outputs = model(**inputs)

hidden_states = outputs.last_hidden_state  # shape: (batch_size, sequence_length, hidden_dim)
```
Here, "hidden_states" contains the vector representations for each token after the transformer layers.

- **GPT2LMHeadModel**: is the GPT-2 transformer plus a language modeling head on top. Outputs token probabilities (aka **logits**) so you can do text generation or compute next-token likelihoods.

Formula:
$$
p(t_i) \propto exp(\vec{x}_N\cdot \vec{x}_{t_i})
$$ 
where $\vec{x}_N$ is the final hidden state for the last token, and $vec{x}_{t_i}$ is the embedding of candidate token $t_i$ (this is computed $\forall t_i$).

_NOTICE_ There are two types of _heads_ you shoud not get confused about:

1. **Attention heads**: individual parallel attention mechanisms inside each transformer layer. 

For example GPT-2 has $n_{head} = 12$, each block has 12 attention heads. 
In each head, we compute: 
$$
Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt(dk)})V
$$

where:

- hidden_state: output from last transformer layer
- W_lm: matrix mapping hidden dimension → vocab size

2. **LM head (language modeling head)**: linear projection (dense layer) that maps hidden states to vocabulary logits. Maps each token’s final hidden vector → probability distribution over vocabulary → generates text.

$$ logits = hidden_{state} \cdot W_{text{lm}}^T + b_{\text{lm}} $$

- **GPT2DoubleHeadsMdoel**: has both the layer that calculates the probabilities and a layer for classification (whatever it is). Useful for multiple-choice tasks (e.g., pick the right ending to a story). 

Also check `transformers.GenerationMixin.generate()` and its documentation.

In [None]:
# Import tokenizer and model
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Padding in the following is left because GPT-2 is decoder-only and usually generates right-to-left.
# Briefly:
#   ENCODER -> processes the entire input sequence all at once and produce a meaningful contextual representation for every input token
#   DECODER -> generate the output sequence token by token 

tokenizer = GPT2Tokenizer.from_pretrained("gpt2", padding_side = "left")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Define the input (different lengths, so padding will be needed)
input_text = ["Hello", "Hello dear!"]

# Set the pad token 
# GPT-2 doesn't have a native pad token (it was trained without one)
# So here you're telling it "when padding, use the EOS (End Of Sequence) as padding token"
tokenizer.pad_token = tokenizer.eos_token

# Tokenize with padding (strings -> token IDs)
# Returns a dictionary
# {
#  'input_ids': tensor of shape (2, max_sequence_length),
#  'attention_mask': tensor of 1s (for real tokens) and 0s (for pads)
# }
# Ex:
# input_ids =
# [[ <pad> <pad> token1 token2 ],
#  [ <pad> token1 token2 token3 ]]
# attention_mask =
# [[0 0 1 1],
#  [0 1 1 1]]
padded_sequences = tokenizer(input_text, padding=True, return_tensors="pt")

# Generate the output
output = model.generate(padded_sequences["input_ids"], # takes the input tokens (to produce a sequence continuation)
                        attention_mask=padded_sequences["attention_mask"], # ensures model ignores padded tokens (so attention = 0 on pad tokens)
                        pad_token_id=tokenizer.eos_token_id, # tells it what token id to consider as pad when generating
                        max_length=50, # Length of output 
                        do_sample=True,# If TRUE, tells the model to sample randomly from the top_k most likely tokens instead of always choosing the most likely token - this makes the output more creative
                        top_k=5) # Flag is used only if do_sample = TRUE.

# Loop through and print
# NOTICE: encoding and decoding here have nothing to do with encoder/decoder blocks of the transformer
# Encoded input -> token IDs (integers) that correspond to the input string (what the tokenizer outputs when it “encodes” text into model input)
# Decoded input -> converting those token IDs back into text (what the tokenizer does when it “decodes” token IDs into readable text)
for i in range(output.shape[0]):
    print(f"Sequence {i}: ")
    print("Decoded input: ", tokenizer.decode(padded_sequences["input_ids"][i]))
    print("Encoded input: ", padded_sequences["input_ids"][i])
    print("Attention mask: ", padded_sequences["attention_mask"][i])
    print("Decoded output: ", tokenizer.decode(output[i],skip_special_tokens=False))
    print("_________________________")

Sequence 0: 
Decoded input:  <|endoftext|><|endoftext|>Hello
Encoded input:  tensor([50256, 50256, 15496])
Attention mask:  tensor([0, 0, 1])
Decoded output:  <|endoftext|><|endoftext|>Hello, I am not sure what I would like to hear. What would you like? What would you like me to hear? You're not sure.

You know what I would like to know? You're not sure. What
_________________________
Sequence 1: 
Decoded input:  Hello dear!
Encoded input:  tensor([15496, 13674,     0])
Attention mask:  tensor([1, 1, 1])
Decoded output:  Hello dear!

This is my first post, so I'll start with my thoughts on how to use this tool, how to make it easy for you to use and how it works for you. The first thing I will do is to make
_________________________


_NOTICE:_

In sequence 0 50256 is GPT-2's eos_token_id, corresponding to token <|endoftext|>

#### 2.2: `transformers.pipeline` API.

Now we run the same example as above, but using the higher level interface provided by the class `pipeline`. Encoding and decoding is under the hood.

In [None]:
# Import pipeline: Hugging Face's high-level helper that wraps model + tokenizer + generation logic into one easy interface
from transformers import pipeline, set_seed

# Set a seed (just to get always same output)
set_seed(42)

# Define the input
input_text = ["Hello", "Hello dear!"]

# Do the generation
# For each input, the output of pipeline will be a list of lists
# Outer list: one entry per input prompt
# Inner list: one entry per returned sequence (if num_return_sequences = 1, inner list has one item)
# So:
# [
#  [
#    {"generated_text": "original_input_text + model_generated_continuation"}
#  ]
# ]
generator = pipeline('text-generation', model='gpt2', device=-1) # Use device=0 for GPU, or device=-1 for CPU
output = generator(input_text,
                    pad_token_id = 50256,
                    truncation = True,
                    max_length=30,
                    temperature=0.1,
                    num_return_sequences=1)

for idx, field in enumerate(output):
    print() # Put a space from previous automatically generated output
    print(f"Sequence {idx}: ")
    print("Decoded input: ", input_text[idx])
    # print("Field:", field)
    print("Decoded Output:", field[0]["generated_text"]) # Each field corresponds to a single input prompt
    print("_________________________")

Device set to use cpu



[{'generated_text': "Hello, I'm not sure if you're aware of the fact that I'm a member of the American Association of Chiefs of Police. I'm a"}]
Sequence 0: 
Decoded input:  Hello
Decoded Output: Hello, I'm not sure if you're aware of the fact that I'm a member of the American Association of Chiefs of Police. I'm a
_________________________

[{'generated_text': "Hello dear! I'm sorry, but I'm not sure what to do. I'm not sure if I should go back to the hospital or not"}]
Sequence 1: 
Decoded input:  Hello dear!
Decoded Output: Hello dear! I'm sorry, but I'm not sure what to do. I'm not sure if I should go back to the hospital or not
_________________________


### 3. How to access token probabilities

For this - as well as many other things - you have to use the low level GPT2 API. 

The **logits** are the scalar products between the last row of the buffer, after transformation, and each of the embedded tokens in the dictionary:

$$ \vec{x}_N \cdot E^T $$
 (here $^T$ stands for "transposed")

_NOTICE_: E is a K(= vocab.size) x D (= dimensionality of embedding) matrix such that the embedded sequence (embedded buffer) is given by $X_{seq} = T_{matrix} \cdot E$ where instead $T_{matrix}$ is a N(= number of tokens in the buffer) x K matrix (each row corresponding to a one-hot vector encoding of tokens). 

(In practice however $X_{seq} = E[token\_ids]$ which is conceptually the same but easier to compute)

N.B: logits are not simply $T_{matrix}$ where each entry is a one-hot encoding because the original $X_{seq}$ has been transformed during the process. Basically each logit ($\in \mathbb{R}^K$) is a score for each token in the vocabulary. 

The **probabilities** are given by:

$$ \propto \exp^{\left(-\frac{\vec{x}_N \cdot E^T}{T}\right) }$$
 
in particular:

$$  softmax{(\frac{\vec{x}_N \cdot E^T}{T})}$$
 

where this last T at denominator of the exponent is finally the temperature parameter.



#### First way:

In [None]:
# Loadinf tokenizer and model, setting the token for padding
tokenizer = GPT2Tokenizer.from_pretrained("gpt2", padding_side = "left")
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Define a new prompt and tokenize it
prompt = "The American flag's colors are red, blue and"
inputs = tokenizer(prompt, return_tensors="pt")
tokens = tokenizer.convert_ids_to_tokens(inputs.input_ids[0])

# Get the outputs
import torch 
with torch.no_grad(): # because during inference you don't need gradients (while in training gradients are needed so PyTorch can compute updates for model weights via backpropagation)
                      # this saves memory and computational cost ecause skips storing intermediate results for gradients
    outputs = model(**inputs) # passing tokenized input to the model and getting the model's output
                              # ** unpacks the dictionary "inputs" into keyword arguments (so it's the same as model(input_ids=..., attention_mask=...) )

# Get the logits (using adequate attribute of the class generator)
logits = outputs.logits  # Shape: [1, seq_len, vocab_size]
                            # 1 -> batch size (you passed one input sequence)
                            # seq_len -> number of tokens in your prompt: N
                            # vocab_size -> number of tokens in GPT-2's vocabulary (50257 for base GPT-2): K

# Print logits size
print(logits.shape)

# Remove the batch dimension (since it is 1) and print [seq_len, vocab_size] raw logits matrix
# Each row corresponds to one position in the prompt
# Each row is a vector of 50257 values - one for each possible next token at that position
print()
print(logits[0])

torch.Size([1, 10, 50257])

tensor([[-36.2872, -35.0111, -38.0791,  ..., -40.5161, -41.3758, -34.9191],
        [-85.1435, -82.5817, -88.0494,  ..., -88.4072, -90.8886, -84.2703],
        [-86.6003, -85.0928, -92.4016,  ..., -98.3911, -91.8806, -89.0551],
        ...,
        [-86.1226, -85.5085, -86.6623,  ..., -95.5519, -89.5766, -85.6829],
        [ -0.4879,   1.0927,  -3.0591,  ..., -11.6097,  -8.8209,  -1.0988],
        [-74.8958, -72.4673, -75.6806,  ..., -83.4975, -78.3614, -74.6660]])


#### Second way:

When calling model.generate() set different types of output to True.

```python
model.generate(
    ...,
    return_dict_in_generate=True,
    output_scores=True,
    output_logits=True,
    output_hidden_states=True,
    output_attentions=True
)
```
You're asking generate() to:

1. Return detailed intermediate outputs (not just the generated sequences)
2. Package them in a GenerateDecoderOnlyOutput onject - a structured dictionary-like object. Inside this object we have sequences, scores, logits, hidden_states, attentions...

Documentation [here](https://huggingface.co/docs/transformers/v4.51.3/en/internal/generation_utils#transformers.generation.GenerateDecoderOnlyOutput)

_NOTICE:_
Scores are equal to logits in case of greedy decoding, but are different in case of more fancy decoding methods like `beam`, or `top_k`.

- Greedy decoding -> the model just picks the token with the highest logit at each step: no extra re-weighting or probabilities applied (score for the token is just its logit).

- Beam search/Top-k/Sampling -> these methods track multiple hypotheses at once or apply filters. They modify or re-weight the logits during generation, therefore the final score of a token under these strategies is not just the raw logit anymore, rather it reflects additional constraints/adjustments made during decoding. 

See the doc page [Generation Strategies](https://huggingface.co/docs/transformers/en/generation_strategies#decoding-strategies). See [here](https://discuss.huggingface.co/t/what-is-the-difference-between-logits-and-scores/79796/3) for forum discussion.

Example:


<details>
<summary>sequences</summary>

* **Always returned**
* **Type:** `torch.LongTensor`
* **Content:** Final generated token IDs
* **Shape:** `(batch_size, sequence_length)`

</details>


<details>
<summary>logits</summary>

* **Returned when:** `output_logits=True`
* **Type:** `tuple(torch.FloatTensor)`
* **Content:** Unprocessed prediction scores before softmax
* **Shape of each tensor:** `(batch_size, vocab_size)`
* **Tuple length:** up to `max_new_tokens`

</details>


<details>
<summary>scores</summary>

* **Returned when:** `output_scores=True`
* **Type:** `tuple(torch.FloatTensor)`
* **Content:** The scores used at each generation step for sampling or selection
* **Shape of each tensor:** `(batch_size, vocab_size)`
* **Tuple length:** up to `max_new_tokens`

</details>


<details>
<summary>attentions</summary>

* **Returned when:** `output_attentions=True`
* **Type:** `tuple(tuple(torch.FloatTensor))`
* **Content:** Attention maps
* **Shape of each tensor:** `(batch_size, num_heads, generated_length, sequence_length)`
* **Structure:**

  * Outer tuple: one per generated token
  * Inner tuple: one per decoder layer

</details>


<details>
<summary>hidden_states</summary>

* **Returned when:** `output_hidden_states=True`
* **Type:** `tuple(tuple(torch.FloatTensor))`
* **Content:** Hidden state activations
* **Shape of each tensor:** `(batch_size, generated_length, hidden_size)`
* **Structure:**

  * Outer tuple: one per generated token
  * Inner tuple: one per decoder layer

</details>


In [7]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
import numpy as np

# Load pretrained model
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token_id = tokenizer.eos_token_id
inputs =  tokenizer(["Today we can go to"], return_tensors="pt") # attributes: input_ids, attention_mask

# Print input tokens (corresponding to each word of input)
print("Input ids:", inputs.input_ids, "\n")

# Generate the output: in "here" you're telling to return
#  a structured object that contains not just the sequences,
#  but also extra info (e.g., scores, hidden states, attentions...)
# If you look at the source code, you'll see: 
# return GenerateDecoderOnlyOutput(...)
outputs = model.generate(**inputs, max_new_tokens = 2,
                        return_dict_in_generate = True, # here
                        output_scores = True,
                        temperature = 1.0,
                        output_logits = True,
                        output_hidden_states = True,
                        output_attentions = True,
                        pad_token_id = 50256)

# Check this by print the class/type of outputs (not a plain tensor or list or dict!!)
print("Outputs class:", type(outputs), "\n") 

# Prints
print("Outputs sequences shape:", outputs.sequences.shape, "\n") # tensor of shape (batch_size, generated_sequence_length)
print("First output word:", tokenizer.decode(outputs.sequences[0][0], skip_special_tokens =False), "\n") # Grab first token of first sequence and convert the token ID to string repr. 
print("Whole output:",tokenizer.decode(outputs.sequences[0][:], skip_special_tokens =False), "\n") # Same for full first sequence of output token IDSs

# Take logits for the first generated token (for the first sample in the batch)
print(tokenizer.decode(outputs.sequences[0][:]))
logits = outputs.scores[0][0]
# Find the top-10 highest logits (scores) across the vocabulary
# Values -> actual logit scores
# Indices -> vocab indices (token IDs) corresponding to these scores
top_values, top_indices = torch.topk(logits, k=5, largest=True)  # or largest=False for smallest
# For each print index, value and decoded word
for idx, val in zip(top_indices.tolist(), top_values.tolist()):
    print(f"Index: {idx}, Value: {val}, Decoded: {tokenizer.decode(idx)}")

print("\n\n")


logits = outputs.scores[1][0]
# Find the top-10 highest logits (scores) across the vocabulary
# Values -> actual logit scores
# Indices -> vocab indices (token IDs) corresponding to these scores
top_values, top_indices = torch.topk(logits, k=5, largest=True)  # or largest=False for smallest
# For each print index, value and decoded word
for idx, val in zip(top_indices.tolist(), top_values.tolist()):
    print(f"Index: {idx}, Value: {val}, Decoded: {tokenizer.decode(idx)}")

Input ids: tensor([[8888,  356,  460,  467,  284]]) 

Outputs class: <class 'transformers.generation.utils.GenerateDecoderOnlyOutput'> 

Outputs sequences shape: torch.Size([1, 7]) 

First output word: Today 

Whole output: Today we can go to the next 

Today we can go to the next
Index: 262, Value: -80.64991760253906, Decoded:  the
Index: 257, Value: -81.82210540771484, Decoded:  a
Index: 670, Value: -82.49638366699219, Decoded:  work
Index: 1175, Value: -82.70538330078125, Decoded:  war
Index: 597, Value: -82.73980712890625, Decoded:  any



Index: 1306, Value: -88.45703887939453, Decoded:  next
Index: 886, Value: -89.5257339477539, Decoded:  end
Index: 2003, Value: -89.5971450805664, Decoded:  future
Index: 966, Value: -89.78964233398438, Decoded:  point
Index: 9231, Value: -89.84014892578125, Decoded:  polls


### 4: Accessing hidden states


Hidden states are the token representations in the embedding space $\mathbb{R}^D$.

We can follow the buffer as it exits each of the layers. Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. In this case, $1 + 12 = 13$.

In [None]:
# Check number of decoder layers (including embedding layer)
print("Hidden states number:", len(outputs.hidden_states[0]), "\n")

# Print shape of hidden states
print("Shape hidden states attribute:", np.shape(outputs.hidden_states[0]))
print("So, each hidden state is shaped (batch_size, layers (including embedding), batch_size/placeholder (e.g., extra dim for grouping), seq_length, hidden_size)\n")

# Get the hidden vector of the first token position at each layer for the first generated token
#first_token_hidden_states = [layer[0] for layer in outputs.hidden_states[0]]

#print(outputs.hidden_states[0][0]) # batch 1, layer 0, shape (1, 5, 768)
#print(outputs.hidden_states[0][0].shape) # batch 1, layer 0, shape (1, 5, 768)

layer_idx = 1                                                  # 0 to 12 (included)
hidden_states_layer_1 = outputs.hidden_states[0][layer_idx][0] # shape (5, 768) = (token sequence length, D)

# Print shapes (NOTICE: does not include the 1)
for i, h in enumerate(outputs.hidden_states[0][layer_idx][0]):
    print(f"Token {i}: \n")
    #print(h)
    print("\n\n")

Hidden states number: 13 

Shape hidden states attribute: (13, 1, 5, 768)
So, each hidden state is shaped (batch_size, layers (including embedding), batch_size/placeholder (e.g., extra dim for grouping), seq_length, hidden_size)

torch.Size([5, 768])
Token 0: 

tensor([ 1.5737e+00, -4.1554e-01,  4.5012e-01,  8.7572e-01, -8.9828e-01,
        -9.3156e-01,  6.4798e-01, -1.7194e+00, -1.3558e+00, -9.4803e-01,
        -1.3937e+00,  1.0653e+00,  1.3543e-01, -2.5946e-01,  2.2569e+00,
         1.6581e+00,  6.1315e-02, -3.6221e-01, -5.4917e-01, -2.3977e-01,
        -1.1843e-01, -8.6638e-02, -1.6552e+00, -1.3035e+00,  2.1897e-01,
        -1.5130e-01, -1.1095e+00,  5.3940e-01, -9.3040e-01, -1.6228e+00,
         1.7367e-01, -7.7553e-02, -2.1432e-01,  1.4262e+00, -4.2878e-01,
         6.7391e-01, -1.3886e+00,  1.5062e+00, -1.8832e-01,  7.2701e-01,
        -5.0284e-01, -5.7902e-01,  1.1196e+00, -1.2278e+00,  1.4119e-02,
         5.3088e-01,  8.9699e-01, -3.9212e-01, -2.8636e+00,  4.7158e-01,
        

###  5. Accessing model parameters

Now we retrieve the set of model parameters that are all learned during the training, and kept fixed during inference. These includes:

- the embedding map, E
- the attention matrices, in each layer and each head $(W_Q, W_K, W_V)$ 
- the neural net weights, in each layer


_NOTICE_: 

When instanciating the model using "from_pretrained()", dropout is deactivated by default by automatically setting the model to evaluation mode -> model.eval().

To train or finetune the model, you should first set it back in training mode with model.train(): this reactivates dropout.

In [31]:
# Check model architecture
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

_NOTICE:_

The model has two main components:
- transformer -> core of GPT-2, made of embeddings + layers
- lm_head -> final layer that maps hidden states to logits (one score per vocabulary token)

Insider transformer we have: 
- token embeddings (wte) -> Maps token IDs (0..50256) to 768-dimensional vectors
- positional embedding (wpe) -> Adds position information for up to 1024 positions
- 12 transform layers (ModuleList) -> each of the 12 blocks has two layerNorms, an attention module and an MLP (feed-foreward net). The attention involves: 

    - c_attn: computes queries, keys, values (3 × 768 = 2304 outputs)
    - c_proj: projects back to hidden size
    - attn_dropout: dropout in attention (active only in train mode)
    - resid_dropout: dropout on residual connections


We can furthermore retreive all learned parameters of the model as an OrderedDict. 

state_dict: maps parameter names (strings) → tensors (weights or biases)

In [32]:
# Loop to see all weights tensors' names and shape
state_dict = model.state_dict()
for name, weights in state_dict.items():
    print(name, weights.shape)

transformer.wte.weight torch.Size([50257, 768])
transformer.wpe.weight torch.Size([1024, 768])
transformer.h.0.ln_1.weight torch.Size([768])
transformer.h.0.ln_1.bias torch.Size([768])
transformer.h.0.attn.c_attn.weight torch.Size([768, 2304])
transformer.h.0.attn.c_attn.bias torch.Size([2304])
transformer.h.0.attn.c_proj.weight torch.Size([768, 768])
transformer.h.0.attn.c_proj.bias torch.Size([768])
transformer.h.0.ln_2.weight torch.Size([768])
transformer.h.0.ln_2.bias torch.Size([768])
transformer.h.0.mlp.c_fc.weight torch.Size([768, 3072])
transformer.h.0.mlp.c_fc.bias torch.Size([3072])
transformer.h.0.mlp.c_proj.weight torch.Size([3072, 768])
transformer.h.0.mlp.c_proj.bias torch.Size([768])
transformer.h.1.ln_1.weight torch.Size([768])
transformer.h.1.ln_1.bias torch.Size([768])
transformer.h.1.attn.c_attn.weight torch.Size([768, 2304])
transformer.h.1.attn.c_attn.bias torch.Size([2304])
transformer.h.1.attn.c_proj.weight torch.Size([768, 768])
transformer.h.1.attn.c_proj.bias 

Some examples of extraction...

In [33]:
# Extracts the word/token embedding matrix E from the state_dict
embedding_matrix = state_dict["transformer.wte.weight"] 

# Extract the weights of the language modeling head
lm_head_matrix = state_dict["lm_head.weight"]

# Check whether embedding_matrix and lm_head_matrix contain exactly the same values
torch.equal(embedding_matrix, lm_head_matrix) # True

True

_NOTICE:_

In GPT-2, by design weight tying is used: the same matrix is shared for both
input token embeddings and output logits linear layer. 

WHY?

Tying input/output embeddings reduces the number of parameters.

