# Downloading GPT2 (Generative pre-trained transformer) from Open-AI


This notebook contains a begginers guide to the usage of transformers library. It covers:

- 0: installation
- 1: loading model
- 2: basic text generation
- 3: accessing next token probabilities
- 4: accessing hidden states
- 5: acessing model paramenters

#### Step 0: Install libraries

You need to install transformers and pytorch. Run from bash shell:

```{bash}
! pip install transformers
! pip install torch
```

#### Step 1: Load GPT-2

How can we use GPT-2? Basically, we have two options:

- **the "low level" option**: use GPT-2 specific API. 

  - Provides finer control. You specify attention masks, padding tokens, decoding, etc. You’re interacting **directly** with the model and tokenizer objects. 
  - Better for customization** — e.g., adding constraints, working with batches, doing masked generation, etc.

    ```{python}
        from transformers import GPT2Tokenizer, GPT2LMHeadModel
        tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
        model = GPT2LMHeadModel.from_pretrained("gpt2")
    ```


- **the "high level" option**: use the  `transformers.pipeline` API. 
  - This is a "wrapper" around the model. You don’t need to manually encode or decode anything. Automatically handles tokenization, decoding and attention under the hood.
  - It is faster and easier to use.

    ```{python}
        from transformers import pipeline, set_seed
        set_seed(42)
        input_text = ["Hello", "Hello dear!"]
        generator = pipeline('text-generation', model='gpt2', device=-1)
        generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
    ```

Lets look at a first example of text generation, using both methods.

### 2: Basic Text Generation

#### 2.1: GPT-2 API.

GPT-2 is provided as an API (Application Programming Interface) inside the Python library "Transformers" from HuggingFace.

From the main page of the Transformers documentation, look for: Transformers/API/Text Models/GPT-2. Alternatively, follow [this link](https://huggingface.co/docs/transformers/en/model_doc/gpt2).

##### A first basic example of text generation.

- Import `transformers.GPT2Tokenizer` and `transformers.GPT2LMHeadModel`. The flag model="gpt2" loads the GPT-2 model and tokenizer, in the base version (i.e. the smallest sized model). If you want you can specify other variants like: "gpt2-medium", "gpt2-large", "gpt2-xl".

- **GPT2Tokenizer class**: Contains both the encoder and the decoder. The flag return_tensors='pt' tells the tokenizer to return a PyTorch tensor (not just a list), because that's what the model expects.

- **LMHeadModel**: The GPT-2 model architecture that generates text (predicts the next token based on previous tokens).

- **Padding**: The input text is always tokenized and converted into a tensor, which is a multidimensional rectangular array. If you want, you may provide an imput consisting of several sequences. In general, after tokenization the sequences will have different lengths. This implies the tokenized input cannot be stored in a tensor as it is. Therefore, a padding token is added to the right or to the left of the sequence to make the dimension homogeneous.

- **Attention**: The attention mask is a binary tensor indicating the position of the padded indices so that the model does not attend to them. For the GPT2Tokenizer, 1 indicates a value that should be attended to, while 0 indicates a padded value. This attention mask is in the dictionary returned by the tokenizer under the key “attention_mask”.



**NOTES** Which model should we choose?

- GPT2Model: the base transformer, outputs the hidden states(the embedded buffer, X, after all the transformations)
- GPT2LMHeadModel: the base transformer, plus the layer which calculates the probabilities (aka the token logits) $p(t_i) \propto exp(\vec{x}_N\cdot \vec{x}_{t_i})$ for each token $t_i$ in the vocabulary.
- GPT2DoubleHeadsModel: has both the layer that calculates the probabilities and a layer for classification (whatever it is). Used for multiple-choice Q&A.


We need to use `GPT2LMHeadModel.from_pretrained()`. This is an instance of the class transformers.PreTrainedModel and inherits all its methods. Check its the documentation.


Also check `transformers.GenerationMixin.generate()` and its documentation.

In [30]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("gpt2", padding_side = "left")
model = GPT2LMHeadModel.from_pretrained("gpt2")

input_text = ["Hello", "Hello dear!"]
#input_ids = tokenizer.encode(input_text, return_tensors='pt')
tokenizer.pad_token = tokenizer.eos_token
padded_sequences = tokenizer(input_text, padding=True, return_tensors="pt")

output = model.generate(padded_sequences["input_ids"], 
                        attention_mask=padded_sequences["attention_mask"],
                        pad_token_id=tokenizer.eos_token_id,
                        max_length=50, # Length of output
                        do_sample=True,# If TRUE, tells the model to sample randomly from the top_k most likely tokens instead of always choosing the most likely token - this makes the output more creative
                        top_k=5) # Flag is used only if do_sample = TRUE.

for i in range(output.shape[0]):
    print(f"Sequence {i}: ")
    print("Decoded input: ", tokenizer.decode(padded_sequences["input_ids"][i]))
    print("Encoded input: ", padded_sequences["input_ids"][i])
    print("Attention mask: ", padded_sequences["attention_mask"][i])
    print("Decoded output: ", tokenizer.decode(output[i],skip_special_tokens=False))
    print("_________________________")

Sequence 0: 
Decoded input:  <|endoftext|><|endoftext|>Hello
Encoded input:  tensor([50256, 50256, 15496])
Attention mask:  tensor([0, 0, 1])
Decoded output:  <|endoftext|><|endoftext|>Hello

Hi everyone!

The last few months have been a whirlwind.

I'm so happy and thankful to everyone for taking the time to read through this thread.

We've had a great run at this game
_________________________
Sequence 1: 
Decoded input:  Hello dear!
Encoded input:  tensor([15496, 13674,     0])
Attention mask:  tensor([1, 1, 1])
Decoded output:  Hello dear! You are my daughter and your daughter's sister. I will be happy to see you soon, but I want your daughter to see you soon. I will be very pleased to hear of your success. Please, please let us meet.
_________________________


#### 2.2: `transformers.pipeline` API.

Now we run the same example as above, but using the higher level interface provided by the class `pipeline`. Encoding and decoding is under the hood.

In [31]:
from transformers import pipeline, set_seed
set_seed(42)
input_text = ["Hello", "Hello dear!"]
generator = pipeline('text-generation', model='gpt2', device=-1) # Use device=0 for GPU, or device=-1 for CPU
output = generator(input_text,
                    pad_token_id = 50256,
                    truncation = True,
                    max_length=30,
                    temperature=0.1,
                    num_return_sequences=1)



for idx, field in enumerate(output):
    print(f"Sequence {idx}: ")
    print("Decoded input: ", input_text[idx])
    print("Decoded Output:", field[0]["generated_text"])
    print("_________________________")

Device set to use cpu


Sequence 0: 
Decoded input:  Hello
Decoded Output: Hello, I'm not sure if you're aware of the fact that I'm a member of the American Association of Chiefs of Police. I'm a
_________________________
Sequence 1: 
Decoded input:  Hello dear!
Decoded Output: Hello dear! I'm sorry, but I'm not sure what to do. I'm not sure if I should go back to the hospital or not
_________________________


### 3. How to access token probabilities

For this - as well as many other things - you have to use the low level GPT2 API. 

The **logits** are the scalar products between the last row of the buffer, after transformation, and each of the embedded tokens in the dictionary:

$$ \vec{x}_N \cdot E^T $$

The **probabilities** are given by:

$$ \exp^{\left(-\frac{\vec{x}_N \cdot E^T}{T}\right) }$$

where T is the temperature parameter.

### First way:

In [32]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2", padding_side = "left")
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
padded_sequences = tokenizer(input_text, padding=True, return_tensors="pt")


prompt = "The American flag's colors are red, blue and"
inputs = tokenizer(prompt, return_tensors="pt")
tokens = tokenizer.convert_ids_to_tokens(inputs.input_ids[0])


import torch
with torch.no_grad():
    outputs = model(**inputs)

logits = outputs.logits  # Shape: [1, seq_len, vocab_size]
print(logits.shape)
print(logits[0])

torch.Size([1, 10, 50257])
tensor([[-36.2872, -35.0111, -38.0791,  ..., -40.5161, -41.3758, -34.9191],
        [-85.1435, -82.5817, -88.0494,  ..., -88.4072, -90.8886, -84.2703],
        [-86.6003, -85.0928, -92.4016,  ..., -98.3911, -91.8806, -89.0551],
        ...,
        [-86.1226, -85.5085, -86.6623,  ..., -95.5519, -89.5766, -85.6829],
        [ -0.4879,   1.0927,  -3.0591,  ..., -11.6097,  -8.8209,  -1.0988],
        [-74.8958, -72.4673, -75.6806,  ..., -83.4975, -78.3614, -74.6660]])


#### Second way:

When calling model.generate(), set output_scores oe output_logits to True.


Documentation [here](https://huggingface.co/docs/transformers/v4.51.3/en/internal/generation_utils#transformers.generation.GenerateDecoderOnlyOutput)

**Notes**
Scores are equal to logits in case of greedy decoding, but are different in case of more fancy decoding methods like `beam`, or `top_k`. See the 
doc page [Generation Strategies](https://huggingface.co/docs/transformers/en/generation_strategies#decoding-strategies). See [here](https://discuss.huggingface.co/t/what-is-the-difference-between-logits-and-scores/79796/3) for forum discussion.



logits (tuple(torch.FloatTensor) optional, returned when output_logits=True) — Unprocessed prediction scores of the language modeling head (scores for each vocabulary token before SoftMax) at each generation step. Tuple of torch.FloatTensor with up to max_new_tokens elements (one element for each generated token), with each tensor of shape (batch_size, config.vocab_size).

attentions (tuple(tuple(torch.FloatTensor)), optional, returned when output_attentions=True) — Tuple (one element for each generated token) of tuples (one element for each layer of the decoder) of torch.FloatTensor of shape (batch_size, num_heads, generated_length, sequence_length).

hidden_states (tuple(tuple(torch.FloatTensor)), optional, returned when output_hidden_states=True) — Tuple (one element for each generated token) of tuples (one element for each layer of the decoder) of torch.FloatTensor of shape (batch_size, generated_length, hidden_size).

In [33]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
import numpy as np

# Load pretrained model
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token_id = tokenizer.eos_token_id
inputs =  tokenizer(["Today we can go to"], return_tensors="pt") # atributes: input_ids, attention_mask
print(inputs.input_ids)
outputs = model.generate(**inputs, max_new_tokens = 1,
                        return_dict_in_generate = True,
                        output_scores = True,
                        temperature = 1.0,
                        output_logits = True,
                        output_hidden_states = True,
                        output_attentions = True,
                        pad_token_id = 50256)

type(outputs) # transformers.generation.utils.GenerateDecoderOnlyOutput


print(outputs.sequences.shape)
print(tokenizer.decode(outputs.sequences[0][0], skip_special_tokens =False))
print(tokenizer.decode(outputs.sequences[0][:], skip_special_tokens =False))


logits = outputs.scores[0][0]
top_values, top_indices = torch.topk(logits, k=10, largest=True)  # or largest=False for smallest
for idx, val in zip(top_indices.tolist(), top_values.tolist()):
    print(f"Index: {idx}, Value: {val}, Decoded: {tokenizer.decode(idx)}")

tensor([[8888,  356,  460,  467,  284]])
torch.Size([1, 6])
Today
Today we can go to the
Index: 262, Value: -80.64991760253906, Decoded:  the
Index: 257, Value: -81.82210540771484, Decoded:  a
Index: 670, Value: -82.49638366699219, Decoded:  work
Index: 1175, Value: -82.70538330078125, Decoded:  war
Index: 597, Value: -82.73980712890625, Decoded:  any
Index: 3993, Value: -82.82905578613281, Decoded:  sleep
Index: 674, Value: -83.12396240234375, Decoded:  our
Index: 1194, Value: -83.35417175292969, Decoded:  another
Index: 3996, Value: -83.4070816040039, Decoded:  bed
Index: 477, Value: -83.7303237915039, Decoded:  all


### 4: Accessing hidden states


Hidden states are the token representations in the embedding space $\mathbb{R}^D$.

We can follow the buffer as it exits each of the layers. Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. In this case, $1 + 12 = 13$.

In [34]:
print(len(outputs.hidden_states[0]))
print(outputs.hidden_states[0][0][0].shape) # hidden_states[generated token number][layer number] is a tensor (input length x D)

hidden_states = outputs.hidden_states[0][0][0]

#first token representation at end of each layer
hidden_states[0, :]

13
torch.Size([5, 768])


tensor([ 2.3067e-02, -2.9040e-01,  1.1197e-01, -3.2291e-03,  2.9128e-03,
        -2.1564e-01, -1.8018e-01, -1.8713e-01, -8.6993e-02, -3.5113e-01,
        -6.8742e-02, -4.7279e-02,  1.2248e-01,  3.9539e-02,  8.5879e-02,
        -1.3172e-01,  1.1782e-01, -6.5752e-02,  7.8041e-02, -1.0637e-01,
        -3.3029e-02, -6.2341e-02, -8.8152e-02, -4.7329e-02,  1.4917e-01,
        -7.3244e-02, -6.4885e-02,  2.1092e-01,  3.2153e-02,  1.3472e-01,
        -2.0389e-02, -3.5955e-01,  5.8926e-02,  3.9860e-03,  9.3552e-02,
        -7.5463e-02, -1.0719e+00,  1.0046e-01, -4.4027e-02,  9.4923e-02,
         6.0284e-03, -5.4183e-02,  6.3331e-02, -1.3404e-01,  9.1742e-02,
        -8.9475e-02,  3.1307e-02, -9.8643e-02, -3.6463e-02,  2.7996e-01,
         7.4173e-02, -2.9777e-02,  6.6384e-01, -8.4596e-02, -5.1499e-02,
         3.8400e+00, -4.7607e-03,  5.3434e-02, -3.4203e-02, -1.4839e-01,
         1.1489e-01, -2.1606e-01, -3.7943e-02, -3.5273e-02,  2.3708e-01,
         3.3337e-02, -2.6543e-02, -5.4957e-01,  1.8

###  5. Accessing model parameters

Now we retrieve the set of model parameters that are all learned during the training, and kept fixed during inference. These includes:

- the embedding map, E
- the attention matrices, in each layer and each head $(W_Q, W_K, W_V)$ 
- the neural net weights, in each layer


When instanciating the model using "from_pretrained()", dropout is deactivated by default by  model.eval() (sets the model to evaluation mode). To train the model, you should first set it back in training mode with model.train().

In [35]:
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [36]:
state_dict = model.state_dict()
for name, weights in state_dict.items():
    print(name, weights.shape)

transformer.wte.weight torch.Size([50257, 768])
transformer.wpe.weight torch.Size([1024, 768])
transformer.h.0.ln_1.weight torch.Size([768])
transformer.h.0.ln_1.bias torch.Size([768])
transformer.h.0.attn.c_attn.weight torch.Size([768, 2304])
transformer.h.0.attn.c_attn.bias torch.Size([2304])
transformer.h.0.attn.c_proj.weight torch.Size([768, 768])
transformer.h.0.attn.c_proj.bias torch.Size([768])
transformer.h.0.ln_2.weight torch.Size([768])
transformer.h.0.ln_2.bias torch.Size([768])
transformer.h.0.mlp.c_fc.weight torch.Size([768, 3072])
transformer.h.0.mlp.c_fc.bias torch.Size([3072])
transformer.h.0.mlp.c_proj.weight torch.Size([3072, 768])
transformer.h.0.mlp.c_proj.bias torch.Size([768])
transformer.h.1.ln_1.weight torch.Size([768])
transformer.h.1.ln_1.bias torch.Size([768])
transformer.h.1.attn.c_attn.weight torch.Size([768, 2304])
transformer.h.1.attn.c_attn.bias torch.Size([2304])
transformer.h.1.attn.c_proj.weight torch.Size([768, 768])
transformer.h.1.attn.c_proj.bias 

In [37]:
embedding_matrix = state_dict["transformer.wte.weight"] # E
lm_head_matrix = state_dict["lm_head.weight"]
torch.equal(embedding_matrix, lm_head_matrix) # True

True