# GPT1 - Improving Language Understanding by Generative Pre-Training

## Paper

This notebook has been automatically translated to make it accessible to more people, please let me know if you see any typos.

[Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) is the GPT1 paper. Before reading the post you need to put yourself in situation, before GPT language models were based on recurrent networks (RNN), which were networks that worked relatively well for specific tasks, but with which you could not reuse the pre-training to make them a fine tuning for other tasks. In addition, they did not have much memory, so if you put very long sentences in them, they did not remember the beginning of the sentence very well.

## Architecture

Before we talk about the architecture of GPT1, let's remember what the architecture of the Transformers was like.

![transformer architecture](https://maximofn.com/wp-content/uploads/2023/12/transformer-scaled.webp)

GPT1 is a model based on the transformer decoders, so as we do not have an encoder, the architecture of a single decoder is as follows

![decoder architecture](https://maximofn.com/wp-content/uploads/2024/06/transformer_decoder_only-scaled.webp)

The attention mechanism between the encoder and decoder sentence is eliminated.

In the GPT1 paper they propose the following architecture

![gpt1 architecture](https://maximofn.com/wp-content/uploads/2024/06/GPT1_architecture.webp)

Which corresponds to the decoder of a transformer as we have seen before, executed 12 times

## Paper abstract

The most interesting ideas in the paper are:

 * The model is trained on a large corpus of unsupervised text. This is used to create a language model. A high-capacity language model is created on a large corpus of text.
 * Fine-tuning is then performed on supervised NLP tasks with labeled datasets. Fine-tuning is performed on a supervised target task. In addition, when the model is evaluated on the supervised task, it is not only evaluated on that task, but on how well it predicts the next token, this helps to improve the generalization of the supervised model and makes the model converge faster.
 * Although we have already mentioned it, the paper says that the transformer architecture is used, since up to that time RNNs were used for the language models. This resulted in an improvement in that what was learned in the first training (training on the unsupervised text corpus) is easier to transfer to supervised tasks. That is, thanks to the use of transformers, it was possible to train on a whole corpus of text and then fine-tune it in supervised tasks.
 * They evaluated the model in four types of language comprehension tasks:
    * Natural language inference
    * Answer to questions
    * Semantic similarity
    * Classification of texts.
 * The general model (the one trained on the entire unsupervised text corpus) outperforms discriminatively trained RNN models that employ task-specific designed architectures, significantly improving the state of the art in 9 of the 12 tasks studied. They also analyze the "zero-shot" behaviors of the pre-trained model in four different environments and showed that it acquires useful linguistic knowledge for subsequent tasks.
 * In recent years, researchers had demonstrated the benefits of using embeddings, which are trained on unlabeled corpora, to improve performance on a variety of tasks. However, these approaches primarily transfer information at the word level, whereas the use of transformers trained on large unsupervised text corpora captures higher-level, sentence-level semantics.

## Text generation

Let's see how to generate text with a pre-trained GPT1

First you have to install `ftfy` and `spacy` via

````bash
pip install ftfy spacy
```

Once installed, you must download the spacy language model you wish to use. For example, to download the English model, you can run:

````bash
python -m spacy download en_core_web_sm
```

To generate text we will use the model from the [GPT1](https://huggingface.co/openai-community/openai-gpt) repository of Hugging Face.

We import the libraries

In [1]:
import torch
from transformers import OpenAIGPTTokenizer, OpenAIGPTLMHeadModel, AutoTokenizer

If you notice we have imported `OpenAIGPTTokenizer` and `AutoTokenizer`. This is because in the [model card](https://huggingface.co/openai-community/openai-gpt) of GPT1 it says to use `OpenAIGPTTokenizer`, but in the [transformers](https://maximofn.com/hugging-face-transformers/) library post we explain that you should use `AutoTokenizer` to load the tokenizer. So let's try both

In [2]:
ckeckpoints = "openai-community/openai-gpt"
tokenizer = OpenAIGPTTokenizer.from_pretrained(ckeckpoints)
auto_tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)

input_tokens = tokenizer("Hello, my dog is cute and", return_tensors="pt")
input_auto_tokens = auto_tokenizer("Hello, my dog is cute and", return_tensors="pt")

print(f"input tokens: \n{input_tokens}")
print(f"input auto tokens: \n{input_auto_tokens}")

input tokens: 
{'input_ids': tensor([[3570,  240,  547, 2585,  544, 4957,  488]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}
input auto tokens: 
{'input_ids': tensor([[3570,  240,  547, 2585,  544, 4957,  488]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}


As you can see with the two tokenizers you get the same tokens. So to make the code more general, so that if you change the ckeckpoints, you don't have to change the code, let's use `AutoTokenizer`.

We then create the device, the tokenizer and the model

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints).to(device)

As we have instantiated the model let's see how many parameters it has

In [None]:
params = sum(p.numel() for p in model.parameters())
print(f"Number of parameters: {round(params/1e6)}M")

Number of parameters: 117M


At the time of billions of parameters, we can see that GPT1 only had 117 million parameters.

We create the input tokens for the model

In [4]:
input_sentence = "Hello, my dog is cute and"
input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)

input_tokens

{'input_ids': tensor([[3570,  240,  547, 2585,  544, 4957,  488]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

We pass them to the model to generate the output tokens.

In [5]:
output_tokens = model.generate(**input_tokens)

print(f"output tokens: \n{output_tokens}")

output tokens: 
tensor([[ 3570,   240,   547,  2585,   544,  4957,   488,   249,   719,   797,
           485,   921,   575,   562,   246,  1671,   239,   244, 40477,   244]],
       device='cuda:0')




We decode the tokens to obtain the output statement

In [6]:
decoded_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

print(f"decoded output: \n{decoded_output}")

decoded output: 
hello, my dog is cute and i'm going to take him for a walk. " 
 "


We have already succeeded in generating text with GPT1

### Generate text token to token

#### Greedy search

We have used `model.generate` to generate the output tokens all at once, but let's see how to generate them one by one. To do this, instead of using `model.generate` we are going to use `model`, which actually calls the `model.forward` method.

In [7]:
outputs = model(**input_tokens)

outputs

CausalLMOutput(loss=None, logits=tensor([[[ -5.9486,  -5.8697, -18.4258,  ...,  -9.7371, -10.4495,   0.8814],
         [ -6.1212,  -4.8031, -14.3970,  ...,  -6.5411,  -9.5051,  -1.2015],
         [ -7.4231,  -6.3615, -14.7297,  ..., -10.4575,  -8.4600,  -1.5183],
         ...,
         [ -5.4751,  -5.8803, -13.7767,  ..., -10.5048, -12.4167,  -6.1584],
         [ -7.2052,  -6.0198, -21.5040,  ..., -16.2941, -14.0494,  -1.2416],
         [ -7.7240,  -7.3631, -17.3174,  ..., -12.1546, -12.3327,  -1.7169]]],
       device='cuda:0', grad_fn=<UnsafeViewBackward0>), hidden_states=None, attentions=None)

We see that it pulls a lot of data, first let's look at the output keys

In [8]:
outputs.keys()

odict_keys(['logits'])

In this case we only have the logits of the model, let's see their size

In [9]:
logits = outputs.logits

logits.shape

torch.Size([1, 7, 40478])

Let's see how many tokens we had at the entrance.

In [10]:
input_tokens.input_ids.shape

torch.Size([1, 7])

Wow, at the output we have the same number of logits as at the input. This is normal

We obtain the logits of the last position of the exit

In [11]:
nex_token_logits = logits[0,-1]

nex_token_logits.shape

torch.Size([40478])

There are a total of 40478 logits, i.e. there is a vocabulary of 40478 tokens and we have to see which token has the highest probability, to do this we first calculate the softmax

In [12]:
softmax_logits = torch.softmax(nex_token_logits, dim=0)

softmax_logits.shape

torch.Size([40478])

In [13]:
next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)

next_token_prob, next_token_id

(tensor(0.1898, device='cuda:0', grad_fn=<MaxBackward0>),
 tensor(249, device='cuda:0'))

We have obtained the following token, now we decode it

In [14]:
tokenizer.decode(next_token_id.item())

'i'

We have obtained the following token using the greedy method, i.e. the token with the highest probability. But we already saw in the transformers library post, the [ways to generate texts](https://maximofn.com/hugging-face-transformers/#Formas-de-generaci%C3%B3n-de-texto) that sampling, top-k, top-p, etc. can be done.

Let's put everything into a function and see what comes out if we generate a few tokens

In [15]:
def generate_next_greedy_token(input_sentence, tokenizer, model, device):
    input_tokens = tokenizer(input_sentence, return_tensors="pt").to(device)
    outputs = model(**input_tokens)
    logits = outputs.logits
    nex_token_logits = logits[0,-1]
    softmax_logits = torch.softmax(nex_token_logits, dim=0)
    next_token_prob, next_token_id = torch.max(softmax_logits, dim=0)
    return next_token_prob, next_token_id

In [16]:
def generate_greedy_text(input_sentence, tokenizer, model, device, max_length=20):
    generated_text = input_sentence
    for _ in range(max_length):
        next_token_prob, next_token_id = generate_next_greedy_token(generated_text, tokenizer, model, device)
        generated_text += tokenizer.decode(next_token_id.item())
    return generated_text

Now we generate text

In [17]:
generate_greedy_text("Hello, my dog is cute and", tokenizer, model, device)

'Hello, my dog is cute andi."\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'

The output is quite repetitive as already seen in the [ways to generate text](https://maximofn.com/hugging-face-transformers/#Formas-de-generaci%C3%B3n-de-texto)

## Fine tuning GPT

### Loss calculation

Before we start doing the fine tuning of GPT1 let's see one thing. Before when we used to get the output of the model we did this

In [19]:
outputs = model(**input_tokens)

outputs

CausalLMOutput(loss=None, logits=tensor([[[ -5.9486,  -5.8697, -18.4258,  ...,  -9.7371, -10.4495,   0.8814],
         [ -6.1212,  -4.8031, -14.3970,  ...,  -6.5411,  -9.5051,  -1.2015],
         [ -7.4231,  -6.3615, -14.7297,  ..., -10.4575,  -8.4600,  -1.5183],
         ...,
         [ -5.4751,  -5.8803, -13.7767,  ..., -10.5048, -12.4167,  -6.1584],
         [ -7.2052,  -6.0198, -21.5040,  ..., -16.2941, -14.0494,  -1.2416],
         [ -7.7240,  -7.3631, -17.3174,  ..., -12.1546, -12.3327,  -1.7169]]],
       device='cuda:0', grad_fn=<UnsafeViewBackward0>), hidden_states=None, attentions=None)

You can see that we get `loss=None`.

In [20]:
print(outputs.loss)

None


As we are going to need the loss to do the fine tuning, let's see how to obtain it.

If we go to the documentation of the method [forward](https://huggingface.co/docs/transformers/model_doc/openai-gpt#transformers.OpenAIGPTLMHeadModel.forward) of `OpenAIGPTLMHeadModel`, we can see that it says that at the output it returns an object of type `transformers.modeling_outputs.CausalLMOutput`, so if we go to the documentation of [transformers.modeling_outputs.CausalLMOutput](https://huggingface.co/docs/transformers/v4.41.3/en/main_classes/output#transformers.modeling_outputs.CausalLMOutput), we can see that it says that it returns `loss` if `labels` is passed to the `forward` method.

If we go to the source code of the [forward](https://github.com/huggingface/transformers/blob/main/src/transformers/models/openai/modeling_openai.py#L544) method, we see this code block

````python
        loss = None
        if labels is not None:
            # Shift so that tokens < n predict n
            shift_logits = lm_logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            # Flatten the tokens
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
```

In other words, the `loss` is calculated as follows

 * Shift of logits and labels: The first part is to shift the logits (`lm_logits`) and labels (`labels`) so that `tokens < n` predict `n`, i.e., from a position `n` the next token is predicted from the previous ones.
 * CrossEntropyLoss: An instance of the `CrossEntropyLoss()` function is created.
 * Flatten tokens: Logits and labels are then flattened using `view(-1, shift_logits.size(-1))` and `view(-1)`, respectively. This is done so that the logits and labels have the same shape for the loss function.
 * Loss calculation: Finally, the loss is calculated using the `CrossEntropyLoss()` function with the flattened logits and flattened labels as inputs.

In summary, `loss` is calculated as the cross-entropy loss between shifted and flattened logits and shifted and flattened labels.

Therefore, if we pass the labels to the `forward` method, it will return the `loss`.

In [21]:
outputs = model(**input_tokens, labels=input_tokens.input_ids)

outputs.loss

tensor(4.2607, device='cuda:0', grad_fn=<NllLossBackward0>)

### Dataset

For the training we are going to use a dataset of English jokes [short-jokes-dataset](https://huggingface.co/datasets/Maximofn/short-jokes-dataset), which is a dataset with 231 thousand English jokes.

Download the dataset

In [22]:
from datasets import load_dataset

jokes = load_dataset("Maximofn/short-jokes-dataset")
jokes

DatasetDict({
    train: Dataset({
        features: ['ID', 'Joke'],
        num_rows: 231657
    })
})

Let's take a look at it

In [23]:
jokes["train"][0]

{'ID': 1,
 'Joke': '[me narrating a documentary about narrators] "I can\'t hear what they\'re saying cuz I\'m talking"'}

### Pytorch training

First let's see how the pure Pytorch training would be done.

> Restart the notebook to avoid problems with the GPU memory.

In [24]:
import torch
from transformers import OpenAIGPTLMHeadModel, AutoTokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

ckeckpoints = "openai-community/openai-gpt"
tokenizer = AutoTokenizer.from_pretrained(ckeckpoints)
model = OpenAIGPTLMHeadModel.from_pretrained(ckeckpoints)

model = model.to(device)

#### Pytorch dataset

We create a Pytorch dataset class

In [25]:
from torch.utils.data import Dataset

class JokesDataset(Dataset):
    def __init__(self, dataset, tokenizer):
        self.dataset = dataset
        self.joke = "JOKE: "
        self.end_of_text_token = "<|endoftext|>"
        self.tokenizer = tokenizer
        
    def __len__(self):
        return len(self.dataset["train"])

    def __getitem__(self, item):
        sentence = self.joke + self.dataset["train"][item]["Joke"] + self.end_of_text_token
        tokens = self.tokenizer(sentence, return_tensors="pt")
        return sentence, tokens

We instantiate it

In [26]:
dataset = JokesDataset(jokes, tokenizer=tokenizer)

Here is an example

In [27]:
sentence, tokens = dataset[5]
print(sentence)
tokens.input_ids.shape, tokens.attention_mask.shape

JOKE: Why can't Barbie get pregnant? Because Ken comes in a different box. Heyooooooo<|endoftext|>


(torch.Size([1, 30]), torch.Size([1, 30]))

#### Dataloader

We now create a Pytorch dataloader

In [28]:
from torch.utils.data import DataLoader

BS = 1
joke_dataloader = DataLoader(dataset, batch_size=BS, shuffle=True)

We see a batch

In [29]:
sentences, tokens = next(iter(joke_dataloader))
len(sentences), tokens.input_ids.shape, tokens.attention_mask.shape

(1, torch.Size([1, 1, 29]), torch.Size([1, 1, 29]))

#### Training

In [30]:
from transformers import AdamW, get_linear_schedule_with_warmup
import tqdm

BATCH_SIZE = 32
EPOCHS = 5
LEARNING_RATE = 3e-5
WARMUP_STEPS = 5000
MAX_SEQ_LEN = 500

model.train()
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=-1)
proc_seq_count = 0
batch_count = 0

tmp_jokes_tens = None

for epoch in range(EPOCHS):
    
    print(f"EPOCH {epoch} started" + '=' * 30)
    progress_bar = tqdm.tqdm(joke_dataloader, desc="Training")
    
    for sample in progress_bar:

        sentence, tokens = sample
        
        #################### "Fit as many joke sequences into MAX_SEQ_LEN sequence as possible" logic start ####
        joke_tens = tokens.input_ids[0].to(device)

        # Skip sample from dataset if it is longer than MAX_SEQ_LEN
        if joke_tens.size()[1] > MAX_SEQ_LEN:
            continue
        
        # The first joke sequence in the sequence
        if not torch.is_tensor(tmp_jokes_tens):
            tmp_jokes_tens = joke_tens
            continue
        else:
            # The next joke does not fit in so we process the sequence and leave the last joke 
            # as the start for next sequence 
            if tmp_jokes_tens.size()[1] + joke_tens.size()[1] > MAX_SEQ_LEN:
                work_jokes_tens = tmp_jokes_tens
                tmp_jokes_tens = joke_tens
            else:
                #Add the joke to sequence, continue and try to add more
                tmp_jokes_tens = torch.cat([tmp_jokes_tens, joke_tens[:,1:]], dim=1)
                continue
        ################## Sequence ready, process it trough the model ##################
            
        outputs = model(work_jokes_tens, labels=work_jokes_tens)
        loss = outputs.loss
        loss.backward()
                       
        proc_seq_count = proc_seq_count + 1
        if proc_seq_count == BATCH_SIZE:
            proc_seq_count = 0    
            batch_count += 1
            optimizer.step()
            scheduler.step() 
            optimizer.zero_grad()
            model.zero_grad()

        progress_bar.set_postfix({'loss': loss.item(), 'lr': scheduler.get_last_lr()[0]})
        if batch_count == 10:
            batch_count = 0





Training: 100%|██████████| 231657/231657 [11:31<00:00, 334.88it/s, loss=2.88, lr=2.93e-6]




Training: 100%|██████████| 231657/231657 [11:30<00:00, 335.27it/s, loss=2.49, lr=5.87e-6]




Training: 100%|██████████| 231657/231657 [11:17<00:00, 341.75it/s, loss=2.57, lr=8.81e-6]




Training: 100%|██████████| 231657/231657 [11:18<00:00, 341.27it/s, loss=2.41, lr=1.18e-5]




Training: 100%|██████████| 231657/231657 [11:19<00:00, 341.04it/s, loss=2.49, lr=1.47e-5]


#### Inference

Let's see how well the model makes jokes.

In [35]:
sentence_joke = "JOKE:"
input_tokens_joke = tokenizer(sentence_joke, return_tensors="pt").to(device)
output_tokens_joke = model.generate(**input_tokens_joke)
decoded_output_joke = tokenizer.decode(output_tokens_joke[0], skip_special_tokens=True)

print(f"decoded joke: \n{decoded_output_joke}")

decoded joke: 
joke : what do you call a group of people who are not afraid of the dark? a group


You can see that you pass it a sequence with the word `joke` and it returns a joke. But if you return another sequence it does not

In [39]:
sentence_joke = "My dog is cute and"
input_tokens_joke = tokenizer(sentence_joke, return_tensors="pt").to(device)
output_tokens_joke = model.generate(**input_tokens_joke)
decoded_output_joke = tokenizer.decode(output_tokens_joke[0], skip_special_tokens=True)

print(f"decoded joke: \n{decoded_output_joke}")

decoded joke: 
my dog is cute and i'm not sure if i should be offended or not. " 

