## Clean GPT2

Tokenizer
Pretrained
Sampler 
Supervised FT
RLHF

# What is GPT-2?

## Introduction to GPT-2

[GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf), released by OpenAI in 2019, marked a pivotal moment in natural language processing (NLP) and the entire broader AI community. This transformer-based language model demonstrated unprecedented capabilities in generating coherent, human-like text, setting new standards for what AI could achieve in language understanding and generation.

## How GPT-2 Works

At its core, GPT-2 operates like a highly sophisticated text prediction system. Given a sequence of words, it predicts the most likely next word, repeatedly extending the text one token at a time ("autoregressive"). The model achieves this through:

1. A transformer architecture that processes text using [attention](https://arxiv.org/pdf/1706.03762) mechanisms, allowing it to consider relationships between words regardless of their distance in the text
2. **Unsupervised learning** on a massive dataset of internet text, enabling it to capture patterns in human language without explicit training labels
3. A large-scale architecture (up to 1.5 billion parameters) that can store and utilize complex language patterns.

## Historical Significance

GPT-2's release was notable for several reasons:

- It demonstrated that scaling up model size and training data could lead to qualitatively predictably better performance ("[Scaling Laws](https://arxiv.org/pdf/2001.08361#page=3&org=openai)")
- It challenged the prevailing wisdom in AI research by showing that simple architectures at scale could outperform complex architectural innovations
- Its capabilities were significant enough that OpenAI initially delayed the full release due to concerns about potential misuse
- It helped establish the foundation for modern language models and showed the potential of unsupervised learning for NLP tasks

## Impact on AI Development
GPT-2's success triggered a series of transformative changes in the field:
- **The Scaling Race** - GPT-2 sparked an industry-wide competition to build increasingly larger models. This "scaling race" led to rapid advancements, with organizations like Google, OpenAI, and Anthropic pushing boundaries in model size and capability. The focus shifted from architecture innovation to scaling existing architectures effectively.

- **Emergence of Meta Learning** - One of GPT-2's most surprising discoveries was its ability to perform "[few-shot learning](https://arxiv.org/pdf/2005.14165)" – adapting to new tasks with minimal explicit instruction. This phenomenon, later explored more deeply with GPT-3 and GPT-4, suggested that large language models could develop *meta-learning* capabilities, learning how to learn during pre-training.

- **Emergent Capabilities** - GPT-2 began revealing what we now call "emergent abilities" – capabilities that appear suddenly above certain scale thresholds. This observation, formally documented in the [GPT-4 technical report](https://arxiv.org/pdf/2303.08774), suggested that scaling language models could lead to qualitatively new behaviors that are difficult to predict in advance. For instance, the ability to perform basic arithmetic or follow implicit reasoning steps emerged without explicit training for these tasks.


## Broader Implications

The success of GPT-2 influenced the development of multimodal models, showing how scaling could benefit other domains beyond text
It sparked important discussions about AI safety and ethics, leading to more thoughtful release strategies for powerful AI systems
The model demonstrated that unsupervised pre-training could capture significant world knowledge, laying groundwork for future work in knowledge representation and reasoning.

These developments fundamentally changed how researchers and organizations approach AI development, shifting focus from small, specialized models to large, general-purpose systems capable of emergent behaviors and meta-learning.

# Tokenizer

## Setup (don't read)

In [1]:
from dataclasses import dataclass
import torch

In [8]:
# Run on "mps" (Mac M series) or "cuda" (NVIDIA GPUs) if available, else run on "cpu"
device = torch.device("mps" if torch.backends.mps.is_available() 
                      else "cuda" if torch.cuda.is_available() 
                      else "cpu")

# Preprocessing: The Tokenizer 
GPT2's input is natural language (i.e. a sequence of characters, strings, etc), but ML models usually take in vectors as input. To convert natural language into vectors, the **tokenizer** splits up the lanuguage into units called **tokens**, and then converts the list of tokens into vectors. 

### Splitting language to tokens
A token is a substring that is a member of the **vocabulary**  set. But what is a good implementation for how to create a **vocabulary**?

Can we take a set of all every word in every dictionary ever made, and have each word be a token? No, this wouldn't allow us to be able to handle arbitary text (i.e. typos, punctuations, URLs, etc). 

Could we just use every characters available in the keyboard? No, this loses relational meaning within words (i.e. "language" is more meaningful than "gangeula")

The most common practice is called **Byte-Pair encodings**. This solves the above two questions by providing us with a general way of splitting langague that is also efficient. However, it far from a perfect system as it is the source of many bugs (i.e. being bad at counting). 

High-Level algorithm:
1. Start with a inital vocabulary of all individual characters as tokens
2. Find the most common pair of tokens in the text, merge this pair into a new token, and re-tokenize the text with the new token
3. Repeat step 2 until you reach a desired vocabulary size or no more pairs can be merged

<details>
<summary>Note: Space (" ") counts as a character and therefore merges with space are very common</summary>

```python
import tiktoken 

tokenizer = tiktoken.get_encoding('gpt2')

print(tokenizer.encode(" a")) # [257]
print(tokenizer.encode("a")) # [64]
print(tokenizer.encode("a ")) # [64, 220]
print(tokenizer.encode(" i")) # [1312]
print(tokenizer.encode("i")) # [72]
print(tokenizer.encode("i ")) # [72]
```

</details>

### Converting tokens into vectors
This process is pretty straight-forward. We can convert each token to a **one-hot encoding** of the vocabulary. **One-hot encoding** vectors are filled with zeros at ever position, except in the position corresponding to the token's index in the vocabulary. 

A key inuition about **one-hot encodings** is they allow you to think of each integer independently. 

$$
\begin{aligned}
t_i &= (0, \dots, 0, 1, 0, \dots, 0) \quad \text{is the one-hot encoding for the }i\text{th token (length }d_{vocab}\text{)} \\
\\
\end{aligned}
$$

<details>
<summary>Not ideal things about tokenization</summary>

**Capitalization and Leading spaces matter** 

```python
import tiktoken 

tokenizer = tiktoken.get_encoding('gpt2')

print(tokenizer.encode("Michael")) # [13256]
print(tokenizer.encode(" Michael")) # [3899]
print(tokenizer.encode(" michael")) # [285, 40302]
print(tokenizer.encode("michael")) # [76, 40302]
```

**Arithmetic does not sense**
Common numbers are bundle together.

```python
import tiktoken 

tokenizer = tiktoken.get_encoding('gpt2')

print(tokenizer.encode("56873+3184623=123456789-1000000000")) # [49211, 4790, 10, 36042, 3510, 1954, 28, 10163, 2231, 3134, 4531, 12, 16, 10535, 830]
```

</details>

In this notebook, we will not be implementing the tokenizer from scatch. Instead we will be importing OpenAI's [tiktoken](https://github.com/openai/tiktoken) library to use the offical tokenizer. 

To see a full walkthrough of implementing a tokenizer check out Karthapthy's video [Let's build the GPT Tokenizer
]

In [None]:
import tiktoken

tokenizer = tiktoken.get_encoding('gpt2')

reference_text = "A day without laughter is a day" # "A day without laughter is a day wasted" - Charlie Chapin
print("Reference text: " + reference_text)

tokens = tokenizer.encode(reference_text)
print("Tokenized sequence: " + str(tokens))

reconstructed_reference_text = tokenizer.decode(tokens)
print("Reconstructed reference text: " + reconstructed_reference_text)

Reference text: A day without laughter is a day
Tokenized sequence: [32, 1110, 1231, 20263, 318, 257, 1110]
Reconstructed reference text: A day without laughter is a day


### Model

Now that we have a tokenizer, we can start using the mode to generate outputs. We will be getting our 

In [4]:
from transformers import AutoModelForCausalLM

# Use HuggingFace's Transformer Library to load the model weights
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")

# Put the model on our device
model = model.to(device)

In [5]:
print(model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)


In [6]:

reference_text = "What is the meaning of life?"
tokens = tokenizer.encode(reference_text)
tokens = torch.tensor(tokens, device=device) # Convert from Python list to Pytorch Tensor to be accepted by the model
outputs = model(tokens)

In [7]:
num_sequences = 5
max_sequence_length = 30

reference_text = "If computers were one day exceeded human level intelligence,"
tokens = tokenizer.encode(reference_text) # Tensor with shape (10,) 
tokens = torch.tensor(tokens, device=device)
tokens = tokens.unsqueeze(0) # Convert tokens from (10,) -> (1, 10) shaped matrix
tokens = tokens.repeat(num_sequences, 1) # Duplicate `num_sequences` of `reference_text` tokens, resulting in (5, 10) shaped matrix
x = tokens.to(device=device) # Move tensor to device

while x.size(1) < max_sequence_length:
    logits = model(x) # Logits contains 

    logits = logits[:, -1, :] # 

TypeError: tuple indices must be integers or slices, not tuple

In [None]:
print(model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)


In [None]:
@dataclass
class Config:
    vocab_size: int = 50304,
    block_size: int = 1024,
    n_layers: int = 12,
    n_head: int = 12,
    n_embd: int = 768

In [None]:
import tiktoken
import torch

enc = tiktoken.get_encoding('gpt2')
tokens = enc.encode("Hello, I am an language model,")
tokens = torch.tensor(tokens, dtype=torch.long)

print(tokens)
print(tokens.shape)


tokens1 = enc.encode("Hello, I am an language model, I have a total of 124M")
tokens1 = torch.tensor(tokens1, dtype=torch.long)
print(tokens1)
print(tokens1.shape)

tensor([15496,    11,   314,   716,   281,  3303,  2746,    11])
torch.Size([8])
tensor([15496,    11,   314,   716,   281,  3303,  2746,    11,   314,   423,
          257,  2472,   286, 19755,    44])
torch.Size([15])


In [None]:
logits = model(tokens)
print(logits.logits.shape)

RuntimeError: Placeholder storage has not been allocated on MPS device!

In [None]:
logits1 = model(tokens1)
print(logits1.logits.shape)

RuntimeError: Placeholder storage has not been allocated on MPS device!

In [None]:
logits2 = logits1.logits[0:8, :]
print(logits2.shape)

torch.Size([8, 50257])


In [None]:
print(torch.equal(logits.logits, logits2))

True


: 