# 🤖 Model

This notebook describes and prepares the models used in this experiment suite.

## Setup 

In [5]:
%reload_ext autoreload
%autoreload 2

In [6]:
import autorootcwd

In [7]:
import warnings 
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category=FutureWarning)

In [9]:
from src.config import ModelConfig
from src.model import GPT2, GPT2Config

import torch
import pandas as pd

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, pipeline

## GPT-2 Family

We want to use a family of LLMS for our experiments. A good candidate is GPT-2:

- Family of models with different, but not too large sizes (124M, 355M, 774M)
- Open-source paper [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
- Open-source weights available on [Hugging Face](https://huggingface.co/openai-community/gpt2)
- Custom minimal implementation in PyTorch available in [NanoGPT](https://github.com/karpathy/nanoGPT) and benchmarks on performance and validation on common benchmarks

The only drawback seems to be that the tokenizer is a bit simplistic, but it will be good enough for our purposes. Let's get familar with the model family by loading its weights and running some inference.

In [None]:
def get_gpt2_num_params(config: GPT2Config):
    non_layer_params = config.vocab_size * config.n_embd + config.block_size * config.n_embd + config.n_embd
    layer_params = (4 * config.n_embd * config.n_embd) + (8 * config.n_embd * config.n_embd) + (2 * config.n_embd)
    if config.bias:
        non_layer_params += config.n_embd
        layer_params += (4 * config.n_embd) + (8 * config.n_embd) + (2 * config.n_embd)
    return config.n_layer * layer_params + non_layer_params

models = ["gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl"]
cols = {"Num. Layers": "n_layer", "Num. Heads": "n_head", "Embedding Dim.": "n_embd"}
rows = []
for model_name in models:
    config = AutoConfig.from_pretrained(f"openai-community/{model_name}")
    filtered_config = {k: v for k, v in config.__dict__.items() if k in cols.values()}
    row = [filtered_config[key] for key in cols.values()]
    row.extend([int(get_gpt2_num_params(GPT2Config(**filtered_config)) / 1e6)])
    rows.append(row)
pd.DataFrame(rows, index=models, columns=list(cols.keys()) + ["Num. Params (M)"])

Nice, we get models between 124M and 1.5B parameters which is perfect for our experiments. They only differ by the number of layers, heads and embedding dimensionality.

## HuggingFace GPT-2


In [None]:
# Load GPT-2 (124M) from HF
model_name = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
hf_gpt2 = AutoModelForCausalLM.from_pretrained(model_name)

print(f"Loaded {hf_gpt2.config._name_or_path} with {hf_gpt2.num_parameters() / 1e6:.2f}M parameters")

In [None]:
# Generate a sequence
pipe = pipeline("text-generation", model=hf_gpt2, tokenizer=tokenizer, pad_token_id=tokenizer.eos_token_id, max_new_tokens=10, device="cuda")
generated = pipe("Hello World!")
print(generated[0]["generated_text"])

## PyTorch GPT-2

Based on the ~250LOC implementation of GPT-2 from [NanoGPT](https://github.com/karpathy/nanoGPT), we have a custom PyTorch model with functionality to load and save checkpoints from Hugging Face.

In [None]:
# Load custom GPT-2 (PyTorch)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
gpt2 = GPT2(GPT2Config())

print(f"Loaded GPT-2 with {gpt2.num_parameters() / 1e6:.2f}M parameters")

In [None]:
tokenizer.decode(gpt2.generate(tokenizer.encode("Hello World!", return_tensors="pt"), 10)[0].tolist())

In [None]:
# From custom GPT-2 from pre-trained weights
gpt2 = GPT2.from_hf(hf_gpt2)

In [None]:
print(tokenizer.decode(gpt2.generate(tokenizer.encode("Hello World!", return_tensors="pt"), 10)[0].tolist()))

## PyTorch <> HF Conversion

The goal is so that we can arbitrarily convert between PyTorch and Hugging Face models. Let's test this by generating a sequence with the PyTorch model and then converting it to a Hugging Face model and generating a sequence with that.

In [None]:
# Local PyTorch model to Hugging Face model
gpt2.save_pretrained("gpt2")

gpt2 = GPT2.from_pretrained("gpt2")
print(tokenizer.decode(gpt2.generate(tokenizer.encode("Hello World!", return_tensors="pt"), 10)[0].tolist()))

## Push to HuggingFace Hub

Finally, let's push fresh versions of all model sizes to Hugging Face.

In [None]:
# Push GPT-2 Small
with open("configs/model/gpt2-small.toml", "r") as f:
    model_config = ModelConfig(**dict(map(lambda s: s.strip().split(" = "), f.readlines())))

# Initialize fresh GPT-2 Small
gpt2_small = GPT2(GPT2Config(**model_config.dict()))

# Push to HF Hub
repo_name = "gpt2-small-fresh"
gpt2_small.push_to_hub(repo_name, use_auth_token=True)
tokenizer.push_to_hub(repo_name, use_auth_token=True)

In [None]:
# Push GPT-2 Medium
with open("configs/model/gpt2-medium.toml", "r") as f:
    model_config = ModelConfig(**dict(map(lambda s: s.strip().split(" = "), f.readlines())))

# Initialize fresh GPT-2 Small
gpt2_medium = GPT2(GPT2Config(**model_config.dict()))

# Push to HF Hub
repo_name = "gpt2-medium-fresh"
gpt2_medium.push_to_hub(repo_name, use_auth_token=True)
tokenizer.push_to_hub(repo_name, use_auth_token=True)

In [None]:
# Load GPT-2 Medium
with open("configs/model/gpt2-large.toml", "r") as f:
    model_config = ModelConfig(**dict(map(lambda s: s.strip().split(" = "), f.readlines())))

# Initialize fresh GPT-2 Large
gpt2_large = GPT2(GPT2Config(**model_config.dict()))

# Push to HF Hub
repo_name = "gpt2-large-fresh"
gpt2_large.push_to_hub(repo_name, use_auth_token=True)
tokenizer.push_to_hub(repo_name, use_auth_token=True)

In [None]:
# Load GPT-2 XL
with open("configs/model/gpt2-xl.toml", "r") as f:
    model_config = ModelConfig(**dict(map(lambda s: s.strip().split(" = "), f.readlines())))

# Initialize fresh GPT-2 XL
gpt2_xl = GPT2(GPT2Config(**model_config.dict()))

# Push to HF Hub
repo_name = "gpt2-xl-fresh"
gpt2_xl.push_to_hub(repo_name, use_auth_token=True)
tokenizer.push_to_hub(repo_name, use_auth_token=True)

## Load from HuggingFace Hub

Let's test if we can load the models from the Hugging Face Hub again.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("mikasenghaas/gpt2-small-fresh")
gpt2_small = GPT2.from_pretrained("mikasenghaas/gpt2-small-fresh")

In [None]:
print(tokenizer.decode(gpt2.generate(tokenizer.encode("Hello World!", return_tensors="pt"), 10)[0].tolist()))

## Model Sharding

In [25]:
from src.utils import get_model, get_sharded_model

model_config = ModelConfig(n_layer=12, n_head=12, n_embd=768, parameter_sharing=False)
model = get_model(model_config)
print(f"Loaded GPT-2 with {model.num_parameters() / 1e6:.2f}M parameters")

Loaded GPT-2 with 162.32M parameters


In [26]:
# Generate fake world
from src.world import World

world0 = World(local_rank=0, world_size=2, debug=True)
world1 = World(local_rank=1, world_size=2, debug=True)

In [27]:
# Shard model
from copy import deepcopy

shard0 = get_sharded_model(deepcopy(model), world0)
shard1 = get_sharded_model(deepcopy(model), world1)

print(f"Shard 0: {shard0.num_parameters() / 1e6:.2f}M parameters")
print(f"Shard 1: {shard1.num_parameters() / 1e6:.2f}M parameters")
assert shard0.num_parameters() + shard1.num_parameters() == model.num_parameters()

Shard 0: 81.16M parameters
Shard 1: 81.16M parameters


Note, that the original GPT-2 model shares the weights for the embeddings and the LM head. In pipeline parallel training, this sharing is more difficult so we will default to not share parameters for simplicity and comparability between methods. However, this may make comparing this model difficult with GPT-2 baselines because of the different parameter count.

In [35]:
# Forward pass
input_ids = torch.randint(0, 50257, (1, 1024))
model_out = model.forward(input_ids=input_ids)
shard_out = shard1.forward(hidden_states=shard0.forward(input_ids=input_ids))

assert torch.allclose(model_out, shard_out)

Nice! We have successfully sharded the model and get equivalent forward passes. Let's check the backward pass.