# Importing OpenAI Weights

In this notebook, I'll be attempting to import official trained model weights from OpenAI into my own GPT model.

I'll be importing code from [gpt.ipynb](./gpt.ipynb), so refer to that when necessary.

In [12]:
import import_ipynb
# Import the notebook gpt.ipynb
import gpt # type: ignore
import torch
import numpy as np
import tiktoken

def get_device() -> torch.device:
    if torch.cuda.is_available():
        return torch.device("cuda")
    elif torch.backends.mps.is_available():
        return torch.device("mps")
    else:
        return torch.device("cpu")
tokenizer = tiktoken.get_encoding("gpt2")

## Downloading the gpt_download.py script

This script was provided as part of the book Build a Large Language Model (From Scratch), which I'm following here.

In [13]:
import urllib.request
from pathlib import Path

def ensure_script():
    url = (
        "https://raw.githubusercontent.com/rasbt/"
        "LLMs-from-scratch/main/ch05/"
        "01_main-chapter-code/gpt_download.py"
    )
    filename = url.split('/')[-1]
    if Path(filename).exists():
        # nothing to do
        return
    print(f"Downloading {filename}")
    urllib.request.urlretrieve(url, filename)

ensure_script()

## Running gpt_download.py

This script will download the following files:
- checkpoint
- encoder.json
- hparams.json
- model.ckpt.data-00000-of-00001
- model.ckpt.index
- model.ckpt.meta
- vocab.bpe

In [14]:
from gpt_download import download_and_load_gpt2
settings, params = download_and_load_gpt2(
    model_size="124M", models_dir="gpt2"
)
assert settings['n_embd'] == 768

File already exists and is up-to-date: gpt2/124M/checkpoint
File already exists and is up-to-date: gpt2/124M/encoder.json
File already exists and is up-to-date: gpt2/124M/hparams.json
File already exists and is up-to-date: gpt2/124M/model.ckpt.data-00000-of-00001
File already exists and is up-to-date: gpt2/124M/model.ckpt.index
File already exists and is up-to-date: gpt2/124M/model.ckpt.meta
File already exists and is up-to-date: gpt2/124M/vocab.bpe


# Define the OpenAI model config

These are the basic hyperparameters that distinguish the various OpenAI GPT-2 models.
We'll be focusing on the 124M version, at least initially, so we'll create `NEW_CONFIG`
with the right settings.

In [15]:
model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

NEW_CONFIG: gpt.GPTConfigDict = gpt.GPT_CONFIG_124M.copy()
NEW_CONFIG.update(model_configs["gpt2-small (124M)"])
# We had set the context_length to 256 before, but we need it back at 1024.
NEW_CONFIG.update({"context_length": 1024})

# QKV Bias is not so popular anymore, but GPT-2 used it, so we will too.
NEW_CONFIG.update({"qkv_bias": True})

# Create a new model based on GPT-2 and transfer weights

This could get long. We're using a helper to "safely" overwrite the weights in
our model. There are a lot of layers to do this with.

In [16]:
def assign(left, right):
    if left.shape != right.shape:
        raise ValueError(f"Shape mismatch. Left: {left.shape},"
                         "Right: {right.shape}")
    return torch.nn.Parameter(torch.tensor(right))

def load_weights_into_gpt(model: gpt.SimplifiedGPT, params):
    # Restore the token embeddings and positional embeddings
    model.positional_embedding.weight = assign(model.positional_embedding.weight, params['wpe'])
    model.token_embedding.weight = assign(model.token_embedding.weight, params['wte'])

    # For each transformer block...
    for b in range(len(params["blocks"])):
        # ...restore the attention QKV weights
        q_w, k_w, v_w = np.split(
            (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1)
        model.transformer_blocks[b].attention.w_query.weight = assign(
            model.transformer_blocks[b].attention.w_query.weight, q_w.T)
        model.transformer_blocks[b].attention.w_key.weight = assign(
            model.transformer_blocks[b].attention.w_key.weight, k_w.T)
        model.transformer_blocks[b].attention.w_value.weight = assign(
            model.transformer_blocks[b].attention.w_value.weight, v_w.T)

        # and the QKV biases
        q_b, k_b, v_b = np.split(
            (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1)
        model.transformer_blocks[b].attention.w_query.bias = assign(
            model.transformer_blocks[b].attention.w_query.bias, q_b)
        model.transformer_blocks[b].attention.w_key.bias = assign(
            model.transformer_blocks[b].attention.w_key.bias, k_b)
        model.transformer_blocks[b].attention.w_value.bias = assign(
            model.transformer_blocks[b].attention.w_value.bias, v_b)

        # and the attention output projection
        model.transformer_blocks[b].attention.w_out.weight = assign(
            model.transformer_blocks[b].attention.w_out.weight, 
            params["blocks"][b]["attn"]["c_proj"]["w"].T)
        model.transformer_blocks[b].attention.w_out.bias = assign(
            model.transformer_blocks[b].attention.w_out.bias, 
            params["blocks"][b]["attn"]["c_proj"]["b"])

        # and the FeedForward layer weights
        model.transformer_blocks[b].feedforward.layers[0].weight = assign(
            model.transformer_blocks[b].feedforward.layers[0].weight, 
            params["blocks"][b]["mlp"]["c_fc"]["w"].T)
        model.transformer_blocks[b].feedforward.layers[0].bias = assign(
            model.transformer_blocks[b].feedforward.layers[0].bias, 
            params["blocks"][b]["mlp"]["c_fc"]["b"])
        model.transformer_blocks[b].feedforward.layers[2].weight = assign(
            model.transformer_blocks[b].feedforward.layers[2].weight, 
            params["blocks"][b]["mlp"]["c_proj"]["w"].T)
        model.transformer_blocks[b].feedforward.layers[2].bias = assign(
            model.transformer_blocks[b].feedforward.layers[2].bias, 
            params["blocks"][b]["mlp"]["c_proj"]["b"])

        # and the LayerNorm scale and shift weights
        model.transformer_blocks[b].layer_norm_1.scale = assign(
            model.transformer_blocks[b].layer_norm_1.scale, 
            params["blocks"][b]["ln_1"]["g"])
        model.transformer_blocks[b].layer_norm_1.shift = assign(
            model.transformer_blocks[b].layer_norm_1.shift, 
            params["blocks"][b]["ln_1"]["b"])
        model.transformer_blocks[b].layer_norm_2.scale = assign(
            model.transformer_blocks[b].layer_norm_2.scale, 
            params["blocks"][b]["ln_2"]["g"])
        model.transformer_blocks[b].layer_norm_2.shift = assign(
            model.transformer_blocks[b].layer_norm_2.shift, 
            params["blocks"][b]["ln_2"]["b"])

    # and finally, restore the final norm scale and shift layers
    model.layer_norm.scale = assign(model.layer_norm.scale, params["g"])
    model.layer_norm.shift = assign(model.layer_norm.shift, params["b"])

    # and the output head is also different in this version.
    model.output.weight = assign(model.output.weight, params["wte"])


model = gpt.SimplifiedGPT(NEW_CONFIG)
model.eval()
load_weights_into_gpt(model.float(), params)

In [17]:
gpt.manual_seed(123)
trainable_model = gpt.GPTModel(NEW_CONFIG, model=model, force_cpu=False)
gpt.prompt(trainable_model, "Every effort moves you", temperature=1.5, max_tokens=25)

Every effort moves you as far as the hand can go until the end of your turn unless something interrupts your control
flow. As you may observe I
