# Importing OpenAI Weights

In this notebook, I'll be attempting to import official trained model weights from OpenAI into my own GPT model.

I'll be importing code from [gpt.ipynb](./gpt.ipynb), so refer to that when necessary.

In [1]:
import import_ipynb
# Import the notebook gpt.ipynb
import gpt # type: ignore
import torch
import numpy as np
import tiktoken

def get_device() -> torch.device:
    if torch.cuda.is_available():
        return torch.device("cuda")
    elif torch.backends.mps.is_available():
        return torch.device("mps:0")
    else:
        return torch.device("cpu")
tokenizer = tiktoken.get_encoding("gpt2")

env: CUDA_LAUNCH_BLOCKING=1
env: CUBLAS_WORKSPACE_CONFIG=:4096:8


## Downloading the gpt_download.py script

This script was provided as part of the book Build a Large Language Model (From Scratch), which I'm following here.

In [2]:
import urllib.request
from pathlib import Path

def ensure_script():
    url = (
        "https://raw.githubusercontent.com/rasbt/"
        "LLMs-from-scratch/main/ch05/"
        "01_main-chapter-code/gpt_download.py"
    )
    filename = url.split('/')[-1]
    if Path(filename).exists():
        # nothing to do
        return
    print(f"Downloading {filename}")
    urllib.request.urlretrieve(url, filename)

ensure_script()

## Running gpt_download.py

This script will download the following files:
- checkpoint
- encoder.json
- hparams.json
- model.ckpt.data-00000-of-00001
- model.ckpt.index
- model.ckpt.meta
- vocab.bpe

In [3]:
from gpt_download import download_and_load_gpt2

2025-06-21 15:55:05.596477: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-06-21 15:55:05.858154: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1750546505.992315    4662 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1750546506.033466    4662 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1750546506.282526    4662 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

# Define the OpenAI model config

These are the basic hyperparameters that distinguish the various OpenAI GPT-2 models.
We'll be focusing on the 124M version, at least initially, so we'll create `NEW_CONFIG`
with the right settings.

In [4]:
model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

GPT_SMALL: gpt.GPTConfigDict = gpt.GPT_CONFIG_124M.copy()
GPT_SMALL.update(model_configs["gpt2-small (124M)"])
# We had set the context_length to 256 before, but we need it back at 1024.
GPT_SMALL.update({"context_length": 1024})

# QKV Bias is not so popular anymore, but GPT-2 used it, so we will too.
GPT_SMALL.update({"qkv_bias": True})

GPT_CONFIG_355M: gpt.GPTConfigDict = gpt.GPT_CONFIG_124M.copy()
GPT_CONFIG_355M.update(model_configs["gpt2-medium (355M)"])
GPT_CONFIG_355M.update({"context_length": 1024, "qkv_bias": True})

GPT_CONFIG_774M: gpt.GPTConfigDict = gpt.GPT_CONFIG_124M.copy()
GPT_CONFIG_774M.update(model_configs["gpt2-large (774M)"])
GPT_CONFIG_774M.update({"context_length": 1024, "qkv_bias": True})

GPT_CONFIG_1558M: gpt.GPTConfigDict = gpt.GPT_CONFIG_124M.copy()
GPT_CONFIG_1558M.update(model_configs["gpt2-xl (1558M)"])
GPT_CONFIG_1558M.update({"context_length": 1024, "qkv_bias": True})

# Create a new model based on GPT-2 and transfer weights

This could get long. We're using a helper to "safely" overwrite the weights in
our model. There are a lot of layers to do this with.

In [None]:
def assign(left, right):
    if left.shape != right.shape:
        raise ValueError(f"Shape mismatch. Left: {left.shape},"
                         "Right: {right.shape}")
    return torch.nn.Parameter(torch.tensor(right))

def load_weights_into_gpt(model: gpt.SimplifiedGPT, params):
    # Restore the token embeddings and positional embeddings
    model.positional_embedding.weight = assign(model.positional_embedding.weight, params['wpe'])
    model.token_embedding.weight = assign(model.token_embedding.weight, params['wte'])

    # For each transformer block...
    for b in range(len(params["blocks"])):
        # ...restore the attention QKV weights
        q_w, k_w, v_w = np.split(
            (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1)
        model.transformer_blocks[b].attention.w_query.weight = assign(
            model.transformer_blocks[b].attention.w_query.weight, q_w.T)
        model.transformer_blocks[b].attention.w_key.weight = assign(
            model.transformer_blocks[b].attention.w_key.weight, k_w.T)
        model.transformer_blocks[b].attention.w_value.weight = assign(
            model.transformer_blocks[b].attention.w_value.weight, v_w.T)

        # and the QKV biases
        q_b, k_b, v_b = np.split(
            (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1)
        model.transformer_blocks[b].attention.w_query.bias = assign(
            model.transformer_blocks[b].attention.w_query.bias, q_b)
        model.transformer_blocks[b].attention.w_key.bias = assign(
            model.transformer_blocks[b].attention.w_key.bias, k_b)
        model.transformer_blocks[b].attention.w_value.bias = assign(
            model.transformer_blocks[b].attention.w_value.bias, v_b)

        # and the attention output projection
        model.transformer_blocks[b].attention.w_out.weight = assign(
            model.transformer_blocks[b].attention.w_out.weight, 
            params["blocks"][b]["attn"]["c_proj"]["w"].T)
        model.transformer_blocks[b].attention.w_out.bias = assign(
            model.transformer_blocks[b].attention.w_out.bias, 
            params["blocks"][b]["attn"]["c_proj"]["b"])

        # and the FeedForward layer weights
        model.transformer_blocks[b].feedforward.layers[0].weight = assign(
            model.transformer_blocks[b].feedforward.layers[0].weight, 
            params["blocks"][b]["mlp"]["c_fc"]["w"].T)
        model.transformer_blocks[b].feedforward.layers[0].bias = assign(
            model.transformer_blocks[b].feedforward.layers[0].bias, 
            params["blocks"][b]["mlp"]["c_fc"]["b"])
        model.transformer_blocks[b].feedforward.layers[2].weight = assign(
            model.transformer_blocks[b].feedforward.layers[2].weight, 
            params["blocks"][b]["mlp"]["c_proj"]["w"].T)
        model.transformer_blocks[b].feedforward.layers[2].bias = assign(
            model.transformer_blocks[b].feedforward.layers[2].bias, 
            params["blocks"][b]["mlp"]["c_proj"]["b"])

        # and the LayerNorm scale and shift weights
        model.transformer_blocks[b].layer_norm_1.scale = assign(
            model.transformer_blocks[b].layer_norm_1.scale, 
            params["blocks"][b]["ln_1"]["g"])
        model.transformer_blocks[b].layer_norm_1.shift = assign(
            model.transformer_blocks[b].layer_norm_1.shift, 
            params["blocks"][b]["ln_1"]["b"])
        model.transformer_blocks[b].layer_norm_2.scale = assign(
            model.transformer_blocks[b].layer_norm_2.scale, 
            params["blocks"][b]["ln_2"]["g"])
        model.transformer_blocks[b].layer_norm_2.shift = assign(
            model.transformer_blocks[b].layer_norm_2.shift, 
            params["blocks"][b]["ln_2"]["b"])

    # and finally, restore the final norm scale and shift layers
    model.layer_norm.scale = assign(model.layer_norm.scale, params["g"])
    model.layer_norm.shift = assign(model.layer_norm.shift, params["b"])

    # and the output head is also different in this version.
    model.output.weight = assign(model.output.weight, params["wte"])


def load_openai_model(config: gpt.GPTConfigDict, size: str) -> gpt.GPTModel:
    settings, params = download_and_load_gpt2(
        model_size=size, models_dir="gpt2"
    )
    model = gpt.SimplifiedGPT(config)
    model.eval()
    load_weights_into_gpt(model, params)
    trainable_model = gpt.GPTModel(config, gpt.DEFAULT_TRAINING_CONFIG, model=model, force_cpu=True)
    print(f"{size} model loaded.")
    return trainable_model

# Uncomment one of the following to load that model
# model = load_openai_model(GPT_SMALL, "124M")
# model = load_openai_model(GPT_CONFIG_355M, "355M")
model = load_openai_model(GPT_CONFIG_1558M, "1558M") # needs force_cpu=True on my system or it crashes


File already exists and is up-to-date: gpt2/355M/checkpoint
File already exists and is up-to-date: gpt2/355M/encoder.json
File already exists and is up-to-date: gpt2/355M/hparams.json
File already exists and is up-to-date: gpt2/355M/model.ckpt.data-00000-of-00001
File already exists and is up-to-date: gpt2/355M/model.ckpt.index
File already exists and is up-to-date: gpt2/355M/model.ckpt.meta
File already exists and is up-to-date: gpt2/355M/vocab.bpe
355M model loaded.


In [6]:
gpt.manual_seed(123)
gpt.prompt(model, "GitHub is", temperature=1.5, max_tokens=1024)

GitHub is the place for organize our parts so please consider coming. If you want have a look around it. As you may
observe I did not write a whole new article about it, just took a little while to prepare things, as it has been quite
the last few months. Please use my work you got from this article. And feel free to ask me anything because they're all
useful. If you have questions please refer back to article where GitBook is discussed. Also you can download my book
"Kudos" [download it here, and don´t miss it. It is recommended for this article]. Just enjoy working in GitBook. Enjoy!
Posted by Greg Smith at 13:55


In [None]:
import textwrap

def chat_gpt(model, prompt, temperature=1.5, max_tokens=128):
    base = "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n"
    tail = "\n\n### Response:\n"
    model_input = base + prompt + tail
    completion = model.prompt(model_input, temperature=temperature, max_tokens=max_tokens)
    response = completion[len(model_input):].strip()
    print(textwrap.fill(response, width=120))

In [20]:
chat_gpt(model, "Answer this question: what is 2+2?")

As mentioned above a response may not even be submitted by application. One reason could be because of memory capacity
limitation or time-out caused by browser rendering the link request while waiting in browser response.  If users of
browsers are expecting to retrieve this data with HTML, it is recommended that to provide valid HTML code for that part
of request.  An effective approach of creating code (HTML or image: nth c# #if !(?=!HTML)(?:\  ){function* response-
tally-html-t1 (i:Integer){if(t1){$((this == "Response: " ?  Response : "<script type= 'text/javascript'> %7Escript
<video id= \" '+ t1+' "' />' ');}}}}",t1: 0}else response-tally-html-t2 (i:Integer): $("iframe<(i<=?2||>=?2)+\">
"){if(-j1){(i : t2 ? : t1).= jQuery.(i)</t2>(;) }try if(t2){$("p.post("*p://[a-z]*
