# LLM Internals

This notebook was created by George Luiz Bittencourt (gbittencourt@microsoft.com) for my talk about how large language models work.

The instructions to setup the environment were extracted from [this notebook](https://colab.research.google.com/drive/1Gsgdydt4KgTm3S_Dbc_Gz08S2mQG4G8X?usp=sharing#scrollTo=P5dJV3xsu_89). Thanks to the authors!

## Setup

Since support for mxfp4 in transformers is bleeding edge, we need a recent version of PyTorch and CUDA, in order to be able to install the `mxfp4` triton kernels.

We also need to install transformers from source, and we uninstall `torchvision` and `torchaudio` to remove dependency conflicts.

In [None]:
%pip install --upgrade torch accelerate kernels
%pip install git+https://github.com/huggingface/transformers triton==3.4 git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels
%pip uninstall torchvision torchaudio -y
%pip list | grep -E "transformers|triton|torch|accelerate|kernels"

## Load the model from Hugging Face

The following code will load the model [openai/gpt-oss-20b](https://hf.co/openai/gpt-oss-20b) from Hugging Face using the transformers library.

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, Mxfp4Config

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
config = AutoConfig.from_pretrained(model_id)
print(config)

quantization_config=Mxfp4Config.from_dict(config.quantization_config)
print(quantization_config)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    torch_dtype="auto",
    device_map="cuda",
)

GptOssConfig {
  "architectures": [
    "GptOssForCausalLM"
  ],
  "attention_bias": true,
  "attention_dropout": 0.0,
  "eos_token_id": 200002,
  "experts_per_token": 4,
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 2880,
  "initial_context_length": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 2880,
  "layer_types": [
    "sliding_attention",
    "full_attention",
    "sliding_attention",
    "full_attention",
    "sliding_attention",
    "full_attention",
    "sliding_attention",
    "full_attention",
    "sliding_attention",
    "full_attention",
    "sliding_attention",
    "full_attention",
    "sliding_attention",
    "full_attention",
    "sliding_attention",
    "full_attention",
    "sliding_attention",
    "full_attention",
    "sliding_attention",
    "full_attention",
    "sliding_attention",
    "full_attention",
    "sliding_attention",
    "full_attention"
  ],
  "max_position_embeddings": 131072,
  "model_type": "gpt_oss",
  "num_attention_head

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 40 files:   0%|          | 0/40 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

## Sample Request

The following string is a sample request to a geography and weather assistant. The body was copied as-is from a real request to OpenAI.

The string is converted to a Python object which is used around

In [2]:
import json, pprint

request_str = """
{
	"messages": [
		{
			"role": "system",
			"content": " You are an expert AI programming assistant, working with a user in the VS Code editor. When asked for your name, you must respond with 'GitHub Copilot'. Follow the user's requirements carefully & to the letter. Follow Microsoft content policies. Avoid content that violates copyrights. If you are asked to generate content that is harmful, hateful, racist, sexist, lewd, or violent, only respond with 'Sorry, I can't assist with that.' Keep your answers short and impersonal. <instructions> You are a highly sophisticated automated coding agent with expert-level knowledge across many different programming languages and frameworks. The user will ask a question, or ask you to perform a task, and it may require lots of research to answer correctly. There is a selection of tools that let you perform actions or retrieve helpful context to answer the user's question. You will be given some context and attachments along with the user prompt. You can use them if they are relevant to the task, and ignore them if not. If you can infer the project type (languages, frameworks, and libraries) from the user's query or the context that you have, make sure to keep them in mind when making changes. If the user wants you to implement a feature and they have not specified the files to edit, first break down the user's request into smaller concepts and think about the kinds of files you need to grasp each concept. If you aren't sure which tool is relevant, you can call multiple tools. You can call tools repeatedly to take actions or gather as much context as needed until you have completed the task fully. Don't give up unless you are sure the request cannot be fulfilled with the tools you have. It's YOUR RESPONSIBILITY to make sure that you have done all you can to collect necessary context. When reading files, prefer reading large meaningful chunks rather than consecutive small sections to minimize tool calls and gain better context. Don't make assumptions about the situation- gather context first, then perform the task or answer the question. Think creatively and explore the workspace in order to make a complete fix. Don't repeat yourself after a tool call, pick up where you left off. You don't need to read a file if it's already provided in context. </instructions> "
		},
		{
			"role": "user",
			"content": "quais os meus repositórios?"
		}
	],
	"max_completion_tokens": 500,
	"temperature": 1,
	"top_p": 1,
	"frequency_penalty": 0,
	"presence_penalty": 0,
	"model": "openai/gpt-oss-20b",
	"reasoning_effort": "medium",
	"tools": [
		{
			"function": {
				"name": "mcp_github_get_me",
				"description": "Get details of the authenticated GitHub user. Use this when a request is about the user's own profile for GitHub. Or when information is missing to build other tool calls.",
				"parameters": {
					"properties": {},
					"type": "object"
				}
			},
			"type": "function"
		},
		{
			"function": {
				"name": "mcp_github_search_repositories",
				"description": "Find GitHub repositories by name, description, readme, topics, or other metadata. Perfect for discovering projects, finding examples, or locating specific repositories across GitHub.",
				"parameters": {
					"properties": {
						"page": {
							"description": "Page number for pagination (min 1)",
							"minimum": 1,
							"type": "number"
						},
						"perPage": {
							"description": "Results per page for pagination (min 1, max 100)",
							"maximum": 100,
							"minimum": 1,
							"type": "number"
						},
						"query": {
							"description": "Repository search query. Examples: 'machine learning in:name stars:>1000 language:python', 'topic:react', 'user:facebook'. Supports advanced search syntax for precise filtering.",
							"type": "string"
						}
					},
					"required": [
						"query"
					],
					"type": "object"
				}
			},
			"type": "function"
		}
	]
}
"""

request = json.loads(request_str)
pprint.pp(request)

{'messages': [{'role': 'system',
               'content': ' You are an expert AI programming assistant, '
                          'working with a user in the VS Code editor. When '
                          "asked for your name, you must respond with 'GitHub "
                          "Copilot'. Follow the user's requirements carefully "
                          '& to the letter. Follow Microsoft content policies. '
                          'Avoid content that violates copyrights. If you are '
                          'asked to generate content that is harmful, hateful, '
                          'racist, sexist, lewd, or violent, only respond with '
                          "'Sorry, I can't assist with that.' Keep your "
                          'answers short and impersonal. <instructions> You '
                          'are a highly sophisticated automated coding agent '
                          'with expert-level knowledge across many different '
                       

## Tokenizer

A tokenizer in a language model splits text into smaller units—like words, subwords, or characters—so the model can process and understand language effectively. Common tokenization methods include:

1. **Whitespace-based**: splits by spaces and punctuation.
2. **Character-based**: treats each character as a token.
3. **WordPiece**: breaks words into subword units based on frequency.
4. **Byte Pair Encoding (BPE)**: merges frequent character pairs iteratively to form subwords.
5. **Unigram Language Model**: selects subwords based on probability.

The vocabulary size in a tokenizer refers to the total number of unique tokens it can recognize and use. This size depends on the tokenization method used, and larger vocabularies offer more precision but require more memory and computation. Smaller vocabularies generalize better but may split common words into many tokens.

In [23]:
len(tokenizer.vocab)

200019

Given a text we can convert into a list of token ids, which will be later sent to the language model.

In [24]:
tokenizer("Hello world!")

{'input_ids': [13225, 2375, 0], 'attention_mask': [1, 1, 1]}

And we can revert the list of token ids into a text, which is a process that happens when the model finishes its execution.

In [25]:
tokenizer.convert_ids_to_tokens([13225, 2375, 0])

['Hello', 'Ġworld', '!']

We can find the token id of a specific word.

In [26]:
tokenizer.vocab["Hello"]

13225

A tokenizer uses special tokens to help the language model understand structure, context, and tasks. Here are the most common ones:

1. **[CLS] (or <s>)**: Marks the beginning of a sequence. Often used for classification tasks.
2. **[SEP] (or </s>)**: Separates segments or sentences, useful in tasks like question answering.
3. **[PAD]**: Used to pad sequences to a uniform length.
5. **[MASK]**: Indicates masked tokens in masked language modeling (e.g., BERT).
5. **[UNK]**: Represents unknown or out-of-vocabulary tokens.
6. **[BOS] / [EOS]**: Begin and end of sequence markers, often used in generative models.

These tokens are part of the tokenizer's vocabulary and help the model interpret input correctly across different tasks

In [27]:
tokenizer.added_tokens_decoder

{199998: AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 199999: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 200000: AddedToken("<|reserved_200000|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 200001: AddedToken("<|reserved_200001|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 200002: AddedToken("<|return|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 200003: AddedToken("<|constrain|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 200004: AddedToken("<|reserved_200004|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 200005: AddedToken("<|channel|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
 200006: AddedToken("<|start|>", rstrip=False, ls

## Parameters

In [28]:
model.state_dict()

OrderedDict([('model.embed_tokens.weight',
              tensor([[-2.9883e-01, -6.9531e-01,  1.9141e+00,  ..., -1.0059e-01,
                       -2.8931e-02,  1.0107e-01],
                      [-3.3203e-01, -3.6328e-01,  1.9141e+00,  ...,  5.9082e-02,
                        2.3560e-02,  1.9824e-01],
                      [ 2.0801e-01,  2.2507e-04,  4.7461e-01,  ...,  7.0801e-03,
                        3.3447e-02,  3.6523e-01],
                      ...,
                      [-1.6113e-02, -1.6846e-02,  7.2266e-02,  ..., -2.3804e-03,
                        1.7395e-03,  3.6163e-03],
                      [-4.9438e-03,  2.2461e-02, -3.2715e-02,  ...,  7.7438e-04,
                        1.2589e-03, -4.4861e-03],
                      [ 5.4016e-03, -8.1177e-03,  7.2937e-03,  ..., -2.6093e-03,
                       -3.9978e-03, -8.3008e-03]], device='cuda:0', dtype=torch.bfloat16)),
             ('model.layers.0.self_attn.sinks',
              tensor([ 2.5156,  0.5586,  1.7188,  0.91

In [29]:
shape = model.state_dict()['model.embed_tokens.weight'].shape

print(f"shape: {shape}")
print(f"num_dimensions: {len(shape)}")
print(f"bytes: {model.state_dict()['model.embed_tokens.weight'].nbytes}")
print(f"dtype: {model.state_dict()['model.embed_tokens.weight'].dtype}")
print(f"total_math: {shape[0] * shape[1] * 2}")

shape: torch.Size([201088, 2880])
num_dimensions: 2
bytes: 1158266880
dtype: torch.bfloat16
total_math: 1158266880


In [30]:
len(model.state_dict()["model.embed_tokens.weight"][tokenizer.vocab["world"]])

2880

In [31]:
model.state_dict()["model.embed_tokens.weight"][tokenizer.vocab["world"]][:30]

tensor([-7.0312e-02, -9.1797e-01,  7.9102e-02,  8.2520e-02, -2.2656e-01,
        -4.9375e+00,  2.0447e-03,  4.7852e-01,  5.7188e+00, -2.1094e+00,
        -1.0645e-01,  7.5989e-03,  1.6875e+00,  2.5312e+00,  5.3223e-02,
        -1.6699e-01,  2.8516e-01, -1.7676e-01,  7.3047e-01, -1.1978e-03,
         1.2031e+00,  5.2246e-02, -1.9844e+00,  2.0508e-02, -8.1543e-02,
         1.3977e-02,  1.3867e-01,  1.6968e-02,  4.0283e-02, -1.5078e+00],
       device='cuda:0', dtype=torch.bfloat16)

## Process request

The following code extract from the request the parameters which will be used by the model later on.

In [32]:
# Extract the thread messages from the request.
messages = request["messages"]

# The tools the model **might** use.
tools = request["tools"]

# The number of tokens to generate.
max_completion_tokens = request["max_completion_tokens"]

# And the reasoning effort.
reasoning_effort = request["reasoning_effort"]

print(tokenizer.apply_chat_template(
    messages,
    tools=tools,
    reasoning_effort=reasoning_effort,
    tokenize=False
))

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-13

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions

 You are an expert AI programming assistant, working with a user in the VS Code editor. When asked for your name, you must respond with 'GitHub Copilot'. Follow the user's requirements carefully & to the letter. Follow Microsoft content policies. Avoid content that violates copyrights. If you are asked to generate content that is harmful, hateful, racist, sexist, lewd, or violent, only respond with 'Sorry, I can't assist with that.' Keep your answers short and impersonal. <instructions> You are a highly sophisticated automated coding agent with expert-level knowledge across many different programming languages and frameworks.

In [33]:
print(tokenizer.apply_chat_template(
    messages,
    tools=tools,
    reasoning_effort=reasoning_effort,
    tokenize=False,
    add_generation_prompt=True
))

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-13

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions

 You are an expert AI programming assistant, working with a user in the VS Code editor. When asked for your name, you must respond with 'GitHub Copilot'. Follow the user's requirements carefully & to the letter. Follow Microsoft content policies. Avoid content that violates copyrights. If you are asked to generate content that is harmful, hateful, racist, sexist, lewd, or violent, only respond with 'Sorry, I can't assist with that.' Keep your answers short and impersonal. <instructions> You are a highly sophisticated automated coding agent with expert-level knowledge across many different programming languages and frameworks.

In [34]:
inputs = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    reasoning_effort=reasoning_effort,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
)
inputs

{'input_ids': tensor([[200006,  17360, 200008,  ..., 200007, 200006, 173781]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1]])}

In [35]:
# Send the prompt to the GPU.
inputs.to(model.device)


# Complete with the tokens based on the prompt.
generated = model.generate(**inputs, max_new_tokens=max_completion_tokens)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

<|channel|>analysis<|message|>User is speaking Portuguese: "quais os meus repositórios?" meaning "which are my repositories?" We need to call mcp_github_get_me? Actually to get user profile details maybe. We need repository list. There's no direct tool to list repos. We can search for repositories with user: <username>. But we don't know username. We can call mcp_github_get_me to get username. Then use search_repositories with query "user:<username>".

We have no authentication context. But assume we can get user info. Use get_me. Then search repositories. Let's do that.<|end|><|start|>assistant<|channel|>commentary to=functions.mcp_github_get_me <|constrain|>json<|message|>{}<|call|>commentary to=functions.mcp_github_search_repositories <|constrain|>json<|message|>{"page":1,"perPage":30,"query":"user:"}<|call|>commentary<|message|>We likely need to specify the username from get_me. Let's assume get_me returns something like {login:"username"}.

We need to first call get_me.<|end|><|st

In [16]:
print(tokenizer.decode(generated[0]))

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-08-13

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions

 You are an expert AI programming assistant, working with a user in the VS Code editor. When asked for your name, you must respond with 'GitHub Copilot'. Follow the user's requirements carefully & to the letter. Follow Microsoft content policies. Avoid content that violates copyrights. If you are asked to generate content that is harmful, hateful, racist, sexist, lewd, or violent, only respond with 'Sorry, I can't assist with that.' Keep your answers short and impersonal. <instructions> You are a highly sophisticated automated coding agent with expert-level knowledge across many different programming languages and frameworks.