<a href="https://colab.research.google.com/github/mbarley333/algo/blob/main/articles/gpt-oss/run-colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/openai/openai-cookbook/blob/main/articles/gpt-oss/run-colab.ipynb)

# Run OpenAI gpt-oss 20B in a FREE Google Colab

OpenAI released `gpt-oss` [120B](https://hf.co/openai/gpt-oss-120b) and [20B](https://hf.co/openai/gpt-oss-20b). Both models are Apache 2.0 licensed.

Specifically, `gpt-oss-20b` was made for lower latency and local or specialized use cases (21B parameters with 3.6B active parameters).

Since the models were trained in native MXFP4 quantization it makes it easy to run the 20B even in resource constrained environments like Google Colab.

Authored by: [Pedro](https://huggingface.co/pcuenq) and [VB](https://huggingface.co/reach-vb)

## Setup environment

Since support for mxfp4 in transformers is bleeding edge, we need a recent version of PyTorch and CUDA, in order to be able to install the `mxfp4` triton kernels.

We also need to install transformers from source, and we uninstall `torchvision` and `torchaudio` to remove dependency conflicts.

In [1]:
!pip install -q --upgrade torch

In [2]:
!pip install -q transformers triton==3.4 kernels

In [3]:
!pip uninstall -q torchvision torchaudio -y

Please, restart your Colab runtime session after installing the packages above.

## Load the model from Hugging Face in Google Colab

We load the model from here: [openai/gpt-oss-20b](https://hf.co/openai/gpt-oss-20b)

In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="cuda",
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/27.9M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/98.0 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.80G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.17G [00:00<?, ?B/s]

model-00000-of-00002.safetensors:   0%|          | 0.00/4.79G [00:00<?, ?B/s]

Fetching 41 files:   0%|          | 0/41 [00:00<?, ?it/s]

__init__.cpython-312.pyc:   0%|          | 0.00/220 [00:00<?, ?B/s]

_ops.py:   0%|          | 0.00/201 [00:00<?, ?B/s]

_finalize_matmul.py: 0.00B [00:00, ?B/s]

matmul_ogs.py: 0.00B [00:00, ?B/s]

compaction.py: 0.00B [00:00, ?B/s]

__init__.py:   0%|          | 0.00/363 [00:00<?, ?B/s]

_masked_compaction.py:   0%|          | 0.00/814 [00:00<?, ?B/s]

_common.py: 0.00B [00:00, ?B/s]

_matmul_ogs.py: 0.00B [00:00, ?B/s]

_p_matmul_ogs.py: 0.00B [00:00, ?B/s]

opt_flags.py: 0.00B [00:00, ?B/s]

opt_flags_intel.py: 0.00B [00:00, ?B/s]

opt_flags_nvidia.py: 0.00B [00:00, ?B/s]

numerics.py: 0.00B [00:00, ?B/s]

opt_flags_amd.py: 0.00B [00:00, ?B/s]

flexpoint.py: 0.00B [00:00, ?B/s]

mxfp.py: 0.00B [00:00, ?B/s]

_upcast_from_mxfp.py: 0.00B [00:00, ?B/s]

proton_opts.py:   0%|          | 0.00/456 [00:00<?, ?B/s]

_downcast_to_mxfp.py: 0.00B [00:00, ?B/s]

reduce_bitmatrix.py: 0.00B [00:00, ?B/s]

_expt_data.py: 0.00B [00:00, ?B/s]

routing.py: 0.00B [00:00, ?B/s]

specialize.py: 0.00B [00:00, ?B/s]

_routing_compute.py: 0.00B [00:00, ?B/s]

_swiglu.py: 0.00B [00:00, ?B/s]

swiglu.py: 0.00B [00:00, ?B/s]

target_info.py: 0.00B [00:00, ?B/s]

tensor.py: 0.00B [00:00, ?B/s]

layout.py: 0.00B [00:00, ?B/s]

blackwell_scale.py: 0.00B [00:00, ?B/s]

base.py:   0%|          | 0.00/352 [00:00<?, ?B/s]

hopper_scale.py: 0.00B [00:00, ?B/s]

hopper_value.py: 0.00B [00:00, ?B/s]

testing.py: 0.00B [00:00, ?B/s]

strided.py:   0%|          | 0.00/337 [00:00<?, ?B/s]

_topk_forward.py: 0.00B [00:00, ?B/s]

_topk_backward.py: 0.00B [00:00, ?B/s]

topk.py: 0.00B [00:00, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Fetching 41 files:   0%|          | 0/41 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

## Setup messages/ chat

You can provide an optional system prompt or directly the input.

In [11]:
messages = [
    {"role": "system", "content": "You are a helpful AI agent with access to tools"},
    {"role": "user", "content": "what is 23408038 * 9876?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

<|channel|>analysis<|message|>We need to multiply 23408038 * 9876. Let's compute. 23,408,038 * 9,876.

We can do long multiplication or use calculator. Let's break: 23,408,038 * 9,876 = 23,408,038 * (10,000 - 124). 23,408,038*10,000 = 234,080,380,000. Subtract 23,408,038*124. 23,408,038*124 = 23,408,038*100 + *20 + *4 = 2,340,803,800 + 468,160,760 + 93,632,152 = sum them: 2,340,803,800 + 468,160,760 = 2,808,964,560; + 93,632,152 = 2,902,596,712. So 234,080,380,000 - 2,902,596,712 = 231,177,783,288. Let's double-check: Compute 23,408,038*124 more carefully.

Compute 23,408,038 * 124: 23,408,038*120 + *4. 23,408,038*120 = 23,408,038*12*10: 23,408,038*12 = 23,408,038*10 + *2 = 234,080,380 + 46,816,076 = 280,896,456; times 10 gives 2,808,964,560. Adding *4: 23,408,038*4 = 93,632,152. Sum = 2,902,596,712. Yes.

Subtract from 234,080,380,000 gives 231,177,783,288. That seems correct.

Alternatively, check with direct product: 23,408,038 * 9,876. Let's do digits: 23,408,038 * 9876.

Alternati

## Specify Reasoning Effort

Simply pass it as an additional argument to `apply_chat_template()`. Supported values are `"low"`, `"medium"` (default), or `"high"`.

In [10]:
messages = [
    {"role": "system", "content": "You are a helpful AI agent with access to tools"},
    {"role": "user", "content": "what is 23408038 * 9876?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    reasoning_effort="medium",
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

<|channel|>analysis<|message|>Need to compute 23,408,038 * 9,876. Let's calculate.

First, 23,408,038 * 9,876.

Break 9,876 = 10,000 - 124.

Compute 23,408,038 * 10,000 = 234,080,380,000

Now compute 23,408,038 * 124 = 23,408,038 * (100 + 20 + 4)

= 23,408,038 * 100 = 2,340,803,800
+ 23,408,038 * 20 = 468,160,760
+ 23,408,038 * 4 = 93,632,152

Now sum: 2,340,803,800 + 468,160,760 = 2,808,964,560
Add 93,632,152 = 2,902,596,712

So 23,408,038 * 124 = 2,902,596,712

Now subtract from 234,080,380,000:

234,080,380,000 - 2,902,596,712 = 231,177,783,288

Double-check using alternative method: Use multiplication algorithm.

Let's quick cross-check: 23,408,038 * 9,876.

We can also do 23,408,038 * 9,876 = 23,408,038 * (9,000 + 876)

First 23,408,038 * 9,000 = 23,408,038 * 9 * 1,000 = 210,672,342 * 1,000 = 210,672,342,000

Wait 23,408,038 * 9 = 210,672,342 correct.

Now 23,408,038 * 876 = 23,408,038 * (800 + 70 + 6)

= 23,408,038 * 800 = 23,408,038 *8 *100 = 187,264,304 *100 = 18,726,430,400

2

## Try out other prompts and ideas!

Check out our blogpost for other ideas: [https://hf.co/blog/welcome-openai-gpt-oss](https://hf.co/blog/welcome-openai-gpt-oss)