[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/openai/openai-cookbook/blob/main/articles/gpt-oss/run-colab.ipynb)

# Run OpenAI gpt-oss 20B in a FREE Google Colab

OpenAI released `gpt-oss` [120B](https://hf.co/openai/gpt-oss-120b) and [20B](https://hf.co/openai/gpt-oss-20b). Both models are Apache 2.0 licensed.

Specifically, `gpt-oss-20b` was made for lower latency and local or specialized use cases (21B parameters with 3.6B active parameters).

Since the models were trained in native MXFP4 quantization it makes it easy to run the 20B even in resource constrained environments like Google Colab.

Authored by: [Pedro](https://huggingface.co/pcuenq) and [VB](https://huggingface.co/reach-vb)

## Setup environment

Since support for mxfp4 in transformers is bleeding edge, we need a recent version of PyTorch and CUDA, in order to be able to install the `mxfp4` triton kernels.

We also need to install transformers from source, and we uninstall `torchvision` and `torchaudio` to remove dependency conflicts.

In [1]:
!pip install -q --upgrade torch

In [2]:
!pip install -q transformers triton==3.4 kernels

In [3]:
!pip uninstall -q torchvision torchaudio -y

Please, restart your Colab runtime session after installing the packages above.

## Load the model from Hugging Face in Google Colab

We load the model from here: [openai/gpt-oss-20b](https://hf.co/openai/gpt-oss-20b)

In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="cuda",
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/27.9M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/98.0 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00000-of-00002.safetensors:   0%|          | 0.00/4.79G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.80G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.17G [00:00<?, ?B/s]

Fetching 40 files:   0%|          | 0/40 [00:00<?, ?it/s]

__init__.cpython-312.pyc:   0%|          | 0.00/220 [00:00<?, ?B/s]

__init__.py:   0%|          | 0.00/179 [00:00<?, ?B/s]

compaction.py: 0.00B [00:00, ?B/s]

matmul_ogs.py: 0.00B [00:00, ?B/s]

_ops.py:   0%|          | 0.00/201 [00:00<?, ?B/s]

_common.py: 0.00B [00:00, ?B/s]

_masked_compaction.py:   0%|          | 0.00/814 [00:00<?, ?B/s]

_finalize_matmul.py: 0.00B [00:00, ?B/s]

_matmul_ogs.py: 0.00B [00:00, ?B/s]

_p_matmul_ogs.py: 0.00B [00:00, ?B/s]

opt_flags.py: 0.00B [00:00, ?B/s]

opt_flags_amd.py: 0.00B [00:00, ?B/s]

numerics.py: 0.00B [00:00, ?B/s]

flexpoint.py: 0.00B [00:00, ?B/s]

opt_flags_nvidia.py: 0.00B [00:00, ?B/s]

mxfp.py: 0.00B [00:00, ?B/s]

_downcast_to_mxfp.py: 0.00B [00:00, ?B/s]

proton_opts.py:   0%|          | 0.00/456 [00:00<?, ?B/s]

_upcast_from_mxfp.py: 0.00B [00:00, ?B/s]

_routing_compute.py: 0.00B [00:00, ?B/s]

routing.py: 0.00B [00:00, ?B/s]

_expt_data.py: 0.00B [00:00, ?B/s]

reduce_bitmatrix.py: 0.00B [00:00, ?B/s]

specialize.py: 0.00B [00:00, ?B/s]

swiglu.py: 0.00B [00:00, ?B/s]

_swiglu.py: 0.00B [00:00, ?B/s]

tensor.py: 0.00B [00:00, ?B/s]

blackwell_scale.py: 0.00B [00:00, ?B/s]

base.py:   0%|          | 0.00/352 [00:00<?, ?B/s]

hopper_scale.py: 0.00B [00:00, ?B/s]

layout.py: 0.00B [00:00, ?B/s]

hopper_value.py: 0.00B [00:00, ?B/s]

target_info.py: 0.00B [00:00, ?B/s]

strided.py:   0%|          | 0.00/337 [00:00<?, ?B/s]

_topk_backward.py: 0.00B [00:00, ?B/s]

topk.py: 0.00B [00:00, ?B/s]

testing.py: 0.00B [00:00, ?B/s]

_topk_forward.py: 0.00B [00:00, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

## Setup messages/ chat

You can provide an optional system prompt or directly the input.

In [5]:
messages = [
    {"role": "system", "content": "Always respond in riddles"},
    {"role": "user", "content": "What is the weather like in Madrid?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

<|channel|>analysis<|message|>The user asks a straightforward question. But developer instruction: "Always respond in riddles". So must answer in riddle-like way. Should incorporate weather in Madrid. Madrid weather: Typically hot in summer, mild in winter. Maybe answer: "In the great city of Madrid, the sky may roar with summer heat, or whisper a chill in winter's embrace. Listen to the wind's song." But must be riddles. Provide a riddle or a description disguised as riddle. Let's craft: "When the sun climbs and the shade falls silent, do you hear the city breathe fire? When the sky bleeds dawn's blush, what winds sweep the plazas?" Eh.

Better: "I am neither cloud nor sun, yet when I'm heavy, Madrid sighs; when I'm light, tourists laugh. What am I?" That's the riddle. But question: "What is the weather like in Madrid?" So respond: "In Madrid, the weather is like: When summer arrives, I transform into a blazing flame, and when winter shivers, I become a cool breeze." But must respond 

## Specify Reasoning Effort

Simply pass it as an additional argument to `apply_chat_template()`. Supported values are `"low"`, `"medium"` (default), or `"high"`.

In [None]:
messages = [
    {"role": "system", "content": "Always respond corrrectly"},
    {"role": "user", "content": "Explain happy numbers"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    reasoning_effort="high",
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=500)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

## Try out other prompts and ideas!

Check out our blogpost for other ideas: [https://hf.co/blog/welcome-openai-gpt-oss](https://hf.co/blog/welcome-openai-gpt-oss)