# How to run gpt-oss with Hugging Face Transformers

The Transformers library by Hugging Face provides a flexible way to load and run large language models locally or on a server. This guide will walk you through running [OpenAI gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) or [OpenAI gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) using Transformers, either with a high-level pipeline or via low-level `generate` calls with raw token IDs.

We'll cover the use of [OpenAI gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) or [OpenAI gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) with the high-level pipeline abstraction, low-level `generate` calls, and serving models locally with `transformers serve`, in a way compatible with the Responses API.

In this guide we'll run through various optimised ways to run the **gpt-oss models via Transformers.**

**Bonus:** You can also fine-tune models via transformers, [check out our fine-tuning guide here](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transformers).

## Pick your model

Both **gpt-oss** models are available on Hugging Face:

- **`openai/gpt-oss-20b`**
  - ~16GB VRAM requirement when using MXFP4
  - Great for single high-end consumer GPUs
- **`openai/gpt-oss-120b`**
  - Requires ≥60GB VRAM or multi-GPU setup
  - Ideal for H100-class hardware

Both are **MXFP4 quantized** by default. Please, note that MXFP4 is supported in Hopper or later architectures. This includes data center GPUs such as H100 or GB200, as well as the latest RTX 50xx family of consumer cards.

If you use `bfloat16` instead of MXFP4, memory consumption will be larger (~48 GB for the 20b parameter model).

# NOTE: The current version of HF Transformers has a glitch where an outdated torchvision dependency prevents transformers module from importing pipeline.

## Quick setup

### 1. Install dependencies

It's recommended to create a fresh Python environment. Install transformers, accelerate, as well as the Triton kernels for MXFP4 compatibility:

In [None]:
!pip install -U transformers kernels torch accelerate torchvision

Collecting accelerate
  Downloading accelerate-1.10.1-py3-none-any.whl.metadata (19 kB)
Downloading accelerate-1.10.1-py3-none-any.whl (374 kB)
Installing collected packages: accelerate
Successfully installed accelerate-1.10.1
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


In [1]:

import transformers, torch, accelerate, platform

print(f"Transformers version: {transformers.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"Accelerate version: {accelerate.__version__}")
dev = (
    "cuda" if torch.cuda.is_available() else
    ("mps" if getattr(torch.backends, "mps", None) and torch.backends.mps.is_available() else "cpu")
)
print(f"Selected device: {dev} • Python {platform.python_version()}")
if dev == "cuda":
    print(f"CUDA devices: {torch.cuda.device_count()}")
    for i in range(torch.cuda.device_count()):
        print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")


Transformers version: 4.56.1
PyTorch version: 2.8.0+cu128
Accelerate version: 1.10.1
Selected device: cuda • Python 3.11.11
CUDA devices: 1
  GPU 0: NVIDIA A40


### 2. (Optional) Enable multi-GPU

If you're running large models, use Accelerate or torchrun to handle device mapping automatically.

For this notebook, we'll use the `device_map="auto"` parameter which automatically distributes the model across available GPUs.

## Create an OpenAI Responses / Chat Completions endpoint

To launch a server, simply use the `transformers serve` CLI command:

```bash
transformers serve
```

The simplest way to interact with the server is through the transformers chat CLI:

```bash
transformers chat localhost:8000 --model-name-or-path openai/gpt-oss-20b
```

or by sending an HTTP request with cURL:

```bash
curl -X POST http://localhost:8000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 0.9, "max_tokens": 1000, "stream": true, "model": "openai/gpt-oss-20b"}'
```

Additional use cases, like integrating `transformers serve` with Cursor and other tools, are detailed in [the documentation](https://huggingface.co/docs/transformers/main/serving).

## Quick inference with pipeline

The easiest way to run the gpt-oss models is with the Transformers high-level `pipeline` API:

In [2]:
!pip install -U torchvision

Collecting torchvision
  Downloading torchvision-0.23.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (6.1 kB)
Downloading torchvision-0.23.0-cp311-cp311-manylinux_2_28_x86_64.whl (8.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.6/8.6 MB[0m [31m177.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torchvision
  Attempting uninstall: torchvision
    Found existing installation: torchvision 0.22.0.dev20250319+cu128
    Uninstalling torchvision-0.22.0.dev20250319+cu128:
      Successfully uninstalled torchvision-0.22.0.dev20250319+cu128
Successfully installed torchvision-0.23.0
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


In [None]:
from transformers import pipeline, AutoTokenizer

model_id = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

# Initialize the text generation pipeline
generator = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

print("Pipeline initialized successfully!")

FileNotFoundError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/']

In [None]:
# Example conversation with the model using the chat template
messages = [
    {"role": "system", "content": "Be concise."},
    {"role": "user", "content": "Explain what MXFP4 quantization is."},
]

prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

result = generator(
    prompt,
    max_new_tokens=200,
    temperature=0.7,
    top_p=0.9,
)

print("Generated response:")
print(result[0]["generated_text"])

## Advanced inference with `.generate()`

If you want more control, you can load the model and tokenizer manually and invoke the `.generate()` method:

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "openai/gpt-oss-20b"

# Load tokenizer and model
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto", trust_remote_code=True
)

print("Model loaded successfully!")
print(f"Model device: {next(model.parameters()).device}")

In [None]:
# Example generation with manual control
messages = [
    {"role": "user", "content": "Explain what MXFP4 quantization is."},
]

# Apply chat template and tokenize
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

print(f"Input tokens shape: {inputs['input_ids'].shape}")

# Generate response
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

# Decode the full response
full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\nFull response:")
print(full_response)

# Extract only the generated part
input_length = inputs['input_ids'].shape[-1]
generated_response = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True)
print("\nGenerated response only:")
print(generated_response)

# Decode and extract only the assistant's final content (Harmony-style)
text = tokenizer.decode(outputs[0], skip_special_tokens=False)
final_marker = "<|assistant|><|final|>"
end_marker = "<|end|>"
if final_marker in text:
    text = text.split(final_marker, 1)[1]
if end_marker in text:
    text = text.split(end_marker, 1)[0]
print("Final:", text.strip()[:1000])


## Streaming tokens (prints only assistant **final**)

In [None]:
import threading, sys
from transformers import TextIteratorStreamer

def stream_final_only(model, tokenizer, messages, generate_kwargs):
    inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
    streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=False)

    t = threading.Thread(target=model.generate, kwargs=dict(input_ids=inputs, streamer=streamer, **generate_kwargs))
    t.start()

    buf, printing_final = "", False
    final_token, end_token = "<|assistant|><|final|>", "<|end|>"
    for piece in streamer:
        buf += piece
        if not printing_final and final_token in buf:
            printing_final = True
            out = buf.split(final_token, 1)[1].replace(end_token, "")
            sys.stdout.write(out); sys.stdout.flush()
            buf = ""
        elif printing_final:
            sys.stdout.write(piece.replace(end_token, "")); sys.stdout.flush()
    t.join()

messages2 = [
    {"role": "system", "content": "Answer briefly."},
    {"role": "user", "content": "What’s the difference between analysis and final channels?"}
]
gen = dict(max_new_tokens=128, do_sample=True, temperature=0.7, top_p=0.9)
stream_final_only(model, tokenizer, messages2, gen)

## Chat template and tool calling

OpenAI gpt-oss models use the [harmony response format](https://cookbook.openai.com/article/harmony) for structuring messages, including reasoning and tool calls.

To construct prompts you can use the built-in chat template of Transformers. Alternatively, you can install and use the [openai-harmony library](https://github.com/openai/harmony) for more control.

### Using the built-in chat template:

In [None]:
# Example with system prompt and chat template
messages = [
    {"role": "system", "content": "Always respond in riddles"},
    {"role": "user", "content": "What is the weather like in Madrid?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

# Generate with the chat template
generated = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.8,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

# Extract only the assistant's response
response = tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print("Assistant's riddle response:")
print(response)

### Using the openai-harmony library

For more advanced control over the conversation format, you can use the openai-harmony library. 

First, install it:
```bash
pip install openai-harmony
```

**Note:** The following cell demonstrates the harmony library usage, but may require the actual library to be installed.

In [None]:
# Example using openai-harmony library (requires installation)
# Uncomment and run if you have openai-harmony installed

'''
import json
from openai_harmony import (
    HarmonyEncodingName,
    load_harmony_encoding,
    Conversation,
    Message,
    Role,
    SystemContent,
    DeveloperContent
)

# Load harmony encoding
encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)

# Build conversation
convo = Conversation.from_messages([
    Message.from_role_and_content(Role.SYSTEM, SystemContent.new()),
    Message.from_role_and_content(
        Role.DEVELOPER,
        DeveloperContent.new().with_instructions("Always respond in riddles")
    ),
    Message.from_role_and_content(Role.USER, "What is the weather like in SF?")
])

# Render prompt
prefill_ids = encoding.render_conversation_for_completion(convo, Role.ASSISTANT)
stop_token_ids = encoding.stop_tokens_for_assistant_actions()

# Generate
outputs = model.generate(
    input_ids=[prefill_ids],
    max_new_tokens=128,
    eos_token_id=stop_token_ids
)

# Parse completion tokens
completion_ids = outputs[0][len(prefill_ids):]
entries = encoding.parse_messages_from_completion_tokens(completion_ids, Role.ASSISTANT)

for message in entries:
    print(json.dumps(message.to_dict(), indent=2))
'''

print("Harmony library example code shown above (commented out)")
print("Note: The Developer role in Harmony maps to the system prompt in the chat template.")

## Multi-GPU & distributed inference

The large gpt-oss-120b fits on a single H100 GPU when using MXFP4. If you want to run it on multiple GPUs, you can:

- Use `tp_plan="auto"` for automatic placement and tensor parallelism
- Launch with `accelerate launch` or `torchrun` for distributed setups
- Leverage Expert Parallelism
- Use specialised Flash attention kernels for faster inference

### Example multi-GPU setup:

In [None]:
# Multi-GPU inference example (requires multiple GPUs)
# This cell demonstrates the configuration but may not run on single GPU systems

'''
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.distributed import DistributedConfig
import torch

model_path = "openai/gpt-oss-120b"
tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="left")

device_map = {
    # Enable Expert Parallelism
    "distributed_config": DistributedConfig(enable_expert_parallel=1),
    # Enable Tensor Parallelism
    "tp_plan": "auto",
}

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    attn_implementation="kernels-community/vllm-flash-attn3",
    **device_map,
)

messages = [
     {"role": "user", "content": "Explain how expert parallelism works in large language models."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1000)

# Decode and print
response = tokenizer.decode(outputs[0])
print("Model response:", response.split("<|channel|>final<|message|>")[-1].strip())
'''

print("Multi-GPU setup example shown above (commented out)")
print("\nTo run this on a node with four GPUs, use:")
print("torchrun --nproc_per_node=4 your_script.py")

# Show current GPU configuration
if torch.cuda.is_available():
    print(f"\nCurrent setup:")
    print(f"Available GPUs: {torch.cuda.device_count()}")
    for i in range(torch.cuda.device_count()):
        print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")
        print(f"    Memory: {torch.cuda.get_device_properties(i).total_memory / 1e9:.1f} GB")
else:
    print("\nNo CUDA GPUs available")

## Additional Examples and Tips

### Memory management

In [None]:
# Check memory usage
if torch.cuda.is_available():
    print("GPU Memory Usage:")
    for i in range(torch.cuda.device_count()):
        allocated = torch.cuda.memory_allocated(i) / 1e9
        cached = torch.cuda.memory_reserved(i) / 1e9
        total = torch.cuda.get_device_properties(i).total_memory / 1e9
        print(f"  GPU {i}: {allocated:.1f}GB allocated, {cached:.1f}GB cached, {total:.1f}GB total")

# Clear cache if needed
# torch.cuda.empty_cache()

### Batch processing example

In [None]:
# Example of processing multiple prompts
batch_messages = [
    [{"role": "user", "content": "What is machine learning?"}],
    [{"role": "user", "content": "Explain quantum computing."}],
    [{"role": "user", "content": "What is the future of AI?"}]
]

print("Processing batch of prompts...")
for i, messages in enumerate(batch_messages):
    print(f"\n--- Prompt {i+1} ---")
    print(f"Input: {messages[0]['content']}")

    # You can use either the pipeline or manual generation here
    # Using pipeline for simplicity:
    if 'generator' in locals():
        result = generator(
            messages,
            max_new_tokens=100,
            temperature=0.7,
        )
        print(f"Output: {result[0]['generated_text'][-200:]}...")  # Show last 200 chars
    else:
        print("Generator not available - run the pipeline example first")

## Summary

This notebook demonstrated various ways to run OpenAI's gpt-oss models using Hugging Face Transformers:

1. **Quick setup** with required dependencies
2. **Pipeline API** for simple, high-level inference
3. **Manual generation** with `.generate()` for more control
4. **Chat templates** for conversation-style interactions
5. **Harmony library integration** for advanced message formatting
6. **Multi-GPU configurations** for large-scale inference

### Key takeaways:
- Start with the pipeline API for quick experimentation
- Use manual tokenization and generation for production deployments
- Consider MXFP4 quantization for memory efficiency on compatible hardware
- Leverage multi-GPU setups for the larger 120B model
- Use proper chat templates for conversation-style applications

### Next steps:
- Explore fine-tuning capabilities
- Set up serving endpoints for production use
- Experiment with different sampling strategies
- Integrate with your specific use case or application

For more advanced topics, check out the [OpenAI Cookbook](https://cookbook.openai.com) and [Hugging Face Transformers documentation](https://huggingface.co/docs/transformers).