# gpt‑oss × Transformers — Interactive Notebook

This notebook turns the **run-transformers.md** guide into an interactive playground. It includes cells for environment setup, quick & advanced inference, serving with `transformers serve`, distributed/multi‑GPU hints, and a live **ipywidgets chat UI**.

**Models:** `openai/gpt-oss-20b`, `openai/gpt-oss-120b`  
**Quantization:** MXFP4 by default (Hopper or later: H100/GB200/RTX 50xx).  
**Notes:** If you use `bfloat16`, memory usage rises (~48 GB for the 20B model).  
**Hardware:** 20B ≈16 GB VRAM with MXFP4; 120B ≥60 GB VRAM or multi‑GPU.

> Tip: Run each cell in order the first time. You can skip pieces you don't need later.


## 0) Install/Update Dependencies
The guide recommends `transformers`, `accelerate`, `torch`, Triton 3.4, and MXFP4 kernels. We also install `ipywidgets` for the UI and `openai-harmony` for prompt tooling.

**Heads‑up:** depending on your environment, the package named `kernels` may not exist. If installation fails for that package, you can safely omit it.

Execute the cell below to install/update the basics (uncomment as needed):

In [None]:
# If you're running locally, uncomment and run this once.
# !pip -q install -U transformers accelerate torch triton==3.4 ipywidgets openai-harmony
# Optional/experimental (may not exist on PyPI):
# !pip -q install kernels

# Enable widgets in some classic notebook setups (often not needed in JupyterLab 3+):
# !jupyter nbextension enable --py widgetsnbextension

import platform, sys
print('Python', sys.version)
print('Platform', platform.platform())

## 1) GPU / Environment Check
Check PyTorch + CUDA availability and device details.

In [None]:
try:
    import torch
    print('torch.__version__ =', torch.__version__)
    print('CUDA available    =', torch.cuda.is_available())
    if torch.cuda.is_available():
        print('CUDA device count =', torch.cuda.device_count())
        for i in range(torch.cuda.device_count()):
            print(f' - [{i}]', torch.cuda.get_device_name(i))
            cc = torch.cuda.get_device_capability(i)
            print('    compute capability:', cc)
        try:
            free_mem = torch.cuda.mem_get_info()[0] / (1024**3)
            print(f'Approx free GPU memory: {free_mem:.1f} GB')
        except Exception:
            pass
except Exception as e:
    print('Torch not available or failed to import:', e)

## 2) Quick Inference via `pipeline`
The high‑level Transformers `pipeline` makes it simple to run gpt‑oss models. This mirrors the guide's usage but wraps it in a function for convenience.

In [None]:
from typing import List, Dict
from contextlib import suppress
try:
    import torch
    from transformers import pipeline
except Exception as e:
    print('Missing deps — run the install cell above. Error:', e)

def make_pipeline(model_name: str = 'openai/gpt-oss-20b', dtype: str = 'auto'):
    """Create a text-generation pipeline for gpt-oss models."""
    torch_dtype = 'auto'
    if dtype.lower() in {'bf16','bfloat16'}:
        torch_dtype = torch.bfloat16 if hasattr(torch, 'bfloat16') else 'auto'
    return pipeline(
        'text-generation',
        model=model_name,
        torch_dtype=torch_dtype,
        device_map='auto'
    )

def run_pipeline_chat(
    generator,
    system_prompt: str,
    user_prompt: str,
    max_new_tokens: int = 200,
    temperature: float = 1.0,
):
    messages = []
    if system_prompt.strip():
        messages.append({'role': 'system', 'content': system_prompt.strip()})
    messages.append({'role': 'user', 'content': user_prompt})
    try:
        result = generator(
            messages,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
        )
        # Common structure: result[0]['generated_text'] contains the whole transcript or text
        with suppress(Exception):
            return result[0].get('generated_text', str(result))
        return str(result)
    except TypeError:
        # Some older pipelines accept a string; fallback to simple concatenation
        prompt = (system_prompt + '\n\n' if system_prompt.strip() else '') + user_prompt
        result = generator(prompt, max_new_tokens=max_new_tokens, temperature=temperature)
        return result[0]['generated_text'] if isinstance(result, list) else str(result)


## 3) Interactive Chat Playground (ipywidgets)
Use the dropdowns and sliders below to load a model and generate responses.
The first generation will download weights (time depends on your bandwidth and disk).

In [None]:
import traceback
from IPython.display import display, Markdown
import ipywidgets as widgets

model_dd = widgets.Dropdown(
    options=['openai/gpt-oss-20b', 'openai/gpt-oss-120b'],
    value='openai/gpt-oss-20b',
    description='Model:',
    style={'description_width': '70px'},
    layout=widgets.Layout(width='50%'),
)
dtype_dd = widgets.Dropdown(
    options=['auto', 'bfloat16'],
    value='auto',
    description='dtype:',
    style={'description_width': '70px'},
)
load_btn = widgets.Button(description='Load / Reload Model', button_style='info')
load_out = widgets.Output()

sys_in = widgets.Textarea(
    value='You are a helpful assistant.',
    description='System',
    layout=widgets.Layout(width='100%', height='80px'),
    style={'description_width': '70px'}
)
usr_in = widgets.Textarea(
    value='Explain what MXFP4 quantization is.',
    description='User',
    layout=widgets.Layout(width='100%', height='80px'),
    style={'description_width': '70px'}
)

temp_sl = widgets.FloatSlider(value=1.0, min=0.0, max=2.0, step=0.05, description='Temp:',
                              style={'description_width': '70px'}, readout_format='.2f')
tokens_sl = widgets.IntSlider(value=200, min=16, max=4096, step=16, description='Max tokens:',
                               style={'description_width': '90px'})

gen_btn = widgets.Button(description='Generate', button_style='primary')
out = widgets.Output()
pipe_holder = {'pipe': None}

def on_load_clicked(_):
    load_out.clear_output()
    out.clear_output()
    with load_out:
        try:
            display(Markdown(f"**Loading** `{model_dd.value}` with dtype `{dtype_dd.value}`..."))
            pipe_holder['pipe'] = make_pipeline(model_dd.value, dtype_dd.value)
            display(Markdown('✅ **Model loaded**'))
        except Exception:
            traceback.print_exc()

def on_generate_clicked(_):
    out.clear_output()
    with out:
        if pipe_holder['pipe'] is None:
            display(Markdown('⚠️ Load a model first.'))
            return
        try:
            text = run_pipeline_chat(
                pipe_holder['pipe'],
                sys_in.value,
                usr_in.value,
                max_new_tokens=int(tokens_sl.value),
                temperature=float(temp_sl.value),
            )
            display(Markdown('**Output**'))
            display(Markdown(f'```
{text}
```'))
        except Exception:
            traceback.print_exc()

load_btn.on_click(on_load_clicked)
gen_btn.on_click(on_generate_clicked)

ui = widgets.VBox([
    widgets.HBox([model_dd, dtype_dd, load_btn]),
    load_out,
    sys_in,
    usr_in,
    widgets.HBox([temp_sl, tokens_sl]),
    gen_btn,
    out
])
display(ui)

## 4) Advanced Inference with `.generate()`
Manual control using `AutoModelForCausalLM` and `AutoTokenizer`, including the chat template.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = 'openai/gpt-oss-20b'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype='auto',
    device_map='auto'
)

messages = [
    {'role': 'user', 'content': 'Explain what MXFP4 quantization is.'},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors='pt',
    return_dict=True,
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7
)

print(tokenizer.decode(outputs[0]))

## 5) Chat Template & Tool Calling (Harmony)
The gpt‑oss models use the Harmony response format. You can either rely on the built‑in chat template or use the `openai-harmony` library to build/parse prompts and completions.

In [None]:
# If needed, install the library first in the install cell above: !pip -q install openai-harmony
import json
from openai_harmony import (
    HarmonyEncodingName,
    load_harmony_encoding,
    Conversation,
    Message,
    Role,
    SystemContent,
    DeveloperContent,
)
from transformers import AutoModelForCausalLM, AutoTokenizer

encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)

convo = Conversation.from_messages([
    Message.from_role_and_content(Role.SYSTEM, SystemContent.new()),
    Message.from_role_and_content(
        Role.DEVELOPER,
        DeveloperContent.new().with_instructions('Always respond in riddles')
    ),
    Message.from_role_and_content(Role.USER, 'What is the weather like in SF?')
])

prefill_ids = encoding.render_conversation_for_completion(convo, Role.ASSISTANT)
stop_token_ids = encoding.stop_tokens_for_assistant_actions()

model_name = 'openai/gpt-oss-20b'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype='auto', device_map='auto')

outputs = model.generate(
    input_ids=[prefill_ids],
    max_new_tokens=128,
    eos_token_id=stop_token_ids
)

completion_ids = outputs[0][len(prefill_ids):]
entries = encoding.parse_messages_from_completion_tokens(completion_ids, Role.ASSISTANT)
for message in entries:
    print(json.dumps(message.to_dict(), indent=2))

## 6) Serve an OpenAI‑style Responses endpoint
You can serve the model locally via the `transformers serve` CLI to expose a `/v1/responses` endpoint. Run these **commands in a terminal** (they are shown here for reference):

In [None]:
# --- Terminal commands (reference) ---
# transformers serve
# transformers chat localhost:8000 --model-name-or-path openai/gpt-oss-20b
# curl -X POST http://localhost:8000/v1/responses \
#   -H 'Content-Type: application/json' \
#   -d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 0.9, "max_tokens": 1000, "stream": true, "model": "openai/gpt-oss-20b"}'

print('See comments above for terminal commands. You can copy/paste them into your shell.')

### Client example (from a notebook) against `transformers serve`
Adjust the URL and payload to your needs.

In [None]:
import json, requests
url = 'http://localhost:8000/v1/responses'
payload = {
    'messages': [
        {'role': 'system', 'content': 'hello'},
        {'role': 'user', 'content': 'Say hi in one sentence.'}
    ],
    'temperature': 0.7,
    'max_tokens': 64,
    'stream': False,
    'model': 'openai/gpt-oss-20b'
}
try:
    r = requests.post(url, json=payload, timeout=30)
    print(r.status_code)
    print(r.text[:1000])
except Exception as e:
    print('Request failed (is the server running?):', e)

## 7) Multi‑GPU & Distributed Inference
Sketch for automatic placement, tensor parallelism, and expert parallelism (edit for your cluster).

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
try:
    from transformers.distributed import DistributedConfig
except Exception:
    class DistributedConfig:
        def __init__(self, **kwargs):
            self.kwargs = kwargs

model_path = 'openai/gpt-oss-120b'
tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side='left')

device_map = {
    'distributed_config': DistributedConfig(enable_expert_parallel=1),
    'tp_plan': 'auto',
}

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype='auto',
    attn_implementation='kernels-community/vllm-flash-attn3',
    **device_map,
)

messages = [
    {'role': 'user', 'content': 'Explain how expert parallelism works in large language models.'}
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors='pt',
    return_dict=True,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0])
print('Model response (truncated):\n', response[:1000])

# To launch across 4 GPUs from a script:
# torchrun --nproc_per_node=4 generate.py


## 8) Troubleshooting Checklist
- **Out of memory**: Use MXFP4 on Hopper‑class GPUs; reduce `max_new_tokens`; try `device_map='auto'`.  
- **Kernel/attn issues**: Ensure Triton is compatible with your CUDA; stick to Triton 3.4 as suggested.  
- **Slow loads**: Use local weight caches; set `TRANSFORMERS_CACHE`.  
- **No widgets**: Ensure `ipywidgets` is installed and enabled in your notebook frontend.  
- **Server 404/500**: Verify `transformers serve` version and model path.  
- **bfloat16 errors**: Fall back to `torch_dtype='auto'` or ensure hardware support.

## Appendix: Original Guide (for reference)
The content below is the original `/mnt/data/run-transformers.md` you provided:

<details>
<summary>Click to expand</summary>

# How to run gpt-oss with Hugging Face Transformers

The Transformers library by Hugging Face provides a flexible way to load and run large language models locally or on a server. This guide will walk you through running [OpenAI gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) or [OpenAI gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) using Transformers, either with a high-level pipeline or via low-level `generate` calls with raw token IDs.

We'll cover the use of [OpenAI gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) or [OpenAI gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) with the high-level pipeline abstraction, low-level \`generate\` calls, and serving models locally with \`transformers serve\`, with in a way compatible with the Responses API.

In this guide we’ll run through various optimised ways to run the **gpt-oss models via Transformers.**

Bonus: You can also fine-tune models via transformers, [check out our fine-tuning guide here](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transformers).

## Pick your model

Both **gpt-oss** models are available on Hugging Face:

- **`openai/gpt-oss-20b`**
  - \~16GB VRAM requirement when using MXFP4
  - Great for single high-end consumer GPUs
- **`openai/gpt-oss-120b`**
  - Requires ≥60GB VRAM or multi-GPU setup
  - Ideal for H100-class hardware

Both are **MXFP4 quantized** by default. Please, note that MXFP4 is supported in Hopper or later architectures. This includes data center GPUs such as H100 or GB200, as well as the latest RTX 50xx family of consumer cards.

If you use `bfloat16` instead of MXFP4, memory consumption will be larger (\~48 GB for the 20b parameter model).

## Quick setup

1. **Install dependencies**  
   It’s recommended to create a fresh Python environment. Install transformers, accelerate, as well as the Triton kernels for MXFP4 compatibility:

```bash
pip install -U transformers accelerate torch triton==3.4 kernels
```

2. **(Optional) Enable multi-GPU**  
   If you’re running large models, use Accelerate or torchrun to handle device mapping automatically.

## Create an Open AI Responses / Chat Completions endpoint

To launch a server, simply use the `transformers serve` CLI command:

```bash
transformers serve
```

The simplest way to interact with the server is through the transformers chat CLI

```bash
transformers chat localhost:8000 --model-name-or-path openai/gpt-oss-20b
```

or by sending an HTTP request with cURL, e.g.

```bash
curl -X POST http://localhost:8000/v1/responses -H "Content-Type: application/json" -d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 0.9, "max_tokens": 1000, "stream": true, "model": "openai/gpt-oss-20b"}'
```

Additional use cases, like integrating `transformers serve` with Cursor and other tools, are detailed in [the documentation](https://huggingface.co/docs/transformers/main/serving).

## Quick inference with pipeline

The easiest way to run the gpt-oss models is with the Transformers high-level `pipeline` API:

```py
from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="openai/gpt-oss-20b",
    torch_dtype="auto",
    device_map="auto"  # Automatically place on available GPUs
)

messages = [
    {"role": "user", "content": "Explain what MXFP4 quantization is."},
]

result = generator(
    messages,
    max_new_tokens=200,
    temperature=1.0,
)

print(result[0]["generated_text"])
```

## Advanced inference with `.generate()`

If you want more control, you can load the model and tokenizer manually and invoke the `.generate()` method:

```py
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

messages = [
    {"role": "user", "content": "Explain what MXFP4 quantization is."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7
)

print(tokenizer.decode(outputs[0]))
```

## Chat template and tool calling

OpenAI gpt-oss models use the [harmony response format](https://cookbook.openai.com/article/harmony) for structuring messages, including reasoning and tool calls.

To construct prompts you can use the built-in chat template of Transformers. Alternatively, you can install and use the [openai-harmony library](https://github.com/openai/harmony) for more control.

To use the chat template:

```py
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="auto",
)

messages = [
    {"role": "system", "content": "Always respond in riddles"},
    {"role": "user", "content": "What is the weather like in Madrid?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1] :]))
```

To integrate the [`openai-harmony`](https://github.com/openai/harmony) library to prepare prompts and parse responses, first install it like this:

```bash
pip install openai-harmony
```

Here’s an example of how to use the library to build your prompts and encode them to tokens:

```py
import json
from openai_harmony import (
    HarmonyEncodingName,
    load_harmony_encoding,
    Conversation,
    Message,
    Role,
    SystemContent,
    DeveloperContent
)
from transformers import AutoModelForCausalLM, AutoTokenizer

encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)

# Build conversation
convo = Conversation.from_messages([
    Message.from_role_and_content(Role.SYSTEM, SystemContent.new()),
    Message.from_role_and_content(
        Role.DEVELOPER,
        DeveloperContent.new().with_instructions("Always respond in riddles")
    ),
    Message.from_role_and_content(Role.USER, "What is the weather like in SF?")
])

# Render prompt
prefill_ids = encoding.render_conversation_for_completion(convo, Role.ASSISTANT)
stop_token_ids = encoding.stop_tokens_for_assistant_actions()

# Load model
model_name = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

# Generate
outputs = model.generate(
    input_ids=[prefill_ids],
    max_new_tokens=128,
    eos_token_id=stop_token_ids
)

# Parse completion tokens
completion_ids = outputs[0][len(prefill_ids):]
entries = encoding.parse_messages_from_completion_tokens(completion_ids, Role.ASSISTANT)

for message in entries:
    print(json.dumps(message.to_dict(), indent=2))
```

Note that the `Developer` role in Harmony maps to the `system` prompt in the chat template.

## Multi-GPU & distributed inference

The large gpt-oss-120b fits on a single H100 GPU when using MXFP4. If you want to run it on multiple GPUs, you can:

- Use `tp_plan="auto"` for automatic placement and tensor parallelism
- Launch with `accelerate launch or torchrun` for distributed setups
- Leverage Expert Parallelism
- Use specialised Flash attention kernels for faster inference

```py
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.distributed import DistributedConfig
import torch

model_path = "openai/gpt-oss-120b"
tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="left")

device_map = {
    # Enable Expert Parallelism
    "distributed_config": DistributedConfig(enable_expert_parallel=1),
    # Enable Tensor Parallelism
    "tp_plan": "auto",
}

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    attn_implementation="kernels-community/vllm-flash-attn3",
    **device_map,
)

messages = [
     {"role": "user", "content": "Explain how expert parallelism works in large language models."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1000)

# Decode and print
response = tokenizer.decode(outputs[0])
print("Model response:", response.split("&lt;|channel|&gt;final&lt;|message|&gt;")[-1].strip())
```

You can then run this on a node with four GPUs via

```bash
torchrun --nproc_per_node=4 generate.py
```


</details>

_Notebook generated on 2025-09-17T23:43:44Z_