Skip to content

gpt-oss-120b inference failed running on 16 GPUs, single node and with tp_plan="auto" #40953

@yuanhangsu1986

Description

@yuanhangsu1986

System Info

  • transformers version: 4.57.0.dev0
  • Platform: Linux-5.4.0-80-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.34.5
  • Safetensors version: 0.5.3
  • Accelerate version: 1.6.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.6.0+cu124 (CUDA)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: Tesla V100-SXM3-32GB-H

Who can help?

@SunMarc @zucchini-nlp @vasqu @ArthurZucker @Cyrilvallez
Running inference with TP with gpt-oss-120b model on a node with 16 GPUs.
Got the following error in the cross attention layer:

[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt_oss/modeling_gpt_oss.py", line 314, in forward                                                                                                                 [rank0]:     key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
[rank0]: RuntimeError: shape '[1, 89, -1, 64]' is invalid for input of size 2848

It seems the projected tensor shape, which is 2848, is less than the minimum allowed by the hidden_shape. This happens due to the TP partitioning of the projection matrix.

I'm wandering if changing the self.head_dim from 64 to 32 is a possible solution to this?
In other words, pseudo-code:

k_proj = self.k_proj(hidden_states)
if torch.numel(k_proj) < math.prod((*input_shape, 1, self.head_dim)):
    head_dim = self.head_dim / math.prod((*input_shape, 1, self.head_dim)) * torch.numel(k_proj)
...

One thing to note that is the GPUs we used is v100, which is why we need that many GPUs to run gpt-oss-120b.
Here is the official fix for running gpt-oss on older GPUs.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

To reproduce:

  • python code
    gpt-oss-120b.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, Mxfp4Config

tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-120b")
quantization_config = Mxfp4Config(dequantize=False)
model_kwargs = dict(attn_implementation="eager", dtype=torch.bfloat16, use_cache=True, tp_plan="auto", quantization_config=quantization_config)
model = AutoModelForCausalLM.from_pretrained("openai/gpt-oss-120b", **model_kwargs).cuda()

SYSTEM_PROMPT = f"Please answer the following question in English."
USER_PROMPT = "What is the capital of Australia?"

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": USER_PROMPT},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

gen_kwargs = {"max_new_tokens": 512, "do_sample": True, "temperature": 0.6, "top_p": None, "top_k": None}

output_ids = model.generate(input_ids, **gen_kwargs)
response = tokenizer.batch_decode(output_ids)[0]
print(response)
  • Launch job with torchrun
torchrun --nproc-per-node 16 gpt-oss-120b.py

Expected behavior

Should run without error

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions