gpt-oss-120b inference failed running on 16 GPUs, single node and with tp_plan="auto"

### System Info

- `transformers` version: 4.57.0.dev0
- Platform: Linux-5.4.0-80-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.34.5
- Safetensors version: 0.5.3
- Accelerate version: 1.6.0
- Accelerate config:    not found
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.6.0+cu124 (CUDA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: Tesla V100-SXM3-32GB-H

### Who can help?

@SunMarc @zucchini-nlp @vasqu @ArthurZucker @Cyrilvallez 
Running inference with TP with gpt-oss-120b model on a node with 16 GPUs.
Got the following error in the cross attention layer:
```
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt_oss/modeling_gpt_oss.py", line 314, in forward                                                                                                                 [rank0]:     key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
[rank0]: RuntimeError: shape '[1, 89, -1, 64]' is invalid for input of size 2848
```

It seems the projected tensor shape, which is 2848, is less than the minimum allowed by the hidden_shape. This happens due to the TP partitioning of the projection matrix.

I'm wandering if changing the self.head_dim from 64 to 32 is a possible solution to this?
In other words, pseudo-code:
```
k_proj = self.k_proj(hidden_states)
if torch.numel(k_proj) < math.prod((*input_shape, 1, self.head_dim)):
    head_dim = self.head_dim / math.prod((*input_shape, 1, self.head_dim)) * torch.numel(k_proj)
...
```

One thing to note that is the GPUs we used is v100, which is why we need that many GPUs to run gpt-oss-120b.
[Here ](https://huggingface.co/openai/gpt-oss-20b/discussions/61) is the official fix for running gpt-oss on older GPUs.

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

To reproduce:
- python code
gpt-oss-120b.py
```
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, Mxfp4Config

tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-120b")
quantization_config = Mxfp4Config(dequantize=False)
model_kwargs = dict(attn_implementation="eager", dtype=torch.bfloat16, use_cache=True, tp_plan="auto", quantization_config=quantization_config)
model = AutoModelForCausalLM.from_pretrained("openai/gpt-oss-120b", **model_kwargs).cuda()

SYSTEM_PROMPT = f"Please answer the following question in English."
USER_PROMPT = "What is the capital of Australia?"

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": USER_PROMPT},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

gen_kwargs = {"max_new_tokens": 512, "do_sample": True, "temperature": 0.6, "top_p": None, "top_k": None}

output_ids = model.generate(input_ids, **gen_kwargs)
response = tokenizer.batch_decode(output_ids)[0]
print(response)
```

- Launch job with torchrun
```
torchrun --nproc-per-node 16 gpt-oss-120b.py
```

### Expected behavior

Should run without error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gpt-oss-120b inference failed running on 16 GPUs, single node and with tp_plan="auto" #40953

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

gpt-oss-120b inference failed running on 16 GPUs, single node and with tp_plan="auto" #40953

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions