-
Couldn't load subscription status.
- Fork 31k
Description
System Info
transformersversion: 4.57.0.dev0- Platform: Linux-5.4.0-80-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.34.5
- Safetensors version: 0.5.3
- Accelerate version: 1.6.0
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.6.0+cu124 (CUDA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?:
- Using GPU in script?:
- GPU type: Tesla V100-SXM3-32GB-H
Who can help?
@SunMarc @zucchini-nlp @vasqu @ArthurZucker @Cyrilvallez
Running inference with TP with gpt-oss-120b model on a node with 16 GPUs.
Got the following error in the cross attention layer:
[rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/gpt_oss/modeling_gpt_oss.py", line 314, in forward [rank0]: key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
[rank0]: RuntimeError: shape '[1, 89, -1, 64]' is invalid for input of size 2848
It seems the projected tensor shape, which is 2848, is less than the minimum allowed by the hidden_shape. This happens due to the TP partitioning of the projection matrix.
I'm wandering if changing the self.head_dim from 64 to 32 is a possible solution to this?
In other words, pseudo-code:
k_proj = self.k_proj(hidden_states)
if torch.numel(k_proj) < math.prod((*input_shape, 1, self.head_dim)):
head_dim = self.head_dim / math.prod((*input_shape, 1, self.head_dim)) * torch.numel(k_proj)
...
One thing to note that is the GPUs we used is v100, which is why we need that many GPUs to run gpt-oss-120b.
Here is the official fix for running gpt-oss on older GPUs.
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
To reproduce:
- python code
gpt-oss-120b.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, Mxfp4Config
tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-120b")
quantization_config = Mxfp4Config(dequantize=False)
model_kwargs = dict(attn_implementation="eager", dtype=torch.bfloat16, use_cache=True, tp_plan="auto", quantization_config=quantization_config)
model = AutoModelForCausalLM.from_pretrained("openai/gpt-oss-120b", **model_kwargs).cuda()
SYSTEM_PROMPT = f"Please answer the following question in English."
USER_PROMPT = "What is the capital of Australia?"
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": USER_PROMPT},
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
gen_kwargs = {"max_new_tokens": 512, "do_sample": True, "temperature": 0.6, "top_p": None, "top_k": None}
output_ids = model.generate(input_ids, **gen_kwargs)
response = tokenizer.batch_decode(output_ids)[0]
print(response)
- Launch job with torchrun
torchrun --nproc-per-node 16 gpt-oss-120b.py
Expected behavior
Should run without error