Skip to content

[Bug] GOT-OCR2.0 crashes with num_beams > 1 during inference (shape mismatch in Qwen2 attention) #7789

@RhiteshKS

Description

@RhiteshKS

Describe the bug
I’m encountering a runtime crash when using beam search (num_beams > 1) with the GOT-OCR2.0 model in swift infer.
This happens consistently and appears to be a beam dimension handling bug in the multimodal generation path (Qwen2-based GOT OCR).

Model
stepfun-ai/GOT-OCR2_0

Command used
CUDA_VISIBLE_DEVICES=0 swift infer
--adapters checkpoint-226000
--temperature 0
--num_beams 5
--repetition_penalty 1.08
--val_dataset est.jsonl
--max_new_tokens 4096
--stream false
--result_path 226k_beam5_results.jsonl

Expected behavior
Inference should complete normally with beam search enabled.

Actual behavior
Inference crashes immediately with a matrix shape mismatch during attention projection.

Full traceback
run sh: /home/vlm/dataset/printed/myenv/bin/python3 /home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/cli/infer.py --adapters checkpoint-226000 --temperature 0 --num_beams 5 --repetition_penalty 1.08 --val_dataset test.jsonl --max_new_tokens 4096 --stream false --result_path 226k_beam5_results.jsonl
[INFO:swift] Successfully registered /home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/llm/dataset/data/dataset_info.json.
[INFO:swift] Loading the model using model_dir: checkpoint-226000
[INFO:swift] Successfully loaded /home/vlm/dataset/printed/checkpoint-226000/args.json.
[INFO:swift] rank: -1, local_rank: -1, world_size: 1, local_world_size: 1
[INFO:swift] Downloading the model from ModelScope Hub, model_id: stepfun-ai/GOT-OCR2_0
Downloading Model from https://www.modelscope.cn to directory: /home/vlm/.cache/modelscope/hub/models/stepfun-ai/GOT-OCR2_0
[INFO:swift] Loading the model using model_dir: /home/vlm/.cache/modelscope/hub/models/stepfun-ai/GOT-OCR2_0
torch_dtype is deprecated! Use dtype instead!
[INFO:swift] Because len(args.val_dataset) > 0, setting split_dataset_ratio: 0.0
[INFO:swift] Setting args.lazy_tokenize: True
[INFO:swift] Setting args.eval_human: False
[INFO:swift] Global seed set to 42
[INFO:swift] args: InferArguments(model='stepfun-ai/GOT-OCR2_0', model_type='got_ocr2', model_revision=None, task_type='causal_lm', torch_dtype=torch.bfloat16, attn_impl=None, new_special_tokens=[], num_labels=None, problem_type=None, rope_scaling=None, device_map=None, max_memory={}, max_model_len=None, local_repo_path=None, init_strategy=None, template='got_ocr2', system=None, max_length=32768, truncation_strategy='delete', max_pixels=None, agent_template=None, norm_bbox=None, use_chat_template=True, padding_free=False, padding_side='right', loss_scale='default', sequence_parallel_size=1, response_prefix=None, template_backend='swift', dataset=[], val_dataset=['page_test.jsonl'], split_dataset_ratio=0.0, data_seed=42, dataset_num_proc=1, load_from_cache_file=True, dataset_shuffle=True, val_dataset_shuffle=False, streaming=False, interleave_prob=None, stopping_strategy='first_exhausted', shuffle_buffer_size=1000, download_mode='reuse_dataset_if_exists', columns={}, strict=False, remove_unused_columns=True, model_name=None, model_author=None, custom_dataset_info=[], quant_method=None, quant_bits=None, hqq_axis=None, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_quant_storage=None, max_new_tokens=4096, temperature=0.0, top_k=None, top_p=None, repetition_penalty=1.08, num_beams=5, stream=False, stop_words=[], logprobs=False, top_logprobs=None, ckpt_dir='/home/vlm/dataset/printed/hindi_master_aug_dpi2/v0-20260113-092657/checkpoint-226000', lora_modules=[], tuner_backend='peft', train_type='lora', adapters=['/home/vlm/dataset/printed/hindi_master_aug_dpi2/v0-20260113-092657/checkpoint-226000'], external_plugins=[], seed=42, model_kwargs={}, load_args=True, load_data_args=False, packing=False, packing_length=None, lazy_tokenize=True, cached_dataset=[], custom_register_path=[], use_hf=False, hub_token=None, ddp_timeout=18000000, ddp_backend=None, ignore_args_error=False, use_swift_lora=False, vllm_gpu_memory_utilization=0.9, vllm_tensor_parallel_size=1, vllm_pipeline_parallel_size=1, vllm_enable_expert_parallel=False, vllm_max_num_seqs=256, vllm_max_model_len=None, vllm_disable_custom_all_reduce=True, vllm_enforce_eager=False, vllm_limit_mm_per_prompt={}, vllm_max_lora_rank=16, vllm_enable_prefix_caching=False, vllm_use_async_engine=False, vllm_quantization=None, vllm_reasoning_parser=None, vllm_disable_cascade_attn=False, vllm_data_parallel_size=1, gpu_memory_utilization=None, tensor_parallel_size=None, limit_mm_per_prompt=None, data_parallel_size=None, use_async_engine=None, sglang_tp_size=1, sglang_pp_size=1, sglang_dp_size=1, sglang_ep_size=1, sglang_enable_ep_moe=False, sglang_mem_fraction_static=None, sglang_context_length=None, sglang_disable_cuda_graph=False, sglang_quantization=None, sglang_kv_cache_dtype='auto', sglang_enable_dp_attention=False, sglang_disable_custom_all_reduce=True, lmdeploy_tp=1, lmdeploy_session_len=None, lmdeploy_cache_max_entry_count=0.8, lmdeploy_quant_policy=0, lmdeploy_vision_batch_size=1, merge_lora=False, safe_serialization=True, max_shard_size='5GB', infer_backend='pt', result_path='/home/vlm/dataset/printed/226k_beam5_results.jsonl', write_batch_size=1000, metric=None, max_batch_size=1, val_dataset_sample=None)
[INFO:swift] Downloading the model from ModelScope Hub, model_id: stepfun-ai/GOT-OCR2_0
Downloading Model from https://www.modelscope.cn to directory: /home/vlm/.cache/modelscope/hub/models/stepfun-ai/GOT-OCR2_0
[INFO:swift] Loading the model using model_dir: /home/vlm/.cache/modelscope/hub/models/stepfun-ai/GOT-OCR2_0
[INFO:swift] model_kwargs: {'device_map': 'cuda:0'}
torch_dtype is deprecated! Use dtype instead!
[INFO:swift] default_system: ' You should follow the instructions carefully and explain your answers in detail.'
[INFO:swift] max_length: 32768
[INFO:swift] response_prefix: ''
[INFO:swift] agent_template: hermes
[INFO:swift] norm_bbox: norm1000
[INFO:swift] Setting ROOT_IMAGE_DIR: None. You can adjust this hyperparameter through the environment variable: ROOT_IMAGE_DIR.
[INFO:swift] model: PeftModelForCausalLM(
(base_model): LoraModel(
(model): GOTQwenForCausalLM(
(model): GOTQwenModel(
(embed_tokens): Embedding(151860, 1024)
(layers): ModuleList(
(0-23): 24 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(q_proj): lora.Linear(
(base_layer): Linear(in_features=1024, out_features=1024, bias=True)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=1024, out_features=32, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=32, out_features=1024, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(k_proj): lora.Linear(
(base_layer): Linear(in_features=1024, out_features=1024, bias=True)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=1024, out_features=32, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=32, out_features=1024, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(v_proj): lora.Linear(
(base_layer): Linear(in_features=1024, out_features=1024, bias=True)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=1024, out_features=32, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=32, out_features=1024, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(o_proj): lora.Linear(
(base_layer): Linear(in_features=1024, out_features=1024, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=1024, out_features=32, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=32, out_features=1024, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
)
(mlp): Qwen2MLP(
(gate_proj): lora.Linear(
(base_layer): Linear(in_features=1024, out_features=2816, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=1024, out_features=32, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=32, out_features=2816, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(up_proj): lora.Linear(
(base_layer): Linear(in_features=1024, out_features=2816, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=1024, out_features=32, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=32, out_features=2816, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(down_proj): lora.Linear(
(base_layer): Linear(in_features=2816, out_features=1024, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.05, inplace=False)
)
(lora_A): ModuleDict(
(default): Linear(in_features=2816, out_features=32, bias=False)
)
(lora_B): ModuleDict(
(default): Linear(in_features=32, out_features=1024, bias=False)
)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(lora_magnitude_vector): ModuleDict()
)
(act_fn): SiLUActivation()
)
(input_layernorm): Qwen2RMSNorm((1024,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((1024,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((1024,), eps=1e-06)
(rotary_emb): Qwen2RotaryEmbedding()
(vision_tower_high): ImageEncoderViT(
(patch_embed): PatchEmbed(
(proj): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
)
(blocks): ModuleList(
(0-11): 12 x Block(
(norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(attn): Attention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(proj): Linear(in_features=768, out_features=768, bias=True)
)
(norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
(mlp): MLPBlock(
(lin1): Linear(in_features=768, out_features=3072, bias=True)
(lin2): Linear(in_features=3072, out_features=768, bias=True)
(act): GELU(approximate='none')
)
)
)
(neck): Sequential(
(0): Conv2d(768, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): LayerNorm2d()
(2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(3): LayerNorm2d()
)
(net_2): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(net_3): Conv2d(512, 1024, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
)
(mm_projector_vary): Linear(in_features=1024, out_features=1024, bias=True)
)
(lm_head): Linear(in_features=1024, out_features=151860, bias=False)
)
)
)
[INFO:swift] Start time of running main: 2026-01-18 15:33:37.360416
[INFO:swift] swift.version: 2.36.0
[INFO:swift] request_config: RequestConfig(max_tokens=4096, temperature=0.0, top_k=None, top_p=None, repetition_penalty=1.08, num_beams=5, stop=[], seed=None, stream=False, logprobs=False, top_logprobs=None, n=1, best_of=None, presence_penalty=0.0, frequency_penalty=0.0, length_penalty=1.0, return_details=False)
[INFO:swift] val_dataset: Dataset({
features: ['images', 'messages'],
num_rows: 505
})
0%| | 0/505 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/cli/infer.py", line 5, in
infer_main()
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/llm/infer/infer.py", line 291, in infer_main
return SwiftInfer(args).main()
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/llm/base.py", line 49, in main
result = self.run()
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/llm/infer/infer.py", line 91, in run
result = self.infer_dataset()
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/llm/infer/infer.py", line 247, in infer_dataset
result_list += self._batch_infer(shard_dataset, request_config)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/llm/infer/infer.py", line 278, in _batch_infer
resp_list = self.infer(val_dataset, request_config, template=self.template, use_tqdm=True, **self.infer_kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/llm/infer/infer_engine/pt_engine.py", line 562, in infer
res += self._infer(
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/llm/infer/infer_engine/pt_engine.py", line 525, in _infer
res = infer_func(**kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/llm/infer/infer_engine/pt_engine.py", line 370, in _infer_full
output = dict(template.generate(self.model, **generate_kwargs))
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/swift/llm/template/base.py", line 682, in generate
return model.generate(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/peft/peft_model.py", line 1973, in generate
outputs = self.base_model.generate(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2564, in generate
result = decoding_method(
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/transformers/generation/utils.py", line 3265, in _beam_search
model_outputs = self(**model_inputs, return_dict=True)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
File "/home/vlm/.cache/huggingface/modules/transformers_modules/GOT_hyphen_OCR2_0/modeling_GOT.py", line 347, in forward
outputs = self.model(
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
File "/home/vlm/.cache/huggingface/modules/transformers_modules/GOT_hyphen_OCR2_0/modeling_GOT.py", line 300, in forward
return super(GOTQwenModel, self).forward(
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/transformers/utils/generic.py", line 1064, in wrapper
outputs = func(self, *args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 384, in forward
hidden_states = decoder_layer(
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/transformers/modeling_layers.py", line 94, in call
return super().call(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
return func(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 234, in forward
hidden_states, _ = self.self_attn(
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
return func(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 182, in forward
attn_output = self.o_proj(attn_output)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 757, in forward
result = self.base_layer(x, *args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
File "/home/vlm/dataset/printed/myenv/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 134, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (288x5120 and 1024x1024)
0%| | 0/505 [00:08<?, ?it/s]

...
File "transformers/models/qwen2/modeling_qwen2.py", line 182, in forward
attn_output = self.o_proj(attn_output)

Note:

  • The dimension 5120 = 1024 * num_beams (5) suggests that the beam dimension is being merged into the hidden dimension, instead of being treated as a batch dimension.
  • The error occurs inside the Qwen2 attention path (o_proj)
  • This only happens for multimodal generation (vision + text)
  • Greedy decoding (num_beams=1) works correctly

Your hardware and system info
ms-swift version: 2.36.0
transformers version: 4.57.1
torch: 2.9.0+cu128
cuda available: True
cuda version: 12.8
gpu: NVIDIA RTX 6000 Ada Generation
Python version: 3.10.12
OS: Ubuntu 24.04.3 LTS

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions