Fix flash attention crash with 3D position_ids (Qwen3.5)#44911
Fix flash attention crash with 3D position_ids (Qwen3.5)#44911ouroborosscr wants to merge 4 commits intohuggingface:mainfrom
Conversation
Qwen3.5 uses 3D position_ids [3, batch, seq_len] for multi-dimensional rotary embedding. _is_packed_sequence() misinterprets this as a packed sequence, causing cu_seqlens to be constructed with 3x the actual token count. Flash attention then reads beyond tensor boundaries, resulting in CUDA illegal memory access. Add a dimensionality check to reject >2D position_ids, since packed sequences always use 2D [batch, seq_len] format.
|
Hi, would you mind checking out this comment: https://github.com/QwenLM/Qwen3.5/issues/104#issuecomment-4105702644 ? (btw, I'm not sure if multimodal sequence packing is supported yet. If it isn't, then the current fix is fine.) |
Qwen3.5 uses 3D position_ids [3, batch, seq_len] for multi-dimensional rotary embedding. _is_packed_sequence() misinterprets this as a packed sequence, causing cu_seqlens to be constructed with 3x the actual token count. Flash attention then reads beyond tensor boundaries, resulting in CUDA illegal memory access. Add a dimensionality check to reject >2D position_ids, since packed sequences always use 2D [batch, seq_len] format.
|
The @jjymmm scheme has been updated to this pull requests. |
ArthurZucker
left a comment
There was a problem hiding this comment.
Ty both! Do you mind adding a direct ref to the issue in the packed sequence doc (we want to make sure vendored / authors are credited !) 🤗
zucchini-nlp
left a comment
There was a problem hiding this comment.
Noope, this should not be the way. Please see the comment under issue, we pass over text positions already without THW
|
Done! Added a direct reference to both issues and credited contributors in the docstring. Thanks for the reminder 🤗 @ArthurZucker |
@zucchini-nlp Oops, I overlooked that text_position_ids would be set to None when position_ids.shape[0] == 3. Previously, I thought this is just a corner case when users skipped preparing 4d position ids while calling the model directly. Sorry for the confusion.🥹 |
|
@zucchini-nlp @JJJYmmm My test is over. When transformers=5.3.0.dev0, there will be no illegal memory access. Now: Past: |
|
Nice @ouroborosscr , then we can close the PR as there is no bug. The next release will include the working FA for qwen3 :) |
Qwen3.5 uses 3D position_ids [3, batch, seq_len] for multi-dimensional rotary embedding. _is_packed_sequence() misinterprets this as a packed sequence, causing cu_seqlens to be constructed with 3x the actual token count. Flash attention then reads beyond tensor boundaries, resulting in CUDA illegal memory access.
Add a dimensionality check to reject >2D position_ids, since packed sequences always use 2D [batch, seq_len] format.
What does this PR do?
Qwen3.5 uses a hybrid architecture (GatedDeltaNet + standard attention) with 3D
position_idsof shape[3, batch_size, seq_len]for multi-dimensional rotary embedding. The function_is_packed_sequence()inmodeling_flash_attention_utils.pydoes not handle >2D tensors, causing it to misidentify the input as a packed sequence. This leads tocu_seqlensbeing constructed with 3× the actual token count, andflash_attn_varlen_funcreads beyond tensor boundaries, resulting inCUDA error: illegal memory access.The fix: Add
if position_ids.dim() > 2: return Falseat the top of_is_packed_sequence(), since packed sequences always use 2D[batch, seq_len]position_ids.Intercepted evidence before crash:
Fixes #44910
Before submitting
Who can review?
@vasqu @ArthurZucker @Cyrilvallez (attention)
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.