Fix CUDA ONNX Attention: min_bias_align crash on SM<80 and MEA NaN for fully-masked batches by titaiwangms · Pull Request #27831 · microsoft/onnxruntime

titaiwangms · 2026-03-24T19:27:58Z

Summary

Fixes two bugs in the CUDA ONNX Attention operator:

min_bias_align crash on SM<80: The alignment check for Memory Efficient Attention (MEA) bias was too strict on SM<80 GPUs, causing MEA to be incorrectly rejected and falling back to the unfused path. Fixed by using a conservative 4*sizeof(T) alignment that works across all SM architectures.
MEA NaN for fully-masked batches: When all positions in a batch are masked (nonpad_kv_seqlen=0), CUTLASS MEA computes 1/s_prime where s_prime=0, producing NaN in the output. Added ZeroOutputForFullyMaskedBatches kernel to zero the output for these batches before MEA runs.

Additional improvements

Fixed is_bsnh to explicit false in unfused decode path (BNSH-only)
Added TODO(titaiwang) documenting Flash Attention's semantic mismatch with ONNX spec for bool attn_mask + past_key (Flash interprets bool mask as padding mask, spec treats it as general attention mask)
Improved comments and docstrings for routing logic and ConvertAttnMaskToBias

…igned dims PR #27831 fell back to CUBLAS_DEFAULT_MATH which still uses TF32 on Ampere+ GPUs (SM>=80) since cuBLAS 11.0. Changed to CUBLAS_PEDANTIC_MATH when dimensions are not 4-aligned to guarantee no tensor core usage, preventing CUDA error 716 (misaligned address) on CUDA 12.9+. Three-way logic in all three float cuBLAS GEMM helpers: - TF32 requested + dimensions aligned: CUBLAS_TF32_TENSOR_OP_MATH - TF32 requested + dimensions NOT aligned: CUBLAS_PEDANTIC_MATH - TF32 not requested: CUBLAS_DEFAULT_MATH (unchanged) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

titaiwangms · 2026-03-25T23:59:33Z

Progress: https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=1140203&view=logs&j=99a497df-3ebc-54a9-8888-7e71eac6ab47&t=c7dba392-9df8-5167-c6ff-e797765bc8da

Now we improve from crash to 3 tests mismatched expected results.

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/core/providers/cuda/shared_inc/fpgeneric.h

onnxruntime/test/providers/cpu/llm/attention_op_test.cc

Copilot

Pull request overview

This PR fixes correctness and stability issues in the CUDA EP implementation of the ONNX Attention op, focusing on decode-time bool masking with past KV, and NaNs in CUTLASS Memory Efficient Attention (MEA) for fully-masked batches.

Changes:

Aligns MEA bias-stride eligibility logic in attention.cc and adds post-MEA output zeroing for fully-masked batches.
Fixes unfused decode behavior for bool masks with past KV by using variable-length KV concat consistent with Flash layout semantics (and zero-inits present buffers).
Adds C++ and Python tests covering decode bool-mask edge cases (partial masks, divergent per-batch seqlens).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
onnxruntime/test/python/transformers/test_onnx_attention/test_mha.py	Adds CUDA-only graph-level tests for bool-mask decode with past KV (partial mask + divergent seqlens).
onnxruntime/test/providers/cpu/llm/attention_op_test.cc	Adds CUDA execution tests forcing unfused path and verifying variable-length concat correctness in decode.
onnxruntime/core/providers/cuda/llm/attention_mask_impl.h	Declares a CUDA helper to zero outputs for fully-masked batches (`seqlens_k==0`).
onnxruntime/core/providers/cuda/llm/attention_mask_impl.cu	Implements and instantiates the `ZeroOutputForFullyMaskedBatches` kernel and launcher.
onnxruntime/core/providers/cuda/llm/attention.cc	Applies fully-masked output zeroing for MEA and unfused nonpad paths; fixes unfused bool-mask decode concat semantics; updates MEA eligibility alignment check.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/core/providers/cuda/llm/attention.cc

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/core/providers/cuda/llm/attention.cc

titaiwangms · 2026-03-27T17:13:27Z

Fixed: https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=1142958&view=results

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/test/python/transformers/test_onnx_attention/test_mha.py

onnxruntime/test/providers/cpu/llm/attention_op_test.cc

titaiwangms · 2026-03-27T23:12:35Z

UnfusedRunner decode with bool mask + past KV (attention.cc)
When a 4D bool attention mask is provided with past key/value in decode mode, the unfused path now uses variable-length concat (LaunchConcatNewToPastKV) placing the new token at position seqlens_k[b] — matching the Flash Attention present_key/value layout contract. This fixes incorrect token attendance where ConcatPastToPresent placed the new token at a fixed position that didn't match the mask.

This does not seem right. I will need to take another look.

- Use 4*sizeof(T) convention for min_bias_align (matches contrib_ops) - Fix unfused decode with bool mask + past KV: variable-length concat placing new token at seqlens_k[b] position (Flash layout contract) - Add ZeroOutputForFullyMaskedBatches kernel for MEA path (CUTLASS epilogue produces NaN when s_prime=0 for fully-masked batches) - Fix is_bsnh mismatch in LaunchConcatNewToPastKV call - Fix BFloat16 linker: use OrtToCudaType instead of ToCudaType for direct CUDA kernel launches - Zero-init present buffers before variable-length concat - Add C++ tests: partial mask decode, multi-batch divergent seqlens, all-false mask decode - Add Python graph-level tests: partial mask decode, multi-batch divergent seqlens Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

titaiwangms · 2026-03-28T18:45:02Z

@tianleiwu @yuslepukhin This is ready.
Success on T4: https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=1144412&view=results

yuslepukhin requested review from Copilot and tianleiwu and removed request for Copilot March 24, 2026 19:34

yuslepukhin reviewed Mar 24, 2026

View reviewed changes

onnxruntime/core/providers/cuda/shared_inc/fpgeneric.h Outdated Show resolved Hide resolved

github-actions bot reviewed Mar 24, 2026

View reviewed changes

onnxruntime/core/providers/cuda/shared_inc/fpgeneric.h Outdated Show resolved Hide resolved

titaiwangms force-pushed the titaiwang/fix_cuda branch 2 times, most recently from 38fef79 to 0796037 Compare March 25, 2026 20:22

titaiwangms changed the title ~~Fix TF32 misaligned address error in cuBLAS GEMM functions~~ Fix misaligned BiasLoader access in CUTLASS FMHA attention dispatch Mar 25, 2026

titaiwangms assigned tianleiwu Mar 25, 2026

titaiwangms force-pushed the titaiwang/fix_cuda branch 2 times, most recently from 9a1bf88 to 2cccef2 Compare March 26, 2026 23:44

github-actions bot reviewed Mar 26, 2026

View reviewed changes

onnxruntime/core/providers/cuda/shared_inc/fpgeneric.h Outdated Show resolved Hide resolved

titaiwangms commented Mar 27, 2026

View reviewed changes

titaiwangms unassigned tianleiwu Mar 27, 2026

titaiwangms force-pushed the titaiwang/fix_cuda branch from 2cccef2 to 678c840 Compare March 27, 2026 05:41

titaiwangms changed the title ~~Fix misaligned BiasLoader access in CUTLASS FMHA attention dispatch~~ Fix ONNX Attention CUDA: bias alignment, unfused decode concat, and MEA NaN Mar 27, 2026

titaiwangms requested a review from Copilot March 27, 2026 05:48

Copilot started reviewing on behalf of titaiwangms March 27, 2026 05:49 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

onnxruntime/core/providers/cuda/llm/attention.cc Outdated Show resolved Hide resolved

onnxruntime/core/providers/cuda/llm/attention.cc Outdated Show resolved Hide resolved

titaiwangms force-pushed the titaiwang/fix_cuda branch from 678c840 to ab43783 Compare March 27, 2026 06:10

titaiwangms requested a review from Copilot March 27, 2026 06:10

Copilot started reviewing on behalf of titaiwangms March 27, 2026 06:11 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

onnxruntime/core/providers/cuda/llm/attention.cc Show resolved Hide resolved

onnxruntime/core/providers/cuda/llm/attention.cc Outdated Show resolved Hide resolved

onnxruntime/core/providers/cuda/llm/attention.cc Show resolved Hide resolved

titaiwangms marked this pull request as ready for review March 27, 2026 17:13

titaiwangms closed this Mar 27, 2026

titaiwangms reopened this Mar 27, 2026

titaiwangms assigned tianleiwu Mar 27, 2026

titaiwangms mentioned this pull request Mar 27, 2026

ONNX Attention CUDA: Coverage Gaps in Runner Fallback Paths #27880

Open

yuslepukhin requested a review from Copilot March 27, 2026 20:59

Copilot started reviewing on behalf of yuslepukhin March 27, 2026 21:02 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

onnxruntime/test/python/transformers/test_onnx_attention/test_mha.py Outdated Show resolved Hide resolved

onnxruntime/test/providers/cpu/llm/attention_op_test.cc Outdated Show resolved Hide resolved

titaiwangms marked this pull request as draft March 27, 2026 23:12

titaiwangms mentioned this pull request Mar 28, 2026

ONNX Attention: Flash runner interprets bool attn_mask as padding mask, diverges from ONNX spec #27885

Open

titaiwangms force-pushed the titaiwang/fix_cuda branch from ab43783 to 38f4ca4 Compare March 28, 2026 04:38

titaiwangms requested a review from Copilot March 28, 2026 04:39

Copilot started reviewing on behalf of titaiwangms March 28, 2026 04:44 View session

Copilot AI reviewed Mar 28, 2026

View reviewed changes

titaiwangms changed the title ~~Fix ONNX Attention CUDA: bias alignment, unfused decode concat, and MEA NaN~~ Fix CUDA ONNX Attention: min_bias_align crash on SM<80 and MEA NaN for fully-masked batches Mar 28, 2026

titaiwangms marked this pull request as ready for review March 28, 2026 18:43

Conversation

titaiwangms commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Additional improvements

Related

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

titaiwangms commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

titaiwangms commented Mar 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

titaiwangms commented Mar 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

titaiwangms commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

titaiwangms commented Mar 24, 2026 •

edited

Loading

titaiwangms commented Mar 25, 2026 •

edited

Loading