Validate per-column weight_scale/weight_zero_point shape in CPU QAttention; harden integer arithmetic in QAttention and AttentionBase by edgchen1 · Pull Request #28480 · microsoft/onnxruntime

edgchen1 · 2026-05-12T18:51:38Z

Description

The CPU QAttention kernel did not validate the shape of per-column weight_scale and weight_zero_point inputs against the expected 3 * hidden_size. A model could supply a per-column
tensor smaller than expected, causing the GEMM dequantization loop to read past the end of the buffer (offsets up to ~3 * hidden_size - head_size).

This PR adds the missing shape validation and, while in the area, hardens integer arithmetic across QAttention and AttentionBase against malformed shape attributes / dimensions.

Changes

onnxruntime/contrib_ops/cpu/quantization/attention_quant.cc

Validate per-column weight_scale and weight_zero_point are 1-D with size 3 * hidden_size; reject otherwise.
Use narrow<int> / narrow<size_t> when converting int64_t shape dims, so out-of-range values throw rather than silently truncating.
Use SafeInt for multiplications whose operands are not provably bounded by upstream validation (loop_len, input_offset, qkv_offset, the gemm allocation, and
packed_weights_data_size in PrePack).
Refactor the gemm allocation and Q/K/V pointer arithmetic to share a single SafeInt-validated batch_size * sequence_length * hidden_size value.
Drop a few redundant static_cast<int>s in the per-iteration index math.
Remove the hidden_size_x3 % 3 == 0 and hidden_size % num_heads_ == 0 checks here; they are now enforced uniformly in AttentionBase::CheckInputs with clearer error messages.

onnxruntime/contrib_ops/cpu/bert/attention_base.h

Replace static_cast<int> with narrow<int> for num_heads_, rotary_embedding_, the parameters struct outputs, and GetPresent's past_sequence_length. Without this, any
int64_t value outside the int range (e.g., a num_heads attribute of 2^31, or a past sequence length of 2^31) silently truncates to an unrelated int value that is then
propagated to downstream kernels and used in arithmetic, enabling division by zero, sign flips, or out-of-bounds indexing.
Drop the static_cast<int> from the past_dims[2] / past_dims[4] shape comparisons so the equality check uses the full int64_t value; previously a past tensor whose dim's low 32
bits happened to match num_heads_ (or k_hidden_size / num_heads_) would pass validation despite having the wrong physical shape.
In CheckInputs, when require_same_hidden_size_ is true, reject bias_dims[0] not a multiple of 3 with a clear error (Q, K, V are packed and share a hidden size).
In CheckInputs, when qkv_hidden_sizes is not set, also reject q_hidden_size % num_heads_ != 0 (mirrors the existing check on the qkv_hidden_sizes path).

onnxruntime/test/contrib_ops/quantize_attention_op_test.cc

4 regression tests for the per-column shape validation:
- InvalidWeightScalePerColumnShape
- InvalidWeightScalePerColumnRank
- InvalidWeightZeroPointPerColumnShape
- InvalidWeightZeroPointPerColumnRank
3 regression tests for the divisibility / narrowing checks (sharing a RunQAttentionExpectFailure helper):
- InvalidBiasDimNotMultipleOfThree
- InvalidHiddenSizeNotDivisibleByNumHeads
- InvalidNumHeadsOverflowsInt (num_heads = INT_MAX + 1 triggers gsl::narrowing_error)

Testing

All QAttention* / AttentionTest* / MultiHeadAttention* tests (97/97) pass locally on CPU Release build.

…hape The CPU QAttention kernel treats any non-scalar weight_scale or weight_zero_point as per-column quantization but never verified the tensor size matched the expected 3 * hidden_size. The GEMM batch loop then indexes dequant_scales and weight_zp_data with offsets up to ~3 * hidden_size - head_size, which can read past the end of the buffer when a model supplies an undersized per-column tensor. Add explicit validation in QAttention<T>::Compute requiring per-column weight_scale and weight_zero_point to be 1-D tensors of size 3 * hidden_size, and move construction of dequant_scales after the shape checks so malformed inputs fail fast. CUDA QAttention is unaffected (it already enforces scalar-only). Adds regression tests covering wrong-size and wrong-rank per-column weight_scale and weight_zero_point inputs. Additionally, harden the surrounding shape handling: use narrow<int> when narrowing int64_t shape dimensions to int (so out-of-range values throw instead of silently truncating), require the weights second dimension to be a positive multiple of 3, and use SafeInt for the loop_len and input_offset multiplications that could otherwise overflow int. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…Pack Replace static_cast<size_t> with onnxruntime::narrow<size_t> when converting int64_t weight dimensions in QAttention<T>::PrePack. narrow throws on negative values instead of silently producing a huge size_t, matching the defensive-narrowing pattern used in Compute. PrePack already bails out on inconsistent shapes via an explicit check, so this is defense-in-depth rather than a security fix. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add SafeInt-based bounds checks to integer multiplications in QAttention's PrePack and Compute paths, refactor the gemm allocation to share a single SafeInt-validated batch*sequence*hidden value with the K/V pointer offsets, and remove redundant static_cast<int>s in the per-iteration index math. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace static_cast<int> conversions of attribute values, shape dims, and parameter outputs with narrow<int>, which throws on lossy conversion. Without this, an int64_t value differing from its int representation by a multiple of 2^32 (e.g., num_heads = 2^32, rotary_embedding_dim = 2^32 + 1) silently truncates and is then propagated to downstream kernels as an attacker-controlled small/zero value, leading to division by zero or out-of-bounds indexing. Also drop the static_cast<int> on past_dims[2] and past_dims[4] in CheckInputs so the equality check compares the full int64_t shape value; previously, an attacker-supplied past tensor with past_dims[2] equal to num_heads_ + k * 2^32 would pass validation despite having the wrong physical shape. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR hardens CPU QAttention/AttentionBase against malformed model inputs by adding missing shape validation for per-column weight_scale/weight_zero_point and by tightening integer conversions/overflow behavior to prevent OOB reads and unsafe arithmetic.

Changes:

Add validation that per-column weight_scale and weight_zero_point are 1-D tensors of size 3 * hidden_size in CPU QAttention.
Replace truncating static_cast<int> conversions with narrow<int> and introduce additional SafeInt-checked arithmetic in QAttention and AttentionBase.
Add regression tests covering invalid per-column scale/zero-point rank/size for CPU QAttention.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
onnxruntime/contrib_ops/cpu/quantization/attention_quant.cc	Adds per-column scale/zp shape validation and hardens integer arithmetic/offset calculations in CPU QAttention.
onnxruntime/contrib_ops/cpu/bert/attention_base.h	Hardens conversions/shape comparisons to avoid truncation-based validation bypasses.
onnxruntime/test/contrib_ops/quantize_attention_op_test.cc	Adds CPU-only regression tests for invalid per-column scale/zero-point shapes/ranks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Without this check, head_size = hidden_size / num_heads_ silently truncates when hidden_size is not a multiple of num_heads_, leaving the high-index portion of the hidden dimension uncovered by the per-head GEMM loop and producing incorrect results. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

yuslepukhin

Test coverage gaps:

No test for the new hidden_size % num_heads_ check (commit 098e4ea). A test with e.g. hidden_size=5, num_heads=2 that expects failure with "must be divisible by num_heads" should be added.

No test for the hidden_size_x3 % 3 == 0 check. A test with weights_dims[1] not divisible by 3 (e.g., 13) would cover this.

No test for narrow<> throwing on overflow (e.g., num_heads attribute = INT_MAX + 1).

Existing happy-path tests (17 QAttention* tests) cover normal operation and are reported as passing.

Adds two regression tests for QAttention input validation: - InvalidHiddenSizeNotDivisibleByNumHeads exercises the new hidden_size %% num_heads_ check added in 098e4ea. - InvalidNumHeadsOverflowsInt exercises the narrow<int>(num_heads) check in AttentionBase's constructor; gsl::narrowing_error is raised during session init. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Move the bias-dim multiple-of-3 and hidden_size-divisible-by-num_heads checks out of QAttention::Compute and into AttentionBase::CheckInputs so they apply to all packed-QKV callers and produce clearer error messages. Add regression tests for the three validation paths flagged in PR review: invalid bias dim, hidden_size not divisible by num_heads, and num_heads overflowing int. Factor the three tests' boilerplate into a shared RunQAttentionExpectFailure helper. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

edgchen1 · 2026-05-14T18:15:35Z

Test coverage gaps:

No test for the new hidden_size % num_heads_ check (commit 098e4ea). A test with e.g. hidden_size=5, num_heads=2 that expects failure with "must be divisible by num_heads" should be added.

No test for the hidden_size_x3 % 3 == 0 check. A test with weights_dims[1] not divisible by 3 (e.g., 13) would cover this.

No test for narrow<> throwing on overflow (e.g., num_heads attribute = INT_MAX + 1).

Existing happy-path tests (17 QAttention* tests) cover normal operation and are reported as passing.

Added some additional tests.

yuslepukhin

LGTM

edgchen1 and others added 5 commits May 11, 2026 19:04

simplify by using existing hidden_size_x3 variable

ce5f50f

edgchen1 requested a review from Copilot May 12, 2026 18:54

Copilot started reviewing on behalf of edgchen1 May 12, 2026 18:55 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

Comment thread onnxruntime/contrib_ops/cpu/bert/attention_base.h Outdated

Comment thread onnxruntime/contrib_ops/cpu/quantization/attention_quant.cc Outdated

Apply suggestions from code review

f9080a7

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

edgchen1 requested a review from yuslepukhin May 12, 2026 23:47

yuslepukhin reviewed May 13, 2026

View reviewed changes

Comment thread onnxruntime/contrib_ops/cpu/quantization/attention_quant.cc

Comment thread onnxruntime/contrib_ops/cpu/quantization/attention_quant.cc

yuslepukhin reviewed May 13, 2026

View reviewed changes

Comment thread onnxruntime/contrib_ops/cpu/bert/attention_base.h

edgchen1 requested a review from yuslepukhin May 13, 2026 19:00

yuslepukhin reviewed May 13, 2026

View reviewed changes

yuslepukhin requested changes May 13, 2026

View reviewed changes

edgchen1 and others added 2 commits May 13, 2026 14:54

edgchen1 requested a review from yuslepukhin May 14, 2026 18:16

yuslepukhin approved these changes May 14, 2026

View reviewed changes

edgchen1 merged commit 18ae6da into main May 14, 2026
93 of 94 checks passed

edgchen1 deleted the edgchen1/fix-qattention-cpu-scale-shape-validation branch May 14, 2026 22:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate per-column weight_scale/weight_zero_point shape in CPU QAttention; harden integer arithmetic in QAttention and AttentionBase#28480

Validate per-column weight_scale/weight_zero_point shape in CPU QAttention; harden integer arithmetic in QAttention and AttentionBase#28480
edgchen1 merged 9 commits into
mainfrom
edgchen1/fix-qattention-cpu-scale-shape-validation

edgchen1 commented May 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuslepukhin left a comment •

edited

Loading

Uh oh!

edgchen1 commented May 14, 2026

Uh oh!

yuslepukhin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

edgchen1 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuslepukhin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edgchen1 commented May 14, 2026

Uh oh!

yuslepukhin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

edgchen1 commented May 12, 2026 •

edited

Loading

yuslepukhin left a comment •

edited

Loading