Skip to content

Validate per-column weight_scale/weight_zero_point shape in CPU QAttention; harden integer arithmetic in QAttention and AttentionBase#28480

Merged
edgchen1 merged 9 commits into
mainfrom
edgchen1/fix-qattention-cpu-scale-shape-validation
May 14, 2026
Merged

Validate per-column weight_scale/weight_zero_point shape in CPU QAttention; harden integer arithmetic in QAttention and AttentionBase#28480
edgchen1 merged 9 commits into
mainfrom
edgchen1/fix-qattention-cpu-scale-shape-validation

Conversation

@edgchen1
Copy link
Copy Markdown
Contributor

@edgchen1 edgchen1 commented May 12, 2026

Description

The CPU QAttention kernel did not validate the shape of per-column weight_scale and weight_zero_point inputs against the expected 3 * hidden_size. A model could supply a per-column
tensor smaller than expected, causing the GEMM dequantization loop to read past the end of the buffer (offsets up to ~3 * hidden_size - head_size).

This PR adds the missing shape validation and, while in the area, hardens integer arithmetic across QAttention and AttentionBase against malformed shape attributes / dimensions.

Changes

onnxruntime/contrib_ops/cpu/quantization/attention_quant.cc

  • Validate per-column weight_scale and weight_zero_point are 1-D with size 3 * hidden_size; reject otherwise.
  • Use narrow<int> / narrow<size_t> when converting int64_t shape dims, so out-of-range values throw rather than silently truncating.
  • Use SafeInt for multiplications whose operands are not provably bounded by upstream validation (loop_len, input_offset, qkv_offset, the gemm allocation, and
    packed_weights_data_size in PrePack).
  • Refactor the gemm allocation and Q/K/V pointer arithmetic to share a single SafeInt-validated batch_size * sequence_length * hidden_size value.
  • Drop a few redundant static_cast<int>s in the per-iteration index math.
  • Remove the hidden_size_x3 % 3 == 0 and hidden_size % num_heads_ == 0 checks here; they are now enforced uniformly in AttentionBase::CheckInputs with clearer error messages.

onnxruntime/contrib_ops/cpu/bert/attention_base.h

  • Replace static_cast<int> with narrow<int> for num_heads_, rotary_embedding_, the parameters struct outputs, and GetPresent's past_sequence_length. Without this, any
    int64_t value outside the int range (e.g., a num_heads attribute of 2^31, or a past sequence length of 2^31) silently truncates to an unrelated int value that is then
    propagated to downstream kernels and used in arithmetic, enabling division by zero, sign flips, or out-of-bounds indexing.
  • Drop the static_cast<int> from the past_dims[2] / past_dims[4] shape comparisons so the equality check uses the full int64_t value; previously a past tensor whose dim's low 32
    bits happened to match num_heads_ (or k_hidden_size / num_heads_) would pass validation despite having the wrong physical shape.
  • In CheckInputs, when require_same_hidden_size_ is true, reject bias_dims[0] not a multiple of 3 with a clear error (Q, K, V are packed and share a hidden size).
  • In CheckInputs, when qkv_hidden_sizes is not set, also reject q_hidden_size % num_heads_ != 0 (mirrors the existing check on the qkv_hidden_sizes path).

onnxruntime/test/contrib_ops/quantize_attention_op_test.cc

  • 4 regression tests for the per-column shape validation:
    • InvalidWeightScalePerColumnShape
    • InvalidWeightScalePerColumnRank
    • InvalidWeightZeroPointPerColumnShape
    • InvalidWeightZeroPointPerColumnRank
  • 3 regression tests for the divisibility / narrowing checks (sharing a RunQAttentionExpectFailure helper):
    • InvalidBiasDimNotMultipleOfThree
    • InvalidHiddenSizeNotDivisibleByNumHeads
    • InvalidNumHeadsOverflowsInt (num_heads = INT_MAX + 1 triggers gsl::narrowing_error)

Testing

All QAttention* / AttentionTest* / MultiHeadAttention* tests (97/97) pass locally on CPU Release build.

edgchen1 and others added 5 commits May 11, 2026 19:04
…hape

The CPU QAttention kernel treats any non-scalar weight_scale or
weight_zero_point as per-column quantization but never verified the
tensor size matched the expected 3 * hidden_size. The GEMM batch loop
then indexes dequant_scales and weight_zp_data with offsets up to
~3 * hidden_size - head_size, which can read past the end of the buffer
when a model supplies an undersized per-column tensor.

Add explicit validation in QAttention<T>::Compute requiring per-column
weight_scale and weight_zero_point to be 1-D tensors of size
3 * hidden_size, and move construction of dequant_scales after the
shape checks so malformed inputs fail fast. CUDA QAttention is
unaffected (it already enforces scalar-only).

Adds regression tests covering wrong-size and wrong-rank per-column
weight_scale and weight_zero_point inputs.

Additionally, harden the surrounding shape handling: use narrow<int>
when narrowing int64_t shape dimensions to int (so out-of-range values
throw instead of silently truncating), require the weights second
dimension to be a positive multiple of 3, and use SafeInt for the
loop_len and input_offset multiplications that could otherwise
overflow int.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…Pack

Replace static_cast<size_t> with onnxruntime::narrow<size_t> when converting int64_t weight dimensions in QAttention<T>::PrePack. narrow throws on negative values instead of silently producing a huge size_t, matching the defensive-narrowing pattern used in Compute. PrePack already bails out on inconsistent shapes via an explicit check, so this is defense-in-depth rather than a security fix.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add SafeInt-based bounds checks to integer multiplications in QAttention's
PrePack and Compute paths, refactor the gemm allocation to share a single
SafeInt-validated batch*sequence*hidden value with the K/V pointer offsets,
and remove redundant static_cast<int>s in the per-iteration index math.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace static_cast<int> conversions of attribute values, shape dims, and
parameter outputs with narrow<int>, which throws on lossy conversion.
Without this, an int64_t value differing from its int representation by a
multiple of 2^32 (e.g., num_heads = 2^32, rotary_embedding_dim = 2^32 + 1)
silently truncates and is then propagated to downstream kernels as an
attacker-controlled small/zero value, leading to division by zero or
out-of-bounds indexing.

Also drop the static_cast<int> on past_dims[2] and past_dims[4] in
CheckInputs so the equality check compares the full int64_t shape value;
previously, an attacker-supplied past tensor with past_dims[2] equal to
num_heads_ + k * 2^32 would pass validation despite having the wrong
physical shape.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens CPU QAttention/AttentionBase against malformed model inputs by adding missing shape validation for per-column weight_scale/weight_zero_point and by tightening integer conversions/overflow behavior to prevent OOB reads and unsafe arithmetic.

Changes:

  • Add validation that per-column weight_scale and weight_zero_point are 1-D tensors of size 3 * hidden_size in CPU QAttention.
  • Replace truncating static_cast<int> conversions with narrow<int> and introduce additional SafeInt-checked arithmetic in QAttention and AttentionBase.
  • Add regression tests covering invalid per-column scale/zero-point rank/size for CPU QAttention.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
onnxruntime/contrib_ops/cpu/quantization/attention_quant.cc Adds per-column scale/zp shape validation and hardens integer arithmetic/offset calculations in CPU QAttention.
onnxruntime/contrib_ops/cpu/bert/attention_base.h Hardens conversions/shape comparisons to avoid truncation-based validation bypasses.
onnxruntime/test/contrib_ops/quantize_attention_op_test.cc Adds CPU-only regression tests for invalid per-column scale/zero-point shapes/ranks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/contrib_ops/cpu/bert/attention_base.h Outdated
Comment thread onnxruntime/contrib_ops/cpu/quantization/attention_quant.cc Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@edgchen1 edgchen1 requested a review from yuslepukhin May 12, 2026 23:47
Comment thread onnxruntime/contrib_ops/cpu/quantization/attention_quant.cc
Comment thread onnxruntime/contrib_ops/cpu/quantization/attention_quant.cc
Comment thread onnxruntime/contrib_ops/cpu/bert/attention_base.h
Without this check, head_size = hidden_size / num_heads_ silently truncates
when hidden_size is not a multiple of num_heads_, leaving the high-index
portion of the hidden dimension uncovered by the per-head GEMM loop and
producing incorrect results.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@edgchen1 edgchen1 requested a review from yuslepukhin May 13, 2026 19:00
Copy link
Copy Markdown
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test coverage gaps:

No test for the new hidden_size % num_heads_ check (commit 098e4ea). A test with e.g. hidden_size=5, num_heads=2 that expects failure with "must be divisible by num_heads" should be added.

No test for the hidden_size_x3 % 3 == 0 check. A test with weights_dims[1] not divisible by 3 (e.g., 13) would cover this.

No test for narrow<> throwing on overflow (e.g., num_heads attribute = INT_MAX + 1).

Existing happy-path tests (17 QAttention* tests) cover normal operation and are reported as passing.

edgchen1 and others added 2 commits May 13, 2026 14:54
Adds two regression tests for QAttention input validation:

- InvalidHiddenSizeNotDivisibleByNumHeads exercises the new hidden_size %% num_heads_ check added in 098e4ea.

- InvalidNumHeadsOverflowsInt exercises the narrow<int>(num_heads) check in AttentionBase's constructor; gsl::narrowing_error is raised during session init.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move the bias-dim multiple-of-3 and hidden_size-divisible-by-num_heads checks out of QAttention::Compute and into AttentionBase::CheckInputs so they apply to all packed-QKV callers and produce clearer error messages.

Add regression tests for the three validation paths flagged in PR review: invalid bias dim, hidden_size not divisible by num_heads, and num_heads overflowing int. Factor the three tests' boilerplate into a shared RunQAttentionExpectFailure helper.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@edgchen1
Copy link
Copy Markdown
Contributor Author

Test coverage gaps:

No test for the new hidden_size % num_heads_ check (commit 098e4ea). A test with e.g. hidden_size=5, num_heads=2 that expects failure with "must be divisible by num_heads" should be added.

No test for the hidden_size_x3 % 3 == 0 check. A test with weights_dims[1] not divisible by 3 (e.g., 13) would cover this.

No test for narrow<> throwing on overflow (e.g., num_heads attribute = INT_MAX + 1).

Existing happy-path tests (17 QAttention* tests) cover normal operation and are reported as passing.

Added some additional tests.

@edgchen1 edgchen1 requested a review from yuslepukhin May 14, 2026 18:16
Copy link
Copy Markdown
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@edgchen1 edgchen1 merged commit 18ae6da into main May 14, 2026
93 of 94 checks passed
@edgchen1 edgchen1 deleted the edgchen1/fix-qattention-cpu-scale-shape-validation branch May 14, 2026 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants