Validate per-column weight_scale/weight_zero_point shape in CPU QAttention; harden integer arithmetic in QAttention and AttentionBase#28480
Conversation
…hape The CPU QAttention kernel treats any non-scalar weight_scale or weight_zero_point as per-column quantization but never verified the tensor size matched the expected 3 * hidden_size. The GEMM batch loop then indexes dequant_scales and weight_zp_data with offsets up to ~3 * hidden_size - head_size, which can read past the end of the buffer when a model supplies an undersized per-column tensor. Add explicit validation in QAttention<T>::Compute requiring per-column weight_scale and weight_zero_point to be 1-D tensors of size 3 * hidden_size, and move construction of dequant_scales after the shape checks so malformed inputs fail fast. CUDA QAttention is unaffected (it already enforces scalar-only). Adds regression tests covering wrong-size and wrong-rank per-column weight_scale and weight_zero_point inputs. Additionally, harden the surrounding shape handling: use narrow<int> when narrowing int64_t shape dimensions to int (so out-of-range values throw instead of silently truncating), require the weights second dimension to be a positive multiple of 3, and use SafeInt for the loop_len and input_offset multiplications that could otherwise overflow int. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…Pack Replace static_cast<size_t> with onnxruntime::narrow<size_t> when converting int64_t weight dimensions in QAttention<T>::PrePack. narrow throws on negative values instead of silently producing a huge size_t, matching the defensive-narrowing pattern used in Compute. PrePack already bails out on inconsistent shapes via an explicit check, so this is defense-in-depth rather than a security fix. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add SafeInt-based bounds checks to integer multiplications in QAttention's PrePack and Compute paths, refactor the gemm allocation to share a single SafeInt-validated batch*sequence*hidden value with the K/V pointer offsets, and remove redundant static_cast<int>s in the per-iteration index math. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace static_cast<int> conversions of attribute values, shape dims, and parameter outputs with narrow<int>, which throws on lossy conversion. Without this, an int64_t value differing from its int representation by a multiple of 2^32 (e.g., num_heads = 2^32, rotary_embedding_dim = 2^32 + 1) silently truncates and is then propagated to downstream kernels as an attacker-controlled small/zero value, leading to division by zero or out-of-bounds indexing. Also drop the static_cast<int> on past_dims[2] and past_dims[4] in CheckInputs so the equality check compares the full int64_t shape value; previously, an attacker-supplied past tensor with past_dims[2] equal to num_heads_ + k * 2^32 would pass validation despite having the wrong physical shape. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR hardens CPU QAttention/AttentionBase against malformed model inputs by adding missing shape validation for per-column weight_scale/weight_zero_point and by tightening integer conversions/overflow behavior to prevent OOB reads and unsafe arithmetic.
Changes:
- Add validation that per-column
weight_scaleandweight_zero_pointare 1-D tensors of size3 * hidden_sizein CPUQAttention. - Replace truncating
static_cast<int>conversions withnarrow<int>and introduce additionalSafeInt-checked arithmetic inQAttentionandAttentionBase. - Add regression tests covering invalid per-column scale/zero-point rank/size for CPU
QAttention.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| onnxruntime/contrib_ops/cpu/quantization/attention_quant.cc | Adds per-column scale/zp shape validation and hardens integer arithmetic/offset calculations in CPU QAttention. |
| onnxruntime/contrib_ops/cpu/bert/attention_base.h | Hardens conversions/shape comparisons to avoid truncation-based validation bypasses. |
| onnxruntime/test/contrib_ops/quantize_attention_op_test.cc | Adds CPU-only regression tests for invalid per-column scale/zero-point shapes/ranks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Without this check, head_size = hidden_size / num_heads_ silently truncates when hidden_size is not a multiple of num_heads_, leaving the high-index portion of the hidden dimension uncovered by the per-head GEMM loop and producing incorrect results. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Test coverage gaps:
No test for the new hidden_size % num_heads_ check (commit 098e4ea). A test with e.g. hidden_size=5, num_heads=2 that expects failure with "must be divisible by num_heads" should be added.
No test for the hidden_size_x3 % 3 == 0 check. A test with weights_dims[1] not divisible by 3 (e.g., 13) would cover this.
No test for narrow<> throwing on overflow (e.g., num_heads attribute = INT_MAX + 1).
Existing happy-path tests (17 QAttention* tests) cover normal operation and are reported as passing.
Adds two regression tests for QAttention input validation: - InvalidHiddenSizeNotDivisibleByNumHeads exercises the new hidden_size %% num_heads_ check added in 098e4ea. - InvalidNumHeadsOverflowsInt exercises the narrow<int>(num_heads) check in AttentionBase's constructor; gsl::narrowing_error is raised during session init. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move the bias-dim multiple-of-3 and hidden_size-divisible-by-num_heads checks out of QAttention::Compute and into AttentionBase::CheckInputs so they apply to all packed-QKV callers and produce clearer error messages. Add regression tests for the three validation paths flagged in PR review: invalid bias dim, hidden_size not divisible by num_heads, and num_heads overflowing int. Factor the three tests' boilerplate into a shared RunQAttentionExpectFailure helper. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Added some additional tests. |
Description
The CPU
QAttentionkernel did not validate the shape of per-columnweight_scaleandweight_zero_pointinputs against the expected3 * hidden_size. A model could supply a per-columntensor smaller than expected, causing the GEMM dequantization loop to read past the end of the buffer (offsets up to
~3 * hidden_size - head_size).This PR adds the missing shape validation and, while in the area, hardens integer arithmetic across
QAttentionandAttentionBaseagainst malformed shape attributes / dimensions.Changes
onnxruntime/contrib_ops/cpu/quantization/attention_quant.ccweight_scaleandweight_zero_pointare 1-D with size3 * hidden_size; reject otherwise.narrow<int>/narrow<size_t>when convertingint64_tshape dims, so out-of-range values throw rather than silently truncating.SafeIntfor multiplications whose operands are not provably bounded by upstream validation (loop_len,input_offset,qkv_offset, the gemm allocation, andpacked_weights_data_sizeinPrePack).SafeInt-validatedbatch_size * sequence_length * hidden_sizevalue.static_cast<int>s in the per-iteration index math.hidden_size_x3 % 3 == 0andhidden_size % num_heads_ == 0checks here; they are now enforced uniformly inAttentionBase::CheckInputswith clearer error messages.onnxruntime/contrib_ops/cpu/bert/attention_base.hstatic_cast<int>withnarrow<int>fornum_heads_,rotary_embedding_, theparametersstruct outputs, andGetPresent'spast_sequence_length. Without this, anyint64_tvalue outside theintrange (e.g., anum_headsattribute of2^31, or apastsequence length of2^31) silently truncates to an unrelatedintvalue that is thenpropagated to downstream kernels and used in arithmetic, enabling division by zero, sign flips, or out-of-bounds indexing.
static_cast<int>from thepast_dims[2]/past_dims[4]shape comparisons so the equality check uses the fullint64_tvalue; previously apasttensor whose dim's low 32bits happened to match
num_heads_(ork_hidden_size / num_heads_) would pass validation despite having the wrong physical shape.CheckInputs, whenrequire_same_hidden_size_is true, rejectbias_dims[0]not a multiple of 3 with a clear error (Q, K, V are packed and share a hidden size).CheckInputs, whenqkv_hidden_sizesis not set, also rejectq_hidden_size % num_heads_ != 0(mirrors the existing check on theqkv_hidden_sizespath).onnxruntime/test/contrib_ops/quantize_attention_op_test.ccInvalidWeightScalePerColumnShapeInvalidWeightScalePerColumnRankInvalidWeightZeroPointPerColumnShapeInvalidWeightZeroPointPerColumnRankRunQAttentionExpectFailurehelper):InvalidBiasDimNotMultipleOfThreeInvalidHiddenSizeNotDivisibleByNumHeadsInvalidNumHeadsOverflowsInt(num_heads = INT_MAX + 1triggersgsl::narrowing_error)Testing
All
QAttention*/AttentionTest*/MultiHeadAttention*tests (97/97) pass locally on CPU Release build.