Add position_ids bounds validation to WebGPU/JS RotaryEmbedding kernels by titaiwangms · Pull Request #28214 · microsoft/onnxruntime

titaiwangms · 2026-04-23T20:38:08Z

This PR adds position_ids bounds checking to WebGPU and JS RotaryEmbedding implementations, completing the security fix started in PR #27597 (commit 056bab3) which covered CPU and CUDA.

Problem

The com.microsoft::RotaryEmbedding kernel uses position_ids as row indices into cos_cache/sin_cache without bounds validation. While PR #27597 fixed CPU and CUDA paths, WebGPU and JS implementations were still missing bounds checks, which could produce silently wrong results (WebGPU hardware clamps OOB reads).

Changes

contrib_ops/webgpu/bert/rotary_embedding.cc: Host-side validation (ORT_MAKE_STATUS) + shader-side defense-in-depth (pass-through on OOB)
core/providers/webgpu/llm/rotary_embedding.cc: Host-side validation with format-0 awareness
js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts: TypeScript validation using getBigInt64Array
7 new C++ OOB test cases across contrib and ONNX domains targeting WebGPU EP

Security

Addresses the same vulnerability as #27597 (OOB read via position_ids, CVSS 7.5-9.1) for WebGPU/JS execution providers.

Testing

7 new unit tests (3 contrib + 4 ONNX domain) with GTEST_SKIP when WebGPU EP unavailable
JS/TS error tests not feasible with current JSONC test format (documented)
Build environment lacks C++20/emsdk for full compilation verification; validated structurally

github-actions

You can commit the suggested changes from lintrunner.

Add shader-side bounds checks to the WebGPU RotaryEmbedding and FusedQKRotaryEmbedding GPU shaders to prevent out-of-bounds reads from cos_cache/sin_cache when position_ids values exceed the cache dimensions. For RotaryEmbeddingProgram: - Check raw_pos < 0 to catch negative position_ids (i32 from truncated int64 avoids u32 wraparound bypass) - Check position_id >= cos_cache_shape[0] after u32 conversion and sequence offset addition - On OOB, pass through input unchanged (matches CUDA kernel behavior) For FusedQKRotaryEmbeddingProgram: - Check position_id >= cos_cache_shape[0] before accessing cos/sin cache - On OOB, pass through both Q and K inputs unchanged This complements the CPU and CUDA fixes from PR #27597 (commit 056bab3) which missed the WebGPU execution provider. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add host-side validation of position_ids values before shader dispatch in all three WebGPU RotaryEmbedding implementations. This prevents out-of-bounds reads from cos_cache/sin_cache when position_ids values exceed the cache dimensions. Changes: 1. contrib_ops/webgpu/bert/rotary_embedding.cc: - Add InputMemoryType(OrtMemTypeCPUInput, 1) to keep position_ids on CPU for validation - Add bounds checking in ComputeInternal() before shader dispatch: format 0 (scalar): base_pos in [0, max_seq_len - seq_len] format 1 (2D array): each value in [0, max_sequence_length) - Returns INVALID_ARGUMENT error on violation - Shader-side bounds checks remain as defense-in-depth 2. core/providers/webgpu/llm/rotary_embedding.cc: - Add InputMemoryType(OrtMemTypeCPUInput, 3) for optional position_ids input - Add bounds checking in the position_ids != nullptr branch - Returns INVALID_ARGUMENT error on violation 3. js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts: - Add value validation in validateInputs() using getBigInt64Array() - Validates both format 0 (scalar offset) and format 1 (2D array) - Throws Error with descriptive message on violation All three implementations follow the same validation pattern as the CPU contrib fix (PR #27597), returning errors rather than silently passing through. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add WebGPU-targeted OOB unit tests to both contrib (kMSDomain) and ONNX domain test files. Tests verify that out-of-bounds, negative, and format-0 overflow position_ids values are rejected with INVALID_ARGUMENT, matching the host-side validation added to the WebGPU RotaryEmbedding kernels. Tests gracefully skip via GTEST_SKIP() when WebGPU EP is not available. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Address readability review feedback: - FusedQK shader: clarify why no negative check (position_id derived from past_seqlen + sequence_idx, always non-negative) - ONNX domain kernel: clarify why no format 0 check (ONNX RotaryEmbedding always uses explicit position_ids, no base-offset mode) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Address code review finding: the shared shader treats single-element position_ids as format 0 (base offset + sequence_idx), so the ONNX domain host-side validation must also check that base_pos + sequence_length - 1 < max_sequence_length. Also add corresponding format-0 OOB WebGPU test case. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix Web CI precheck: run prettier on the TypeScript file to match the project's JS/TS formatting requirements. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR completes the RotaryEmbedding position_ids bounds-checking security fix for the WebGPU EP and the JS WebGPU implementation, aligning them with the earlier CPU/CUDA hardening work.

Changes:

Add host-side position_ids bounds validation to WebGPU RotaryEmbedding kernels (ONNX + contrib/MS domain), plus shader-side pass-through on OOB as defense-in-depth.
Add TypeScript-side position_ids bounds validation for the JS WebGPU RotaryEmbedding op.
Add new WebGPU-focused unit tests covering negative and out-of-range position_ids cases.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
onnxruntime/core/providers/webgpu/llm/rotary_embedding.cc	Adds host-side `position_ids` bounds validation and forces CPU input for validation.
onnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc	Adds host-side bounds validation, marks `position_ids` as CPU input, and adds shader-side OOB pass-through.
js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts	Adds JS-side bounds checks for `position_ids` using BigInt64 reads.
onnxruntime/test/providers/cpu/llm/rotary_embedding_op_test.cc	Adds ONNX-domain WebGPU EP failure tests for OOB/negative `position_ids`.
onnxruntime/test/contrib_ops/rotary_embedding_op_test.cc	Adds contrib/MS-domain WebGPU EP failure tests for OOB/negative `position_ids`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-23T21:25:36Z

  const auto half_rotary_embedding_dim = onnxruntime::narrow<uint32_t>(cos_cache->Shape()[1]);
  const auto head_size = rotary_embedding_dim_ == 0 ? half_rotary_embedding_dim * 2 : hidden_size / num_heads_;
+  const auto max_sequence_length = static_cast<int64_t>(cos_cache->Shape()[0]);
+


This code path doesn’t explicitly handle sequence_length > max_sequence_length (cos_cache->Shape()[0]) for the non-packed case, while the CPU contrib kernel returns NOT_IMPLEMENTED when cache update would be required. As written, that scenario will surface as an INVALID_ARGUMENT from the bounds check (or could run if format-1 ids are in-range), which diverges across EPs. Consider adding a !is_packed_batching_ && sequence_length > max_sequence_length NOT_IMPLEMENTED check before position_id validation.

Suggested change

if (!is_packed_batching_ && static_cast<int64_t>(sequence_length) > max_sequence_length) {

return ORT_MAKE_STATUS(ONNXRUNTIME, NOT_IMPLEMENTED,

"RotaryEmbedding does not support sequence_length ", sequence_length,

" exceeding cos/sin cache length ", max_sequence_length,

" for non-packed batching.");

}

Copilot · 2026-04-23T21:25:37Z

+    const auto max_sequence_length = static_cast<int64_t>(cos_cache->Shape()[0]);
+    const auto* pos_ids_data = position_ids->Data<int64_t>();
+    const auto pos_ids_size = position_ids->Shape().Size();
+    if (pos_ids_size == 1) {
+      // Format 0: single base offset. Shader adds sequence_idx, so effective range is
+      // [base_pos, base_pos + sequence_length - 1]. All must be < max_sequence_length.
+      int64_t base_pos = pos_ids_data[0];
+      int64_t max_valid_base = max_sequence_length - static_cast<int64_t>(sequence_length);
+      if (base_pos < 0 || base_pos > max_valid_base) {
+        return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT,
+                               "position_ids base value ", base_pos,
+                               " with sequence_length ", sequence_length,
+                               " exceeds cos/sin cache range [0, ", max_sequence_length, ")");
+      }
+    } else {
+      // Format 1: 2D array (batch_size, sequence_length). Each value must be in [0, max_sequence_length).
+      for (int64_t i = 0; i < pos_ids_size; ++i) {
+        int64_t pos = pos_ids_data[i];
+        if (pos < 0 || pos >= max_sequence_length) {
+          return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT,
+                                 "position_ids value ", pos, " at index ", i,
+                                 " is out of range [0, ", max_sequence_length, ")");
+        }
+      }


For the ONNX-domain RotaryEmbedding, CPU/CUDA input checking requires position_ids to be rank-2 when provided. This WebGPU validation instead treats Size()==1 as a “base offset” format and accepts rank-1 inputs, which risks cross-EP semantic divergence. If rank-1 is not part of the ONNX op contract, consider rejecting it with an INVALID_ARGUMENT shape error to align behavior across EPs.

Suggested change

const auto max_sequence_length = static_cast<int64_t>(cos_cache->Shape()[0]);

const auto* pos_ids_data = position_ids->Data<int64_t>();

const auto pos_ids_size = position_ids->Shape().Size();

if (pos_ids_size == 1) {

// Format 0: single base offset. Shader adds sequence_idx, so effective range is

// [base_pos, base_pos + sequence_length - 1]. All must be < max_sequence_length.

int64_t base_pos = pos_ids_data[0];

int64_t max_valid_base = max_sequence_length - static_cast<int64_t>(sequence_length);

if (base_pos < 0 || base_pos > max_valid_base) {

return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT,

"position_ids base value ", base_pos,

" with sequence_length ", sequence_length,

" exceeds cos/sin cache range [0, ", max_sequence_length, ")");

}

} else {

// Format 1: 2D array (batch_size, sequence_length). Each value must be in [0, max_sequence_length).

for (int64_t i = 0; i < pos_ids_size; ++i) {

int64_t pos = pos_ids_data[i];

if (pos < 0 || pos >= max_sequence_length) {

return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT,

"position_ids value ", pos, " at index ", i,

" is out of range [0, ", max_sequence_length, ")");

}

}

const auto& position_ids_shape = position_ids->Shape();

if (position_ids_shape.NumDimensions() != 2) {

return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT,

"position_ids must be a rank-2 tensor when provided. Got rank ",

position_ids_shape.NumDimensions());

}

const auto max_sequence_length = static_cast<int64_t>(cos_cache->Shape()[0]);

const auto* pos_ids_data = position_ids->Data<int64_t>();

const auto pos_ids_size = position_ids_shape.Size();

// position_ids is a 2D array (batch_size, sequence_length). Each value must be in

// [0, max_sequence_length).

for (int64_t i = 0; i < pos_ids_size; ++i) {

int64_t pos = pos_ids_data[i];

if (pos < 0 || pos >= max_sequence_length) {

return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT,

"position_ids value ", pos, " at index ", i,

" is out of range [0, ", max_sequence_length, ")");

}

Copilot · 2026-04-23T21:25:37Z

+  // Format 0: single value. Effective positions = [7, 8] — position 8 is out of range [0, 8).
+  test.AddInput<int64_t>("position_ids", {1}, {7});


This new WebGPU test for the ONNX-domain op passes position_ids with shape {1} (base-offset semantics). CPU/CUDA kernels for the ONNX-domain RotaryEmbedding currently require rank-2 position_ids when provided, so this test codifies a WebGPU-only behavior. Consider changing the test to use a (batch_size, sequence_length) position_ids tensor (or, if base-offset is intended for the ONNX op, update the other EPs/helpers accordingly).

Suggested change

// Format 0: single value. Effective positions = [7, 8] — position 8 is out of range [0, 8).

test.AddInput<int64_t>("position_ids", {1}, {7});

// Explicit position_ids for the batch/sequence. Positions = [7, 8] — position 8 is out of range [0, 8).

test.AddInput<int64_t>("position_ids", {batch_size, sequence_length}, {7, 8});

Copilot · 2026-04-23T21:25:37Z

+  // Validate position_ids values are within cos/sin cache bounds.
+  const positionIdsData = positionIds.getBigInt64Array();
+  if (positionIdsData.length === 1) {
+    // Format 0: single base offset. Effective positions are [base_pos, base_pos + sequence_length - 1].
+    const basePos = positionIdsData[0];


The new bounds validation runs before the existing sequenceLength > maxSequenceLength check. For format-0 position_ids this can now throw a “position_ids base value … exceeds … range” error when the real issue is that updating cos/sin cache isn’t supported, which differs from the native kernels’ error behavior. Consider performing the sequenceLength > maxSequenceLength (NOT_IMPLEMENTED) check before validating position_ids bounds to keep error semantics consistent.

Copilot · 2026-04-23T21:25:38Z

  if (position_ids != nullptr) {
    // position_ids provided: cos/sin cache is 2D (max_pos, D/2)
+    // Validate position_ids values are within cache bounds (position_ids kept on CPU via InputMemoryType).
+    const auto max_sequence_length = static_cast<int64_t>(cos_cache->Shape()[0]);


When position_ids is provided, CPU/CUDA kernels reject sequence_length > max_sequence_length (cos/sin cache length) with NOT_IMPLEMENTED (cache update not supported). This WebGPU path only checks individual IDs (and the base offset case), so a format-1 tensor with all IDs in-range could still run even when sequence_length > max_sequence_length, diverging from other EPs. Consider adding the same sequence_length > max_sequence_length NOT_IMPLEMENTED check here for parity.

Suggested change

const auto max_sequence_length = static_cast<int64_t>(cos_cache->Shape()[0]);

const auto max_sequence_length = static_cast<int64_t>(cos_cache->Shape()[0]);

if (static_cast<int64_t>(sequence_length) > max_sequence_length) {

return ORT_MAKE_STATUS(ONNXRUNTIME, NOT_IMPLEMENTED,

"sequence_length ", sequence_length,

" exceeds cos/sin cache length ", max_sequence_length,

" when position_ids is provided. Cache update is not supported.");

}

Copilot · 2026-04-23T21:25:38Z

        .TypeConstraint("T", WebGpuSupportedFloatTypes())
-        .TypeConstraint("M", DataTypeImpl::GetTensorType<int64_t>()),
+        .TypeConstraint("M", DataTypeImpl::GetTensorType<int64_t>())
+        .InputMemoryType(OrtMemTypeCPUInput, 3),  // position_ids on CPU for bounds validation
    RotaryEmbedding);


KernelDef marks input 3 (position_ids) as OrtMemTypeCPUInput, but the compute program still binds position_ids as a WebGPU buffer input. WebGpuContext::Run binds inputs by casting Tensor::DataRaw() to WGPUBuffer, so a CPU tensor here will fail validation/crash. Instead, keep position_ids as a GPU tensor for the shader and do validation via an explicit GPU→CPU copy (or create a temporary GPU copy after CPU validation).

Copilot · 2026-04-23T21:25:38Z

+        .TypeConstraint("M", DataTypeImpl::GetTensorType<int64_t>())
+        .InputMemoryType(OrtMemTypeCPUInput, 1),  // position_ids on CPU for bounds validation


KernelDef marks input 1 (position_ids) as OrtMemTypeCPUInput, but the shader program binds position_ids as an input buffer. WebGpuContext::Run expects all program inputs to be WebGPU buffers and casts Tensor::DataRaw() to WGPUBuffer, so forcing CPU memory here will break execution. Consider validating via an explicit GPU→CPU copy while keeping the shader input on GPU (or copy CPU data into a temporary GPU tensor used by the program).

Suggested change

.TypeConstraint("M", DataTypeImpl::GetTensorType<int64_t>())

.InputMemoryType(OrtMemTypeCPUInput, 1), // position_ids on CPU for bounds validation

.TypeConstraint("M", DataTypeImpl::GetTensorType<int64_t>()),

Copilot · 2026-04-23T21:25:38Z

+  // Validate position_ids values are within cos/sin cache bounds (position_ids kept on CPU via InputMemoryType).
+  const auto* pos_ids_data = position_ids->Data<int64_t>();
+  const auto pos_ids_size = position_ids->Shape().Size();
+  if (pos_ids_size == 1) {


Bounds validation treats any position_ids tensor with Size()!=1 as “format 1”, but it doesn’t enforce the schema constraints (either shape (1) or (batch_size, sequence_length)). If an invalid-but-broadcastable shape is provided, validation may pass while the shader uses broadcast semantics, leading to incorrect results. Suggest explicitly validating rank/dims to match the operator schema before iterating values.

Suggested change

// Validate position_ids values are within cos/sin cache bounds (position_ids kept on CPU via InputMemoryType).

const auto* pos_ids_data = position_ids->Data<int64_t>();

const auto pos_ids_size = position_ids->Shape().Size();

if (pos_ids_size == 1) {

const auto& position_ids_shape = position_ids->Shape();

const auto position_ids_rank = position_ids_shape.NumDimensions();

const auto pos_ids_size = position_ids_shape.Size();

const bool is_format_0 = position_ids_rank == 1 && pos_ids_size == 1;

const bool is_format_1 =

position_ids_rank == 2 &&

position_ids_shape[0] == static_cast<int64_t>(batch_size) &&

position_ids_shape[1] == static_cast<int64_t>(sequence_length);

if (!is_format_0 && !is_format_1) {

return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT,

"position_ids must have shape (1) or (",

static_cast<int64_t>(batch_size), ", ",

static_cast<int64_t>(sequence_length), "), but got rank ",

position_ids_rank);

}

// Validate position_ids values are within cos/sin cache bounds (position_ids kept on CPU via InputMemoryType).

const auto* pos_ids_data = position_ids->Data<int64_t>();

if (is_format_0) {

tianleiwu

Review Summary

The validation logic and shader-side defense-in-depth checks are well-designed and mirror the CPU/CUDA approach correctly. The test coverage is thorough. However, both kernel registrations have a critical issue that will cause a runtime crash.

Critical: `InputMemoryType(OrtMemTypeCPUInput)` + `AddInputs()` incompatibility

Both kernel registrations add .InputMemoryType(OrtMemTypeCPUInput, ...) to keep position_ids on CPU for host-side validation. However, ComputeInternal still passes position_ids to AddInputs() for the shader. The WebGPU program execution path in webgpu_context.cc (around L464) does:

bind_buffers.push_back(reinterpret_cast<WGPUBuffer>(const_cast<void*>(inputs[i].tensor->DataRaw())));

This assumes all program inputs have GPU buffer handles in DataRaw(). A CPU-resident tensor's DataRaw() returns a heap pointer, producing an invalid WGPUBuffer handle that will crash or corrupt at dispatch.

No existing WebGPU kernel combines InputMemoryType(OrtMemTypeCPUInput) with AddInputs() on the same tensor. Existing kernels (Reduce, Pad, Resize, GQA, etc.) read CPU-pinned inputs for host-side configuration only and never submit them as shader bindings.

Recommended fix

The simplest approach is to remove both InputMemoryType directives and the host-side C++ validation, relying solely on the shader-side defense-in-depth checks (which are already correct in this PR). This matches CUDA's approach: device-side bounds checking with pass-through on OOB. The TS validation in rotary-embedding.ts can remain since it runs on CPU during validateInputs() before shader dispatch.

If host-side rejection is desired, the kernel would need to explicitly copy position_ids from GPU to CPU (via DataTransferManager) for validation, then pass the original GPU tensor to AddInputs().

What's good

Shader-side OOB checks: raw_pos < 0 catches negative i32 from truncated int64; position_id >= max_position catches OOB after u32 conversion + offset. Pass-through behavior matches CUDA.
FusedQKRotaryEmbeddingProgram bounds check on computed position_id correctly handles both Q and K outputs, including the conditional K-head guard.
TS validation using getBigInt64Array() avoids int64 truncation issues.
7 new tests cover format-0, format-1, negative, and batch OOB comprehensively.

TensorViewImpl.getBigInt64Array() threw RangeError when the WASM heap offset was not 8-byte aligned. Fix by detecting unaligned offsets and copying bytes into an aligned buffer before creating the BigInt64Array. This fixes the Web CI failure where all RotaryEmbedding tests failed with 'start offset of BigInt64Array should be a multiple of 8'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove InputMemoryType(OrtMemTypeCPUInput) from both WebGPU kernel registrations (contrib and ONNX domain) and the associated host-side position_ids value scanning. InputMemoryType is incompatible with AddInputs() — a CPU tensor's DataRaw() would be cast to WGPUBuffer, causing a crash at dispatch time. Defense strategy is now: - Shader-side: WGSL bounds checks pass through input unchanged on OOB (same as CUDA kernel behavior) - JSEP/browser: TypeScript validation in rotary-embedding.ts catches OOB before shader dispatch - init.ts: getBigInt64Array() handles unaligned WASM heap offsets WebGPU OOB tests changed from kExpectFailure to kExpectSuccess, verifying pass-through behavior (output equals input on OOB). ONNX domain tests updated to use rank-2 position_ids for cross-EP consistency. TS validation reordered per Copilot review: sequence_length check before per-value bounds validation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

tianleiwu

Round-2 review (head 9744bfb).

Summary

The high-severity concern from the prior round — InputMemoryType(OrtMemTypeCPUInput) + AddInputs(position_ids) would crash because the same tensor cannot be both CPU-resident and bound as a WebGPU storage buffer — has been resolved. The author removed the host-side C++ value scanning and InputMemoryType directives and now relies on shader-side defense-in-depth (OOB → pass-through input) for both contrib and ONNX-domain WebGPU kernels, plus host-side validation on the JS path via getBigInt64Array() in validateInputs. Tests were converted to kExpectSuccess with output == input. The pass-through semantics now match the CUDA kernel for cross-EP consistency.

Resolving my two prior threads:

contrib_ops/webgpu/bert/rotary_embedding.cc (shader as primary protection) — addressed.
test/contrib_ops/rotary_embedding_op_test.cc (tests must verify pass-through, not failure) — addressed.

Remaining items (all nits, non-blocking)

1. Shader comment understates what raw_pos < 0 catches — onnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc

The WebGPU shader storage representation truncates int64 to i32 (see ProgramVariableDataType::Int64 → value type i32 in core/providers/webgpu/shader_variable.cc). Any positive int64 value greater than INT32_MAX will alias to a negative i32 and be caught by the raw_pos < 0 branch. That is safe (still pass-through), but the inline comment only mentions "negative position_ids". Consider expanding the comment to note that very large positive int64 values also fall into this branch by virtue of i32 truncation, so a future maintainer doesn't think the check is incomplete.

2. JSEP getBigInt64Array alignment fallback returns a read-only copy — js/web/lib/wasm/jsep/init.ts and js/web/lib/wasm/jsep/tensor-view.ts

The inline comment in init.ts correctly notes that the unaligned path returns a read-only copy (mutations don't propagate to the WASM heap). All current call sites (pad.ts, split.ts, reduce.ts, slice.ts, tile.ts, expand.ts, cumsum.ts, resize.ts, gather-block-quantized.ts, the new rotary-embedding.ts) are read-only iterations, so this is safe today. Consider promoting that read-only caveat into the TensorView.getBigInt64Array() JSDoc in tensor-view.ts so future consumers don't accidentally write through the returned view expecting WASM-heap propagation.

3. Format-0 detection in TS is contrib-op-specific — js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts

positionIdsElementCount === 1 triggers the format-0 base-offset branch. The ONNX-domain RotaryEmbedding schema requires rank-2 position_ids (no base-offset mode). If this validateInputs is shared with the ONNX-domain op (now or later), the format-0 branch should be guarded by op identity. A short comment confirming the JSEP path scope would prevent silent acceptance of malformed shapes against the ONNX-domain op.

4. Pre-existing related gap (informational) — separately noted by the Copilot reviewer at L159 of contrib_ops/webgpu/bert/rotary_embedding.cc: when position_ids is provided, CPU/CUDA reject sequence_length > max_sequence_length with NOT_IMPLEMENTED, while the WebGPU contrib path (after this PR) silently produces pass-through for the OOB tokens. This was the case before this PR too. Worth tracking as a follow-up if cross-EP behavior parity is desired here, but it does not need to block this security fix.

Verdict

Looks good. The shader bounds checks are correctly designed (both for the standalone RotaryEmbeddingProgram and the FusedQKRotaryEmbeddingProgram Q+K pass-through path), tests are appropriate, and the JS validation is functionally correct.

github-actions Bot reviewed Apr 23, 2026

View reviewed changes

titaiwangms requested review from Copilot and tianleiwu April 23, 2026 21:14

Copilot started reviewing on behalf of titaiwangms April 23, 2026 21:14 View session

titaiwangms and others added 7 commits April 23, 2026 21:17

Apply lintrunner clang-format auto-fixes

5ec4acf

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Megatron2546-sudo approved these changes Apr 23, 2026

View reviewed changes

tianleiwu requested changes Apr 23, 2026

View reviewed changes

Comment thread onnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc

Comment thread onnxruntime/test/contrib_ops/rotary_embedding_op_test.cc

titaiwangms force-pushed the titaiwang/fix-rotary-embedding-oob-read branch from d1cc33c to 5920aa7 Compare April 23, 2026 22:14

This comment was marked as duplicate.

Sign in to view

tianleiwu reviewed Apr 24, 2026

View reviewed changes

+  if (!is_packed_batching_ && static_cast<int64_t>(sequence_length) > max_sequence_length) {
+    return ORT_MAKE_STATUS(ONNXRUNTIME, NOT_IMPLEMENTED,
+                           "RotaryEmbedding does not support sequence_length ", sequence_length,
+                           " exceeding cos/sin cache length ", max_sequence_length,
+                           " for non-packed batching.");
+  }

		// Format 0: single value. Effective positions = [7, 8] — position 8 is out of range [0, 8).
		test.AddInput<int64_t>("position_ids", {1}, {7});

-    const auto max_sequence_length = static_cast<int64_t>(cos_cache->Shape()[0]);
+    const auto max_sequence_length = static_cast<int64_t>(cos_cache->Shape()[0]);
+    if (static_cast<int64_t>(sequence_length) > max_sequence_length) {
+      return ORT_MAKE_STATUS(ONNXRUNTIME, NOT_IMPLEMENTED,
+                             "sequence_length ", sequence_length,
+                             " exceeds cos/sin cache length ", max_sequence_length,
+                             " when position_ids is provided. Cache update is not supported.");
+    }

		.TypeConstraint("M", DataTypeImpl::GetTensorType<int64_t>())
		.InputMemoryType(OrtMemTypeCPUInput, 1), // position_ids on CPU for bounds validation

-  // Validate position_ids values are within cos/sin cache bounds (position_ids kept on CPU via InputMemoryType).
-  const auto* pos_ids_data = position_ids->Data<int64_t>();
-  const auto pos_ids_size = position_ids->Shape().Size();
-  if (pos_ids_size == 1) {
+  const auto& position_ids_shape = position_ids->Shape();
+  const auto position_ids_rank = position_ids_shape.NumDimensions();
+  const auto pos_ids_size = position_ids_shape.Size();
+  const bool is_format_0 = position_ids_rank == 1 && pos_ids_size == 1;
+  const bool is_format_1 =
+      position_ids_rank == 2 &&
+      position_ids_shape[0] == static_cast<int64_t>(batch_size) &&
+      position_ids_shape[1] == static_cast<int64_t>(sequence_length);
+  if (!is_format_0 && !is_format_1) {
+    return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT,
+                           "position_ids must have shape (1) or (",
+                           static_cast<int64_t>(batch_size), ", ",
+                           static_cast<int64_t>(sequence_length), "), but got rank ",
+                           position_ids_rank);
+  }
+  // Validate position_ids values are within cos/sin cache bounds (position_ids kept on CPU via InputMemoryType).
+  const auto* pos_ids_data = position_ids->Data<int64_t>();
+  if (is_format_0) {

Conversation

titaiwangms commented Apr 23, 2026

Problem

Changes

Security

Testing

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Review Summary

Critical: InputMemoryType(OrtMemTypeCPUInput) + AddInputs() incompatibility

Recommended fix

What's good

Uh oh!

Uh oh!

Uh oh!

This comment was marked as duplicate.

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Summary

Remaining items (all nits, non-blocking)

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Critical: `InputMemoryType(OrtMemTypeCPUInput)` + `AddInputs()` incompatibility