Add position_ids bounds validation to WebGPU/JS RotaryEmbedding kernels#28214
Add position_ids bounds validation to WebGPU/JS RotaryEmbedding kernels#28214titaiwangms wants to merge 9 commits intomainfrom
Conversation
Add shader-side bounds checks to the WebGPU RotaryEmbedding and FusedQKRotaryEmbedding GPU shaders to prevent out-of-bounds reads from cos_cache/sin_cache when position_ids values exceed the cache dimensions. For RotaryEmbeddingProgram: - Check raw_pos < 0 to catch negative position_ids (i32 from truncated int64 avoids u32 wraparound bypass) - Check position_id >= cos_cache_shape[0] after u32 conversion and sequence offset addition - On OOB, pass through input unchanged (matches CUDA kernel behavior) For FusedQKRotaryEmbeddingProgram: - Check position_id >= cos_cache_shape[0] before accessing cos/sin cache - On OOB, pass through both Q and K inputs unchanged This complements the CPU and CUDA fixes from PR #27597 (commit 056bab3) which missed the WebGPU execution provider. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add host-side validation of position_ids values before shader dispatch
in all three WebGPU RotaryEmbedding implementations. This prevents
out-of-bounds reads from cos_cache/sin_cache when position_ids values
exceed the cache dimensions.
Changes:
1. contrib_ops/webgpu/bert/rotary_embedding.cc:
- Add InputMemoryType(OrtMemTypeCPUInput, 1) to keep position_ids
on CPU for validation
- Add bounds checking in ComputeInternal() before shader dispatch:
format 0 (scalar): base_pos in [0, max_seq_len - seq_len]
format 1 (2D array): each value in [0, max_sequence_length)
- Returns INVALID_ARGUMENT error on violation
- Shader-side bounds checks remain as defense-in-depth
2. core/providers/webgpu/llm/rotary_embedding.cc:
- Add InputMemoryType(OrtMemTypeCPUInput, 3) for optional
position_ids input
- Add bounds checking in the position_ids != nullptr branch
- Returns INVALID_ARGUMENT error on violation
3. js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts:
- Add value validation in validateInputs() using getBigInt64Array()
- Validates both format 0 (scalar offset) and format 1 (2D array)
- Throws Error with descriptive message on violation
All three implementations follow the same validation pattern as the
CPU contrib fix (PR #27597), returning errors rather than silently
passing through.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add WebGPU-targeted OOB unit tests to both contrib (kMSDomain) and ONNX domain test files. Tests verify that out-of-bounds, negative, and format-0 overflow position_ids values are rejected with INVALID_ARGUMENT, matching the host-side validation added to the WebGPU RotaryEmbedding kernels. Tests gracefully skip via GTEST_SKIP() when WebGPU EP is not available. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address readability review feedback: - FusedQK shader: clarify why no negative check (position_id derived from past_seqlen + sequence_idx, always non-negative) - ONNX domain kernel: clarify why no format 0 check (ONNX RotaryEmbedding always uses explicit position_ids, no base-offset mode) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address code review finding: the shared shader treats single-element position_ids as format 0 (base offset + sequence_idx), so the ONNX domain host-side validation must also check that base_pos + sequence_length - 1 < max_sequence_length. Also add corresponding format-0 OOB WebGPU test case. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix Web CI precheck: run prettier on the TypeScript file to match the project's JS/TS formatting requirements. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR completes the RotaryEmbedding position_ids bounds-checking security fix for the WebGPU EP and the JS WebGPU implementation, aligning them with the earlier CPU/CUDA hardening work.
Changes:
- Add host-side
position_idsbounds validation to WebGPU RotaryEmbedding kernels (ONNX + contrib/MS domain), plus shader-side pass-through on OOB as defense-in-depth. - Add TypeScript-side
position_idsbounds validation for the JS WebGPU RotaryEmbedding op. - Add new WebGPU-focused unit tests covering negative and out-of-range
position_idscases.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/core/providers/webgpu/llm/rotary_embedding.cc | Adds host-side position_ids bounds validation and forces CPU input for validation. |
| onnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc | Adds host-side bounds validation, marks position_ids as CPU input, and adds shader-side OOB pass-through. |
| js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts | Adds JS-side bounds checks for position_ids using BigInt64 reads. |
| onnxruntime/test/providers/cpu/llm/rotary_embedding_op_test.cc | Adds ONNX-domain WebGPU EP failure tests for OOB/negative position_ids. |
| onnxruntime/test/contrib_ops/rotary_embedding_op_test.cc | Adds contrib/MS-domain WebGPU EP failure tests for OOB/negative position_ids. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const auto half_rotary_embedding_dim = onnxruntime::narrow<uint32_t>(cos_cache->Shape()[1]); | ||
| const auto head_size = rotary_embedding_dim_ == 0 ? half_rotary_embedding_dim * 2 : hidden_size / num_heads_; | ||
| const auto max_sequence_length = static_cast<int64_t>(cos_cache->Shape()[0]); | ||
|
|
There was a problem hiding this comment.
This code path doesn’t explicitly handle sequence_length > max_sequence_length (cos_cache->Shape()[0]) for the non-packed case, while the CPU contrib kernel returns NOT_IMPLEMENTED when cache update would be required. As written, that scenario will surface as an INVALID_ARGUMENT from the bounds check (or could run if format-1 ids are in-range), which diverges across EPs. Consider adding a !is_packed_batching_ && sequence_length > max_sequence_length NOT_IMPLEMENTED check before position_id validation.
| if (!is_packed_batching_ && static_cast<int64_t>(sequence_length) > max_sequence_length) { | |
| return ORT_MAKE_STATUS(ONNXRUNTIME, NOT_IMPLEMENTED, | |
| "RotaryEmbedding does not support sequence_length ", sequence_length, | |
| " exceeding cos/sin cache length ", max_sequence_length, | |
| " for non-packed batching."); | |
| } |
| const auto max_sequence_length = static_cast<int64_t>(cos_cache->Shape()[0]); | ||
| const auto* pos_ids_data = position_ids->Data<int64_t>(); | ||
| const auto pos_ids_size = position_ids->Shape().Size(); | ||
| if (pos_ids_size == 1) { | ||
| // Format 0: single base offset. Shader adds sequence_idx, so effective range is | ||
| // [base_pos, base_pos + sequence_length - 1]. All must be < max_sequence_length. | ||
| int64_t base_pos = pos_ids_data[0]; | ||
| int64_t max_valid_base = max_sequence_length - static_cast<int64_t>(sequence_length); | ||
| if (base_pos < 0 || base_pos > max_valid_base) { | ||
| return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT, | ||
| "position_ids base value ", base_pos, | ||
| " with sequence_length ", sequence_length, | ||
| " exceeds cos/sin cache range [0, ", max_sequence_length, ")"); | ||
| } | ||
| } else { | ||
| // Format 1: 2D array (batch_size, sequence_length). Each value must be in [0, max_sequence_length). | ||
| for (int64_t i = 0; i < pos_ids_size; ++i) { | ||
| int64_t pos = pos_ids_data[i]; | ||
| if (pos < 0 || pos >= max_sequence_length) { | ||
| return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT, | ||
| "position_ids value ", pos, " at index ", i, | ||
| " is out of range [0, ", max_sequence_length, ")"); | ||
| } | ||
| } |
There was a problem hiding this comment.
For the ONNX-domain RotaryEmbedding, CPU/CUDA input checking requires position_ids to be rank-2 when provided. This WebGPU validation instead treats Size()==1 as a “base offset” format and accepts rank-1 inputs, which risks cross-EP semantic divergence. If rank-1 is not part of the ONNX op contract, consider rejecting it with an INVALID_ARGUMENT shape error to align behavior across EPs.
| const auto max_sequence_length = static_cast<int64_t>(cos_cache->Shape()[0]); | |
| const auto* pos_ids_data = position_ids->Data<int64_t>(); | |
| const auto pos_ids_size = position_ids->Shape().Size(); | |
| if (pos_ids_size == 1) { | |
| // Format 0: single base offset. Shader adds sequence_idx, so effective range is | |
| // [base_pos, base_pos + sequence_length - 1]. All must be < max_sequence_length. | |
| int64_t base_pos = pos_ids_data[0]; | |
| int64_t max_valid_base = max_sequence_length - static_cast<int64_t>(sequence_length); | |
| if (base_pos < 0 || base_pos > max_valid_base) { | |
| return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT, | |
| "position_ids base value ", base_pos, | |
| " with sequence_length ", sequence_length, | |
| " exceeds cos/sin cache range [0, ", max_sequence_length, ")"); | |
| } | |
| } else { | |
| // Format 1: 2D array (batch_size, sequence_length). Each value must be in [0, max_sequence_length). | |
| for (int64_t i = 0; i < pos_ids_size; ++i) { | |
| int64_t pos = pos_ids_data[i]; | |
| if (pos < 0 || pos >= max_sequence_length) { | |
| return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT, | |
| "position_ids value ", pos, " at index ", i, | |
| " is out of range [0, ", max_sequence_length, ")"); | |
| } | |
| } | |
| const auto& position_ids_shape = position_ids->Shape(); | |
| if (position_ids_shape.NumDimensions() != 2) { | |
| return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT, | |
| "position_ids must be a rank-2 tensor when provided. Got rank ", | |
| position_ids_shape.NumDimensions()); | |
| } | |
| const auto max_sequence_length = static_cast<int64_t>(cos_cache->Shape()[0]); | |
| const auto* pos_ids_data = position_ids->Data<int64_t>(); | |
| const auto pos_ids_size = position_ids_shape.Size(); | |
| // position_ids is a 2D array (batch_size, sequence_length). Each value must be in | |
| // [0, max_sequence_length). | |
| for (int64_t i = 0; i < pos_ids_size; ++i) { | |
| int64_t pos = pos_ids_data[i]; | |
| if (pos < 0 || pos >= max_sequence_length) { | |
| return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT, | |
| "position_ids value ", pos, " at index ", i, | |
| " is out of range [0, ", max_sequence_length, ")"); | |
| } |
| // Format 0: single value. Effective positions = [7, 8] — position 8 is out of range [0, 8). | ||
| test.AddInput<int64_t>("position_ids", {1}, {7}); |
There was a problem hiding this comment.
This new WebGPU test for the ONNX-domain op passes position_ids with shape {1} (base-offset semantics). CPU/CUDA kernels for the ONNX-domain RotaryEmbedding currently require rank-2 position_ids when provided, so this test codifies a WebGPU-only behavior. Consider changing the test to use a (batch_size, sequence_length) position_ids tensor (or, if base-offset is intended for the ONNX op, update the other EPs/helpers accordingly).
| // Format 0: single value. Effective positions = [7, 8] — position 8 is out of range [0, 8). | |
| test.AddInput<int64_t>("position_ids", {1}, {7}); | |
| // Explicit position_ids for the batch/sequence. Positions = [7, 8] — position 8 is out of range [0, 8). | |
| test.AddInput<int64_t>("position_ids", {batch_size, sequence_length}, {7, 8}); |
| // Validate position_ids values are within cos/sin cache bounds. | ||
| const positionIdsData = positionIds.getBigInt64Array(); | ||
| if (positionIdsData.length === 1) { | ||
| // Format 0: single base offset. Effective positions are [base_pos, base_pos + sequence_length - 1]. | ||
| const basePos = positionIdsData[0]; |
There was a problem hiding this comment.
The new bounds validation runs before the existing sequenceLength > maxSequenceLength check. For format-0 position_ids this can now throw a “position_ids base value … exceeds … range” error when the real issue is that updating cos/sin cache isn’t supported, which differs from the native kernels’ error behavior. Consider performing the sequenceLength > maxSequenceLength (NOT_IMPLEMENTED) check before validating position_ids bounds to keep error semantics consistent.
| if (position_ids != nullptr) { | ||
| // position_ids provided: cos/sin cache is 2D (max_pos, D/2) | ||
| // Validate position_ids values are within cache bounds (position_ids kept on CPU via InputMemoryType). | ||
| const auto max_sequence_length = static_cast<int64_t>(cos_cache->Shape()[0]); |
There was a problem hiding this comment.
When position_ids is provided, CPU/CUDA kernels reject sequence_length > max_sequence_length (cos/sin cache length) with NOT_IMPLEMENTED (cache update not supported). This WebGPU path only checks individual IDs (and the base offset case), so a format-1 tensor with all IDs in-range could still run even when sequence_length > max_sequence_length, diverging from other EPs. Consider adding the same sequence_length > max_sequence_length NOT_IMPLEMENTED check here for parity.
| const auto max_sequence_length = static_cast<int64_t>(cos_cache->Shape()[0]); | |
| const auto max_sequence_length = static_cast<int64_t>(cos_cache->Shape()[0]); | |
| if (static_cast<int64_t>(sequence_length) > max_sequence_length) { | |
| return ORT_MAKE_STATUS(ONNXRUNTIME, NOT_IMPLEMENTED, | |
| "sequence_length ", sequence_length, | |
| " exceeds cos/sin cache length ", max_sequence_length, | |
| " when position_ids is provided. Cache update is not supported."); | |
| } |
| .TypeConstraint("T", WebGpuSupportedFloatTypes()) | ||
| .TypeConstraint("M", DataTypeImpl::GetTensorType<int64_t>()), | ||
| .TypeConstraint("M", DataTypeImpl::GetTensorType<int64_t>()) | ||
| .InputMemoryType(OrtMemTypeCPUInput, 3), // position_ids on CPU for bounds validation | ||
| RotaryEmbedding); |
There was a problem hiding this comment.
KernelDef marks input 3 (position_ids) as OrtMemTypeCPUInput, but the compute program still binds position_ids as a WebGPU buffer input. WebGpuContext::Run binds inputs by casting Tensor::DataRaw() to WGPUBuffer, so a CPU tensor here will fail validation/crash. Instead, keep position_ids as a GPU tensor for the shader and do validation via an explicit GPU→CPU copy (or create a temporary GPU copy after CPU validation).
| .TypeConstraint("M", DataTypeImpl::GetTensorType<int64_t>()) | ||
| .InputMemoryType(OrtMemTypeCPUInput, 1), // position_ids on CPU for bounds validation |
There was a problem hiding this comment.
KernelDef marks input 1 (position_ids) as OrtMemTypeCPUInput, but the shader program binds position_ids as an input buffer. WebGpuContext::Run expects all program inputs to be WebGPU buffers and casts Tensor::DataRaw() to WGPUBuffer, so forcing CPU memory here will break execution. Consider validating via an explicit GPU→CPU copy while keeping the shader input on GPU (or copy CPU data into a temporary GPU tensor used by the program).
| .TypeConstraint("M", DataTypeImpl::GetTensorType<int64_t>()) | |
| .InputMemoryType(OrtMemTypeCPUInput, 1), // position_ids on CPU for bounds validation | |
| .TypeConstraint("M", DataTypeImpl::GetTensorType<int64_t>()), |
| // Validate position_ids values are within cos/sin cache bounds (position_ids kept on CPU via InputMemoryType). | ||
| const auto* pos_ids_data = position_ids->Data<int64_t>(); | ||
| const auto pos_ids_size = position_ids->Shape().Size(); | ||
| if (pos_ids_size == 1) { |
There was a problem hiding this comment.
Bounds validation treats any position_ids tensor with Size()!=1 as “format 1”, but it doesn’t enforce the schema constraints (either shape (1) or (batch_size, sequence_length)). If an invalid-but-broadcastable shape is provided, validation may pass while the shader uses broadcast semantics, leading to incorrect results. Suggest explicitly validating rank/dims to match the operator schema before iterating values.
| // Validate position_ids values are within cos/sin cache bounds (position_ids kept on CPU via InputMemoryType). | |
| const auto* pos_ids_data = position_ids->Data<int64_t>(); | |
| const auto pos_ids_size = position_ids->Shape().Size(); | |
| if (pos_ids_size == 1) { | |
| const auto& position_ids_shape = position_ids->Shape(); | |
| const auto position_ids_rank = position_ids_shape.NumDimensions(); | |
| const auto pos_ids_size = position_ids_shape.Size(); | |
| const bool is_format_0 = position_ids_rank == 1 && pos_ids_size == 1; | |
| const bool is_format_1 = | |
| position_ids_rank == 2 && | |
| position_ids_shape[0] == static_cast<int64_t>(batch_size) && | |
| position_ids_shape[1] == static_cast<int64_t>(sequence_length); | |
| if (!is_format_0 && !is_format_1) { | |
| return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT, | |
| "position_ids must have shape (1) or (", | |
| static_cast<int64_t>(batch_size), ", ", | |
| static_cast<int64_t>(sequence_length), "), but got rank ", | |
| position_ids_rank); | |
| } | |
| // Validate position_ids values are within cos/sin cache bounds (position_ids kept on CPU via InputMemoryType). | |
| const auto* pos_ids_data = position_ids->Data<int64_t>(); | |
| if (is_format_0) { |
tianleiwu
left a comment
There was a problem hiding this comment.
Review Summary
The validation logic and shader-side defense-in-depth checks are well-designed and mirror the CPU/CUDA approach correctly. The test coverage is thorough. However, both kernel registrations have a critical issue that will cause a runtime crash.
Critical: InputMemoryType(OrtMemTypeCPUInput) + AddInputs() incompatibility
Both kernel registrations add .InputMemoryType(OrtMemTypeCPUInput, ...) to keep position_ids on CPU for host-side validation. However, ComputeInternal still passes position_ids to AddInputs() for the shader. The WebGPU program execution path in webgpu_context.cc (around L464) does:
bind_buffers.push_back(reinterpret_cast<WGPUBuffer>(const_cast<void*>(inputs[i].tensor->DataRaw())));This assumes all program inputs have GPU buffer handles in DataRaw(). A CPU-resident tensor's DataRaw() returns a heap pointer, producing an invalid WGPUBuffer handle that will crash or corrupt at dispatch.
No existing WebGPU kernel combines InputMemoryType(OrtMemTypeCPUInput) with AddInputs() on the same tensor. Existing kernels (Reduce, Pad, Resize, GQA, etc.) read CPU-pinned inputs for host-side configuration only and never submit them as shader bindings.
Recommended fix
The simplest approach is to remove both InputMemoryType directives and the host-side C++ validation, relying solely on the shader-side defense-in-depth checks (which are already correct in this PR). This matches CUDA's approach: device-side bounds checking with pass-through on OOB. The TS validation in rotary-embedding.ts can remain since it runs on CPU during validateInputs() before shader dispatch.
If host-side rejection is desired, the kernel would need to explicitly copy position_ids from GPU to CPU (via DataTransferManager) for validation, then pass the original GPU tensor to AddInputs().
What's good
- Shader-side OOB checks:
raw_pos < 0catches negative i32 from truncated int64;position_id >= max_positioncatches OOB after u32 conversion + offset. Pass-through behavior matches CUDA. FusedQKRotaryEmbeddingProgrambounds check on computedposition_idcorrectly handles both Q and K outputs, including the conditional K-head guard.- TS validation using
getBigInt64Array()avoids int64 truncation issues. - 7 new tests cover format-0, format-1, negative, and batch OOB comprehensively.
TensorViewImpl.getBigInt64Array() threw RangeError when the WASM heap offset was not 8-byte aligned. Fix by detecting unaligned offsets and copying bytes into an aligned buffer before creating the BigInt64Array. This fixes the Web CI failure where all RotaryEmbedding tests failed with 'start offset of BigInt64Array should be a multiple of 8'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
d1cc33c to
5920aa7
Compare
Remove InputMemoryType(OrtMemTypeCPUInput) from both WebGPU kernel registrations (contrib and ONNX domain) and the associated host-side position_ids value scanning. InputMemoryType is incompatible with AddInputs() — a CPU tensor's DataRaw() would be cast to WGPUBuffer, causing a crash at dispatch time. Defense strategy is now: - Shader-side: WGSL bounds checks pass through input unchanged on OOB (same as CUDA kernel behavior) - JSEP/browser: TypeScript validation in rotary-embedding.ts catches OOB before shader dispatch - init.ts: getBigInt64Array() handles unaligned WASM heap offsets WebGPU OOB tests changed from kExpectFailure to kExpectSuccess, verifying pass-through behavior (output equals input on OOB). ONNX domain tests updated to use rank-2 position_ids for cross-EP consistency. TS validation reordered per Copilot review: sequence_length check before per-value bounds validation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tianleiwu
left a comment
There was a problem hiding this comment.
Round-2 review (head 9744bfb).
Summary
The high-severity concern from the prior round — InputMemoryType(OrtMemTypeCPUInput) + AddInputs(position_ids) would crash because the same tensor cannot be both CPU-resident and bound as a WebGPU storage buffer — has been resolved. The author removed the host-side C++ value scanning and InputMemoryType directives and now relies on shader-side defense-in-depth (OOB → pass-through input) for both contrib and ONNX-domain WebGPU kernels, plus host-side validation on the JS path via getBigInt64Array() in validateInputs. Tests were converted to kExpectSuccess with output == input. The pass-through semantics now match the CUDA kernel for cross-EP consistency.
Resolving my two prior threads:
contrib_ops/webgpu/bert/rotary_embedding.cc(shader as primary protection) — addressed.test/contrib_ops/rotary_embedding_op_test.cc(tests must verify pass-through, not failure) — addressed.
Remaining items (all nits, non-blocking)
1. Shader comment understates what raw_pos < 0 catches — onnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc
The WebGPU shader storage representation truncates int64 to i32 (see ProgramVariableDataType::Int64 → value type i32 in core/providers/webgpu/shader_variable.cc). Any positive int64 value greater than INT32_MAX will alias to a negative i32 and be caught by the raw_pos < 0 branch. That is safe (still pass-through), but the inline comment only mentions "negative position_ids". Consider expanding the comment to note that very large positive int64 values also fall into this branch by virtue of i32 truncation, so a future maintainer doesn't think the check is incomplete.
2. JSEP getBigInt64Array alignment fallback returns a read-only copy — js/web/lib/wasm/jsep/init.ts and js/web/lib/wasm/jsep/tensor-view.ts
The inline comment in init.ts correctly notes that the unaligned path returns a read-only copy (mutations don't propagate to the WASM heap). All current call sites (pad.ts, split.ts, reduce.ts, slice.ts, tile.ts, expand.ts, cumsum.ts, resize.ts, gather-block-quantized.ts, the new rotary-embedding.ts) are read-only iterations, so this is safe today. Consider promoting that read-only caveat into the TensorView.getBigInt64Array() JSDoc in tensor-view.ts so future consumers don't accidentally write through the returned view expecting WASM-heap propagation.
3. Format-0 detection in TS is contrib-op-specific — js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts
positionIdsElementCount === 1 triggers the format-0 base-offset branch. The ONNX-domain RotaryEmbedding schema requires rank-2 position_ids (no base-offset mode). If this validateInputs is shared with the ONNX-domain op (now or later), the format-0 branch should be guarded by op identity. A short comment confirming the JSEP path scope would prevent silent acceptance of malformed shapes against the ONNX-domain op.
4. Pre-existing related gap (informational) — separately noted by the Copilot reviewer at L159 of contrib_ops/webgpu/bert/rotary_embedding.cc: when position_ids is provided, CPU/CUDA reject sequence_length > max_sequence_length with NOT_IMPLEMENTED, while the WebGPU contrib path (after this PR) silently produces pass-through for the OOB tokens. This was the case before this PR too. Worth tracking as a follow-up if cross-EP behavior parity is desired here, but it does not need to block this security fix.
Verdict
Looks good. The shader bounds checks are correctly designed (both for the standalone RotaryEmbeddingProgram and the FusedQKRotaryEmbeddingProgram Q+K pass-through path), tests are appropriate, and the JS validation is functionally correct.
This PR adds position_ids bounds checking to WebGPU and JS RotaryEmbedding implementations, completing the security fix started in PR #27597 (commit 056bab3) which covered CPU and CUDA.
Problem
The
com.microsoft::RotaryEmbeddingkernel uses position_ids as row indices into cos_cache/sin_cache without bounds validation. While PR #27597 fixed CPU and CUDA paths, WebGPU and JS implementations were still missing bounds checks, which could produce silently wrong results (WebGPU hardware clamps OOB reads).Changes
Security
Addresses the same vulnerability as #27597 (OOB read via position_ids, CVSS 7.5-9.1) for WebGPU/JS execution providers.
Testing