Skip to content

Integrate ONNX 1.22.0 (opset 27) — issue #28752#28754

Merged
titaiwangms merged 32 commits into
microsoft:mainfrom
titaiwangms:integrate-onnx-1.22.0rc1
Jun 16, 2026
Merged

Integrate ONNX 1.22.0 (opset 27) — issue #28752#28754
titaiwangms merged 32 commits into
microsoft:mainfrom
titaiwangms:integrate-onnx-1.22.0rc1

Conversation

@titaiwangms

@titaiwangms titaiwangms commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Integrate ONNX 1.22.0rc1 (opset 27)

Resolves #28752.

Pin: onnx/onnx@bc3be77bec2f628788796dff60819186bacf49df (VERSION_NUMBER 1.22.0rc1).
ONNX 1.21.0 → 1.22.0rc1. Max ai.onnx opset 26 → 27. IR version unchanged (13 / 0x0D).

This is the RC validation phase of an incremental integration (same strategy as the ONNX 1.21 bump, #27601). The formal v1.22.0 GitHub release is still a draft (no git tag yet), so re-pinning to the released tag is deferred to Phase 2 (see Follow-ups). Landing the RC now validates ONNX 1.22 against ORT before ONNX publishes the formal release.


Update — ONNX 1.22.0 FINAL re-pin + rebase onto upstream/main + closes #28969

ONNX published the formal v1.22.0 GitHub release, so this PR is re-pinned rc2 → FINAL (onnx/onnx@v1.22.0) — the Phase-2 step deferred in the rc1 description below. The branch was also rebased onto upstream/main to pick up the intervening optimizer/opset-26 work. The released tag tarball is a different asset hash than the RCs, so the vcpkg MS-internal asset mirror was re-seeded for the final tag (otherwise --use_vcpkg legs 404).

Also closes #28969 (WebGPU binary-elementwise broadcast SIZE_MAX underflow). ONNX 1.22's expanded-Attention reference tests exposed a latent WebGPU bug where a broadcast shape computed dim - 1 on a zero/unit dimension and underflowed to SIZE_MAX; the fix is included here and the previously-skipped reference tests are re-enabled.

Opset-27 *CurrentOpset test handling. ONNX 1.22.0 FINAL ships DomainToVersionRange map-max 27 while the last released opset is 26, so opset 27 stays under development for the whole 1.22 cycle. Strict legs (the default, or ALLOW_RELEASED_ONNX_OPSET_ONLY=1) therefore throw "Opset 27 under development" at model load on every *CurrentOpset fusion test that builds at the max opset. These tests now load with per-model ModelOptions{/*allow_released_opsets_only*/ false, /*strict_shape_type_inference*/ false}, extending the existing 38f17243b / GatherToSlice precedent to the rest of the *CurrentOpset suite. This is leg-agnostic (exercises opset 27 on every CI leg, not just the relaxed ones) and preserves opset coverage (vs. GTEST_SKIP). Each call site is annotated with a one-line WHY + tracking issue (#28966) so the relaxation can be removed once opset 27 is released.

Resolves #28752 (unchanged). Closes #28969.

Update — ONNX 1.22.0rc2 re-pin + ConvTranspose conforms to ONNX output_shape spec

Since the original rc1 description below, this PR was re-pinned rc1 → rc2 (onnx/onnx@b124e0188a, VERSION_NUMBER 1.22.0rc2) to pick up the upstream Xcode/iOS CMake fix (onnx#8056). rc2 also carries onnx#8051, which tightened convTransposeShapeInference to reject an output_shape/output_padding whose size does not match the number of spatial dimensions (per the ONNX spec clarification onnx#5400). ONNX Runtime now conforms to that spec instead of patching ONNX to preserve a non-standard form.

⚠️ Breaking change — ConvTranspose output_shape now follows the ONNX spec (spatial dimensions only). ORT previously also accepted a non-standard rank + 2 form that included batch and channel, i.e. (N, C, H, W). As of ONNX 1.22, a rank + 2 output_shape on a ConvTranspose whose input has a statically-known rank is rejected at Graph::Resolve with "Attribute output_shape has incorrect size". Migration: specify output_shape with spatial dimensions only — e.g. {1, 1, 1, 14}{1, 14} (batch and channel are always inferred from the input and weight, so results are identical; the kernel ignores N, C). Models whose ConvTranspose input has a dynamic/unknown rank are unaffected — ONNX skips the size check and ORT computes the same result (covered by the new ConvTranspose_RankPlus2_OutputShape_DynamicRankInput_Runtime test).

Patch inventory — supersedes "2 files, 3 hunks" below. cmake/patches/onnx/onnx.patch (and its byte-identical binskim.patch mirror) carries only the ONNX_MINIMAL_BUILD option hunk and the GroupNormalization-18 .Deprecate() removal — no ConvTranspose hunks. rc2's strict shape-inference check is kept as-is; ORT's own test models were conformed to the spec. The upstream archive hash, deps.txt, portfile.cmake, vcpkg.json, and the submodule pin are unchanged.

Additional rc2 test conform. rc2 also tightened convPoolShapeInference to reject Conv inputs with rank < 3 ("Input tensor must have at least 3 dimensions"). The hand-authored model in onnxruntime/test/python/quantization/test_op_split.py declared a spec-invalid rank-2 Conv input/weight; it was conformed to a valid NCHW shape ([6, 3][1, 1, 6, 3], weight → [2, 1, 1, 1]), keeping the quantized-Split graph and expected outputs identical. No ORT source change.

This note should also seed the GitHub Release notes for the ONNX 1.22 / opset 27 milestone and the squash-commit message.


What changed (29 files)

Version plumbing

  • cmake/deps.txt — onnx archive URL → rc1 commit zip + SHA1 421e5a9afb6c41a54696e424e5b9a3796aab6821.
  • cmake/external/onnx — submodule → bc3be77b.
  • cmake/vcpkg-ports/onnx/portfile.cmakeREF commit form + tar.gz SHA512 e0c526f5…3ce467.
  • cmake/vcpkg-ports/onnx/vcpkg.jsonversion-semver 1.22.0, port-version 0.
  • cmake/patches/onnx/onnx.patch + cmake/vcpkg-ports/onnx/binskim.patchbyte-identical rebase onto 1.22 (2 files, 3 hunks): kept the ONNX_MINIMAL_BUILD option (restructured for 1.22's new onnx_core OBJECT-lib / add_subdirectory(onnx) layout) and the GroupNormalization-18 .Deprecate() removal; dropped the Utils.cmake protobuf-warnings hunk (already merged upstream in 1.22).

Opset-27 op enablement (Range)

  • onnxruntime/core/providers/cpu/generator/range.cc — split into versioned [11, 26] + a new unversioned 27 registration. The opset-27 kernel natively supports the existing common numeric types (float/double/int16/int32/int64). fp16 Range is covered via ONNX's Range-27 function body, which ORT expands into primitive ops at partition time. bf16 Range is deferred to that same function expansion — there is no native bf16 kernel, and its bf16 reference node test (test_range_bfloat16_type_positive_delta, base + _expanded) is not exercised by the Python/numpy ONNX backend series, whose harness cannot materialize bf16 (Numpy_type 256); a native fp16/bf16 kernel + stash_type handling is a follow-up (efficiency, not correctness).
  • onnxruntime/core/providers/cpu/cpu_execution_provider.cc — versioned the Range forward-declare + BuildKernelCreateInfo entries and added the opset-27 registration.
  • CUDA Range — same versioned [11, 26] + opset-27 split as CPU (onnxruntime/core/providers/cuda/generator/range.cc + cuda_execution_provider.cc); GPU-verified locally: onnx_test_runner -e cuda 8/8 opset-27 Range node tests pass, native Range-27 placed on CUDAExecutionProvider (fp16/bf16 via function expansion).

Optimizer / EP opset ceilings

  • …/transpose_optimization/optimizer_api.hkMaxSupportedOpset 26 → 27.
  • coreml/nnapi/vsinpu/webnn base_op_builder.hGetMaxSupportedOpSet() 25 → 27 (upper guard only; per-op support checks still gate — these EPs gain no new kernels here).

Fusion updates

  • onnxruntime/core/optimizer/gather_fusion.cc — GatherToSlice Range version list {1,11}{1,11,27}.
  • onnxruntime/core/optimizer/embed_layer_norm_fusion.cc — add 27 to the two Range path-matchers (parent_path_3/4) so embedding fusion still matches opset-27 models.
  • onnxruntime/test/optimizer/graph_transform_test.cc — new opset-27 GatherToSliceFusion test.

Requirements (7 bumped)

  • All 7 CI requirements.txtonnx==1.22.0rc1 (rc1 wheel is on PyPI). The 3 transformers pins remain frozen at 1.18.0 (unrelated to this bump; intentionally untouched).

Generated docs / test data

  • js/web/docs/webgl-operators.md — regenerated.
  • docs/OperatorKernels.mdsurgical edit: CPU EP and CUDA EP Range rows (27+ + [11, 26] continuation each); see caveats.
  • onnxruntime/test/testdata/onnx_backend_test_series_filters.jsonccomment-only: documents why no opset-27 CPU exclusions are needed (all opset-27 node tests pass via function expansion).

Docs

  • .agents/skills/onnx-opset-bump-checklist/SKILL.md — new reusable checklist skill distilled from this integration. Now also documents the "bump all execution providers together" tradition (CPU + CUDA + JS/DML assessment in one pass) so future opset bumps don't ship a partial EP set.

Validation (CPU EP + CUDA EP, Linux x64)

  • Full build ✅
  • --minimal_build extended build ✅ (validates the rebased ONNX_MINIMAL_BUILD patch hunk independently of the vcpkg mirror path)
  • onnxruntime_test_all ✅ — 1595 passed / 0 failed
  • onnx_test_runner -e cpu on the ONNX 1.22 opset-27 node tests ✅ — 62/62 pass via ONNX function-body expansion (run with ALLOW_RELEASED_ONNX_OPSET_ONLY=0), including CausalConvWithState, LinearAttention, and fp16/bf16 Range — despite no native kernels for them.
  • CUDA EP (H100): built --use_cuda clean in both Debug and RelWithDebInfo ✅; onnx_test_runner -e cuda on the opset-27 Range node tests ✅ — 8/8 pass, with native Range-27 placed on CUDAExecutionProvider (no CPU fallback) and fp16/bf16 covered via function-body expansion.

Standing caveats (please read before reviewing)

  1. CUDA EP now locally verified for Range; other GPU EPs/ops still CI-only. The CUDA EP was built and the opset-27 Range node tests run locally on an H100 (8/8 pass). DML and the remaining GPU EPs/ops were not exercised here. Function-body expansion is EP-agnostic, so other opset-27 models are expected to run on those EPs too, but broader GPU coverage remains a CI/follow-up item.
  2. OperatorKernels.md updated surgically (CPU Range row only). A CPU-only full regen would destructively wipe the CUDA/DML/other-EP sections (the generator only emits rows for the EPs in the built module). A correct multi-EP regen needs a build per EP and is a follow-up.
  3. Opset 27 is "under development" in ONNX's released-versions map. ORT's load-time validation rejects opset-27 models unless ALLOW_RELEASED_ONNX_OPSET_ONLY=0 (ORT CI already sets this). The opset-27 schemas are always compiled in from the submodule regardless — this gate only affects model load-time acceptance, not schema availability.
  4. EP GetMaxSupportedOpSet jumped 25 → 27 (skips 26). This is an upper guard only; raising it merely lets opset-26/27 nodes reach the per-op support checks that still gate correctness. No regression — it also retroactively un-caps opset-26 for these EPs.
  5. iOS/macOS Xcode framework build is currently broken by an upstream ONNX CMake regression (the onnx_core OBJECT-library split in Remove glob calls from ONNX CMake code onnx/onnx#7733 reintroduced the Xcode breakage originally fixed by Revert ONNX CMake changes onnx/onnx#7515 for Build failure with Xcode generator onnx/onnx#7514). This is NOT caused by this opset bump. Tracked upstream at onnx/onnx#8053. Non-Xcode builds (Linux/Windows/Android/WASM) and all CPU/CUDA validation are unaffected. This resolves at the Phase 2 formal v1.22.0 re-pin once ONNX ships the fix.

Follow-ups (explicitly NOT in this PR)

  • GPU/multi-EP coverage: run opset-27 CUDA/DML node tests; regenerate OperatorKernels.md across all EPs.
  • JS EP Range [11, 26] + 27 split (currently registered open-ended at 11; mirror the CPU/CUDA versioned split).
  • DML Range opset-27 assessment (DML uses its own REG_INFO registration system — assess whether an opset-27 entry is needed).
  • WebGPU EP Range opset-27 split — range.cc registers Range .SinceVersion(11) open-ended, so it already claims opset-27 Range; only the new bf16 type is unsupported and falls back via the T type-constraint (function expansion). Mirror the CPU/CUDA versioned [11, 26] + 27 split.
  • Native kernels: implement CPU (and EP) CausalConvWithState and LinearAttention kernels, and a native fp16/bf16 + stash_type Range-27 kernel (replace today's function-expansion path with efficient kernels).
  • Phase 2 — formal v1.22.0 re-pin: re-pin deps.txt/submodule/portfile/requirements to the released tag once ONNX publishes it (currently blocked on ONNX tagging the release); upload the tag tarball to the vcpkg mirror. This also restores the iOS/macOS Xcode framework build once the upstream onnx OBJECT-library Xcode regression (caveat 5) is resolved and re-pinned.
  • Tooling: fix the pre-existing crash in find_optimizer_opset_version_updates_required.py (placeholder ver parsed as int) so it can be relied on for future bumps.

titaiwangms added a commit to titaiwangms/onnxruntime that referenced this pull request Jun 2, 2026
… tradition

When an op's kernel set changes for the new opset (e.g. Range gaining fp16/bf16 at
opset 27), version-split / bump that op's registration in EVERY EP that registers it
(CPU and CUDA at minimum) in the SAME PR, so no EP silently lags behind CPU and the
advertised opset boundaries stay consistent. Even an open-ended kernel that already
binds the new opset (e.g. CUDA Range at SinceVersion(11)) should still be version-split
for convention/clarity. Worked example cited: PR microsoft#28754 split Range [11,26]+27 in both
CPU and CUDA (verified).

- §1 Group B: added the all-EP tradition callout + an EP checklist (grep each provider
  dir; split cpu/cuda/js/rocm macro registrations; assess dml/webgpu/coreml/nnapi/etc.
  per their own systems; bump coreml/nnapi/vsinpu/webnn GetMaxSupportedOpSet ceilings).
- §11: added a cross-EP consistency convention note distinguishing splitting (clarity)
  from binding-coverage (correctness).
- §6 checklist: Group B line now calls out version-splitting in every EP.

Agent-signed-off: Architect (bcad189c) [claude-opus-4.8 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Agent-signed-off: Architect (bcad189c) [claude-opus-4.6 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@titaiwangms titaiwangms force-pushed the integrate-onnx-1.22.0rc1 branch from 34b6359 to 06ced99 Compare June 2, 2026 23:34
titaiwangms added a commit to titaiwangms/onnxruntime that referenced this pull request Jun 2, 2026
… tradition

When an op's kernel set changes for the new opset (e.g. Range gaining fp16/bf16 at
opset 27), version-split / bump that op's registration in EVERY EP that registers it
(CPU and CUDA at minimum) in the SAME PR, so no EP silently lags behind CPU and the
advertised opset boundaries stay consistent. Even an open-ended kernel that already
binds the new opset (e.g. CUDA Range at SinceVersion(11)) should still be version-split
for convention/clarity. Worked example cited: PR microsoft#28754 split Range [11,26]+27 in both
CPU and CUDA (verified).

- §1 Group B: added the all-EP tradition callout + an EP checklist (grep each provider
  dir; split cpu/cuda/js/rocm macro registrations; assess dml/webgpu/coreml/nnapi/etc.
  per their own systems; bump coreml/nnapi/vsinpu/webnn GetMaxSupportedOpSet ceilings).
- §11: added a cross-EP consistency convention note distinguishing splitting (clarity)
  from binding-coverage (correctness).
- §6 checklist: Group B line now calls out version-splitting in every EP.

Agent-signed-off: Architect (bcad189c) [claude-opus-4.8 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Agent-signed-off: Architect (bcad189c) [claude-opus-4.6 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
titaiwangms added a commit to titaiwangms/onnxruntime that referenced this pull request Jun 3, 2026
…2 (PR microsoft#28754 CI Issue B)

ONNX 1.22's onnx_core(OBJECT)/onnx_proto target split lists the generated
${ONNX_PROTO_SRCS} (.pb.cc) as compiled sources in BOTH targets (by design:
onnx_core's objects feed onnx + onnx_cpp2py_export with hidden visibility, while
onnx_proto is a standalone static lib with different defines). Xcode's new build
system forbids two targets independently producing the same onnx-data.pb.cc.

Extend the existing onnx_proto_gen custom target to also DEPEND on
${ONNX_PROTO_SRCS}, making it the single owner of the protoc generation step.
Both libraries already depend on onnx_proto_gen, so they now consume the
pre-generated sources (and still each compile their own object). This is CMake's
documented add_custom_target-driver pattern for multi-target generated outputs
(add_custom_command docs) and changes no defines/visibility, so it is safe for
the normal Make/Ninja build.

- cmake/patches/onnx/onnx.patch: new CMakeLists.txt hunk (now 2 files / 4 hunks)
- cmake/vcpkg-ports/onnx/binskim.patch: byte-identical mirror (verified sha256)
- onnx-opset-bump-checklist SKILL.md §7: hunk-count annotation 3 -> 4

Verified on Linux (Ninja, Debug): patch applies clean (git apply + patch -p1);
onnx + onnx_proto configure, proto generates, both targets compile onnx-ml.pb.cc,
libonnx.a + libonnx_proto.a link. Xcode itself is CI-verified only (no macOS host)
- iOS_CI_on_Mac + React-Native-iOS are the oracles.

Agent-signed-off: Architect (bcad189c) [claude-opus-4.6 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
titaiwangms added a commit to titaiwangms/onnxruntime that referenced this pull request Jun 3, 2026
…icrosoft#28754 CI Issue B facet 2)

ONNX 1.22 splits onnx into onnx_core(OBJECT)/onnx_proto; the aggregate
`onnx` target is add_library(onnx $<TARGET_OBJECTS:onnx_core>) with no real
sources. Xcode's generator won't archive a static lib whose only sources are
$<TARGET_OBJECTS:...>, so no libonnx.a is emitted and ORT's iOS framework link
fails. Guard the onnx target with if(CMAKE_GENERATOR STREQUAL "Xcode") to add a
generated empty dummy source forcing the archive; else()-branch leaves
Make/Ninja/MSVC byte-unchanged. Mirrored byte-identically into binskim.patch;
skill §7 hunk count updated 4->5.

Agent-signed-off: Architect (bcad189c) [claude-opus-4.6 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@titaiwangms titaiwangms force-pushed the integrate-onnx-1.22.0rc1 branch from 20b4010 to 85a11f2 Compare June 3, 2026 20:57
titaiwangms added a commit to titaiwangms/onnxruntime that referenced this pull request Jun 3, 2026
…28754 CI Issue B facet 3)

The onnx_core(OBJECT) aggregate consumed via $<TARGET_OBJECTS:onnx_core> gives
the Xcode generator no build-order guarantee, so onnx's libtool archive step
races ahead of onnx_core's compilation and fails with 'libtool: can't open
file: onnx_core.../defs.o (No such file or directory)'. Under the Xcode guard,
build libonnx.a from the generated empty source and link onnx_core via
target_link_libraries(onnx PRIVATE onnx_core) instead of loose TARGET_OBJECTS:
cmake records a proper target-level dependency (onnx_core compiles first) and
still archives every onnx_core object into libonnx.a, so ORT's ar -x over
$<TARGET_FILE:onnx> extracts the full symbol set. else()-branch leaves
Make/Ninja/MSVC byte-unchanged. Mirrored byte-identically into binskim.patch;
skill §7 annotation updated.

Agent-signed-off: Architect (bcad189c) [claude-opus-4.6 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@titaiwangms titaiwangms force-pushed the integrate-onnx-1.22.0rc1 branch from 017edb3 to 85a11f2 Compare June 3, 2026 22:31
titaiwangms added a commit to titaiwangms/onnxruntime that referenced this pull request Jun 9, 2026
… leniency)

ONNX 1.22 (rc2, cherry-pick microsoft#8051) tightened convTransposeShapeInference to
fail_shape_inference when an output_shape/output_padding attribute size does not
match the number of spatial dimensions. ORT historically also accepted a
non-spec rank+2 (full N,C,H,W) output_shape form. This is Option A: instead of
patching ONNX to restore the leniency (Option B, commit 031c777), conform
ORT's own test models to the spec so the onnx.patch ConvTranspose hunks never
land in main.

Changes:
- conv_transpose_op_test.cc: 10 output_shape attributes -> spatial-only
  (N,C prefix dropped; Y_shape/expected_vals unchanged). Keep B's
  InvalidKernelShape expected-message fix (ONNX rejects kernel_shape rank at
  Graph::Resolve). Add a new InferenceSession-based regression test
  (ConvTranspose_RankPlus2_OutputShape_DynamicRankInput_Runtime) that feeds an
  unknown-rank input so the kept rank+2 kernel branch stays exercised.
- xnnpack_basic_test.cc: 1 output_shape attribute -> spatial-only.
- conv_transpose_attributes.h: keep the rank+2 toleration (runtime-reachable for
  dynamic-rank inputs) and document why it is retained; no behavior change.
- onnx.patch / binskim.patch: unchanged at the 3-hunk base (no ConvTranspose
  reverts) since this branch is built on the pre-B base.

Breaking change: models specifying a rank+2 output_shape AND a statically-known
input rank now fail to load under ONNX 1.22 with 'Attribute output_shape has
incorrect size'. Migration: use spatial-only output_shape. See PR microsoft#28754
description.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@titaiwangms titaiwangms force-pushed the integrate-onnx-1.22.0rc1 branch from 031c777 to 2f63b0a Compare June 9, 2026 20:18
@titaiwangms titaiwangms force-pushed the integrate-onnx-1.22.0rc1 branch from 01b9673 to e72e1de Compare June 10, 2026 20:54
titaiwangms and others added 16 commits June 15, 2026 20:07
Pin ONNX to rel-1.22.0 HEAD commit bc3be77bec2f628788796dff60819186bacf49df (VERSION_NUMBER 1.22.0rc1):
- cmake/deps.txt: commit-archive zip URL + SHA1 421e5a9afb6c41a54696e424e5b9a3796aab6821
- cmake/external/onnx: submodule -> bc3be77b
- cmake/vcpkg-ports/onnx/vcpkg.json: version-semver 1.22.0, port-version 0
- cmake/vcpkg-ports/onnx/portfile.cmake: REF commit form + tar.gz SHA512 e0c526f5...3ce467

Rebase cmake/patches/onnx/onnx.patch to ONNX 1.22 and mirror byte-identically into binskim.patch:
- Kept ONNX_MINIMAL_BUILD option (rebased context)
- Restructured the minimal-build source-selection hunk for ONNX 1.22's new onnx_core OBJECT library / add_subdirectory(onnx) layout
- Dropped the Utils.cmake protobuf_warnings hunk (already removed upstream in 1.22)
- Kept the GroupNormalization-18 .Deprecate() removal (still present in 1.22)

Agent-signed-off: Developer (8ac66e2a) [claude-opus-4.6 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ONNX 1.22.0rc1 is published on PyPI (verified via pip index versions onnx --pre), so all 7 requirements files use the published wheel pin onnx==1.22.0rc1 rather than the git source pin.

Agent-signed-off: Developer (0ede529c) [claude-opus-4.6 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…o 27 and register Range at opset 27

- optimizer_api.h: kMaxSupportedOpset 26 -> 27
- Range CPU kernel split into VERSIONED [11,26] + new non-versioned [27] reusing the existing kernel (Range-27 adds fp16/bf16 types and a stash_type attr; the existing common numeric types bind, fp16/bf16 is a deferred enhancement)
- cpu_execution_provider.cc: versioned the Range forward-declare and BuildKernelCreateInfo entries and added the opset-27 Range registration

Agent-signed-off: Developer (d307842f) [claude-opus-4.6 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…doc (ONNX 1.22.0rc1, issue microsoft#28752)

- onnx_backend_test_series_filters.jsonc: exclude deferred opset-27 ops CausalConvWithState and LinearAttention (whole families, base + _expanded), and the Range-27 float16/bfloat16 node tests (ORT CPU Range-27 reuses the existing kernel registration which supports only the common numeric types; fp16/bf16 is a tracked follow-up). float/int32 Range tests remain enabled.
- js/web/docs/webgl-operators.md: regenerated via npm run build:doc; now lists CausalConvWithState and LinearAttention as unsupported WebGL ops, reflecting the opset-27 schemas pulled in by the submodule bump.

Agent-signed-off: Developer (d307842f) [claude-opus-4.6 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ONNX 1.22 re-registers Range with an opset-27 schema (adds stash_type attr, widens type constraints; signature unchanged: 3 scalar inputs -> 1-D output).

- gather_fusion.cc: add 27 to GatherToSlice Range version list {1,11}->{1,11,27}; add opset-27 GatherToSliceFusion test.
- embed_layer_norm_fusion.cc: add 27 to the two Range path matchers (parent_path_3/4) so the embedding fusion still matches opset-27 models.
- coreml/nnapi/vsinpu/webnn base_op_builder.h: bump default GetMaxSupportedOpSet 25->27, matching the lockstep convention of prior ONNX-integration PRs (microsoft#26579, microsoft#25678, microsoft#24449) so opset-27 nodes are not spuriously rejected.

qdq_util.cc / layout_transformation_potentially_added_ops.h / kernel_type_str_resolver_utils.cc were checked and need no change (none reference the opset-27 ops Range/CausalConvWithState/LinearAttention).

Agent-signed-off: Developer (0ede529c) [claude-opus-4.6 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Completes the T6 opset-27 sweep that the prior commit (c6366eb) missed for these
two files. ONNX 1.22 re-registers Range with an opset-27 schema (operator_sets.h
line 1460); find_optimizer_opset_version_updates_required.py flagged:
  'Newer opset found for kOnnxDomain.Range. Latest:27 Optimizer support ends at 11.
   File: gather_fusion.cc'

- gather_fusion.cc:309: GatherToSlice Range version list {1,11}->{1,11,27}. Range-27's
  signature is unchanged (3 scalar inputs -> 1-D output; only adds stash_type attr and
  wider type constraints), and the fusion depends only on that signature, so extending
  the accepted SinceVersion list is safe.
- graph_transform_test.cc: new OpSet-27 (int64) GatherToSliceFusion block mirroring the
  existing OpSet-12/OpSet-14 blocks, proving Range-27 -> Gather fuses to Slice.

Agent-signed-off: Developer (0ede529c) [claude-opus-4.6 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…0rc1)

Update the CPU ai.onnx Range entry to reflect T2's split registration: a non-versioned opset-27 kernel (27+) plus the versioned [11, 26] kernel, matching the output of tools/python/gen_opkernel_doc.py against a fresh RelWithDebInfo build.

The doc was updated surgically rather than via a full regen because this validation build is CPU-only (no CUDA/DML EPs available on this Linux host; DML is Windows-only). A full regen would have destructively dropped the CUDA and DmlExecutionProvider sections. The CPU-section delta was verified to be exactly this Range change by diffing the freshly generated CPU section against the committed doc.

Agent-signed-off: Developer (d307842f) [claude-opus-4.6 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…WithState/LinearAttention/Range-fp16-bf16 exclusions (ONNX 1.22.0rc1, issue microsoft#28752)

T7 validation showed all 62 opset-27 backend node tests (the complete set: base + _expanded, including fp16/bf16) pass on the CPU EP. These ops are ONNX function ops (and Range-27 carries a function body), so ORT expands them into primitive nodes at partition time and executes correctly despite no native kernel and the CPU Range-27 kernel registering only common numeric types. Verified via onnx_test_runner -e cpu (62/62 succeeded, output-compared) under ALLOW_RELEASED_ONNX_OPSET_ONLY=0. Removing the global filters restores backend-test coverage per design-review Minor-1.

Agent-signed-off: Developer (d307842f) [claude-opus-4.6 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…s work today via ONNX function-body expansion (ONNX 1.22.0rc1, issue microsoft#28752)

The prior comment said float16/bfloat16 'support is a follow-up enhancement', which a maintainer could misread as fp16/bf16 Range being broken. Clarify that such models execute correctly today because Range-27 carries an ONNX function body that ORT expands at graph-partition time; the follow-up is an efficient native kernel, not a functional fix. Comment-only change; verified onnxruntime_providers recompiles and all 20 GatherToSliceFusion/EmbedLayerNormFusion optimizer tests pass.

Agent-signed-off: Developer (d307842f) [claude-opus-4.6 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…27 filter comment (ONNX 1.22.0rc1, issue microsoft#28752)

The opset-27 validation only exercised the CPU EP. That limitation was previously only implicit (via the 'if a non-CPU EP fails in CI' clause). Add an explicit NOTE so a future maintainer cannot miss that GPU/CUDA/DML EPs were not exercised in this validation env. Comment-only; JSONC still parses (302 entries unchanged).

Agent-signed-off: Developer (d307842f) [claude-opus-4.6 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Agent-signed-off: Developer (d307842f) [claude-opus-4.6 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Agent-signed-off: Developer (d307842f) [claude-opus-4.6 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… tradition

When an op's kernel set changes for the new opset (e.g. Range gaining fp16/bf16 at
opset 27), version-split / bump that op's registration in EVERY EP that registers it
(CPU and CUDA at minimum) in the SAME PR, so no EP silently lags behind CPU and the
advertised opset boundaries stay consistent. Even an open-ended kernel that already
binds the new opset (e.g. CUDA Range at SinceVersion(11)) should still be version-split
for convention/clarity. Worked example cited: PR microsoft#28754 split Range [11,26]+27 in both
CPU and CUDA (verified).

- §1 Group B: added the all-EP tradition callout + an EP checklist (grep each provider
  dir; split cpu/cuda/js/rocm macro registrations; assess dml/webgpu/coreml/nnapi/etc.
  per their own systems; bump coreml/nnapi/vsinpu/webnn GetMaxSupportedOpSet ceilings).
- §11: added a cross-EP consistency convention note distinguishing splitting (clarity)
  from binding-coverage (correctness).
- §6 checklist: Group B line now calls out version-splitting in every EP.

Agent-signed-off: Architect (bcad189c) [claude-opus-4.8 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Agent-signed-off: Architect (bcad189c) [claude-opus-4.6 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…message

Agent-signed-off: Developer (d307842f) [claude-opus-4.6 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…PSET_ONLY legs

Agent-signed-off: Developer (d307842f) [claude-opus-4.6 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Agent-signed-off: Developer (d307842f) [claude-opus-4.6 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
titaiwangms and others added 12 commits June 15, 2026 20:08
…efault

Agent-signed-off: Developer (d307842f) [claude-opus-4.6 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Re-pin the ONNX C++ source dependency to rel-1.22.0 HEAD
(onnx/onnx@b124e0188a, VERSION_NUMBER 1.22.0rc2), which carries the
upstream Xcode/iOS CMake fix onnx#8056 (cherry-picked via onnx#8066).
With the fix upstream, cmake/patches/onnx/onnx.patch collapses back to
the 3 ORT-only hunks (ONNX_MINIMAL_BUILD option + minimal-build source
branch rebased onto rc2's post-8056 onnx-target layout, and the
GroupNormalization-18 .Deprecate() removal); binskim.patch mirrors it
byte-for-byte.

- cmake/deps.txt: onnx archive zip + SHA1 -> rc2
- cmake/external/onnx: submodule pointer -> b124e0188a
- cmake/vcpkg-ports/onnx/portfile.cmake: REF + tar.gz SHA512 -> rc2
- cmake/patches/onnx/onnx.patch + binskim.patch: regenerated for rc2

rc1..rc2 source delta (bc3be77be..b124e0188a, 3 commits): besides the
Xcode/CMake restructure (onnx#8056) and the version bump, the range also
carries runtime-touching but schema-NEUTRAL hardening that is compiled
into ORT:
  - onnx#8051 (via microsoft#8058): Conv/Pool/RoiPool/ConvTranspose shape-inference
    guards (reject <min-rank inputs, non-positive dilations/kernel/strides,
    negative pads) plus a behavior-identical auto_pad residual fix.
  - onnx#8066 cherrypicks: onnx/checker.cc raw_data size validation for
    packed sub-byte tensors, and a Conv weight/input spatial-rank guard.
There are ZERO operator-schema/opset/type-constraint/attribute changes in
the range, so these deltas only reject previously-malformed inputs and do
not change results for any valid model.

The 7 CI requirements.txt files intentionally stay at onnx==1.22.0rc1:
the rc2 wheel is not yet on PyPI, and the iOS/Xcode build consumes the
GitHub source archive (deps.txt), not the wheel. Given the schema-neutral
delta above, the rc1 wheel stays functionally compatible; bump to rc2 once
ONNX publishes the wheel.

Validated: minimal-build extended gate passes (ONNX_MINIMAL_BUILD=ON);
onnx.patch applies cleanly to the rc2 source; binskim.patch byte-identical.

Agent-signed-off: Developer (a478a765) [claude-opus-4.8 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ror gotcha in onnx-opset-bump-checklist

The vcpkg asset cache runs with x-block-origin (no GitHub fallback), so a
missing blob hard-fails every --use_vcpkg leg with a 404. Each archive bump
(rc1->rc2->formal) is a new SHA512 = new un-mirrored blob; a green rc1 run
doesn't mean rc2 is mirrored. Added a read-only curl probe and the bare-SHA512
blob-key detail.

Agent-signed-off: Architect (f1afcb8a) [claude-opus-4.8 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…for vcpkg asset-mirror gotcha

Extends the onnx-opset-bump-checklist mirroring section with: (h) the exact
vcpkg log signature (404 + 'x-block-origin set' on a /artifacts/<sha512> URL,
vcpkg legs fail while FetchContent legs pass); (i) ordered fix options
(Terrapin self-seed Windows leg / az blob upload under bare-SHA512 name /
EngSys ticket) with a verify-via-curl-200 step. References the architect
rc2 mirror runbook artifact.

Agent-signed-off: Architect (f1afcb8a) [claude-opus-4.8 via copilot]
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… leniency)

ONNX 1.22 (rc2, cherry-pick microsoft#8051) tightened convTransposeShapeInference to
fail_shape_inference when an output_shape/output_padding attribute size does not
match the number of spatial dimensions. ORT historically also accepted a
non-spec rank+2 (full N,C,H,W) output_shape form. This is Option A: instead of
patching ONNX to restore the leniency (Option B, commit 031c777), conform
ORT's own test models to the spec so the onnx.patch ConvTranspose hunks never
land in main.

Changes:
- conv_transpose_op_test.cc: 10 output_shape attributes -> spatial-only
  (N,C prefix dropped; Y_shape/expected_vals unchanged). Keep B's
  InvalidKernelShape expected-message fix (ONNX rejects kernel_shape rank at
  Graph::Resolve). Add a new InferenceSession-based regression test
  (ConvTranspose_RankPlus2_OutputShape_DynamicRankInput_Runtime) that feeds an
  unknown-rank input so the kept rank+2 kernel branch stays exercised.
- xnnpack_basic_test.cc: 1 output_shape attribute -> spatial-only.
- conv_transpose_attributes.h: keep the rank+2 toleration (runtime-reachable for
  dynamic-rank inputs) and document why it is retained; no behavior change.
- onnx.patch / binskim.patch: unchanged at the 3-hunk base (no ConvTranspose
  reverts) since this branch is built on the pre-B base.

Breaking change: models specifying a rank+2 output_shape AND a statically-known
input rank now fail to load under ONNX 1.22 with 'Attribute output_shape has
incorrect size'. Migration: use spatial-only output_shape. See PR microsoft#28754
description.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ONNX 1.22.0rc2 tightened convPoolShapeInference to reject Conv inputs
with rank < 3 at model load ("Input tensor must have at least 3
dimensions"). test_op_split.py declared a spec-invalid rank-2 input
[6,3] feeding Conv (weight also rank-2), which rc2 now rejects, breaking
test_quantize_split and test_quantize_split_s8s8 on 16 CI legs.

Conform the test model to the ONNX Conv spec (Option-A philosophy):
- input [6,3] -> NCHW [1,1,6,3]
- conv_weight [6,3] -> 4D [2,1,1,1] (M=2 output channels, 1x1 kernel)
- data_reader feed [6,3] -> [1,1,6,3]

Conv output is now [1,2,6,3] = 36 elements, unchanged downstream:
reshape -> [3,12], Split(axis=0, [1,1,1]) -> three [1,12] outputs.
Op-type-count assertions are unaffected (only ranks changed).

Validated against an rc2-linked onnxruntime build: full quantization
unittest suite (314 tests) passes; the two previously-failing split
tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n + atol override for fp16 causal_conv

Two ONNX-1.22 backend node-test failures were unmasked once the python quant
suite stopped aborting before the OnnxBackendNodeModelTest stage:

1. test_attention_4d_softcap_neginf_mask_expanded (+ _poison_expanded) ERROR
   on the macOS-arm64-webgpu Release backend-conformance leg: the ONNX
   function-EXPANDED Attention reference decomposition trips a SizeFromDimension
   underflow (dim = SIZE_MAX) downstream of the bias Add (prime suspect
   softmax.cc:177). The native FUSED Attention kernel
   (test_attention_4d_softcap_neginf_mask[_poison], no _expanded) passes on
   every arch, and CPU EP passes the expanded tests on x64 + Linux-arm64 and in
   the ONNX ReferenceEvaluator -- so this is not user-facing; only the expanded
   reference graph trips on that build.

   Placed in the GLOBAL current_failing_tests list (matching the existing
   expanded-attention webgpu-skip precedent at L38/40/42), so the 2 _expanded
   REFERENCE decompositions are skipped on all configs including CPU. Global was
   chosen over the current_failing_tests_WEBGPU section for a deterministic green
   that is independent of the supports_device('WEBGPU') runtime condition. The
   FUSED production Attention kernel stays fully covered on every arch; only the
   2 expanded reference variants stop running. Tracked for a follow-up fix.

2. test_causal_conv_with_state_silu_fp16 (+ _expanded) is a single-ULP fp16
   tolerance miss on arm64 (max rel diff 0.001174). Given an atol override of
   5e-4 (mirroring test_attention_4d_fp16) rather than a skip, so coverage is
   retained on all platforms.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ow (microsoft#28969)

The vectorized broadcast path counted trailing "shared" dimensions with a
loop that treated an exhausted operand's implicit size-1 dim as a match
against a literal 1. For unequal-rank operands with leading unit dims
(e.g. Add lhs=[1,1,6,6] + rhs=[6,6]), num_shared_dimension grew past the
smaller operand's rank, so lhs/rhs SizeFromDimension(rank - num_shared)
underflowed (size_t wrap to SIZE_MAX) and tripped ORT_ENFORCE in
TensorShape::SizeFromDimension, failing every WebGPU binary op at the
[...,1,d,e] + [d,e] corner.

Extract the shared-trailing-dim math into a deviceless free helper
CountSharedTrailingDimensions that breaks as soon as EITHER operand runs
out of real dimensions, bounding the count by min(lhs rank, rhs rank).
The shared-dim product (and thus the divisible-by-4 vectorize decision)
is unchanged for all existing cases; only the previously-underflowing
corner is corrected.

Add a deviceless gtest on the helper and an end-to-end OpTester
regression on DefaultWebGpuExecutionProvider (Add [1,1,6,6]+[6,6]) that
fails pre-fix with the SIZE_MAX SizeFromDimension enforce and passes
post-fix. Verified locally against lavapipe software Vulkan; the full
elementwise/broadcast suite (62 tests) stays green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
 fixed)

Remove the two global skips for
test_attention_4d_softcap_neginf_mask_expanded and its _poison_expanded
variant. They were added only to dodge the WebGPU binary-elementwise
broadcast SizeFromDimension underflow (microsoft#28969), which is now fixed in
this branch by the CountSharedTrailingDimensions helper. The expanded
function-reference Attention tests can run again on every config.

The fp16 causal_conv atol override in
onnx_backend_test_series_overrides.jsonc is an independent tolerance fix
and is intentionally left in place.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pe, filter-vs-override, latent bugs)

Add three gotchas to .agents/skills/onnx-opset-bump-checklist/SKILL.md:
- (j) the Linux webgpu CI leg is build-only; ONNX backend node tests
  (OnnxBackendNodeModelTest) only execute on the macOS-arm64 webgpu leg, so a
  green Linux webgpu leg does not mean WebGPU actually ran them.
- (k) a filter-vs-override decision rubric: filters.jsonc SKIPs for a real EP bug
  (cite issue + removal condition), overrides.jsonc RELAXes ATOL for benign
  fp16/ULP diffs (prefer over a skip when the kernel is correct) — but only after
  root-causing the diff as ~1 ULP; unexplained/large/growing diffs are bugs.
- (l) new upstream reference tests can expose latent EP bugs (e.g. microsoft#28969, a
  WebGPU broadcast underflow surfaced by ONNX 1.22 expanded-Attention tests).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…on Linux without a GPU

Add .agents/skills/webgpu-local-testing/SKILL.md covering how to build and run
ONNX Runtime WebGPU provider tests on Linux using a software Vulkan adapter
(Mesa lavapipe): why software Vulkan suffices for host-side enforce/shape bugs
and MatMul-free kernels, dnf install on Azure Linux, the --use_webgpu build flag,
the onnxruntime_provider_test target with VK_ICD_FILENAMES, the lavapipe
MatMul-family crash gotcha, and the fact that the Linux webgpu CI leg is
build-only.

Scope is called out explicitly: any MatMul-containing graph (including the
expanded-Attention node tests that motivated microsoft#28969) cannot run on lavapipe and
is validated only on macOS-arm64 Metal; microsoft#28969 itself was validated on lavapipe
via a standalone Add-broadcast OpTester proxy, not the expanded-Attention node
test. The lavapipe ICD path is noted as arch-specific (x86_64 vs aarch64).

Cross-reference the new skill from ort-test (running WebGPU tests locally) and
ort-build (--use_webgpu key flag).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ONNX 1.22.0 was released 2026-06-15 (tag v1.22.0, commit 2bb50465). The
rc2->final delta touched only CI workflow yml + VERSION_NUMBER -- no operator
schema, opset, or backend-testdata change -- so this is pure version plumbing.

- cmake/deps.txt: onnx archive -> refs/tags/v1.22.0.zip (SHA1 2b2cd58a...)
- cmake/external/onnx submodule -> 2bb50465112feca9003e1ed654d77f01ff1415ca
- cmake/vcpkg-ports/onnx/portfile.cmake: REF v1.22.0 + tar.gz SHA512 13fafff0...
- 7 CI requirements.txt: onnx==1.22.0rc1 -> onnx==1.22.0 (now on PyPI); the 3
  transformers-model requirements stay frozen at onnx==1.18.0.
- onnx.patch / binskim.patch unchanged (source identical rc2<->final; still apply).
- filters.jsonc integration comment: 1.22.0rc1 -> 1.22.0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@titaiwangms titaiwangms force-pushed the integrate-onnx-1.22.0rc1 branch from e72e1de to 0a25d14 Compare June 15, 2026 20:22
@titaiwangms titaiwangms changed the title Integrate ONNX 1.22.0rc1 (opset 27) — issue #28752 Integrate ONNX 1.22.0 (opset 27) — issue #28752 Jun 15, 2026
titaiwangms and others added 2 commits June 15, 2026 22:06
The *CurrentOpset fusion regression tests build/load models stamped at the
current ONNX opset (27 in ONNX 1.22), which is still under development while
opset 26 is the last released version. Under ORT's default strict load-time
validation (ALLOW_RELEASED_ONNX_OPSET_ONLY unset or '1'), loading such a model
throws, so these tests failed on every strict CI leg.

Pass ModelOptions{allow_released_opsets_only=false, strict_shape_type_inference=false}
through the model-construction/load path of these tests (mirroring the existing
GatherToSliceFusion opset-27 precedent) so they RUN and PASS on every leg,
strict or not, preserving opset-27 fusion coverage with no masking.

- 9 TestGraphTransformer calls (Gelu/FastGelu/BiasGelu/MatMulAdd/DivMul/QuickGelu,
  LayerNorm/SkipLayerNorm, GQA-Qwen): append the ModelOptions argument.
- AttentionFusionMobileClipMhaCurrentOpsetTest (TransformerTester): thread an
  optional ModelOptions through TransformerTester; load the serialized model via
  the istream Load overload so allow_released_opsets_only is honored (the byte/
  proto Load overloads hardcode it). No product-code change.
- 3 EmbedLayerNorm tests: pass ModelOptions to Model::Load in LoadModelAtCurrentOpset.
- ReshapeFusionOpsetTest (ENABLE_TRAINING-only): its opset loop includes the
  current opset; apply the same ModelOptions to both TestGraphTransformer calls so
  it runs on strict training builds too. Training-gated (validated by analogy to
  the non-training pattern; not compiled in the default build).

Validated: 13 non-training tests PASS under default(strict), =1, and =0; full
onnxruntime_test_all under strict passes 1820 tests with no 'Opset 27 under
development' throw.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add WHY comments + tracking issue refs (microsoft#28966, and microsoft#28969 on the WebGPU
attention-fusion path) to the ModelOptions{allow_released_opsets_only=false}
call sites in the *CurrentOpset fusion tests, so a future reader knows they can
be removed once ONNX opset 27 ships. No test logic or ModelOptions args change.

Extend the onnx-opset-bump-checklist skill with three hard-won gotchas from the
1.22.0 integration: (m) the vcpkg MS-internal asset mirror must be Terrapin-seeded
with the new tag tarball or every --use_vcpkg leg 404s; (n) a FINAL onnx release
can still ship a map-max opset > last released opset (1.22.0: 27 > 26), leaving it
under-development; (o) prefer per-model ModelOptions{allow_released_opsets_only=false}
over per-leg CI env flips or GTEST_SKIP.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@titaiwangms

Copy link
Copy Markdown
Contributor Author

Review summary (review-team pass) — ONNX 1.22.0 / opset 27

Structurally matches the merged 1.21 bump (#27601): same version-plumbing, same versioned-split pattern for new-opset ops, same kMaxSupportedOpset bump. The EP GetMaxSupportedOpSet 25→27 jump is a legitimate catch-up (1.21 left these EPs at 25), not a regression. All spec claims were verified against the pinned 1.22 commit. No Critical/Major blockers.

Actionable

1. (Minor / consistency) WebGPU Range missed from the follow-up list
onnxruntime/core/providers/webgpu/generator/range.cc:96 registers Range open-ended at .SinceVersion(11), so WebGPU will claim opset-27 Range — the exact situation deferred for JS/DML, but WebGPU isn't in the follow-up list. Same risk class as the already-accepted JS deferral, so not a blocker. Suggest adding WebGPU alongside JS/DML in the follow-ups (or splitting it now).

2. (Minor / open question) bf16 Range-27 is untested; description slightly overstates
onnx_backend_test_series_filters.jsonc excludes both test_range_bfloat16_* base and _expanded variants (the Python harness can't materialize bf16), so bf16 Range-27 has no passing test. fp16 is covered (its node test isn't excluded and passes via function expansion). The description says "fp16/bf16 Range … pass", but bf16 is actually excluded. Low risk (same EP-agnostic function-expansion path as fp16). Suggest softening the wording, or adding a C++ OpTester bf16 Range-27 test that bypasses the numpy harness.

Optional polish (readability, non-blocking)

  • Extract the ModelOptions{/*allow_released_opsets_only*/ false, /*strict_shape_type_inference*/ false} literal (repeated ~10×) into a named kAllowUnreleasedOpset constant — the GatherToSlice site already shows the pattern.
  • binary_elementwise_broadcast_utils.h: move the header-design rationale out of the function docstring; rename dimA/dimBlhs_dim/rhs_dim; the in-loop comment largely duplicates the block comment above.
  • Add @param model_options to the single-opset TestGraphTransformer doxygen overload (only the vector<int> overload was updated).

Praise

Generated by a 5-reviewer pass (readability, code, critical, deep-spec, integration); the two Minor items above were re-verified against the source before posting.

@tianleiwu tianleiwu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary — ONNX 1.22.0 / opset 27 integration

Careful, well-documented version bump. I independently re-verified the risky
plumbing and the operator/optimizer/test changes; no blocking issues, only
optional nitpicks. Verdict: approve-leaning (commenting, not gating).

Independent verification (re-checked at this head)

Check Result
cmake/external/onnx submodule pin 2bb50465112feca9003e1ed654d77f01ff1415ca = v1.22.0 tag commit ✅
cmake/deps.txt SHA1 of v1.22.0.zip sha1sum of fresh download matches 2b2cd58a…
portfile.cmake SHA512 of v1.22.0.tar.gz sha512sum of fresh download matches 13fafff0…
vcpkg MS asset mirror (portfile SHA512) …/artifacts/13fafff0…HTTP 200 (mirrored; --use_vcpkg legs won't 404) ✅
onnx.patchbinskim.patch byte-identical (sha1 6a4e6ed8…) ✅
requirements pins all 7 CI files at 1.22.0; transformers files correctly left at 1.18.0

The onnx.patch rebase is sound: ONNX_MINIMAL_BUILD was re-expressed against
1.22's add_subdirectory(onnx) layout via target_sources(onnx PRIVATE … data_type_utils.cc),
the GroupNormalization-18 .Deprecate() removal is kept, and the now-upstreamed
Utils.cmake protobuf-warnings hunk is correctly dropped.

Correctness highlights

  • Range opset-27 split (CPU + CUDA): versioned [11,26] + new 27, matching
    forward-declares and BuildKernelCreateInfo. fp16/bf16 correctly fall through to
    ONNX function-body expansion; stash_type is irrelevant to the native path.
  • WebGPU broadcast fix (#28969): CountSharedTrailingDimensions now breaks once
    either operand is exhausted, bounding the shared run by min(lhs_rank, rhs_rank).
    Every divergence from the old inline loop is exactly a previously-underflowing
    case (SizeFromDimension(rank − num_shared) size_t wrap), so the fix is strictly
    safer. Deviceless unit test + end-to-end Add-broadcast coverage is thorough, and
    the Dawn-free header avoids a webgpu-provider link dependency in the CPU test TU.
  • ConvTranspose output_shape conformance (breaking): rank+2 form now rejected
    at Graph::Resolve for static rank (onnx#5400); the retained kernel branch is
    correctly documented as runtime-reachable only for dynamic-rank inputs, and the
    new ConvTranspose_RankPlus2_OutputShape_DynamicRankInput_Runtime test exercises
    exactly that path.
  • Backend-test filters: ^test_flexattention_(?!.*expanded) is correct given
    ONNX's _cpu/_cuda method-name suffix — it excludes base test_flexattention_cpu
    while preserving the _expanded_ver26 variants.
  • Test infra: threading ModelOptions + mirroring strict_shape_type_inference
    onto the session and loading via std::istream (so allow_released_opsets_only is
    honored) is the right way to exercise under-development opset 27 on strict legs.
    Each relaxed call site is annotated with #28966.

Minor / optional (non-blocking)

  1. Spurious 1 in Range version lists. {1, 11, 27} in gather_fusion.cc and
    embed_layer_norm_fusion.cc carries a leading 1, but Range has no opset-1
    schema, so 1 can never match a node's SinceVersion. Harmless and pre-existing,
    but {11, 27} would be marginally cleaner since these lines are already being
    edited.
  2. Deferred follow-ups confirmed, not missed: JS Range stays open-ended at 11
    (still matches opset-27 nodes), DML/ROCm unsplit, multi-EP OperatorKernels.md
    regen, and a native fp16/bf16 Range-27 kernel are all listed as explicit
    follow-ups in the description.

The new onnx-opset-bump-checklist and webgpu-local-testing skills capture the
exact gotchas hit here — a good durable artifact for the next bump.

titaiwangms and others added 2 commits June 16, 2026 00:22
…exec-11)

Test-only / docs-only readability polish on top of the opset-27 ModelOptions
fusion-test fix. No behavior change.

1. Extract the magic boolean used at every *CurrentOpset fusion-test site.
   Introduce `constexpr bool kAllowReleasedOpsetsOnly` (in the shared
   graph_transform_test_builder.h, namespace onnxruntime::test) and use it as
   the ModelOptions::allow_released_opsets_only first argument at all 14 call
   sites across graph_transform_test.cc, graph_transform_test_layernorm.cc,
   group_query_attention_pre_norm_fusion_test.cc, and the GatherToSlice
   precedent. The constant mirrors the ctor argument name exactly so each site
   reads ModelOptions{kAllowReleasedOpsetsOnly, ...} (false = do not restrict to
   released opsets, i.e. load models stamped at the not-yet-released opset).
   strict_shape_type_inference=false behavior unchanged.

2. binary_elementwise_broadcast_utils.h: add Doxygen @param/@return docs to
   CountSharedTrailingDimensions and rename the local dimA/dimB -> lhs_dim/
   rhs_dim for clarity. Stays inline + Dawn-free (only tensor_shape.h);
   behavior unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address tianleiwu's review nit on PR microsoft#28754: the Range op was introduced at
ONNX opset 11 (there is no opset-1 Range schema), so the leading `1` in the
`{1, 11, 27}` version lists is dead and never matches. Trim it to `{11, 27}`,
keeping 27 so opset-27 Range nodes still match.

Sites:
- onnxruntime/core/optimizer/gather_fusion.cc (Range->Gather->Slice matcher)
- onnxruntime/core/optimizer/embed_layer_norm_fusion.cc (two Range path-matchers)

No behavior change: opset-1 Range never existed, so removing it cannot drop any
real match; 11 and 27 are preserved.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@titaiwangms

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review, @tianleiwu, and for independently re-verifying the version pins, vcpkg mirror, onnx.patch rebase, and the Range/WebGPU/ConvTranspose changes.

Good catch on the spurious leading 1 in the Range version lists. Since Range was introduced at ONNX opset 11, there's no opset-1 schema, so the 1 could never match a node's SinceVersion — it was dead. I've cleaned it up to {11, 27} (keeping 27 for the opset-27 Range nodes) at both sites:

  • onnxruntime/core/optimizer/gather_fusion.cc
  • onnxruntime/core/optimizer/embed_layer_norm_fusion.cc

The change is behavior-identical (exact-set version matching, the 1 was unreachable). Pushed as db8d8bbc24; CI is re-running on the new head.

The other items you flagged (JS Range left open-ended at 11, DML/ROCm unsplit, OperatorKernels.md regen, native fp16/bf16 Range-27 kernel) are intentional follow-ups noted in the PR description, not oversights. Let me know if you'd like any of those pulled into this PR instead.

Could you take another look and formally approve when you have a chance? Thanks!

@tianleiwu tianleiwu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review at db8d8bbc (delta since my prior review at 80674c19)

Re-reviewed the two commits added since my last pass. The bump stays clean — no blocking issues from my side.

Resolved

  • Spurious leading 1 in the Range version lists (my prior nitpick): now {11, 27} in gather_fusion.cc and embed_layer_norm_fusion.cc. Verified behavior-identical — IsSupportedOptypeVersionAndDomain does exact-set matching and Range has no opset-1 schema, so the 1 was unreachable.

Independent re-verification at this head

  • Range-27 native kernel + stash_type: confirmed against the pinned ONNX 1.22.0 source (onnx/defs/generator/defs.cc, Range/27 + BuildFunctionBodyRange27). Opset 27 only adds float16/bfloat16 to T and a stash_type attribute that is consumed only when T ∈ {float16, bfloat16} ("Has no effect for other types"). The native CPU/CUDA kernels keep the 5-type constraint {float, double, int16, int32, int64}, for which the opset-27 function body is identical to opset 11 — so the [11,26] + 27 split is behavior-preserving and fp16/bf16 correctly route through ONNX function expansion. The kernel comments are accurate.
  • Broadcast util refactor (binary_elementwise_broadcast_utils.h): the doc + lhs_dim/rhs_dim rename does not change the loop's underflow-safe bound min(lhs rank, rhs rank, output_rank-1). Still correct.

Confirming the two minor items from your review-team pass (non-blocking)

  • WebGPU Range open-ended at .SinceVersion(11) (webgpu/generator/range.cc:96, not in this diff): I verified it is behavior-correct for its T ∈ {float, int32, int64} constraint — none of those types are touched by the opset-27 change and fp16/bf16 aren't registered there, so the SinceVersion(11) kernel that now also serves opset-27 nodes produces identical results. Purely a consistency/clarity gap; agree it belongs in the follow-up list alongside the JS/DML deferral (or split now). Not a blocker.
  • bf16 Range-27 test coverage / description wording: agree — the backend filter excludes both the base and _expanded bf16 variants, so bf16 has no passing test (fp16 does, via function expansion). Softening the description's "fp16/bf16 … pass" or adding a C++ OpTester bf16 case would make it precise. Minor.

The named-constant + broadcast-doc polish in d00fd69f reads well.

Verdict: no blocking issues; the remaining items are documented, non-blocking follow-ups.

@titaiwangms titaiwangms merged commit 43fd961 into microsoft:main Jun 16, 2026
86 checks passed
@titaiwangms titaiwangms deleted the integrate-onnx-1.22.0rc1 branch June 16, 2026 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants