[CoreML EP] Support Gather with scalar 'indices'#28278
Open
maxwbuckley wants to merge 11 commits into
Open
Conversation
ee07752 to
28796b4
Compare
The CoreML GatherOpBuilder rejected rank-0 (scalar) indices because CoreML's `gather` requires rank-1+ indices and the workaround would change the output rank. Fixes microsoft#28180 by performing the workaround internally: when indices are scalar, the builder now emits reshape(indices, shape=[1]) -> indices_1d gather(data, indices_1d, axis) -> shape data_shape with axis=1 squeeze(., axes=[axis]) -> ONNX gather output shape in both the MLProgram and NeuralNetwork emitters. We use `reshape` rather than `expand_dims` because CoreML internally pads scalars and expand_dims on the padded tensor can push the apparent rank past CoreML's rank-5 limit. Restrictions: - 'data' must have a fully static shape (we claim a static intermediate shape between gather and squeeze). - 'data' rank capped at 4 (the rank-5 case still trips CoreML's compiler with "Invalid rank: 6", so we keep the conservative boundary). Dynamic-shape and rank-5+ scalar Gather still falls back to CPU. Motivated by StyleGAN-family generators (e.g. GFPGAN), which select per-layer style codes with a scalar-index Gather. Combined with microsoft#28270 (pre-opset-13 Split), GFPGAN-1024 (input 1x3x512x512) goes from 9 partitions / 534 CoreML nodes to 1 partition / 557 CoreML nodes on M3 Max. Adds six tests covering both NeuralNetwork and MLProgram emitters: GatherScalarIndicesAxis1 - axis=1, mid-rank squeeze GatherScalarIndicesAxis0 - axis=0, leading-axis squeeze GatherScalarIndicesNegativeAxis - axis=-1, HandleNegativeAxis path GatherScalarIndicesFloat16 - fp16 data (MLProgram only) GatherScalarIndicesInt64Data - int64 data, both formats GatherScalarIndicesRank4Data - rank-4 data, exercises the cap Updates the comment on the existing GatherWithScalarIndices test to reflect that scalar Gather is now supported when 'data' is fully static and rank-4 or below; the test continues to exercise the dynamic-shape fall-back. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
28796b4 to
8d966dd
Compare
Resolves conflict in coreml_basic_test.cc where this branch's GatherScalarIndices* tests landed in the same file region as the Split11/13/7 tests merged via microsoft#28270. Both sets are preserved sequentially (Gather first, then Split). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four small follow-ups based on review-style scrutiny of the scalar
'indices' path:
1. Validate `indices` dtype in IsOpSupportedImpl. The MLProgram dispatch
reads `indices_dtype` via GetType and silently defaulted to INT32 if
the call failed. ONNX Gather's schema constrains indices to INT32 or
INT64; reject anything else up front so the dispatch can trust the
value (and ORT_RETURN_IF_NOT the GetType call as a defensive backstop).
2. Add a TODO above the data-rank-4 cap explaining the underlying CoreML
compiler bug ("Invalid rank: 6" on rank-5 reshape+gather intermediates)
and noting the cap can be lifted to 5 when Apple fixes it.
3. Document why the NN-format scalar path uses expand_dims rather than
reshape, parallel to the existing comment on the MLProgram path. NN's
expand_dims doesn't pad rank internally, so we don't hit the
apparent-rank inflation that drove reshape+gather there.
4. Two new negative tests:
- GatherScalarIndicesDynamicDataNotSupported — dynamic 'data' shape
with scalar indices falls back (was rejected silently; now covered).
- GatherScalarIndicesRank5DataNotSupported — rank-5 'data' with
scalar indices falls back (the bug-driven cap above is now tested).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brief comment explaining that the static_cast normalises the proto- generated enum (whose underlying type isn't guaranteed) to the codebase's int32_t convention for ONNX dtype codes — and that the "INT64" in the constant name refers to the dtype tag (7), not int64 data being narrowed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The proto enum *is* typed (`enum TensorProto_DataType : int` per onnx-ml.pb.h), so the previous "underlying type isn't guaranteed" was wrong. The actual reason for the cast is narrower: `int` and `int32_t` are distinct C++ types (even when layout-compatible on the platforms ORT targets), and the codebase uses `int32_t` for ONNX dtype codes. Comment now states that accurately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two constexpr dtype tags only ever appeared in `indices_dtype == kInt64 ? kInt64 : kInt32`, which was a no-op once IsOpSupportedImpl gates indices to exactly INT32 or INT64 — the expression simplifies to just `indices_dtype`. Drop the constants and pass `indices_dtype` directly to AddIntermediateOperationOutput. This also makes the dtype handling consistent with the surrounding `output_dtype` line, which already uses an inline static_cast on the proto enum without a named constant. (Net effect: less code, less to read, no behavior change. Spotted in review.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The other six positive scalar-indices tests use INT64 indices (the PyTorch default), so the INT32-indices code path through both the new dtype gating in IsOpSupportedImpl and the indices_dtype path-through to the reshape's intermediate output was untested. Adds GatherScalarIndicesInt32Indices to cover it on both NN and MLProgram formats. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tyle The earlier rewrite of the Gather builder dropped the "// coreml docs claims validate_indices is optional but in practice it is required" comment and changed `const auto validate_indices = false;` to `constexpr bool validate_indices = false;`. Both were drive-by churn unrelated to the scalar-indices fix this PR is supposed to add. The comment in particular is institutional context — without it, the next reader sees `validate_indices = false` being passed via AddScalarConstant and reasonably wonders why we bother (the CoreML docs say it's optional). Restoring it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # onnxruntime/test/providers/coreml/coreml_basic_test.cc
Contributor
Author
|
@yuslepukhin thank you again for all your support here 👍 |
yuslepukhin
reviewed
May 14, 2026
…test - Use TensorShapeVector for the small inline `gather_shape` buffer (was std::vector<int64_t>). - Add GatherScalarIndicesRank1Data covering rank-1 data with scalar indices (output is rank-0), the previously-untested corner of the static-shape branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Member
|
Has merge conflicts. |
This was referenced May 20, 2026
# Conflicts: # onnxruntime/test/providers/coreml/coreml_basic_test.cc
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The CoreML `GatherOpBuilder` rejected rank-0 (scalar) `indices` because CoreML's `gather` requires rank-1+ indices and the obvious workaround would change the output rank — see the `// Don't allow scalar 'indices' input.` comment at `gather_op_builder.cc:90`. This PR performs the workaround internally:
```
reshape(indices, shape=[1]) -> indices_1d
gather(data, indices_1d, axis) -> data_shape with the gather axis = 1
squeeze(., axes=[axis]) -> ONNX gather output shape
```
…in both the MLProgram and NeuralNetwork emitters. The squeeze restores the original ONNX output rank, so caller-visible Gather semantics are unchanged. `reshape` is used rather than `expand_dims` because CoreML internally pads scalars and `expand_dims` on the padded tensor can push the apparent rank past the rank-5 limit on high-rank `data`.
Restrictions:
Dynamic-shape and rank-5+ scalar Gather still falls back to CPU (preserves the existing `GatherWithScalarIndices` test, whose data is dynamic-shape).
Fixes #28180.
Motivation
StyleGAN-family generators (StyleGAN, StyleGAN2, GFPGAN, …) select per-layer style codes with a scalar-index Gather. The resulting graph alternates between Gather and the rest of the generator, splitting the CoreML subgraph repeatedly.
On GFPGAN-1024 (
[1, 3, 512, 512]), this PR moves all 16 scalar Gathers off CPU and the model lands on a single CoreML partition.M3 Max, MLProgram, batch 1, 3 × 100-iter steady-state runs (n=300):
Mean −8.8%, stddev −50%, P99 −17%, max −18.5% — eliminating the CPU↔CoreML round-trip on every scalar Gather both speeds up the steady state and tightens the tail.
Striking secondary effect: the worst-case run with the fix (88.03 ms) is faster than the mean run without it (89.68 ms). Every single fixed inference over n=300 lands below the unfixed average.
Tests
Six new tests in `onnxruntime/test/providers/coreml/coreml_basic_test.cc` covering distinct code paths, exercised on both NeuralNetwork and MLProgram emitters where the dtype is supported:
Each verifies CoreML output against the CPU EP reference and asserts `ExpectedEPNodeAssignment::All`. The existing `GatherWithScalarIndices` test (dynamic-shape data) is updated only in its comment to reflect the new precise condition; it still exercises the CPU fall-back as before.
All pass locally on macOS 26.3 / M3 Max.