Skip to content

[CoreML EP] Add QuickGelu support#28184

Merged
yuslepukhin merged 4 commits intomicrosoft:mainfrom
maxwbuckley:coreml-quickgelu
Apr 28, 2026
Merged

[CoreML EP] Add QuickGelu support#28184
yuslepukhin merged 4 commits intomicrosoft:mainfrom
maxwbuckley:coreml-quickgelu

Conversation

@maxwbuckley
Copy link
Copy Markdown
Contributor

@maxwbuckley maxwbuckley commented Apr 22, 2026

Description

Adds support for com.microsoft:QuickGelu (x * Sigmoid(alpha * x)) to the CoreML Execution Provider's MLProgram path. The builder decomposes QuickGelu into three MIL ops (mul / sigmoid / mul), matching the op's own schema function-body in contrib_defs.cc:605-631 and the approach the QNN EP already uses in qnn/builder/opbuilder/quick_gelu_op_builder.cc. Only the MLProgram path is implemented; NeuralNetwork is deprecated on Apple Silicon.

Adds CoreMLExecutionProviderTest.QuickGeluTest which builds a single com.microsoft:QuickGelu node with non-default alpha=1.5 and verifies the entire graph is claimed by the CoreML EP via ExpectedEPNodeAssignment::All. Verified with a negative test: temporarily removing the CreateQuickGeluOpBuilder registration causes the new test to fail with a VerifyEPNodeAssignment fatal failure, proving it genuinely exercises the CoreML path.

Also updates coreml_supported_mlprogram_ops.md.

Motivation and Context

Fixes #28183.

QuickGelu is produced by ORT's own QuickGeluFusion optimizer pass (onnxruntime/core/optimizer/quick_gelu_fusion.cc), which runs at ORT_ENABLE_EXTENDED — and therefore also at ORT_ENABLE_ALL, the default session optimization level. So any model that contains the x * sigmoid(alpha * x) pattern (CLIP, several mobile transformers, the DWPose pose estimator) gets silently mutated by ORT into a graph with QuickGelu nodes that the CoreML EP then rejects — turning 3 supported primitives into 1 unsupported op, making the fusion strictly harmful for CoreML.

On the DWPose dw-ll_ucoco_384.onnx model with batch=1 and ORT_ENABLE_EXTENDED, 76 QuickGelu nodes get produced. Running the result on the CoreML EP:

ORT build CoreML subgraphs Inference (ms)
main (QuickGelu rejected) ~80 (each QuickGelu is a graph break) 54.77
this PR (QuickGelu supported) 10 13.91

The remaining breaks are other ops — see "Related gaps" below. A ~4× speedup at EXTENDED level from this patch alone.

Even at the default ORT_ENABLE_ALL with a symbolic batch dim (where partial shape inference inhibits most fusions), 3 QuickGelu nodes still get produced — so this patch helps any CoreML user who hasn't explicitly downgraded to ORT_ENABLE_BASIC.

Related CoreML EP gaps observed (out of scope for this PR)

With QuickGelu fixed, the remaining 9 CPU-fallback nodes on the EXTENDED-optimized DWPose pose model are:

  • com.microsoft:FusedConv (×4) — produced by ConvActivationFusion. Fuses Conv + activation into one node. Same failure mode as QuickGelu: Conv and the activations (Relu, Sigmoid, HardSigmoid, etc.) are individually CoreML-supported, but the fused form isn't. Decomposition is straightforward — emit the underlying conv MIL op, then the corresponding activation.
  • com.microsoft:FusedMatMul (×2, from MatMulScaleFusion) — MatMul * alpha with an optional transpose. Decomposition: matmul + scalar mul.
  • ai.onnx:Split (×2) — pre-existing CoreML EP gap unrelated to fusion. CoreML MIL has a native split op; this one is a straight op-builder omission.

Happy to send follow-up PRs for any of these after this one lands, following the same pattern. Flagging here so they're on the EP coverage roadmap.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds CoreML EP MLProgram support for the com.microsoft:QuickGelu contrib op by lowering it into existing MIL primitives, improving CoreML graph-claim coverage for models affected by ORT’s QuickGeluFusion.

Changes:

  • Register a new CoreML op builder for com.microsoft:QuickGelu and decompose it into mul -> sigmoid -> mul in the MLProgram path.
  • Add a CoreML EP unit test that builds a single-node QuickGelu model (with non-default alpha) and verifies full graph assignment to CoreML.
  • Update the MLProgram supported-ops documentation to include com.microsoft:QuickGelu.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tools/ci_build/github/apple/coreml_supported_mlprogram_ops.md Documents QuickGelu as supported in the MLProgram path.
onnxruntime/test/providers/coreml/coreml_basic_test.cc Adds CoreMLExecutionProviderTest.QuickGeluTest for EP assignment + output verification.
onnxruntime/core/providers/coreml/builders/op_builder_factory.h Declares CreateQuickGeluOpBuilder.
onnxruntime/core/providers/coreml/builders/op_builder_factory.cc Registers the QuickGelu builder in the factory.
onnxruntime/core/providers/coreml/builders/impl/quick_gelu_op_builder.cc Implements QuickGelu decomposition into MIL ops for MLProgram.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/test/providers/coreml/coreml_basic_test.cc Outdated
@yuslepukhin
Copy link
Copy Markdown
Member

The PR may require rebase from main when the pipelines are fixed.

@maxwbuckley
Copy link
Copy Markdown
Contributor Author

Thanks for catching that — rebased on main and dropped the misleading comment along with the redundant params.fp32_abs_err assignment (1e-5f was already the default). The test passes with the plain EPVerificationParams{ExpectedEPNodeAssignment::All} constructor now.

@yuslepukhin yuslepukhin requested a review from Copilot April 23, 2026 19:20
@maxwbuckley
Copy link
Copy Markdown
Contributor Author

Note: I force-pushed this branch too (one commit, same "rebase + address Copilot's inline comment" amendment pattern). Realized on #28182 that force-pushing wipes the "changes since last review" view — sorry for the same here. Going forward I'll stack follow-up commits instead of amending.

Delta since the original commit (275498df6a):

  1. Fix misleading tolerance comment — drop the params.fp32_abs_err = 1e-5f line (1e-5f was already the default so it wasn't loosening anything) and the comment claiming otherwise. Test now uses EPVerificationParams{ExpectedEPNodeAssignment::All} inline.
  2. Rebase onto current main — no code changes, fast-forward over 4265122712.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/providers/coreml/builders/impl/quick_gelu_op_builder.cc Outdated
@yuslepukhin
Copy link
Copy Markdown
Member

Apart from the comments above.

The core QuickGelu implementation is functionally correct, spec-compliant, and exception-safe. Recommended actions before merge:

  • Add a FLOAT16 test (the most impactful gap)
  • Split out the unrelated NuGet/pipeline changes into their own PR
  • Consider making the non-MLProgram path return an error instead of silent OK (optional, matches existing convention)
  • Consider the alpha ≈ 1.0 optimization (optional)

maxwbuckley added a commit to maxwbuckley/onnxruntime that referenced this pull request Apr 24, 2026
Adds `CoreMLExecutionProviderTest.QuickGeluTestFp16` — same single-node
model and non-default alpha=1.5 as the existing QuickGeluTest, but with
FLOAT16 input/output. Exercises the MLFloat16 branch of the alpha-scalar
wiring in `QuickGeluOpBuilder::AddToModelBuilderImpl`.

Tolerance widened to 2e-2 (fp16 ulp at magnitude 20 is ~0.01).

Addresses review feedback on microsoft#28184.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maxwbuckley
Copy link
Copy Markdown
Contributor Author

Addressing feedback in stacked commits.

Just pushed (7f36cfcf56): FLOAT16 test addedCoreMLExecutionProviderTest.QuickGeluTestFp16, same single-node model and non-default alpha=1.5 as the fp32 variant, with FLOAT16 input/output. Exercises the MLFloat16 branch of the alpha-scalar wiring. Tolerance 2e-2 (fp16 ulp at magnitude 20 is ~0.01, with headroom for the 3-op decomposition).

Still on my list for follow-up commits:

  • Shape availability check in IsOpSupportedImpl (Copilot's line-45 comment)
  • Fail-fast on the non-MLProgram path in AddToModelBuilderImpl (Copilot's line-48 comment + your "silent OK" note)
  • alpha ≈ 1.0 skip optimization (optional)

One clarification needed: your review mentioned "Split out the unrelated NuGet/pipeline changes into their own PR" — but this PR only touches 5 files, all CoreML-EP-related (quick_gelu_op_builder.cc, op_builder_factory.{cc,h}, coreml_basic_test.cc, coreml_supported_mlprogram_ops.md). I don't see any NuGet or pipeline changes on the branch. Could you point me at what you're seeing? It's possible you're thinking of a different PR, or there's something showing up for you that isn't showing up in gh pr view 28184 --json files for me.

maxwbuckley added a commit to maxwbuckley/onnxruntime that referenced this pull request Apr 24, 2026
When `alpha` is within 1e-6 of 1.0 (e.g. CLIP's `x * sigmoid(x)`), skip
the leading `mul(x, alpha)` in `QuickGeluOpBuilder::AddToModelBuilderImpl`
and feed `x` straight into `sigmoid`. Saves one MIL op per QuickGelu
node and avoids the rounding it would introduce. Mirrors the same
optimization in the QNN builder
(`qnn/builder/opbuilder/quick_gelu_op_builder.cc:42-49`).

Adds `CoreMLExecutionProviderTest.QuickGeluTestAlphaOne` covering the
`alpha=1.0` branch with `ExpectedEPNodeAssignment::All`. Verified via
negative test: temporarily forcing `skip_alpha_mul` for all alphas
causes the alpha=1.5 tests (fp32 + fp16) to fail with a tolerance
mismatch while alpha=1.0 still passes, proving both branches are
exercised.

Addresses optional review feedback on microsoft#28184.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maxwbuckley
Copy link
Copy Markdown
Contributor Author

Just pushed a2781ee6e1: alpha ≈ 1.0 skip optimization.

When |alpha - 1.0| < 1e-6 (CLIP's x * sigmoid(x)), skip the leading mul(x, alpha) and feed x straight into sigmoid. Saves one MIL op per node and avoids its rounding. Same logic the QNN builder uses (quick_gelu_op_builder.cc:42-49 there).

Added CoreMLExecutionProviderTest.QuickGeluTestAlphaOne covering that branch. Verified via the same negative-test discipline as the other tests: temporarily forcing skip_alpha_mul = true for all alphas causes the alpha=1.5 tests (fp32 + fp16) to fail while alpha=1.0 still passes — confirming both branches are genuinely exercised.

Remaining items from Copilot's review (shape availability check in IsOpSupportedImpl, fail-fast for non-MLProgram path in AddToModelBuilderImpl) coming in one more stacked commit.

maxwbuckley added a commit to maxwbuckley/onnxruntime that referenced this pull request Apr 24, 2026
…LProgram

Two defensive checks in `QuickGeluOpBuilder`:

1. `IsOpSupportedImpl` now calls `GetShape(...)` on input 0 and returns
   false (with VERBOSE log) if shape info is unavailable, matching the
   hard requirement in `AddToModelBuilderImpl`. Previously the EP could
   claim a QuickGelu node and then fail at model-build time if shape
   inference was incomplete upstream. Matches the pattern used in e.g.
   `conv_op_builder.cc` and `batch_norm_op_builder.cc`.

2. `AddToModelBuilderImpl` replaces the `if (CreateMLProgram()) { ... }`
   guard with an `ORT_RETURN_IF_NOT` at the top. The old form silently
   returned `Status::OK()` without emitting any op if called in
   NeuralNetwork mode — an invalid CoreML model. `IsOpSupportedImpl`
   gates this, but defense-in-depth is cheap here.

Addresses Copilot's two inline review comments on microsoft#28184.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yuslepukhin
Copy link
Copy Markdown
Member

I am seeing lint failures. Please, ensure that you install lintrunner, then lintrunner init then run lintrunner -a for all files prior to a push.

@yuslepukhin
Copy link
Copy Markdown
Member

One clarification needed: your review mentioned "Split out the unrelated NuGet/pipeline changes into their own PR" — but this PR only touches 5 files, all CoreML-EP-related

This turns out to be a residual from my local merge.

@yuslepukhin
Copy link
Copy Markdown
Member

LGTM. Need to resolve conflics.

@maxwbuckley
Copy link
Copy Markdown
Contributor Author

Conflicts resolved via merge commit 2a85e2cf12 (kept the branch append-only so the 'changes since last review' view stays intact). The conflict was only in coreml_basic_test.cc where #28182's HardSigmoidTest landed in the same spot I was appending the QuickGelu* tests — both now present side-by-side. Other files auto-merged clean.

Also re-ran lintrunner locally with the pinned versions from requirements-lintrunner.txt — reports clean on all 5 files. All 4 CoreML-EP tests (HardSigmoidTest, QuickGeluTest, QuickGeluTestAlphaOne, QuickGeluTestFp16) pass against the rebuilt binary.

@maxwbuckley
Copy link
Copy Markdown
Contributor Author

Thanks for the approval! Quick note on the React Native CI Android failure — that ran after the approval and looks unrelated to this PR (the diff touches only onnxruntime/core/providers/coreml/builders/..., onnxruntime/test/providers/coreml/coreml_basic_test.cc, and the CoreML supported-ops doc — no Android / React Native code). The job log shows it failed in the React Native pipeline itself, not in anything our patch could affect. Happy to retry or rebase if it'd help, but otherwise hopefully it's just a transient.

maxwbuckley and others added 4 commits April 26, 2026 20:25
Adds support for `com.microsoft:QuickGelu` (`x * Sigmoid(alpha * x)`) to
the CoreML Execution Provider's MLProgram path. QuickGelu is produced by
ORT's own `QuickGeluFusion` optimizer pass (`ORT_ENABLE_EXTENDED` and
above, which includes the default `ORT_ENABLE_ALL`), so any model with
the `x * sigmoid(alpha * x)` pattern in it ends up with an op CoreML
rejects — turning 3 supported primitives into 1 unsupported op and
making the fusion a net negative for CoreML.

The builder decomposes QuickGelu into three MIL ops (`mul` / `sigmoid` /
`mul`), matching the op's own schema function-body in `contrib_defs.cc`
and the approach the QNN EP already uses in
`qnn/builder/opbuilder/quick_gelu_op_builder.cc`. Only the MLProgram
path is implemented; NeuralNetwork is deprecated on Apple Silicon.

Adds `CoreMLExecutionProviderTest.QuickGeluTest` which builds a single
`com.microsoft:QuickGelu` node with non-default alpha=1.5 and verifies
the entire graph is claimed by the CoreML EP via
`ExpectedEPNodeAssignment::All`. Verified via negative test: temporarily
removing the `CreateQuickGeluOpBuilder` registration causes the new test
to fail with a `VerifyEPNodeAssignment` fatal failure, proving it
genuinely exercises the CoreML path.

Also updates `coreml_supported_mlprogram_ops.md`.

Fixes microsoft#28183.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `CoreMLExecutionProviderTest.QuickGeluTestFp16` — same single-node
model and non-default alpha=1.5 as the existing QuickGeluTest, but with
FLOAT16 input/output. Exercises the MLFloat16 branch of the alpha-scalar
wiring in `QuickGeluOpBuilder::AddToModelBuilderImpl`.

Tolerance widened to 2e-2 (fp16 ulp at magnitude 20 is ~0.01).

Addresses review feedback on microsoft#28184.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When `alpha` is within 1e-6 of 1.0 (e.g. CLIP's `x * sigmoid(x)`), skip
the leading `mul(x, alpha)` in `QuickGeluOpBuilder::AddToModelBuilderImpl`
and feed `x` straight into `sigmoid`. Saves one MIL op per QuickGelu
node and avoids the rounding it would introduce. Mirrors the same
optimization in the QNN builder
(`qnn/builder/opbuilder/quick_gelu_op_builder.cc:42-49`).

Adds `CoreMLExecutionProviderTest.QuickGeluTestAlphaOne` covering the
`alpha=1.0` branch with `ExpectedEPNodeAssignment::All`. Verified via
negative test: temporarily forcing `skip_alpha_mul` for all alphas
causes the alpha=1.5 tests (fp32 + fp16) to fail with a tolerance
mismatch while alpha=1.0 still passes, proving both branches are
exercised.

Addresses optional review feedback on microsoft#28184.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…LProgram

Two defensive checks in `QuickGeluOpBuilder`:

1. `IsOpSupportedImpl` now calls `GetShape(...)` on input 0 and returns
   false (with VERBOSE log) if shape info is unavailable, matching the
   hard requirement in `AddToModelBuilderImpl`. Previously the EP could
   claim a QuickGelu node and then fail at model-build time if shape
   inference was incomplete upstream. Matches the pattern used in e.g.
   `conv_op_builder.cc` and `batch_norm_op_builder.cc`.

2. `AddToModelBuilderImpl` replaces the `if (CreateMLProgram()) { ... }`
   guard with an `ORT_RETURN_IF_NOT` at the top. The old form silently
   returned `Status::OK()` without emitting any op if called in
   NeuralNetwork mode — an invalid CoreML model. `IsOpSupportedImpl`
   gates this, but defense-in-depth is cheap here.

Addresses Copilot's two inline review comments on microsoft#28184.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yuslepukhin yuslepukhin merged commit a53d6d7 into microsoft:main Apr 28, 2026
94 of 102 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] CoreML EP: add QuickGelu support

3 participants