[Core] Add SpaceToDepth fusion pattern#27747
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new Level1 optimizer fusion that recognizes a 4x Slice (stride-2, phase offsets) + Concat(axis=1) downsample pattern and replaces it with SpaceToDepth (plus optional Gather to preserve non-canonical block ordering).
Changes:
- Introduces
SliceConcatToSpaceToDepthFusiongraph transformer and integrates it into Level1 transformer generation. - Implements fusion logic to replace Slice/Concat subgraphs with
SpaceToDepth, optionally followed byGatherfor channel reordering. - Adds optimizer unit tests covering canonical, Constant-node-input, and permuted block order cases.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| onnxruntime/test/optimizer/graph_transform_test.cc | Adds tests validating the new fusion behavior (including Constant-based Slice inputs and permuted block order). |
| onnxruntime/core/optimizer/slice_concat_to_space_to_depth_fusion.h | Declares the new SliceConcatToSpaceToDepthFusion transformer. |
| onnxruntime/core/optimizer/slice_concat_to_space_to_depth_fusion.cc | Implements the matching and rewrite logic for Slice+Concat → SpaceToDepth (+ optional Gather). |
| onnxruntime/core/optimizer/graph_transformer_utils.cc | Registers the new transformer in the Level1 transformer pipeline. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
This PR introduces a new Level1 graph optimization that recognizes a specific Slice×4 + Concat downsampling pattern and fuses it into a SpaceToDepth (and optionally a Gather to preserve non-canonical slice ordering), improving runtime efficiency for eligible models.
Changes:
- Added
SliceConcatToSpaceToDepthFusiontransformer implementation and integrated it into Level1 transformer generation. - Added optimizer tests covering canonical fusion, constant-input variants, permuted block order (requires
Gather), and negative (non-triggering) scenarios. - Updated minimal-build optimizer source allowlist to include the new transformer.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/optimizer/graph_transform_test.cc | Adds test coverage for the new Slice+Concat → SpaceToDepth fusion (and non-trigger cases). |
| onnxruntime/core/optimizer/slice_concat_to_space_to_depth_fusion.h | Declares the new graph transformer. |
| onnxruntime/core/optimizer/slice_concat_to_space_to_depth_fusion.cc | Implements pattern matching and rewrite to SpaceToDepth (+ optional Gather). |
| onnxruntime/core/optimizer/graph_transformer_utils.cc | Registers the transformer in Level1 pipeline. |
| cmake/onnxruntime_optimizer.cmake | Adds new source files to minimal-build allowlist. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
onnxruntime/core/optimizer/slice_concat_to_space_to_depth_fusion.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/optimizer/slice_concat_to_space_to_depth_fusion.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/optimizer/slice_concat_to_space_to_depth_fusion.cc
Outdated
Show resolved
Hide resolved
tianleiwu
left a comment
There was a problem hiding this comment.
Summary
This PR introduces a highly valuable graph optimization connecting a common pattern generated by YOLO's Focus layers (Concat of 4 phase-shifted strided Slice operations) into ONNX's SpaceToDepth operator. The transformation correctly mitigates the heavy memory bus constraints imposed by piecewise slicing and concatenation, yielding performance wins for YOLO variants.
The mathematical mapping leveraged here relies on an elegant quirk of ONNXRuntime's SpaceToDepth native output layout p * C + c, perfectly aligning with the channel dimension interleaved sequence of Concat([slices...], axis=1). The algorithm is robustly designed, securely verifying phase overlaps without hardcoding static dimension constraints where unnecessary.
There are just a few isolated concerns to address, primarily involving the optimizer's execution flow within minimal builds and a few overly restrictive structural matching rules.
Findings by Component
Subgraph Matching & Analysis (slice_concat_to_space_to_depth_fusion.cc)
Positive: The way GetSliceInfo integrates axes tracking correctly populates missing slice domains with .starts zero-fills and .ends INT_MAX fills is immaculate. Likewise, TryGetPhasePermutation perfectly computes the re-indexing necessary to handle alternate Focus phase orderings cleanly via an auxiliary Gather.
⚠️ Concern - Missed Match on Overshoot Ends: InIsFullExtentEnd(Line 183), the checkend == dim_valuerisks rejecting functionally equivalent full-extent slices. WhenSliceoperations specify overshoot values (e.g.,ends = [1000]for a dimension of size8), the ONNX specification natively clamps this, meaning it captures the full extent.- Fix Suggestion: Relax the check to allow
end >= dim_valuegiven thatSliceclamps at boundaries.return TryGetStaticInputDim(input, axis, dim_value) && end >= dim_value;
- Fix Suggestion: Relax the check to allow
⚠️ Concern - Missing Node Edge Generation for Gather: As identified in another review, in non-canonical cases the createdGathernode is disjointed from theSpaceToDepthnode within the internal Graph dependency mappings. WhileReplacement_endties the IO args properly, the Node edge is missing.- Fix Suggestion: Add explicit topological edging right after binding the execution provider.
gather.SetExecutionProviderType(provider_type); graph.AddEdge(space_to_depth.Index(), gather.Index(), 0, 0); // Connect Data Edge replacement_end = &gather;
- Fix Suggestion: Add explicit topological edging right after binding the execution provider.
Graph Pass Initialization (GraphTransformer::ApplyImpl)
⚠️ Concern - Caching Stale Operator Topologies: WithinSliceConcatToSpaceToDepthFusion::ApplyImpl(Line 549),graph_viewer.GetNodesInTopologicalOrder()evaluates the graph nodes exactly once. BecauseFuseSliceConcatToSpaceToDepthdrastically alters graph boundaries (mutating upstream links and deleting multiple nodes), relying on the staticnode_topology_listfor subsequent nodes may reference invalidated pointer offsets downstream (especially in consecutive pattern instances).- Fix Suggestion: Re-initialize the outer loop execution over
GraphViewerwhenevermodifiedis markedtrue, similarly to how other level 1 graph rewrites handle cascading topology updates.
- Fix Suggestion: Re-initialize the outer loop execution over
Code Style & Hygiene
Positive: Naming schemes, constant declarations, inline vectors allocation tracking, and helper extractions all align beautifully with the surrounding codebase in optimizer. Utilizing MoveAllNodeInputEdges and carefully wiping the auxiliary constant edges for the 1st slice node was a brilliant strategy to bypass native ONNX initializer mismatches.
--
This review focuses selectively to cover unique structural and boundary concerns. Outstanding points involving minimal build lookups and Float16 constraints raised by Copilot in earlier reviews remain equally valid and advised.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
onnxruntime/core/optimizer/slice_concat_to_space_to_depth_fusion.cc
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
onnxruntime/core/optimizer/slice_concat_to_space_to_depth_fusion.cc
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
onnxruntime/core/optimizer/slice_concat_to_space_to_depth_fusion.cc
Outdated
Show resolved
Hide resolved
…on.cc Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Description
Fuses a pattern of 4 Slices + 1 Concat into 1 SpaceToDepth op (+1 Gather if the order doesn't match the expected default pattern). Saves about ~1ms for Yolox_tiny which is a sub-15ms model on an AVX512 machine. So, it is good savings.
Motivation and Context
Improves performance especially when the SpaceToDepth operation occurs in a cheap real-time model
Original SpaceToDepth operation in the model (4 slices + 1 Concat is a lot of memory transactions that can be replaced by one optimized kernel):