[Cuda Plugin] Refactor CUDA ops — Move shared CPU/CUDA helper code from .cc to headers by tianleiwu · Pull Request #27617 · microsoft/onnxruntime

tianleiwu · 2026-03-11T00:33:19Z

Description

This PR refactors several CPU operator helper functions by moving their implementations from .cc files into .h headers, using the #ifdef SHARED_PROVIDER / #else inline pattern. This is a prerequisite for the CUDA Plugin EP work, where CUDA kernels are built into a standalone shared library (libonnxruntime_providers_cuda_plugin.so) that cannot link against the CPU provider's .cc object files.

Why This Refactoring Is Needed

The CUDA Plugin EP compiles CUDA operator kernels into a separate shared library that communicates with the ORT core through the ORT EP Plugin API. In this architecture, kernel source files cannot depend on framework-internal symbols that live in the CPU provider static library (libonnxruntime_providers.a). Many CUDA kernels inherit from CPU base classes and call shared helper/validation methods (e.g., SliceBase::PrepareForCompute, SplitBase::PrepareForCompute, ScatterND::ValidateShapes, TileOp::IsTileMemcpy, PadBase::ComputePads) whose implementations currently live in CPU .cc files.

In the in-tree CUDA EP build (SHARED_PROVIDER mode), these helpers are accessed through the ProviderHostCPU DLL-boundary virtual table bridge. However, the plugin EP does not use this bridge — it uses EP API adapters and force-included headers instead. To make these helpers available in the plugin build without duplicating code, this PR moves the implementations into headers as inline functions under #ifndef SHARED_PROVIDER guards. The SHARED_PROVIDER (in-tree) build path retains the existing declaration-only signatures that route through ProviderHostCPU.

This pattern has already been successfully applied to other operators (e.g., Einsum). This PR extends it to the remaining operators that need it.

Summary of Changes

Helper functions moved from `.cc` to `.h` (inline under `#ifndef SHARED_PROVIDER`)

Operator	File	Functions Moved
Slice	`cpu/tensor/slice.h`	`SliceBase::FlattenOutputDims`, `SliceBase::PrepareForCompute` (both overloads), `SliceBase::FillVectorsFromInput`, `slice_detail::CopyInputData<T>`
Split	`cpu/tensor/split.h`	`SplitBase::PrepareForCompute`
ScatterND	`cpu/tensor/scatter_nd.h`	`ScatterND::ValidateShapes`
Tile	`cpu/tensor/tile.h`	`TileOp::IsTileMemcpy`
Pad	`cpu/tensor/padbase.h`	`PadBase::ComputePadsImpl` (new template method replacing `ComputePads` for cross-context compatibility)
BiasGelu	`contrib_ops/cpu/bert/bias_gelu_helper.h`	`bias_gelu_helper::CheckInputs` (templatized on context type)
EmbedLayerNorm	`contrib_ops/cpu/bert/embed_layer_norm_helper.h`	`embed_layer_norm::CheckInputs` (templatized on context type)
NonMaxSuppression	`cpu/object_detection/non_max_suppression.h` + new `non_max_suppression_helper.h`	`NonMaxSuppressionBase` refactored into `NonMaxSuppressionBaseImpl<KernelInfoType, KernelContextType>` template for plugin compatibility

Deleted `.cc` files (implementations moved to headers)

contrib_ops/cpu/bert/bias_gelu_helper.cc
contrib_ops/cpu/bert/embed_layer_norm_helper.cc

Provider bridge additions

Added Tensor::DataAsSpan<int32_t>() support through the shared provider interface (provider_interfaces.h, provider_wrappedtypes.h, provider_bridge_ort.cc). This was needed because slice_detail::CopyInputData<int32_t> calls Tensor::DataAsSpan<int32_t>(), which was not previously bridged.

CUDA-side updates

cuda/tensor/slice.h: Updated Slice constructor to use the new SliceBase(info, dynamic, 0) overload (template-based constructor compatible with both adapter and real OpKernelInfo).
cuda/tensor/pad.cc: Updated call from PadBase::ComputePads to PadBase::ComputePadsImpl.
cuda/tensor/scatter_nd.cc: Templatized InitializeElementCountsAndInputDimsSpanOrGpu on KernelContextType (also fixed typo: InitiliazeElement... → InitializeElement...).
cuda/object_detection/non_max_suppression.h: Updated to use NonMaxSuppressionBaseImpl<OpKernelInfo, OpKernelContext> instead of NonMaxSuppressionBase.

New file

cpu/object_detection/non_max_suppression_helper.h: Contains the template-based NonMaxSuppressionBaseImpl class, separating it from the CPU-specific NonMaxSuppression kernel registration.

Testing

Existing unit tests cover all affected operators (Slice, Split, ScatterND, Tile, Pad, BiasGelu, EmbedLayerNorm, NonMaxSuppression).
No behavioral changes — all function logic is identical; only the location (header vs. source) and linkage (inline vs. external) changed.
The SHARED_PROVIDER code path (in-tree CUDA EP build) is unchanged — declarations remain and route through the existing ProviderHostCPU bridge.

Motivation and Context

This is part of the ongoing CUDA Plugin EP effort to build CUDA kernels as a standalone shared library that can be updated independently of the ORT core. The refactoring enables ~10 additional CUDA operators to compile in the plugin build by making their CPU-side validation and preparation helpers available as header-inline functions.

…int32_t>

Copilot

Pull request overview

This PR refactors CPU/CUDA shared helper code to better support the CUDA shared-provider/plugin build, primarily by moving a number of small CPU helpers into headers (or into SHARED_PROVIDER bridge-forwarded declarations) and extending the provider bridge API for additional tensor span access.

Changes:

Extend the shared provider bridge to support Tensor::DataAsSpan<int32_t>().
Refactor several CPU helper implementations (Slice/Split/Tile/ScatterND/Pad/NMS) into headers with SHARED_PROVIDER-aware declarations, and adjust CUDA kernels to use the new shared helpers.
Move contrib BERT helper implementations (embed_layer_norm, bias_gelu) from .cc into headers for non-SHARED_PROVIDER builds.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
onnxruntime/core/session/provider_bridge_ort.cc	Implements new `Tensor__DataAsSpan_int32` host callback.
onnxruntime/core/providers/shared_library/provider_interfaces.h	Adds `Tensor__DataAsSpan_int32` to `ProviderHost` ABI.
onnxruntime/core/providers/shared_library/provider_wrappedtypes.h	Adds `Tensor::DataAsSpan<int32_t>` specialization for shared providers.
onnxruntime/core/providers/cuda/tensor/slice.h	Updates CUDA Slice to use new `SliceBase` ctor signature.
onnxruntime/core/providers/cuda/tensor/scatter_nd.cc	Generalizes a helper to accept templated context type.
onnxruntime/core/providers/cuda/tensor/pad.cc	Switches CUDA Pad to call `ComputePadsImpl`.
onnxruntime/core/providers/cuda/object_detection/non_max_suppression.h	Switches CUDA NMS to a templated shared helper base.
onnxruntime/core/providers/cpu/tensor/tile.h	Makes `IsTileMemcpy` inline for non-shared builds; declares for SHARED_PROVIDER.
onnxruntime/core/providers/cpu/tensor/tile.cc	Removes out-of-line `IsTileMemcpy` implementation (now header/bridge-based).
onnxruntime/core/providers/cpu/tensor/split.h	Makes `SplitBase::PrepareForCompute` inline for non-shared builds; declares for SHARED_PROVIDER.
onnxruntime/core/providers/cpu/tensor/split.cc	Removes out-of-line `PrepareForCompute` implementation.
onnxruntime/core/providers/cpu/tensor/slice.h	Refactors Slice helpers into header for non-shared builds; adds int32 indices support path.
onnxruntime/core/providers/cpu/tensor/slice.cc	Removes out-of-line Slice helper implementations.
onnxruntime/core/providers/cpu/tensor/scatter_nd.h	Makes `ValidateShapes` inline for non-shared builds; declares for SHARED_PROVIDER.
onnxruntime/core/providers/cpu/tensor/scatter_nd.cc	Removes out-of-line `ValidateShapes` implementation.
onnxruntime/core/providers/cpu/tensor/padbase.h	Moves small helpers inline for non-shared builds; keeps SHARED_PROVIDER declarations.
onnxruntime/core/providers/cpu/tensor/pad.cc	Removes out-of-line `HandleDimValueZero` and `ComputePads` wrappers.
onnxruntime/core/providers/cpu/object_detection/non_max_suppression_helper.h	Introduces templated NMS shared helper implementation.
onnxruntime/core/providers/cpu/object_detection/non_max_suppression.h	Routes CPU NMS static helpers via templated impl for non-shared builds.
onnxruntime/core/providers/cpu/object_detection/non_max_suppression.cc	Removes out-of-line CPU NMS static helper implementations.
onnxruntime/contrib_ops/cpu/bert/embed_layer_norm_helper.h	Moves `CheckInputs` implementation into header for non-shared builds.
onnxruntime/contrib_ops/cpu/bert/embed_layer_norm_helper.cc	Deleted (implementation moved to header / bridge path).
onnxruntime/contrib_ops/cpu/bert/bias_gelu_helper.h	Moves `CheckInputs` implementation into header for non-shared builds.
onnxruntime/contrib_ops/cpu/bert/bias_gelu_helper.cc	Deleted (implementation moved to header / bridge path).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

onnxruntime/core/providers/cpu/object_detection/non_max_suppression_helper.h

onnxruntime/core/providers/cuda/tensor/scatter_nd.cc

onnxruntime/contrib_ops/cpu/bert/bias_gelu_helper.h

Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

onnxruntime/core/providers/cuda/tensor/slice.h

onnxruntime/core/providers/cpu/tensor/slice.h

onnxruntime/core/providers/cpu/tensor/scatter_nd.h

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

…de from .cc to headers (Part 2) (#27628) ## Description This PR continues the refactoring effort started in PR #27617, moving additional CPU operator helper function implementations from `.cc` files into `.h` headers using the `#ifdef SHARED_PROVIDER` / `#else` inline pattern. This is a prerequisite for the **CUDA Plugin EP** work, where CUDA kernels are built into a standalone shared library (`libonnxruntime_providers_cuda_plugin.so`) that cannot link against the CPU provider's `.cc` object files. ### Why This Refactoring Is Needed The CUDA Plugin EP compiles CUDA operator kernels into a separate shared library that communicates with the ORT core through the ORT EP Plugin API. In this architecture, kernel source files **cannot** depend on framework-internal symbols that live in the CPU provider static library (`libonnxruntime_providers.a`). Many CUDA kernels inherit from CPU base classes and call shared helper/validation methods whose implementations currently live in CPU `.cc` files. In the in-tree CUDA EP build (`SHARED_PROVIDER` mode), these helpers are accessed through the `ProviderHostCPU` DLL-boundary virtual table bridge. However, the plugin EP does not use this bridge — it uses EP API adapters and force-included headers instead. To make these helpers available in the plugin build without duplicating code, this PR moves the implementations into headers as `inline` functions under `#ifndef SHARED_PROVIDER` guards. The `SHARED_PROVIDER` (in-tree) build path retains the existing declaration-only signatures that route through `ProviderHostCPU`. ### Refactoring Patterns Used 1. **Inline move**: Function body moved from `.cc` to `.h`, wrapped in `#ifndef SHARED_PROVIDER` with `inline` linkage. The `#ifdef SHARED_PROVIDER` path keeps the original declaration. 2. **Template-on-context**: Methods like `PrepareCompute`, `PrepareForCompute`, and `GetPresent` are templatized on `KernelContextType` so they work with both `OpKernelContext` (in-tree) and the plugin EP's adapter context. 3. **Template-on-info**: Constructors and initialization methods (e.g., `RoiAlignBase`, `CropBase`, `SpaceDepthBase`) are templatized on `KernelInfoType` with `info.template GetAttr<T>(...)` calls, making them compatible with both `OpKernelInfo` and the plugin's `OpKernelInfoAdapter`. 4. **Helper extraction**: Free helper functions (e.g., `CheckROIAlignValidInput`, `GetAxis`, `AdjustOutputSizeAsPolicy`) moved inline into headers. ## Summary of Changes ### Helper functions moved from `.cc` to `.h` (inline under `#ifndef SHARED_PROVIDER`) | Operator | Header File | Functions Moved | |----------|-------------|-----------------| | **AttentionBase** | `contrib_ops/cpu/bert/attention_base.h` | `AttentionBase::CheckInputs` (both overloads), `AttentionBase::CheckMask`, `AttentionBase::GetPresent` (templatized on `TOpKernelContext`) | | **LongformerAttentionBase** | `contrib_ops/cpu/bert/longformer_attention_base.h` | `LongformerAttentionBase::CheckInputs` | | **CumSum** | `cpu/math/cumsum.h` | `GetAxis` (free function) | | **RoiAlign** | `cpu/object_detection/roialign.h` | `CheckROIAlignValidInput` (free function), `RoiAlignBase` constructor templatized on `TKernelInfo` | | **Concat** | `cpu/tensor/concatbase.h` | `ConcatBase::PrepareForCompute` (templatized, delegates to `PrepareForComputeImpl`) | | **Gather** | `cpu/tensor/gatherbase.h` | `GatherBase::PrepareForCompute` (templatized, delegates to `PrepareForComputeImpl`) | | **Unsqueeze** | `cpu/tensor/unsqueeze.h` | `UnsqueezeBase::PrepareCompute` (templatized on `KernelContextType`) | | **Upsample** | `cpu/tensor/upsamplebase.h` | `UpsampleBase::AdjustOutputSizeAsPolicy`, `upsamplebase_helper::AdjustOutputSizeAsPolicy` (free helper) | ### Constructor templatization (for plugin EP adapter compatibility) | Class | Header File | Change | |-------|-------------|--------| | **CropBase** | `contrib_ops/cpu/crop.h` | Constructor templatized on `KernelInfoType`, `GetAttrsOrDefault` calls use `info.template` syntax | | **SpaceDepthBase** | `cpu/tensor/space_depth_ops.h` | Constructor templatized on `KernelInfoType`, `GetAttr` call uses `info.template` syntax | | **RoiAlignBase** | `cpu/object_detection/roialign.h` | Constructor templatized on `TKernelInfo`, all `GetAttr` calls use `info.template` syntax | ### CUDA-side updates | File | Change | |------|--------| | `cuda/tensor/upsample.cc` | Added explicit template instantiations for `Upsample<float>`, `Upsample<double>`, `Upsample<MLFloat16>`, `Upsample<int32_t>`, `Upsample<uint8_t>` (needed because `AdjustOutputSizeAsPolicy` implementation moved to header) | ### Files with code removed (moved to headers) | Source File | Lines Removed | Moved To | |-------------|---------------|----------| | `contrib_ops/cpu/bert/attention_base.cc` | ~333 | `attention_base.h` | | `contrib_ops/cpu/bert/longformer_attention_base.cc` | ~133 | `longformer_attention_base.h` | | `cpu/math/cumsum.cc` | ~23 | `cumsum.h` | | `cpu/object_detection/roialign.cc` | ~74 | `roialign.h` | | `cpu/tensor/concat.cc` | ~8 | `concatbase.h` | | `cpu/tensor/gather.cc` | ~4 | `gatherbase.h` | | `cpu/tensor/unsqueeze.cc` | ~51 | `unsqueeze.h` | | `cpu/tensor/upsample.cc` | ~44 | `upsamplebase.h` | ## Testing - Existing unit tests cover all affected operators (Attention, LongformerAttention, CumSum, RoiAlign, Concat, Gather, Unsqueeze, Upsample, Crop, SpaceToDepth/DepthToSpace). - No behavioral changes — all function logic is identical; only the location (header vs. source) and linkage (inline vs. external) changed. - The `SHARED_PROVIDER` code path (in-tree CUDA EP build) is unchanged — declarations remain and route through the existing `ProviderHostCPU` bridge. ## Motivation and Context This is part of the ongoing CUDA Plugin EP effort to build CUDA kernels as a standalone shared library that can be updated independently of the ORT core. The refactoring enables additional CUDA operators to compile in the plugin build by making their CPU-side validation and preparation helpers available as header-inline functions. This PR is a direct continuation of PR #27617 which applied the same pattern to Slice, Split, ScatterND, Tile, Pad, BiasGelu, EmbedLayerNorm, and NonMaxSuppression operators.

tianleiwu added 2 commits March 10, 2026 17:03

refactoring cuda ops for cpu/cuda shared code

9695187

Add Tensor::DataAsSpan<int>() bridge for slice_detail::CopyInputData<…

85ca0d3

…int32_t>

tianleiwu marked this pull request as draft March 11, 2026 00:33

tianleiwu requested a review from Copilot March 11, 2026 02:40

Copilot started reviewing on behalf of tianleiwu March 11, 2026 02:40 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

onnxruntime/core/providers/cpu/object_detection/non_max_suppression_helper.h Outdated Show resolved Hide resolved

onnxruntime/core/providers/cuda/tensor/scatter_nd.cc Show resolved Hide resolved

onnxruntime/contrib_ops/cpu/bert/bias_gelu_helper.h Outdated Show resolved Hide resolved

review feedback

cddcba4

tianleiwu requested a review from Copilot March 11, 2026 03:07

tianleiwu marked this pull request as ready for review March 11, 2026 03:08

Copilot started reviewing on behalf of tianleiwu March 11, 2026 03:08 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

onnxruntime/core/providers/cuda/tensor/slice.h Outdated Show resolved Hide resolved

onnxruntime/core/providers/cpu/tensor/slice.h Show resolved Hide resolved

onnxruntime/core/providers/cpu/tensor/scatter_nd.h Outdated Show resolved Hide resolved

tianleiwu changed the title ~~[Cuda Plugin] Refactoring cpu/cuda shared code~~ [Cuda Plugin] Refactor CUDA ops — Move shared CPU/CUDA helper code from .cc to headers Mar 11, 2026

address Copilot feedbacks

d7eae8a

tianleiwu requested a review from Copilot March 11, 2026 03:56

Copilot AI reviewed Mar 11, 2026

View reviewed changes

tianleiwu mentioned this pull request Mar 12, 2026

[Cuda Plugin] Refactor CUDA ops — Move more shared CPU/CUDA helper code from .cc to headers (Part 2) #27628

Merged

nenad1002 approved these changes Mar 12, 2026

View reviewed changes

tianleiwu merged commit 201e240 into main Mar 12, 2026
91 of 93 checks passed

tianleiwu deleted the tlwu/20260310/refactoring_cuda_op branch March 12, 2026 20:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cuda Plugin] Refactor CUDA ops — Move shared CPU/CUDA helper code from .cc to headers#27617

[Cuda Plugin] Refactor CUDA ops — Move shared CPU/CUDA helper code from .cc to headers#27617
tianleiwu merged 4 commits intomainfrom
tlwu/20260310/refactoring_cuda_op

tianleiwu commented Mar 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tianleiwu commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why This Refactoring Is Needed

Summary of Changes

Helper functions moved from .cc to .h (inline under #ifndef SHARED_PROVIDER)

Deleted .cc files (implementations moved to headers)

Provider bridge additions

CUDA-side updates

New file

Testing

Motivation and Context

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tianleiwu commented Mar 11, 2026 •

edited

Loading

Helper functions moved from `.cc` to `.h` (inline under `#ifndef SHARED_PROVIDER`)

Deleted `.cc` files (implementations moved to headers)