Skip to content

[Cuda Plugin] Refactor CUDA ops — Move shared CPU/CUDA helper code from .cc to headers#27617

Merged
tianleiwu merged 4 commits intomainfrom
tlwu/20260310/refactoring_cuda_op
Mar 12, 2026
Merged

[Cuda Plugin] Refactor CUDA ops — Move shared CPU/CUDA helper code from .cc to headers#27617
tianleiwu merged 4 commits intomainfrom
tlwu/20260310/refactoring_cuda_op

Conversation

@tianleiwu
Copy link
Contributor

@tianleiwu tianleiwu commented Mar 11, 2026

Description

This PR refactors several CPU operator helper functions by moving their implementations from .cc files into .h headers, using the #ifdef SHARED_PROVIDER / #else inline pattern. This is a prerequisite for the CUDA Plugin EP work, where CUDA kernels are built into a standalone shared library (libonnxruntime_providers_cuda_plugin.so) that cannot link against the CPU provider's .cc object files.

Why This Refactoring Is Needed

The CUDA Plugin EP compiles CUDA operator kernels into a separate shared library that communicates with the ORT core through the ORT EP Plugin API. In this architecture, kernel source files cannot depend on framework-internal symbols that live in the CPU provider static library (libonnxruntime_providers.a). Many CUDA kernels inherit from CPU base classes and call shared helper/validation methods (e.g., SliceBase::PrepareForCompute, SplitBase::PrepareForCompute, ScatterND::ValidateShapes, TileOp::IsTileMemcpy, PadBase::ComputePads) whose implementations currently live in CPU .cc files.

In the in-tree CUDA EP build (SHARED_PROVIDER mode), these helpers are accessed through the ProviderHostCPU DLL-boundary virtual table bridge. However, the plugin EP does not use this bridge — it uses EP API adapters and force-included headers instead. To make these helpers available in the plugin build without duplicating code, this PR moves the implementations into headers as inline functions under #ifndef SHARED_PROVIDER guards. The SHARED_PROVIDER (in-tree) build path retains the existing declaration-only signatures that route through ProviderHostCPU.

This pattern has already been successfully applied to other operators (e.g., Einsum). This PR extends it to the remaining operators that need it.

Summary of Changes

Helper functions moved from .cc to .h (inline under #ifndef SHARED_PROVIDER)

Operator File Functions Moved
Slice cpu/tensor/slice.h SliceBase::FlattenOutputDims, SliceBase::PrepareForCompute (both overloads), SliceBase::FillVectorsFromInput, slice_detail::CopyInputData<T>
Split cpu/tensor/split.h SplitBase::PrepareForCompute
ScatterND cpu/tensor/scatter_nd.h ScatterND::ValidateShapes
Tile cpu/tensor/tile.h TileOp::IsTileMemcpy
Pad cpu/tensor/padbase.h PadBase::ComputePadsImpl (new template method replacing ComputePads for cross-context compatibility)
BiasGelu contrib_ops/cpu/bert/bias_gelu_helper.h bias_gelu_helper::CheckInputs (templatized on context type)
EmbedLayerNorm contrib_ops/cpu/bert/embed_layer_norm_helper.h embed_layer_norm::CheckInputs (templatized on context type)
NonMaxSuppression cpu/object_detection/non_max_suppression.h + new non_max_suppression_helper.h NonMaxSuppressionBase refactored into NonMaxSuppressionBaseImpl<KernelInfoType, KernelContextType> template for plugin compatibility

Deleted .cc files (implementations moved to headers)

  • contrib_ops/cpu/bert/bias_gelu_helper.cc
  • contrib_ops/cpu/bert/embed_layer_norm_helper.cc

Provider bridge additions

  • Added Tensor::DataAsSpan<int32_t>() support through the shared provider interface (provider_interfaces.h, provider_wrappedtypes.h, provider_bridge_ort.cc). This was needed because slice_detail::CopyInputData<int32_t> calls Tensor::DataAsSpan<int32_t>(), which was not previously bridged.

CUDA-side updates

  • cuda/tensor/slice.h: Updated Slice constructor to use the new SliceBase(info, dynamic, 0) overload (template-based constructor compatible with both adapter and real OpKernelInfo).
  • cuda/tensor/pad.cc: Updated call from PadBase::ComputePads to PadBase::ComputePadsImpl.
  • cuda/tensor/scatter_nd.cc: Templatized InitializeElementCountsAndInputDimsSpanOrGpu on KernelContextType (also fixed typo: InitiliazeElement...InitializeElement...).
  • cuda/object_detection/non_max_suppression.h: Updated to use NonMaxSuppressionBaseImpl<OpKernelInfo, OpKernelContext> instead of NonMaxSuppressionBase.

New file

  • cpu/object_detection/non_max_suppression_helper.h: Contains the template-based NonMaxSuppressionBaseImpl class, separating it from the CPU-specific NonMaxSuppression kernel registration.

Testing

  • Existing unit tests cover all affected operators (Slice, Split, ScatterND, Tile, Pad, BiasGelu, EmbedLayerNorm, NonMaxSuppression).
  • No behavioral changes — all function logic is identical; only the location (header vs. source) and linkage (inline vs. external) changed.
  • The SHARED_PROVIDER code path (in-tree CUDA EP build) is unchanged — declarations remain and route through the existing ProviderHostCPU bridge.

Motivation and Context

This is part of the ongoing CUDA Plugin EP effort to build CUDA kernels as a standalone shared library that can be updated independently of the ORT core. The refactoring enables ~10 additional CUDA operators to compile in the plugin build by making their CPU-side validation and preparation helpers available as header-inline functions.

@tianleiwu tianleiwu marked this pull request as draft March 11, 2026 00:33
@tianleiwu tianleiwu requested a review from Copilot March 11, 2026 02:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors CPU/CUDA shared helper code to better support the CUDA shared-provider/plugin build, primarily by moving a number of small CPU helpers into headers (or into SHARED_PROVIDER bridge-forwarded declarations) and extending the provider bridge API for additional tensor span access.

Changes:

  • Extend the shared provider bridge to support Tensor::DataAsSpan<int32_t>().
  • Refactor several CPU helper implementations (Slice/Split/Tile/ScatterND/Pad/NMS) into headers with SHARED_PROVIDER-aware declarations, and adjust CUDA kernels to use the new shared helpers.
  • Move contrib BERT helper implementations (embed_layer_norm, bias_gelu) from .cc into headers for non-SHARED_PROVIDER builds.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
onnxruntime/core/session/provider_bridge_ort.cc Implements new Tensor__DataAsSpan_int32 host callback.
onnxruntime/core/providers/shared_library/provider_interfaces.h Adds Tensor__DataAsSpan_int32 to ProviderHost ABI.
onnxruntime/core/providers/shared_library/provider_wrappedtypes.h Adds Tensor::DataAsSpan<int32_t> specialization for shared providers.
onnxruntime/core/providers/cuda/tensor/slice.h Updates CUDA Slice to use new SliceBase ctor signature.
onnxruntime/core/providers/cuda/tensor/scatter_nd.cc Generalizes a helper to accept templated context type.
onnxruntime/core/providers/cuda/tensor/pad.cc Switches CUDA Pad to call ComputePadsImpl.
onnxruntime/core/providers/cuda/object_detection/non_max_suppression.h Switches CUDA NMS to a templated shared helper base.
onnxruntime/core/providers/cpu/tensor/tile.h Makes IsTileMemcpy inline for non-shared builds; declares for SHARED_PROVIDER.
onnxruntime/core/providers/cpu/tensor/tile.cc Removes out-of-line IsTileMemcpy implementation (now header/bridge-based).
onnxruntime/core/providers/cpu/tensor/split.h Makes SplitBase::PrepareForCompute inline for non-shared builds; declares for SHARED_PROVIDER.
onnxruntime/core/providers/cpu/tensor/split.cc Removes out-of-line PrepareForCompute implementation.
onnxruntime/core/providers/cpu/tensor/slice.h Refactors Slice helpers into header for non-shared builds; adds int32 indices support path.
onnxruntime/core/providers/cpu/tensor/slice.cc Removes out-of-line Slice helper implementations.
onnxruntime/core/providers/cpu/tensor/scatter_nd.h Makes ValidateShapes inline for non-shared builds; declares for SHARED_PROVIDER.
onnxruntime/core/providers/cpu/tensor/scatter_nd.cc Removes out-of-line ValidateShapes implementation.
onnxruntime/core/providers/cpu/tensor/padbase.h Moves small helpers inline for non-shared builds; keeps SHARED_PROVIDER declarations.
onnxruntime/core/providers/cpu/tensor/pad.cc Removes out-of-line HandleDimValueZero and ComputePads wrappers.
onnxruntime/core/providers/cpu/object_detection/non_max_suppression_helper.h Introduces templated NMS shared helper implementation.
onnxruntime/core/providers/cpu/object_detection/non_max_suppression.h Routes CPU NMS static helpers via templated impl for non-shared builds.
onnxruntime/core/providers/cpu/object_detection/non_max_suppression.cc Removes out-of-line CPU NMS static helper implementations.
onnxruntime/contrib_ops/cpu/bert/embed_layer_norm_helper.h Moves CheckInputs implementation into header for non-shared builds.
onnxruntime/contrib_ops/cpu/bert/embed_layer_norm_helper.cc Deleted (implementation moved to header / bridge path).
onnxruntime/contrib_ops/cpu/bert/bias_gelu_helper.h Moves CheckInputs implementation into header for non-shared builds.
onnxruntime/contrib_ops/cpu/bert/bias_gelu_helper.cc Deleted (implementation moved to header / bridge path).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@tianleiwu tianleiwu requested a review from Copilot March 11, 2026 03:07
@tianleiwu tianleiwu marked this pull request as ready for review March 11, 2026 03:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@tianleiwu tianleiwu changed the title [Cuda Plugin] Refactoring cpu/cuda shared code [Cuda Plugin] Refactor CUDA ops — Move shared CPU/CUDA helper code from .cc to headers Mar 11, 2026
@tianleiwu tianleiwu requested a review from Copilot March 11, 2026 03:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@tianleiwu tianleiwu merged commit 201e240 into main Mar 12, 2026
91 of 93 checks passed
@tianleiwu tianleiwu deleted the tlwu/20260310/refactoring_cuda_op branch March 12, 2026 20:21
tianleiwu added a commit that referenced this pull request Mar 13, 2026
…de from .cc to headers (Part 2) (#27628)

## Description

This PR continues the refactoring effort started in PR #27617, moving
additional CPU operator helper function implementations from `.cc` files
into `.h` headers using the `#ifdef SHARED_PROVIDER` / `#else` inline
pattern. This is a prerequisite for the **CUDA Plugin EP** work, where
CUDA kernels are built into a standalone shared library
(`libonnxruntime_providers_cuda_plugin.so`) that cannot link against the
CPU provider's `.cc` object files.

### Why This Refactoring Is Needed

The CUDA Plugin EP compiles CUDA operator kernels into a separate shared
library that communicates with the ORT core through the ORT EP Plugin
API. In this architecture, kernel source files **cannot** depend on
framework-internal symbols that live in the CPU provider static library
(`libonnxruntime_providers.a`). Many CUDA kernels inherit from CPU base
classes and call shared helper/validation methods whose implementations
currently live in CPU `.cc` files.

In the in-tree CUDA EP build (`SHARED_PROVIDER` mode), these helpers are
accessed through the `ProviderHostCPU` DLL-boundary virtual table
bridge. However, the plugin EP does not use this bridge — it uses EP API
adapters and force-included headers instead. To make these helpers
available in the plugin build without duplicating code, this PR moves
the implementations into headers as `inline` functions under `#ifndef
SHARED_PROVIDER` guards. The `SHARED_PROVIDER` (in-tree) build path
retains the existing declaration-only signatures that route through
`ProviderHostCPU`.

### Refactoring Patterns Used

1. **Inline move**: Function body moved from `.cc` to `.h`, wrapped in
`#ifndef SHARED_PROVIDER` with `inline` linkage. The `#ifdef
SHARED_PROVIDER` path keeps the original declaration.
2. **Template-on-context**: Methods like `PrepareCompute`,
`PrepareForCompute`, and `GetPresent` are templatized on
`KernelContextType` so they work with both `OpKernelContext` (in-tree)
and the plugin EP's adapter context.
3. **Template-on-info**: Constructors and initialization methods (e.g.,
`RoiAlignBase`, `CropBase`, `SpaceDepthBase`) are templatized on
`KernelInfoType` with `info.template GetAttr<T>(...)` calls, making them
compatible with both `OpKernelInfo` and the plugin's
`OpKernelInfoAdapter`.
4. **Helper extraction**: Free helper functions (e.g.,
`CheckROIAlignValidInput`, `GetAxis`, `AdjustOutputSizeAsPolicy`) moved
inline into headers.

## Summary of Changes

### Helper functions moved from `.cc` to `.h` (inline under `#ifndef
SHARED_PROVIDER`)

| Operator | Header File | Functions Moved |
|----------|-------------|-----------------|
| **AttentionBase** | `contrib_ops/cpu/bert/attention_base.h` |
`AttentionBase::CheckInputs` (both overloads),
`AttentionBase::CheckMask`, `AttentionBase::GetPresent` (templatized on
`TOpKernelContext`) |
| **LongformerAttentionBase** |
`contrib_ops/cpu/bert/longformer_attention_base.h` |
`LongformerAttentionBase::CheckInputs` |
| **CumSum** | `cpu/math/cumsum.h` | `GetAxis` (free function) |
| **RoiAlign** | `cpu/object_detection/roialign.h` |
`CheckROIAlignValidInput` (free function), `RoiAlignBase` constructor
templatized on `TKernelInfo` |
| **Concat** | `cpu/tensor/concatbase.h` |
`ConcatBase::PrepareForCompute` (templatized, delegates to
`PrepareForComputeImpl`) |
| **Gather** | `cpu/tensor/gatherbase.h` |
`GatherBase::PrepareForCompute` (templatized, delegates to
`PrepareForComputeImpl`) |
| **Unsqueeze** | `cpu/tensor/unsqueeze.h` |
`UnsqueezeBase::PrepareCompute` (templatized on `KernelContextType`) |
| **Upsample** | `cpu/tensor/upsamplebase.h` |
`UpsampleBase::AdjustOutputSizeAsPolicy`,
`upsamplebase_helper::AdjustOutputSizeAsPolicy` (free helper) |

### Constructor templatization (for plugin EP adapter compatibility)

| Class | Header File | Change |
|-------|-------------|--------|
| **CropBase** | `contrib_ops/cpu/crop.h` | Constructor templatized on
`KernelInfoType`, `GetAttrsOrDefault` calls use `info.template` syntax |
| **SpaceDepthBase** | `cpu/tensor/space_depth_ops.h` | Constructor
templatized on `KernelInfoType`, `GetAttr` call uses `info.template`
syntax |
| **RoiAlignBase** | `cpu/object_detection/roialign.h` | Constructor
templatized on `TKernelInfo`, all `GetAttr` calls use `info.template`
syntax |

### CUDA-side updates

| File | Change |
|------|--------|
| `cuda/tensor/upsample.cc` | Added explicit template instantiations for
`Upsample<float>`, `Upsample<double>`, `Upsample<MLFloat16>`,
`Upsample<int32_t>`, `Upsample<uint8_t>` (needed because
`AdjustOutputSizeAsPolicy` implementation moved to header) |

### Files with code removed (moved to headers)

| Source File | Lines Removed | Moved To |
|-------------|---------------|----------|
| `contrib_ops/cpu/bert/attention_base.cc` | ~333 | `attention_base.h` |
| `contrib_ops/cpu/bert/longformer_attention_base.cc` | ~133 |
`longformer_attention_base.h` |
| `cpu/math/cumsum.cc` | ~23 | `cumsum.h` |
| `cpu/object_detection/roialign.cc` | ~74 | `roialign.h` |
| `cpu/tensor/concat.cc` | ~8 | `concatbase.h` |
| `cpu/tensor/gather.cc` | ~4 | `gatherbase.h` |
| `cpu/tensor/unsqueeze.cc` | ~51 | `unsqueeze.h` |
| `cpu/tensor/upsample.cc` | ~44 | `upsamplebase.h` |

## Testing

- Existing unit tests cover all affected operators (Attention,
LongformerAttention, CumSum, RoiAlign, Concat, Gather, Unsqueeze,
Upsample, Crop, SpaceToDepth/DepthToSpace).
- No behavioral changes — all function logic is identical; only the
location (header vs. source) and linkage (inline vs. external) changed.
- The `SHARED_PROVIDER` code path (in-tree CUDA EP build) is unchanged —
declarations remain and route through the existing `ProviderHostCPU`
bridge.

## Motivation and Context

This is part of the ongoing CUDA Plugin EP effort to build CUDA kernels
as a standalone shared library that can be updated independently of the
ORT core. The refactoring enables additional CUDA operators to compile
in the plugin build by making their CPU-side validation and preparation
helpers available as header-inline functions.

This PR is a direct continuation of PR #27617 which applied the same
pattern to Slice, Split, ScatterND, Tile, Pad, BiasGelu, EmbedLayerNorm,
and NonMaxSuppression operators.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants