[Cuda Plugin] Refactor CUDA ops — Move shared CPU/CUDA helper code from .cc to headers#27617
[Cuda Plugin] Refactor CUDA ops — Move shared CPU/CUDA helper code from .cc to headers#27617
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors CPU/CUDA shared helper code to better support the CUDA shared-provider/plugin build, primarily by moving a number of small CPU helpers into headers (or into SHARED_PROVIDER bridge-forwarded declarations) and extending the provider bridge API for additional tensor span access.
Changes:
- Extend the shared provider bridge to support
Tensor::DataAsSpan<int32_t>(). - Refactor several CPU helper implementations (Slice/Split/Tile/ScatterND/Pad/NMS) into headers with
SHARED_PROVIDER-aware declarations, and adjust CUDA kernels to use the new shared helpers. - Move contrib BERT helper implementations (
embed_layer_norm,bias_gelu) from.ccinto headers for non-SHARED_PROVIDER builds.
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/core/session/provider_bridge_ort.cc | Implements new Tensor__DataAsSpan_int32 host callback. |
| onnxruntime/core/providers/shared_library/provider_interfaces.h | Adds Tensor__DataAsSpan_int32 to ProviderHost ABI. |
| onnxruntime/core/providers/shared_library/provider_wrappedtypes.h | Adds Tensor::DataAsSpan<int32_t> specialization for shared providers. |
| onnxruntime/core/providers/cuda/tensor/slice.h | Updates CUDA Slice to use new SliceBase ctor signature. |
| onnxruntime/core/providers/cuda/tensor/scatter_nd.cc | Generalizes a helper to accept templated context type. |
| onnxruntime/core/providers/cuda/tensor/pad.cc | Switches CUDA Pad to call ComputePadsImpl. |
| onnxruntime/core/providers/cuda/object_detection/non_max_suppression.h | Switches CUDA NMS to a templated shared helper base. |
| onnxruntime/core/providers/cpu/tensor/tile.h | Makes IsTileMemcpy inline for non-shared builds; declares for SHARED_PROVIDER. |
| onnxruntime/core/providers/cpu/tensor/tile.cc | Removes out-of-line IsTileMemcpy implementation (now header/bridge-based). |
| onnxruntime/core/providers/cpu/tensor/split.h | Makes SplitBase::PrepareForCompute inline for non-shared builds; declares for SHARED_PROVIDER. |
| onnxruntime/core/providers/cpu/tensor/split.cc | Removes out-of-line PrepareForCompute implementation. |
| onnxruntime/core/providers/cpu/tensor/slice.h | Refactors Slice helpers into header for non-shared builds; adds int32 indices support path. |
| onnxruntime/core/providers/cpu/tensor/slice.cc | Removes out-of-line Slice helper implementations. |
| onnxruntime/core/providers/cpu/tensor/scatter_nd.h | Makes ValidateShapes inline for non-shared builds; declares for SHARED_PROVIDER. |
| onnxruntime/core/providers/cpu/tensor/scatter_nd.cc | Removes out-of-line ValidateShapes implementation. |
| onnxruntime/core/providers/cpu/tensor/padbase.h | Moves small helpers inline for non-shared builds; keeps SHARED_PROVIDER declarations. |
| onnxruntime/core/providers/cpu/tensor/pad.cc | Removes out-of-line HandleDimValueZero and ComputePads wrappers. |
| onnxruntime/core/providers/cpu/object_detection/non_max_suppression_helper.h | Introduces templated NMS shared helper implementation. |
| onnxruntime/core/providers/cpu/object_detection/non_max_suppression.h | Routes CPU NMS static helpers via templated impl for non-shared builds. |
| onnxruntime/core/providers/cpu/object_detection/non_max_suppression.cc | Removes out-of-line CPU NMS static helper implementations. |
| onnxruntime/contrib_ops/cpu/bert/embed_layer_norm_helper.h | Moves CheckInputs implementation into header for non-shared builds. |
| onnxruntime/contrib_ops/cpu/bert/embed_layer_norm_helper.cc | Deleted (implementation moved to header / bridge path). |
| onnxruntime/contrib_ops/cpu/bert/bias_gelu_helper.h | Moves CheckInputs implementation into header for non-shared builds. |
| onnxruntime/contrib_ops/cpu/bert/bias_gelu_helper.cc | Deleted (implementation moved to header / bridge path). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
onnxruntime/core/providers/cpu/object_detection/non_max_suppression_helper.h
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
…de from .cc to headers (Part 2) (#27628) ## Description This PR continues the refactoring effort started in PR #27617, moving additional CPU operator helper function implementations from `.cc` files into `.h` headers using the `#ifdef SHARED_PROVIDER` / `#else` inline pattern. This is a prerequisite for the **CUDA Plugin EP** work, where CUDA kernels are built into a standalone shared library (`libonnxruntime_providers_cuda_plugin.so`) that cannot link against the CPU provider's `.cc` object files. ### Why This Refactoring Is Needed The CUDA Plugin EP compiles CUDA operator kernels into a separate shared library that communicates with the ORT core through the ORT EP Plugin API. In this architecture, kernel source files **cannot** depend on framework-internal symbols that live in the CPU provider static library (`libonnxruntime_providers.a`). Many CUDA kernels inherit from CPU base classes and call shared helper/validation methods whose implementations currently live in CPU `.cc` files. In the in-tree CUDA EP build (`SHARED_PROVIDER` mode), these helpers are accessed through the `ProviderHostCPU` DLL-boundary virtual table bridge. However, the plugin EP does not use this bridge — it uses EP API adapters and force-included headers instead. To make these helpers available in the plugin build without duplicating code, this PR moves the implementations into headers as `inline` functions under `#ifndef SHARED_PROVIDER` guards. The `SHARED_PROVIDER` (in-tree) build path retains the existing declaration-only signatures that route through `ProviderHostCPU`. ### Refactoring Patterns Used 1. **Inline move**: Function body moved from `.cc` to `.h`, wrapped in `#ifndef SHARED_PROVIDER` with `inline` linkage. The `#ifdef SHARED_PROVIDER` path keeps the original declaration. 2. **Template-on-context**: Methods like `PrepareCompute`, `PrepareForCompute`, and `GetPresent` are templatized on `KernelContextType` so they work with both `OpKernelContext` (in-tree) and the plugin EP's adapter context. 3. **Template-on-info**: Constructors and initialization methods (e.g., `RoiAlignBase`, `CropBase`, `SpaceDepthBase`) are templatized on `KernelInfoType` with `info.template GetAttr<T>(...)` calls, making them compatible with both `OpKernelInfo` and the plugin's `OpKernelInfoAdapter`. 4. **Helper extraction**: Free helper functions (e.g., `CheckROIAlignValidInput`, `GetAxis`, `AdjustOutputSizeAsPolicy`) moved inline into headers. ## Summary of Changes ### Helper functions moved from `.cc` to `.h` (inline under `#ifndef SHARED_PROVIDER`) | Operator | Header File | Functions Moved | |----------|-------------|-----------------| | **AttentionBase** | `contrib_ops/cpu/bert/attention_base.h` | `AttentionBase::CheckInputs` (both overloads), `AttentionBase::CheckMask`, `AttentionBase::GetPresent` (templatized on `TOpKernelContext`) | | **LongformerAttentionBase** | `contrib_ops/cpu/bert/longformer_attention_base.h` | `LongformerAttentionBase::CheckInputs` | | **CumSum** | `cpu/math/cumsum.h` | `GetAxis` (free function) | | **RoiAlign** | `cpu/object_detection/roialign.h` | `CheckROIAlignValidInput` (free function), `RoiAlignBase` constructor templatized on `TKernelInfo` | | **Concat** | `cpu/tensor/concatbase.h` | `ConcatBase::PrepareForCompute` (templatized, delegates to `PrepareForComputeImpl`) | | **Gather** | `cpu/tensor/gatherbase.h` | `GatherBase::PrepareForCompute` (templatized, delegates to `PrepareForComputeImpl`) | | **Unsqueeze** | `cpu/tensor/unsqueeze.h` | `UnsqueezeBase::PrepareCompute` (templatized on `KernelContextType`) | | **Upsample** | `cpu/tensor/upsamplebase.h` | `UpsampleBase::AdjustOutputSizeAsPolicy`, `upsamplebase_helper::AdjustOutputSizeAsPolicy` (free helper) | ### Constructor templatization (for plugin EP adapter compatibility) | Class | Header File | Change | |-------|-------------|--------| | **CropBase** | `contrib_ops/cpu/crop.h` | Constructor templatized on `KernelInfoType`, `GetAttrsOrDefault` calls use `info.template` syntax | | **SpaceDepthBase** | `cpu/tensor/space_depth_ops.h` | Constructor templatized on `KernelInfoType`, `GetAttr` call uses `info.template` syntax | | **RoiAlignBase** | `cpu/object_detection/roialign.h` | Constructor templatized on `TKernelInfo`, all `GetAttr` calls use `info.template` syntax | ### CUDA-side updates | File | Change | |------|--------| | `cuda/tensor/upsample.cc` | Added explicit template instantiations for `Upsample<float>`, `Upsample<double>`, `Upsample<MLFloat16>`, `Upsample<int32_t>`, `Upsample<uint8_t>` (needed because `AdjustOutputSizeAsPolicy` implementation moved to header) | ### Files with code removed (moved to headers) | Source File | Lines Removed | Moved To | |-------------|---------------|----------| | `contrib_ops/cpu/bert/attention_base.cc` | ~333 | `attention_base.h` | | `contrib_ops/cpu/bert/longformer_attention_base.cc` | ~133 | `longformer_attention_base.h` | | `cpu/math/cumsum.cc` | ~23 | `cumsum.h` | | `cpu/object_detection/roialign.cc` | ~74 | `roialign.h` | | `cpu/tensor/concat.cc` | ~8 | `concatbase.h` | | `cpu/tensor/gather.cc` | ~4 | `gatherbase.h` | | `cpu/tensor/unsqueeze.cc` | ~51 | `unsqueeze.h` | | `cpu/tensor/upsample.cc` | ~44 | `upsamplebase.h` | ## Testing - Existing unit tests cover all affected operators (Attention, LongformerAttention, CumSum, RoiAlign, Concat, Gather, Unsqueeze, Upsample, Crop, SpaceToDepth/DepthToSpace). - No behavioral changes — all function logic is identical; only the location (header vs. source) and linkage (inline vs. external) changed. - The `SHARED_PROVIDER` code path (in-tree CUDA EP build) is unchanged — declarations remain and route through the existing `ProviderHostCPU` bridge. ## Motivation and Context This is part of the ongoing CUDA Plugin EP effort to build CUDA kernels as a standalone shared library that can be updated independently of the ORT core. The refactoring enables additional CUDA operators to compile in the plugin build by making their CPU-side validation and preparation helpers available as header-inline functions. This PR is a direct continuation of PR #27617 which applied the same pattern to Slice, Split, ScatterND, Tile, Pad, BiasGelu, EmbedLayerNorm, and NonMaxSuppression operators.
Description
This PR refactors several CPU operator helper functions by moving their implementations from
.ccfiles into.hheaders, using the#ifdef SHARED_PROVIDER/#elseinline pattern. This is a prerequisite for the CUDA Plugin EP work, where CUDA kernels are built into a standalone shared library (libonnxruntime_providers_cuda_plugin.so) that cannot link against the CPU provider's.ccobject files.Why This Refactoring Is Needed
The CUDA Plugin EP compiles CUDA operator kernels into a separate shared library that communicates with the ORT core through the ORT EP Plugin API. In this architecture, kernel source files cannot depend on framework-internal symbols that live in the CPU provider static library (
libonnxruntime_providers.a). Many CUDA kernels inherit from CPU base classes and call shared helper/validation methods (e.g.,SliceBase::PrepareForCompute,SplitBase::PrepareForCompute,ScatterND::ValidateShapes,TileOp::IsTileMemcpy,PadBase::ComputePads) whose implementations currently live in CPU.ccfiles.In the in-tree CUDA EP build (
SHARED_PROVIDERmode), these helpers are accessed through theProviderHostCPUDLL-boundary virtual table bridge. However, the plugin EP does not use this bridge — it uses EP API adapters and force-included headers instead. To make these helpers available in the plugin build without duplicating code, this PR moves the implementations into headers asinlinefunctions under#ifndef SHARED_PROVIDERguards. TheSHARED_PROVIDER(in-tree) build path retains the existing declaration-only signatures that route throughProviderHostCPU.This pattern has already been successfully applied to other operators (e.g.,
Einsum). This PR extends it to the remaining operators that need it.Summary of Changes
Helper functions moved from
.ccto.h(inline under#ifndef SHARED_PROVIDER)cpu/tensor/slice.hSliceBase::FlattenOutputDims,SliceBase::PrepareForCompute(both overloads),SliceBase::FillVectorsFromInput,slice_detail::CopyInputData<T>cpu/tensor/split.hSplitBase::PrepareForComputecpu/tensor/scatter_nd.hScatterND::ValidateShapescpu/tensor/tile.hTileOp::IsTileMemcpycpu/tensor/padbase.hPadBase::ComputePadsImpl(new template method replacingComputePadsfor cross-context compatibility)contrib_ops/cpu/bert/bias_gelu_helper.hbias_gelu_helper::CheckInputs(templatized on context type)contrib_ops/cpu/bert/embed_layer_norm_helper.hembed_layer_norm::CheckInputs(templatized on context type)cpu/object_detection/non_max_suppression.h+ newnon_max_suppression_helper.hNonMaxSuppressionBaserefactored intoNonMaxSuppressionBaseImpl<KernelInfoType, KernelContextType>template for plugin compatibilityDeleted
.ccfiles (implementations moved to headers)contrib_ops/cpu/bert/bias_gelu_helper.cccontrib_ops/cpu/bert/embed_layer_norm_helper.ccProvider bridge additions
Tensor::DataAsSpan<int32_t>()support through the shared provider interface (provider_interfaces.h,provider_wrappedtypes.h,provider_bridge_ort.cc). This was needed becauseslice_detail::CopyInputData<int32_t>callsTensor::DataAsSpan<int32_t>(), which was not previously bridged.CUDA-side updates
cuda/tensor/slice.h: UpdatedSliceconstructor to use the newSliceBase(info, dynamic, 0)overload (template-based constructor compatible with both adapter and realOpKernelInfo).cuda/tensor/pad.cc: Updated call fromPadBase::ComputePadstoPadBase::ComputePadsImpl.cuda/tensor/scatter_nd.cc: TemplatizedInitializeElementCountsAndInputDimsSpanOrGpuonKernelContextType(also fixed typo:InitiliazeElement...→InitializeElement...).cuda/object_detection/non_max_suppression.h: Updated to useNonMaxSuppressionBaseImpl<OpKernelInfo, OpKernelContext>instead ofNonMaxSuppressionBase.New file
cpu/object_detection/non_max_suppression_helper.h: Contains the template-basedNonMaxSuppressionBaseImplclass, separating it from the CPU-specificNonMaxSuppressionkernel registration.Testing
SHARED_PROVIDERcode path (in-tree CUDA EP build) is unchanged — declarations remain and route through the existingProviderHostCPUbridge.Motivation and Context
This is part of the ongoing CUDA Plugin EP effort to build CUDA kernels as a standalone shared library that can be updated independently of the ORT core. The refactoring enables ~10 additional CUDA operators to compile in the plugin build by making their CPU-side validation and preparation helpers available as header-inline functions.