Skip to content

[CUDA] Extend Pad support through opset 25#27708

Closed
tianleiwu wants to merge 5 commits intomainfrom
tlwu/20260317/cuda_pad
Closed

[CUDA] Extend Pad support through opset 25#27708
tianleiwu wants to merge 5 commits intomainfrom
tlwu/20260317/cuda_pad

Conversation

@tianleiwu
Copy link
Contributor

@tianleiwu tianleiwu commented Mar 17, 2026

Description

This PR updates the CUDA Pad kernel to support the ONNX Pad schema split from opset 18 through opset 25 instead of stopping at the older registration boundary. It also implements CUDA wrap mode support so newer Pad registrations are backed by real kernel behavior, and adds targeted tests to cover the newly supported opset ranges.

Summary of Changes

Kernel registration and opset coverage

File Change
onnxruntime/core/providers/cuda/tensor/pad.cc Adds CUDA Pad kernel registrations for opset ranges 18, 19-20, 21-22, 23, 24, and 25, matching the current ONNX Pad schema evolution.
onnxruntime/core/providers/cuda/cuda_execution_provider.cc Registers the new Pad kernel versions in the CUDA EP registry and keeps them grouped under the existing per-opset sections for consistency with the rest of the file.

CUDA Pad implementation

File Change
onnxruntime/core/providers/cuda/tensor/pad_impl.h Extends the Pad kernel interface to pass effective sliced extents and per-axis input offsets into the CUDA implementation.
onnxruntime/core/providers/cuda/tensor/pad_impl.cu Adds CUDA wrap mode handling for both the general Pad kernel and the NCHW H/W-specialized kernel path, and updates the dispatch logic for the new mode.
onnxruntime/core/providers/cuda/tensor/pad.cc Computes the effective sliced input extents/offsets needed for wrap behavior with negative pads, and routes wrap through the generic implementation instead of the optimized non-wrap-only path.

Test coverage

File Change
onnxruntime/test/providers/cpu/tensor/pad_test.cc Adds CUDA-only Pad coverage for edge across opsets 18-25 and wrap across opsets 19-25, and updates the existing wrap test comment to reflect the new CUDA support.

Testing

  • Built the touched CUDA and test translation units in build/cuda/Release, including pad_impl.cu, pad.cc, cuda_execution_provider.cc, and pad_test.cc.
  • Added CUDA-only coverage for edge mode on opsets 18-25 and wrap mode on opsets 19-25.
  • Full onnxruntime_test_all was not run locally.

Motivation and Context

Related issues: #26393.

Pad evolved after opset 18 in ways that matter for CUDA placement: opset 19 introduced wrap, and later opsets continued the schema/version split while broadening supported types. Before this change, CUDA Pad registration did not line up with those newer schemas, and CUDA did not implement wrap, which made newer Pad models fall back or remain unsupported on the CUDA execution provider. This change aligns CUDA registration with the ONNX Pad versions now used by the runtime and makes the exposed support match actual kernel behavior.

Checklist

  • Tests added/updated
  • Documentation updated (if applicable)
  • No breaking changes (or documented in description)
  • CI passes

@hariharans29
Copy link
Member

Is this also related - #27416 ?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the CUDA Pad kernel’s ONNX opset coverage through opset 25, aligning CUDA registrations with the post-opset-18 ONNX schema splits, and adds CUDA wrap mode behavior plus targeted CUDA-only tests for the newly supported opset ranges.

Changes:

  • Added CUDA Pad kernel registrations for opsets 18, 19–20, 21–22, 23, 24, and 25 (and updated CUDA EP kernel registry accordingly).
  • Extended the CUDA Pad implementation to support wrap mode, including handling negative pads (slicing) via effective extents and per-axis input offsets.
  • Added CUDA-only tests to validate edge (opsets 18–25) and wrap (opsets 19–25) behavior, and updated operator kernel documentation to reflect the new opset splits.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
onnxruntime/core/providers/cuda/tensor/pad.cc Adds versioned kernel registrations through opset 25; computes effective extents/offsets and routes wrap through the generic implementation.
onnxruntime/core/providers/cuda/tensor/pad_impl.h Extends PadImpl interface to accept effective extents and input offsets.
onnxruntime/core/providers/cuda/tensor/pad_impl.cu Implements wrap coordinate handling for the generic pad kernel (and adds a wrap branch in the NCHW kernel).
onnxruntime/core/providers/cuda/cuda_execution_provider.cc Declares/registers the newly versioned CUDA Pad kernels for opsets 18–25.
onnxruntime/test/providers/cpu/tensor/pad_test.cc Adds CUDA-only tests covering the newly supported opset ranges for edge and wrap.
docs/OperatorKernels.md Updates the published CUDA kernel opset coverage for Pad to reflect the new version splits up to opset 25.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@tianleiwu
Copy link
Contributor Author

Is this also related - #27416 ?

Yes, this is related to #27416 and overlaps in the same CUDA Pad support area.

From what I checked, #27416 adds CUDA Pad support through opset 23, while this PR supports through opset 25 includes the OperatorKernels doc update. The wrap implementation is also different. Let me do some comparison to decide whether to consolidate or supersede one of the two PRs.

@tianleiwu tianleiwu marked this pull request as ready for review March 19, 2026 18:03
@tianleiwu tianleiwu requested a review from Copilot March 19, 2026 18:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends CUDA Pad to align with ONNX Pad schema splits through opset 25 and adds CUDA wrap mode implementation, with targeted CUDA-only tests for the newly supported opset ranges.

Changes:

  • Register CUDA Pad kernels across opset ranges 18, 19–20, 21–22, 23, 24, and 25.
  • Implement CUDA wrap mode support and plumb effective sliced extents/offsets into the CUDA kernels.
  • Add CUDA-only tests for edge (opset 18–25) and wrap (opset 19–25).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
onnxruntime/test/providers/cpu/tensor/pad_test.cc Adds CUDA-only Pad tests for edge/wrap across supported opsets and updates wrap-mode comment.
onnxruntime/core/providers/cuda/tensor/pad_impl.h Extends PadImpl interface to accept effective extents and per-axis offsets.
onnxruntime/core/providers/cuda/tensor/pad_impl.cu Implements wrap mode coordinate mapping and updates kernel dispatch.
onnxruntime/core/providers/cuda/tensor/pad.cc Adds per-opset kernel registrations and computes extents/offsets for wrap behavior; routes wrap via generic path.
onnxruntime/core/providers/cuda/cuda_execution_provider.cc Registers the new per-opset CUDA Pad kernel variants in the EP registry.
docs/OperatorKernels.md Updates documented CUDA Pad opset coverage to match new registrations.
Comments suppressed due to low confidence (1)

onnxruntime/core/providers/cuda/tensor/pad.cc:1

  • effective_input_extents and input_offsets are now passed into the CUDA kernel for all pad modes, even though only wrap uses them. This increases kernel parameter size and can increase register/constant memory pressure for common modes (e.g., constant), potentially reducing occupancy. Consider splitting into two kernel entry points/signatures: one specialized for non-wrap (original parameter list) and one for wrap (extended parameters), dispatching based on mode_.
// Copyright (c) Microsoft Corporation. All rights reserved.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Wrap
};

__device__ __forceinline__ int64_t WrapCoordinate(int64_t coord, int64_t extent) {
Comment on lines +51 to +78
if (out_coord < lower_pads[dim]) {
switch ((PadMode)pad_mode) {
case PadMode::Constant:
use_pad_value = true;
break;
case PadMode::Edge:
in_coord = 0;
break;
case PadMode::Reflect:
in_coord = lower_pads[dim] - out_coord;
break;
case PadMode::Wrap:
break;
}
} else if (out_coord >= lower_pads[dim] + input_dims[dim]) {
switch ((PadMode)pad_mode) {
case PadMode::Constant:
use_pad_value = true;
break;
case PadMode::Edge:
in_coord = input_dims[dim] - 1;
break;
case PadMode::Reflect:
in_coord = input_dims[dim] - 2 - (out_coord - (lower_pads[dim] + input_dims[dim]));
break;
case PadMode::Wrap:
break;
}
Comment on lines 825 to +833
|PRelu|*in* X:**T**<br> *in* slope:**T**<br> *out* Y:**T**|16+|**T** = tensor(double), tensor(float), tensor(float16)|
|||[9, 15]|**T** = tensor(double), tensor(float), tensor(float16)|
|||[7, 8]|**T** = tensor(double), tensor(float), tensor(float16)|
|Pad|*in* data:**T**<br> *in* pads:**tensor(int64)**<br> *in* constant_value:**T**<br> *in* axes:**Tind**<br> *out* output:**T**<br><br>or<br><br>*in* data:**T**<br> *in* pads:**tensor(int64)**<br> *in* constant_value:**T**<br> *out* output:**T**<br><br>or<br><br>*in* data:**T**<br> *out* output:**T**|18+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16)|
|Pad|*in* data:**T**<br> *in* pads:**tensor(int64)**<br> *in* constant_value:**T**<br> *in* axes:**Tind**<br> *out* output:**T**<br><br>or<br><br>*in* data:**T**<br> *in* pads:**tensor(int64)**<br> *in* constant_value:**T**<br> *out* output:**T**<br><br>or<br><br>*in* data:**T**<br> *out* output:**T**|25+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16)|
|||24|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16)|
|||23|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16)|
|||[21, 22]|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16)|
|||[19, 20]|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16)|
|||18|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16)|
@tianleiwu
Copy link
Contributor Author

This PR is superseded by #27774

@tianleiwu tianleiwu closed this Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants