[CUDA] Extend Pad support through opset 25#27708
Conversation
|
Is this also related - #27416 ? |
There was a problem hiding this comment.
Pull request overview
This PR extends the CUDA Pad kernel’s ONNX opset coverage through opset 25, aligning CUDA registrations with the post-opset-18 ONNX schema splits, and adds CUDA wrap mode behavior plus targeted CUDA-only tests for the newly supported opset ranges.
Changes:
- Added CUDA
Padkernel registrations for opsets 18, 19–20, 21–22, 23, 24, and 25 (and updated CUDA EP kernel registry accordingly). - Extended the CUDA
Padimplementation to supportwrapmode, including handling negative pads (slicing) via effective extents and per-axis input offsets. - Added CUDA-only tests to validate
edge(opsets 18–25) andwrap(opsets 19–25) behavior, and updated operator kernel documentation to reflect the new opset splits.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
onnxruntime/core/providers/cuda/tensor/pad.cc |
Adds versioned kernel registrations through opset 25; computes effective extents/offsets and routes wrap through the generic implementation. |
onnxruntime/core/providers/cuda/tensor/pad_impl.h |
Extends PadImpl interface to accept effective extents and input offsets. |
onnxruntime/core/providers/cuda/tensor/pad_impl.cu |
Implements wrap coordinate handling for the generic pad kernel (and adds a wrap branch in the NCHW kernel). |
onnxruntime/core/providers/cuda/cuda_execution_provider.cc |
Declares/registers the newly versioned CUDA Pad kernels for opsets 18–25. |
onnxruntime/test/providers/cpu/tensor/pad_test.cc |
Adds CUDA-only tests covering the newly supported opset ranges for edge and wrap. |
docs/OperatorKernels.md |
Updates the published CUDA kernel opset coverage for Pad to reflect the new version splits up to opset 25. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Yes, this is related to #27416 and overlaps in the same CUDA Pad support area. From what I checked, #27416 adds CUDA Pad support through opset 23, while this PR supports through opset 25 includes the OperatorKernels doc update. The wrap implementation is also different. Let me do some comparison to decide whether to consolidate or supersede one of the two PRs. |
There was a problem hiding this comment.
Pull request overview
Extends CUDA Pad to align with ONNX Pad schema splits through opset 25 and adds CUDA wrap mode implementation, with targeted CUDA-only tests for the newly supported opset ranges.
Changes:
- Register CUDA Pad kernels across opset ranges 18, 19–20, 21–22, 23, 24, and 25.
- Implement CUDA
wrapmode support and plumb effective sliced extents/offsets into the CUDA kernels. - Add CUDA-only tests for
edge(opset 18–25) andwrap(opset 19–25).
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/providers/cpu/tensor/pad_test.cc | Adds CUDA-only Pad tests for edge/wrap across supported opsets and updates wrap-mode comment. |
| onnxruntime/core/providers/cuda/tensor/pad_impl.h | Extends PadImpl interface to accept effective extents and per-axis offsets. |
| onnxruntime/core/providers/cuda/tensor/pad_impl.cu | Implements wrap mode coordinate mapping and updates kernel dispatch. |
| onnxruntime/core/providers/cuda/tensor/pad.cc | Adds per-opset kernel registrations and computes extents/offsets for wrap behavior; routes wrap via generic path. |
| onnxruntime/core/providers/cuda/cuda_execution_provider.cc | Registers the new per-opset CUDA Pad kernel variants in the EP registry. |
| docs/OperatorKernels.md | Updates documented CUDA Pad opset coverage to match new registrations. |
Comments suppressed due to low confidence (1)
onnxruntime/core/providers/cuda/tensor/pad.cc:1
effective_input_extentsandinput_offsetsare now passed into the CUDA kernel for all pad modes, even though onlywrapuses them. This increases kernel parameter size and can increase register/constant memory pressure for common modes (e.g.,constant), potentially reducing occupancy. Consider splitting into two kernel entry points/signatures: one specialized for non-wrap (original parameter list) and one for wrap (extended parameters), dispatching based onmode_.
// Copyright (c) Microsoft Corporation. All rights reserved.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| Wrap | ||
| }; | ||
|
|
||
| __device__ __forceinline__ int64_t WrapCoordinate(int64_t coord, int64_t extent) { |
| if (out_coord < lower_pads[dim]) { | ||
| switch ((PadMode)pad_mode) { | ||
| case PadMode::Constant: | ||
| use_pad_value = true; | ||
| break; | ||
| case PadMode::Edge: | ||
| in_coord = 0; | ||
| break; | ||
| case PadMode::Reflect: | ||
| in_coord = lower_pads[dim] - out_coord; | ||
| break; | ||
| case PadMode::Wrap: | ||
| break; | ||
| } | ||
| } else if (out_coord >= lower_pads[dim] + input_dims[dim]) { | ||
| switch ((PadMode)pad_mode) { | ||
| case PadMode::Constant: | ||
| use_pad_value = true; | ||
| break; | ||
| case PadMode::Edge: | ||
| in_coord = input_dims[dim] - 1; | ||
| break; | ||
| case PadMode::Reflect: | ||
| in_coord = input_dims[dim] - 2 - (out_coord - (lower_pads[dim] + input_dims[dim])); | ||
| break; | ||
| case PadMode::Wrap: | ||
| break; | ||
| } |
| |PRelu|*in* X:**T**<br> *in* slope:**T**<br> *out* Y:**T**|16+|**T** = tensor(double), tensor(float), tensor(float16)| | ||
| |||[9, 15]|**T** = tensor(double), tensor(float), tensor(float16)| | ||
| |||[7, 8]|**T** = tensor(double), tensor(float), tensor(float16)| | ||
| |Pad|*in* data:**T**<br> *in* pads:**tensor(int64)**<br> *in* constant_value:**T**<br> *in* axes:**Tind**<br> *out* output:**T**<br><br>or<br><br>*in* data:**T**<br> *in* pads:**tensor(int64)**<br> *in* constant_value:**T**<br> *out* output:**T**<br><br>or<br><br>*in* data:**T**<br> *out* output:**T**|18+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16)| | ||
| |Pad|*in* data:**T**<br> *in* pads:**tensor(int64)**<br> *in* constant_value:**T**<br> *in* axes:**Tind**<br> *out* output:**T**<br><br>or<br><br>*in* data:**T**<br> *in* pads:**tensor(int64)**<br> *in* constant_value:**T**<br> *out* output:**T**<br><br>or<br><br>*in* data:**T**<br> *out* output:**T**|25+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16)| | ||
| |||24|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16)| | ||
| |||23|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16)| | ||
| |||[21, 22]|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16)| | ||
| |||[19, 20]|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16)| | ||
| |||18|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16)| |
|
This PR is superseded by #27774 |
Description
This PR updates the CUDA Pad kernel to support the ONNX Pad schema split from opset 18 through opset 25 instead of stopping at the older registration boundary. It also implements CUDA
wrapmode support so newer Pad registrations are backed by real kernel behavior, and adds targeted tests to cover the newly supported opset ranges.Summary of Changes
Kernel registration and opset coverage
onnxruntime/core/providers/cuda/tensor/pad.cc18,19-20,21-22,23,24, and25, matching the current ONNX Pad schema evolution.onnxruntime/core/providers/cuda/cuda_execution_provider.ccCUDA Pad implementation
onnxruntime/core/providers/cuda/tensor/pad_impl.honnxruntime/core/providers/cuda/tensor/pad_impl.cuwrapmode handling for both the general Pad kernel and the NCHW H/W-specialized kernel path, and updates the dispatch logic for the new mode.onnxruntime/core/providers/cuda/tensor/pad.ccwrapthrough the generic implementation instead of the optimized non-wrap-only path.Test coverage
onnxruntime/test/providers/cpu/tensor/pad_test.ccedgeacross opsets18-25andwrapacross opsets19-25, and updates the existing wrap test comment to reflect the new CUDA support.Testing
build/cuda/Release, includingpad_impl.cu,pad.cc,cuda_execution_provider.cc, andpad_test.cc.edgemode on opsets18-25andwrapmode on opsets19-25.onnxruntime_test_allwas not run locally.Motivation and Context
Related issues: #26393.
Pad evolved after opset 18 in ways that matter for CUDA placement: opset 19 introduced
wrap, and later opsets continued the schema/version split while broadening supported types. Before this change, CUDA Pad registration did not line up with those newer schemas, and CUDA did not implementwrap, which made newer Pad models fall back or remain unsupported on the CUDA execution provider. This change aligns CUDA registration with the ONNX Pad versions now used by the runtime and makes the exposed support match actual kernel behavior.Checklist