[CUDA] Extend Pad support through opset 25 by tianleiwu · Pull Request #27708 · microsoft/onnxruntime

tianleiwu · 2026-03-17T18:57:01Z

Description

This PR updates the CUDA Pad kernel to support the ONNX Pad schema split from opset 18 through opset 25 instead of stopping at the older registration boundary. It also implements CUDA wrap mode support so newer Pad registrations are backed by real kernel behavior, and adds targeted tests to cover the newly supported opset ranges.

Summary of Changes

Kernel registration and opset coverage

File	Change
`onnxruntime/core/providers/cuda/tensor/pad.cc`	Adds CUDA Pad kernel registrations for opset ranges `18`, `19-20`, `21-22`, `23`, `24`, and `25`, matching the current ONNX Pad schema evolution.
`onnxruntime/core/providers/cuda/cuda_execution_provider.cc`	Registers the new Pad kernel versions in the CUDA EP registry and keeps them grouped under the existing per-opset sections for consistency with the rest of the file.

CUDA Pad implementation

File	Change
`onnxruntime/core/providers/cuda/tensor/pad_impl.h`	Extends the Pad kernel interface to pass effective sliced extents and per-axis input offsets into the CUDA implementation.
`onnxruntime/core/providers/cuda/tensor/pad_impl.cu`	Adds CUDA `wrap` mode handling for both the general Pad kernel and the NCHW H/W-specialized kernel path, and updates the dispatch logic for the new mode.
`onnxruntime/core/providers/cuda/tensor/pad.cc`	Computes the effective sliced input extents/offsets needed for wrap behavior with negative pads, and routes `wrap` through the generic implementation instead of the optimized non-wrap-only path.

Test coverage

File	Change
`onnxruntime/test/providers/cpu/tensor/pad_test.cc`	Adds CUDA-only Pad coverage for `edge` across opsets `18-25` and `wrap` across opsets `19-25`, and updates the existing wrap test comment to reflect the new CUDA support.

Testing

Built the touched CUDA and test translation units in build/cuda/Release, including pad_impl.cu, pad.cc, cuda_execution_provider.cc, and pad_test.cc.
Added CUDA-only coverage for edge mode on opsets 18-25 and wrap mode on opsets 19-25.
Full onnxruntime_test_all was not run locally.

Motivation and Context

Related issues: #26393.

Pad evolved after opset 18 in ways that matter for CUDA placement: opset 19 introduced wrap, and later opsets continued the schema/version split while broadening supported types. Before this change, CUDA Pad registration did not line up with those newer schemas, and CUDA did not implement wrap, which made newer Pad models fall back or remain unsupported on the CUDA execution provider. This change aligns CUDA registration with the ONNX Pad versions now used by the runtime and makes the exposed support match actual kernel behavior.

Checklist

Tests added/updated
Documentation updated (if applicable)
No breaking changes (or documented in description)
CI passes

…t 19

hariharans29 · 2026-03-17T20:40:49Z

Is this also related - #27416 ?

Copilot

Pull request overview

This PR extends the CUDA Pad kernel’s ONNX opset coverage through opset 25, aligning CUDA registrations with the post-opset-18 ONNX schema splits, and adds CUDA wrap mode behavior plus targeted CUDA-only tests for the newly supported opset ranges.

Changes:

Added CUDA Pad kernel registrations for opsets 18, 19–20, 21–22, 23, 24, and 25 (and updated CUDA EP kernel registry accordingly).
Extended the CUDA Pad implementation to support wrap mode, including handling negative pads (slicing) via effective extents and per-axis input offsets.
Added CUDA-only tests to validate edge (opsets 18–25) and wrap (opsets 19–25) behavior, and updated operator kernel documentation to reflect the new opset splits.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`onnxruntime/core/providers/cuda/tensor/pad.cc`	Adds versioned kernel registrations through opset 25; computes effective extents/offsets and routes wrap through the generic implementation.
`onnxruntime/core/providers/cuda/tensor/pad_impl.h`	Extends `PadImpl` interface to accept effective extents and input offsets.
`onnxruntime/core/providers/cuda/tensor/pad_impl.cu`	Implements wrap coordinate handling for the generic pad kernel (and adds a wrap branch in the NCHW kernel).
`onnxruntime/core/providers/cuda/cuda_execution_provider.cc`	Declares/registers the newly versioned CUDA `Pad` kernels for opsets 18–25.
`onnxruntime/test/providers/cpu/tensor/pad_test.cc`	Adds CUDA-only tests covering the newly supported opset ranges for `edge` and `wrap`.
`docs/OperatorKernels.md`	Updates the published CUDA kernel opset coverage for `Pad` to reflect the new version splits up to opset 25.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

onnxruntime/core/providers/cuda/tensor/pad_impl.cu

onnxruntime/test/providers/cpu/tensor/pad_test.cc

tianleiwu · 2026-03-17T23:01:41Z

Is this also related - #27416 ?

Yes, this is related to #27416 and overlaps in the same CUDA Pad support area.

From what I checked, #27416 adds CUDA Pad support through opset 23, while this PR supports through opset 25 includes the OperatorKernels doc update. The wrap implementation is also different. Let me do some comparison to decide whether to consolidate or supersede one of the two PRs.

Copilot

Pull request overview

Extends CUDA Pad to align with ONNX Pad schema splits through opset 25 and adds CUDA wrap mode implementation, with targeted CUDA-only tests for the newly supported opset ranges.

Changes:

Register CUDA Pad kernels across opset ranges 18, 19–20, 21–22, 23, 24, and 25.
Implement CUDA wrap mode support and plumb effective sliced extents/offsets into the CUDA kernels.
Add CUDA-only tests for edge (opset 18–25) and wrap (opset 19–25).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
onnxruntime/test/providers/cpu/tensor/pad_test.cc	Adds CUDA-only Pad tests for `edge`/`wrap` across supported opsets and updates wrap-mode comment.
onnxruntime/core/providers/cuda/tensor/pad_impl.h	Extends PadImpl interface to accept effective extents and per-axis offsets.
onnxruntime/core/providers/cuda/tensor/pad_impl.cu	Implements `wrap` mode coordinate mapping and updates kernel dispatch.
onnxruntime/core/providers/cuda/tensor/pad.cc	Adds per-opset kernel registrations and computes extents/offsets for wrap behavior; routes wrap via generic path.
onnxruntime/core/providers/cuda/cuda_execution_provider.cc	Registers the new per-opset CUDA Pad kernel variants in the EP registry.
docs/OperatorKernels.md	Updates documented CUDA Pad opset coverage to match new registrations.

Comments suppressed due to low confidence (1)

onnxruntime/core/providers/cuda/tensor/pad.cc:1

effective_input_extents and input_offsets are now passed into the CUDA kernel for all pad modes, even though only wrap uses them. This increases kernel parameter size and can increase register/constant memory pressure for common modes (e.g., constant), potentially reducing occupancy. Consider splitting into two kernel entry points/signatures: one specialized for non-wrap (original parameter list) and one for wrap (extended parameters), dispatching based on mode_.

// Copyright (c) Microsoft Corporation. All rights reserved.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

onnxruntime/core/providers/cuda/tensor/pad_impl.cu

+  Wrap
 };

+__device__ __forceinline__ int64_t WrapCoordinate(int64_t coord, int64_t extent) {


onnxruntime/core/providers/cuda/tensor/pad_impl.cu

+      if (out_coord < lower_pads[dim]) {
+        switch ((PadMode)pad_mode) {
+          case PadMode::Constant:
+            use_pad_value = true;
+            break;
+          case PadMode::Edge:
+            in_coord = 0;
+            break;
+          case PadMode::Reflect:
+            in_coord = lower_pads[dim] - out_coord;
+            break;
+          case PadMode::Wrap:
+            break;
+        }
+      } else if (out_coord >= lower_pads[dim] + input_dims[dim]) {
+        switch ((PadMode)pad_mode) {
+          case PadMode::Constant:
+            use_pad_value = true;
+            break;
+          case PadMode::Edge:
+            in_coord = input_dims[dim] - 1;
+            break;
+          case PadMode::Reflect:
+            in_coord = input_dims[dim] - 2 - (out_coord - (lower_pads[dim] + input_dims[dim]));
+            break;
+          case PadMode::Wrap:
+            break;
+        }


docs/OperatorKernels.md

 |PRelu|*in* X:**T**<br> *in* slope:**T**<br> *out* Y:**T**|16+|**T** = tensor(double), tensor(float), tensor(float16)|
 |||[9, 15]|**T** = tensor(double), tensor(float), tensor(float16)|
 |||[7, 8]|**T** = tensor(double), tensor(float), tensor(float16)|
-|Pad|*in* data:**T**<br> *in* pads:**tensor(int64)**<br> *in* constant_value:**T**<br> *in* axes:**Tind**<br> *out* output:**T**<br><br>or<br><br>*in* data:**T**<br> *in* pads:**tensor(int64)**<br> *in* constant_value:**T**<br> *out* output:**T**<br><br>or<br><br>*in* data:**T**<br> *out* output:**T**|18+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16)|
+|Pad|*in* data:**T**<br> *in* pads:**tensor(int64)**<br> *in* constant_value:**T**<br> *in* axes:**Tind**<br> *out* output:**T**<br><br>or<br><br>*in* data:**T**<br> *in* pads:**tensor(int64)**<br> *in* constant_value:**T**<br> *out* output:**T**<br><br>or<br><br>*in* data:**T**<br> *out* output:**T**|25+|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16)|
+|||24|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16)|
+|||23|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16)|
+|||[21, 22]|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16)|
+|||[19, 20]|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16)|
+|||18|**T** = tensor(bool), tensor(double), tensor(float), tensor(float16)|


tianleiwu · 2026-03-19T20:55:33Z

This PR is superseded by #27774

lukas-folle-snkeos and others added 3 commits March 17, 2026 16:51

[CUDA] Update Pad kernel to support versioning and add tests for Opse…

8107200

…t 19

initial draft

a674a3d

style

f0a97db

tianleiwu marked this pull request as draft March 17, 2026 18:57

tianleiwu mentioned this pull request Mar 17, 2026

[CUDA] Update Pad kernel to support versioning and add tests for Opset 19 #27701

Closed

update doc

c48656a

tianleiwu requested review from Copilot and yuslepukhin March 17, 2026 22:31

Copilot started reviewing on behalf of tianleiwu March 17, 2026 22:32 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

onnxruntime/core/providers/cuda/tensor/pad_impl.cu Outdated Show resolved Hide resolved

onnxruntime/test/providers/cpu/tensor/pad_test.cc Outdated Show resolved Hide resolved

copilot feedback

01392cb

tianleiwu mentioned this pull request Mar 18, 2026

[Feature Request] Extend CUDA ONNX Ops to latest opset version #27729

Open

tianleiwu marked this pull request as ready for review March 19, 2026 18:03

tianleiwu requested a review from Copilot March 19, 2026 18:03

Copilot AI reviewed Mar 19, 2026

View reviewed changes

Copilot started reviewing on behalf of tianleiwu March 19, 2026 18:45 View session

Copilot AI mentioned this pull request Mar 19, 2026

[CUDA] Extend Pad support through opset 25 with wrap mode #27774

Draft

2 tasks

tianleiwu closed this Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Extend Pad support through opset 25#27708

[CUDA] Extend Pad support through opset 25#27708
tianleiwu wants to merge 5 commits intomainfrom
tlwu/20260317/cuda_pad

tianleiwu commented Mar 17, 2026 •

edited

Loading

Uh oh!

hariharans29 commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

tianleiwu commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

tianleiwu commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tianleiwu commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary of Changes

Kernel registration and opset coverage

CUDA Pad implementation

Test coverage

Testing

Motivation and Context

Checklist

Uh oh!

hariharans29 commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

tianleiwu commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

tianleiwu commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tianleiwu commented Mar 17, 2026 •

edited

Loading