[CPU/CUDA ep] Improve DeformConv op performance#27824
[CPU/CUDA ep] Improve DeformConv op performance#27824tianleiwu merged 54 commits intomicrosoft:mainfrom
Conversation
860a063 to
fea461b
Compare
b14d3b8 to
3623db2
Compare
|
Hi @tianleiwu , sorry to tag you here — could you please help trigger a Copilot code review for me? This commit implements performance optimizations for the DeformConv operator on the main branch, achieving ~65% speedup on CPU and ~30% on GPU. All unit tests have passed locally, lint checks are clean, and the performance report on the real model (Birefnet) is included above. I’ve also added additional comments to improve readability and maintainability. If you have any suggestions or would like me to make any changes, I’d be happy to address them promptly. |
7f4e638 to
37099b6
Compare
There was a problem hiding this comment.
Pull request overview
This PR improves DeformConv performance across CPU and CUDA execution providers by refactoring hot paths and reducing GPU workspace/kernel overhead, while also adding additional bounds/indexing safeguards and clarifying documentation.
Changes:
- CPU: refactors im2col to precompute and reuse an AoSoA bilinear sampling plan (SIMD-friendly) and adds overflow-safe stride/dimension handling.
- CUDA: removes GEMM-output staging/copy kernel by writing GEMM results directly into
Y(zero-copy), and adds a faster bias-add path with an optional 2D kernel launch. - Shared: introduces
DeformConvValidateAndComputeCommonDimsto centralize derived dimension computation.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/providers/cpu/nn/deform_conv_op_test.cc | Updates test comment to reflect mask semantics more generally. |
| onnxruntime/core/providers/cuda/nn/deform_conv_impl.h | Updates CUDA kernel entry-point docs; removes GEMM-output copy entry point; extends bias API signature. |
| onnxruntime/core/providers/cuda/nn/deform_conv_impl.cu | Implements zero-copy-related kernel changes (im2col masking specialization, bias add rework, 32/64-bit index selection helpers). |
| onnxruntime/core/providers/cuda/nn/deform_conv.cc | Updates host orchestration: new chunk sizing, removes temp GEMM output buffer, writes GEMM directly to NCHW, passes grid-y limit for bias. |
| onnxruntime/core/providers/cpu/nn/deform_conv_attributes.h | Adds shared derived-dim helper used by CPU/CUDA. |
| onnxruntime/core/providers/cpu/nn/deform_conv.cc | Major CPU refactor: sampling plan + SIMD-friendly fill, overflow-checked strides, improved bias add. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f938f4c to
f892ad8
Compare
|
In the end, I decided not to make too many changes; the remaining optimizations I could make would be practically negligible.
I only fixed the issues reported by Copilot, then rebased the changes onto the latest |
f892ad8 to
cf3d79c
Compare
|
Failed runner:
It doesn't seem to be caused by this PR; it might be due to an unstable runner. You can just rerun the CI. |
|
Rebase to latest main branch? Waiting for the merge of the PR titled "Fix WebGPU Windows CI timeouts by removing redundant tests and sharding provider tests" to prevent pipeline timeouts. |
…structs and remove redundant dispatch branches
…ilinear accumulation
cf3d79c to
8c6ed74
Compare
tianleiwu
left a comment
There was a problem hiding this comment.
Summary
A substantial, well-motivated performance overhaul of CPU and CUDA DeformConv. The CPU path is redesigned around a precomputed AoSoA bilinear sampling plan that amortizes interpolation setup across channels, while the CUDA path eliminates a temp buffer + scatter kernel by writing GEMM output directly to NCHW Y via strided batched GEMM.
Positives:
- AoSoA layout (
kPlanAoSoALanes=8) aligns with 256-bit AVX2; the gather/interpolate inner loop can SIMD-unroll 8-wide. - Plan reuse across channels within an offset group is the key insight that eliminates redundant bilinear coordinate work.
- Zero-copy GEMM output with
cublasGemmStridedBatchedHelper— the stride algebra is correct for direct NCHW writes. UseMaskpromoted to template parameter eliminates runtime branches from the hottest kernel loop.- The branchless bilinear interpolation (safe-address + validity masks) eliminates warp divergence.
- Overflow checks (
CheckedMulSizeT,CheckedBatchSpan) are systematic and well-placed.
One high-priority issue (signed integer overflow UB in CeilDiv) and a few suggestions below.
| # | Severity | Component | Issue |
|---|---|---|---|
| 1 | High | CUDA GetDeformConvParallelChunkSize |
CeilDiv signed integer overflow UB when N is large |
| 2 | Suggestion | CUDA zero-copy GEMM | Behavioral change from evenly-divisible to uneven chunks should be documented |
| 3 | Suggestion | CUDA im2col kernel | 5×5 specialization removed — mention in PR description |
| 4 | Suggestion | Tests | No new test cases for AoSoA tail, prime batch, 7×7 kernel, or overflow paths |
| 5 | Nitpick | CUDA im2col kernel | offset_byte_offset is an element offset, not byte offset |
| 6 | Nitpick | CUDA bias kernel | max_grid_y > 32 threshold lacks rationale comment |
| 7 | Nitpick | CPU deform_conv | Plan block allocation is not zero-initialized (currently safe but fragile) |
tianleiwu
left a comment
There was a problem hiding this comment.
Code Review: PR 27824
High-Priority
-
onnxruntime/core/providers/cuda/nn/deform_conv.ccThe new balanced chunking helper can now produce a final iteration with
cur_parallel == 1after earlier iterations used a largern_parallel_imgs, but thecur_parallel == 1GEMM fast path still computesstride_colfrom the outer-scopecol_stride(kernel_dim * n_parallel_imgs * output_image_size).DeformConvIm2ColImplrepackscol_bufferfor each iteration using the currentcur_parallel, so on a one-image tail the actual per-group stride incol_bufferis onlykernel_dim * output_image_size. As soon asgroup > 1, the strided-batched cuBLAS call skips past the compactly written tail buffer and reads uninitialized memory for later groups, corrupting the last chunk. This regression was masked before because the old divisor-based chunking avoided one-image tails after larger chunks. The fix is to derive the group stride from the current iteration (kernel_dim * cur_out_size, which collapses tokernel_dim * output_image_sizehere) and to add a CUDA test that combinesgroup > 1with a non-divisible batch size.
Suggestion
-
onnxruntime/core/providers/cpu/nn/deform_conv.cc -
onnxruntime/core/providers/cuda/nn/deform_conv_impl.cuThe late
ORT_CPU_RESTRICT/__restrict__pass now marks external model inputs (X,offset,mask) as non-aliasing. ONNX Runtime does not guarantee that distinct tensor inputs are backed by distinct memory regions, and DeformConv takes the same element type for all three tensors, so callers can legally bind overlapping views of one allocation. Under those conditions the new qualifiers make the optimized code undefined behavior and give the compiler permission to reorder loads as if aliasing were impossible. Unless the op contract explicitly forbids overlapping input buffers, the restrict annotations should be limited to internally owned temporaries (data_col,sampling_plan_blocks, etc.) rather than user-provided input tensors.
|
tianleiwu
left a comment
There was a problem hiding this comment.
Thanks for the cleanup here. The CPU/CUDA refactor is much easier to follow now, but I found one remaining correctness issue in the grouped CUDA tail path and one follow-up coverage gap.
These two issues have been fixed in commit cdb979c , and I've reviewed the code again. There shouldn't be any similar issues left. Could you please review it again? Thank you very much! Besides that, I actually have another question: I’ve currently set |
|
I also ran separate benchmarks for the hand-coded AVX2 and AVX512 implementations, as well as the AoSoA, AoS, and SoA memory layouts, and found that the differences weren’t significant—they can essentially be considered noise. It seems the bottleneck is still in random memory reads. I’ll leave it at that for now and not make any changes.
|
tianleiwu
left a comment
There was a problem hiding this comment.
Thanks for the updates. Current head looks sound overall; I left one small non-blocking suggestion inline around reusing the new checked common-dimension computation before CUDA chunk sizing.
|
/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |






Description
Improve DeformConv op performance
Motivation and Context
This PR consolidates a series of optimizations targeting the
DeformConv(Deformable Convolution) operator across both CPU and CUDA execution providers.1. CPU Optimizations & Refactoring
The CPU execution path has been heavily refactored to minimize branching in hot paths, maximize vectorization, and safely handle edge cases.
kPlanAoSoALanes).im2colgathering phase.DeformConvKernelMetaCacheDatato cache static convolution geometry (e.g.,kH,kW,padding,dilation).Compute()step.DeformConvFastFloorand utilized an inverted bounds check with bitwise operations to evaluate all four corners simultaneously.std::floorcalls and unpredictable branches from the operator's hottest path.concurrency::ThreadPool::TryParallelForto split fine-grained work effectively, drastically improving thread pool scaling.CheckedMulSizeTandCheckedBatchSpan.size_trange, preventing integer overflow vulnerabilities.div/modoperations, applyingORT_CPU_RESTRICTand force-inlining.2. GPU (CUDA) Optimizations
The CUDA implementation was optimized to reduce memory footprint and eliminate unnecessary kernel launches.
gemm_output_bufferallocation entirely. By carefully configuring thestride_cparameter (stride_c_y = M * output_image_size), thecublasGemmStridedBatchedHelpernow writes the computed output directly into the correct NCHW memory layout of the finalYtensor.DeformConvCopyGemmOutputRowMajorToNCHWcustom kernel and its associated dispatch logic. This reduces kernel launch overhead, lowers GPU memory bandwidth pressure, and simplifies the overall CUDA execution pipeline.bytes_per_imagecalculation for workspace memory to reflect the removal of the GEMM output buffer. This allows the operator to potentially process more images in parallel under the same memory constraints.3. Changed
kis chosen so that the number of outer rounds is minimized under the temp-memory cap;kdoes not have to divideN. The host loop usescur_parallel = min(k, N - b), so the last chunk may be smaller. This is the intended default behavior for this EP (not yet in a formal release).kH/kWpath. Rationale: 5×5 is less common in current stacks (often replaced by stacked 3×3); specializing 7×7 targets common large-kernel cases. Older DCN/detection models that still use 5×5 deformable conv will take the dynamic path—correctness is unchanged; only compile-time unrolling differs.Yoverlaps any input buffer, results can be incorrect regardless ofrestrict, because output writes may clobber source elements before they are fully consumed.restrictfurther tightens this by introducing undefined behavior when aliasing assumptions are violated.Summary
In the current implementation, CPU performance is 33x (main branch is 15x) that of TorchVision. If we were to implement AVX2/AVX512 optimizations from scratch, we could achieve a 36x performance boost. However, I haven’t found any similar reference code in the ONNX Runtime repository.
This PR also significantly improves parallelism:
Both ort and tv are configured with 16 threads
Open Question for Reviewers
Regarding CUDA Temporary Memory Allocation:
Currently, the effective maximum temporary memory for CUDA is calculated using a heuristic (
total_global_mem * 0.1or similar logic inGetDeformConvEffectiveMaxTempBytes). While the removal ofgemm_output_bufferhas reduced the memory footprint per image, I am not entirely certain if this 10% threshold is still the most appropriate value for balancing parallel image processing (n_parallel_imgs) against overall VRAM consumption in large models.I would appreciate any feedback or suggestions on whether we should tune this threshold, or if there's a more robust way to dynamically determine the optimal temporary workspace size for
DeformConvin ORT.