[NVPTX] Use PRMT instruction to lower i16 bswap #168968

Chengjunp · 2025-11-20T23:24:09Z

Previously, i16 bswap was lowered using multiple shift and OR operations. This patch adds a pattern to directly lower i16 bswap using the PRMT (permute) instruction, which is more efficient.

llvmbot · 2025-11-20T23:24:32Z

@llvm/pr-subscribers-backend-nvptx

Author: Chengjun (Chengjunp)

Changes

Previously, i16 bswap was lowered using multiple shift and OR operations. This patch adds a pattern to directly lower i16 bswap using the PRMT (permute) instruction, which is more efficient.

Full diff: https://github.com/llvm/llvm-project/pull/168968.diff

3 Files Affected:

(modified) llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp (-2)
(modified) llvm/lib/Target/NVPTX/NVPTXInstrInfo.td (+4)
(modified) llvm/test/CodeGen/NVPTX/bswap.ll (+7-7)

diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index a77eb0240e677..cc675ff6ff7c7 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -713,8 +713,6 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
                        Custom);
   }
 
-  setOperationAction(ISD::BSWAP, MVT::i16, Expand);
-
   setOperationAction(ISD::BR_JT, MVT::Other, Custom);
   setOperationAction(ISD::BRIND, MVT::Other, Expand);
 
diff --git a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
index 8b129e7e5eeae..2654d3fbef6f6 100644
--- a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
+++ b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
@@ -2477,6 +2477,10 @@ include "NVPTXIntrinsics.td"
 //   unpack). sm_20 supports native 32-bit register, but not native 16-bit
 // register.
 
+def : Pat <
+  (i16 (bswap i16:$a)),
+  (i16 (CVT_u16_u32 (PRMT_B32rii (i32 (CVT_u32_u16 $a, CvtNONE)), (i32 0), (i32 0x0001), PrmtNONE), CvtNONE))>;
+
 def : Pat <
   (i32 (bswap i32:$a)),
   (PRMT_B32rii $a, (i32 0), (i32 0x0123), PrmtNONE)>;
diff --git a/llvm/test/CodeGen/NVPTX/bswap.ll b/llvm/test/CodeGen/NVPTX/bswap.ll
index e3d1c80922609..a12deed544642 100644
--- a/llvm/test/CodeGen/NVPTX/bswap.ll
+++ b/llvm/test/CodeGen/NVPTX/bswap.ll
@@ -10,16 +10,16 @@ target triple = "nvptx64-nvidia-cuda"
 define i16 @bswap16(i16 %a) {
 ; CHECK-LABEL: bswap16(
 ; CHECK:       {
-; CHECK-NEXT:    .reg .b16 %rs<5>;
-; CHECK-NEXT:    .reg .b32 %r<2>;
+; CHECK-NEXT:    .reg .b16 %rs<3>;
+; CHECK-NEXT:    .reg .b32 %r<4>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
 ; CHECK-NEXT:    ld.param.b16 %rs1, [bswap16_param_0];
-; CHECK-NEXT:    shr.u16 %rs2, %rs1, 8;
-; CHECK-NEXT:    shl.b16 %rs3, %rs1, 8;
-; CHECK-NEXT:    or.b16 %rs4, %rs3, %rs2;
-; CHECK-NEXT:    cvt.u32.u16 %r1, %rs4;
-; CHECK-NEXT:    st.param.b32 [func_retval0], %r1;
+; CHECK-NEXT:    cvt.u32.u16 %r1, %rs1;
+; CHECK-NEXT:    prmt.b32 %r2, %r1, 0, 0x1U;
+; CHECK-NEXT:    cvt.u16.u32 %rs2, %r2;
+; CHECK-NEXT:    cvt.u32.u16 %r3, %rs2;
+; CHECK-NEXT:    st.param.b32 [func_retval0], %r3;
 ; CHECK-NEXT:    ret;
   %b = tail call i16 @llvm.bswap.i16(i16 %a)
   ret i16 %b

AlexMaclean

LGTM.

If you wanted to go above and beyond, I think it would be best to instead implement this (and all other bswap expansion) as a custom legalization to allow for DAGCombiner to optimize them.

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td

Chengjunp · 2025-11-20T23:51:25Z

If you wanted to go above and beyond, I think it would be best to instead implement this (and all other bswap expansion) as a custom legalization to allow for DAGCombiner to optimize them.

I see. Do you mean that we just lower all the bswap into bit operations first and let DAGCombiner decide how to optimize using PRMT?

Artem-B

I'm curious whether this makes any difference on the SASS level. The old shift/or pattern looks simple enough for ptxas to deal with.

github-actions · 2025-11-21T00:09:42Z

🐧 Linux x64 Test Results

186431 tests passed
4864 tests skipped

Chengjunp · 2025-11-21T00:30:18Z

I'm curious whether this makes any difference on the SASS level. The old shift/or pattern looks simple enough for ptxas to deal with.

Here is the sass comparison. We do not have the optimization for the old pattern. We can add those kinds of folding in the DAGCombiner in the next step, but I think it may be beyond the scope of this change.

Artem-B · 2025-11-21T00:55:01Z

Here is the sass comparison.

Fascinating. New code is clearly an improvement, but it looks like we can (sometimes?) make it even better. Right now ptxas generates prmt to extend i16->i32, and only then swaps the bytes around.

 PRMT R4, R4, 0x7710, RZ 
 PRMT R4, R4, 0x7701, RZ

We only need one PRMT for that. Unfortunately, I see no easy way to avoid cvt.u32.u16. The only way that I managed to make ptxas generate only one PRMT is by using an extending load: https://godbolt.org/z/6PWx3hxda

Looks like this may need to be handled by ptxas.
I mean -- we can convert load i16 + cvt into an extending load, and that would be a useful peephole optimization, but in general case byteswap may not be in immediate proximity to benefit from that, and we'll still end up with the redundant PRMT.
What we can't do is tell ptxas -- "here's an i16 argument, but it already lives in an i32 register under the hood, please let me use it as an argument of a 32-bit PRMT".

AlexMaclean · 2025-11-21T03:18:31Z

If you wanted to go above and beyond, I think it would be best to instead implement this (and all other bswap expansion) as a custom legalization to allow for DAGCombiner to optimize them.

I see. Do you mean that we just lower all the bswap into bit operations first and let DAGCombiner decide how to optimize using PRMT?

sorta, something along the lines of LowerVECTOR_SHUFFLE. That is use the NVPTXISD::PRTM node plus zext, trunc to represent this same expansion during operation legalization (instead of in ISel where it is now). Does that make sense?

Use PRMT instruction to lower i16 bswap

c396152

Chengjunp requested review from AlexMaclean and Prince781 November 20, 2025 23:24

Chengjunp self-assigned this Nov 20, 2025

Chengjunp added the backend:NVPTX label Nov 20, 2025

Merge branch 'main' into dev/chengjunp/bswap16

f29d093

AlexMaclean approved these changes Nov 20, 2025

View reviewed changes

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td Outdated Show resolved Hide resolved

AlexMaclean requested a review from Artem-B November 20, 2025 23:45

Artem-B approved these changes Nov 21, 2025

View reviewed changes

Format

2de19db

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NVPTX] Use PRMT instruction to lower i16 bswap #168968

[NVPTX] Use PRMT instruction to lower i16 bswap #168968

Uh oh!

Chengjunp commented Nov 20, 2025

Uh oh!

llvmbot commented Nov 20, 2025

Uh oh!

AlexMaclean left a comment

Uh oh!

Uh oh!

Chengjunp commented Nov 20, 2025

Uh oh!

Artem-B left a comment

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

Chengjunp commented Nov 21, 2025

Uh oh!

Artem-B commented Nov 21, 2025 •

edited

Loading

Uh oh!

AlexMaclean commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[NVPTX] Use PRMT instruction to lower i16 bswap #168968

Are you sure you want to change the base?

[NVPTX] Use PRMT instruction to lower i16 bswap #168968

Uh oh!

Conversation

Chengjunp commented Nov 20, 2025

Uh oh!

llvmbot commented Nov 20, 2025

Uh oh!

AlexMaclean left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Chengjunp commented Nov 20, 2025

Uh oh!

Artem-B left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 21, 2025

🐧 Linux x64 Test Results

Uh oh!

Chengjunp commented Nov 21, 2025

Uh oh!

Artem-B commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexMaclean commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Artem-B commented Nov 21, 2025 •

edited

Loading