Skip to content

Conversation

@Chengjunp
Copy link
Contributor

Previously, i16 bswap was lowered using multiple shift and OR operations. This patch adds a pattern to directly lower i16 bswap using the PRMT (permute) instruction, which is more efficient.

@llvmbot
Copy link
Member

llvmbot commented Nov 20, 2025

@llvm/pr-subscribers-backend-nvptx

Author: Chengjun (Chengjunp)

Changes

Previously, i16 bswap was lowered using multiple shift and OR operations. This patch adds a pattern to directly lower i16 bswap using the PRMT (permute) instruction, which is more efficient.


Full diff: https://github.com/llvm/llvm-project/pull/168968.diff

3 Files Affected:

  • (modified) llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp (-2)
  • (modified) llvm/lib/Target/NVPTX/NVPTXInstrInfo.td (+4)
  • (modified) llvm/test/CodeGen/NVPTX/bswap.ll (+7-7)
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index a77eb0240e677..cc675ff6ff7c7 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -713,8 +713,6 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
                        Custom);
   }
 
-  setOperationAction(ISD::BSWAP, MVT::i16, Expand);
-
   setOperationAction(ISD::BR_JT, MVT::Other, Custom);
   setOperationAction(ISD::BRIND, MVT::Other, Expand);
 
diff --git a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
index 8b129e7e5eeae..2654d3fbef6f6 100644
--- a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
+++ b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
@@ -2477,6 +2477,10 @@ include "NVPTXIntrinsics.td"
 //   unpack). sm_20 supports native 32-bit register, but not native 16-bit
 // register.
 
+def : Pat <
+  (i16 (bswap i16:$a)),
+  (i16 (CVT_u16_u32 (PRMT_B32rii (i32 (CVT_u32_u16 $a, CvtNONE)), (i32 0), (i32 0x0001), PrmtNONE), CvtNONE))>;
+
 def : Pat <
   (i32 (bswap i32:$a)),
   (PRMT_B32rii $a, (i32 0), (i32 0x0123), PrmtNONE)>;
diff --git a/llvm/test/CodeGen/NVPTX/bswap.ll b/llvm/test/CodeGen/NVPTX/bswap.ll
index e3d1c80922609..a12deed544642 100644
--- a/llvm/test/CodeGen/NVPTX/bswap.ll
+++ b/llvm/test/CodeGen/NVPTX/bswap.ll
@@ -10,16 +10,16 @@ target triple = "nvptx64-nvidia-cuda"
 define i16 @bswap16(i16 %a) {
 ; CHECK-LABEL: bswap16(
 ; CHECK:       {
-; CHECK-NEXT:    .reg .b16 %rs<5>;
-; CHECK-NEXT:    .reg .b32 %r<2>;
+; CHECK-NEXT:    .reg .b16 %rs<3>;
+; CHECK-NEXT:    .reg .b32 %r<4>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
 ; CHECK-NEXT:    ld.param.b16 %rs1, [bswap16_param_0];
-; CHECK-NEXT:    shr.u16 %rs2, %rs1, 8;
-; CHECK-NEXT:    shl.b16 %rs3, %rs1, 8;
-; CHECK-NEXT:    or.b16 %rs4, %rs3, %rs2;
-; CHECK-NEXT:    cvt.u32.u16 %r1, %rs4;
-; CHECK-NEXT:    st.param.b32 [func_retval0], %r1;
+; CHECK-NEXT:    cvt.u32.u16 %r1, %rs1;
+; CHECK-NEXT:    prmt.b32 %r2, %r1, 0, 0x1U;
+; CHECK-NEXT:    cvt.u16.u32 %rs2, %r2;
+; CHECK-NEXT:    cvt.u32.u16 %r3, %rs2;
+; CHECK-NEXT:    st.param.b32 [func_retval0], %r3;
 ; CHECK-NEXT:    ret;
   %b = tail call i16 @llvm.bswap.i16(i16 %a)
   ret i16 %b

Copy link
Member

@AlexMaclean AlexMaclean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

If you wanted to go above and beyond, I think it would be best to instead implement this (and all other bswap expansion) as a custom legalization to allow for DAGCombiner to optimize them.

@AlexMaclean AlexMaclean requested a review from Artem-B November 20, 2025 23:45
@Chengjunp
Copy link
Contributor Author

If you wanted to go above and beyond, I think it would be best to instead implement this (and all other bswap expansion) as a custom legalization to allow for DAGCombiner to optimize them.

I see. Do you mean that we just lower all the bswap into bit operations first and let DAGCombiner decide how to optimize using PRMT?

Copy link
Member

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious whether this makes any difference on the SASS level. The old shift/or pattern looks simple enough for ptxas to deal with.

@github-actions
Copy link

🐧 Linux x64 Test Results

  • 186431 tests passed
  • 4864 tests skipped

@Chengjunp
Copy link
Contributor Author

I'm curious whether this makes any difference on the SASS level. The old shift/or pattern looks simple enough for ptxas to deal with.

Here is the sass comparison. We do not have the optimization for the old pattern. We can add those kinds of folding in the DAGCombiner in the next step, but I think it may be beyond the scope of this change.

@Artem-B
Copy link
Member

Artem-B commented Nov 21, 2025

Here is the sass comparison.

Fascinating. New code is clearly an improvement, but it looks like we can (sometimes?) make it even better. Right now ptxas generates prmt to extend i16->i32, and only then swaps the bytes around.

 PRMT R4, R4, 0x7710, RZ 
 PRMT R4, R4, 0x7701, RZ 

We only need one PRMT for that. Unfortunately, I see no easy way to avoid cvt.u32.u16. The only way that I managed to make ptxas generate only one PRMT is by using an extending load: https://godbolt.org/z/6PWx3hxda

Looks like this may need to be handled by ptxas.
I mean -- we can convert load i16 + cvt into an extending load, and that would be a useful peephole optimization, but in general case byteswap may not be in immediate proximity to benefit from that, and we'll still end up with the redundant PRMT.
What we can't do is tell ptxas -- "here's an i16 argument, but it already lives in an i32 register under the hood, please let me use it as an argument of a 32-bit PRMT".

@AlexMaclean
Copy link
Member

If you wanted to go above and beyond, I think it would be best to instead implement this (and all other bswap expansion) as a custom legalization to allow for DAGCombiner to optimize them.

I see. Do you mean that we just lower all the bswap into bit operations first and let DAGCombiner decide how to optimize using PRMT?

sorta, something along the lines of LowerVECTOR_SHUFFLE. That is use the NVPTXISD::PRTM node plus zext, trunc to represent this same expansion during operation legalization (instead of in ISel where it is now). Does that make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants