-
Notifications
You must be signed in to change notification settings - Fork 15.3k
[NVPTX] Use PRMT instruction to lower i16 bswap #168968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@llvm/pr-subscribers-backend-nvptx Author: Chengjun (Chengjunp) ChangesPreviously, i16 Full diff: https://github.com/llvm/llvm-project/pull/168968.diff 3 Files Affected:
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
index a77eb0240e677..cc675ff6ff7c7 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
@@ -713,8 +713,6 @@ NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
Custom);
}
- setOperationAction(ISD::BSWAP, MVT::i16, Expand);
-
setOperationAction(ISD::BR_JT, MVT::Other, Custom);
setOperationAction(ISD::BRIND, MVT::Other, Expand);
diff --git a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
index 8b129e7e5eeae..2654d3fbef6f6 100644
--- a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
+++ b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
@@ -2477,6 +2477,10 @@ include "NVPTXIntrinsics.td"
// unpack). sm_20 supports native 32-bit register, but not native 16-bit
// register.
+def : Pat <
+ (i16 (bswap i16:$a)),
+ (i16 (CVT_u16_u32 (PRMT_B32rii (i32 (CVT_u32_u16 $a, CvtNONE)), (i32 0), (i32 0x0001), PrmtNONE), CvtNONE))>;
+
def : Pat <
(i32 (bswap i32:$a)),
(PRMT_B32rii $a, (i32 0), (i32 0x0123), PrmtNONE)>;
diff --git a/llvm/test/CodeGen/NVPTX/bswap.ll b/llvm/test/CodeGen/NVPTX/bswap.ll
index e3d1c80922609..a12deed544642 100644
--- a/llvm/test/CodeGen/NVPTX/bswap.ll
+++ b/llvm/test/CodeGen/NVPTX/bswap.ll
@@ -10,16 +10,16 @@ target triple = "nvptx64-nvidia-cuda"
define i16 @bswap16(i16 %a) {
; CHECK-LABEL: bswap16(
; CHECK: {
-; CHECK-NEXT: .reg .b16 %rs<5>;
-; CHECK-NEXT: .reg .b32 %r<2>;
+; CHECK-NEXT: .reg .b16 %rs<3>;
+; CHECK-NEXT: .reg .b32 %r<4>;
; CHECK-EMPTY:
; CHECK-NEXT: // %bb.0:
; CHECK-NEXT: ld.param.b16 %rs1, [bswap16_param_0];
-; CHECK-NEXT: shr.u16 %rs2, %rs1, 8;
-; CHECK-NEXT: shl.b16 %rs3, %rs1, 8;
-; CHECK-NEXT: or.b16 %rs4, %rs3, %rs2;
-; CHECK-NEXT: cvt.u32.u16 %r1, %rs4;
-; CHECK-NEXT: st.param.b32 [func_retval0], %r1;
+; CHECK-NEXT: cvt.u32.u16 %r1, %rs1;
+; CHECK-NEXT: prmt.b32 %r2, %r1, 0, 0x1U;
+; CHECK-NEXT: cvt.u16.u32 %rs2, %r2;
+; CHECK-NEXT: cvt.u32.u16 %r3, %rs2;
+; CHECK-NEXT: st.param.b32 [func_retval0], %r3;
; CHECK-NEXT: ret;
%b = tail call i16 @llvm.bswap.i16(i16 %a)
ret i16 %b
|
AlexMaclean
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
If you wanted to go above and beyond, I think it would be best to instead implement this (and all other bswap expansion) as a custom legalization to allow for DAGCombiner to optimize them.
I see. Do you mean that we just lower all the bswap into bit operations first and let DAGCombiner decide how to optimize using PRMT? |
Artem-B
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious whether this makes any difference on the SASS level. The old shift/or pattern looks simple enough for ptxas to deal with.
🐧 Linux x64 Test Results
|
Here is the sass comparison. We do not have the optimization for the old pattern. We can add those kinds of folding in the DAGCombiner in the next step, but I think it may be beyond the scope of this change. |
Fascinating. New code is clearly an improvement, but it looks like we can (sometimes?) make it even better. Right now ptxas generates prmt to extend i16->i32, and only then swaps the bytes around. We only need one PRMT for that. Unfortunately, I see no easy way to avoid cvt.u32.u16. The only way that I managed to make ptxas generate only one PRMT is by using an extending load: https://godbolt.org/z/6PWx3hxda Looks like this may need to be handled by ptxas. |
sorta, something along the lines of |
Previously, i16
bswapwas lowered using multiple shift and OR operations. This patch adds a pattern to directly lower i16bswapusing thePRMT(permute) instruction, which is more efficient.