[X86] Prefer andl to andb to save one byte encoding when using with bzhi or bextr #86921

phoebewang · 2024-03-28T08:36:12Z

No description provided.

llvmbot · 2024-03-28T08:36:41Z

@llvm/pr-subscribers-backend-x86

Author: Phoebe Wang (phoebewang)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/86921.diff

2 Files Affected:

(modified) llvm/lib/Target/X86/X86ISelDAGToDAG.cpp (+18-3)
(modified) llvm/test/CodeGen/X86/extract-lowbits.ll (+2-2)

diff --git a/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp b/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
index 4e4241efd63d6b..dcc279f0d34d79 100644
--- a/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
+++ b/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
@@ -3928,9 +3928,17 @@ bool X86DAGToDAGISel::matchBitExtract(SDNode *Node) {
 
   SDLoc DL(Node);
 
-  // Truncate the shift amount.
-  NBits = CurDAG->getNode(ISD::TRUNCATE, DL, MVT::i8, NBits);
-  insertDAGNode(*CurDAG, SDValue(Node, 0), NBits);
+  if (NBits.getSimpleValueType() != MVT::i8) {
+    // Truncate the shift amount.
+    NBits = CurDAG->getNode(ISD::TRUNCATE, DL, MVT::i8, NBits);
+    insertDAGNode(*CurDAG, SDValue(Node, 0), NBits);
+  }
+
+  // Turn (i32)(x & imm8) into (i32)x & imm32.
+  ConstantSDNode *Imm = nullptr;
+  if (NBits->getOpcode() == ISD::AND)
+    if ((Imm = dyn_cast<ConstantSDNode>(NBits->getOperand(1))))
+      NBits = NBits->getOperand(0);
 
   // Insert 8-bit NBits into lowest 8 bits of 32-bit register.
   // All the other bits are undefined, we do not care about them.
@@ -3945,6 +3953,13 @@ bool X86DAGToDAGISel::matchBitExtract(SDNode *Node) {
                   0);
   insertDAGNode(*CurDAG, SDValue(Node, 0), NBits);
 
+  if (Imm) {
+    NBits =
+        CurDAG->getNode(ISD::AND, DL, MVT::i32, NBits,
+                        CurDAG->getConstant(Imm->getZExtValue(), DL, MVT::i32));
+    insertDAGNode(*CurDAG, SDValue(Node, 0), NBits);
+  }
+
   // We might have matched the amount of high bits to be cleared,
   // but we want the amount of low bits to be kept, so negate it then.
   if (NegateNBits) {
diff --git a/llvm/test/CodeGen/X86/extract-lowbits.ll b/llvm/test/CodeGen/X86/extract-lowbits.ll
index 848b920490ab83..85b242c22c47cb 100644
--- a/llvm/test/CodeGen/X86/extract-lowbits.ll
+++ b/llvm/test/CodeGen/X86/extract-lowbits.ll
@@ -439,14 +439,14 @@ define i64 @bzhi64_a0_masked(i64 %val, i64 %numlowbits) nounwind {
 ;
 ; X64-BMI1-LABEL: bzhi64_a0_masked:
 ; X64-BMI1:       # %bb.0:
-; X64-BMI1-NEXT:    andb $63, %sil
+; X64-BMI1-NEXT:    andl $63, %esi
 ; X64-BMI1-NEXT:    shll $8, %esi
 ; X64-BMI1-NEXT:    bextrq %rsi, %rdi, %rax
 ; X64-BMI1-NEXT:    retq
 ;
 ; X64-BMI2-LABEL: bzhi64_a0_masked:
 ; X64-BMI2:       # %bb.0:
-; X64-BMI2-NEXT:    andb $63, %sil
+; X64-BMI2-NEXT:    andl $63, %esi
 ; X64-BMI2-NEXT:    bzhiq %rsi, %rdi, %rax
 ; X64-BMI2-NEXT:    retq
   %numlowbits.masked = and i64 %numlowbits, 63

danilaml · 2024-03-28T13:15:02Z

If it's all around win (i.e. it's at least as fast as andb in all cases and just takes less bytes to encode), then I don't see why it has to be part of matchBitExtract, and can't be own dag combine/transform that does // Turn (i32)(x & imm8) into (i32)x & imm32. I think there are already some combines like that, so I'm not sure why they haven't triggered for this case.

phoebewang · 2024-03-28T14:13:51Z

If it's all around win (i.e. it's at least as fast as andb in all cases and just takes less bytes to encode), then I don't see why it has to be part of matchBitExtract, and can't be own dag combine/transform that does // Turn (i32)(x & imm8) into (i32)x & imm32. I think there are already some combines like that, so I'm not sure why they haven't triggered for this case.

I checked them from uops.info, they have identical TP/Lat. I think the reason is matchBitExtract happens during instruction selection, there's no dag combine after it.

topperc · 2024-03-28T15:28:03Z

Doesn't this cause an H register merge for eax/ebx/ecx/edx if ah/bh/ch/dh have been written previously?

Is the 1 byte savings from the REX prefix needed to access the low register of sil/dil/bph/sph? They look to be same length for other registers.

If the register is al/eax the andl encoding is larger. https://godbolt.org/z/3xYPnjqas

goldsteinn · 2024-03-28T17:36:48Z

Doesn't this cause an H register merge for eax/ebx/ecx/edx if ah/bh/ch/dh have been written previously?

Would be either way no? In either test case at least there is a 32-bit instruction. IIRC once the H register is
dirty there is always a merging uop to use the 32/64 bit register (or any part for that matter).

Is the 1 byte savings from the REX prefix needed to access the low register of sil/dil/bph/sph? They look to be same length for other registers.

If the register is al/eax the andl encoding is larger. https://godbolt.org/z/3xYPnjqas

topperc · 2024-03-28T17:41:33Z

Doesn't this cause an H register merge for eax/ebx/ecx/edx if ah/bh/ch/dh have been written previously?

Would be either way no? In either test case at least there is a 32-bit instruction. IIRC once the H register is dirty there is always a merging uop to use the 32/64 bit register (or any part for that matter).

I guess I didn't pay enough attention to notice this is specific to the bit extract code that generates bzhi or bextr. That should really be spelled out in the description. The title makes it sound like a generic optimization

goldsteinn · 2024-03-28T17:46:41Z

I guess I didn't pay enough attention to notice this is specific to the bit extract code that generates bzhi or bextr. That should really be spelled out in the description. The title makes it sound like a generic optimization

+1

phoebewang · 2024-03-29T01:39:35Z

Title changed, sorry for the misleading.

phoebewang · 2024-03-29T02:22:10Z

Is the 1 byte savings from the REX prefix needed to access the low register of sil/dil/bph/sph? They look to be same length for other registers.

If the register is al/eax the andl encoding is larger. https://godbolt.org/z/3xYPnjqas

Thanks for the points. I was thinking 8-bit register always has a longer encoding. So the thing is AL has one byte shorter, BL/CL/DL and R8~R15B is the same, while SIL/DIL/BPL/SPL has one byter longer. Maybe it still looks a win in general?

One thing I didn't mention is the reporter thinks this can also avoid false dependence stalls, but I didn't find it the SOM nor in hasPartialRegUpdate. The weak evidence is GCC always generates 32-bit AND, even in 32-bit mode https://godbolt.org/z/hvvM3fzM9

Do you think it's worth for this change?

phoebewang · 2024-04-15T03:19:35Z

Ping?

danilaml · 2024-04-15T15:14:12Z

I'm still not sure this is the best place to implement this, if we really want it (I defer to others on this). The optimization seems general enough to me for it to live in this specific place and I'm not sure we can guarantee that other optimizations don't produce similar code and needs similar fix ups. Aren't there any sort of peephole x86 machine passes that swap instructions with equivalent acting on smaller sizes to save on size?

phoebewang · 2024-04-16T06:22:32Z

I'm still not sure this is the best place to implement this, if we really want it (I defer to others on this). The optimization seems general enough to me for it to live in this specific place and I'm not sure we can guarantee that other optimizations don't produce similar code and needs similar fix ups. Aren't there any sort of peephole x86 machine passes that swap instructions with equivalent acting on smaller sizes to save on size?

AFAIk, We have 3 classes of size optimizations on X86:

EVEX to VEX/legacy encoding compress doing in a machine pass, see X86CompressEVEX.cpp;
Miscellaneous compress doing in MCInst lowering, see X86EncodingOptimization.cpp;
Specific compress doing in Instruction selection, see X86InstrShiftRotate.td;

You can see we optimize them differently for different cases.
Here what we do is similar to class 3. We do it specially for bzhi/bextr, because:

As @topperc pointed, it is not a general optimization;
As I mentioned above, it is considered beneficial for bzhi/bextr, which matches with GCC;

Does it sound more reasonable?

topperc · 2024-04-16T06:47:17Z

As I mentioned above, it is considered beneficial for bzhi/bextr, which matches with GCC;

It's been a long time since I looked, but I thought gcc generally avoids 8 bit registers. Not just for bzhi/bextr.

phoebewang · 2024-04-16T07:22:14Z

As I mentioned above, it is considered beneficial for bzhi/bextr, which matches with GCC;

It's been a long time since I looked, but I thought gcc generally avoids 8 bit registers. Not just for bzhi/bextr.

Do you have an example in mind, I did some simple experiment, the Clang code generation looks good to me https://godbolt.org/z/6h3W348WM

topperc · 2024-04-16T07:30:27Z

As I mentioned above, it is considered beneficial for bzhi/bextr, which matches with GCC;

It's been a long time since I looked, but I thought gcc generally avoids 8 bit registers. Not just for bzhi/bextr.

Do you have an example in mind, I did some simple experiment, the Clang code generation looks good to me https://godbolt.org/z/6h3W348WM

gcc uses %eax, clang uses %al https://godbolt.org/z/vfbezP9ac but like I said its been a long time since I looked at this. My recollection was that gcc promotes more operations to 32-bit registers than clang. I long ago thought maybe we should promote i8 to i32 like we do i16 through IsDesirableToPromoteOp.

phoebewang · 2024-04-16T08:10:42Z

As I mentioned above, it is considered beneficial for bzhi/bextr, which matches with GCC;

It's been a long time since I looked, but I thought gcc generally avoids 8 bit registers. Not just for bzhi/bextr.

Do you have an example in mind, I did some simple experiment, the Clang code generation looks good to me https://godbolt.org/z/6h3W348WM

gcc uses %eax, clang uses %al https://godbolt.org/z/vfbezP9ac but like I said its been a long time since I looked at this. My recollection was that gcc promotes more operations to 32-bit registers than clang. I long ago thought maybe we should promote i8 to i32 like we do i16 through IsDesirableToPromoteOp.

Thanks @topperc for the point! i16 has more benefits compared to i8 https://godbolt.org/z/d1ofbTros
For i8, the benefits depends on the frequency of AL vs. SIL/DIL/BPL/SPL, though the latter has 4, but I think AL should have more frequency in practice, not to mention the H register case. So, I think we don't need to apply i8 to general cases.

[X86] Prefer andl to andb to save one byte encoding

d24a32a

phoebewang requested review from danilaml and RKSimon March 28, 2024 08:36

llvmbot added the backend:X86 label Mar 28, 2024

phoebewang changed the title ~~[X86] Prefer andl to andb to save one byte encoding~~ [X86] Prefer andl to andb to save one byte encoding when using with bzhi or bextr Mar 29, 2024

phoebewang closed this May 2, 2024

phoebewang deleted the andl branch May 2, 2024 09:15

phoebewang restored the andl branch May 2, 2024 09:20

phoebewang reopened this May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[X86] Prefer andl to andb to save one byte encoding when using with bzhi or bextr #86921

[X86] Prefer andl to andb to save one byte encoding when using with bzhi or bextr #86921

phoebewang commented Mar 28, 2024

llvmbot commented Mar 28, 2024

danilaml commented Mar 28, 2024

phoebewang commented Mar 28, 2024 •

edited

topperc commented Mar 28, 2024 •

edited

goldsteinn commented Mar 28, 2024

topperc commented Mar 28, 2024

goldsteinn commented Mar 28, 2024

phoebewang commented Mar 29, 2024

phoebewang commented Mar 29, 2024

phoebewang commented Apr 15, 2024

danilaml commented Apr 15, 2024

phoebewang commented Apr 16, 2024

topperc commented Apr 16, 2024

phoebewang commented Apr 16, 2024

topperc commented Apr 16, 2024

phoebewang commented Apr 16, 2024

[X86] Prefer andl to andb to save one byte encoding when using with bzhi or bextr #86921

Are you sure you want to change the base?

[X86] Prefer andl to andb to save one byte encoding when using with bzhi or bextr #86921

Conversation

phoebewang commented Mar 28, 2024

llvmbot commented Mar 28, 2024

danilaml commented Mar 28, 2024

phoebewang commented Mar 28, 2024 • edited

topperc commented Mar 28, 2024 • edited

goldsteinn commented Mar 28, 2024

topperc commented Mar 28, 2024

goldsteinn commented Mar 28, 2024

phoebewang commented Mar 29, 2024

phoebewang commented Mar 29, 2024

phoebewang commented Apr 15, 2024

danilaml commented Apr 15, 2024

phoebewang commented Apr 16, 2024

topperc commented Apr 16, 2024

phoebewang commented Apr 16, 2024

topperc commented Apr 16, 2024

phoebewang commented Apr 16, 2024

phoebewang commented Mar 28, 2024 •

edited

topperc commented Mar 28, 2024 •

edited