Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[X86] Prefer andl to andb to save one byte encoding when using with bzhi or bextr #86921

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

phoebewang
Copy link
Contributor

No description provided.

@llvmbot
Copy link
Collaborator

llvmbot commented Mar 28, 2024

@llvm/pr-subscribers-backend-x86

Author: Phoebe Wang (phoebewang)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/86921.diff

2 Files Affected:

  • (modified) llvm/lib/Target/X86/X86ISelDAGToDAG.cpp (+18-3)
  • (modified) llvm/test/CodeGen/X86/extract-lowbits.ll (+2-2)
diff --git a/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp b/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
index 4e4241efd63d6b..dcc279f0d34d79 100644
--- a/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
+++ b/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
@@ -3928,9 +3928,17 @@ bool X86DAGToDAGISel::matchBitExtract(SDNode *Node) {
 
   SDLoc DL(Node);
 
-  // Truncate the shift amount.
-  NBits = CurDAG->getNode(ISD::TRUNCATE, DL, MVT::i8, NBits);
-  insertDAGNode(*CurDAG, SDValue(Node, 0), NBits);
+  if (NBits.getSimpleValueType() != MVT::i8) {
+    // Truncate the shift amount.
+    NBits = CurDAG->getNode(ISD::TRUNCATE, DL, MVT::i8, NBits);
+    insertDAGNode(*CurDAG, SDValue(Node, 0), NBits);
+  }
+
+  // Turn (i32)(x & imm8) into (i32)x & imm32.
+  ConstantSDNode *Imm = nullptr;
+  if (NBits->getOpcode() == ISD::AND)
+    if ((Imm = dyn_cast<ConstantSDNode>(NBits->getOperand(1))))
+      NBits = NBits->getOperand(0);
 
   // Insert 8-bit NBits into lowest 8 bits of 32-bit register.
   // All the other bits are undefined, we do not care about them.
@@ -3945,6 +3953,13 @@ bool X86DAGToDAGISel::matchBitExtract(SDNode *Node) {
                   0);
   insertDAGNode(*CurDAG, SDValue(Node, 0), NBits);
 
+  if (Imm) {
+    NBits =
+        CurDAG->getNode(ISD::AND, DL, MVT::i32, NBits,
+                        CurDAG->getConstant(Imm->getZExtValue(), DL, MVT::i32));
+    insertDAGNode(*CurDAG, SDValue(Node, 0), NBits);
+  }
+
   // We might have matched the amount of high bits to be cleared,
   // but we want the amount of low bits to be kept, so negate it then.
   if (NegateNBits) {
diff --git a/llvm/test/CodeGen/X86/extract-lowbits.ll b/llvm/test/CodeGen/X86/extract-lowbits.ll
index 848b920490ab83..85b242c22c47cb 100644
--- a/llvm/test/CodeGen/X86/extract-lowbits.ll
+++ b/llvm/test/CodeGen/X86/extract-lowbits.ll
@@ -439,14 +439,14 @@ define i64 @bzhi64_a0_masked(i64 %val, i64 %numlowbits) nounwind {
 ;
 ; X64-BMI1-LABEL: bzhi64_a0_masked:
 ; X64-BMI1:       # %bb.0:
-; X64-BMI1-NEXT:    andb $63, %sil
+; X64-BMI1-NEXT:    andl $63, %esi
 ; X64-BMI1-NEXT:    shll $8, %esi
 ; X64-BMI1-NEXT:    bextrq %rsi, %rdi, %rax
 ; X64-BMI1-NEXT:    retq
 ;
 ; X64-BMI2-LABEL: bzhi64_a0_masked:
 ; X64-BMI2:       # %bb.0:
-; X64-BMI2-NEXT:    andb $63, %sil
+; X64-BMI2-NEXT:    andl $63, %esi
 ; X64-BMI2-NEXT:    bzhiq %rsi, %rdi, %rax
 ; X64-BMI2-NEXT:    retq
   %numlowbits.masked = and i64 %numlowbits, 63

@danilaml
Copy link
Collaborator

If it's all around win (i.e. it's at least as fast as andb in all cases and just takes less bytes to encode), then I don't see why it has to be part of matchBitExtract, and can't be own dag combine/transform that does // Turn (i32)(x & imm8) into (i32)x & imm32. I think there are already some combines like that, so I'm not sure why they haven't triggered for this case.

@phoebewang
Copy link
Contributor Author

phoebewang commented Mar 28, 2024

If it's all around win (i.e. it's at least as fast as andb in all cases and just takes less bytes to encode), then I don't see why it has to be part of matchBitExtract, and can't be own dag combine/transform that does // Turn (i32)(x & imm8) into (i32)x & imm32. I think there are already some combines like that, so I'm not sure why they haven't triggered for this case.

I checked them from uops.info, they have identical TP/Lat. I think the reason is matchBitExtract happens during instruction selection, there's no dag combine after it.

@topperc
Copy link
Collaborator

topperc commented Mar 28, 2024

Doesn't this cause an H register merge for eax/ebx/ecx/edx if ah/bh/ch/dh have been written previously?

Is the 1 byte savings from the REX prefix needed to access the low register of sil/dil/bph/sph? They look to be same length for other registers.

If the register is al/eax the andl encoding is larger. https://godbolt.org/z/3xYPnjqas

@goldsteinn
Copy link
Contributor

Doesn't this cause an H register merge for eax/ebx/ecx/edx if ah/bh/ch/dh have been written previously?

Would be either way no? In either test case at least there is a 32-bit instruction. IIRC once the H register is
dirty there is always a merging uop to use the 32/64 bit register (or any part for that matter).

Is the 1 byte savings from the REX prefix needed to access the low register of sil/dil/bph/sph? They look to be same length for other registers.

If the register is al/eax the andl encoding is larger. https://godbolt.org/z/3xYPnjqas

@topperc
Copy link
Collaborator

topperc commented Mar 28, 2024

Doesn't this cause an H register merge for eax/ebx/ecx/edx if ah/bh/ch/dh have been written previously?

Would be either way no? In either test case at least there is a 32-bit instruction. IIRC once the H register is dirty there is always a merging uop to use the 32/64 bit register (or any part for that matter).

I guess I didn't pay enough attention to notice this is specific to the bit extract code that generates bzhi or bextr. That should really be spelled out in the description. The title makes it sound like a generic optimization

@goldsteinn
Copy link
Contributor

I guess I didn't pay enough attention to notice this is specific to the bit extract code that generates bzhi or bextr. That should really be spelled out in the description. The title makes it sound like a generic optimization

+1

@phoebewang phoebewang changed the title [X86] Prefer andl to andb to save one byte encoding [X86] Prefer andl to andb to save one byte encoding when using with bzhi or bextr Mar 29, 2024
@phoebewang
Copy link
Contributor Author

Title changed, sorry for the misleading.

@phoebewang
Copy link
Contributor Author

Is the 1 byte savings from the REX prefix needed to access the low register of sil/dil/bph/sph? They look to be same length for other registers.

If the register is al/eax the andl encoding is larger. https://godbolt.org/z/3xYPnjqas

Thanks for the points. I was thinking 8-bit register always has a longer encoding. So the thing is AL has one byte shorter, BL/CL/DL and R8~R15B is the same, while SIL/DIL/BPL/SPL has one byter longer. Maybe it still looks a win in general?

One thing I didn't mention is the reporter thinks this can also avoid false dependence stalls, but I didn't find it the SOM nor in hasPartialRegUpdate. The weak evidence is GCC always generates 32-bit AND, even in 32-bit mode https://godbolt.org/z/hvvM3fzM9

Do you think it's worth for this change?

@phoebewang
Copy link
Contributor Author

Ping?

@danilaml
Copy link
Collaborator

I'm still not sure this is the best place to implement this, if we really want it (I defer to others on this). The optimization seems general enough to me for it to live in this specific place and I'm not sure we can guarantee that other optimizations don't produce similar code and needs similar fix ups. Aren't there any sort of peephole x86 machine passes that swap instructions with equivalent acting on smaller sizes to save on size?

@phoebewang
Copy link
Contributor Author

I'm still not sure this is the best place to implement this, if we really want it (I defer to others on this). The optimization seems general enough to me for it to live in this specific place and I'm not sure we can guarantee that other optimizations don't produce similar code and needs similar fix ups. Aren't there any sort of peephole x86 machine passes that swap instructions with equivalent acting on smaller sizes to save on size?

AFAIk, We have 3 classes of size optimizations on X86:

  • EVEX to VEX/legacy encoding compress doing in a machine pass, see X86CompressEVEX.cpp;
  • Miscellaneous compress doing in MCInst lowering, see X86EncodingOptimization.cpp;
  • Specific compress doing in Instruction selection, see X86InstrShiftRotate.td;

You can see we optimize them differently for different cases.
Here what we do is similar to class 3. We do it specially for bzhi/bextr, because:

  • As @topperc pointed, it is not a general optimization;
  • As I mentioned above, it is considered beneficial for bzhi/bextr, which matches with GCC;

Does it sound more reasonable?

@topperc
Copy link
Collaborator

topperc commented Apr 16, 2024

  • As I mentioned above, it is considered beneficial for bzhi/bextr, which matches with GCC;

It's been a long time since I looked, but I thought gcc generally avoids 8 bit registers. Not just for bzhi/bextr.

@phoebewang
Copy link
Contributor Author

  • As I mentioned above, it is considered beneficial for bzhi/bextr, which matches with GCC;

It's been a long time since I looked, but I thought gcc generally avoids 8 bit registers. Not just for bzhi/bextr.

Do you have an example in mind, I did some simple experiment, the Clang code generation looks good to me https://godbolt.org/z/6h3W348WM

@topperc
Copy link
Collaborator

topperc commented Apr 16, 2024

  • As I mentioned above, it is considered beneficial for bzhi/bextr, which matches with GCC;

It's been a long time since I looked, but I thought gcc generally avoids 8 bit registers. Not just for bzhi/bextr.

Do you have an example in mind, I did some simple experiment, the Clang code generation looks good to me https://godbolt.org/z/6h3W348WM

gcc uses %eax, clang uses %al https://godbolt.org/z/vfbezP9ac but like I said its been a long time since I looked at this. My recollection was that gcc promotes more operations to 32-bit registers than clang. I long ago thought maybe we should promote i8 to i32 like we do i16 through IsDesirableToPromoteOp.

@phoebewang
Copy link
Contributor Author

  • As I mentioned above, it is considered beneficial for bzhi/bextr, which matches with GCC;

It's been a long time since I looked, but I thought gcc generally avoids 8 bit registers. Not just for bzhi/bextr.

Do you have an example in mind, I did some simple experiment, the Clang code generation looks good to me https://godbolt.org/z/6h3W348WM

gcc uses %eax, clang uses %al https://godbolt.org/z/vfbezP9ac but like I said its been a long time since I looked at this. My recollection was that gcc promotes more operations to 32-bit registers than clang. I long ago thought maybe we should promote i8 to i32 like we do i16 through IsDesirableToPromoteOp.

Thanks @topperc for the point! i16 has more benefits compared to i8 https://godbolt.org/z/d1ofbTros
For i8, the benefits depends on the frequency of AL vs. SIL/DIL/BPL/SPL, though the latter has 4, but I think AL should have more frequency in practice, not to mention the H register case. So, I think we don't need to apply i8 to general cases.

@phoebewang phoebewang closed this May 2, 2024
@phoebewang phoebewang deleted the andl branch May 2, 2024 09:15
@phoebewang phoebewang restored the andl branch May 2, 2024 09:20
@phoebewang phoebewang reopened this May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants