New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[X86] Prefer andl to andb to save one byte encoding when using with bzhi or bextr #86921
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-backend-x86 Author: Phoebe Wang (phoebewang) ChangesFull diff: https://github.com/llvm/llvm-project/pull/86921.diff 2 Files Affected:
diff --git a/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp b/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
index 4e4241efd63d6b..dcc279f0d34d79 100644
--- a/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
+++ b/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
@@ -3928,9 +3928,17 @@ bool X86DAGToDAGISel::matchBitExtract(SDNode *Node) {
SDLoc DL(Node);
- // Truncate the shift amount.
- NBits = CurDAG->getNode(ISD::TRUNCATE, DL, MVT::i8, NBits);
- insertDAGNode(*CurDAG, SDValue(Node, 0), NBits);
+ if (NBits.getSimpleValueType() != MVT::i8) {
+ // Truncate the shift amount.
+ NBits = CurDAG->getNode(ISD::TRUNCATE, DL, MVT::i8, NBits);
+ insertDAGNode(*CurDAG, SDValue(Node, 0), NBits);
+ }
+
+ // Turn (i32)(x & imm8) into (i32)x & imm32.
+ ConstantSDNode *Imm = nullptr;
+ if (NBits->getOpcode() == ISD::AND)
+ if ((Imm = dyn_cast<ConstantSDNode>(NBits->getOperand(1))))
+ NBits = NBits->getOperand(0);
// Insert 8-bit NBits into lowest 8 bits of 32-bit register.
// All the other bits are undefined, we do not care about them.
@@ -3945,6 +3953,13 @@ bool X86DAGToDAGISel::matchBitExtract(SDNode *Node) {
0);
insertDAGNode(*CurDAG, SDValue(Node, 0), NBits);
+ if (Imm) {
+ NBits =
+ CurDAG->getNode(ISD::AND, DL, MVT::i32, NBits,
+ CurDAG->getConstant(Imm->getZExtValue(), DL, MVT::i32));
+ insertDAGNode(*CurDAG, SDValue(Node, 0), NBits);
+ }
+
// We might have matched the amount of high bits to be cleared,
// but we want the amount of low bits to be kept, so negate it then.
if (NegateNBits) {
diff --git a/llvm/test/CodeGen/X86/extract-lowbits.ll b/llvm/test/CodeGen/X86/extract-lowbits.ll
index 848b920490ab83..85b242c22c47cb 100644
--- a/llvm/test/CodeGen/X86/extract-lowbits.ll
+++ b/llvm/test/CodeGen/X86/extract-lowbits.ll
@@ -439,14 +439,14 @@ define i64 @bzhi64_a0_masked(i64 %val, i64 %numlowbits) nounwind {
;
; X64-BMI1-LABEL: bzhi64_a0_masked:
; X64-BMI1: # %bb.0:
-; X64-BMI1-NEXT: andb $63, %sil
+; X64-BMI1-NEXT: andl $63, %esi
; X64-BMI1-NEXT: shll $8, %esi
; X64-BMI1-NEXT: bextrq %rsi, %rdi, %rax
; X64-BMI1-NEXT: retq
;
; X64-BMI2-LABEL: bzhi64_a0_masked:
; X64-BMI2: # %bb.0:
-; X64-BMI2-NEXT: andb $63, %sil
+; X64-BMI2-NEXT: andl $63, %esi
; X64-BMI2-NEXT: bzhiq %rsi, %rdi, %rax
; X64-BMI2-NEXT: retq
%numlowbits.masked = and i64 %numlowbits, 63
|
If it's all around win (i.e. it's at least as fast as andb in all cases and just takes less bytes to encode), then I don't see why it has to be part of matchBitExtract, and can't be own dag combine/transform that does |
I checked them from uops.info, they have identical TP/Lat. I think the reason is matchBitExtract happens during instruction selection, there's no dag combine after it. |
Doesn't this cause an H register merge for eax/ebx/ecx/edx if ah/bh/ch/dh have been written previously? Is the 1 byte savings from the REX prefix needed to access the low register of sil/dil/bph/sph? They look to be same length for other registers. If the register is al/eax the andl encoding is larger. https://godbolt.org/z/3xYPnjqas |
Would be either way no? In either test case at least there is a 32-bit instruction. IIRC once the H register is
|
I guess I didn't pay enough attention to notice this is specific to the bit extract code that generates bzhi or bextr. That should really be spelled out in the description. The title makes it sound like a generic optimization |
+1 |
Title changed, sorry for the misleading. |
Thanks for the points. I was thinking 8-bit register always has a longer encoding. So the thing is AL has one byte shorter, BL/CL/DL and R8~R15B is the same, while SIL/DIL/BPL/SPL has one byter longer. Maybe it still looks a win in general? One thing I didn't mention is the reporter thinks this can also avoid false dependence stalls, but I didn't find it the SOM nor in hasPartialRegUpdate. The weak evidence is GCC always generates 32-bit AND, even in 32-bit mode https://godbolt.org/z/hvvM3fzM9 Do you think it's worth for this change? |
Ping? |
I'm still not sure this is the best place to implement this, if we really want it (I defer to others on this). The optimization seems general enough to me for it to live in this specific place and I'm not sure we can guarantee that other optimizations don't produce similar code and needs similar fix ups. Aren't there any sort of peephole x86 machine passes that swap instructions with equivalent acting on smaller sizes to save on size? |
AFAIk, We have 3 classes of size optimizations on X86:
You can see we optimize them differently for different cases.
Does it sound more reasonable? |
It's been a long time since I looked, but I thought gcc generally avoids 8 bit registers. Not just for bzhi/bextr. |
Do you have an example in mind, I did some simple experiment, the Clang code generation looks good to me https://godbolt.org/z/6h3W348WM |
gcc uses %eax, clang uses %al https://godbolt.org/z/vfbezP9ac but like I said its been a long time since I looked at this. My recollection was that gcc promotes more operations to 32-bit registers than clang. I long ago thought maybe we should promote i8 to i32 like we do i16 through IsDesirableToPromoteOp. |
Thanks @topperc for the point! i16 has more benefits compared to i8 https://godbolt.org/z/d1ofbTros |
No description provided.