-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIX][TLS] Optimize the small local-exec access sequence for non-zero offsets #71485
[AIX][TLS] Optimize the small local-exec access sequence for non-zero offsets #71485
Conversation
✅ With the latest revision this PR passed the C/C++ code formatter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The testcase updates look reasonable, but I didn't review every change.
@@ -1514,13 +1522,41 @@ void PPCAsmPrinter::emitInstruction(const MachineInstr *MI) { | |||
case PPC::LWA: { | |||
// Verify alignment is legal, so we don't create relocations | |||
// that can't be supported. | |||
unsigned OpNum = (MI->getOpcode() == PPC::STD) ? 2 : 1; | |||
unsigned OpNum; | |||
if (Subtarget->hasAIXSmallLocalExecTLS()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does hasAIXSmallLocalExecTLS() affect OpNum? Is a comment needed?
int64_t Offset = MO.getOffset(); | ||
// Non-zero offsets for lwa/ld/std require special handling and are | ||
// handled here. | ||
if (!Offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is opposite the preceding comment, and I wouldn't use a 'break' inside 2 if statments. Why not use if (Offset) { ...
break; | ||
|
||
LowerPPCMachineInstrToMCInst(MI, TmpInst, *this); | ||
if (Offset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check redundant?
int64_t Offset = MO.getOffset(); | ||
// Non-zero offsets for loads/stores require special handling and are | ||
// handled here. For `addi`, all offsets are handled here. | ||
if (!Offset && !IsMIADDI8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above: I would use if (Offset || IsMIADDI8) {...
// we need to create a new MCExpr that adds the non-zero offset to the address | ||
// of the local-exec variable that will be used in either an addi, load or | ||
// store. However, the final displacement for these instructions must be | ||
// between [-32768, 32768), so if the TLS address + it's non-zero offset is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// between [-32768, 32768), so if the TLS address + it's non-zero offset is | |
// between [-32768, 32768), so if the TLS address + its non-zero offset is |
// Handle the written offset for cases where: | ||
// address of the TLS variable + the offset is greater than 32KB. | ||
|
||
// Get the address in the range of 0 to 64KB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can simplify this.The amount you have to subtract is ((FinalAddress + 32768 ) & ~0xFFFF)
; SMALL-LOCAL-EXEC-SMALLCM64-NEXT: la r3, mySmallLocalExecTLS5[TL]@le(r13) | ||
; SMALL-LOCAL-EXEC-SMALLCM64-NEXT: stw r4, 332(r3) | ||
; SMALL-LOCAL-EXEC-SMALLCM64-NEXT: stw r4, mySmallLocalExecTLSv1[TL]@le+24(r13) | ||
; SMALL-LOCAL-EXEC-SMALLCM64-NEXT: stw r3, mySmallLocalExecTLS2[TL]@le-65216(r13) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the assembly prints:
TLSVar[storageMappingClass]@le + OffsetDelta(r13)
Where OffsetDelta
= Offset - (Multiple of 64KB)
. Basically, it will print the single OffsetDelta
value.
Stephen and I discussed this, and perhaps another way we could print it is the more explicit way of separating everything out:
TLSVar[storageMappingClass]@le + Offset - (Multiple of 64KB) (r13)
I experimented with this a bit and an example of the assembly would look like:
stw r3, (mySmallLocalExecTLS2[TL]@le+320)-65536(r13)
Not sure if that's clearer or looks better/worse, but the current way might look a bit cleaner.
If anyone has any opinions on this, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the alternate way would look be better, because it's easier to debug or understand the assembly code. In addition, the "-65536" is required because of an AIX assembler limitation. If the assembler were fixed, the extra subtraction could be eliminated. But I could accept either output format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial group code review comments.
Also:
Title should be updated.Might be good to add the zero-based offset patch in the description.
Done the above two items.
// InitialADDI is the addi feeding into N (also an addi), and the addi that | ||
// we want optimized out. | ||
SDValue InitialADDI = N->getOperand(0); | ||
if (!InitialADDI.isMachineOpcode() || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be better to separate the checks into a separate function so we can check it in the load/store case, too.
@@ -7735,7 +7792,15 @@ void PPCDAGToDAGISel::PeepholePPC64() { | |||
ImmOpnd = CurDAG->getTargetConstant(Offset, SDLoc(ImmOpnd), | |||
ImmOpnd.getValueType()); | |||
} else if (Offset != 0) { | |||
continue; | |||
if (!HasAIXSmallLocalExecTLS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing checks for thread pointer/TLS variable or not?
Double check this.
@@ -7735,7 +7792,15 @@ void PPCDAGToDAGISel::PeepholePPC64() { | |||
ImmOpnd = CurDAG->getTargetConstant(Offset, SDLoc(ImmOpnd), | |||
ImmOpnd.getValueType()); | |||
} else if (Offset != 0) { | |||
continue; | |||
if (!HasAIXSmallLocalExecTLS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be better to do:
if (HasAIXSmallLocalExecTLS) {
...
} else
continue;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the latest changes look good
const MachineOperand &MO = MI->getOperand(OpNum); | ||
if (MO.isGlobal()) { | ||
const DataLayout &DL = MO.getGlobal()->getParent()->getDataLayout(); | ||
if (MO.getGlobal()->getPointerAlignment(DL) < 4) | ||
llvm_unreachable("Global must be word-aligned for LD, STD, LWA!"); | ||
|
||
// A faster non-TOC-based local-exec sequence is represented by | ||
// `lwa`/`ld`/`std` directingly loading or storing off of the thread |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit, but there are a lot of load and store instructions. I don't think you need to list lwa, ld, and std explicitly.
; SMALL-LOCAL-EXEC-SMALLCM64-NEXT: la r3, mySmallLocalExecTLS5[TL]@le(r13) | ||
; SMALL-LOCAL-EXEC-SMALLCM64-NEXT: stw r4, 332(r3) | ||
; SMALL-LOCAL-EXEC-SMALLCM64-NEXT: stw r4, mySmallLocalExecTLSv1[TL]@le+24(r13) | ||
; SMALL-LOCAL-EXEC-SMALLCM64-NEXT: stw r3, mySmallLocalExecTLS2[TL]@le-65216(r13) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the alternate way would look be better, because it's easier to debug or understand the assembly code. In addition, the "-65536" is required because of an AIX assembler limitation. If the assembler were fixed, the extra subtraction could be eliminated. But I could accept either output format.
// Such instructions do not otherwise arise. | ||
if ((MO.getTargetFlags() & PPCII::MO_TPREL_FLAG) != 0) { | ||
assert(HasAIXSmallLocalExecTLS && | ||
"lwa/ld/std with thread-pointer only expected with " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another nit: I would change "lwa/ld/std" to "load/store"
const MachineOperand &MO = MI->getOperand(OpNum); | ||
if (MO.isGlobal()) { | ||
const DataLayout &DL = MO.getGlobal()->getParent()->getDataLayout(); | ||
if (MO.getGlobal()->getPointerAlignment(DL) < 4) | ||
llvm_unreachable("Global must be word-aligned for LD, STD, LWA!"); | ||
|
||
// A faster non-TOC-based local-exec sequence is represented by | ||
// directingly loading or storing off of the thread pointer and with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
directingly -> directly
// non-zero offset to the TLS variable address. | ||
// For when TLS variables are extern, this is safe to do because we can | ||
// assume that the address of extern TLS variables are zero. | ||
if ((FinalAddress < 32768) || IsGlobalADeclaration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we don't need || IsGlobalADeclaration since in
ptrdiff_t FinalAddress = (TLSVarAddress + Offset);
TLSVarAddress was initialized to 0 for IsGlobalADeclaration.
So it will already pass FinalAddress < 32768 for IsGlobalADeclaration.
789629a
to
c6d882b
Compare
fdcfeef
to
cf3957f
Compare
ec5f16e
to
c9ea0b0
Compare
Ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than a couple of nits, I think the code and the generated assembly look good. Factoring out the +/- 65536 in the assemble code helps anyone reading the the assembly listing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
2601089
to
3563b6a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@@ -7578,6 +7668,10 @@ void PPCDAGToDAGISel::PeepholePPC64() { | |||
if (isVSXSwap(SDValue(N, 0))) | |||
reduceVSXSwap(N, CurDAG); | |||
|
|||
// This optimization is performed for non-TOC-based local-exec accesses. | |||
if (HasAIXSmallLocalExecTLS) | |||
foldADDIForLocalExecAccesses(N, CurDAG); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
continue
the loop earlier will save compiler compile time ,
anyway, you can keep your code. I don't have a strong preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
assert((InstDisp < 32768) || | ||
(InstDisp >= -32768) && | ||
"Expecting the instruction displacement for local-exec TLS " | ||
"variables to be between [-32768, 32768)!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not a review comment)
Per my understanding, allow peephole for non-zero offsets but without changes to AsmPrinter, we'll get assembler error like The displacement must be greater than or equal to -32768 and less than or equal to 32767
(no-integrated-as).
Since code here is to rewrite offsets exceeding upper limit (32768) into negative, will this assert be hit if the offset is even larger? (for example, twice the size as objects in aix-small-local-exec-tls-largeaccess.ll
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For larger variables, the assert will be not triggered. This is because in the initial patch that introduced this feature, I only restricted this non-TOC-based access sequence if the size of the TLS variable is less than 32751 within PPCISelLowering.cpp
.
constexpr uint64_t AIXSmallTlsPolicySizeLimit = 32751;
....
// With the -maix-small-local-exec-tls option, produce a faster access
// sequence for local-exec TLS variables where the offset from the TLS
// base is encoded as an immediate operand.
//
// We only utilize the faster local-exec access sequence when the TLS
// variable has a size within the policy limit. We treat types that are
// not sized or are empty as being over the policy size limit.
if (HasAIXSmallLocalExecTLS && IsTLSLocalExecModel) {
Type *GVType = GV->getValueType();
if (GVType->isSized() && !GVType->isEmptyTy() &&
GV->getParent()->getDataLayout().getTypeAllocSize(GVType) <=
AIXSmallTlsPolicySizeLimit)
return DAG.getNode(PPCISD::Lo, dl, PtrVT, VariableOffsetTGA, TLSReg);
}
So in the case where the size is 8187
, the test case for the large variable in aix-small-local-exec-tls-largeaccess.ll
looks like:
stw r3, mySmallLocalExecTLSv1[TL]@le(r13)
If the size is increased past AIXSmallTlsPolicySizeLimit
, it will load from the TOC instead and do the regular local-exec sequence, which is expected:
ld r4, L..C0(r2) # target-flags(ppc-tprel) @mySmallLocalExecTLSv1
li r3, 1
add r4, r13, r4
stw r3, 0(r4)
. . .
Hope the above answers your question.
…sequence for non-zero offsets This patch utilizes the -maix-small-local-exec-tls option to produce a faster, non-TOC-based access sequence for the local-exec TLS model, specifically for when the offsets from the TLS variable are non-zero. In particular, this patch produces either a single: - addi/la with a displacement off of R13 plus a non-zero offset for when an address is calculated, or - load or store off of R13 plus a non-zero offset for when an address is calculated and used for further access Where R13 is the thread pointer, respectively. In order to produce a single addi or load/store off of the thread pointer with a non-zero offset, this patch also adds the necessary support in the assembly printer when printing these instructions. Specifically: - The non-zero offset is added to the TLS variable address when the address of the TLS variable + it's offset is less than 32KB. - Otherwise, when the address of the TLS variable + its offset is greater than 32KB, the non-zero offset (multiplied by a multiple of 64KB) is subtracted from the TLS address. This handling in the assembly printer is necessary to ensure that the TLS address + the non-zero offset is between [-32768, 32768), so that the total displacement can fit within the addi/load/store instructions.
…move unnecessary breaks
…ry variables, etc.
// access. The first operand of InitialADDI should be the thread pointer, | ||
// which has been checked in isEligibleToFoldADDIForLocalExecAccesses(). | ||
SDValue TPRegNode = InitialADDI.getOperand(0); | ||
RegisterSDNode *TPReg = dyn_cast_or_null<RegisterSDNode>(TPRegNode.getNode()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dyn_cast_or_null
means TPRegNode.getNode()
is nullable, is it true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I think I wanted to change this initially but had forgot. I don't expect these to be null, so I think I will probably just do dyn_cast()
.
859368e
to
cf5df00
Compare
… offsets (llvm#71485) This patch utilizes the -maix-small-local-exec-tls option to produce a faster, non-TOC-based access sequence for the local-exec TLS model. Specifically, for when the offsets from the TLS variable are non-zero. In particular, this patch produces either a single: - addi/la with a displacement off of R13 plus a non-zero offset for when an address is calculated, or - load or store off of R13 plus a non-zero offset for when an address is calculated and used for further access where R13 is the thread pointer, respectively. In order to produce a single addi or load/store off of the thread pointer with a non-zero offset, this patch also adds the necessary support in the assembly printer when printing these instructions. Specifically: - The non-zero offset is added to the TLS variable address when the address of the TLS variable + it's offset is less than 32KB. - Otherwise, when the address of the TLS variable + its offset is greater than 32KB, the non-zero offset (and a multiple of 64KB) is subtracted from the TLS address. This handling in the assembly printer is necessary to ensure that the TLS address + the non-zero offset is between [-32768, 32768), so that the total displacement can fit within the addi/load/store instructions. This patch is meant to be a follow-up to 3f46e54 (where the optimization occurs for when the offset is zero).
… offsets (llvm#71485) This patch utilizes the -maix-small-local-exec-tls option to produce a faster, non-TOC-based access sequence for the local-exec TLS model. Specifically, for when the offsets from the TLS variable are non-zero. In particular, this patch produces either a single: - addi/la with a displacement off of R13 plus a non-zero offset for when an address is calculated, or - load or store off of R13 plus a non-zero offset for when an address is calculated and used for further access where R13 is the thread pointer, respectively. In order to produce a single addi or load/store off of the thread pointer with a non-zero offset, this patch also adds the necessary support in the assembly printer when printing these instructions. Specifically: - The non-zero offset is added to the TLS variable address when the address of the TLS variable + it's offset is less than 32KB. - Otherwise, when the address of the TLS variable + its offset is greater than 32KB, the non-zero offset (and a multiple of 64KB) is subtracted from the TLS address. This handling in the assembly printer is necessary to ensure that the TLS address + the non-zero offset is between [-32768, 32768), so that the total displacement can fit within the addi/load/store instructions. This patch is meant to be a follow-up to 3f46e54 (where the optimization occurs for when the offset is zero).
… offsets (llvm#71485) This patch utilizes the -maix-small-local-exec-tls option to produce a faster, non-TOC-based access sequence for the local-exec TLS model. Specifically, for when the offsets from the TLS variable are non-zero. In particular, this patch produces either a single: - addi/la with a displacement off of R13 plus a non-zero offset for when an address is calculated, or - load or store off of R13 plus a non-zero offset for when an address is calculated and used for further access where R13 is the thread pointer, respectively. In order to produce a single addi or load/store off of the thread pointer with a non-zero offset, this patch also adds the necessary support in the assembly printer when printing these instructions. Specifically: - The non-zero offset is added to the TLS variable address when the address of the TLS variable + it's offset is less than 32KB. - Otherwise, when the address of the TLS variable + its offset is greater than 32KB, the non-zero offset (and a multiple of 64KB) is subtracted from the TLS address. This handling in the assembly printer is necessary to ensure that the TLS address + the non-zero offset is between [-32768, 32768), so that the total displacement can fit within the addi/load/store instructions. This patch is meant to be a follow-up to 3f46e54 (where the optimization occurs for when the offset is zero).
This patch utilizes the -maix-small-local-exec-tls option to produce a faster,
non-TOC-based access sequence for the local-exec TLS model. Specifically, for
when the offsets from the TLS variable are non-zero.
In particular, this patch produces either a single:
access where R13 is the thread pointer, respectively.
In order to produce a single addi or load/store off of the thread pointer with a non-zero offset,
this patch also adds the necessary support in the assembly printer when printing these instructions.
Specifically:
TLS variable + it's offset is less than 32KB.
non-zero offset (and a multiple of 64KB) is subtracted from the TLS address.
This handling in the assembly printer is necessary to ensure that the TLS address + the non-zero offset
is between [-32768, 32768), so that the total displacement can fit within the addi/load/store instructions.
This patch is meant to be a follow-up to 3f46e54 (where the
optimization occurs for when the offset is zero).