Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIX][TLS] Optimize the small local-exec access sequence for non-zero offsets #71485

Merged

Conversation

amy-kwan
Copy link
Contributor

@amy-kwan amy-kwan commented Nov 7, 2023

This patch utilizes the -maix-small-local-exec-tls option to produce a faster,
non-TOC-based access sequence for the local-exec TLS model. Specifically, for
when the offsets from the TLS variable are non-zero.

In particular, this patch produces either a single:

  • addi/la with a displacement off of R13 plus a non-zero offset for when an address is calculated, or
  • load or store off of R13 plus a non-zero offset for when an address is calculated and used for further
    access where R13 is the thread pointer, respectively.

In order to produce a single addi or load/store off of the thread pointer with a non-zero offset,
this patch also adds the necessary support in the assembly printer when printing these instructions.

Specifically:

  • The non-zero offset is added to the TLS variable address when the address of the
    TLS variable + it's offset is less than 32KB.
  • Otherwise, when the address of the TLS variable + its offset is greater than 32KB, the
    non-zero offset (and a multiple of 64KB) is subtracted from the TLS address.

This handling in the assembly printer is necessary to ensure that the TLS address + the non-zero offset
is between [-32768, 32768), so that the total displacement can fit within the addi/load/store instructions.

This patch is meant to be a follow-up to 3f46e54 (where the
optimization occurs for when the offset is zero).

Copy link

github-actions bot commented Nov 7, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Contributor

@stephenpeckham stephenpeckham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testcase updates look reasonable, but I didn't review every change.

@@ -1514,13 +1522,41 @@ void PPCAsmPrinter::emitInstruction(const MachineInstr *MI) {
case PPC::LWA: {
// Verify alignment is legal, so we don't create relocations
// that can't be supported.
unsigned OpNum = (MI->getOpcode() == PPC::STD) ? 2 : 1;
unsigned OpNum;
if (Subtarget->hasAIXSmallLocalExecTLS())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does hasAIXSmallLocalExecTLS() affect OpNum? Is a comment needed?

int64_t Offset = MO.getOffset();
// Non-zero offsets for lwa/ld/std require special handling and are
// handled here.
if (!Offset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is opposite the preceding comment, and I wouldn't use a 'break' inside 2 if statments. Why not use if (Offset) { ...

break;

LowerPPCMachineInstrToMCInst(MI, TmpInst, *this);
if (Offset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check redundant?

llvm/lib/Target/PowerPC/PPCAsmPrinter.cpp Show resolved Hide resolved
int64_t Offset = MO.getOffset();
// Non-zero offsets for loads/stores require special handling and are
// handled here. For `addi`, all offsets are handled here.
if (!Offset && !IsMIADDI8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above: I would use if (Offset || IsMIADDI8) {...

// we need to create a new MCExpr that adds the non-zero offset to the address
// of the local-exec variable that will be used in either an addi, load or
// store. However, the final displacement for these instructions must be
// between [-32768, 32768), so if the TLS address + it's non-zero offset is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// between [-32768, 32768), so if the TLS address + it's non-zero offset is
// between [-32768, 32768), so if the TLS address + its non-zero offset is

// Handle the written offset for cases where:
// address of the TLS variable + the offset is greater than 32KB.

// Get the address in the range of 0 to 64KB.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can simplify this.The amount you have to subtract is ((FinalAddress + 32768 ) & ~0xFFFF)

llvm/lib/Target/PowerPC/PPCAsmPrinter.cpp Show resolved Hide resolved
; SMALL-LOCAL-EXEC-SMALLCM64-NEXT: la r3, mySmallLocalExecTLS5[TL]@le(r13)
; SMALL-LOCAL-EXEC-SMALLCM64-NEXT: stw r4, 332(r3)
; SMALL-LOCAL-EXEC-SMALLCM64-NEXT: stw r4, mySmallLocalExecTLSv1[TL]@le+24(r13)
; SMALL-LOCAL-EXEC-SMALLCM64-NEXT: stw r3, mySmallLocalExecTLS2[TL]@le-65216(r13)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the assembly prints:

TLSVar[storageMappingClass]@le + OffsetDelta(r13)

Where OffsetDelta = Offset - (Multiple of 64KB). Basically, it will print the single OffsetDelta value.

Stephen and I discussed this, and perhaps another way we could print it is the more explicit way of separating everything out:

TLSVar[storageMappingClass]@le + Offset - (Multiple of 64KB) (r13)

I experimented with this a bit and an example of the assembly would look like:

stw r3, (mySmallLocalExecTLS2[TL]@le+320)-65536(r13)

Not sure if that's clearer or looks better/worse, but the current way might look a bit cleaner.

If anyone has any opinions on this, please let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the alternate way would look be better, because it's easier to debug or understand the assembly code. In addition, the "-65536" is required because of an AIX assembler limitation. If the assembler were fixed, the extra subtraction could be eliminated. But I could accept either output format.

Copy link
Contributor Author

@amy-kwan amy-kwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial group code review comments.

Also:

  • Title should be updated.
  • Might be good to add the zero-based offset patch in the description.

Done the above two items.

llvm/lib/Target/PowerPC/PPCISelDAGToDAG.cpp Show resolved Hide resolved
llvm/lib/Target/PowerPC/PPCISelDAGToDAG.cpp Show resolved Hide resolved
// InitialADDI is the addi feeding into N (also an addi), and the addi that
// we want optimized out.
SDValue InitialADDI = N->getOperand(0);
if (!InitialADDI.isMachineOpcode() ||
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to separate the checks into a separate function so we can check it in the load/store case, too.

@@ -7735,7 +7792,15 @@ void PPCDAGToDAGISel::PeepholePPC64() {
ImmOpnd = CurDAG->getTargetConstant(Offset, SDLoc(ImmOpnd),
ImmOpnd.getValueType());
} else if (Offset != 0) {
continue;
if (!HasAIXSmallLocalExecTLS)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing checks for thread pointer/TLS variable or not?
Double check this.

@@ -7735,7 +7792,15 @@ void PPCDAGToDAGISel::PeepholePPC64() {
ImmOpnd = CurDAG->getTargetConstant(Offset, SDLoc(ImmOpnd),
ImmOpnd.getValueType());
} else if (Offset != 0) {
continue;
if (!HasAIXSmallLocalExecTLS)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to do:

if (HasAIXSmallLocalExecTLS) {
  ...
} else
  continue;

@amy-kwan amy-kwan changed the title [AIX][TLS] Optimize the -maix-small-local-exec-tls local-exec access sequence for non-zero offsets [AIX][TLS] Optimize the small local-exec access sequence for non-zero offsets Nov 22, 2023
Copy link
Contributor

@stephenpeckham stephenpeckham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the latest changes look good

const MachineOperand &MO = MI->getOperand(OpNum);
if (MO.isGlobal()) {
const DataLayout &DL = MO.getGlobal()->getParent()->getDataLayout();
if (MO.getGlobal()->getPointerAlignment(DL) < 4)
llvm_unreachable("Global must be word-aligned for LD, STD, LWA!");

// A faster non-TOC-based local-exec sequence is represented by
// `lwa`/`ld`/`std` directingly loading or storing off of the thread
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit, but there are a lot of load and store instructions. I don't think you need to list lwa, ld, and std explicitly.

; SMALL-LOCAL-EXEC-SMALLCM64-NEXT: la r3, mySmallLocalExecTLS5[TL]@le(r13)
; SMALL-LOCAL-EXEC-SMALLCM64-NEXT: stw r4, 332(r3)
; SMALL-LOCAL-EXEC-SMALLCM64-NEXT: stw r4, mySmallLocalExecTLSv1[TL]@le+24(r13)
; SMALL-LOCAL-EXEC-SMALLCM64-NEXT: stw r3, mySmallLocalExecTLS2[TL]@le-65216(r13)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the alternate way would look be better, because it's easier to debug or understand the assembly code. In addition, the "-65536" is required because of an AIX assembler limitation. If the assembler were fixed, the extra subtraction could be eliminated. But I could accept either output format.

// Such instructions do not otherwise arise.
if ((MO.getTargetFlags() & PPCII::MO_TPREL_FLAG) != 0) {
assert(HasAIXSmallLocalExecTLS &&
"lwa/ld/std with thread-pointer only expected with "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another nit: I would change "lwa/ld/std" to "load/store"

const MachineOperand &MO = MI->getOperand(OpNum);
if (MO.isGlobal()) {
const DataLayout &DL = MO.getGlobal()->getParent()->getDataLayout();
if (MO.getGlobal()->getPointerAlignment(DL) < 4)
llvm_unreachable("Global must be word-aligned for LD, STD, LWA!");

// A faster non-TOC-based local-exec sequence is represented by
// directingly loading or storing off of the thread pointer and with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

directingly -> directly

// non-zero offset to the TLS variable address.
// For when TLS variables are extern, this is safe to do because we can
// assume that the address of extern TLS variables are zero.
if ((FinalAddress < 32768) || IsGlobalADeclaration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we don't need || IsGlobalADeclaration since in
ptrdiff_t FinalAddress = (TLSVarAddress + Offset);
TLSVarAddress was initialized to 0 for IsGlobalADeclaration.
So it will already pass FinalAddress < 32768 for IsGlobalADeclaration.

@amy-kwan amy-kwan force-pushed the amy-kwan/small-local-exec-tls-nonzero-offsets branch from 789629a to c6d882b Compare December 1, 2023 05:50
@amy-kwan amy-kwan force-pushed the amy-kwan/small-local-exec-tls-nonzero-offsets branch from fdcfeef to cf3957f Compare December 7, 2023 21:35
@amy-kwan amy-kwan force-pushed the amy-kwan/small-local-exec-tls-nonzero-offsets branch 2 times, most recently from ec5f16e to c9ea0b0 Compare January 2, 2024 19:19
@amy-kwan
Copy link
Contributor Author

amy-kwan commented Jan 8, 2024

Ping.

Copy link
Contributor

@stephenpeckham stephenpeckham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than a couple of nits, I think the code and the generated assembly look good. Factoring out the +/- 65536 in the assemble code helps anyone reading the the assembly listing.

llvm/lib/Target/PowerPC/PPCAsmPrinter.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/PowerPC/PPCAsmPrinter.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@stephenpeckham stephenpeckham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@amy-kwan amy-kwan force-pushed the amy-kwan/small-local-exec-tls-nonzero-offsets branch from 2601089 to 3563b6a Compare January 24, 2024 21:06
Copy link
Contributor

@diggerlin diggerlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@@ -7578,6 +7668,10 @@ void PPCDAGToDAGISel::PeepholePPC64() {
if (isVSXSwap(SDValue(N, 0)))
reduceVSXSwap(N, CurDAG);

// This optimization is performed for non-TOC-based local-exec accesses.
if (HasAIXSmallLocalExecTLS)
foldADDIForLocalExecAccesses(N, CurDAG);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continue the loop earlier will save compiler compile time ,
anyway, you can keep your code. I don't have a strong preference.

llvm/lib/Target/PowerPC/PPCAsmPrinter.cpp Show resolved Hide resolved
@diggerlin diggerlin self-requested a review January 27, 2024 16:00
Copy link
Contributor

@diggerlin diggerlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +1649 to +1652
assert((InstDisp < 32768) ||
(InstDisp >= -32768) &&
"Expecting the instruction displacement for local-exec TLS "
"variables to be between [-32768, 32768)!");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not a review comment)

Per my understanding, allow peephole for non-zero offsets but without changes to AsmPrinter, we'll get assembler error like The displacement must be greater than or equal to -32768 and less than or equal to 32767 (no-integrated-as).

Since code here is to rewrite offsets exceeding upper limit (32768) into negative, will this assert be hit if the offset is even larger? (for example, twice the size as objects in aix-small-local-exec-tls-largeaccess.ll)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For larger variables, the assert will be not triggered. This is because in the initial patch that introduced this feature, I only restricted this non-TOC-based access sequence if the size of the TLS variable is less than 32751 within PPCISelLowering.cpp.

constexpr uint64_t AIXSmallTlsPolicySizeLimit = 32751;
....

     // With the -maix-small-local-exec-tls option, produce a faster access
      // sequence for local-exec TLS variables where the offset from the TLS
      // base is encoded as an immediate operand.
      //
      // We only utilize the faster local-exec access sequence when the TLS
      // variable has a size within the policy limit. We treat types that are
      // not sized or are empty as being over the policy size limit.
      if (HasAIXSmallLocalExecTLS && IsTLSLocalExecModel) {
        Type *GVType = GV->getValueType();
        if (GVType->isSized() && !GVType->isEmptyTy() &&
            GV->getParent()->getDataLayout().getTypeAllocSize(GVType) <=
                AIXSmallTlsPolicySizeLimit)
          return DAG.getNode(PPCISD::Lo, dl, PtrVT, VariableOffsetTGA, TLSReg);
      }

So in the case where the size is 8187, the test case for the large variable in aix-small-local-exec-tls-largeaccess.ll looks like:

        stw r3, mySmallLocalExecTLSv1[TL]@le(r13)

If the size is increased past AIXSmallTlsPolicySizeLimit, it will load from the TOC instead and do the regular local-exec sequence, which is expected:

        ld r4, L..C0(r2)                        # target-flags(ppc-tprel) @mySmallLocalExecTLSv1
        li r3, 1
        add r4, r13, r4
        stw r3, 0(r4)
. . .

Hope the above answers your question.

…sequence for non-zero offsets

This patch utilizes the -maix-small-local-exec-tls option to produce a faster,
non-TOC-based access sequence for the local-exec TLS model, specifically for
when the offsets from the TLS variable are non-zero.

In particular, this patch produces either a single:
- addi/la with a displacement off of R13 plus a non-zero offset for
  when an address is calculated, or
- load or store off of R13 plus a non-zero offset for when an address is
  calculated and used for further access
Where R13 is the thread pointer, respectively.

In order to produce a single addi or load/store off of the thread pointer with
a non-zero offset, this patch also adds the necessary support in the assembly
printer when printing these instructions.

Specifically:
- The non-zero offset is added to the TLS variable address when the address of
  the TLS variable + it's offset is less than 32KB.
- Otherwise, when the address of the TLS variable + its offset is greater than
  32KB, the non-zero offset (multiplied by a multiple of 64KB) is subtracted
  from the TLS address.

This handling in the assembly printer is necessary to ensure that the TLS
address + the non-zero offset is between [-32768, 32768), so that the total
displacement can fit within the addi/load/store instructions.
// access. The first operand of InitialADDI should be the thread pointer,
// which has been checked in isEligibleToFoldADDIForLocalExecAccesses().
SDValue TPRegNode = InitialADDI.getOperand(0);
RegisterSDNode *TPReg = dyn_cast_or_null<RegisterSDNode>(TPRegNode.getNode());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dyn_cast_or_null means TPRegNode.getNode() is nullable, is it true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I think I wanted to change this initially but had forgot. I don't expect these to be null, so I think I will probably just do dyn_cast().

@amy-kwan amy-kwan force-pushed the amy-kwan/small-local-exec-tls-nonzero-offsets branch from 859368e to cf5df00 Compare February 1, 2024 13:57
@amy-kwan amy-kwan merged commit 2a50921 into llvm:main Feb 1, 2024
3 of 4 checks passed
smithp35 pushed a commit to smithp35/llvm-project that referenced this pull request Feb 1, 2024
… offsets (llvm#71485)

This patch utilizes the -maix-small-local-exec-tls option to produce a
faster,
non-TOC-based access sequence for the local-exec TLS model.
Specifically, for
when the offsets from the TLS variable are non-zero.

In particular, this patch produces either a single:
- addi/la with a displacement off of R13 plus a non-zero offset for when
an address is calculated, or
- load or store off of R13 plus a non-zero offset for when an address is
calculated and used for further
  access where R13 is the thread pointer, respectively.

In order to produce a single addi or load/store off of the thread
pointer with a non-zero offset,
this patch also adds the necessary support in the assembly printer when
printing these instructions.

Specifically:
- The non-zero offset is added to the TLS variable address when the
address of the
  TLS variable + it's offset is less than 32KB.
- Otherwise, when the address of the TLS variable + its offset is
greater than 32KB, the
non-zero offset (and a multiple of 64KB) is subtracted from the TLS
address.

This handling in the assembly printer is necessary to ensure that the
TLS address + the non-zero offset
is between [-32768, 32768), so that the total displacement can fit
within the addi/load/store instructions.

This patch is meant to be a follow-up to
3f46e54 (where the
optimization occurs for when the offset is zero).
carlosgalvezp pushed a commit to carlosgalvezp/llvm-project that referenced this pull request Feb 1, 2024
… offsets (llvm#71485)

This patch utilizes the -maix-small-local-exec-tls option to produce a
faster,
non-TOC-based access sequence for the local-exec TLS model.
Specifically, for
when the offsets from the TLS variable are non-zero.

In particular, this patch produces either a single:
- addi/la with a displacement off of R13 plus a non-zero offset for when
an address is calculated, or
- load or store off of R13 plus a non-zero offset for when an address is
calculated and used for further
  access where R13 is the thread pointer, respectively.

In order to produce a single addi or load/store off of the thread
pointer with a non-zero offset,
this patch also adds the necessary support in the assembly printer when
printing these instructions.

Specifically:
- The non-zero offset is added to the TLS variable address when the
address of the
  TLS variable + it's offset is less than 32KB.
- Otherwise, when the address of the TLS variable + its offset is
greater than 32KB, the
non-zero offset (and a multiple of 64KB) is subtracted from the TLS
address.

This handling in the assembly printer is necessary to ensure that the
TLS address + the non-zero offset
is between [-32768, 32768), so that the total displacement can fit
within the addi/load/store instructions.

This patch is meant to be a follow-up to
3f46e54 (where the
optimization occurs for when the offset is zero).
agozillon pushed a commit to agozillon/llvm-project that referenced this pull request Feb 5, 2024
… offsets (llvm#71485)

This patch utilizes the -maix-small-local-exec-tls option to produce a
faster,
non-TOC-based access sequence for the local-exec TLS model.
Specifically, for
when the offsets from the TLS variable are non-zero.

In particular, this patch produces either a single:
- addi/la with a displacement off of R13 plus a non-zero offset for when
an address is calculated, or
- load or store off of R13 plus a non-zero offset for when an address is
calculated and used for further
  access where R13 is the thread pointer, respectively.

In order to produce a single addi or load/store off of the thread
pointer with a non-zero offset,
this patch also adds the necessary support in the assembly printer when
printing these instructions.

Specifically:
- The non-zero offset is added to the TLS variable address when the
address of the
  TLS variable + it's offset is less than 32KB.
- Otherwise, when the address of the TLS variable + its offset is
greater than 32KB, the
non-zero offset (and a multiple of 64KB) is subtracted from the TLS
address.

This handling in the assembly printer is necessary to ensure that the
TLS address + the non-zero offset
is between [-32768, 32768), so that the total displacement can fit
within the addi/load/store instructions.

This patch is meant to be a follow-up to
3f46e54 (where the
optimization occurs for when the offset is zero).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants