Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

-g causes different code to be generated due to PostRA scheduling differences #36588

Open
llvmbot opened this issue Apr 25, 2018 · 13 comments
Open

-g causes different code to be generated due to PostRA scheduling differences #36588

llvmbot opened this issue Apr 25, 2018 · 13 comments

Comments

@llvmbot
Copy link
Collaborator

@llvmbot llvmbot commented Apr 25, 2018

Bugzilla Link 37240
Version trunk
OS All
Blocks #37076
Reporter LLVM Bugzilla Contributor
CC @Arnaud-de-Grandmaison-ARM,@francisvm,@froydnj,@hiraditya,@jmorse,@OCHyams,@pogo59,@smithp35,@yuanfang-chen

Extended Description

When -g (or just -gline-tables-only) is passed to clang, CFI_INSTRUCTIONs are added to the function prologue by the Prologue/Epilogue insertion pass. These instructions act as scheduling region barriers when the PostRA scheduler is run. This can lead to different schedules for the prologue block, which can further lead to differences in tail merging/duplication.

A simple example the illustrates the problem:

target triple = "aarch64-linux-gnu"

@​X1 = global i32 0, align 4
@​X2 = global i32 0, align 4
@​X3 = global i32 0, align 4
@​X4 = global i32 0, align 4

define void @​test(i32 %i) #​0 {
entry:
%0 = load i32, i32* @​X1, align 4
%x1 = add i32 %0, 1
%x2 = add i32 %0, 2
%x3 = add i32 %0, 3
%x4 = add i32 %0, 4
tail call void @​foo()
store i32 %x1, i32* @​X1, align 4
store i32 %x2, i32* @​X2, align 4
store i32 %x3, i32* @​X3, align 4
store i32 %x4, i32* @​X4, align 4
ret void
}

declare void @​foo()

attributes #​0 = { nounwind }

!llvm.dbg.cu = !{#0}
!llvm.module.flags = !{#3, !​4, !​5}
!​0 = distinct !DICompileUnit(language: DW_LANG_C99, file: !​1, producer: "clang version 7.0.0 (trunk 330790) (llvm/trunk 330787)", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !​2)
!​1 = !DIFile(filename: "test.c", directory: "")
!​2 = !{}
!​3 = !{i32 2, !"Dwarf Version", i32 4}
!​4 = !{i32 2, !"Debug Info Version", i32 3}
!​5 = !{i32 1, !"wchar_size", i32 4}

when compiled as is (i.e. with debug info), produces the following code:

    stp     x23, x22, [sp, #-48]!   // 8-byte Folded Spill
    stp     x21, x20, [sp, #​16]     // 8-byte Folded Spill
    stp     x19, x30, [sp, #​32]     // 8-byte Folded Spill
    .cfi_def_cfa_offset 48
    .cfi_offset w30, -8
    .cfi_offset w19, -16
    .cfi_offset w20, -24
    .cfi_offset w21, -32
    .cfi_offset w22, -40
    .cfi_offset w23, -48
    adrp    x19, X1
    ldr     w8, [x19, :lo12:X1]
    add     w20, w8, #​1             // =1
    add     w21, w8, #​2             // =2
    add     w22, w8, #​3             // =3
    add     w23, w8, #​4             // =4
    bl      foo
    adrp    x8, X2
    str     w20, [x19, :lo12:X1]
    str     w21, [x8, :lo12:X2]
    ldp     x19, x30, [sp, #​32]     // 8-byte Folded Reload
    ldp     x21, x20, [sp, #​16]     // 8-byte Folded Reload
    adrp    x9, X3
    adrp    x10, X4
    str     w22, [x9, :lo12:X3]
    str     w23, [x10, :lo12:X4]
    ldp     x23, x22, [sp], #​48     // 8-byte Folded Reload
    ret

but when the debug metadata is commented out, it produces the following code instead:

    stp     x23, x22, [sp, #-48]!   // 8-byte Folded Spill
    stp     x19, x30, [sp, #​32]     // 8-byte Folded Spill
    adrp    x19, X1
    ldr     w8, [x19, :lo12:X1]
    stp     x21, x20, [sp, #​16]     // 8-byte Folded Spill
    add     w20, w8, #​1             // =1
    add     w21, w8, #​2             // =2
    add     w22, w8, #​3             // =3
    add     w23, w8, #​4             // =4
    bl      foo
    adrp    x8, X2
    str     w20, [x19, :lo12:X1]
    str     w21, [x8, :lo12:X2]
    ldp     x19, x30, [sp, #​32]     // 8-byte Folded Reload
    ldp     x21, x20, [sp, #​16]     // 8-byte Folded Reload
    adrp    x9, X3
    adrp    x10, X4
    str     w22, [x9, :lo12:X3]
    str     w23, [x10, :lo12:X4]
    ldp     x23, x22, [sp], #​48     // 8-byte Folded Reload
    ret

Note the different position in the schedule of the first adrp and ldr instructions.

@pogo59
Copy link
Collaborator

@pogo59 pogo59 commented Apr 25, 2018

I was under the impression that there was an existing bug for this,
but I'm not finding it, so here's what I know about this situation.

The .cfi_* directives are there to allow generating information that
allows a debugger (or unwinder, or backtrace, or whatever) to navigate
the stack frames, and correctly restore the state of a previous frame.
That information has specific details about which register is stored
where, and when. Allowing a scheduler to reorder instructions across
.cfi_* directives would make that information incorrect. That's why
the post RA scheduler treats them as scheduling barriers.

@llvmbot
Copy link
Collaborator Author

@llvmbot llvmbot commented Apr 25, 2018

It seems to me that we should be able to model the relevant dependencies (e.g. by adding some implicit reads/writes to the CFI_INSTRUCTION) such that only valid re-orderings would occur. We might also need to teach the scheduler to schedule these instructions as soon as they are available, or in some other way prevent their presence from altering scheduling decisions.

That being said, we'd need to weigh the complexity of doing all of this against the benefit of having slightly more consistent code generation w.r.t. debug flags. FWIW, this isn't something I am actively pursuing.

@pogo59
Copy link
Collaborator

@pogo59 pogo59 commented Apr 25, 2018

It is a given, among people who do debug info seriously, that changing
code generation when you say -g is a Bad Thing. In the worst case,
actual bugs can hide behind what look like innocuous differences.

And if this actually interferes with other optimizations like tail merging
or outlining, then it might get a little higher on someone's radar.

Thanks for filing the bug!

@llvmbot
Copy link
Collaborator Author

@llvmbot llvmbot commented Sep 6, 2019

The scheduling regions [I, RegionEnd] will be broken by cfg instructions in PostRA Machine Instruction Scheduler (postmisched) pass.

track the code when getSchedRegions:
if (isSchedBoundary(&MI, &*MBB, MF, TII))

if (isSchedBoundary(&MI, &*MBB, MF, TII))
break;


isSchedBoundary()
{
return MI->isCall() || TII->isSchedulingBoundary(*MI, MBB, *MF);
}


TII->isSchedulingBoundary(*MI, MBB, *MF);
{
if (MI.isTerminator() || MI.isPosition())
return true;
}


// llvm/include/llvm/CodeGen/MachineInstr.h#L1024
bool isPosition() const { return isLabel() || isCFIInstruction(); }

The given CFI instruction is considered as a scheduling boundary when MI isPosition=true. Then, one Region is broke as two different Regions(that makes wrong behavior).

I am going to ignore CFG instruction before isSchedBoundary() and push_back to Regions without CFG instructions. Or is there any other better solution to fix this issue?

@jmorse
Copy link
Member

@jmorse jmorse commented Sep 10, 2019

Chris wrote:

I am going to ignore CFG instruction before isSchedBoundary() and push_back
to Regions without CFG instructions. Or is there any other better solution
to fix this issue?

From what Paul wrote in comment 1, I believe the risk is that machine instructions will be rescheduled in such a way that the CFI instructions are no longer correct, which will mislead debuggers. Fully fixing this probably means the scheduler needs to understand CFI instructions, so that it can fix up the changes that it makes.

An alternative interpretation would be that it's better to create broken debug info than to change the generated code, but that might be difficult to argue.

@llvmbot
Copy link
Collaborator Author

@llvmbot llvmbot commented Sep 17, 2019

In this case, cfi instructions are recognized as SchedulingBoundary and break one Region into two Regions. For example, the su(0) and su(1) are one Region, and su(9) and su(10) is another Region, they should schedule in one Region, but in fact they do not schedule as two Regions caused by cfi barriers. After replaced the MI.isPosition with MI.isLable in isSchedulingBoundary, cfi instructions will not break the Regions then, and those cfi instructions will be pushed in the same Regions with machine instructions and do scheduling together.

here is debug information:

Instruction sequence before schedule:
su(0) - machine instruction sunit
su(1)
su(2~8)-these are cfi instructions
su(9)
su(10)


after acheduling (no code change) - scheduling no change, postmisched fails
su(0)
su(1)
su(2~8)-these are cfi instructions
su(9)
su(10)


After scheduling (with code change) - the machine instruction is correct, but cfi moved after su(0).
su(1)
su(9)
su(10)
su(0)
su(2~8) - cfi instruction.

code change:
TII->isSchedulingBoundary(*MI, MBB, *MF);
{

  • if (MI.isTerminator() || MI.isPosition())
  • if (MI.isTerminator() || MI.isLabel())
    return true;
    }

Could the issue be fixed with this change? In this way the cfi instruction will join scheduling and reserved in code. Not sure if this has risk somewhere.
or could I ignore the cfi instruction just like handling with debug instruction while scheduling(not push to Region, not do scheduling)?

@llvmbot
Copy link
Collaborator Author

@llvmbot llvmbot commented Sep 26, 2019

I think your idea is not a valid fix for the problem. Although the problem of having differing assembly will certainly be fixed by your change, it would move cfi instructions in an invalid way and away from stack altering instructions.

Please see this discussion on the mailing list: http://lists.llvm.org/pipermail/llvm-dev/2019-September/135433.html

And this patch on Phabricator: https://reviews.llvm.org/D68076

@OCHyams
Copy link
Contributor

@OCHyams OCHyams commented Dec 5, 2019

Hi Chris, I've been digging around and, with some help, I think I now
understand what's going on here. I hope this sheds some light on the matter.

There is a TL;DR near the end. The following write-up follows my investigation
as someone who previously had no knowledge on cfi instructions.

Clickbait title: CFI_INSTRUCTIONs don't need to be scheduling barriers.

Scheduling barriers

If an instruction, implicitly or otherwise, modifies the stack pointer
register (SP) it is treated as a scheduling barrier. CFI_INSTRUCTIONs in LLVM
mir are also always treated as scheduling barriers.

Call Frame Information (CFI) instructions

CFI instructions are used to work out offset of the Canonical Frame Address
(CFA), the stack slot containing the return address, from SP for the current
frame at any given time during program execution. By definition then any SP
modifying instruction requires a CFI_INSTRUCTION follow it to describe the
modification (if we're emitting emitting CFI, of course).

CFI_INSTRUCTIONs that we care about for this example:
def_cfa_offset <x: int> : CFA offset from SP
offset <r: reg>, <x: int> : Reg value stored at offset from CFA

x86

When we spill registers on x86 we will most likely push r. push modifies
SP, so it's a scheduling barrier, and if we want CFI it also requires a
CFI_INSTRUCTION.

Assuming the CFI_INSTRUCTIONs correctly always follow the instruction that
they describe it doesn't matter whether they are scheduling barriers or not.
Illustrated in this example:

--- Without CFI
MOV64rr $eax, $r15
PUSH64r $rbp // [WRITE SP: SCHED BARRIER]
MOV64rr $eax, $r15
...
--- With CFI
MOV64rr $eax, $r15
PUSH64r $rbp // [WRITE SP: SCHED BARRIER]
CFI_INSTRUCTION def_cfa_offset 8 // [CFI: SCHED BARRIER] Follow PUSH64r
CFI_INSTRUCTION offset $rbp -16 // [CFI: SCHED BARRIER] Follow PUSH64r
MOV64rr $eax, $r15
...

Goeff's aarch64 reproducer

Taken from this ticket, note the new comments in the square brackets.

stp x23, x22, [sp, #-48]! // 8-byte Folded Spill [READ-WRITE SP]
stp x21, x20, [sp, #​16] // 8-byte Folded Spill [READ SP]
stp x19, x30, [sp, #​32] // 8-byte Folded Spill [READ SP]
.cfi_def_cfa_offset 48
.cfi_offset w30, -8
.cfi_offset w19, -16
.cfi_offset w20, -24
.cfi_offset w21, -32
.cfi_offset w22, -40
.cfi_offset w23, -48

Here, there are two spills which do not modify SP but do require
CFI_INSTRUCTIONs.

[TL;DR]
This is the disconnect: Not all instructions which require CFI_INSTRUCTIONs
modify SP.

By definition, instructions which modify SP are both scheduling barriers
and require a CFI_INSTRUCTION def_cfa_offset. Spills, however, may not modify
SP and therefore may not be scheduling barriers. So the instruction
CFI_INSTRUCTION offset may incorrectly introduce a scheduling barrier and
cause a codegen change.

Potential fix (suggested by @​jmorse offline)

Pull out CFI_INSTRUCTIONS before scheduling. Map them to the instructions
which they describe. Then after scheduling, reinsert the CFI_INSTRUCTIONs
such that they follow the instructions they describe. This can be done by
the scheduler itself, and it seems to be an easier-to-implement version of
fix #​2 suggested in the llvm-dev discussion here [0].

[0] http://lists.llvm.org/pipermail/llvm-dev/2019-September/135433.html

@llvmbot
Copy link
Collaborator Author

@llvmbot llvmbot commented Dec 9, 2019

Pull out CFI_INSTRUCTIONS before scheduling. Map them to the instructions
which they describe. Then after scheduling, reinsert the CFI_INSTRUCTIONs
such that they follow the instructions they describe. This can be done by
the scheduler itself, and it seems to be an easier-to-implement version of
fix #​2 suggested in the llvm-dev discussion here [0].
[0] http://lists.llvm.org/pipermail/llvm-dev/2019-September/135433.html

Thanks Orlando for analyzing, not sure if David or someone else are working on this issue. If not, let me have a try to fix.

@llvmbot
Copy link
Collaborator Author

@llvmbot llvmbot commented Dec 11, 2019

Not sure if my solution is workable as below:

In getSchedRegions(), the region will be detected from RegionEnd to RegionBegin,
during this detecting, when found CFI, then splice it to RegionEnd and set the CFI as RegionEnd. So that it will not be recognized as SchedBoundary to split one Region for scheduling.

for example, original MBB:

frame-setup STPXi killed $x20, killed $x19, $sp, 4  ------(Region 2 Begin)
frame-setup CFI_INSTRUCTION def_cfa_offset 48  ------(Region 2 End)
frame-setup CFI_INSTRUCTION offset $w19, -8
renamable $w23 = ADDWri killed renamable $w8, 4, 0 ------(Region 1 Begin)
BL @&#8203;foo, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit-def $sp  ------(Region 1 End)

After reorder CFI by splice to END:

frame-setup STPXi killed $x20, killed $x19, $sp, 4 ------(Region 1 Begin)
renamable $w23 = ADDWri killed renamable $w8, 4, 0 
frame-setup CFI_INSTRUCTION def_cfa_offset 48 ------(Region 1 End)
frame-setup CFI_INSTRUCTION offset $w19, -8
BL @&#8203;foo, csr_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit-def $sp

@llvmbot
Copy link
Collaborator Author

@llvmbot llvmbot commented Dec 11, 2019

Only aarch64/arm64 meet this issue, and other platform does not, X86 for example disabled post-MI-sched:

lib/Target/X86/X86Schedule.td

// IssueWidth is analogous to the number of decode units. Core and its
// descendents, including Nehalem and SandyBridge have 4 decoders.
// Resources beyond the decoder operate on micro-ops and are bufferred
// so adjacent micro-ops don't directly compete.
//
// MicroOpBufferSize > 1 indicates that RAW dependencies can be
// decoded in the same cycle. The value 32 is a reasonably arbitrary
// number of in-flight instructions.
//
// HighLatency=10 is optimistic. X86InstrInfo::isHighLatencyDef
// indicates high latency opcodes. Alternatively, InstrItinData
// entries may be included here to define specific operand
// latencies. Since these latencies are not used for pipeline hazards,
// they do not need to be exact.

// The GenericX86Model contains no instruction schedules
// and disables PostRAScheduler.
class GenericX86Model : SchedMachineModel {
let IssueWidth = 4;
let MicroOpBufferSize = 32;
let LoadLatency = 4;
let HighLatency = 10;
let PostRAScheduler = 0; -----disabled
let CompleteModel = 0;
}

@hiraditya
Copy link
Collaborator

@hiraditya hiraditya commented Aug 8, 2020

@vedantk
Copy link
Collaborator

@vedantk vedantk commented Nov 27, 2021

mentioned in issue #37076

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants