Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spill/Reload of Constant across CPSCall #11

Open
kavon opened this issue Jun 12, 2017 · 10 comments
Open

Spill/Reload of Constant across CPSCall #11

kavon opened this issue Jun 12, 2017 · 10 comments

Comments

@kavon
Copy link
Owner

kavon commented Jun 12, 2017

In the example below, the constant global's address is spilled and reloaded from the C stack across this call, which we should not have happen.

; ModuleID = 'spill_nt.ll'
source_filename = "spill_nt.ll"

@ghczmprim_GHCziTypes_True_closure = external global i8

declare ghccc { i64, i64* } @bar(i64, i64*)

; Function Attrs: naked
define ghccc { i64, i64* } @foo(i64 %__x, i64* %__y) #0 {
cazq:
  %lnbeg = ptrtoint i8* @ghczmprim_GHCziTypes_True_closure to i64
  %lnbeh = add i64 %lnbeg, 2
  %retVals = call ghccc { i64, i64* } ({ i64, i64* } (i64, i64*)*, i64, i32, i16, ...) 
  		@llvm.experimental.cpscall.sl_i64p0i64s.p0f_sl_i64p0i64si64p0i64f(
  			{ i64, i64* } (i64, i64*)* @bar, i64 1337, i32 2, i16 1, 
  			i64 %lnbeh, i64* %__y)
  %lnbeg2 = ptrtoint i8* @ghczmprim_GHCziTypes_True_closure to i64
  %lnbeh2 = add i64 %lnbeg2, 2
  %updated = insertvalue { i64, i64* } %retVals, i64 %lnbeh2, 0
  ret { i64, i64* } %updated
}

declare ghccc { i64, i64* } 
	@llvm.experimental.cpscall.sl_i64p0i64s.p0f_sl_i64p0i64si64p0i64f(
		{ i64, i64* } (i64, i64*)*, i64, i32, i16, ...)

attributes #0 = { naked }
	.section	__TEXT,__text,regular,pure_instructions
	.macosx_version_min 10, 11
	.globl	_foo
	.p2align	4, 0x90
_foo:                                   ## @foo
	.cfi_startproc
## BB#0:                                ## %cazq
	movq	_ghczmprim_GHCziTypes_True_closure@GOTPCREL(%rip), %r13
	addq	$2, %r13
	movq	%r13, (%rsp)            ## 8-byte Spill
	leaq	L1337_0(%rip), %rax
	movq	%rax, 2(%rbp)
	jmp	_bar                    ## TAILCALL
LBB0_1:
L1337_0:
	movq	(%rsp), %r13            ## 8-byte Reload
	movq	(%r13), %rax
	jmpq	*%rax                   ## TAILCALL
	.cfi_endproc


.subsections_via_symbols
@kavon kavon added the bug label Jun 12, 2017
@kavon
Copy link
Owner Author

kavon commented Jun 12, 2017

Looks like this will be tricky to fix. This is the MBB before expanding the CPSCall, and vreg3 is reused after the call.

# Machine code for function foo: IsSSA, TracksLiveness
Function Live Ins: %RBP in %vreg1

BB#0: derived from LLVM BB %cazq
    Live Ins: %RBP
	%vreg1<def> = COPY %RBP; GR64:%vreg1
	%vreg2<def> = MOV64rm %RIP, 1, %noreg, <ga:@ghczmprim_GHCziTypes_True_closure>[TF=5], %noreg; mem:LD8[GOT] GR64:%vreg2
	%vreg3<def,tied1> = ADD64ri8 %vreg2<tied0>, 2, %EFLAGS<imp-def,dead>; GR64:%vreg3,%vreg2
	ADJCALLSTACKDOWN64 0, 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
	%R13<def> = COPY %vreg3; GR64:%vreg3
	%RBP<def> = COPY %vreg1; GR64:%vreg1
	CPSCALLd64 <ga:@bar>, 1337, 2, 1, <regmask>, %RSP<imp-use>, %R13<imp-use>, %RBP<imp-use>, %RSP<imp-def>, %R13<imp-def>, %RBP<imp-def>
	ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
	%vreg4<def> = COPY %R13; GR64:%vreg4
	%vreg5<def> = COPY %RBP; GR64:%vreg5
	%R13<def> = COPY %vreg3; GR64:%vreg3
	%RBP<def> = COPY %vreg5; GR64:%vreg5
	CPSRET 0, %R13, %RBP

@kavon
Copy link
Owner Author

kavon commented Jun 12, 2017

Looks like it happens during initial IR -> DAG construction. What a weird time to do an optimization like this.

*** IR Dump After Module Verifier ***
; Function Attrs: naked
define ghccc { i64, i64* } @foo(i64 %__x, i64* %__y) #0 {
cazq:
  %lnbeg = ptrtoint i8* @ghczmprim_GHCziTypes_True_closure to i64
  %lnbeh = add i64 %lnbeg, 2
  %retVals = call ghccc { i64, i64* } ({ i64, i64* } (i64, i64*)*, i64, i32, i16, ...) @llvm.experimental.cpscall.sl_i64p0i64s.p0f_sl_i64p0i64si64p0i64f({ i64, i64* } (i64, i64*)* @bar, i64 1337, i32 2, i16 1, i64 %lnbeh, i64* %__y)
  %lnbeg2 = ptrtoint i8* @ghczmprim_GHCziTypes_True_closure to i64
  %lnbeh2 = add i64 %lnbeg2, 2
  %updated = insertvalue { i64, i64* } %retVals, i64 %lnbeh2, 0
  ret { i64, i64* } %updated
}



=== foo
Initial selection DAG: BB#0 'foo:cazq'
SelectionDAG has 30 nodes:
  t0: ch = EntryToken
  t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0
  t7: i64 = add GlobalAddress:i64<i8* @ghczmprim_GHCziTypes_True_closure> 0, Constant:i64<2>
  t8: i64 = GlobalAddress<{ i64, i64* } (i64, i64*)* @bar> 0
    t13: ch,glue = callseq_start t0, TargetConstant:i64<0>, TargetConstant:i64<0>
  t15: ch,glue = CopyToReg t13, Register:i64 %R13, t7
    t4: i64,ch = CopyFromReg t0, Register:i64 %vreg1
  t17: ch,glue = CopyToReg t15, Register:i64 %RBP, t4, t15:1
  t20: ch,glue = X86ISD::CPS_CALL t17, TargetGlobalAddress:i64<{ i64, i64* } (i64, i64*)* @bar> 0, Constant:i64<1337>, Constant:i32<2>, Constant:i16<1>, Register:i64 %R13, Register:i64 %RBP, RegisterMask:Untyped, t17:1
  t21: ch,glue = callseq_end t20, TargetConstant:i64<0>, TargetConstant:i64<0>, t20:1
  t22: i64,ch,glue = CopyFromReg t21, Register:i64 %R13, t21:1
  t23: i64,ch,glue = CopyFromReg t22:1, Register:i64 %RBP, t22:2
    t24: i64,i64 = merge_values t22, t23
  t25: i64,i64 = merge_values t7, t24:1
  t27: ch,glue = CopyToReg t23:1, Register:i64 %R13, t25
  t28: ch,glue = CopyToReg t27, Register:i64 %RBP, t25:1, t27:1
  t29: ch = X86ISD::CPS_RET t28, TargetConstant:i32<0>, Register:i64 %R13, Register:i64 %RBP, t28:1

@kavon
Copy link
Owner Author

kavon commented Jun 12, 2017

Turns out that it's due to SelectionDAG memoizing nodes within a block. The add nodes are identical and so the CSEMap returns the previously constructed one (t7 in the first output). When I do a CSE.clear(), we get a fresh node back on the 2nd visit (t25 in the second output).

➤ llc reload.ll 

; Function Attrs: naked
define ghccc { i64, i64* } @foo(i64 %__x, i64* %__y) #0 {
cazq:
  %lnbeg = ptrtoint i8* @ghczmprim_GHCziTypes_True_closure to i64
  %lnbeh = add i64 %lnbeg, 2
  %retVals = call ghccc { i64, i64* } ({ i64, i64* } (i64, i64*)*, i64, i32, i16, ...) @llvm.experimental.cpscall.sl_i64p0i64s.p0f_sl_i64p0i64si64p0i64f({ i64, i64* } (i64, i64*)* @bar, i64 1337, i32 2, i16 1, i64 %lnbeh, i64* %__y)
  %lnbeg2 = ptrtoint i8* @ghczmprim_GHCziTypes_True_closure to i64
  %lnbeh2 = add i64 %lnbeg2, 2
  %updated = insertvalue { i64, i64* } %retVals, i64 %lnbeh2, 0
  ret { i64, i64* } %updated
}

t7: i64 = add GlobalAddress:i64<i8* @ghczmprim_GHCziTypes_True_closure> 0, Constant:i64<2>
t7: i64 = add GlobalAddress:i64<i8* @ghczmprim_GHCziTypes_True_closure> 0, Constant:i64<2>
kavon@cronus:~/m/playground
➤ llc reload.ll 

; Function Attrs: naked
define ghccc { i64, i64* } @foo(i64 %__x, i64* %__y) #0 {
cazq:
  %lnbeg = ptrtoint i8* @ghczmprim_GHCziTypes_True_closure to i64
  %lnbeh = add i64 %lnbeg, 2
  %retVals = call ghccc { i64, i64* } ({ i64, i64* } (i64, i64*)*, i64, i32, i16, ...) @llvm.experimental.cpscall.sl_i64p0i64s.p0f_sl_i64p0i64si64p0i64f({ i64, i64* } (i64, i64*)* @bar, i64 1337, i32 2, i16 1, i64 %lnbeh, i64* %__y)
  %lnbeg2 = ptrtoint i8* @ghczmprim_GHCziTypes_True_closure to i64
  %lnbeh2 = add i64 %lnbeg2, 2
  %updated = insertvalue { i64, i64* } %retVals, i64 %lnbeh2, 0
  ret { i64, i64* } %updated
}

t7: i64 = add GlobalAddress:i64<i8* @ghczmprim_GHCziTypes_True_closure> 0, Constant:i64<2>
t25: i64 = add GlobalAddress:i64<i8* @ghczmprim_GHCziTypes_True_closure> 0, Constant:i64<2>
Assertion failed: ((Ptr & 1) && "Not a bucket pointer"), function GetBucketPtr, file /Users/kavon/msr/llvm-dev/src/lib/Support/FoldingSet.cpp, line 203.
0  llc                      0x000000010c6c9fec llvm::sys::PrintStackTrace(llvm::raw_ostream&) + 60
1  llc                      0x000000010c6ca479 PrintStackTraceSignalHandler(void*) + 25
2  llc                      0x000000010c6c7209 llvm::sys::RunSignalHandlers() + 425
3  llc                      0x000000010c6ca6f2 SignalHandler(int) + 354
4  libsystem_platform.dylib 0x00007fff9056b52a _sigtramp + 26
5  libsystem_platform.dylib 0x00007fff6c2fd5c8 _sigtramp + 3688439992
6  libsystem_c.dylib        0x00007fff90aac6df abort + 129
7  libsystem_c.dylib        0x00007fff90a73dd8 basename + 0
8  llc                      0x000000010c5c9f18 GetBucketPtr(void*) + 104
9  llc                      0x000000010c5c9e55 llvm::FoldingSetImpl::RemoveNode(llvm::FoldingSetImpl::Node*) + 181
10 llc                      0x000000010c4277c2 llvm::SelectionDAG::RemoveNodeFromCSEMaps(llvm::SDNode*) + 1874
11 llc                      0x000000010c427a8f llvm::SelectionDAG::DeleteNode(llvm::SDNode*) + 47
12 llc                      0x000000010c1bb7fb (anonymous namespace)::DAGCombiner::deleteAndRecombine(llvm::SDNode*) + 235
13 llc                      0x000000010c1bffb4 (anonymous namespace)::DAGCombiner::visitMERGE_VALUES(llvm::SDNode*) + 212
14 llc                      0x000000010c1bc536 (anonymous namespace)::DAGCombiner::visit(llvm::SDNode*) + 118
15 llc                      0x000000010c1bbf78 (anonymous namespace)::DAGCombiner::combine(llvm::SDNode*) + 56
16 llc                      0x000000010c1bb226 (anonymous namespace)::DAGCombiner::Run(llvm::CombineLevel) + 1286
17 llc                      0x000000010c1baca8 llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::AAResults*, llvm::CodeGenOpt::Level) + 104
18 llc                      0x000000010c4a46db llvm::SelectionDAGISel::CodeGenAndEmitDAG() + 4235
19 llc                      0x000000010c4a3642 llvm::SelectionDAGISel::SelectBasicBlock(llvm::ilist_iterator<llvm::ilist_detail::node_options<llvm::Instruction, true, false, void>, false, true>, llvm::ilist_iterator<llvm::ilist_detail::node_options<llvm::Instruction, true, false, void>, false, true>, bool&) + 322
20 llc                      0x000000010c4a3204 llvm::SelectionDAGISel::SelectAllBasicBlocks(llvm::Function const&) + 10180
21 llc                      0x000000010c49f44f llvm::SelectionDAGISel::runOnMachineFunction(llvm::MachineFunction&) + 2415
22 llc                      0x000000010abb64db (anonymous namespace)::X86DAGToDAGISel::runOnMachineFunction(llvm::MachineFunction&) + 59
23 llc                      0x000000010b6d8c31 llvm::MachineFunctionPass::runOnFunction(llvm::Function&) + 449
24 llc                      0x000000010bc0535f llvm::FPPassManager::runOnFunction(llvm::Function&) + 399
25 llc                      0x000000010bc05675 llvm::FPPassManager::runOnModule(llvm::Module&) + 117
26 llc                      0x000000010bc0635f (anonymous namespace)::MPPassManager::runOnModule(llvm::Module&) + 1967
27 llc                      0x000000010bc05936 llvm::legacy::PassManagerImpl::run(llvm::Module&) + 342
28 llc                      0x000000010bc06f51 llvm::legacy::PassManager::run(llvm::Module&) + 33
29 llc                      0x000000010aab51d2 compileModule(char**, llvm::LLVMContext&) + 21938
30 llc                      0x000000010aaaf8d7 main + 2471
31 libdyld.dylib            0x00007fff946f75ad start + 1
32 libdyld.dylib            0x0000000000000002 start + 1804634710
Stack dump:
0.	Program arguments: llc reload.ll 
1.	Running pass 'Function Pass Manager' on module 'reload.ll'.
2.	Running pass 'X86 DAG->DAG Instruction Selection' on function '@foo'
fish: “llc reload.ll” terminated by signal SIGABRT (Abort)
kavon@cronus:~/m/playground

@kavon
Copy link
Owner Author

kavon commented Jun 13, 2017

Attempts to clear the CSEMap and/or other data structures used for CSE have failed. There are too many assumptions baked into the implementation of CSE during the phases of isel.

One key assumption seems to be that all nodes created exist in one of these data structures, unless if doNotCSE says otherwise. Thus, much of the functionality fails with an error if they are cleared before the block is finished being selected.

@kavon
Copy link
Owner Author

kavon commented Jun 13, 2017

Current solution I'm thinking of is to run a PreISel IR->IR pass. It's the perfect time to do it since that happens after codegen prepare and immediately before isel starts.

Also, there's already infrastructure for this, which can even be target specific:

  • TargetPassConfig::addISelPrepare() -- this function has target-independent ISelPrep passes.
  • TargetPassConfig::addPreISel() -- called by the above function so that targets can insert their own isel prep passes.
  • X86PassConfig::addPreISel() -- an example of a target specific pre-isel pass.

@kavon
Copy link
Owner Author

kavon commented Jun 14, 2017

CSE breaks things again:

# *** IR Dump After Machine Common Subexpression Elimination ***:
# Machine code for function foo: IsSSA, TracksLiveness
Function Live Ins: %RBP in %vreg3

BB#0: derived from LLVM BB %cazq
    Live Ins: %RBP
  %vreg3<def> = COPY %RBP; GR64:%vreg3
  %vreg4<def> = MOV64rm %RIP, 1, %noreg, <ga:@ghczmprim_GHCziTypes_True_closure>[TF=5], %noreg; mem:LD8[GOT] GR64:%vreg4
  %vreg5<def,tied1> = ADD64ri8 %vreg4<tied0>, 2, %EFLAGS<imp-def,dead>; GR64:%vreg5,%vreg4
  ADJCALLSTACKDOWN64 0, 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
  %R13<def> = COPY %vreg5; GR64:%vreg5
  %RBP<def> = COPY %vreg3; GR64:%vreg3
  ADJCALLSTACKUP64 0, 0, %RSP<imp-def>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
  %vreg10<def> = LEA64r %RIP, 1, %noreg, <MCSym=L1337_0>, %noreg; GR64:%vreg10
  MOV64mr %RBP, 1, %noreg, 2, %noreg, %vreg10<kill>; GR64:%vreg10
  TCRETURNdi64 <ga:@bar>, 0, %RSP<imp-use>, %R13<imp-use>, %RBP<imp-use>
    Successors according to CFG: BB#2(?%)

BB#2: EH LANDING PAD
    Live Ins: %R13 %RBP
    Predecessors according to CFG: BB#0
  EH_LABEL <MCSym=L1337_0>
  %vreg7<def> = COPY %RBP<kill>; GR64:%vreg7
  %vreg1<def> = COPY %vreg7; GR64:%vreg1,%vreg7
    Successors according to CFG: BB#1(0x80000000 / 0x80000000 = 100.00%)

BB#1: derived from LLVM BB %cazq.split
    Predecessors according to CFG: BB#2
  %R13<def> = COPY %vreg5; GR64:%vreg5
  %RBP<def> = COPY %vreg1; GR64:%vreg1
  %vreg11<def> = MOV64rm %R13, 1, %noreg, 0, %noreg; GR64:%vreg11
  TCRETURNri64 %vreg11, 0, %R13<imp-use>, %RBP<imp-use>; GR64:%vreg11

# End machine code for function foo.

@kavon
Copy link
Owner Author

kavon commented Jun 20, 2017

The workaround for this is to pass -disable-machine-cse to llc.

The best solution to this is to not mark a retpt as the successor of the block it was split from, because it's not actually a successor. There is no local predecessor within the function for a retpt, but this breaks fundamental assumptions made by LLVM, namely, that all blocks are reachable from the entry node.

@kavon kavon added enhancement and removed bug labels Jun 20, 2017
@kavon kavon added the bug label Aug 29, 2017
@kavon
Copy link
Owner Author

kavon commented Aug 29, 2017

It seems that the prep pass and disabling machine CSE does not completely solve the issue.

Here's why binary-trees fails when -O is run (LLVM opts are only mem2reg globalopt):

_r7IJ_info$def:
## BB#0:                                ## %c7WR
	subq	$24, %rsp
	movq	%r14, %rax
	leaq	-48(%rbp), %rcx
	cmpq	%r15, %rcx
	jb	LBB2_26             # basic heap/stack test
## BB#1:
	movq	_ghczmprim_GHCziTypes_False_closure@GOTPCREL(%rip), %rcx
	incq	%rcx
	movq	%rcx, 16(%rsp)          ## 8-byte Spill
	movq	_ghczmprim_GHCziTypes_True_closure@GOTPCREL(%rip), %rcx
	addq	$2, %rcx
	movq	%rcx, 8(%rsp)           ## 8-byte Spill

# ... several CPS calls later within this function ...

LBB2_18:                                ## %c7Xj
	movq	8(%rbp), %rdi
	movq	32(%rbp), %rsi
	andl	$7, %ebx
	addq	$40, %rbp
	cmpq	$2, %rbx
	jne	LBB2_19
## BB#20:                               ## %c7Xv
	movq	16(%rsp), %rax          ## 8-byte Reload
# .... other stuff ....
LBB2_19:                                ## %c7Xr
	movq	8(%rsp), %rax           ## 8-byte Reload

The LLVM for this is quite different.

c7Xj:
; ...
  switch i64 %ln829, label %c7Xr [i64 1, label %c7Xr
i64 2, label %c7Xv]
c7Xr:
; ...
  %ln82g = ptrtoint i8* @ghczmprim_GHCziTypes_True_closure to i64
  %ln82h = add i64 %ln82g, 2
c7Xv:
; ...
  %ln82o = ptrtoint i8* @ghczmprim_GHCziTypes_False_closure to i64
  %ln82p = add i64 %ln82o, 1

We only reference those globals once each within the function, and LLVM does something stupid by floating both of those global offset computations to the beginning of the function, and "passing" that value via spill/reload from the stack. Why would anyone think this is a good optimization for LLVM to ever perform?

This again goes back to the discussion in #19 where we just make LLVM not do this non-sensical thing.

@kavon kavon added major and removed enhancement labels Aug 29, 2017
@kavon
Copy link
Owner Author

kavon commented Aug 30, 2017

It seems that this is caused by an optimization in llc, since compiling with -O0 causes the computations to not be floated up.

@kavon
Copy link
Owner Author

kavon commented Aug 30, 2017

These constants were hoisted by Machine LICM. The workaround for this case is to use -disable-machine-licm.

It really shouldn't happen at all. It makes the code worse to hoist this. The "loop" here is actually a GC loop I think, which is not hot.

@kavon kavon changed the title Spill/Reload across CPSCall Spill/Reload of Constant across CPSCall Sep 1, 2017
@kavon kavon added the major label Sep 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant