Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BOLT] Fix local out-of-range stub issue that leads to infinite loop in LongJmp pass #73918

Merged

Conversation

linsinan1995
Copy link
Member

@linsinan1995 linsinan1995 commented Nov 30, 2023

If a local stub is out-of-range, at LongJmp we will try to find another local stub first. However, there seems to be a problem with the code logic which leads to an infinite loop at LongJmp pass in my workload.

https://github.com/llvm/llvm-project/blob/main/bolt/lib/Passes/LongJmp.cpp#L203-L209

  } else if (LocalStubsIter != Stubs.end() &&
             LocalStubsIter->second.count(TgtBB)) {
    // If we are replacing a local stub (because it is now out of range),
    // use its target instead of creating a stub to jump to another stub
    TgtSym = BC.MIB->getTargetSymbol(*TgtBB->begin()); // (1)
    TgtBB = BB.getSuccessor(TgtSym, BI);                              // (2)
  }

TgtSym now is the target of the local stub (statement 1), and thus it is not a successor of BB (statement 2), and thus TgtBB will be set to nullptr.

After this patch, we first convert the target of BB back to the target of the local stub, and then look up for other valid local stubs... I tested it on my workload, and it works fine after this change.

@linsinan1995
Copy link
Member Author

More detail for the infinite loop:

BB has a local stub .LStub1111 that is out of range, and the execution path entered the problematic code piece.

.LFT2057:
    00019e80: 	ldr	x0, [x25]
    00019e84: 	mov	x1, x24
    00019e88: 	ldr	x8, [x0]
    00019e8c: 	ldr	x8, [x8, #0x10]
    00019e90: 	blr	x8 # handler: 0; action: 0
    00019e94: 	tbz	w0, #0x0, .LStub1111
    00019e98: 	b	.Ltmp22
preds: .LStub1110
succs: .LStub1111 .Ltmp22

After the problematic code, TgtSym and TgtBBwere changed from both .LStub1111 into .Ltmp2459 and null. Then a new local stub .LStub1370 was created to replace the out-of-range stub.

TgtSym .LStub1111 => .Ltmp2459
TgtBB  .LStub1111 => null

** createNewStub .LStub1370 **

.LFT2057:
    00019e80: 	ldr	x0, [x25]
    00019e84: 	mov	x1, x24
    00019e88: 	ldr	x8, [x0]
    00019e8c: 	ldr	x8, [x8, #0x10]
    00019e90: 	blr	x8 # handler: 0; action: 0
    00019e94: 	tbz	w0, #0x0, .LStub1370
    00019e98: 	b	.Ltmp22
preds: .LStub1110
succs: .LStub1111 .Ltmp22

likely because of we did not remove the relation between BB .LFT2057 and .LStub1111, so the target of branch instruction was changed back to .LStub1111 via BinaryFunction::fixBranches

  During fixBranch
    00000000: 	tbz	w0, #0x0, .LStub1370
                  =>
    00000000: 	tbz	w0, #0x0, .LStub1111

Then Bolt was stuck in the loop of back-and-forth conversions between .LStub1111 and .LStub1370 .

TgtSym = BC.MIB->getTargetSymbol(*TgtBB->begin());
TgtBB = BB.getSuccessor(TgtSym, BI);
BB.replaceSuccessor(TgtBB, TgtBB->getSuccessor(TgtSym, BI), BI.Count,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TgtBB->getSuccessor(TgtSym, BI) might return NULL, since the veneer/stub "successor" might be the other function (e.g. function call was replaced by call to the veneer and jump from it). replaceSuccessor would fail in this case. So replaceSuccessor logic and further execution count calculations must be call under "TgtBB->getSuccessor(TgtSym, BI)" condition. The last getsuccessor might be be called outside this condition, since NULL is expected return in such case.

// local stub. We replace with the target of the local stub instead
// of creating a stub to jump to another stub.
// e.g.
// change the out-of-range stub
Copy link
Member

@yota9 yota9 Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably just text would be more useful here, just add smth about that we need to also return BB's successors to previous state i.e. set it to the target of stub in case it was a local stub.

@yota9
Copy link
Member

yota9 commented Dec 4, 2023

Thanks for the patch @linsinan1995 ! I leaved a couple of comments above, I think there might be a problem with this patch and it needs a small adjustment.

@linsinan1995 linsinan1995 force-pushed the fix-longjmp-out-of-range-local-stub branch from 06f4b4d to cd5c4f4 Compare December 4, 2023 12:53
@linsinan1995
Copy link
Member Author

Hi @yota9

Thank you for pointing out the issue. An update has been made.

Copy link
Member

@yota9 yota9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thanks for the fix! Just fix a small nit please.
Let's wait 1 more day, maybe META team has some more comments, then fill free to commit :)

"At least equal or greater than the branch count.");
TgtBB->setExecutionCount(TgtBB->getExecutionCount() - BI.Count);
}
TgtBB = TgtBBSucc;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please add new line after }

@aaupov
Copy link
Contributor

aaupov commented Dec 5, 2023

Can you please retitle to an imperative statement containing what you're changing? Your branch name fix-longjmp-out-of-range-local-stub is a good title.
Please also add a test.

TgtSym = BC.MIB->getTargetSymbol(*TgtBB->begin());
TgtBB = BB.getSuccessor(TgtSym, BI);
auto *TgtBBSucc = TgtBB->getSuccessor(TgtSym, BI);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also replace auto with BinaryBasicBlock please, we prefer not to use auto.

@linsinan1995 linsinan1995 changed the title [BOLT] a local out-of-range stub might lead to infinite loop in LongJmp [BOLT] Fix local out-of-range stub issue that leads to infinite loop in LongJmp pass Dec 5, 2023
@linsinan1995 linsinan1995 force-pushed the fix-longjmp-out-of-range-local-stub branch from cd5c4f4 to f03a10d Compare December 5, 2023 07:16
@linsinan1995
Copy link
Member Author

Can you please retitle to an imperative statement containing what you're changing? Your branch name fix-longjmp-out-of-range-local-stub is a good title. Please also add a test.

Hi @aaupov . it is a bit tricky to build a case that has a local out-of-range stub. any suggestion?

@yota9
Copy link
Member

yota9 commented Dec 5, 2023

@aaupov We don't have a tests for LongJmp usually. This case is quite hard to reproduce. I think we should continue without it for now..

Copy link
Contributor

@rafaelauler rafaelauler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this fix, @linsinan1995. The only thing I would do differently is to add an assert that TgtSym is not null after line 208 "TgtSym = ....getTargetSymbol(*TgtBB->begin())", so we are not taken by surprise if we change the contents of a stub and all of a sudden the first instruction is not the branch anymore, in which case we will silently return nullptr to TgtSym and break LongJmp. There are a lot of assumptions in that line, which, admittedly, is part of the original code and not your code.

@linsinan1995 linsinan1995 force-pushed the fix-longjmp-out-of-range-local-stub branch from f03a10d to b5b8c8e Compare December 6, 2023 09:30
@linsinan1995
Copy link
Member Author

Thanks a lot for this fix, @linsinan1995. The only thing I would do differently is to add an assert that TgtSym is not null after line 208 "TgtSym = ....getTargetSymbol(*TgtBB->begin())", so we are not taken by surprise if we change the contents of a stub and all of a sudden the first instruction is not the branch anymore, in which case we will silently return nullptr to TgtSym and break LongJmp. There are a lot of assumptions in that line, which, admittedly, is part of the original code and not your code.

Added. Investigating bugs related to LongJmp is quite troublesome. Adding an assertion to prevent the continuation of the subsequent logic when TgtSym is null is indeed very helpful for debugging. Thanks for the suggestion.

TgtSym = BC.MIB->getTargetSymbol(*TgtBB->begin());
TgtBB = BB.getSuccessor(TgtSym, BI);
assert(TgtSym && "First instruction is expected to be a branch.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first instruction doesn't have to be a branch. It is probably would be ADRP one for stub. More proper would be "Expected first instruction to contain a target symbol" or smth like that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Changed.

@linsinan1995 linsinan1995 force-pushed the fix-longjmp-out-of-range-local-stub branch from b5b8c8e to f5b3bd9 Compare December 6, 2023 10:44
If a local stub is out-of-range, at LongJmp we will try to find another
local stub first. However, The original implementation do not work as
expected and it leads to an infinite loop between replaceTargetWithStub
and fixBranches.

After this patch, we first convert the target of BB back to the target
of the local stub, and then look up for other valid local stubs and so
on.
@linsinan1995 linsinan1995 force-pushed the fix-longjmp-out-of-range-local-stub branch from f5b3bd9 to e3f587f Compare December 6, 2023 10:45
@linsinan1995 linsinan1995 merged commit fdb13cf into llvm:main Dec 11, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants