-
Notifications
You must be signed in to change notification settings - Fork 6.3k
8364757: Missing Store nodes caused by bad wiring in PhaseIdealLoop::insert_post_loop #27225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Clean up and add comments Move the store rewiring out of the phi loop, use loop->tail() instead Add comments in src
Add two more test cases Rename test and improve headers Remove useless space in test
👋 Welcome back bmaillard! A progress list of the required criteria for merging this PR into |
@benoitmaillard This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 606 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@eme64, @mhaessig, @rwestrel) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
@benoitmaillard The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
PhaseIdealLoop::insert_post_loop
PhaseIdealLoop::insert_post_loop
Webrevs
|
Not a review but a comment on the missing Phis. Your description makes it sound like if the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @benoitmaillard !
And thanks for all the explanations.
It seems the missing Phi at the OuterStripMinedLoop are a decision that implies that Stores will just sort of "hang" between loop exit and SafePoint. That is now the new "invariant". Fine for now, but we may want to reconsider adding the Phi for the OuterStripMinedLoop eventually.
I have read through the PR, and was a little confused about names, so bear with my comments 😅
On the algo level I was wondering if it is possible to have a chain of stores between the exit and SafePoint? Do you have such examples?
src/hotspot/share/opto/loopnode.hpp
Outdated
// Find the last memory node in the loop when following memory usages | ||
Node *find_mem_out_outer_strip_mined(Node* store, IdealLoopTree* outer_loop); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of the method is a bit confusing. And the comment seems to suggest something different than what the code says.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name was really bad indeed, sorry for that. I have renamed it to find_last_store_in_outer_loop
, and added a comment to explain why we have the guarantee of a linear graph here.
Node* PhaseIdealLoop::find_mem_out_outer_strip_mined(Node* store, IdealLoopTree* outer_loop) { | ||
Node* out = store; | ||
// Follow the memory uses until we get out of the loop | ||
while (true) { | ||
Node* unique_next = nullptr; | ||
for (DUIterator_Fast imax, l = out->fast_outs(imax); l < imax; l++) { | ||
Node* next = out->fast_out(l); | ||
if (next->is_Mem() && next->in(MemNode::Memory) == out) { | ||
IdealLoopTree* output_loop = get_loop(get_ctrl(next)); | ||
if (outer_loop->is_member(output_loop)) { | ||
assert(unique_next == nullptr, "memory node should only have one usage in the loop body"); | ||
unique_next = next; | ||
} | ||
} | ||
} | ||
if (unique_next == nullptr) { | ||
break; | ||
} | ||
out = unique_next; | ||
} | ||
return out; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note from later me: I was quite confused here. I thought this was going to be some general function that should handle all sorts of memory flow in the loop, but that is not the case. I'll leave all my comments here just to show you what I as the reader thought when reading it ;)
Below, in a code comment you say that this method does:
Find the last memory node in the loop when following memory usages
What happens here if we hit an if-diamond (or more complicated), where there can be multiple memory uses, that are then merged again by a memory phi?
store
|
+--------+
| |
store store
| |
+---+ +--+
| |
phi
|
store -> the last one in the loop
I wonder if this is somehow possible. There are surely some IGVN optimizations that would common the stores here, and so the graph would probably have to be even more complicated. But I'm simply wondering if it could be possible that we would have branches / phis in the memory graph. Or what guarantees us that the graph is really linear here?
I'm also not sure how to parse the method name:
find_mem_out_outer_strip_mined
- find "mem out" outer-strip-mined <loop?>
- find mem outside of outer-strip-mined loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we would trigger your assert if we found a branch:
assert(unique_next == nullptr, "memory node should only have one usage in the loop body");
Now we usually only do pre-main-post for relatively small loop bodies, see LoopUnrollLimit
. But I wonder if we ever decided to increase this limit, would we then encounter such more complicated memory graphs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I have been misled by the names / comments.
You are really looking for the last store in the outer_loop
. And we do have the guarantee of a linear memory graph because it is the one between if_false
and SafePoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a better method name would help a lot ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens here if we hit an if-diamond (or more complicated), where there can be multiple memory uses, that are then merged again by a memory phi?
This actually cannot happen because of the conditions in PhaseIdealLoop::try_move_store_after_loop
. There, before moving the store, we make sure that any user of the store is either:
- the
Phi
node attached to the loop head - outside of the loop body
This means we cannot have any branch (though we can have chains), and it guarantees that the memory subgraph is linear within the loop body.
Now we usually only do pre-main-post for relatively small loop bodies, see LoopUnrollLimit. But I wonder if we ever decided to increase this limit, would we then encounter such more complicated memory graphs?
I think the answer is the same here, it really depends on PhaseIdealLoop::try_move_store_after_loop
.
Node* next = out->fast_out(l); | ||
if (next->is_Mem() && next->in(MemNode::Memory) == out) { | ||
IdealLoopTree* output_loop = get_loop(get_ctrl(next)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep the names for next
and output_loop
consistent. Maybe next_loop
? Or just call them use
and use_loop
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I have changed it to use
and use_loop
.
for (DUIterator j = if_false->outs(); if_false->has_out(j); j++) { | ||
Node* store = if_false->out(j)->isa_Store(); | ||
// We don't make changes if the memory input is in the loop body as well | ||
if (store && !outer_loop->is_member(get_loop(get_ctrl(store->in(MemNode::Memory))))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (store && !outer_loop->is_member(get_loop(get_ctrl(store->in(MemNode::Memory))))) { | |
if (store != nullptr && !outer_loop->is_member(get_loop(get_ctrl(store->in(MemNode::Memory))))) { |
No implicit null or zero checks, see hotspot style guide ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop nesting check looks a bit convoluted. Consider refactoring a little. Could you get rid of the !
by swapping things around?
get_loop(get_ctrl(store->in(MemNode::Memory))))->is_member(outer_loop)
Does not look that much better either... hmm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No implicit null or zero checks, see hotspot style guide ;)
Missed that, thanks for the reminder!
The loop nesting check looks a bit convoluted. Consider refactoring a little. Could you get rid of the ! by swapping things around?
I personally think it looks more intuitive with the !
, but I agree it is a bit convoluted. I have added an intermediate variable to make it more readable.
const Node* if_false = loop->tail()->in(0)->as_BaseCountedLoopEnd()->proj_out(false); | ||
for (DUIterator j = if_false->outs(); if_false->has_out(j); j++) { | ||
Node* store = if_false->out(j)->isa_Store(); | ||
// We don't make changes if the memory input is in the loop body as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? I suppose that is because there must be a Phi in the loop then, right? Maybe state that in the comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a memory input that is outside of the loop body is the situation where we would normally expect a Phi
, and this is where we would like to intervene.
If the memory input is in the loop body as well, we can safely assume it is still correct as the whole body get cloned as a unit.
I have updated the comment, I hope it is clearer now.
Node* mem_out = find_mem_out_outer_strip_mined(store, outer_loop); | ||
Node* store_new = old_new[store->_idx]; | ||
store_new->set_req(MemNode::Memory, mem_out); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be that there are multiple stores in a chain after the loop exit and before the SafePoint?
Loop
Exit
store1
store2
store3
SafePoint
If so, they all have the same control, namely at the if_false
.
Their memory state should be ordered, where store2 depends on store1 and store3 on store2. Only store1 should then really have its memory input updated.
Your code now finds the store_new
for each of store1, store2 and store3, and sets all of their memory inputs to mem_out
. But that means that the "new" stores all have the same memory input, and are not in a chain any more. Did I see this right? Is that ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this can happen. This is actually what we test with the last test case (test3
), and this is why we have the following:
// We don't make changes if the memory input is in the loop body as well
if (store && !outer_loop->is_member(get_loop(get_ctrl(store->in(MemNode::Memory))))) {
In this case, the condition is only true for store1
(as its memory input would be last memory operation before the loop, or the memory Parm
), but not for store2
nor store3
. We would only end up rewiring store1
, and leave store2
and store3
as they are.
Does that make sense?
static public void test3(A a1, A a2) { | ||
a1.field = 0; | ||
a2.field = 0; | ||
for (int i = 0; i < 20000; i++) { | ||
a1.field += i; | ||
a2.field += i; | ||
} | ||
a1.field = 0; | ||
a2.field = 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the field stores both float out of the loop, and end up in a chain between exit and safepoint? Might be nice to add some comments to these tests so we can see what examples you already cover and if we might need some more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the entire chain floats out of the loop (each store is moved successively). I have added some comments about the structure that we are trying expose, and changed the test slightly as well.
@eme64 Thanks a lot for your detailed comments, this is really helpful. I have tried to address all of them, let me you what you think once you get the chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this and for the clear analysis of this tricky issue, @benoitmaillard!
Your solution seems good, but I have a few coding suggestions below.
…and make sure we return a store
Co-authored-by: Manuel Hässig <manuel@haessig.org>
Thank you for the review @mhaessig. I have adressed your comments, let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than that, looks good to me.
// is in the loop body as well, then we can safely assume it is still correct as the entire | ||
// body was cloned as a unit | ||
IdealLoopTree* input_loop = get_loop(get_ctrl(store->in(MemNode::Memory))); | ||
if (!outer_loop->is_member(input_loop)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Actually, I wonder if a new method that also does the get_ctrl()
(or ctrl_or_self()
), wouldn't be useful given that pattern must be quite common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. It seems there are a lot of occurrences indeed, maybe I should address this in a separate RFE. Btw, it seems we could also change the return type of PhaseIdealLoop::is_member
from int
to bool
, to stay consistent with IdealLoopTree::is_member
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, no reason for a return type of int
. Sure, a separate RFE works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have filed JDK-8369002.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing my comments. Looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the change. Looks good to me.
Thank you for your review @rwestrel! And apologies for not replying to your comment earlier, I saw it right before leaving on vacation and then forgot. I agree with what you said, and I may have overlooked that aspect while writing my explanation. Thanks for clearing that out. |
/integrate |
@benoitmaillard |
/sponsor |
Going to push as commit 7231916.
Your commit was automatically rebased without conflicts. |
@mhaessig @benoitmaillard Pushed as commit 7231916. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
This PR introduces a fix for wrong results caused by missing
Store
nodes in C2 IR due to incorrect wiring inPhaseIdealLoop::insert_post_loop
.Context
The issue was initially found by the fuzzer. After some trial and error, and with the help of @chhagedorn I was able to reduce the reproducer to something very simple. After being compiled by C2, the execution of the following method led to the last statement (
x = 0
) to be ignored:After some investigation and discussions with @robcasloz and @chhagedorn, it appeared that this issue is linked to how safepoints are inserted into long running loops, causing the loop to be transformed into a nested loop with an
OuterStripMinedLoop
node.Store
node are moved out of the inner loop when encountering this pattern, and the associatedPhi
nodes are removed in order to avoid inhibiting loop optimizations taking place later. This was initially adressed in JDK-8356708 by making the necessary corrections in macro expansion. As explained in the next section, this is not enough here as macro expansion happens too late.This PR aims at addressing the specific case of the wrong wiring of
Store
nodes in post loops, but on the longer term further investigations into the missingPhi
node issue are necessary, as they are likely to cause other issues (cf. related JBS issues).Detailed Analysis
In
PhaseIdealLoop::create_outer_strip_mined_loop
, a simpleCountedLoop
is turned into a nested loop with anOuterStripMinedLoop
. The body of the initial loop remains in the inner loop, but the safepoint is moved to the outer loop. Later, we attempt to moveStore
nodes after the inner loop inPhaseIdealLoop::try_move_store_after_loop
. When theStore
node is moved to the outer loop, we also get rid of its inputPhi
node in order not to confuse loop optimizations happening later.This only becomes a problem in
PhaseIdealLoop::insert_post_loop
, where we clone the body of the inner/outer loop for the iterations remaining after unrolling. There, we usePhi
nodes to do the necessary rewiring between the original body and the cloned one. Because we do not havePhi
nodes for the movedStore
nodes, their memory inputs may end up being incorrect.This is what the IR looks like after the creation of the post loop in our reproducer:
On the screenshot, node
118 StoreI
takes directly24 StoreI
as memory input, even though it is obvious that96 CountedLoopEnd
(to which73 NodeI
is attached) is a predecessor of114 CountedLoopEnd
in the CFG.After that, we observe a succession of IGVN optimizations that eventually lead to the generation of wrong code:
IfFalse
projection of128 If
becomes dead, as the the post loop is always executed (number of iterations is known)121 Region
and123 Phi
are subsequently eliminated (as a result of the dead path)Phi
disappeared,118 StoreI
becomes the memory input of89 StoreI
118 StoreI
is eliminated because it is directly followed by a write at the same memory location89 StoreI
is replaced by24 StoreI
as anIdentity
optimizations because it is stores the same value at the same locationNode
89 StoreI
corresponds to the lastx = 0
assignment, and its elimination directly causes the wrong result (the store node from theOuterStripMinedLoop
remains, as it is used by the safepoint).Proposed Fix
As mentioned previously, the impact of the missing
Phi
nodes need to be investigated further, as it it likely that this causes other bugs in the compilation process. This is a "local fix" for the specific issue ofStore
nodes moved out of the inner loop.The approach here is to do the wiring directly in
PhaseIdealLoop::insert_post_loop
, right after having done the usual rewiring based on thePhi
nodes. As the conditions for movingStore
nodes out of the loop are quite restrictive, the pattern is predictable:Store
nodes are attached to thefalse
projection of the innerCountedLoopEnd
, right before the safepoint in the CFG.In the simplest case, the memory input of new version of the store node is outside of the loop body. In the cloned node, we change it to point to its original version instead (as the original store is always executed before).
It may also be that the memory input of the new node points to another memory node in the loop body. This can happen in the case where we have:
Here, the second store has the first one as memory input, as
a1
anda2
may be aliases. In this case, we only need to change the memory input of the first store in the chain, and it needs to point to the last memory node in the chain in the original version of the loop.Testing
Thank you for reviewing!
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/27225/head:pull/27225
$ git checkout pull/27225
Update a local copy of the PR:
$ git checkout pull/27225
$ git pull https://git.openjdk.org/jdk.git pull/27225/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 27225
View PR using the GUI difftool:
$ git pr show -t 27225
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/27225.diff
Using Webrev
Link to Webrev Comment