Skip to content

Conversation

@rwestrel
Copy link
Contributor

@rwestrel rwestrel commented May 3, 2024

In the test case:

long i;
for (; i > 0; i--) {
    res += 42 / ((int) i);

The long counted loop phi has type [1..100]. As a consequence, the
ConvL2I also has type [1..100]. The DivI node that follows can't
fault: it is not guarded by a zero check and has no control set.

The ConvL2I is split through phi and so is the DiVI node:
PhaseIdealLoop::cannot_split_division() returns true because the
value coming from the backedge into the DivI (when it is about to be
split thru phi) is the result of the ConvL2I which has type
[1..100] so is not zero as far as the compiler can tell.

On the last iteration of the loop, i is 1. Because the DivI was split
thru Phi, it computes the value for the following iteration, so for i
= 0. This causes a crash when the compiled code runs.

The same problem can't happen with an int counted loop because logic
in PhaseIdealLoop::split_thru_phi() prevents a ConvI2L from being
split thru phi. I propose to fix this the same way: in the test case,
it's not true that once the ConvL2I is split thru phi it keeps type
[1..100]. The fix is fairly conservative because it's base on the
existing logic for ConvI2L: we would want to not split a ConvL2I
only a counted loopd but. I suppose the same is true for the ConvI2L
and I thought it would be best to revisit both together.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8331575: C2: crash when ConvL2I is split thru phi at LongCountedLoop (Bug - P3)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/19086/head:pull/19086
$ git checkout pull/19086

Update a local copy of the PR:
$ git checkout pull/19086
$ git pull https://git.openjdk.org/jdk.git pull/19086/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 19086

View PR using the GUI difftool:
$ git pr show -t 19086

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/19086.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented May 3, 2024

👋 Welcome back roland! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented May 3, 2024

@rwestrel This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8331575: C2: crash when ConvL2I is split thru phi at LongCountedLoop

Reviewed-by: chagedorn, epeter

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 158 new commits pushed to the master branch:

  • ab8d7b0: 8324517: C2: crash in compiled code because of dependency on removed range check CastIIs
  • fe8a2af: 8307778: com/sun/jdi/cds tests fail with jtreg's Virtual test thread factory
  • 95f79c6: 8332253: Linux arm32 build fails after 8292591
  • b687aa5: 8332176: Refactor ClassListParser::parse()
  • 4083255: 8316138: Add GlobalSign 2 TLS root certificates
  • 43b109b: 8330066: HeapDumpPath and HeapDumpGzipLevel VM options do not mention HeapDumpBeforeFullGC and HeapDumpAfterFullGC
  • 7cff04f: 8330276: Console methods with explicit Locale
  • 8a4315f: 8331987: Enhance stacktrace clarity for CompletableFuture CancellationException
  • 491b3b4: 8332256: Shenandoah: Do not visit heap threads during shutdown
  • 9c02c8d: 8332255: Shenandoah: Remove duplicate definition of init mark closure
  • ... and 148 more: https://git.openjdk.org/jdk/compare/7a41a525deb796396ade1456f1d0a101ac705014...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk
Copy link

openjdk bot commented May 3, 2024

@rwestrel The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label May 3, 2024
@rwestrel rwestrel changed the title xb8331575: C2: crash when ConvL2I is split thru phi at LongCountedLoop 8331575: C2: crash when ConvL2I is split thru phi at LongCountedLoop May 3, 2024
@openjdk openjdk bot added the rfr Pull request is ready for review label May 3, 2024
@mlbridge
Copy link

mlbridge bot commented May 3, 2024

Webrevs

Copy link
Member

@chhagedorn chhagedorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also add the regression tests from the duplicated issue JDK-8298851.

// ConvI2L may have type information on it which is unsafe to push up
if ((n->Opcode() == Op_ConvI2L && n->bottom_type() != TypeLong::LONG) ||
(n->Opcode() == Op_ConvL2I && n->bottom_type() != TypeInt::INT)) {
// ConvI2L/ConvL2I may have type information on it which is unsafe to push up
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks good and we should probably move forward with that.

But I'm still wondering though, if these bailouts are really needed in the general case. It seems like this problem is mainly for loop phis. Couldn't we check the types of loop phi inputs and bail out if one includes zero? IIUC, the backedge should be an AddL with type [0..99], i.e. post-decremented. So, pushing through seems wrong in this case since the backedge type includes zero. But it could be detected and prevented. However, if the phi has type [5..100], for example, then it should be safe. We probably then just need to update the type of the pushed-through ConvL2I to whatever the type of the input is.

This type checking approach could work in the general case. But I'm not sure though, if it's beneficial to split these Conv nodes through phis in general. But it seems the bailouts have only been introduced due to correctness bugs and not due to performance reasons. Anyway, this should be investigated separately, including benchmarking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I'm still wondering though, if these bailouts are really needed in the general case. It seems like this problem is mainly for loop phis. Couldn't we check the types of loop phi inputs and bail out if one includes zero?

Are we sure divisions are the only cause of bugs? My understanding of this issue is that once pushed thru phi, the type of the ConvL2I is simply not correct and that's the root cause. I wonder if we could get other failures because of this: maybe a node becoming top because of the incorrect type or an out of bound array access.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure divisions are the only cause of bugs?

Not 100% sure. But the only cases I've observed so far are with division/mod where they float above and end up being executed too early (the result is never actually observed, though).

that once pushed thru phi, the type of the ConvL2I is simply not correct and that's the root cause.

Yes, that's my understanding, too. But since the AddL input into the loop iv phi contains zero, it raised the question if we could actually detect that and do our decision based on whether the input contains zero instead of simply disabling pushing ConvL2I (and ConvI2L) nodes through phis entirely.

It also seems that it's only a problem with loop iv phis because we improve the iv type in such a way that some of the possible values of the backedge are excluded. So, maybe a first step could be to allow splitting the Conv* nodes through non-loop-iv phi nodes. However, there might also be other non-loop-iv phi problems I'm currently not aware of. Nevertheless, it might be worth to investigate further in a separate RFE.

I wonder if we could get other failures because of this: maybe a node becoming top because of the incorrect type or an out of bound array access.

Could very well be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also seems that it's only a problem with loop iv phis because we improve the iv type in such a way that some of the possible values of the backedge are excluded. So, maybe a first step could be to allow splitting the Conv* nodes through non-loop-iv phi nodes. However, there might also be other non-loop-iv phi problems I'm currently not aware of. Nevertheless, it might be worth to investigate further in a separate RFE.

I agree that it would be worth investigating further.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label May 6, 2024
Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable.


I guess the issue is that ConvL2I and ConvI2L are also type nodes, which can restrict their type, just like CastII nodes. And that restricting of the type is only true under a certain if-branch.

But if the ConvI2L were not a type-node, then it would not restrict type, and you could simply push it through phis. Right?

Why do we have type restriction mixed into ConvI2L? Could that not be separated out into a CastII / CastLL?

Maybe we could generally separate ConvI2L, type restriction, and pinning? CastII also does multiple things, and it has hurt us many times in the past. Would this sort of maximal separation and specialization not be more "see of nodes" style?

Anyway, this would be interesting to look into for a future RFE.

* @run main/othervm -XX:-BackgroundCompilation -XX:-TieredCompilation -XX:-UseOnStackReplacement
* -XX:+StressGCM -XX:StressSeed=92643864 TestLongCountedLoopConvL2I
* @run main/othervm -XX:-BackgroundCompilation -XX:-TieredCompilation -XX:-UseOnStackReplacement
* -XX:+StressGCM TestLongCountedLoopConvL2I
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have a run that allows OSR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also add -XX:+UnlockDiagnosticVMOptions for the stress flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in new commit.

@rwestrel
Copy link
Contributor Author

You could also add the regression tests from the duplicated issue JDK-8298851.

I added one of them because it doesn't seem to need StressGCM. Does it really make sense to add all of them?

@rwestrel
Copy link
Contributor Author

I guess the issue is that ConvL2I and ConvI2L are also type nodes, which can restrict their type, just like CastII nodes. And that restricting of the type is only true under a certain if-branch.

That's not entirely true here. The ConvL2I captures the type of its input so not a narrower type. The problem is that the type is that of a Phi for a counted loop and once pushed through phi, the type captured by the ConvI2L becomes incorrect.

@eme64
Copy link
Contributor

eme64 commented May 13, 2024

I guess the issue is that ConvL2I and ConvI2L are also type nodes, which can restrict their type, just like CastII nodes. And that restricting of the type is only true under a certain if-branch.

That's not entirely true here. The ConvL2I captures the type of its input so not a narrower type. The problem is that the type is that of a Phi for a counted loop and once pushed through phi, the type captured by the ConvI2L becomes incorrect.

So what exactly is it that guarantees the correctness of the phi range under the counted loop that is not true when you push it back? I mean I would assume the phi can only have values that its inputs actually produce, so its inputs cannot have wildly different ranges, right? At some point, this range must be established by some control flow, at which point we can do the "type restriction".

I would now have to dive into the code and debug if the "type restriction" for counted loop phi happens purely because of the input values, or because of explicitly restrincting the type of the ConvI2L. But I do see that there is some new ConvI2LNode(input, type) cases where we do restrict the type of a ConvI2L.

@rwestrel
Copy link
Contributor Author

Before split if:

long i = 100;
for (; i > 0;) {
  // i here is 1..100
  int j = (int)i; // ConvL2I type is 1..100, same as loop phi
  int k = 42 / j;
   i--;
}

after split if:

long i = 100;
int j = 100;
int k = 0;
for (; i > 0;) {
  // i here is 1..100
   i--;
  // i here is 0..99
  j = (int)i; // ConvL2I type is still 1..100 which is not correct
  k = 42 / j;
}

@eme64
Copy link
Contributor

eme64 commented May 14, 2024

@rwestrel which "split_if" optimization was applied in your example? Split the ConvI2L through the phi? If so, the problem seems to be that the ConvI2L floats by the exit-check, right?

after split if:

long i = 100;
int j = 100;
int k = 0;
for (; i > 0;) {
  // i here is 1..100
   i--;
  // i here is 0..99
  exit check
  // i here is 1..99
  j = (int)i; // ConvL2I type is still 1..100 which is not correct
  k = 42 / j;
}

I guess the issue is that the ConvL2I was somehow pinned inside the loop, after the CountedLoop, by the phi. But when the ConvL2I is split into the backedge, it does not stay in the backedge but floats further, passes by the exit-check and goes into the last iteration -> BOOM.

How exactly did we narrow the type to 1...100? I guess that that is some smart logic in the trip count Phi node, right? If instead we had a CastLL for the exit check that narrows the type, then the CastLL would remain after the split-if, and the split ConvL2I could not float from the backedge into the loop body of the last iteration.

So I guess that is really a limitation: a trip count Phi specifically does the narrowing, and so you cannot just split past it. The question is if that is really nice, or if we could do it differently, e.g. via a CastLL/CastII on the exit-check?

@rwestrel
Copy link
Contributor Author

@rwestrel which "split_if" optimization was applied in your example? Split the ConvI2L through the phi? If so, the problem seems to be that the ConvI2L floats by the exit-check, right?

Yes.

So I guess that is really a limitation: a trip count Phi specifically does the narrowing, and so you cannot just split past it. The question is if that is really nice, or if we could do it differently, e.g. via a CastLL/CastII on the exit-check?

The issue involves conv nodes when split thru phi at a counted loop. That's a narrow corner case. I think fixing it by addressing the corner case where it occurs as proposed is simpler than trying a most general fix which can have hard to anticipate consequences.

@eme64
Copy link
Contributor

eme64 commented May 14, 2024

@rwestrel Yes, I'm totally fine with the fix. It simply applies the int case to long.

In a future RFE, we could at least restrict the "bailout" to trip-count Phi's, and not all Phi's.

In even further RFE's, we could consider doing the type narrowing not in the trip-count phi, but via casts at the checks. That would be a more unified solution. Generally, I feel like we are struggling way too much with all the different ways one can pin and narrow types: it is all mixed into trip-count phi's, Cast's, Conv's etc. Who really can understand all the complicated interactions? It seem we keep piling on special-case logic, but it is a endless whack-a-mole game. Every fix is "simple" but the sum of all those fixes is far from "simple" ;)

Copy link
Member

@chhagedorn chhagedorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looks good, thanks for adding the test!

@rwestrel
Copy link
Contributor Author

FTR, I double checked that fuzzer test failures from JDK-8298851 are indeed the same issue and are fixed with this.

@rwestrel
Copy link
Contributor Author

/integrate

@openjdk
Copy link

openjdk bot commented May 16, 2024

Going to push as commit f398cd2.
Since your change was applied there have been 160 commits pushed to the master branch:

  • 96c5c3f: 8329998: Remove double initialization for parts of small TypeArrays in ZObjArrayAllocator
  • ee4a9d3: 8321622: ClassFile.verify(byte[] bytes) throws unexpected ConstantPoolException, IAE
  • ab8d7b0: 8324517: C2: crash in compiled code because of dependency on removed range check CastIIs
  • fe8a2af: 8307778: com/sun/jdi/cds tests fail with jtreg's Virtual test thread factory
  • 95f79c6: 8332253: Linux arm32 build fails after 8292591
  • b687aa5: 8332176: Refactor ClassListParser::parse()
  • 4083255: 8316138: Add GlobalSign 2 TLS root certificates
  • 43b109b: 8330066: HeapDumpPath and HeapDumpGzipLevel VM options do not mention HeapDumpBeforeFullGC and HeapDumpAfterFullGC
  • 7cff04f: 8330276: Console methods with explicit Locale
  • 8a4315f: 8331987: Enhance stacktrace clarity for CompletableFuture CancellationException
  • ... and 150 more: https://git.openjdk.org/jdk/compare/7a41a525deb796396ade1456f1d0a101ac705014...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label May 16, 2024
@openjdk openjdk bot closed this May 16, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels May 16, 2024
@openjdk
Copy link

openjdk bot commented May 16, 2024

@rwestrel Pushed as commit f398cd2.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

3 participants