Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8325469: Freeze/Thaw code can crash in the presence of OSR frames #18637

Closed
wants to merge 7 commits into from

Conversation

pchilano
Copy link
Contributor

@pchilano pchilano commented Apr 4, 2024

Freeze/thaw code assumes that a compiled frame for a method where num_stack_arg_slots() > 0 will always have the arguments setup above the metadata at the bottom of the frame. But when converting an interpreter frame to a compiled frame during OSR we don't explicitly leave room for the stack arguments after popping the interpreter frame. All parameters needed will be read from the "buf" array and stored inside the frame before calling OSR_migration_end().

This mismatch in how the stack looks and what we assume can lead to different crashes. In particular the issue happens when the OSR conversion happens for the bottom-most frame in the stack. If the OSR frame has a caller in the stack then there is no issue on freezing/thawing. I added more details about this in the bug comments.

When the OSR conversion happens for the bottom-most frame then a future freeze/thaw can lead to crashes for all cases: freeze_fast/thaw_fast, freeze_fast/thaw_slow, freeze_slow/thaw_slow. When freezing fast, either thawing fast or slow can lead to trying to read past the bottom of the stackChunk or writing below the allocated space in the stack. The freeze slow case is almost okay, except that it uncovered an invalid assert that is triggered if the size of the OSR frame plus all the other frames we freeze takes less space than the size of locals minus parameters of the interpreter frame that was OSR. I also added more details about these in the bug comments.

I tested different fixes, but I think the most straightforward one is to add _num_stack_arg_slots in the nmethod class and initialize it accordingly depending on whether the nmethod is an OSR one or not.

The patch includes a new test that exercises all these possible combinations of OSR frame at bottom of stack or not, and then freezing fast/slow and thawing fast/slow. The bottom case where we freeze fast and thaw slow reproduces the originally reported crash. There are actually two different failure modes depending of whether this is a thaw top or return barrier case. The other bottom cases lead to the other crashes described in the bug comments.
The new test uncover another bug besides the OSR issues, but since it's a different one I filed a separate JBS issue (JDK-8329665) and I made this a dependent PR.

I tested the current patch with the new test and also run it through mach5 tiers1-6.

Thanks,
Patricio


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8325469: Freeze/Thaw code can crash in the presence of OSR frames (Bug - P3)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18637/head:pull/18637
$ git checkout pull/18637

Update a local copy of the PR:
$ git checkout pull/18637
$ git pull https://git.openjdk.org/jdk.git pull/18637/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 18637

View PR using the GUI difftool:
$ git pr show -t 18637

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18637.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Apr 4, 2024

👋 Welcome back pchilanomate! A progress list of the required criteria for merging this PR into pr/18632 will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Apr 4, 2024

@pchilano This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8325469: Freeze/Thaw code can crash in the presence of OSR frames

Reviewed-by: rpressler, dlong

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 5 new commits pushed to the master branch:

  • 76cbe48: 8329430: MetaspaceShared::preload_and_dump should clear pending exception
  • f7c8413: 8326116: JFR: Add help option to -XX:StartFlightRecording
  • 941bee1: 8327640: Allow NumberFormat strict parsing
  • 2ede143: 8330279: Typo in java.text.Bidi class description
  • 90df3b7: 8329190: (ch) DatagramChannel.receive should throw ClosedChannelException when called on closed channel

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk
Copy link

openjdk bot commented Apr 4, 2024

@pchilano The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot hotspot-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Apr 4, 2024
@pchilano
Copy link
Contributor Author

pchilano commented Apr 4, 2024

/label remove core-libs

@openjdk openjdk bot removed the core-libs core-libs-dev@openjdk.org label Apr 4, 2024
@openjdk
Copy link

openjdk bot commented Apr 4, 2024

@pchilano
The core-libs label was successfully removed.

@pchilano pchilano marked this pull request as ready for review April 4, 2024 21:06
@openjdk openjdk bot added the rfr Pull request is ready for review label Apr 4, 2024
@mlbridge
Copy link

mlbridge bot commented Apr 4, 2024

Webrevs

@dean-long
Copy link
Member

This looks good, but have you considered computing the value every time instead of caching it in _num_stack_arg_slots and increasing the size of every nmethod?

@@ -801,6 +802,7 @@ nmethod::nmethod(

init_defaults();
_entry_bci = entry_bci;
_num_stack_arg_slots = entry_bci != InvocationEntryBci ? 0 : _method->constMethod()->num_stack_arg_slots();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, is the condition on this line the actual fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The point is that _num_stack_arg_slots should not be fixed for a given Method as now but it should depend on the actual nmethod.

while (!cont.isDone()) {
cont.run();
if (freezeFast && !thawFast && fooCallCount == 2) {
// All frames freezed in last yield should be compiled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

freezed -> frozen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

// provoke OSR compilation
for (int i = 0; i < 500_000 * fooCallCount; i++) {
}
fooCallCount++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps use WhiteBox to check if we're OSRed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll test using isMethodCompiled(m, true) as another condition to break the loop.

// provoke OSR compilation
for (int i = 0; i < 5_000_000 * fooCallCount; i++) {
}
fooCallCount++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Perhaps use WhiteBox to check if we're OSRed?

@pchilano
Copy link
Contributor Author

pchilano commented Apr 8, 2024

This looks good, but have you considered computing the value every time instead of caching it in _num_stack_arg_slots and increasing the size of every nmethod?

Since this is used in the thaw fast path too I wanted the avoid the extra load of constMethod if possible, but I think either case is fine. Moving _is_unlinked to where the other booleans are defined actually keeps the size of the nmethod same as before (368 bytes). What do you think?

@dean-long
Copy link
Member

Since this is used in the thaw fast path too I wanted the avoid the extra load of constMethod if possible, but I think either case is fine. Moving _is_unlinked to where the other booleans are defined actually keeps the size of the nmethod same as before (368 bytes). What do you think?

Can you do a performance measurement to see if the extra load actually makes a difference. I think @vnkozlov is also doing nmethod field reordering/compaction, so the relative overhead of an extra field might not remain 0.

@pron
Copy link
Member

pron commented Apr 10, 2024

It may be hard to do a proper measurement because the number of methods in our microbenchmarks is small. We're also talking an extra branch, I think. This is code than can be called a million times per second per core. It's very performance sensitive. So I would prefer to first see if there's an impact on nmethod size, and only if there is consider whether the speed implications are acceptable.

@dean-long
Copy link
Member

OK, let's go with the new nmethod field.

@openjdk-notifier openjdk-notifier bot changed the base branch from pr/18632 to master April 16, 2024 14:13
@openjdk-notifier
Copy link

The parent pull request that this pull request depends on has now been integrated and the target branch of this pull request has been updated. This means that changes from the dependent pull request can start to show up as belonging to this pull request, which may be confusing for reviewers. To remedy this situation, simply merge the latest changes from the new target branch into this pull request by running commands similar to these in the local repository for your personal fork:

git checkout JDK-8325469
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# if there are conflicts, follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk
Copy link

openjdk bot commented Apr 16, 2024

@pchilano this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8325469
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Apr 16, 2024
@openjdk
Copy link

openjdk bot commented Apr 16, 2024

⚠️ @pchilano This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).

@openjdk openjdk bot added ready Pull request is ready to be integrated and removed merge-conflict Pull request has merge conflict with target branch labels Apr 16, 2024
@pchilano
Copy link
Contributor Author

Thanks for the reviews @pron and @dean-long!

@pchilano
Copy link
Contributor Author

/integrate

@openjdk
Copy link

openjdk bot commented Apr 17, 2024

Going to push as commit fd331ff.
Since your change was applied there have been 30 commits pushed to the master branch:

  • 9fd7802: 8325494: C2: Broken graph after not skipping CastII node anymore for Assertion Predicates after JDK-8309902
  • 192ec38: 8329595: spurious variable "might not have been initialized" on static final field
  • 03e8417: 8329948: Remove string template feature
  • ff3e76f: 8330053: JFR: Use LocalDateTime instead ZonedDateTime
  • 811aadd: 8324683: Unify AttachListener code for Posix platforms
  • 5841cb3: 8330107: Separate out "awt" libraries from Awt2dLibraries.gmk
  • 89129e3: 8212895: ChronoField.INSTANT_SECONDS's range doesn't match the range of Instant
  • 9445047: 8330163: C2: improve CMoveNode::Value() when condition is always true or false
  • d2f9a1e: Merge
  • 33d7127: 8322122: Enhance generation of addresses
  • ... and 20 more: https://git.openjdk.org/jdk/compare/f11a496de61d800a680517457eb43b078a633953...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Apr 17, 2024
@openjdk openjdk bot closed this Apr 17, 2024
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Apr 17, 2024
@openjdk
Copy link

openjdk bot commented Apr 17, 2024

@pchilano Pushed as commit fd331ff.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@vnkozlov
Copy link
Contributor

Hi @pchilano

This change did affect my PR which try to reduce nmethod header size #18768.

I am fine with caching the value in nmethod but why you used int field for it? It is u2 in constMethod.hpp#L209.

I am currently resolving conflict in my PR with your changes and I am planning to use u2 for it in nmethod too. Are you okay with that?

@pchilano
Copy link
Contributor Author

Hi @pchilano

This change did affect my PR which try to reduce nmethod header size #18768.

I am fine with caching the value in nmethod but why you used int field for it? It is u2 in constMethod.hpp#L209.

I am currently resolving conflict in my PR with your changes and I am planning to use u2 for it in nmethod too. Are you okay with that?

Yes. I just used int because that was the return value of num_stack_arg_slots() that I moved from method.hpp, but I missed the field can just be defined as a u2 instead.

@vnkozlov
Copy link
Contributor

Yes. I just used int because that was the return value of num_stack_arg_slots() that I moved from method.hpp, but I missed the field can just be defined as a u2 instead.

Okay. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated
4 participants