Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8318586: Explicitly handle upcall stub allocation failure #16311

Closed
wants to merge 14 commits into from

Conversation

JornVernee
Copy link
Member

@JornVernee JornVernee commented Oct 23, 2023

Explicitly handle UpcallStub allocation failures by terminating. We currently might try to use the returned nullptr which would fail sooner or later. This patch just makes the termination explicit.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issues

  • JDK-8318586: Explicitly handle upcall stub allocation failure (Bug - P3)
  • JDK-8318653: UpcallTestHelper::runInNewProcess waits for forked process without timeout (Bug - P3)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16311/head:pull/16311
$ git checkout pull/16311

Update a local copy of the PR:
$ git checkout pull/16311
$ git pull https://git.openjdk.org/jdk.git pull/16311/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 16311

View PR using the GUI difftool:
$ git pr show -t 16311

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16311.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 23, 2023

👋 Welcome back jvernee! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Oct 23, 2023

@JornVernee The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot-compiler hotspot-compiler-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Oct 23, 2023
@JornVernee
Copy link
Member Author

/solves JDK-8318653

@openjdk
Copy link

openjdk bot commented Oct 24, 2023

@JornVernee
Adding additional issue to solves list: 8318653: UpcallTestHelper::runInNewProcess waits for forked process without timeout.

@JornVernee JornVernee marked this pull request as ready for review October 24, 2023 14:44
@openjdk openjdk bot added the rfr Pull request is ready for review label Oct 24, 2023
@mlbridge
Copy link

mlbridge bot commented Oct 24, 2023

Webrevs

Copy link
Member

@shipilev shipilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it pretty weird to terminate the VM if we cannot allocate upcall stub. Does this mean the user code could actually terminate the VM on this fatal? Unit test suggests so.

Can the VM code actually handle things without upcall stub present, if e.g. memory is exhausted?

@@ -758,6 +758,9 @@ UpcallStub* UpcallStub::create(const char* name, CodeBuffer* cb, jobject receive
{
MutexLocker mu(CodeCache_lock, Mutex::_no_safepoint_check_flag);
blob = new (size) UpcallStub(name, cb, size, receiver, frame_data_offset);
if (blob == nullptr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be safer to call into fatal without having CodeCache_lock held. Move it out of MutexLocker scope?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern follows what is done in RuntimeStub::new_runtime_stub, which also calls fatal under the lock.

I agree it's probably better to call outside of the lock (and that is something I noticed in the original change for RuntimeStub as well). I'm happy to fix it for both.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is okay to handle RuntimeStub in a separate (and more cleanly backportable) PR. Let's make the new code do the right thing from the start.

@JornVernee
Copy link
Member Author

I find it pretty weird to terminate the VM if we cannot allocate upcall stub. Does this mean the user code could actually terminate the VM on this fatal? Unit test suggests so.

Can the VM code actually handle things without upcall stub present, if e.g. memory is exhausted?

I think the question is whether the user can do anything reasonable if the allocation fails. Upcall stubs are allocated as a result of a call to Linker.upcallStub. That means that, one way or another we can not satisfy a direct user request, and the allocation failure would be visible. Whether that is through a fatal error, or some kind of exception, I'm not sure.

But it sounds like you're saying that plain user code should never result in a VM error (if we can help it). That is something I agree with. We'd have to throw some exception from Linker.upcallStub if the allocation fails (not sure of OOME is the right one for CodeCache exhaustion). And probably the same for Linker.downcallLinker.

@shipilev
Copy link
Member

shipilev commented Oct 25, 2023

But it sounds like you're saying that plain user code should never result in a VM error (if we can help it).

Yes, exactly. Granted, there are resource exhaustion situations where the overall progress could be sluggish (e.g. if we near the Java heap OOME), but we don't usually elevate that to globally shutting down the JVM, unless user explicitly requests this, e.g. with -XX:+ExitOnOutOfMemoryError.

I think the situation for RuntimeStub-s is a bit different, as we expect to have more or less constant number of them, mostly allocated upfront. Failing to allocate RuntimeStub then looks like a configuration issue. But for UpcallStub-s -- correct me if I am wrong here -- we can have unbounded number of them, right? Which exposes us to globally visible VM failure if there is a misbehaving code.

It is not the problem with this concrete PR, which I think is fine, but it exposes the larger, more important architectural question.

@JornVernee
Copy link
Member Author

But it sounds like you're saying that plain user code should never result in a VM error (if we can help it).

Yes, exactly. Granted, there are resource exhaustion situations where the overall progress could be sluggish (e.g. if we near the Java heap OOME), but we don't usually elevate that to globally shutting down the JVM, unless user explicitly requests this, e.g. with -XX:+ExitOnOutOfMemoryError.

I think the situation for RuntimeStub-s is a bit different, as we expect to have more or less constant number of them, mostly allocated upfront. Failing to allocate RuntimeStub then looks like a configuration issue. But for UpcallStub-s -- correct me if I am wrong here -- we can have unbounded number of them, right? Which exposes us to globally visible VM failure if there is a misbehaving code.

FWIW, we use RuntimeStub for downcall stubs allocated by the linker. So there is also an unbounded number of those. (Well technically bounded by the 255 argument limit * all argument type combinations, but that number is very large).

It is not the problem with this concrete PR, which I think is fine, but it exposes the larger, more important architectural question.

Ok, I'll discuss with the others in the FFM team. I think if we turn this failure into an OOME (leaning on that side at the moment), then we also need a spec change + CSR.

I'll wait with this PR until we reach some conclusion.

@shipilev
Copy link
Member

I'll wait with this PR until we reach some conclusion.

I think we can proceed with this PR. The explicit failure is still better than a failure somewhere downstream. That is, this PR does not change the failure mode substantially, right? If something else is doable, like throwing the actual exception, we can replace this fatal with throwing the exception later.

@JornVernee
Copy link
Member Author

this PR does not change the failure mode substantially, right?

That's right.

@openjdk
Copy link

openjdk bot commented Oct 25, 2023

@JornVernee This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8318586: Explicitly handle upcall stub allocation failure
8318653: UpcallTestHelper::runInNewProcess waits for forked process without timeout

Reviewed-by: shade, mcimadamore

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 11 new commits pushed to the master branch:

  • c6a8278: 8321127: ProblemList java/util/stream/GatherersTest.java
  • a3eb664: 8315701: [macos] Regression: KeyEvent has different keycode on different keyboard layouts
  • 6aba6aa: 8320347: Emulate vblendvp[sd] on ECore
  • 6938474: 8320916: jdk/jfr/event/gc/stacktrace/TestParallelMarkSweepAllocationPendingStackTrace.java failed with "OutOfMemoryError: GC overhead limit exceeded"
  • da09eab: 8319980: [JVMCI] libgraal should reuse Thread instances as C2 does
  • 33b26f7: 8319123: Implement JEP 461: Stream Gatherers (Preview)
  • 04ad98e: 8315458: Implement JEP 463: Implicitly Declared Classes and Instance Main Method (Second Preview)
  • 03759e8: 8320304: Refactor and simplify monitor deflation functions
  • da7cf25: 8320665: update jdk_core at open/test/jdk/TEST.groups
  • c9d15f7: 8321025: Enable Neoverse N1 optimizations for Neoverse V2
  • ... and 1 more: https://git.openjdk.org/jdk/compare/8b102ed6b4f595f07c0e741328f5fcac65320461...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Oct 25, 2023
@dholmes-ora
Copy link
Member

Is stub allocation the same as other VM C heap exhaustion cases? They will terminate the VM too. Otherwise it would be better report an error and have the Java code thrown OOME.

@JornVernee
Copy link
Member Author

Is stub allocation the same as other VM C heap exhaustion cases?

I think it depends on the particular case. e.g. Unsafe::allocateMemory can also throw OOME.

I think an allocation failure in the particular case of the FFM Linker allocating stubs, is something that we can reasonably report back to the user.

Looking at this again, I realize that we're also allocating a BufferBlob when creating the CodeBuffer, which is a far bigger allocation, and we never check whether that allocation fails either. I'll have a more thorough look at this.

I agree that bubbling up the allocation failures as OOME would be better.

@dougxc
Copy link
Member

dougxc commented Oct 26, 2023

I agree that avoiding a VM fatal error is preferable, like a recent change to make JVMCI RuntimeStub creation failure result in a BailoutException instead of a fatal error.

@JornVernee
Copy link
Member Author

I've uploaded another version that throws a OOME when the allocation of a downcall or upcall stub fails. (on x64 only for now, I'll look at the other platforms as well).

Let me know if that seems better.

@@ -529,6 +529,7 @@ static Linker nativeLinker() {
* @throws IllegalArgumentException if {@code !address.isNative()}, or if {@code address.equals(MemorySegment.NULL)}.
* @throws IllegalArgumentException if an invalid combination of linker options is given.
* @throws IllegalCallerException If the caller is in a module that does not have native access enabled.
* @throws OutOfMemoryError if the runtime does not have the memory needed to create the downcall handle.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions for the phrasing here are welcome. I think we should use something that works for both downcall handles and upcall stubs though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOME is pretty much understood to be possible anywhere, given it is a VirtualMachineError. We often do not document it explicitly. The risk with documenting it is that it gives the impression that other methods, which don't document it, can never throw it. A rough grep for @throws OutOfMemoryError reveals only 15 classes in java.base that explicitly document this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking inspiration from other methods that throw this exception, maybe something like this might work:

if the downcall method handle cannot be allocated by the Linker

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I also agree with @dholmes-ora - e.g. I'm not sure how much OOME is important to document here, since it reflects an internal state of the JVM, rather than something the client can do something about.

E.g. if you create an allocator with SegmentAllocator::slicingAllocator, at some point you are going to run out of space in the underlying segment, so it makes sense to report failures (and to document why that happens). But in this case the documentation is going to be very vague, and I don't think it provides a lot of value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I figured it was similar to Unsafe::allocateMemory, which also documents to OOME. But then again, the user is not directly interested in memory in this case.

I"ll remove these @throws tags then

@JornVernee
Copy link
Member Author

I've also removed the test that tries to trigger an OOME when allocating downcall stubs. It seems not really possible to isolate that particular code path (unless a direct whitebox API is added maybe, but that also kinda defeats the purpose of testing), leading to a flaky test. I've left the test for upcall stubs, as that seems to work well enough (but, might need to drop that as well).

@JornVernee
Copy link
Member Author

Okay, I have (finally) updated all the other platforms. Please take another look. Thanks.

Copy link
Contributor

@mcimadamore mcimadamore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest changes look good

@openjdk
Copy link

openjdk bot commented Nov 30, 2023

@JornVernee this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout UpcallStubAllocFailure
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added merge-conflict Pull request has merge conflict with target branch and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Nov 30, 2023
@openjdk openjdk bot added ready Pull request is ready to be integrated rfr Pull request is ready for review and removed merge-conflict Pull request has merge conflict with target branch labels Nov 30, 2023
@JornVernee
Copy link
Member Author

/integrate

@openjdk
Copy link

openjdk bot commented Nov 30, 2023

Going to push as commit e96e191.
Since your change was applied there have been 18 commits pushed to the master branch:

  • 630bafd: 8320826: call allocate_shared_strings_array after all strings are interned
  • 0a60b0f: 8302233: HSS/LMS: keytool and jarsigner changes
  • 7ad7005: 8315034: File.mkdirs() occasionally fails to create folders on Windows shared folder
  • 41daa3b: 8320239: add dynamic switch for JvmtiVTMSTransitionDisabler sync protocol
  • 7c135c3: 8321066: Multiple JFR tests have started failing
  • 8bedb28: 8321119: Disable java/foreign/TestHandshake.java on Zero VMs
  • b1cbf55: 8321018: Parallel: Make some methods in ParCompactionManager private
  • c6a8278: 8321127: ProblemList java/util/stream/GatherersTest.java
  • a3eb664: 8315701: [macos] Regression: KeyEvent has different keycode on different keyboard layouts
  • 6aba6aa: 8320347: Emulate vblendvp[sd] on ECore
  • ... and 8 more: https://git.openjdk.org/jdk/compare/8b102ed6b4f595f07c0e741328f5fcac65320461...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Nov 30, 2023
@openjdk openjdk bot closed this Nov 30, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Nov 30, 2023
@openjdk
Copy link

openjdk bot commented Nov 30, 2023

@JornVernee Pushed as commit e96e191.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@JornVernee JornVernee deleted the UpcallStubAllocFailure branch November 30, 2023 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
5 participants