Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDK-8155004: CrashOnOutOfMemoryError doesn't work for OOM caused by inability to create threads #3586

Conversation

tstuefe
Copy link
Member

@tstuefe tstuefe commented Apr 20, 2021

Greetings,

this is an old issue and I'd like to fix it. If we fail to create a java thread due to platform limitations we throw an OOM. But we fail to honor the various xxxOnOutOfMemoryError switches.

The fix is very straightforward.

If fixes

  • CrashOnOutOfMemoryError
  • ExitOnOutOfMemoryError
  • HeapDumpOnOutOfMemoryError
  • and, in theory "OnOutOfMemoryError=".

the latter only in theory because most of the times whatever prevented the thread to start up will also prevent the fork needed to get the user command running.

One remaining question, maybe for a future RFE, is how we want to handle native threads creation error. AFAICS currently, failing to create a native thread may or may not result in a fatal shutdown, a log output, or just be ignored, depending on the thread.

If ...OnOutOfMemoryError is specified, should native thread creation failure be handled the same way as a java thread?

  • No if I take the option name literally, since there is no OOM involved
  • Yes if I take into account what these switches are actually used for - analysis or quick shutdown of a JVM inside a container in case of an OOM. Since it is completely random which thread is hit by the limit.

Thanks, Thomas


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8155004: CrashOnOutOfMemoryError doesn't work for OOM caused by inability to create threads

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/3586/head:pull/3586
$ git checkout pull/3586

Update a local copy of the PR:
$ git checkout pull/3586
$ git pull https://git.openjdk.java.net/jdk pull/3586/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 3586

View PR using the GUI difftool:
$ git pr show -t 3586

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/3586.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Apr 20, 2021

👋 Welcome back stuefe! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Apr 20, 2021

@tstuefe The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Apr 20, 2021
@tstuefe tstuefe marked this pull request as ready for review April 20, 2021 11:48
@openjdk openjdk bot added the rfr Pull request is ready for review label Apr 20, 2021
@mlbridge
Copy link

mlbridge bot commented Apr 20, 2021

Webrevs

@tstuefe tstuefe marked this pull request as draft April 20, 2021 11:56
@openjdk openjdk bot removed the rfr Pull request is ready for review label Apr 20, 2021
Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Thomas,

As I wrote in the bug report my take on these flags are that they are only intended for a Java heap OOME not arbitrary resource exhaustion where the VM chooses to throw OOME. I have previously rejected a number of "bug" reports about this.

I was wondering if AbortVMOnException would not suffice here for you support case?

Thanks,
David

@tstuefe tstuefe marked this pull request as ready for review April 20, 2021 14:21
@openjdk openjdk bot added the rfr Pull request is ready for review label Apr 20, 2021
@tstuefe
Copy link
Member Author

tstuefe commented Apr 20, 2021

I would fix it because its a well known and documented way to handle OOMs in a VM. In fact its much better known than AbortVMOnException, and has been around for much longer. People use these switches, expect them to work, but they don't.

Other reasons include:

  • maybe the customer does not want to crash but just a clean exit
  • maybe the customer wants to have a heap dump to take a look at who created that many java threads
  • Fix is really really simple

Thanks, Thomas

@tstuefe
Copy link
Member Author

tstuefe commented Apr 20, 2021

p.s. I do not understand your quotes around "bug", nor your hesitance to fix this. I have seen quite a number of desperate setups customers have around crashy or OOMy JVMs, to analyse and to restart quickly. Anything in that area we can do helps.

One example, take a look at the way CloudFoundry handles OOMs in a JVM. They have a JVMTI agent hooked up to resource-exhausted, then attempt to run analysis code (written in Java too) to dump the heap. This is so fragile.

xxxOnOutOfMemoryError could really help in these scenarios, if it would function reliably.

@YaSuenag
Copy link
Member

Mostly agree with @tstuefe . I give +1 to this PR. I also have seen OOMs due to native thread creation, then I advised to watch the log to reboot the system quickly.
(I've provided HeapStats to my customers to do it - it is JVMTI agent and similar way to CloudFoundry's agent)

However I want to know why JVM ignores it now. Because JVM cannot work correctly if the error is caused by ENOMEM?

@tstuefe
Copy link
Member Author

tstuefe commented Apr 21, 2021

Mostly agree with @tstuefe . I give +1 to this PR. I also have seen OOMs due to native thread creation, then I advised to watch the log to reboot the system quickly.
(I've provided HeapStats to my customers to do it - it is JVMTI agent and similar way to CloudFoundry's agent)

However I want to know why JVM ignores it now. Because JVM cannot work correctly if the error is caused by ENOMEM?

Its simply not implemented. The ...OnOutOfMemoryError handling needs to be added to places where OOM is thrown, and is missing from a couple of places.

@YaSuenag
Copy link
Member

Its simply not implemented. The ...OnOutOfMemoryError handling needs to be added to places where OOM is thrown, and is missing from a couple of places.

Strictly speaking, we need to think about OOME from Java layer as well. For example, OOME would be thrown when direct buffer cannot allocate native memory (-XX:MaxDirectMemorySize). It is not caught by JVMTI ResourceExhausted event, however it might be important information to know system memory usage.

I've found out similar issue on JFR, and I've posted #1403 , however it might not be accepted because C2 might eliminate throwing exceptions. IMHO *OnOutOfMemoryError, unified logging (-Xlog:exceptions), JVMTI event, JFR events should be hook all of OOMEs, they should be able to hook same source at least.

@tstuefe
Copy link
Member Author

tstuefe commented Apr 21, 2021

Its simply not implemented. The ...OnOutOfMemoryError handling needs to be added to places where OOM is thrown, and is missing from a couple of places.

Strictly speaking, we need to think about OOME from Java layer as well. For example, OOME would be thrown when direct buffer cannot allocate native memory (-XX:MaxDirectMemorySize). It is not caught by JVMTI ResourceExhausted event, however it might be important information to know system memory usage.

Please open a new issue for that, we can then discuss this separately.

I've found out similar issue on JFR, and I've posted #1403 , however it might not be accepted because C2 might eliminate throwing exceptions. IMHO *OnOutOfMemoryError, unified logging (-Xlog:exceptions), JVMTI event, JFR events should be hook all of OOMEs, they should be able to hook same source at least.

Fully agree. In fact, I think semantically binding the switches to OutOfMemoryError is a design flaw, since you can have any number of resource bottlenecks where one might wish to restart the VM but which do not result in a OOM in java; one example is the creation of native threads by the VM.

But that is a discussion for another day. We seem to have two opposite factions in this discussion. The "works as designed and leave it be" faction and the one wanting to cover as much resource exhaustion cases as possible by these switches. How do we proceed?

@mlbridge
Copy link

mlbridge bot commented Apr 21, 2021

Mailing list message from David Holmes on hotspot-dev:

Hi Thomas,

On 21/04/2021 12:53 am, Thomas Stuefe wrote:

On Tue, 20 Apr 2021 10:52:26 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

Greetings,

this is an old issue and I'd like to fix it. If we fail to create a java thread due to platform limitations we throw an OOM. But we fail to honor the various xxxOnOutOfMemoryError switches.

The fix is very straightforward.

One remaining question, maybe for a future RFE, is how we want to handle native threads creation error. AFAICS currently, failing to create a native thread may or may not result in a fatal shutdown, a log output, or just be ignored, depending on the thread.

If `...OnOutOfMemoryError` is specified, should native thread creation failure be handled the same way as a java thread?

- No if I take the option name literally, since there is no OOM involved
- Yes if I take into account what these switches are actually used for - analysis or quick shutdown of a JVM inside a container in case of an OOM. Since it is completely random which thread is hit by the limit.

Thanks, Thomas

p.s. I do not understand your quotes around "bug", nor your hesitance to fix this. I have seen quite a number of desperate setups customers have around crashy or OOMy JVMs, to analyse and to restart quickly. Anything in that area we can do helps.

My hesitance is because the intent of these flags is to provide a
response for Java heap exhaustion, not to provide a response to every
place that java.lang.OutOfMemoryError may be thrown. You've picked this
one case of native thread exhaustion to add-in but that just makes it
inconsistent. Why should that case be handled and not all the others? It
is unfortunate that there is more than one way to read OutOfMemoryError,
but it should be read as out-of-Java-heap-memory-error. We have had bug
reports when someone uses this flag and their code does "throw new
OutOfMemoryError() and they want to know why the VM didn't crash!

I wouldn't object to a general broadening of what this flag applies to
but I do object to cherry-picking a single case, resulting in the flag
have no well-defined meaning at all.

And per your latest email perhaps we need new generic flags for
resource-exhaustion.

Cheers,
David
-----

One example, take a look at the way CloudFoundry handles OOMs in a JVM. They have a JVMTI agent hooked up to resource-exhausted, then attempt to run analysis code (written in Java too) to dump the heap. This is so fragile.

1 similar comment
@mlbridge
Copy link

mlbridge bot commented Apr 21, 2021

Mailing list message from David Holmes on hotspot-dev:

Hi Thomas,

On 21/04/2021 12:53 am, Thomas Stuefe wrote:

On Tue, 20 Apr 2021 10:52:26 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

Greetings,

this is an old issue and I'd like to fix it. If we fail to create a java thread due to platform limitations we throw an OOM. But we fail to honor the various xxxOnOutOfMemoryError switches.

The fix is very straightforward.

One remaining question, maybe for a future RFE, is how we want to handle native threads creation error. AFAICS currently, failing to create a native thread may or may not result in a fatal shutdown, a log output, or just be ignored, depending on the thread.

If `...OnOutOfMemoryError` is specified, should native thread creation failure be handled the same way as a java thread?

- No if I take the option name literally, since there is no OOM involved
- Yes if I take into account what these switches are actually used for - analysis or quick shutdown of a JVM inside a container in case of an OOM. Since it is completely random which thread is hit by the limit.

Thanks, Thomas

p.s. I do not understand your quotes around "bug", nor your hesitance to fix this. I have seen quite a number of desperate setups customers have around crashy or OOMy JVMs, to analyse and to restart quickly. Anything in that area we can do helps.

My hesitance is because the intent of these flags is to provide a
response for Java heap exhaustion, not to provide a response to every
place that java.lang.OutOfMemoryError may be thrown. You've picked this
one case of native thread exhaustion to add-in but that just makes it
inconsistent. Why should that case be handled and not all the others? It
is unfortunate that there is more than one way to read OutOfMemoryError,
but it should be read as out-of-Java-heap-memory-error. We have had bug
reports when someone uses this flag and their code does "throw new
OutOfMemoryError() and they want to know why the VM didn't crash!

I wouldn't object to a general broadening of what this flag applies to
but I do object to cherry-picking a single case, resulting in the flag
have no well-defined meaning at all.

And per your latest email perhaps we need new generic flags for
resource-exhaustion.

Cheers,
David
-----

One example, take a look at the way CloudFoundry handles OOMs in a JVM. They have a JVMTI agent hooked up to resource-exhausted, then attempt to run analysis code (written in Java too) to dump the heap. This is so fragile.

@tstuefe
Copy link
Member Author

tstuefe commented Apr 21, 2021

Hi David,

(snip)

p.s. I do not understand your quotes around "bug", nor your hesitance to fix this. I have seen quite a number of desperate setups customers have around crashy or OOMy JVMs, to analyse and to restart quickly. Anything in that area we can do helps.

My hesitance is because the intent of these flags is to provide a
response for Java heap exhaustion, not to provide a response to every
place that java.lang.OutOfMemoryError may be thrown. You've picked this
one case of native thread exhaustion to add-in but that just makes it
inconsistent. Why should that case be handled and not all the others? It
is unfortunate that there is more than one way to read OutOfMemoryError,
but it should be read as out-of-Java-heap-memory-error. We have had bug
reports when someone uses this flag and their code does "throw new
OutOfMemoryError() and they want to know why the VM didn't crash!

I wouldn't object to a general broadening of what this flag applies to
but I do object to cherry-picking a single case, resulting in the flag
have no well-defined meaning at all.

Okay, I understand your concern now better.

If you do not object in principle to broadening of ..OnOutOfMemoryError to cover more cases, but just to the fact that this RFE only handles this one case of resource exhaustion, I think we have common ground. Both Yasumasa and me are fully in favour of making the ..OnOutOfMemoryError coverage as complete as possible. I would prefer to do this one RFE at a time, since that way we can put in work when we have free cycles, and these smaller patches are more easy to backport.

Today ..OnOutOfMemoryError is invoked on OOM for both heap- and metaspace exhaustion, and AFAICS has been this way since at least jdk 11. So it is not consistently applied to java heap ooms only.

And per your latest email perhaps we need new generic flags for
resource-exhaustion.

I'd like that very much. But let me make an argument for fixing the ..OnOutOfMemoryError first, where fixing means making them apply to as many cases of thrown OOMs as possible:

The public does not know when these hooks fire and reasonably assume they fire for all instances of OOM, regardless the cause. To my knowledge we never told them differently. Since it is difficult enough to educate them about new options - look at AbortVMOnException, not many seem to know that one - I'd go with the flow and let the switches do what the public always thought they would do. There are a lot of StackOverFlow articles describing these switches, for instance.

Cheers, Thomas

Cheers,
David

@YaSuenag
Copy link
Member

We seem to have two opposite factions in this discussion. The "works as designed and leave it be" faction and the one wanting to cover as much resource exhaustion cases as possible by these switches. How do we proceed?

I like latter, but we need to discuss more because, for example, as you know, description of HeapDumpOnOutOfMemoryError says it affects when an allocation from the Java heap or the permanent generation cannot be satisfied. We will change this spec.

I cannot understand "a discussion for another day" well, but...
UL (-Xlog:exceptions) has the most coverage of OOMEs (Throwables), so we need to follow it. It will be bigger change than what you thought, but my patch ( #1403 ) will help you.
IMHO you can work for it in this PR, and we can work for the others (JFR, JVMTI) in other issues. Is it ok? Otherwise do you want to work just for OOME of native thread creation at first? Ether way, it is the best if these work are acomplished in JDK 17 😉

@mlbridge
Copy link

mlbridge bot commented Apr 21, 2021

Mailing list message from David Holmes on hotspot-dev:

Hi Thomas,

On 21/04/2021 5:48 pm, Thomas Stuefe wrote:

Hi David,

(snip)

p.s. I do not understand your quotes around "bug", nor your hesitance to fix this. I have seen quite a number of desperate setups customers have around crashy or OOMy JVMs, to analyse and to restart quickly. Anything in that area we can do helps.

My hesitance is because the intent of these flags is to provide a
response for Java heap exhaustion, not to provide a response to every
place that java.lang.OutOfMemoryError may be thrown. You've picked this
one case of native thread exhaustion to add-in but that just makes it
inconsistent. Why should that case be handled and not all the others? It
is unfortunate that there is more than one way to read OutOfMemoryError,
but it should be read as out-of-Java-heap-memory-error. We have had bug
reports when someone uses this flag and their code does "throw new
OutOfMemoryError() and they want to know why the VM didn't crash!

I wouldn't object to a general broadening of what this flag applies to
but I do object to cherry-picking a single case, resulting in the flag
have no well-defined meaning at all.

Okay, I understand your concern now better.

If you do not object in principle to broadening of ..OnOutOfMemoryError to cover more cases, but just to the fact that this RFE only handles this one case of resource exhaustion, I think we have common ground. Both Yasumasa and me are fully in favour of making the ..OnOutOfMemoryError coverage as complete as possible. I would prefer to do this one RFE at a time, since that way we can put in work when we have free cycles, and these smaller patches are more easy to backport.

I've done some more research and I'm less convinced that this is worth
pursuing. You might think it good to add more cases to when these flags
get applied but what if existing users of these flags do not want them
to be applied to these new cases? You might argue that being unable to
start a native thread is something that should trigger a heapdump, or
abort the VM, but somebody else may not want that. And even if we agree
that most people might want this case, what about the other cases? I
certainly don't think that any throwing of OutOfMemoryError should
trigger these flags as many of them do not actually reflect being
out-of-memory: the OOME is just a proxy for out-of-some-resource.

Today ..OnOutOfMemoryError is invoked on OOM for both heap- and metaspace exhaustion, and AFAICS has been this way since at least jdk 11. So it is not consistently applied to java heap ooms only.

Okay you got me, but these are both allocation regions directly related
to Java-level allocation: either instances or classes.

And per your latest email perhaps we need new generic flags for
resource-exhaustion.

I'd like that very much. But let me make an argument for fixing the ..OnOutOfMemoryError first, where fixing means making them apply to as many cases of thrown OOMs as possible:

The public does not know when these hooks fire and reasonably assume they fire for all instances of OOM, regardless the cause. To my knowledge we never told them differently. Since it is difficult enough to educate them about new options - look at AbortVMOnException, not many seem to know that one - I'd go with the flow and let the switches do what the public always thought they would do. There are a lot of StackOverFlow articles describing these switches, for instance.

An initial google search did not reveal to me any huge issues with these
flags. And most of the time people want to get a heap dump because they
only expect HeapDumpOnOutOfMemory to do a heap dump when out of heap memory.

I think if you want to offer similar options for other circumstances,
then it is better to add new flags to deal with that, rather than
takeover the existing ones and potentially change behaviour in a way
that users have no control over.

Sorry.

Cheers,
David

@openjdk
Copy link

openjdk bot commented Apr 21, 2021

@tstuefe This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8155004: CrashOnOutOfMemoryError doesn't work for OOM caused by inability to create threads

Reviewed-by: mbaesken

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 33 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Apr 21, 2021
@MBaesken
Copy link
Member

Hello , the description of the flags in globals.hpp says :

ExitOnOutOfMemoryError
JVM exits on the first occurrence of an out-of-memory error thrown from JVM

CrashOnOutOfMemoryError
JVM aborts, producing an error log and core/mini dump, on the first occurrence of an out-of-memory error thrown from JVM

So I think the patch Thomas did is fine, it improves the situation where customers expect an exit/crash, when the flag is set accordingly. But they do not get it at the moment.

@tstuefe
Copy link
Member Author

tstuefe commented Apr 21, 2021

Hi David,

okay, I'll withdraw the patch and close the issue as wont-fix. We'll keep the fixes downstream.

Wrt new flags, this sounds good. JDK 17 would be a good release for such a new flag, but I am not sure yet if I can find the spare cycles.

Cheers, Thomas

@tstuefe tstuefe closed this Apr 21, 2021
@tstuefe tstuefe deleted the JDK-8155004-native-thread-oom-should-trigger-oom-handling branch May 7, 2021 08:01
@electrum
Copy link

electrum commented Sep 14, 2021

Until the JVM has an option to handle this, you can use the simple jvmkill JVMTI agent which does a SIGKILL to the current process on the JVMTI_EVENT_RESOURCE_EXHAUSTED event.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org ready Pull request is ready to be integrated rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

5 participants