Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8292083: Detected container memory limit may exceed physical machine memory #9880

Closed
wants to merge 25 commits into from

Conversation

jmtd
Copy link

@jmtd jmtd commented Aug 15, 2022

We discovered some systems configured with cgroups v1 which report a bogus container memory limit value which is above the physical memory of the host. OpenJDK then calculates flags such as InitialHeapSize based on this invalid value; this can be larger than the available memory which can result in the OS terminating the process due to OOM.

hotspot's container awareness attempts to sanity check the limit value by ensuring it's below _unlimited_memory = (LONG_MAX / os::vm_page_size()) * os::vm_page_size(), but that still leaves a large range of potential invalid values between physical RAM and that ceiling value.

Cgroups V1 in particular returns an uninitialised value for the memory limit when one has not been explicitly set. Cgroups v2 does not suffer the same problem: however, it's possible for any value to be set for the max memory, including values exceeding the available physical memory, in either v1 or v2.

This fixes the problem in two places. Further work may be required in the area of Java metrics / MXBeans. I'd also look again at whether the existing ceiling value _unlimited_memory serves any useful purpose. I personally don't feel those improvements should hold up this fix.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8292083: Detected container memory limit may exceed physical machine memory

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9880/head:pull/9880
$ git checkout pull/9880

Update a local copy of the PR:
$ git checkout pull/9880
$ git pull https://git.openjdk.org/jdk pull/9880/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 9880

View PR using the GUI difftool:
$ git pr show -t 9880

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9880.diff

@jmtd jmtd force-pushed the 8292083-cgroups-badmaxmem branch from 170257f to 0bf9076 Compare August 15, 2022 14:52
@bridgekeeper
Copy link

bridgekeeper bot commented Aug 15, 2022

👋 Welcome back jdowland! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Aug 15, 2022

@jmtd The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label Aug 15, 2022
The memory limit within a cgroup might be higher than the amount
of physical memory, so clamp it.
Signed-off-by: Jonathan Dowland <jdowland@redhat.com>
@jmtd jmtd force-pushed the 8292083-cgroups-badmaxmem branch from 0bf9076 to 2de864c Compare August 15, 2022 14:58
@openjdk openjdk bot added the rfr Pull request is ready for review label Aug 15, 2022
@mlbridge
Copy link

mlbridge bot commented Aug 15, 2022

Copy link
Contributor

@jerboaa jerboaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks mostly good. I'd prefer if we changed the test to not rely on InitialHeapSize as that might get ergonomically set.

src/hotspot/os/linux/os_linux.cpp Outdated Show resolved Hide resolved
src/hotspot/os/linux/osContainer_linux.cpp Outdated Show resolved Hide resolved
@jmtd
Copy link
Author

jmtd commented Aug 17, 2022

Thanks for the review! I agree with all your points.

@jmtd
Copy link
Author

jmtd commented Aug 17, 2022

This looks mostly good. I'd prefer if we changed the test to not rely on InitialHeapSize as that might get ergonomically set.

Hmm I see what you mean, yes. I liked checking a Flag, versus (or as well as) checking a trace log line, as I felt it gave better assurance. But even with -XX:InitialRAMPercentage set, InitialHeapSize is still indeed ergonomic. I'll check over all the other flags and see if there's a better candidate.

@jerboaa
Copy link
Contributor

jerboaa commented Aug 17, 2022

This fixes the problem in two places. Further work may be required in the area of Java metrics / MXBeans.

I've filed https://bugs.openjdk.org/browse/JDK-8292541 for the Java side.

@jerboaa
Copy link
Contributor

jerboaa commented Aug 17, 2022

This looks mostly good. I'd prefer if we changed the test to not rely on InitialHeapSize as that might get ergonomically set.

Hmm I see what you mean, yes. I liked checking a Flag, versus (or as well as) checking a trace log line, as I felt it gave better assurance. But even with -XX:InitialRAMPercentage set, InitialHeapSize is still indeed ergonomic. I'll check over all the other flags and see if there's a better candidate.

Hence my suggestion to check what os::physical_memory() returns via WhiteBox. Many GC heap settings are derived from physical memory. So checking that it never exceeds actual physical host memory should be sufficient.

src/hotspot/os/linux/osContainer_linux.cpp Outdated Show resolved Hide resolved
src/hotspot/os/linux/os_linux.cpp Outdated Show resolved Hide resolved
Refactor the ternary expression into an if/else chain and expand it
to the third case (memory limit equal to or exceeding physical RAM)

Format the trace log message for that case to match that of the other
two

Adjust the other two to incorporate physical RAM into the log message
@openjdk-notifier
Copy link

@jmtd Please do not rebase or force-push to an active PR as it invalidates existing review comments. All changes will be squashed into a single commit automatically when integrating. See OpenJDK Developers’ Guide for more information.

 * split assignment to mem_limit from reading it
 * nest if expressions to avoid comparing mem_limit twice
Copy link
Contributor

@jerboaa jerboaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Come to think of this some more the better way to fix this would be to change OSContainer::memory_limit_in_bytes() and return -1 (for unlimited) if the detected limit exceeds host memory.

src/hotspot/os/linux/os_linux.cpp Outdated Show resolved Hide resolved
Only do so in os::physical_memory()
Cgroups code used this to override the real host RAM value with the
container memory limit. We don't do this any more so this routine
is not needed. Linux::physical_memory()/_physical_memory will now
always correspond to the hosts physical RAM, unaffected by cgroups
limits.
Remove unneeded local variable host_memory
Rewrite the test to run two containers. First time, capture the logging
to get the reported physical memory size. Derive a bad value from this
(*10). Second run, set the container memory limit to the bad value.
Check the trace log for a line indicating this was detected and ignored.
@jerboaa
Copy link
Contributor

jerboaa commented Aug 18, 2022

It appears os::Linux::available_memory() would need similar treatment, still.

src/hotspot/os/linux/os_linux.cpp Outdated Show resolved Hide resolved
@openjdk openjdk bot added the ready Pull request is ready to be integrated label Aug 24, 2022
@jmtd
Copy link
Author

jmtd commented Aug 24, 2022

Looks good to me. Please get a second review for this, though, if you can.

Thanks @jerboaa much appreciated.

@tstuefe, you provided some valuable feedback on this PR in an earlier incarnation. Would you be prepared to take another look?

@jmtd
Copy link
Author

jmtd commented Aug 24, 2022

GHA reports that 1/9 tests in linux-x86 / test (hs/tier1 gc) failed. Digging into the logs, this was gc/metaspace/TestMetaspacePerfCounters.java#Shenandoah-32, which seems unrelated to these changes. I've just ran this one locally and it passed for me.

@tstuefe
Copy link
Member

tstuefe commented Aug 25, 2022

Hmm, I get uncomfortable if APIs mix layers and try to be smart.

As in this case, an API that claims to return CG mem limit but mixes in knowledge about the total physical memory. Then we have os::Linux::available_memory() in turn reporting CG mem limit instead if we are containerized. It is getting hard to understand who reports what under which conditions.

In general, I prefer dumb APIs that do exactly what they are named for, and some arbitration layer atop of it resolving conflicts and deciding whose values to use. Makes the code much easier to understand.

That does not mean that I want a rewrite of this patch. I'm just not sure this is the right direction. I have to think this through some more.

@jerboaa
Copy link
Contributor

jerboaa commented Aug 25, 2022

Hmm, I get uncomfortable if APIs mix layers and try to be smart.

As in this case, an API that claims to return CG mem limit but mixes in knowledge about the total physical memory. Then we have os::Linux::available_memory() in turn reporting CG mem limit instead if we are containerized. It is getting hard to understand who reports what under which conditions.

In general, I prefer dumb APIs that do exactly what they are named for, and some arbitration layer atop of it resolving conflicts and deciding whose values to use. Makes the code much easier to understand.

All fair points, but this patch doesn't change the status quo. For example, CgroupSubsystem::active_processor_count() uses os::Linux::active_processor_count() to bound the cores reported to the JVM above by the physical cores (or virtualized cores in a VM). Similarly, CgroupV1Subsystem uses _unlimited_memory = (LONG_MAX / os::vm_page_size()) * os::vm_page_size(); Again, OS code. The latter, I believe, since day one of the container support code.

I guess it would be conceivable to move the "arbitration" to OSContainer dealing with limits exceeding physical memory. It could either pass upper bounds down to the CG layer or let it do it's thing first and do the sanity checking after. Would something like that be more amenable?

The gist of this patch is code like this:

jlong CgroupV1Subsystem::read_memory_limit_in_bytes() {
  GET_CONTAINER_INFO(julong, _memory->controller(), "/memory.limit_in_bytes",
                     "Memory Limit is: " JULONG_FORMAT, JULONG_FORMAT, memlimit);
  if (memlimit >= _unlimited_memory) {
   ...
  } else {
    return (jlong)memlimit;
  }

... might return arbitrary large values on some systems (note that _unlimited_memory = (LONG_MAX / os::vm_page_size()) * os::vm_page_size() which can be fairly large; we've observed systems where the cgroup interface files have 92233720365056 whereas the _unlimited_memory value was 9223372036854771712). That on a system with 8GB physical memory.

That does not mean that I want a rewrite of this patch. I'm just not sure this is the right direction. I have to think this through some more.

OK.

@tstuefe
Copy link
Member

tstuefe commented Aug 25, 2022

So, before we had:

os::physical_memory()  
        |-----------------------------------------
        |                                        |
        v                                        v
OSContainer::memory_limit_in_bytes()    os::Linux::physical_memory()

Made sense if one thinks of the os layer as the arbiter that has to lie to the caller to abstract aways platform stuff, and the Linux layer as the real truth source.

Similar for available memory:

os::available_memory()
        |
        v
os::Linux::available_memory()
        |
        |-----------------------------------------
        |                                        |
        v                                        v
OSContainer::memory_limit_in_bytes()    or returns host available memory

Already a bit crooked, since here the splicing is done at the ::Linux layer.

With this patch we have:

os::physical_memory()  
        |
        |-----------------------------------------
        |                                        |
        v                                        v
OSContainer::memory_limit_in_bytes()    os::Linux::physical_memory()
        |
        |-----------------------------------------
        |                                        |
        v                                        v
os::Linux::physical_memory()            or returns cgroup limit

and

os::available_memory()
        |
        v
os::Linux::available_memory()
        |
        |-----------------------------------------
        |                                        |
        v                                        v
OSContainer::memory_limit_in_bytes()    or returns host physical memory
        |
        |-----------------------------------------
        |                                        |
        v                                        v
os::Linux::physical_memory()            or returns cgroup limit

This is getting a bit hard to understand. The only one behaving as advertised is os::Linux::physical_memory(), everyone else is somehow splicing other information in. Of course a lot of this is preexisting and has nothing to do with this patch. But it deepens the confusion.

One thing I don't understand is, why does this calculation have to be done at every invocation to os::available_memory()/os::physical_memory()/OSContainer::memory_limit_in_bytes() etc? Yes, cgroup limits can change, but AFAIK there is nothing in the VM that can react to these changes anyway. Heap geometry etc. gets sorted out at VM start.

So why do we do not just do this:

  // We need to update the amount of physical memory now that
  // cgroup subsystem files have been processed.
-  if ((mem_limit = cgroup_subsystem->memory_limit_in_bytes()) > 0) {
+  size_t mem_limit = cgroup_subsystem->memory_limit_in_bytes();
+  if (mem_limit > 0 && mem_limit < os::Linux::physical_limit()) {
    os::Linux::set_physical_memory(mem_limit);
    log_info(os, container)("Memory Limit is: " JLONG_FORMAT, mem_limit);
  }

and be done with it?

@tstuefe
Copy link
Member

tstuefe commented Aug 25, 2022

The gist of this patch is code like this:

jlong CgroupV1Subsystem::read_memory_limit_in_bytes() {
  GET_CONTAINER_INFO(julong, _memory->controller(), "/memory.limit_in_bytes",
                     "Memory Limit is: " JULONG_FORMAT, JULONG_FORMAT, memlimit);
  if (memlimit >= _unlimited_memory) {
   ...
  } else {
    return (jlong)memlimit;
  }

... might return arbitrary large values on some systems (note that _unlimited_memory = (LONG_MAX / os::vm_page_size()) * os::vm_page_size() which can be fairly large; we've observed systems where the cgroup interface files have 92233720365056 whereas the _unlimited_memory value was 9223372036854771712). That on a system with 8GB physical memory.

Okay, makes sense to fix it. But why not return "invalid" or "not set" and give the caller the responsibility to deal with it?

@tstuefe
Copy link
Member

tstuefe commented Aug 25, 2022

I think the gist of my remark is that I would like the layers to behave consistently.

I see that CgroupSubsystem::memory_limit_in_bytes() is only used in two places, os::Linux::available_memory() and os::physical_memory.

I would say let the os layer lie and Linux and CgroupSystem be the truth. Then we end up with a clear hierarchy:

  • let os::Linux::available_memory() and os::Linux::physical_memory() return the pure host values
  • let the cgroup system return the pure cgroup values
  • let os::available_memory() and os::physical_memory() return either one or the other depending on what makes sense.

In addition, let the cgroup subsystem return defined values for "invalid" (if that is possible).

Would that make sense? I don't think this would be a huge effort. We also could do it in a separate RFE.

@jerboaa
Copy link
Contributor

jerboaa commented Aug 25, 2022

The gist of this patch is code like this:

jlong CgroupV1Subsystem::read_memory_limit_in_bytes() {
  GET_CONTAINER_INFO(julong, _memory->controller(), "/memory.limit_in_bytes",
                     "Memory Limit is: " JULONG_FORMAT, JULONG_FORMAT, memlimit);
  if (memlimit >= _unlimited_memory) {
   ...
  } else {
    return (jlong)memlimit;
  }

... might return arbitrary large values on some systems (note that _unlimited_memory = (LONG_MAX / os::vm_page_size()) * os::vm_page_size() which can be fairly large; we've observed systems where the cgroup interface files have 92233720365056 whereas the _unlimited_memory value was 9223372036854771712). That on a system with 8GB physical memory.

Okay, makes sense to fix it. But why not return "invalid" or "not set" and give the caller the responsibility to deal with it?

Because you don't know? There is nothing in the cg1 interface files which would tell you that. So you have to come up with a heuristic for "unlimited". For cg2 you have max, cg1 just contains random large numbers (if unset).

@tstuefe
Copy link
Member

tstuefe commented Aug 25, 2022

The gist of this patch is code like this:

jlong CgroupV1Subsystem::read_memory_limit_in_bytes() {
  GET_CONTAINER_INFO(julong, _memory->controller(), "/memory.limit_in_bytes",
                     "Memory Limit is: " JULONG_FORMAT, JULONG_FORMAT, memlimit);
  if (memlimit >= _unlimited_memory) {
   ...
  } else {
    return (jlong)memlimit;
  }

Because you don't know? There is nothing in the cg1 interface files which would tell you that. So you have to come up with a heuristic for "unlimited". For cg2 you have max, cg1 just contains random large numbers (if unset).

Oh, ok. Fair enough. Then my only question is at what layer we want the heuristic to happen.

@jerboaa
Copy link
Contributor

jerboaa commented Aug 25, 2022

I think the gist of my remark is that I would like the layers to behave consistently.

I see that CgroupSubsystem::memory_limit_in_bytes() is only used in two places, os::Linux::available_memory() and os::physical_memory.

You mean OSContainer::memory_limit_in_bytes() right?

I would say let the os layer lie and Linux and CgroupSystem be the truth. Then we end up with a clear hierarchy:

There is also this OSContainer hybrid ;-)

* let `os::Linux::available_memory()` and `os::Linux::physical_memory()` return the pure host values
* let the cgroup system return the pure cgroup values
* let `os::available_memory()` and `os::physical_memory()` return either one or the other depending on what makes sense.

In addition, let the cgroup subsystem return defined values for "invalid" (if that is possible).

Sounds reasonable to me.

Would that make sense? I don't think this would be a huge effort. We also could do it in a separate RFE.

+1

@tstuefe
Copy link
Member

tstuefe commented Aug 25, 2022

I think the gist of my remark is that I would like the layers to behave consistently.
I see that CgroupSubsystem::memory_limit_in_bytes() is only used in two places, os::Linux::available_memory() and os::physical_memory.

You mean OSContainer::memory_limit_in_bytes() right?

Right.

I would say let the os layer lie and Linux and CgroupSystem be the truth. Then we end up with a clear hierarchy:

There is also this OSContainer hybrid ;-)

Yes, that is what I meant with "CgroupSystem". Sorry, I collapsed that with its CGV1/V2 implementations for less confusion :)

@jmtd
Copy link
Author

jmtd commented Aug 26, 2022

Hi @tstuefe ,

Thanks for the detailed explanation (and diagrams!) of your concern about the complexity of this. I understand what you mean. This PR was my first proper look at these subsystems, and I had some trouble unwinding it all.

The structuring you lay out in this comment sounds good to me, with some caveats:

In addition, let the cgroup subsystem return defined values for "invalid" (if that is possible).

Sadly, in the case of cgroups v1, not only is there no notion of "invalid" to report, but we also need to know, at that level of abstraction, if the values we are reading exceed physical memory, because that's used to decide whether to check for a "heirarchical memory limit" or not. So, even after arranging things as you suggest, there will be a need for the bottom cgroups layer (in the v1 case) to call up into the Linux:: layer.

@jerboaa :

There is also this OSContainer hybrid ;-)

I got the impression this exists because the abstractions as they existed prior to v2 support needed to be extended to support cgroups v2 later on. As part of revisiting this, I wonder if we could merge OSContainer/CgroupSystem.

@tstuefe:

Would that make sense? I don't think this would be a huge effort. We also could do it in a separate RFE.

It would be my preference to do this in a separate RFE if that's acceptable to you. In an ideal world we would get this bug fix into jdk mainline this side of the October CPU cut-off (Aug 30 I think), as I also plan to backport to 17, 11 and 8. If you are happy with that, please mark "reviewed" in GitHub and I'll integrate and raise the RFE issue.

@tstuefe
Copy link
Member

tstuefe commented Aug 26, 2022

Hi @tstuefe ,

Thanks for the detailed explanation (and diagrams!) of your concern about the complexity of this. I understand what you mean. This PR was my first proper look at these subsystems, and I had some trouble unwinding it all.

The structuring you lay out in this comment sounds good to me, with some caveats:

In addition, let the cgroup subsystem return defined values for "invalid" (if that is possible).

Sadly, in the case of cgroups v1, not only is there no notion of "invalid" to report, but we also need to know, at that level of abstraction, if the values we are reading exceed physical memory, because that's used to decide whether to check for a "heirarchical memory limit" or not. So, even after arranging things as you suggest, there will be a need for the bottom cgroups layer (in the v1 case) to call up into the Linux:: layer.

Not necessarily. You could hand in an upper reasonable bound as argument.

But I'd be happy if we can get at least to a consistent layering like this:

  • os::Linux::get_xxx_memory -> the bottom layer, looks at host values
  • Cgroup subsystem -> middle layer, has access to os::Linux::get_xxx_memory
  • os::get_xxxx_memory() -> top, uses both Cgroup subsystem and os::Linux::get_xxx_memory.

@jerboaa :

There is also this OSContainer hybrid ;-)

I got the impression this exists because the abstractions as they existed prior to v2 support needed to be extended to support cgroups v2 later on. As part of revisiting this, I wonder if we could merge OSContainer/CgroupSystem.

+1

@tstuefe:

Would that make sense? I don't think this would be a huge effort. We also could do it in a separate RFE.

It would be my preference to do this in a separate RFE if that's acceptable to you. In an ideal world we would get this bug fix into jdk mainline this side of the October CPU cut-off (Aug 30 I think), as I also plan to backport to 17, 11 and 8. If you are happy with that, please mark "reviewed" in GitHub and I'll integrate and raise the RFE issue.

Okay, lets ship this.

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jmtd
Copy link
Author

jmtd commented Aug 26, 2022

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Aug 26, 2022
@openjdk
Copy link

openjdk bot commented Aug 26, 2022

@jmtd
Your change (at version af509e1) is now ready to be sponsored by a Committer.

@jerboaa
Copy link
Contributor

jerboaa commented Aug 26, 2022

/sponsor

@jerboaa
Copy link
Contributor

jerboaa commented Aug 26, 2022

@jmtd Please file an RFE ticket for the proposed API cleanup. Thanks!

@openjdk
Copy link

openjdk bot commented Aug 26, 2022

Going to push as commit f694f8a.
Since your change was applied there have been 59 commits pushed to the master branch:

  • f91943c: 8292868: Explicitly pass a third temp register to MacroAssembler::store_heap_oop for aarch64
  • 6354a57: 8290711: assert(false) failed: infinite loop in PhaseIterGVN::optimize
  • 3e18773: 8292880: Improve debuggee logging for com/sun/jdi/ClassUnloadEventTest.java
  • 909e1ed: 8292919: Build failure due to UseJVMCICompiler was not declared when C2 is disabled after JDK-8292691
  • 55f5a83: 8282410: Remove SA ProcDebugger support
  • d83faea: 8292250: Create test for co-located JDI MethodEntry, Step, and Breakpoint events
  • e353b57: 8292890: Remove PrintTouchedMethodsAtExit and LogTouchedMethods
  • 95a33fe: 8292314: Cleanup legacy address handling
  • 5d799d8: 8292304: [REDO] JDK-8289208 Test DrawRotatedStringUsingRotatedFont.java occasionally crashes on MacOS
  • 4f50316: 8292680: Convert Dictionary to ConcurrentHashTable
  • ... and 49 more: https://git.openjdk.org/jdk/compare/e561933907bbab0a42f1796fa12f582b3a347312...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Aug 26, 2022
@openjdk openjdk bot closed this Aug 26, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Aug 26, 2022
@openjdk
Copy link

openjdk bot commented Aug 26, 2022

@jerboaa @jmtd Pushed as commit f694f8a.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@jmtd
Copy link
Author

jmtd commented Aug 26, 2022

@jmtd Please file an RFE ticket for the proposed API cleanup. Thanks!

I've filed https://bugs.openjdk.org/browse/JDK-8292984 . Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated
5 participants