New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8292083: Detected container memory limit may exceed physical machine memory #9880
Conversation
170257f
to
0bf9076
Compare
|
The memory limit within a cgroup might be higher than the amount of physical memory, so clamp it.
Signed-off-by: Jonathan Dowland <jdowland@redhat.com>
0bf9076
to
2de864c
Compare
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks mostly good. I'd prefer if we changed the test to not rely on InitialHeapSize
as that might get ergonomically set.
Thanks for the review! I agree with all your points. |
Hmm I see what you mean, yes. I liked checking a Flag, versus (or as well as) checking a trace log line, as I felt it gave better assurance. But even with |
I've filed https://bugs.openjdk.org/browse/JDK-8292541 for the Java side. |
Hence my suggestion to check what |
Thanks Severin
Refactor the ternary expression into an if/else chain and expand it to the third case (memory limit equal to or exceeding physical RAM) Format the trace log message for that case to match that of the other two Adjust the other two to incorporate physical RAM into the log message
@jmtd Please do not rebase or force-push to an active PR as it invalidates existing review comments. All changes will be squashed into a single commit automatically when integrating. See OpenJDK Developers’ Guide for more information. |
* split assignment to mem_limit from reading it * nest if expressions to avoid comparing mem_limit twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Come to think of this some more the better way to fix this would be to change OSContainer::memory_limit_in_bytes()
and return -1
(for unlimited) if the detected limit exceeds host memory.
Only do so in os::physical_memory()
Cgroups code used this to override the real host RAM value with the container memory limit. We don't do this any more so this routine is not needed. Linux::physical_memory()/_physical_memory will now always correspond to the hosts physical RAM, unaffected by cgroups limits.
Remove unneeded local variable host_memory
Rewrite the test to run two containers. First time, capture the logging to get the reported physical memory size. Derive a bad value from this (*10). Second run, set the container memory limit to the bad value. Check the trace log for a line indicating this was detected and ignored.
It appears |
GHA reports that 1/9 tests in linux-x86 / test (hs/tier1 gc) failed. Digging into the logs, this was gc/metaspace/TestMetaspacePerfCounters.java#Shenandoah-32, which seems unrelated to these changes. I've just ran this one locally and it passed for me. |
Hmm, I get uncomfortable if APIs mix layers and try to be smart. As in this case, an API that claims to return CG mem limit but mixes in knowledge about the total physical memory. Then we have os::Linux::available_memory() in turn reporting CG mem limit instead if we are containerized. It is getting hard to understand who reports what under which conditions. In general, I prefer dumb APIs that do exactly what they are named for, and some arbitration layer atop of it resolving conflicts and deciding whose values to use. Makes the code much easier to understand. That does not mean that I want a rewrite of this patch. I'm just not sure this is the right direction. I have to think this through some more. |
All fair points, but this patch doesn't change the status quo. For example, I guess it would be conceivable to move the "arbitration" to The gist of this patch is code like this:
... might return arbitrary large values on some systems (note that
OK. |
So, before we had:
Made sense if one thinks of the Similar for available memory:
Already a bit crooked, since here the splicing is done at the With this patch we have:
and
This is getting a bit hard to understand. The only one behaving as advertised is One thing I don't understand is, why does this calculation have to be done at every invocation to os::available_memory()/os::physical_memory()/OSContainer::memory_limit_in_bytes() etc? Yes, cgroup limits can change, but AFAIK there is nothing in the VM that can react to these changes anyway. Heap geometry etc. gets sorted out at VM start. So why do we do not just do this:
and be done with it? |
Okay, makes sense to fix it. But why not return "invalid" or "not set" and give the caller the responsibility to deal with it? |
I think the gist of my remark is that I would like the layers to behave consistently. I see that I would say let the
In addition, let the cgroup subsystem return defined values for "invalid" (if that is possible). Would that make sense? I don't think this would be a huge effort. We also could do it in a separate RFE. |
Because you don't know? There is nothing in the cg1 interface files which would tell you that. So you have to come up with a heuristic for "unlimited". For cg2 you have |
Oh, ok. Fair enough. Then my only question is at what layer we want the heuristic to happen. |
You mean
There is also this
Sounds reasonable to me.
+1 |
Right.
Yes, that is what I meant with "CgroupSystem". Sorry, I collapsed that with its CGV1/V2 implementations for less confusion :) |
Hi @tstuefe , Thanks for the detailed explanation (and diagrams!) of your concern about the complexity of this. I understand what you mean. This PR was my first proper look at these subsystems, and I had some trouble unwinding it all. The structuring you lay out in this comment sounds good to me, with some caveats:
Sadly, in the case of cgroups v1, not only is there no notion of "invalid" to report, but we also need to know, at that level of abstraction, if the values we are reading exceed physical memory, because that's used to decide whether to check for a "heirarchical memory limit" or not. So, even after arranging things as you suggest, there will be a need for the bottom cgroups layer (in the v1 case) to call up into the Linux:: layer. @jerboaa :
I got the impression this exists because the abstractions as they existed prior to v2 support needed to be extended to support cgroups v2 later on. As part of revisiting this, I wonder if we could merge OSContainer/CgroupSystem.
It would be my preference to do this in a separate RFE if that's acceptable to you. In an ideal world we would get this bug fix into jdk mainline this side of the October CPU cut-off (Aug 30 I think), as I also plan to backport to 17, 11 and 8. If you are happy with that, please mark "reviewed" in GitHub and I'll integrate and raise the RFE issue. |
Not necessarily. You could hand in an upper reasonable bound as argument. But I'd be happy if we can get at least to a consistent layering like this:
+1
Okay, lets ship this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/integrate |
/sponsor |
@jmtd Please file an RFE ticket for the proposed API cleanup. Thanks! |
Going to push as commit f694f8a.
Your commit was automatically rebased without conflicts. |
I've filed https://bugs.openjdk.org/browse/JDK-8292984 . Thanks |
We discovered some systems configured with cgroups v1 which report a bogus container memory limit value which is above the physical memory of the host. OpenJDK then calculates flags such as InitialHeapSize based on this invalid value; this can be larger than the available memory which can result in the OS terminating the process due to OOM.
hotspot's container awareness attempts to sanity check the limit value by ensuring it's below
_unlimited_memory = (LONG_MAX / os::vm_page_size()) * os::vm_page_size()
, but that still leaves a large range of potential invalid values between physical RAM and that ceiling value.Cgroups V1 in particular returns an uninitialised value for the memory limit when one has not been explicitly set. Cgroups v2 does not suffer the same problem: however, it's possible for any value to be set for the max memory, including values exceeding the available physical memory, in either v1 or v2.
This fixes the problem in two places. Further work may be required in the area of Java metrics / MXBeans. I'd also look again at whether the existing ceiling value
_unlimited_memory
serves any useful purpose. I personally don't feel those improvements should hold up this fix.Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9880/head:pull/9880
$ git checkout pull/9880
Update a local copy of the PR:
$ git checkout pull/9880
$ git pull https://git.openjdk.org/jdk pull/9880/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 9880
View PR using the GUI difftool:
$ git pr show -t 9880
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9880.diff