Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8253435: Cgroup: 'stomping of _mount_path' crash if manually mounted cpusets exist #295

Closed
wants to merge 1 commit into from

Conversation

@simonis
Copy link
Member

@simonis simonis commented Sep 22, 2020

Hi,

can I please have a review (or an idea for a better fix) for this PR?

If a tool like cpuset is used to manually create and manage cpusets the cgroups detections will be confused and crash in a debug build or behave unexpectedly in a product build.

The problem is that the additionally mounted cpuset will be interpreted as if it was belonging to Cgroup controller:

$ grep cgroup /proc/self/mountinfo
36 25 0:30 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:9 - tmpfs tmpfs ro,mode=755
49 36 0:43 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:23 - cgroup cgroup rw,memory
50 36 0:44 / /sys/fs/cgroup/rdma rw,nosuid,nodev,noexec,relatime shared:24 - cgroup cgroup rw,rdma
...
43 36 0:37 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:17 - cgroup cgroup rw,cpuset
121 32 0:37 / /cpusets rw,relatime shared:69 - cgroup none rw,cpuset

The current fix solves this problem for manually created cpusets which don't have a "mount source" but this is yet another heuristic. I'm open to better solutions for detecting cpusets which don't don't belong to a Cgroup.

Thanks,
Volker


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8253435: Cgroup: 'stomping of _mount_path' crash if manually mounted cpusets exist

Reviewers

Download

$ git fetch https://git.openjdk.java.net/jdk pull/295/head:pull/295
$ git checkout pull/295

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Sep 22, 2020

👋 Welcome back simonis! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr label Sep 22, 2020
@openjdk
Copy link

@openjdk openjdk bot commented Sep 22, 2020

@simonis The following label will be automatically applied to this pull request: hotspot-runtime.

When this pull request is ready to be reviewed, an RFR email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label (add|remove) "label" command.

@mlbridge
Copy link

@mlbridge mlbridge bot commented Sep 22, 2020

Webrevs

@jerboaa
Copy link
Contributor

@jerboaa jerboaa commented Sep 22, 2020

Did you run container tests with this?

@mlbridge
Copy link

@mlbridge mlbridge bot commented Sep 22, 2020

Mailing list message from Bob Vandette on hotspot-runtime-dev:

Yuk. I just fixed a bug which caused us to use the mount source for the cgroup type. Not fixing
that bug would have hidden your problem.

Are there any hints in /proc/self/cgroup or /proc/self/mounts that we could use to eliminate this manual mount?

I?d be tempted to eliminate mountinfo entries that are 1) duplicate controllers and 2) not in ?/sys/fs/cgroup? mount point.

Bob.

@simonis
Copy link
Member Author

@simonis simonis commented Sep 23, 2020

Did you run container tests with this?

Yes. It took me some time to find out that I have to set -Djdk.test.docker.image.name=ubuntu and -Djdk.test.docker.image.version=18.04 on Ubuntu in order to run them :)

For release builds they all pass with and without the change (except TestJFRWithJMX.java which always fails, but I don' t think that's related to this issue). Fast- and slowdebug builds don't even start without the fix and pass all the tests with it (again except TestJFRWithJMX.java).

@simonis
Copy link
Member Author

@simonis simonis commented Sep 23, 2020

Mailing list message from Bob Vandette on hotspot-runtime-dev:

Yuk. I just fixed a bug which caused us to use the mount source for the cgroup type. Not fixing
that bug would have hidden your problem.

Sorry, but I don't understand. which bug are you speaking of and has it been fixed in the jdk already?

Are there any hints in /proc/self/cgroup or /proc/self/mounts that we could use to eliminate this manual mount?

Not that I'm aware of. I couldn't find any.

I?d be tempted to eliminate mountinfo entries that are 1) duplicate controllers and 2) not in ?/sys/fs/cgroup? mount point.

It's not easy to remove the right duplicate :)

Checking for /sys/fs/cgroup as mount point is probably safer although that path is also just a convention as far as I know.

So what about the following solution:

  • record the mount point for memory, cpu and cpuacct
  • when hitting cpuset and its mountpoint is different from the recorded one ignore it. If there's no recorded mount prefix ignore the entry if its mount prefix is not /sys/fs/cgroup.

I think this is the best we can do if we don't want to parse mountinfo two times.

What do you think?

Bob.

@jerboaa
Copy link
Contributor

@jerboaa jerboaa commented Sep 23, 2020

Did you run container tests with this?

Yes. It took me some time to find out that I have to set -Djdk.test.docker.image.name=ubuntu and -Djdk.test.docker.image.version=18.04 on Ubuntu in order to run them :)

Yes, this is a bit painful. I should have said that.

For release builds they all pass with and without the change (except TestJFRWithJMX.java which always fails, but I don' t think that's related to this issue). Fast- and slowdebug builds don't even start without the fix and pass all the tests with it (again except TestJFRWithJMX.java).

Hmm, which ones did you run? It seems odd that they fail to run in fastdebug config.

FWIW, I've crafted a regression test for this issue. Please include something like that if you can:
simonis#1

A fix like this should make it pass (uses the /sys/fs/cgroup convention) - on top of your change:

diff --git a/src/hotspot/os/linux/cgroupSubsystem_linux.cpp b/src/hotspot/os/linux/cgroupSubsystem_linux.cpp
index 21be7a8260c..4c6ae541929 100644
--- a/src/hotspot/os/linux/cgroupSubsystem_linux.cpp
+++ b/src/hotspot/os/linux/cgroupSubsystem_linux.cpp
@@ -295,10 +295,10 @@ bool CgroupSubsystemFactory::determine_type(CgroupInfo* cg_infos,
         // Skip cgroup2 fs lines on hybrid or unified hierarchy.
         continue;
       }
-      if (strcmp("none", tmpsource) == 0) {
-        // Skip cpusets created manually or by cset/cpuset (https://github.com/lpechacek/cpuset)
-        // The "mount source" for these mounts is usually "none" while the source of "true" Cgroup
-        // controllers is usually "cgroup". But this is just another heuristic...
+      if (strcmp(tmpmount, "/sys/fs/cgroup") < 0) {
+        // Skip potentially duplicate, manually mounted cgroup controllers
+        // not on /sys/fs/cgroup
+        log_info(os, container)("%s not mounted at /sys/fs/cgroup, skipping!", tmpmount);

@jerboaa
Copy link
Contributor

@jerboaa jerboaa commented Sep 23, 2020

So what about the following solution:

* record the mount point for `memory`, `cpu` and `cpuacct`
* when hitting `cpuset` and its mountpoint is different from the recorded one ignore it. If there's no recorded mount prefix ignore the entry if its mount prefix is not `/sys/fs/cgroup`.

Sounds sensible to me.

@jerboaa
Copy link
Contributor

@jerboaa jerboaa commented Sep 23, 2020

Mailing list message from Bob Vandette on hotspot-runtime-dev:
Yuk. I just fixed a bug which caused us to use the mount source for the cgroup type. Not fixing
that bug would have hidden your problem.

Sorry, but I don't understand. which bug are you speaking of and has it been fixed in the jdk already?

@bobvandette probably meant https://bugs.openjdk.java.net/browse/JDK-8252359. A little correction, though. We didn't use the mount source as the cgroup type before JDK-8252359, but we relied on the mount source to be cgroup or cgroup2 before JDK-8252359, which wasn't the case on those affected systems. They had the controller name as the mount source.

Bob is right, though, prior JDK-8252359, you wouldn't have hit the assert because of what I just said above. Your extra cpuset entries have none as mount source.

@jerboaa
Copy link
Contributor

@jerboaa jerboaa commented Sep 23, 2020

Did you run container tests with this?

For the record, the important one to run is this one:

test/hotspot/jtreg/containers/cgroup/CgroupSubsystemFactory.java

It's independent of your hosts cgroup files. I believe that test broke with your proposed v1 fix because after your patch any none entries would be skipped. All of them are none for this test after JDK-8252359.

@mlbridge
Copy link

@mlbridge mlbridge bot commented Sep 23, 2020

Mailing list message from Bob Vandette on hotspot-runtime-dev:

On Sep 23, 2020, at 5:41 AM, Severin Gehwolf <sgehwolf at openjdk.java.net> wrote:

On Wed, 23 Sep 2020 09:15:38 GMT, Volker Simonis <simonis at openjdk.org> wrote:

_Mailing list message from [Bob Vandette](mailto:bob.vandette at oracle.com) on
[hotspot-runtime-dev](mailto:hotspot-runtime-dev at openjdk.java.net):_ Yuk. I just fixed a bug which caused us to use the
mount source for the cgroup type. Not fixing that bug would have hidden your problem.

Sorry, but I don't understand. which bug are you speaking of and has it been fixed in the jdk already?

@bobvandette probably meant https://bugs.openjdk.java.net/browse/JDK-8252359. A little correction, though. We didn't
use the mount source as the cgroup type before JDK-8252359, but we relied on the mount source to be `cgroup` or
`cgroup2` before JDK-8252359, which wasn't the case on those affected systems. They had the controller name as the
mount source.

That?s the bug I was referring to.

Bob.

@mlbridge
Copy link

@mlbridge mlbridge bot commented Sep 23, 2020

Mailing list message from Bob Vandette on hotspot-runtime-dev:

On Sep 23, 2020, at 5:34 AM, Severin Gehwolf <sgehwolf at openjdk.java.net> wrote:

On Wed, 23 Sep 2020 09:15:38 GMT, Volker Simonis <simonis at openjdk.org> wrote:

So what about the following solution:

* record the mount point for `memory`, `cpu` and `cpuacct`
* when hitting `cpuset` and its mountpoint is different from the recorded one ignore it. If there's no recorded mount
prefix ignore the entry if its mount prefix is not `/sys/fs/cgroup`.

Sounds sensible to me.

I?m ok with that approach as well.

Bob.

@simonis simonis force-pushed the JDK-8253435-cpusets-fix branch from 568c489 to 2949e5f Sep 24, 2020
@simonis
Copy link
Member Author

@simonis simonis commented Sep 24, 2020

Hi Severin, Bob,

so here comes the new version as discussed. @jerboaa thanks for the additional test. I've merged it and extended it such that it checks both variants, when the manual csets controller comes before and when it comes after all the other controllers to exercise both code path in the detection.

All container tests pass now (except TestJFRWithJMX.java which doesn't seem to be related to this issue and change).

Are you OK with the change?

Thank you and best regards,
Volker

Copy link
Contributor

@jerboaa jerboaa left a comment

I'm not sure this extra gymnastics will be worth the added complexity. This makes the code somewhat harder to follow and, essentially, we end up checking whether or not the cgroup controller is being mounted under /sys/fs/cgroup. The code tracks any interesting controller, like memory, cpu, cpuset and cpuacct and records the mount point. It's going to be /sys/fs/cgroup for almost all cases, would it not?

The skip for non-/sys/fs/cgroup controllers ends up being either /fo/bar/baz or whatever the lead-up path to the first seen "interesting" controller is or /sys/fs/cgroup. I'm not convinced it's really anything other than /sys/fs/cgroup. How about a simpler solution like this?

jerboaa@1323315

This seems to work for me.

src/hotspot/os/linux/cgroupSubsystem_linux.cpp Outdated Show resolved Hide resolved
src/hotspot/os/linux/cgroupSubsystem_linux.cpp Outdated Show resolved Hide resolved
@bobvandette
Copy link
Contributor

@bobvandette bobvandette commented Sep 24, 2020

I was expecting to see some logic in this "else if" section that recorded the first occurance but did the validation on the second pass (cg_infos[CPUSET_IDX]._mount_path != NULL). When this situation is detected, we accept the mount with the /sys/fs/cgroup.

    } else if (strcmp(token, "cpuset") == 0) {
      assert(cg_infos[CPUSET_IDX]._mount_path == NULL, "stomping of _mount_path");
      cg_infos[CPUSET_IDX]._mount_path = os::strdup(tmpmount);
      cg_infos[CPUSET_IDX]._root_mount_path = os::strdup(tmproot);
      cg_infos[CPUSET_IDX]._data_complete = true;

@simonis simonis force-pushed the JDK-8253435-cpusets-fix branch from 2949e5f to 7753bc5 Sep 25, 2020
@simonis
Copy link
Member Author

@simonis simonis commented Sep 25, 2020

I was expecting to see some logic in this "else if" section that recorded the first occurance but did the validation on the second pass (cg_infos[CPUSET_IDX]._mount_path != NULL). When this situation is detected, we accept the mount with the /sys/fs/cgroup.

    } else if (strcmp(token, "cpuset") == 0) {
      assert(cg_infos[CPUSET_IDX]._mount_path == NULL, "stomping of _mount_path");
      cg_infos[CPUSET_IDX]._mount_path = os::strdup(tmpmount);
      cg_infos[CPUSET_IDX]._root_mount_path = os::strdup(tmproot);
      cg_infos[CPUSET_IDX]._data_complete = true;

You're right, the logic to ignore a cpusets entry should be moved into the else if (strcmp(token, "cpuset") == 0) and I've done that in the new version of my patch.

However, the problem is that we don't know if the first or the second occurrence of cpusets´ is *the right one*. I'm also not sure if the Cgroup controllers are ALWAYS mounted under /sys/fs/cgroup` (see my answer to Severin's comments). If you can confirm that that's really the case, we could go with the simple solution proposed by Severin.

Otherwise, my current solution tries to be conservative and does not assume a predefined mount point for Cgroup controllers. Instead it records the mount point of the first controller out of cpu, cpuacct and memory and checks that all the other controllers from this set are mounted under the same location. In addition, it only accepts cpusets if it is under the same mount point like the other controllers or under /sys/fs/cgroup if no other controller has been seen before.

@bobvandette
Copy link
Contributor

@bobvandette bobvandette commented Sep 25, 2020

In my suggestion, it doesn't matter which entry is first. If we see the manual on first, record it. When the second one comes around, replace the first one if it's /sys/fs/cgroup. In the very unlikely event that there isn't a second non-manual one, I think we still want to record the manaul mount point since there could be a cpuset limit setup which we should respect.

As for the second point about /sys/fs/cgroup, I think that using this string is just as good if not better than assuming they are all mounted in the same subdirectory.

If you follow my suggestion, then we will only be subjected to a failure to mount if there are two cpuset mount entries AND neither are mounted on /sys/fs/cgroup/cpuset.

@jerboaa
Copy link
Contributor

@jerboaa jerboaa commented Sep 25, 2020

In my suggestion, it doesn't matter which entry is first. If we see the manual on first, record it. When the second one comes around, replace the first one if it's /sys/fs/cgroup. In the very unlikely event that there isn't a second non-manual one, I think we still want to record the manaul mount point since there could be a cpuset limit setup which we should respect.

As for the second point about /sys/fs/cgroup, I think that using this string is just as good if not better than assuming they are all mounted in the same subdirectory.

If you follow my suggestion, then we will only be subjected to a failure to mount if there are two cpuset mount entries AND neither are mounted on /sys/fs/cgroup/cpuset.

Like this perhaps?
jerboaa@45608cf

I tend to think that in actual container workloads it might be unlikely to actually see multiple cpuset mounts. So in a sense that's a special case. The general case, before this bug, shouldn't be penalized. So perhaps the above would be worth considering.

@bobvandette
Copy link
Contributor

@bobvandette bobvandette commented Sep 25, 2020

In my suggestion, it doesn't matter which entry is first. If we see the manual on first, record it. When the second one comes around, replace the first one if it's /sys/fs/cgroup. In the very unlikely event that there isn't a second non-manual one, I think we still want to record the manaul mount point since there could be a cpuset limit setup which we should respect.
As for the second point about /sys/fs/cgroup, I think that using this string is just as good if not better than assuming they are all mounted in the same subdirectory.
If you follow my suggestion, then we will only be subjected to a failure to mount if there are two cpuset mount entries AND neither are mounted on /sys/fs/cgroup/cpuset.

Like this perhaps?
jerboaa@45608cf

I tend to think that in actual container workloads it might be unlikely to actually see multiple cpuset mounts. So in a sense that's a special case. The general case, before this bug, shouldn't be penalized. So perhaps the above would be worth considering.

Yes, that's exactly what I was thinking. In the manual case from your example, we'd record "/" as the mount point if it showed up first and then overwrite it if /sys/fs/cgroup came along.

@simonis simonis force-pushed the JDK-8253435-cpusets-fix branch from 7753bc5 to b3d0f28 Sep 28, 2020
@simonis
Copy link
Member Author

@simonis simonis commented Sep 28, 2020

OK, I just want to get this done so I can finally use debug builds again.

I've picked your version now in the hope that you'll review that :)

I only changed the condition

strcmp(cg_infos[CPUSET_IDX]._mount_path, "/sys/fs/cgroup") < 0

to

strstr(cg_infos[CPUSET_IDX]._mount_path, "/sys/fs/cgroup") != cg_infos[CPUSET_IDX]._mount_path

otherwise you'd still choose the alternative cpuset if that was mounted on a mount point which is lexicographically after /sys/fs/cgroup (e.g. /tmp/cgroups).

I've also adapted the logs to reflect the fact that the current solution will simply choose the cpusets from the second mount in the (unusual?) case where the "normal" Cgroups are not mounted to /sys/fs/cgroup.

Hope that's still fine.

Copy link
Contributor

@jerboaa jerboaa left a comment

Looks fine to me. Thanks for your patience!

@openjdk
Copy link

@openjdk openjdk bot commented Sep 28, 2020

@simonis This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for more details.

After integration, the commit message for the final commit will be:

8253435: Cgroup: 'stomping of _mount_path' crash if manually mounted cpusets exist

Reviewed-by: sgehwolf, bobv

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 85 new commits pushed to the master branch:

  • 8e338f6: 8253646: ZGC: Avoid overhead of sorting ZStatIterableValues on bootstrap
  • ec9bee6: 8253015: Aarch64: Move linux code out from generic CPU feature detection
  • 16b8c39: 8253053: Javadoc clean up in Authenticator and BasicAuthenicator
  • 840aa2b: 8253424: Add support for running pre-submit testing using GitHub Actions
  • 8e87d46: 8252857: AArch64: Shenandoah C1 CAS is not sequentially consistent
  • c2692f8: 8225329: -XX:+PrintBiasedLockingStatistics causes crash during initia…
  • e9c1782: 8252752: Clear card table for old regions during scan in G1
  • 276fcee: 8252835: Revert fix for JDK-8246051
  • ca1ed16: 8253639: Change os::attempt_reserve_memory_at parameter order
  • fed3636: 8252219: C2: Randomize IGVN worklist for stress testing
  • ... and 75 more: https://git.openjdk.java.net/jdk/compare/2e30ff61b0394793655d6c6a593a5de8eb9c07b9...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label Sep 28, 2020
@simonis
Copy link
Member Author

@simonis simonis commented Sep 28, 2020

Looks fine to me. Thanks for your patience!

Thanks Severin.

I'll wait one more day to also give Bob a chance to look at the final version and push after that.

Best regards,
Volker

@bobvandette
Copy link
Contributor

@bobvandette bobvandette commented Sep 28, 2020

Looks good Volker.

@mlbridge
Copy link

@mlbridge mlbridge bot commented Sep 28, 2020

Mailing list message from Bob Vandette on hotspot-runtime-dev:

Looks good Volker.

Bob.

@simonis
Copy link
Member Author

@simonis simonis commented Sep 28, 2020

Looks good Volker.

Thanks Bob.

@simonis
Copy link
Member Author

@simonis simonis commented Sep 28, 2020

/reviewer credit @bobvandette

@openjdk
Copy link

@openjdk openjdk bot commented Sep 28, 2020

@simonis
Reviewer bobv successfully credited.

@simonis
Copy link
Member Author

@simonis simonis commented Sep 28, 2020

/integrate

@openjdk openjdk bot closed this Sep 28, 2020
@openjdk openjdk bot added integrated and removed ready rfr labels Sep 28, 2020
@openjdk
Copy link

@openjdk openjdk bot commented Sep 28, 2020

@simonis Since your change was applied there have been 85 commits pushed to the master branch:

  • 8e338f6: 8253646: ZGC: Avoid overhead of sorting ZStatIterableValues on bootstrap
  • ec9bee6: 8253015: Aarch64: Move linux code out from generic CPU feature detection
  • 16b8c39: 8253053: Javadoc clean up in Authenticator and BasicAuthenicator
  • 840aa2b: 8253424: Add support for running pre-submit testing using GitHub Actions
  • 8e87d46: 8252857: AArch64: Shenandoah C1 CAS is not sequentially consistent
  • c2692f8: 8225329: -XX:+PrintBiasedLockingStatistics causes crash during initia…
  • e9c1782: 8252752: Clear card table for old regions during scan in G1
  • 276fcee: 8252835: Revert fix for JDK-8246051
  • ca1ed16: 8253639: Change os::attempt_reserve_memory_at parameter order
  • fed3636: 8252219: C2: Randomize IGVN worklist for stress testing
  • ... and 75 more: https://git.openjdk.java.net/jdk/compare/2e30ff61b0394793655d6c6a593a5de8eb9c07b9...master

Your commit was automatically rebased without conflicts.

Pushed as commit 0054c15.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants