Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUILD-278: fix cgroupv2 memory max defaulting #252

Merged
merged 2 commits into from Aug 24, 2021

Conversation

gabemontero
Copy link
Contributor

See results from #246 (comment) and later discussion along with openshift/release#19115

This is an attempt to adjust and address the various concerns.

/assign @kolyshkin
/assign @nalind
/assign @adambkaplan

@vrutkovs FYI

@gabemontero
Copy link
Contributor Author

/hold

@vrutkovs has openshift/release#20125 up so that we can test these cgroupv2 changes via a new optional e2e job (thanks @vrutkovs !!)

once that merges, we'll drive test of this change from that

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 13, 2021
@vrutkovs
Copy link
Member

/test e2e-aws-cgroupsv2

@vrutkovs
Copy link
Member

again

error: failed to retrieve cgroup limits: cannot determine cgroup limits: open /sys/fs/cgroup/memory.max: no such file or directory

in the build logs

@gabemontero
Copy link
Contributor Author

again

error: failed to retrieve cgroup limits: cannot determine cgroup limits: open /sys/fs/cgroup/memory.max: no such file or directory

in the build logs

yep

I'll start adding debug prints along the various paths here in this PR and we'll go from there.

@gabemontero
Copy link
Contributor Author

I may have figured out what was wrong with the patch here while adding debug. Will add that change in addition to the debug. Should be up shortly.

@gabemontero
Copy link
Contributor Author

AWS VpcLimitExceeded error on last e2e-aws-builds failure

will retest after the other e2e's come in just in case this is widespread

@vrutkovs
Copy link
Member

/test e2e-aws-cgroupsv2

@gabemontero
Copy link
Contributor Author

gabemontero commented Jul 14, 2021

Combo of incorrect debug and yes having progressed there error some perhaps ... the message is slightly different. Looking.

Log Tail:	GGM cgroup2 true
		GGM file /sys/fs/cgroup/memory.max go err open /sys/fs/cgroup/memory.max: no such file or directory
		GGM stat on dir got err &fs.PathError{Op:"open", Path:"/sys/fs/cgroup/memory.max", Err:0x2}
		GGM notExist false
		error: failed to retrieve cgroup limits: read cgroup paren...d498b2b2c61f86213fd5f42e3293f49f1298185122f58e994855.scope]

@gabemontero
Copy link
Contributor Author

Another test had a more thorough dump on the error. We are not in the no memory.max provided, default path, but get subsequent errors. Investigating:

2021-07-14T13:45:46.844072193Z I0714 13:45:46.844054       1 util_linux.go:57] found cgroup values map: map[:/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod340d7682_7be8_4207_b567_23c490506337.slice/crio-e8bdd41c69d41f8060534462dd978b49ab44caa665022fa97e6ed2f9b935a310.scope]
2021-07-14T13:45:46.850317302Z F0714 13:45:46.850283       1 helpers.go:115] error: failed to retrieve cgroup limits: read cgroup parent: could not find memory cgroup subsystem in map map[:/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod340d7682_7be8_4207_b567_23c490506337.slice/crio-e8bdd41c69d41f8060534462dd978b49ab44caa665022fa97e6ed2f9b935a310.scope]
2021-07-14T13:45:46.850481404Z goroutine 1 [running]:
2021-07-14T13:45:46.850481404Z k8s.io/klog/v2.stacks(0xc000012001, 0xc000a461c0, 0x157, 0x1a8)
2021-07-14T13:45:46.850481404Z 	/go/src/github.com/openshift/builder/vendor/k8s.io/klog/v2/klog.go:1021 +0xb9
2021-07-14T13:45:46.850481404Z k8s.io/klog/v2.(*loggingT).output(0x3c32c60, 0xc000000003, 0x0, 0x0, 0xc000036070, 0x302d4cf, 0xa, 0x73, 0xcb8800)
2021-07-14T13:45:46.850481404Z 	/go/src/github.com/openshift/builder/vendor/k8s.io/klog/v2/klog.go:970 +0x191
2021-07-14T13:45:46.850481404Z k8s.io/klog/v2.(*loggingT).printDepth(0x3c32c60, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x2, 0xc0005f81c0, 0x1, 0x1)
2021-07-14T13:45:46.850481404Z 	/go/src/github.com/openshift/builder/vendor/k8s.io/klog/v2/klog.go:733 +0x16f
2021-07-14T13:45:46.850481404Z k8s.io/klog/v2.FatalDepth(...)
2021-07-14T13:45:46.850481404Z 	/go/src/github.com/openshift/builder/vendor/k8s.io/klog/v2/klog.go:1495
2021-07-14T13:45:46.850481404Z k8s.io/kubectl/pkg/cmd/util.fatal(0xc00022e8c0, 0x128, 0x1)
2021-07-14T13:45:46.850481404Z 	/go/src/github.com/openshift/builder/vendor/k8s.io/kubectl/pkg/cmd/util/helpers.go:93 +0x288
2021-07-14T13:45:46.850481404Z k8s.io/kubectl/pkg/cmd/util.checkErr(0x2c02e80, 0xc0004d5d80, 0x2a404e0)
2021-07-14T13:45:46.850481404Z 	/go/src/github.com/openshift/builder/vendor/k8s.io/kubectl/pkg/cmd/util/helpers.go:188 +0x935
2021-07-14T13:45:46.850481404Z k8s.io/kubectl/pkg/cmd/util.CheckErr(...)
2021-07-14T13:45:46.850481404Z 	/go/src/github.com/openshift/builder/vendor/k8s.io/kubectl/pkg/cmd/util/helpers.go:115
2021-07-14T13:45:46.850481404Z main.NewCommandDockerBuilder.func1(0xc000384000, 0xc00007e840, 0x0, 0x1)
2021-07-14T13:45:46.850481404Z 	/go/src/github.com/openshift/builder/cmd/builder.go:125 +0x129
2021-07-14T13:45:46.850481404Z main.main.func2(0xc000384000, 0xc00007e840, 0x0, 0x1)
2021-07-14T13:45:46.850481404Z 	/go/src/github.com/openshift/builder/cmd/main.go:85 +0x239
2021-07-14T13:45:46.850481404Z github.com/spf13/cobra.(*Command).execute(0xc000384000, 0xc000050070, 0x1, 0x1, 0xc000384000, 0xc000050070)
2021-07-14T13:45:46.850481404Z 	/go/src/github.com/openshift/builder/vendor/github.com/spf13/cobra/command.go:856 +0x2c2
2021-07-14T13:45:46.850481404Z github.com/spf13/cobra.(*Command).ExecuteC(0xc000384000, 0xc000520720, 0x2938405, 0xd)
2021-07-14T13:45:46.850481404Z 	/go/src/github.com/openshift/builder/vendor/github.com/spf13/cobra/command.go:960 +0x375
2021-07-14T13:45:46.850481404Z github.com/spf13/cobra.(*Command).Execute(...)
2021-07-14T13:45:46.850481404Z 	/go/src/github.com/openshift/builder/vendor/github.com/spf13/cobra/command.go:897
2021-07-14T13:45:46.850481404Z main.main()
2021-07-14T13:45:46.850481404Z 	/go/src/github.com/openshift/builder/cmd/main.go:88 +0xac5
2021-07-14T13:45:46.850481404Z 
2021-07-14T13:45:46.850481404Z goroutine 6 [chan receive]:
2021-07-14T13:45:46.850481404Z k8s.io/klog/v2.(*loggingT).flushDaemon(0x3c32c60)
2021-07-14T13:45:46.850481404Z 	/go/src/github.com/openshift/builder/vendor/k8s.io/klog/v2/klog.go:1164 +0x8b
2021-07-14T13:45:46.850481404Z created by k8s.io/klog/v2.init.0
2021-07-14T13:45:46.850481404Z 	/go/src/github.com/openshift/builder/vendor/k8s.io/klog/v2/klog.go:418 +0xdf
2021-07-14T13:45:46.850481404Z 
2021-07-14T13:45:46.850481404Z goroutine 10 [syscall]:
2021-07-14T13:45:46.850481404Z os/signal.signal_recv(0x0)
2021-07-14T13:45:46.850481404Z 	/usr/lib/golang/src/runtime/sigqueue.go:168 +0xa5
2021-07-14T13:45:46.850481404Z os/signal.loop()
2021-07-14T13:45:46.850481404Z 	/usr/lib/golang/src/os/signal/signal_unix.go:23 +0x25
2021-07-14T13:45:46.850481404Z created by os/signal.Notify.func1.1
2021-07-14T13:45:46.850481404Z 	/usr/lib/golang/src/os/signal/signal.go:151 +0x45
2021-07-14T13:45:46.850481404Z 
2021-07-14T13:45:46.850481404Z goroutine 11 [chan receive]:
2021-07-14T13:45:46.850481404Z main.main.func1(0xc000330d80)
2021-07-14T13:45:46.850481404Z 	/go/src/github.com/openshift/builder/cmd/main.go:34 +0x38
2021-07-14T13:45:46.850481404Z created by main.main
2021-07-14T13:45:46.850481404Z 	/go/src/github.com/openshift/builder/cmd/main.go:33 +0x14a
2021-07-14T13:45:46.850481404Z 
2021-07-14T13:45:46.850481404Z goroutine 12 [select]:
2021-07-14T13:45:46.850481404Z k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x2a404a8, 0x2c05a40, 0xc000576000, 0x1, 0xc000054c60)
2021-07-14T13:45:46.850481404Z 	/go/src/github.com/openshift/builder/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:167 +0x118
2021-07-14T13:45:46.850481404Z k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x2a404a8, 0x12a05f200, 0x0, 0x1, 0xc000054c60)
2021-07-14T13:45:46.850481404Z 	/go/src/github.com/openshift/builder/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
2021-07-14T13:45:46.850481404Z k8s.io/apimachinery/pkg/util/wait.Until(...)
2021-07-14T13:45:46.850481404Z 	/go/src/github.com/openshift/builder/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
2021-07-14T13:45:46.850481404Z k8s.io/apimachinery/pkg/util/wait.Forever(0x2a404a8, 0x12a05f200)
2021-07-14T13:45:46.850481404Z 	/go/src/github.com/openshift/builder/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:81 +0x4f
2021-07-14T13:45:46.850481404Z created by k8s.io/component-base/logs.InitLogs
2021-07-14T13:45:46.850481404Z 	/go/src/github.com/openshift/builder/vendor/k8s.io/component-base/logs/logs.go:58 +0x8a

@gabemontero
Copy link
Contributor Author

next tweak up wrt default when no memory.max

@gabemontero
Copy link
Contributor Author

/test e2e-aws-cgroupsv2

@gabemontero
Copy link
Contributor Author

VpcLimitExceeded with latest e2e-aws-image-ecosystem

@gabemontero
Copy link
Contributor Author

Down to one failure in e2e-aws-cgroupsv2 @vrutkovs

: [sig-builds][Feature:Builds] s2i build with a quota Building from a template should create an s2i build with a quota and run it [Skipped:Disconnected] [Suite:openshift/conformance/parallel] expand_less

quota and cgroups2 certainly could have a relationship that is unique vs. the other test

investigating

@gabemontero
Copy link
Contributor Author

So I see this in the buildah debug @vrutkovs @nalind @adambkaplan

    time="2021-07-14T16:33:45Z" level=debug msg="Running [\"runc\" \"start\" \"buildah-buildah101642493\"]"
    MEMORY=cat: /sys/fs/cgroup/memory/memory.limit_in_bytes: No such file or directory
    MEMORYSWAP=cat: /sys/fs/cgroup/memory/memory.memsw.limit_in_bytes: No such file or directory
    /tmp/scripts/assemble: line 16: /sys/fs/cgroup/cpu,cpuacct/cpu.cfs_quota_us: No such file or directory

A bit of a mystery after some simple searches/greps.

The test errors out at https://github.com/openshift/origin/blob/master/test/extended/builds/s2i_quota.go#L58

This is the test bc: https://github.com/openshift/origin/blob/master/test/extended/builds/s2i_quota.go#L58

Where that builder image's assemble script is at https://github.com/openshift/build-test-images/blob/master/simples2i/s2i/assemble

Also in the log, we are not finding any memory.max file:

Caching blobs under "/var/cache/blobs".
GGM cgroup2 true
GGM file /sys/fs/cgroup/memory.max go err open /sys/fs/cgroup/memory.max: no such file or directory
GGM stat on dir got err <nil>
GGM notExist false
GGM returning default for cgroupv2
I0714 16:32:30.924788       1 builder.go:375] Running build with cgroup limits: api.CGroupLimits{MemoryLimitBytes:0, CPUShares:0, CPUPeriod:0, CPUQuota:0, MemorySwap:0, Parent:""}

Even though we've specified limits at https://github.com/openshift/origin/blob/master/test/extended/testdata/builds/test-s2i-build-quota.json#L14-L15

Seems like those should translate to a memory.max file and the analogous cpu max file.

Feels like a lower level k8s / crio / cgroup setup error.

WDYT

@vrutkovs
Copy link
Member

At this point I'm fine with skipping this particular test in cgroupsv2 suite for now.

It appears build quota setting doesn't get passed to cgroups - or builder container can't read it?

@adambkaplan
Copy link
Contributor

I assumed that these resource limits get translated into container resource limits. What I am not aware of is how those container resource limits are made visible within the container itself.

@gabemontero
Copy link
Contributor Author

gabemontero commented Jul 14, 2021

the test works with "cgroups v1", as exhibited by that same test passing in e2e-aws-builds since /sys/fs/cgroup/memory/memory.limit_in_bytes is found

What I can't find yet is where that MEMORY=cat: /sys/fs/cgroup/memory/memory.limit_in_bytes is coming from. Seeing where that is would help connect some dots.

It seems like a similar check for cgroups v2 like what we are doing in this PR with util_linux.go is needed where ever that cat is.

Perhaps merging this PR to make some progress with the CI tests you are pursuing @vrutkovs is maybe one thing, but
I don't see us handing BUILD-278 off to QE until we see these limits handled properly.

@gabemontero
Copy link
Contributor Author

OK, @nalind @adambkaplan and I had a pow-wow in our team's scrum today.

  1. @nalind found the mystery MEMORY=cat: /sys/fs/cgroup/memory/memory.limit_in_bytes source: https://github.com/openshift/origin/blob/master/test/extended/testdata/builds/build-quota/.s2i/bin/assemble

  2. I will need to update that file to echo both the cgroup v1 and v2 locations ... as long as one of the echo's produces the MEMORY=419430400 string and the MEMORYSWAP one we should be good

  3. that said, there is still a question if the correct cgroupv2 file is getting created / updated when we set resource limits on the build pod.

  4. so I'll be posting more debug to traverse in the build pod to traverse /sys/fs/cgroup and confirm what its listing looks like compared to @vrutkovs debug at BUILD-278: Check cgroup v1 and cgroup v2 Files for Quota #246 (comment)

  5. when he frees up from some meeting today @vrutkovs @nalind will be posting some questions to you here specifically that he had around how the nodes are getting set up in your test, to potentially reconcile what we see exactly from the build pod when we traverse /sys/fs/cggroup

@nalind
Copy link
Member

nalind commented Jul 15, 2021

The debug logic @gabemontero added to the builder here was attempting to check the contents of the /sys/fs/cgroup directory that it was given, and there didn't appear to be a memory.max node there.

On a cgroupv2 node, the container should be seeing the controllers that the runtime configured for it listed in /sys/fs/cgroup/cgroup.controllers, along with contents that look more like the container case from #246 (comment), or really any other container. What do /sys/fs/cgroup/cgroup.controllers and /sys/fs/cgroup look like in the builder container in a build pod, and if they're different from what we get for a container run using oc run, why are they different?

@gabemontero
Copy link
Contributor Author

I've pushed the debug to list the files under /sys/fs/cgroup from within the build pod, as well as print the contents of any file that starts with memory to confirm whether any of them reflect the memory limit our quota tests places on the build pod.

I'll post data from the debug assuming we get a valid run.

@gabemontero
Copy link
Contributor Author

But with everything I listed, it seems like either /sys/fs/cgroup/user.slice/memory.max or /sys/fs/cgroup/pids/memory.max is the one.

I did go back to the reference https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#basics that @nalind pointed me to and could not however find references to either of those two locations for memory.max.

Any thoughts @nalind or @kolyshkin ?

In the interim, I am going to add some temporary debug back into this PR to see if either of those files are present and visible to the us we are are processing in util_linux.go, and see if they have the expected value. If we get a match, I'll update the openshift/origin test with the final solution. Otherwise, I'm going to need to merge another temporary change to the openshift origin tests where I echo the name of the file, then cat the contents, so we can nail down which memory.max should be inspected.

Unfortunately, only the pod specific memory max files had the correct value. All those top level ones either had max or much larger numbers which looked like the max amount of mem on the node.

So the bash script test on the openshift/origin side will to search for the memory.max file that has the correct amount under kubepods.slice

like /sys/fs/cgroup/kubepods.slice/kubepods-pod1460b72e_88ad_4d68_b63a_e579fa7daae9.slice/memory.max

sigh

@gabemontero
Copy link
Contributor Author

/retest

1 similar comment
@gabemontero
Copy link
Contributor Author

/retest

@gabemontero
Copy link
Contributor Author

So the latest test change from openshift/origin#26395 was able to post the expected setting of MEMORY when quota was applied.

However, MEMORYSWAP did not show up. And with the debug present so far, I'm not seeing an analogous setting of swap for cgroupv2 like we did with cgroupv1 in the memory swap files found under /sys/fs/cgroup.

Will need to circle back with the doc previously cited and the folks on this PR to see if that is expected for cgroupv2, or if we need to investigate further.

@gabemontero
Copy link
Contributor Author

/test e2e-aws-builds

@gabemontero
Copy link
Contributor Author

So the latest test change from openshift/origin#26395 was able to post the expected setting of MEMORY when quota was applied.

However, MEMORYSWAP did not show up. And with the debug present so far, I'm not seeing an analogous setting of swap for cgroupv2 like we did with cgroupv1 in the memory swap files found under /sys/fs/cgroup.

Will need to circle back with the doc previously cited and the folks on this PR to see if that is expected for cgroupv2, or if we need to investigate further.

OK after some research on my end and an exchange with @nalind on slack, confirmed that the expectaions for swap in v1 is not the same as v2. With v1 memory.limit_in_bytes) is memory+swap, with v2 memory.swap.max is just swap. So for v2 we should see a 0

And I do see that with my latest debug.

So we need one more openshift/origin PR and a tweak to the quota assemble script to account for this difference.

@gabemontero
Copy link
Contributor Author

/retest

@gabemontero
Copy link
Contributor Author

cgroupv2 passed !!!

and the e2e-aws-builds was a sig-etcd flake ... all the sig-builds passed !!!

going to remove the debug commit and see about getting clean e2e's next go around and getting lgtm / approve labels from the reviewers :-)

@gabemontero
Copy link
Contributor Author

sig-builds passed

unrelated flakes in cgroupv2

/test e2e-aws-cgroupv2

@gabemontero
Copy link
Contributor Author

/test e2e-aws-cgroupsv2

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 19, 2021

@gabemontero: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test e2e-aws
  • /test e2e-aws-builds
  • /test e2e-aws-image-ecosystem
  • /test images
  • /test unit
  • /test verify

The following commands are available to trigger optional jobs:

  • /test e2e-aws-cgroupsv2
  • /test e2e-aws-proxy

Use /test all to run all jobs.

In response to this:

sig-builds passed

unrelated flakes in cgroupv2

/test e2e-aws-cgroupv2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gabemontero
Copy link
Contributor Author

/retest

1 similar comment
@vrutkovs
Copy link
Member

/retest

Copy link
Contributor

@adambkaplan adambkaplan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 20, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 20, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adambkaplan, gabemontero

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 20, 2021
@gabemontero
Copy link
Contributor Author

/assign @jitendar-singh

Hey @jitendar-singh - so we are ready to talk about what level of verification you may or may not need to do with this, and thus how comfortable are you with applying the qe-approved label.

At its core, this change just adds the memory limit from the k8s pod mem resource limit to the appropriate option field for buildah before we call it.

buildah ultimately passes that to the opencontainers code, which basically sets the linux level mem limits.

So,

  1. how to set up with cgroupv2 ... is it viable for you to install a cluster in that fashion?

  2. what is needed / possible to satisfy that buildah/opencontainers/linux has the memory limit?

For 1), the best I can tell you is that you have to install a cluster and then apply the necessary machine config operator changes to have it boot in cgroup v2 mode, like the e2-aws-ccgroupsv2 job that @vrutkovs set up for us does ... running this PRs changes against that job is how I did development here.

@vrutkovs please assist if I'm missing key details there.

What I found around that machine config change and setting the required kernel arguments is https://github.com/openshift/release/blob/master/ci-operator/step-registry/openshift/manifests/cgroupsv2/openshift-manifests-cgroupsv2-commands.sh

get those machine config yaml saved, then per the OCP doc apply those machine config chagnes and reboot the nodes/cluster as needed for them to take effect, with the underlying hosts now running in cgroupv2 mode.

For 2), unless running with maxium trace on a build prints sufficient data to confirm the memory limit is set, I think we need to link up with @nalind and see if the inspection of the /sys/fs/cgroup like in https://github.com/openshift/origin/blob/master/test/extended/testdata/builds/build-quota/.s2i/bin/assemble is sufficient, or if there are some additional verification in the build container we should do.

Or for both 1) and 2), we say that since cgroupv2 is a dev/tech preview only, the PR verification is sufficient for QE coverage on this?

Hopefully we can discuss during scrum / office hours on Monday/Tuesday next week.

thanks

@gabemontero
Copy link
Contributor Author

quick update: based on the discussion (for RH internal only) in https://coreos.slack.com/archives/C02258G4S79/p1629482111160000 we seem to be leaning toward PR validation being sufficient and just applying the QE label and merging, but I'm still awaiting on whether we have consensus with @jitendar-singh and @adambkaplan

@adambkaplan
Copy link
Contributor

/label qe-approved

Since cgroupsv2 isn't even tech preview yet, CI tests passing is sufficient here.

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Aug 24, 2021
@openshift-merge-robot openshift-merge-robot merged commit a757261 into openshift:master Aug 24, 2021
@gabemontero gabemontero deleted the cgroup2-retry branch August 24, 2021 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. docs-approved Signifies that Docs has signed off on this PR lgtm Indicates that a PR is ready to be merged. px-approved Signifies that Product Support has signed off on this PR qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants