-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUILD-278: fix cgroupv2 memory max defaulting #252
BUILD-278: fix cgroupv2 memory max defaulting #252
Conversation
/hold @vrutkovs has openshift/release#20125 up so that we can test these cgroupv2 changes via a new optional e2e job (thanks @vrutkovs !!) once that merges, we'll drive test of this change from that |
/test e2e-aws-cgroupsv2 |
again
in the build logs |
yep I'll start adding debug prints along the various paths here in this PR and we'll go from there. |
I may have figured out what was wrong with the patch here while adding debug. Will add that change in addition to the debug. Should be up shortly. |
AWS VpcLimitExceeded error on last e2e-aws-builds failure will retest after the other e2e's come in just in case this is widespread |
/test e2e-aws-cgroupsv2 |
Combo of incorrect debug and yes having progressed there error some perhaps ... the message is slightly different. Looking.
|
Another test had a more thorough dump on the error. We are not in the no memory.max provided, default path, but get subsequent errors. Investigating:
|
78b5bf2
to
38eeace
Compare
next tweak up wrt default when no |
/test e2e-aws-cgroupsv2 |
VpcLimitExceeded with latest e2e-aws-image-ecosystem |
Down to one failure in e2e-aws-cgroupsv2 @vrutkovs : [sig-builds][Feature:Builds] s2i build with a quota Building from a template should create an s2i build with a quota and run it [Skipped:Disconnected] [Suite:openshift/conformance/parallel] expand_less quota and cgroups2 certainly could have a relationship that is unique vs. the other test investigating |
So I see this in the buildah debug @vrutkovs @nalind @adambkaplan
A bit of a mystery after some simple searches/greps. The test errors out at https://github.com/openshift/origin/blob/master/test/extended/builds/s2i_quota.go#L58 This is the test bc: https://github.com/openshift/origin/blob/master/test/extended/builds/s2i_quota.go#L58 Where that builder image's assemble script is at https://github.com/openshift/build-test-images/blob/master/simples2i/s2i/assemble Also in the log, we are not finding any memory.max file:
Even though we've specified limits at https://github.com/openshift/origin/blob/master/test/extended/testdata/builds/test-s2i-build-quota.json#L14-L15 Seems like those should translate to a Feels like a lower level k8s / crio / cgroup setup error. WDYT |
At this point I'm fine with skipping this particular test in cgroupsv2 suite for now. It appears build quota setting doesn't get passed to cgroups - or builder container can't read it? |
I assumed that these resource limits get translated into container resource limits. What I am not aware of is how those container resource limits are made visible within the container itself. |
the test works with "cgroups v1", as exhibited by that same test passing in e2e-aws-builds since What I can't find yet is where that It seems like a similar check for cgroups v2 like what we are doing in this PR with util_linux.go is needed where ever that Perhaps merging this PR to make some progress with the CI tests you are pursuing @vrutkovs is maybe one thing, but |
OK, @nalind @adambkaplan and I had a pow-wow in our team's scrum today.
|
The debug logic @gabemontero added to the builder here was attempting to check the contents of the On a cgroupv2 node, the container should be seeing the controllers that the runtime configured for it listed in |
38eeace
to
35f98c8
Compare
I've pushed the debug to list the files under I'll post data from the debug assuming we get a valid run. |
Unfortunately, only the pod specific memory max files had the correct value. All those top level ones either had So the bash script test on the openshift/origin side will to search for the like sigh |
/retest |
1 similar comment
/retest |
So the latest test change from openshift/origin#26395 was able to post the expected setting of However, Will need to circle back with the doc previously cited and the folks on this PR to see if that is expected for cgroupv2, or if we need to investigate further. |
/test e2e-aws-builds |
a4af671
to
2154fc6
Compare
OK after some research on my end and an exchange with @nalind on slack, confirmed that the expectaions for swap in v1 is not the same as v2. With v1 And I do see that with my latest debug. So we need one more openshift/origin PR and a tweak to the quota assemble script to account for this difference. |
/retest |
cgroupv2 passed !!! and the e2e-aws-builds was a sig-etcd flake ... all the sig-builds passed !!! going to remove the debug commit and see about getting clean e2e's next go around and getting lgtm / approve labels from the reviewers :-) |
2154fc6
to
04e350b
Compare
sig-builds passed unrelated flakes in cgroupv2 /test e2e-aws-cgroupv2 |
/test e2e-aws-cgroupsv2 |
@gabemontero: The specified target(s) for
The following commands are available to trigger optional jobs:
Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
1 similar comment
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: adambkaplan, gabemontero The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign @jitendar-singh Hey @jitendar-singh - so we are ready to talk about what level of verification you may or may not need to do with this, and thus how comfortable are you with applying the qe-approved label. At its core, this change just adds the memory limit from the k8s pod mem resource limit to the appropriate option field for buildah before we call it. buildah ultimately passes that to the opencontainers code, which basically sets the linux level mem limits. So,
For 1), the best I can tell you is that you have to install a cluster and then apply the necessary machine config operator changes to have it boot in cgroup v2 mode, like the e2-aws-ccgroupsv2 job that @vrutkovs set up for us does ... running this PRs changes against that job is how I did development here. @vrutkovs please assist if I'm missing key details there. What I found around that machine config change and setting the required kernel arguments is https://github.com/openshift/release/blob/master/ci-operator/step-registry/openshift/manifests/cgroupsv2/openshift-manifests-cgroupsv2-commands.sh get those machine config yaml saved, then per the OCP doc apply those machine config chagnes and reboot the nodes/cluster as needed for them to take effect, with the underlying hosts now running in cgroupv2 mode. For 2), unless running with maxium trace on a build prints sufficient data to confirm the memory limit is set, I think we need to link up with @nalind and see if the inspection of the /sys/fs/cgroup like in https://github.com/openshift/origin/blob/master/test/extended/testdata/builds/build-quota/.s2i/bin/assemble is sufficient, or if there are some additional verification in the build container we should do. Or for both 1) and 2), we say that since cgroupv2 is a dev/tech preview only, the PR verification is sufficient for QE coverage on this? Hopefully we can discuss during scrum / office hours on Monday/Tuesday next week. thanks |
quick update: based on the discussion (for RH internal only) in https://coreos.slack.com/archives/C02258G4S79/p1629482111160000 we seem to be leaning toward PR validation being sufficient and just applying the QE label and merging, but I'm still awaiting on whether we have consensus with @jitendar-singh and @adambkaplan |
/label qe-approved Since cgroupsv2 isn't even tech preview yet, CI tests passing is sufficient here. |
See results from #246 (comment) and later discussion along with openshift/release#19115
This is an attempt to adjust and address the various concerns.
/assign @kolyshkin
/assign @nalind
/assign @adambkaplan
@vrutkovs FYI