Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry pick of #78495: Fix issues in kubelet for Aarch64 resulting in kubelet crashing on starup #79671

Conversation

@odinuge
Copy link
Member

commented Jul 2, 2019

TL;DR: This fixes issues in kubelet for AArch64 causing it to crash on startup. Without back porting this, the first working release for AArch64 running linux 5.0+ will land in september (v1.16.0).

When starting kubelet, it exits with 1 and prints:

Failed to start ContainerManager failed to initialize top level QOS containers: failed to update top level Burstable QOS cgroup : failed to set supported cgroup subsystems for cgroup [kubepods burstable]: Failed to set config for supported subsystems : failed to write 4611686018427387904 to hugetlb.64kB.limit_in_bytes: open /sys/fs/cgroup/hugetlb/kubepods/burstable/hugetlb.64kB.limit_in_bytes: permission denied

This is tested on AArch64, and works as expected: #79671 (comment)

A bit unhappy with having to bump dependencies in a back port, but I do think it is required... This is a result of "wrong code" in k8s and runc, just something we haven't seen enabled by default before linux 5.0. :/

Just as @Pennyzct said in the original PR, #78495 (comment), the only way to run a working kubelet when running linux 5.0+ on AArch64 (The 64bits arm) is currently to compile your own version from master. No pervious releases will work (or i guess those pre hugetlb should work), and 1.16 (with the fix) will not land before mid september.

There are also people in k3s reporting this issue: rancher/k3s#474

Again, I am not a huge fan of this, and I think we at least should consider cherry picking it to 1.15, so that there is one supported version. It is also possible to back port even more, but I think doing only 1.15 is a valid choice in the middle.

For more background info, read through the (the whole discussion, not just the first post) original PR: #78495 for all the background information.


Cherry pick of #78495 on release-1.15.

#78495: Update dependency opencontainer/runc

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jul 2, 2019

@odinuge: This PR is not for the master branch but does not have the cherry-pick-approved label. Adding the do-not-merge/cherry-pick-not-approved label.

To approve the cherry-pick, please ping the kubernetes/patch-release-team in a comment when ready.

See also Kuberentes Patch Releases

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jul 2, 2019

Hi @odinuge. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@odinuge

This comment has been minimized.

Copy link
Member Author

commented Jul 2, 2019

/assign @derekwaynecarr
/assign @mattjmcnaughton

Need to verify on AArch64. I don't have any hardware at hand right now.

/hold

@mattjmcnaughton

This comment has been minimized.

Copy link
Contributor

commented Jul 3, 2019

Please ping me when you've verified on the proper hardware and I will take a look.

/ok-to-test

@odinuge

This comment has been minimized.

Copy link
Member Author

commented Jul 3, 2019

Thanks @mattjmcnaughton!

This also depends on 51177c1. Guess i can just cherry-pick that commit (and not the whole PR) into this PR?

Since runc also depends on the newest version of libseccomp, imo. this PR should also be cherry-picked: #79287

@mattjmcnaughton

This comment has been minimized.

Copy link
Contributor

commented Jul 4, 2019

Hmmm @odinuge to be honest I'm getting a little nervous about how big a cherry-pick this is (especially with all the dependency updates we need to bring in). Would you be comfortable sharing more about the impact of this bug, and how difficult you imagine it being for those impacted by this bug to upgrade to a newer version of k8s?

@odinuge

This comment has been minimized.

Copy link
Member Author

commented Jul 4, 2019

TL;DR: Without back porting this, the first working release for AArch64 running linux 5.0+ will land in september (1.16).

Yes, ofc. @mattjmcnaughton ! 😄

A bit unhappy with having to bump dependencies in a back port myself, but I do think it is required... This is a result of "wrong code" in k8s and runc, just something we haven't seen enabled by default before linux 5.0. :/

Just as @Pennyzct said in the original PR, #78495 (comment), the only way to run a working kubelet when running linux 5.0+ on AArch64 (The 64bits arm) is currently to compile your own version from master. No pervious releases will work (or i guess those pre hugetlb should work), and 1.16 (with the fix) will not land before mid september.

There are also people in k3s reporting this issue: rancher/k3s#474

Again, I am not a huge fan of this, and I think we at least should consider cherry picking it to 1.15, so that there is one supported version. It is also possible to back port even more, but I think doing only 1.15 is a valid choice in the middle.

@cblecker

This comment has been minimized.

Copy link
Member

commented Jul 4, 2019

Please update the title/body to reflect that this is now more than just an automated cherry pick. Exact details of what you're cherry picking should be clear.

My biggest question is when did we announce support for AArch64? Have we announced support?

@odinuge odinuge changed the title Automated cherry pick of #78495: Update dependency opencontainer/runc cherry pick of #78495: Fix issues issues in kubelet for Aarch64 Jul 4, 2019

@odinuge

This comment has been minimized.

Copy link
Member Author

commented Jul 18, 2019

/hold cancel
/kind bug
/priority important-soon

@odinuge

This comment was marked as resolved.

Copy link
Member Author

commented Jul 18, 2019

This should be good to go now @kubernetes/patch-release-team! Looks like the @k8s-ci-robot is struggling a bit tho..

@odinuge

This comment has been minimized.

Copy link
Member Author

commented Jul 18, 2019

This should be good to go now @kubernetes/patch-release-team! Looks like the @k8s-ci-robot is struggling a bit tho (probably because the github api is quite unstable atm.)..

@cblecker

This comment has been minimized.

Copy link
Member

commented Jul 18, 2019

/kind bug
/lgtm
/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jul 18, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cblecker, derekwaynecarr, odinuge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@odinuge

This comment has been minimized.

Copy link
Member Author

commented Jul 19, 2019

/priority important-soon

@walduino

This comment has been minimized.

Copy link

commented Aug 1, 2019

Any chance this gets approved soon ? My Arm cluster is purely for educational purposes. I replaced one of the control plane nodes by an orange pi 3 sbc which runs linux 5.1. I understand that there are only a few peeps waiting for it so there's little pressure. I CAN wait till september to start playing with k8s again, but would like not to HAVE TO. //cheerz

@Pennyzct

This comment has been minimized.

Copy link

commented Aug 2, 2019

Hi~ guys. I'm also working for ARM on Kata Containers, and need to build k8s cluster for it. The kernel version needs to be more than v5.0.x, so looks like v1.15.x is the best option. ;). really hope this PR get landed. ;)

@vielmetti

This comment has been minimized.

Copy link

commented Aug 2, 2019

Hoping that @deitch can take a peek at this and make a recommendation.

@cblecker

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

/milestone v1.15

@k8s-ci-robot k8s-ci-robot added this to the v1.15 milestone Aug 2, 2019

@deitch

This comment has been minimized.

Copy link
Contributor

commented Aug 6, 2019

Trying to wrap my head around all of this.

Basically the full fix is on master, targeted at 1.16. Because the full fix requires changing the dependencies we are very hesitant to backport it into 1.15 (I have been in the "dependency hell" of go modules too many times; I am disappointed not by go modules, but by how it tries to update and "magic things" a bit too much), so the fix that @derekwaynecarr said is "reasonable and small" is just to replace "kB" with "KB"? It won't properly fix the hugetlb issue as per #78495 does in master, but at least prevent it from crashing?

@tpepper

This comment has been minimized.

Copy link
Contributor

commented Aug 6, 2019

Given the delays we had on the branch for the recent CVE's and that conversation on this cherry pick has continued a fair amount after it got lgtm/approve, can I get an affirmation from folks (eg: @derekwaynecarr @cblecker @dims @mattjmcnaughton) that ya'll do want this merged and things are complete. We've begun merging content for 1.15.3 with a target of 2019-08-19.

subsequently approved change

@cblecker
Copy link
Member

left a comment

@tpepper Thanks for checking Tim. Appreciate your diligence, considering the background on this one.

Yes, I feel comfortable signing off on this one as the right approach for 1.15.

@dims

This comment has been minimized.

Copy link
Member

commented Aug 7, 2019

+1 from me as well

@k8s-ci-robot k8s-ci-robot merged commit b3a664e into kubernetes:release-1.15 Aug 8, 2019

22 of 23 checks passed

pull-kubernetes-kubemark-e2e-gce-big Job triggered.
Details
cla/linuxfoundation odinuge authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Skipped.
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Skipped.
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.