Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cgroup hugetlb size prefix for kB #78495

Merged
merged 3 commits into from Jun 28, 2019

Conversation

@odinuge
Copy link
Member

commented May 29, 2019

What type of PR is this?

/kind bug

What this PR does / why we need it:
The hugetlb cgroup control files (introduced here in 2012:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=abb8206cb0773)
use "KB" and not "kB"
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/hugetlb_cgroup.c?h=v5.0#n349).

The behavior in the kernel has not changed since the introduction, and the current code using "kB" will therefore fail on devices with small amounts of ram (see #77169) running a kernel with config flag CONFIG_HUGETLBFS=y

As seen from the code in "mem_fmt" inside hugetlb_cgroup.c, only "KB",
"MB" and "GB" are used, so the others may be removed as well.

Here is a real world example of the files inside the
"/sys/kernel/mm/hugepages/" directory:

  • "hugepages-64kB"
  • "hugepages-2048kB"
  • "hugepages-32768kB"
  • "hugepages-1048576kB"

And the corresponding cgroup files:

  • "hugetlb.64KB._____"
  • "hugetlb.2MB._____"
  • "hugetlb.32MB._____"
  • "hugetlb.1GB._____"

Noticed this when tinkering around with kubernetes on a Raspberry PI (1GB ram) running Arch Linux (Linux k8s-003 5.1.5-1-ARCH #1 SMP Sat May 25 13:23:49 MDT 2019 aarch64 GNU/Linux)

Which issue(s) this PR fixes:

Fixes #77169

Special notes for your reviewer:

/priority important-soon

WIP beacuse it depends on this upstream fix in runc:
opencontainers/runc#2065

Does this PR introduce a user-facing change?:

Fix kubelet errors in AArch64 with huge page sizes smaller than 1MiB
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented May 29, 2019

Welcome @odinuge!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented May 29, 2019

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented May 29, 2019

Hi @odinuge. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@odinuge

This comment has been minimized.

Copy link
Member Author

commented May 29, 2019

CLA should be good to go now.

@mattjmcnaughton
Copy link
Contributor

left a comment

Thanks for your work on this issue @odinuge :)

I think the largest question I'm trying to wrap my head around is whether this feature ever worked, or if it has been broken since it was first introduced?

If the former, will this change cause issues for those for whom it was working before?

Additionally, thoughts on how hard it would be to add a unit/integration test to assert this behavior is working?

Lmk when the change on which this depends is merged, and I'll mark this as ok-to-test.

@odinuge

This comment has been minimized.

Copy link
Member Author

commented May 30, 2019

Hi @mattjmcnaughton,

I have twisted my head around this quite a few times myself. HugePages smaller than 1MB doesn't make that much sense after all... And the code is kinda strange, as some stuff is done inside runc, and more or less the same is done inside k8s.

After a bit more research I think I found a relavant patch (doing changes inside architecture code of arm64 only), https://lkml.org/lkml/2018/10/23/143. The line add_huge_page_size(PMD_SIZE * CONT_PMDS) with the config of the kernel I use, ends up as PMD_SIZE = 1UL << 12 and CONT_PMDS = 1UL << 4. Resulting in add_huge_page_size(1UL << 16), aka. HugePage size of 64kB.

So, it looks like this that patch introduced the HugePage size 64kB in arm64, even tho such small HugePages has been supported in the kernel from day 1.

If the former, will this change cause issues for those for whom it was working before?

No. Using kB instead of KB for the cgroup(v1) control files has never worked (the kernel has used KB from day 1), but as I said over, i guess no one would have seen this before the patch I referred to. 😄

Additionally, thoughts on how hard it would be to add a unit/integration test to assert this behavior is working?

Yee, I do indeed prefer writing some tests. The file containing the code has ~18% coverage, but the func using it has 0.. :( Guess I can write a simple test ensuring KB is used, but I don't think that would give any value, or, what do you think?

Also, as I said in the runc-PR, this looks like the result of a quick copy-paste of the ByteSizes, meant for making huge numbers more human readable, from the docker/go-units package. https://github.com/docker/go-units/blob/519db1ee28dcc9fd2474ae59fca29a810482bfb1/size.go#L37. And another copy-paste from runc into kubernetes 😄

Lmk when the change on which this depends is merged, and I'll mark this as ok-to-test.

Will do 😄

@mattjmcnaughton

This comment has been minimized.

Copy link
Contributor

commented May 31, 2019

Hi @mattjmcnaughton,

I have twisted my head around this quite a few times myself. HugePages smaller than 1MB doesn't make that much sense after all... And the code is kinda strange, as some stuff is done inside runc, and more or less the same is done inside k8s.

After a bit more research I think I found a relavant patch (doing changes inside architecture code of arm64 only), https://lkml.org/lkml/2018/10/23/143. The line add_huge_page_size(PMD_SIZE * CONT_PMDS) with the config of the kernel I use, ends up as PMD_SIZE = 1UL << 12 and CONT_PMDS = 1UL << 4. Resulting in add_huge_page_size(1UL << 16), aka. HugePage size of 64kB.

So, it looks like this that patch introduced the HugePage size 64kB in arm64, even tho such small HugePages has been supported in the kernel from day 1.

If the former, will this change cause issues for those for whom it was working before?

No. Using kB instead of KB for the cgroup(v1) control files has never worked (the kernel has used KB from day 1), but as I said over, i guess no one would have seen this before the patch I referred to.

Additionally, thoughts on how hard it would be to add a unit/integration test to assert this behavior is working?

Yee, I do indeed prefer writing some tests. The file containing the code has ~18% coverage, but the func using it has 0.. :( Guess I can write a simple test ensuring KB is used, but I don't think that would give any value, or, what do you think?

Also, as I said in the runc-PR, this looks like the result of a quick copy-paste of the ByteSizes, meant for making huge numbers more human readable, from the docker/go-units package. https://github.com/docker/go-units/blob/519db1ee28dcc9fd2474ae59fca29a810482bfb1/size.go#L37. And another copy-paste from runc into kubernetes

Lmk when the change on which this depends is merged, and I'll mark this as ok-to-test.

Will do

Thanks for your detail here :) Tbh, I don't have a ton of experience with the cgroup manager component of the kubelet, so I'm not positive what exact type of test coverage already exists. At the very least, adding a unit test would be helpful. It would also be interesting to have a integration testing the entire huge page workflow (if that's possible). I'm not sure how difficult that integration test would be to write (and it also may be more appropriate to do it outside of this diff).

@odinuge

This comment has been minimized.

Copy link
Member Author

commented May 31, 2019

After some considerations I made the list of units public/expoted in runc, and wrote a few tests there instead: https://github.com/opencontainers/runc/pull/2065/files. When/If that change is merged into upstream, we can just that constant instead, and avoid any problems with regression. 😄

A integration test for cgroup (all the different parts) handling inside kubelet would be nice, but that is way outside my expertise atm... Without some serious rewrites, doing that in a general way (for testing this particular problem) would be quite hard I imagine. Internally runc does: files, err := ioutil.ReadDir("/sys/kernel/mm/hugepages"), aka. reads directly from the system. Since only aarch64(armv8 64bits, not 100% sure about arm32) have hugePages smaller than 1MB, some serous file mocking, and code rewriting, would be required to test it on other architectures.

@mattjmcnaughton

This comment has been minimized.

Copy link
Contributor

commented Jun 1, 2019

After some considerations I made the list of units public/expoted in runc, and wrote a few tests there instead: https://github.com/opencontainers/runc/pull/2065/files. When/If that change is merged into upstream, we can just that constant instead, and avoid any problems with regression.

A integration test for cgroup (all the different parts) handling inside kubelet would be nice, but that is way outside my expertise atm... Without some serious rewrites, doing that in a general way (for testing this particular problem) would be quite hard I imagine. Internally runc does: files, err := ioutil.ReadDir("/sys/kernel/mm/hugepages"), aka. reads directly from the system. Since only aarch64(armv8 64bits, not 100% sure about arm32) have hugePages smaller than 1MB, some serous file mocking, and code rewriting, would be required to test it on other architectures.

Cool, this sounds great! Agree that an integration test is a large undertaking and we shouldn't block this immediate fix on it. Can you please ping me when your upstream runc change is merged?

Thanks for your work here :)

@odinuge odinuge force-pushed the odinuge:cgroups-hugetlb branch from 2bebb65 to 15ff978 Jun 6, 2019

@k8s-ci-robot k8s-ci-robot added size/L and removed size/XS labels Jun 6, 2019

@liggitt

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

I'm assuming we'll want to pick the kB -> KB fix back to previous releases. Doing so by just changing our inline list will make this much easier than also needing to bump dependencies, right?

re-read your comment, sounds like the runc bump is also required to fix the issue. If that's true, then using the constant from runc is more reasonable, but makes fixing this in past releases harder.

@odinuge

This comment has been minimized.

Copy link
Member Author

commented Jun 28, 2019

Yes, correct, the runc bump is required. I think this should be cherry-picked into the previous releases, and as you say, I guess it would require some work (especially pre go.mod).

odinuge added some commits Jun 20, 2019

Fix cgroup hugetlb size prefix for kB
Use the exported list from runc that uses "KB" and not "kB".

This issue breaks kubelet on AArch64 (arm 64).

var HugePageSizeUnitList = []string{"B", "KB", "MB", "GB", "TB", "PB"}

The hugetlb cgroup control files (introduced here in 2012:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=abb8206cb0773)
use "KB" and not "kB"
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/hugetlb_cgroup.c?h=v5.0#n349).

The behavior in the kernel has not changed since the introduction, and
the current code using "kB" will therefore fail on devices with huge
pages smaller than 1MiB. This is the case for AArch64.

As seen from the code in "mem_fmt" inside hugetlb_cgroup.c, only "KB",
"MB" and "GB" are used, so the others may be removed as well.

Here is a real world example of the files inside the
"/sys/kernel/mm/hugepages/" directory:
- "hugepages-64kB"
- "hugepages-2048kB"
- "hugepages-32768kB"
- "hugepages-1048576kB"

And the corresponding cgroup files:
- "hugetlb.64KB._____"
- "hugetlb.2MB._____"
- "hugetlb.32MB._____"
- "hugetlb.1GB._____"

Signed-off-by: Odin Ugedal <odin@ugedal.com>

@odinuge odinuge force-pushed the odinuge:cgroups-hugetlb branch from f39fb83 to 4ee5fe2 Jun 28, 2019

@odinuge

This comment has been minimized.

Copy link
Member Author

commented Jun 28, 2019

Did a rebase now. As you see above @liggitt, @derekwaynecarr (from sig-node & approver in pkg/kubelet) commented with both /lgtm and /approve (the lgtm was removed because of the rebase), so the code has already been approved from someone in sig-node. But because of the changes in go.mod & go.sum, that is not enough.

Also, how far back do you think we should cherry-pick this?

@liggitt

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm label Jun 28, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, liggitt, odinuge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@liggitt

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

I'll defer to sig-node on the severity of the issue this is fixing vs the risk of back-porting

@odinuge

This comment has been minimized.

Copy link
Member Author

commented Jun 28, 2019

That sounds like a good idea @liggitt! Thanks! 😄

@thockin thockin removed their assignment Jun 28, 2019

@k8s-ci-robot k8s-ci-robot merged commit ca6113f into kubernetes:master Jun 28, 2019

22 of 23 checks passed

pull-kubernetes-kubemark-e2e-gce-big Job triggered.
Details
cla/linuxfoundation odinuge authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Job succeeded.
Details
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
@Pennyzct

This comment has been minimized.

Copy link

commented Jul 1, 2019

Hi~@odinuge @liggit again~ Penny from ARM. super excited to see this PR got merged. ;)
Looking forward to see this patch backed port to v1.14 and v1.15. Otherwise, if you try to set up k8s cluster on AArch64 with kernel 5.0+, building from the scratch maybe the only option.😭😢

@odinuge

This comment has been minimized.

Copy link
Member Author

commented Jul 1, 2019

Hi @Pennyzct! Yes, currently, the only way to run k8s on AArch64 is to compile the master branch of k8s. I think this can become a huge pain point when people gets their hands on the new Raspberry Pi, and starts testing with 5.0 kernels and k8s. :/

I will make a patch for the v1.15 branch as quick as I can, and since the difference between master and v1.15 isn't that big, it shouldn't be a huge problem. When it comes to back porting it further, to v1.14 etc., I need some input from the sig-node team to hear if they want it or not, ref.

I'll defer to sig-node on the severity of the issue this is fixing vs the risk of back-porting

I can certainly do all the back porting, as long as there is will in sig-node to get it merged! 😄

Any thoughts @derekwaynecarr & @mattjmcnaughton?

@mattjmcnaughton

This comment has been minimized.

Copy link
Contributor

commented Jul 3, 2019

I agree let's start with 1.15.

For me, the largest question is how often someone would be using this hugetlb functionality, and in what scenarios. I'd be hesitant to upgrade runc in past versions if the hugetlb functionality either didn't enjoy widespread use or was predominantly used in non-production instances (in which an upgrade to 1.15.x would be less onerous).

Thoughts?

odinuge added a commit to odinuge/kubernetes that referenced this pull request Jul 7, 2019

Fix cgroup hugetlb size prefix for kB
Mitigation of the issue fixed in master where hugetlb prefix for "KiB"
is "kB" in runc, but the correct is "KB"
See opencontainers/runc#2065
and kubernetes#78495
for more info.

---

The hugetlb cgroup control files (introduced here in 2012:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=abb8206cb0773)
use "KB" and not "kB"
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/hugetlb_cgroup.c?h=v5.0#n349).

The behavior in the kernel has not changed since the introduction, and the current code using "kB" will therefore fail on devices with small amounts of ram (see kubernetes#77169) running a kernel with config flag CONFIG_HUGETLBFS=y

As seen from the code in "mem_fmt" inside hugetlb_cgroup.c, only "KB",
"MB" and "GB" are used, so the others may be removed as well.

Here is a real world example of the files inside the
"/sys/kernel/mm/hugepages/" directory:

"hugepages-64kB"
"hugepages-2048kB"
"hugepages-32768kB"
"hugepages-1048576kB"
And the corresponding cgroup files:

"hugetlb.64KB._____"
"hugetlb.2MB._____"
"hugetlb.32MB._____"
"hugetlb.1GB._____"
@derekwaynecarr

This comment has been minimized.

Copy link
Member

commented Jul 8, 2019

The issue is how often hugetlb is used with that specific page size. I will look at the backport, I am not adverse to it.

odinuge added a commit to odinuge/kubernetes that referenced this pull request Jul 8, 2019

Fix cgroup hugetlb size prefix for kB
Mitigation of the issue fixed in master where hugetlb prefix for "KiB"
is "kB" in runc, but the correct is "KB"
See opencontainers/runc#2065
and kubernetes#78495
for more info.

---

The hugetlb cgroup control files (introduced here in 2012:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=abb8206cb0773)
use "KB" and not "kB"
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/hugetlb_cgroup.c?h=v5.0#n349).

The behavior in the kernel has not changed since the introduction, and the current code using "kB" will therefore fail on devices with small amounts of ram (see kubernetes#77169) running a kernel with config flag CONFIG_HUGETLBFS=y

As seen from the code in "mem_fmt" inside hugetlb_cgroup.c, only "KB",
"MB" and "GB" are used, so the others may be removed as well.

Here is a real world example of the files inside the
"/sys/kernel/mm/hugepages/" directory:

"hugepages-64kB"
"hugepages-2048kB"
"hugepages-32768kB"
"hugepages-1048576kB"
And the corresponding cgroup files:

"hugetlb.64KB._____"
"hugetlb.2MB._____"
"hugetlb.32MB._____"
"hugetlb.1GB._____"

odinuge added a commit to odinuge/kubernetes that referenced this pull request Jul 8, 2019

Fix cgroup hugetlb size prefix for kB
Mitigation of the issue fixed in master where hugetlb prefix for "KiB"
is "kB" in runc, but the correct is "KB"
See opencontainers/runc#2065
and kubernetes#78495
for more info.

---

The hugetlb cgroup control files (introduced here in 2012:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=abb8206cb0773)
use "KB" and not "kB"
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/hugetlb_cgroup.c?h=v5.0#n349).

The behavior in the kernel has not changed since the introduction, and the current code using "kB" will therefore fail on devices with small amounts of ram (see kubernetes#77169) running a kernel with config flag CONFIG_HUGETLBFS=y

As seen from the code in "mem_fmt" inside hugetlb_cgroup.c, only "KB",
"MB" and "GB" are used, so the others may be removed as well.

Here is a real world example of the files inside the
"/sys/kernel/mm/hugepages/" directory:

"hugepages-64kB"
"hugepages-2048kB"
"hugepages-32768kB"
"hugepages-1048576kB"
And the corresponding cgroup files:

"hugetlb.64KB._____"
"hugetlb.2MB._____"
"hugetlb.32MB._____"
"hugetlb.1GB._____"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.