Set LimitNOFILE=1048576 in containerd unit file to match upstream containerd. #760

jbarrick-mesosphere · 2019-08-07T20:37:31Z

The Ubuntu containerd package sets LimitNOFILE=infinity, but this causes issues with various applications: NFS and mysql are unable to run in kind on some systems.

This fixes the containerd limit to match upstream containerd and resolves issues with NFS and mysql.

To reproduce the issue, on the host set sudo sysctl -w fs.nr_open=1073741816 (this is the default on Arch Linux) and then follow the steps below.

Broken:

kind create cluster
export KUBECONFIG="$(kind get kubeconfig-path --name="kind")"
# print the default ulimit -n:
kubectl run -it --restart=Never --rm test --image=alpine -- ash -c 'ulimit -n'
# run a broken service (this will eventually be OOM killed by the kernel):
# NFS gets OOM killed:
kubectl run -it --restart=Never test --image=quay.io/kubernetes_incubator/nfs-provisioner:latest --command -- rpcinfo -p
# MySQL uses tons of memory and is extremely slow (sometimes it gets OOM killed):
kubectl run mysql -it --restart=Never --env=MYSQL_ROOT_PASSWORD=test --env=MYSQL_DATABASE=kudo --image=mysql:5.7
kind delete cluster

Broken output:

$ kind create cluster
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.15.0) 🖼
 ✓ Preparing nodes 📦 
 ✓ Creating kubeadm config 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
Cluster creation complete. You can now use the cluster with:

export KUBECONFIG="$(kind get kubeconfig-path --name="kind")"
kubectl cluster-info
$ export KUBECONFIG="$(kind get kubeconfig-path --name="kind")"
$ kubectl run -it --restart=Never --rm test --image=alpine -- ash -c 'ulimit -n'
1073741816
pod "test" deleted
$ kubectl run -it --restart=Never test --image=quay.io/kubernetes_incubator/nfs-provisioner:latest --command -- rpcinfo -p
If you don't see a command prompt, try pressing enter.
pod default/test terminated (Error)
$ dmesg  |grep oom
[ 4019.292178] oom_reaper: reaped process 10173 (rpcinfo), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
$ kubectl run mysql -it --restart=Never --env=MYSQL_ROOT_PASSWORD=test --env=MYSQL_DATABASE=kudo --image=mysql:5.7

If you don't see a command prompt, try pressing enter.
Initializing database
2019-08-07T20:20:37.948900Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
...
2019-08-07T20:21:40.875839Z 0 [Note] mysqld: ready for connections.
Version: '5.7.27'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  MySQL Community Server (GPL)

Fixed:

# Now try with the patch:
kind create cluster --image=jbarrickmesosphere/node:v1.15.0
export KUBECONFIG="$(kind get kubeconfig-path --name="kind")"
# print the default ulimit -n:
kubectl run -it --restart=Never --rm test --image=alpine -- ash -c 'ulimit -n'
# the services work now:
# NFS does not get OOM killed (the command runs - which errors, but that is expected):
kubectl run -it --restart=Never test --image=quay.io/kubernetes_incubator/nfs-provisioner:latest --command -- rpcinfo -p
# MySQL takes five seconds instead of a minute to start on my machine:
kubectl run mysql -it --restart=Never --env=MYSQL_ROOT_PASSWORD=test --env=MYSQL_DATABASE=kudo --image=mysql:5.7

Fixed output:

$ docker build -t jbarrickmesosphere/node:v1.15.0 .
Sending build context to Docker daemon  6.656kB
Step 1/2 : FROM kindest/node:v1.15.0
 ---> 1a930cac869e
Step 2/2 : COPY 10-limits.conf /etc/systemd/system/containerd.service.d/10-limits.conf
 ---> 4b3f03870e73
Successfully built 4b3f03870e73
Successfully tagged jbarrickmesosphere/node:v1.15.0
$ kind create cluster --image=jbarrickmesosphere/node:v1.15.0

Creating cluster "kind" ...
 ✓ Ensuring node image (jbarrickmesosphere/node:v1.15.0) 🖼
 ✓ Preparing nodes 📦 
 ✓ Creating kubeadm config 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
Cluster creation complete. You can now use the cluster with:

export KUBECONFIG="$(kind get kubeconfig-path --name="kind")"
kubectl cluster-info
$ export KUBECONFIG="$(kind get kubeconfig-path --name="kind")"
                                                                      
$ kubectl run -it --restart=Never --rm test --image=alpine -- ash -c 'ulimit -n'

Error from server (Forbidden): pods "test" is forbidden: error looking up service account default/default: serviceaccount "default" not found
$ kubectl run -it --restart=Never --rm test --image=alpine -- ash -c 'ulimit -n'
                                                                                       
1048576
pod "test" deleted
$ kubectl run -it --restart=Never test --image=quay.io/kubernetes_incubator/nfs-provisioner:latest --command -- rpcinfo -p

rpcinfo: can't contact portmapper: RPC: Remote system error - No such file or directory
pod default/test terminated (Error)
$ kubectl run mysql -it --restart=Never --env=MYSQL_ROOT_PASSWORD=test --env=MYSQL_DATABASE=kudo --image=mysql:5.7

If you don't see a command prompt, try pressing enter.
2019-08-07T20:30:46.856020Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2019-08-07T20:30:51.285083Z 0 [Note] mysqld: ready for connections.
Version: '5.7.27'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  MySQL Community Server (GPL)
^C

We can see upstream's unit file here: https://github.com/containerd/containerd/blob/d57cf6f151e444d125407526bf58bb8e79c5e47a/containerd.service#L17 and the background for why it was set there is here containerd/containerd#3201.

I believe this also fixes the issue referenced in #118 (comment) when rimusz said NFS does not work in kind.

…tainerd. The Ubuntu containerd package sets LimitNOFILE=infinity, but this causes issues with various applications: NFS and mysql are unable to run in kind on some systems. This fixes the containerd limit to match upstream containerd and resolves issues with NFS and mysql.

k8s-ci-robot · 2019-08-07T20:37:33Z

Welcome @jbarrick-mesosphere!

It looks like this is your first PR to kubernetes-sigs/kind 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kind has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2019-08-07T20:37:45Z

Hi @jbarrick-mesosphere. Thanks for your PR.

I'm waiting for a kubernetes-sigs or kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jbarrick-mesosphere · 2019-08-07T21:11:23Z

I finally found the setting on the host that causes this:

$ sudo sysctl fs.nr_open
fs.nr_open = 1073741816
$ sudo sysctl -w fs.nr_open=1048576
fs.nr_open = 1048576
$ sudo sysctl fs.nr_open           
fs.nr_open = 1048576
$ kind create cluster
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.15.0) 🖼
 ✓ Preparing nodes 📦 
 ✓ Creating kubeadm config 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
Cluster creation complete. You can now use the cluster with:

export KUBECONFIG="$(kind get kubeconfig-path --name="kind")"
kubectl cluster-info
$ kubectl run -it --restart=Never --rm test --image=alpine -- ash -c 'ulimit -n'                                          
                                                                                       
1048576
pod "test" deleted
$ kubectl run -it --restart=Never test --image=quay.io/kubernetes_incubator/nfs-provisioner:latest --command -- rpcinfo -p
                                                                                                                                 
rpcinfo: can't contact portmapper: RPC: Remote system error - No such file or directory
pod default/test terminated (Error)
$

I think the root cause of this is:

kind uses systemd 240
Arch Linux sets sysctl fs.nr_open set to a high value.

It seems related to this issue (also where I found the fix): systemd/systemd#11510

I do still think the best thing to do is set LimitNOFILE in our containerd.service, but in the meantime it can be worked around at the host level by setting sudo sysctl -w fs.nr_open=1048576.

aojea · 2019-08-07T21:44:11Z

Thanks for the detailed description, just a few comments because I don´t have clear the relation between the file descriptors limit and the problems you are referencing.
The MariaDB issue that you are referencing is closed because it´s a problem that was fixed in MariaDB
The NFS issues in kind are because NFS support for overlayfs system was added on kernel version 4.16, previous kernel versions won´t allow using NFS with kind since you can´t export the filesystem http://lkml.iu.edu/hypermail/linux/kernel/1802.0/02787.html

jbarrick-mesosphere · 2019-08-07T21:50:57Z

The applications seem to do big allocations up front of memory based on the amount of file descriptors they have available.

For example, NFS gets the number of file descriptors using this method:

http://git.linux-nfs.org/?p=steved/libtirpc.git;a=blob;f=src/rpc_generic.c;h=51f36ac9c2b1642065283512112e0a8a7d6d0096;hb=HEAD#l106

And then does an allocation based on that:

In this case, that allocation would be 1073741816 * (sizeof int) (4?) bytes == 4.5 gigabytes.

jbarrick-mesosphere · 2019-08-07T22:02:40Z

And also you are correct that this is fixed if I try mariadb, but doesn't look like the fix made it into mysql. But seems like that is a similar issue of a large up front allocation: systemd/systemd#11510 (comment)

aojea · 2019-08-07T22:04:11Z

I´m not arguing against that, but I can´t see evidence that´s happening in your NFS example, have you double-checked with top or ps that the NFS pod is allocating 5 G of RAM, how many RAM do you have in your system.?

aojea · 2019-08-07T22:07:03Z

My point is that is an application problem allocating memory based on the number of file descriptor or not having the correct sanity checks to do it, I´m not in favor of workaround that in kind :)

jbarrick-mesosphere · 2019-08-07T22:11:01Z

For NFS, I showed the OOM kill from dmesg:

[ 4019.292178] oom_reaper: reaped process 10173 (rpcinfo), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

And here's the RAM usages from htop:

I don't really think it's a workaround in kind to align it with the containerd upstream configuration.

aojea · 2019-08-07T23:02:04Z

Please, don´t misunderstand me, I just want to fully understand this, so we can understand the consequences, we don´t know if there are kind users that need to go beyond the 1048576 limit.

Seems that this setting on Arch Linux systems, where DefaultLimitNOFILE=infinity
in /etc/systemd/system.conf: is affecting more software https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=920913 and the programs affected are started to be patched. This confirms the theory that is a bug in the application that should be fixed.

Also, I was checking at the containerd issue you mentioned, and the rationale to include a limit upstream was just the opposite, increasing the limit.

recent version of Container Optimized OS and saw that the process has the limit set to 65536.
I believe it should be set to 1048576 like docker does right now: moby/moby@428d733

neolit123 · 2019-08-08T00:55:06Z

The applications seem to do big allocations up front of memory based on the amount of file descriptors they have available.

on Ubuntu 17.10 for me getrlimit(RLIMIT_NOFILE, &rl) returns 1000(1024) for the hard limit, which is a sane default.

for systemd:

sudo sysctl fs.nr_open
fs.nr_open = 1048576

but if an application is making the assumption that it can mmap any arbitrary getrlimit value without capping it, it can easily OOM. so that seems like a bug on the app side.

in terms of capping the containerd service in kind, i guess it's OK, if Infinity was problematic in systemd at some point.

BenTheElder · 2019-08-08T05:21:01Z

Matching the upstream unit makes sense to me, I assume the Ubuntu package just doesn't have this change yet as their release train takes a bit.

For precedence: the other override we apply is also a patch I sent to the upstream unit that hasn't landed in the package we have installed yet.

/ok-to-test

BenTheElder

Thanks for the PR!

k8s-ci-robot · 2019-08-08T05:22:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenTheElder, jbarrick-mesosphere

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [BenTheElder]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

BenTheElder · 2019-08-08T05:22:58Z

(I would also agree that these seem like bugs but, this decision to change the limit was already made upstream anyhow and seems reasonable to me)

aojea · 2019-08-08T06:47:53Z

/retest

Well, it seems this is only happening in arch Linux that's the only distro with no limit in systemd.
With this we only solve those apps that run inside kind, the one that run outside containerd will fail.
Totally agree with all of you on capping it , but I'd prefer to fix it in Ubuntu upstream rather than in our images only

aojea · 2019-08-08T07:14:48Z

By the way thanks @jbarrick-mesosphere for your patience and your explanations 😄

aojea · 2019-08-08T09:23:43Z

For reference https://bugs.launchpad.net/ubuntu/+source/containerd/+bug/1839445

jbarrick-mesosphere · 2019-08-08T20:27:06Z

Thanks everyone 👍

kwenzh · 2023-08-11T02:50:11Z

The applications seem to do big allocations up front of memory based on the amount of file descriptors they have available.

For example, NFS gets the number of file descriptors using this method:

http://git.linux-nfs.org/?p=steved/libtirpc.git;a=blob;f=src/rpc_generic.c;h=51f36ac9c2b1642065283512112e0a8a7d6d0096;hb=HEAD#l106

And then does an allocation based on that:

http://git.linux-nfs.org/?p=steved/libtirpc.git;a=blob;f=src/clnt_vc.c;h=3d775c70d21b388ae0f196f6d6d7b781c51431a4;hb=HEAD#l206

http://git.linux-nfs.org/?p=steved/libtirpc.git;a=blob;f=src/clnt_vc.c;h=3d775c70d21b388ae0f196f6d6d7b781c51431a4;hb=HEAD#l217

In this case, that allocation would be 1073741816 * (sizeof int) (4?) bytes == 4.5 gigabytes.

the same issue， ulimit -n is 1073741816 , mount nfs ， it happen：
mount.nfs: timeout set for Fri Aug 11 02:43:32 2023
mount.nfs: trying text-based options 'nolock,tcp,vers=3,addr=10.72xxx'
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: portmap query failed: RPC: Success
mount.nfs: mount to NFS server '10.72xx:/_test' failed: RPC Error: Success

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 7, 2019

k8s-ci-robot requested review from aojea and krzyzacy August 7, 2019 20:37

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 7, 2019

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 8, 2019

BenTheElder approved these changes Aug 8, 2019

View reviewed changes

k8s-ci-robot assigned BenTheElder Aug 8, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 8, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 8, 2019

k8s-ci-robot merged commit 5315b86 into kubernetes-sigs:master Aug 8, 2019

stellirin mentioned this pull request May 5, 2020

[systemd] Add value to LimitNOFILE due to performance problems k3s-io/k3s#1735

Merged

aojea mentioned this pull request Aug 21, 2020

use infinity for LimitNOFILE #1799

Merged

aojea mentioned this pull request Sep 19, 2020

Enable Simulation of automatically provisioned ReadWriteMany PVs #1487

Open

zlangbert mentioned this pull request Sep 30, 2020

MySQL 5.7 container OOM kill / changing the NOFILE limit bottlerocket-os/bottlerocket#1136

Closed

glothriel mentioned this pull request Oct 14, 2020

[bitnami/rabbitmq] Doesn't start on 1.19.1, eats all the CPU bitnami/charts#4011

Closed

aojea mentioned this pull request Apr 2, 2021

uWSGI running in a Kind cluster on Fedora 33 uses over 8Gi of memory #2175

Closed

aojea mentioned this pull request Feb 17, 2022

A Pod becomes a memory hog in Kind (the issue seems Kind-specific) #2623

Closed

victor-sudakov mentioned this pull request Feb 18, 2022

HAproxy pods in Kubernetes rapidly consume all available memory docker-library/haproxy#179

Closed

thomasjm referenced this pull request Jul 22, 2022

use infinity for LimitNOFILE

2f17d25

aojea mentioned this pull request Oct 6, 2022

limit loadbalancer max sockets #2954

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set LimitNOFILE=1048576 in containerd unit file to match upstream containerd. #760

Set LimitNOFILE=1048576 in containerd unit file to match upstream containerd. #760

jbarrick-mesosphere commented Aug 7, 2019 •

edited

Loading

k8s-ci-robot commented Aug 7, 2019

k8s-ci-robot commented Aug 7, 2019

jbarrick-mesosphere commented Aug 7, 2019 •

edited

Loading

aojea commented Aug 7, 2019

jbarrick-mesosphere commented Aug 7, 2019

jbarrick-mesosphere commented Aug 7, 2019 •

edited

Loading

aojea commented Aug 7, 2019 •

edited

Loading

aojea commented Aug 7, 2019

jbarrick-mesosphere commented Aug 7, 2019

aojea commented Aug 7, 2019

neolit123 commented Aug 8, 2019 •

edited

Loading

BenTheElder commented Aug 8, 2019

BenTheElder left a comment

k8s-ci-robot commented Aug 8, 2019

BenTheElder commented Aug 8, 2019

aojea commented Aug 8, 2019 •

edited

Loading

aojea commented Aug 8, 2019

aojea commented Aug 8, 2019

jbarrick-mesosphere commented Aug 8, 2019

kwenzh commented Aug 11, 2023

Set LimitNOFILE=1048576 in containerd unit file to match upstream containerd. #760

Set LimitNOFILE=1048576 in containerd unit file to match upstream containerd. #760

Conversation

jbarrick-mesosphere commented Aug 7, 2019 • edited Loading

k8s-ci-robot commented Aug 7, 2019

k8s-ci-robot commented Aug 7, 2019

jbarrick-mesosphere commented Aug 7, 2019 • edited Loading

aojea commented Aug 7, 2019

jbarrick-mesosphere commented Aug 7, 2019

jbarrick-mesosphere commented Aug 7, 2019 • edited Loading

aojea commented Aug 7, 2019 • edited Loading

aojea commented Aug 7, 2019

jbarrick-mesosphere commented Aug 7, 2019

aojea commented Aug 7, 2019

neolit123 commented Aug 8, 2019 • edited Loading

BenTheElder commented Aug 8, 2019

BenTheElder left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Aug 8, 2019

BenTheElder commented Aug 8, 2019

aojea commented Aug 8, 2019 • edited Loading

aojea commented Aug 8, 2019

aojea commented Aug 8, 2019

jbarrick-mesosphere commented Aug 8, 2019

kwenzh commented Aug 11, 2023

jbarrick-mesosphere commented Aug 7, 2019 •

edited

Loading

jbarrick-mesosphere commented Aug 7, 2019 •

edited

Loading

jbarrick-mesosphere commented Aug 7, 2019 •

edited

Loading

aojea commented Aug 7, 2019 •

edited

Loading

neolit123 commented Aug 8, 2019 •

edited

Loading

aojea commented Aug 8, 2019 •

edited

Loading