Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set LimitNOFILE=1048576 in containerd unit file to match upstream containerd. #760

Merged
merged 1 commit into from
Aug 8, 2019

Conversation

jbarrick-mesosphere
Copy link
Contributor

@jbarrick-mesosphere jbarrick-mesosphere commented Aug 7, 2019

The Ubuntu containerd package sets LimitNOFILE=infinity, but this causes issues with various applications: NFS and mysql are unable to run in kind on some systems.

This fixes the containerd limit to match upstream containerd and resolves issues with NFS and mysql.

To reproduce the issue, on the host set sudo sysctl -w fs.nr_open=1073741816 (this is the default on Arch Linux) and then follow the steps below.

Broken:

kind create cluster
export KUBECONFIG="$(kind get kubeconfig-path --name="kind")"
# print the default ulimit -n:
kubectl run -it --restart=Never --rm test --image=alpine -- ash -c 'ulimit -n'
# run a broken service (this will eventually be OOM killed by the kernel):
# NFS gets OOM killed:
kubectl run -it --restart=Never test --image=quay.io/kubernetes_incubator/nfs-provisioner:latest --command -- rpcinfo -p
# MySQL uses tons of memory and is extremely slow (sometimes it gets OOM killed):
kubectl run mysql -it --restart=Never --env=MYSQL_ROOT_PASSWORD=test --env=MYSQL_DATABASE=kudo --image=mysql:5.7
kind delete cluster

Broken output:

$ kind create cluster
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.15.0) 🖼
 ✓ Preparing nodes 📦 
 ✓ Creating kubeadm config 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
Cluster creation complete. You can now use the cluster with:

export KUBECONFIG="$(kind get kubeconfig-path --name="kind")"
kubectl cluster-info
$ export KUBECONFIG="$(kind get kubeconfig-path --name="kind")"
$ kubectl run -it --restart=Never --rm test --image=alpine -- ash -c 'ulimit -n'
1073741816
pod "test" deleted
$ kubectl run -it --restart=Never test --image=quay.io/kubernetes_incubator/nfs-provisioner:latest --command -- rpcinfo -p
If you don't see a command prompt, try pressing enter.
pod default/test terminated (Error)
$ dmesg  |grep oom
[ 4019.292178] oom_reaper: reaped process 10173 (rpcinfo), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
$ kubectl run mysql -it --restart=Never --env=MYSQL_ROOT_PASSWORD=test --env=MYSQL_DATABASE=kudo --image=mysql:5.7

If you don't see a command prompt, try pressing enter.
Initializing database
2019-08-07T20:20:37.948900Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
...
2019-08-07T20:21:40.875839Z 0 [Note] mysqld: ready for connections.
Version: '5.7.27'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  MySQL Community Server (GPL)

Fixed:

# Now try with the patch:
kind create cluster --image=jbarrickmesosphere/node:v1.15.0
export KUBECONFIG="$(kind get kubeconfig-path --name="kind")"
# print the default ulimit -n:
kubectl run -it --restart=Never --rm test --image=alpine -- ash -c 'ulimit -n'
# the services work now:
# NFS does not get OOM killed (the command runs - which errors, but that is expected):
kubectl run -it --restart=Never test --image=quay.io/kubernetes_incubator/nfs-provisioner:latest --command -- rpcinfo -p
# MySQL takes five seconds instead of a minute to start on my machine:
kubectl run mysql -it --restart=Never --env=MYSQL_ROOT_PASSWORD=test --env=MYSQL_DATABASE=kudo --image=mysql:5.7

Fixed output:

$ docker build -t jbarrickmesosphere/node:v1.15.0 .
Sending build context to Docker daemon  6.656kB
Step 1/2 : FROM kindest/node:v1.15.0
 ---> 1a930cac869e
Step 2/2 : COPY 10-limits.conf /etc/systemd/system/containerd.service.d/10-limits.conf
 ---> 4b3f03870e73
Successfully built 4b3f03870e73
Successfully tagged jbarrickmesosphere/node:v1.15.0
$ kind create cluster --image=jbarrickmesosphere/node:v1.15.0

Creating cluster "kind" ...
 ✓ Ensuring node image (jbarrickmesosphere/node:v1.15.0) 🖼
 ✓ Preparing nodes 📦 
 ✓ Creating kubeadm config 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
Cluster creation complete. You can now use the cluster with:

export KUBECONFIG="$(kind get kubeconfig-path --name="kind")"
kubectl cluster-info
$ export KUBECONFIG="$(kind get kubeconfig-path --name="kind")"
                                                                      
$ kubectl run -it --restart=Never --rm test --image=alpine -- ash -c 'ulimit -n'

Error from server (Forbidden): pods "test" is forbidden: error looking up service account default/default: serviceaccount "default" not found
$ kubectl run -it --restart=Never --rm test --image=alpine -- ash -c 'ulimit -n'
                                                                                       
1048576
pod "test" deleted
$ kubectl run -it --restart=Never test --image=quay.io/kubernetes_incubator/nfs-provisioner:latest --command -- rpcinfo -p

rpcinfo: can't contact portmapper: RPC: Remote system error - No such file or directory
pod default/test terminated (Error)
$ kubectl run mysql -it --restart=Never --env=MYSQL_ROOT_PASSWORD=test --env=MYSQL_DATABASE=kudo --image=mysql:5.7

If you don't see a command prompt, try pressing enter.
2019-08-07T20:30:46.856020Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2019-08-07T20:30:51.285083Z 0 [Note] mysqld: ready for connections.
Version: '5.7.27'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  MySQL Community Server (GPL)
^C

We can see upstream's unit file here: https://github.com/containerd/containerd/blob/d57cf6f151e444d125407526bf58bb8e79c5e47a/containerd.service#L17 and the background for why it was set there is here containerd/containerd#3201.

I believe this also fixes the issue referenced in #118 (comment) when rimusz said NFS does not work in kind.

…tainerd.

The Ubuntu containerd package sets LimitNOFILE=infinity, but this causes issues with various applications: NFS and mysql are unable to run in kind on some systems.

This fixes the containerd limit to match upstream containerd and resolves issues with NFS and mysql.
@k8s-ci-robot
Copy link
Contributor

Welcome @jbarrick-mesosphere!

It looks like this is your first PR to kubernetes-sigs/kind 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kind has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 7, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @jbarrick-mesosphere. Thanks for your PR.

I'm waiting for a kubernetes-sigs or kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 7, 2019
@jbarrick-mesosphere
Copy link
Contributor Author

jbarrick-mesosphere commented Aug 7, 2019

I finally found the setting on the host that causes this:

$ sudo sysctl fs.nr_open
fs.nr_open = 1073741816
$ sudo sysctl -w fs.nr_open=1048576
fs.nr_open = 1048576
$ sudo sysctl fs.nr_open           
fs.nr_open = 1048576
$ kind create cluster
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.15.0) 🖼
 ✓ Preparing nodes 📦 
 ✓ Creating kubeadm config 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
Cluster creation complete. You can now use the cluster with:

export KUBECONFIG="$(kind get kubeconfig-path --name="kind")"
kubectl cluster-info
$ kubectl run -it --restart=Never --rm test --image=alpine -- ash -c 'ulimit -n'                                          
                                                                                       
1048576
pod "test" deleted
$ kubectl run -it --restart=Never test --image=quay.io/kubernetes_incubator/nfs-provisioner:latest --command -- rpcinfo -p
                                                                                                                                 
rpcinfo: can't contact portmapper: RPC: Remote system error - No such file or directory
pod default/test terminated (Error)
$

I think the root cause of this is:

  • kind uses systemd 240
  • Arch Linux sets sysctl fs.nr_open set to a high value.

It seems related to this issue (also where I found the fix): systemd/systemd#11510

I do still think the best thing to do is set LimitNOFILE in our containerd.service, but in the meantime it can be worked around at the host level by setting sudo sysctl -w fs.nr_open=1048576.

@aojea
Copy link
Contributor

aojea commented Aug 7, 2019

Thanks for the detailed description, just a few comments because I don´t have clear the relation between the file descriptors limit and the problems you are referencing.
The MariaDB issue that you are referencing is closed because it´s a problem that was fixed in MariaDB
The NFS issues in kind are because NFS support for overlayfs system was added on kernel version 4.16, previous kernel versions won´t allow using NFS with kind since you can´t export the filesystem http://lkml.iu.edu/hypermail/linux/kernel/1802.0/02787.html

@jbarrick-mesosphere
Copy link
Contributor Author

The applications seem to do big allocations up front of memory based on the amount of file descriptors they have available.

For example, NFS gets the number of file descriptors using this method:

And then does an allocation based on that:

In this case, that allocation would be 1073741816 * (sizeof int) (4?) bytes == 4.5 gigabytes.

@jbarrick-mesosphere
Copy link
Contributor Author

jbarrick-mesosphere commented Aug 7, 2019

And also you are correct that this is fixed if I try mariadb, but doesn't look like the fix made it into mysql. But seems like that is a similar issue of a large up front allocation: systemd/systemd#11510 (comment)

@aojea
Copy link
Contributor

aojea commented Aug 7, 2019

I´m not arguing against that, but I can´t see evidence that´s happening in your NFS example, have you double-checked with top or ps that the NFS pod is allocating 5 G of RAM, how many RAM do you have in your system.?

@aojea
Copy link
Contributor

aojea commented Aug 7, 2019

My point is that is an application problem allocating memory based on the number of file descriptor or not having the correct sanity checks to do it, I´m not in favor of workaround that in kind :)

@jbarrick-mesosphere
Copy link
Contributor Author

For NFS, I showed the OOM kill from dmesg:

[ 4019.292178] oom_reaper: reaped process 10173 (rpcinfo), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

And here's the RAM usages from htop:

2019-08-07-150730_1277x258_scrot
2019-08-07-150822_1278x238_scrot

I don't really think it's a workaround in kind to align it with the containerd upstream configuration.

@aojea
Copy link
Contributor

aojea commented Aug 7, 2019

Please, don´t misunderstand me, I just want to fully understand this, so we can understand the consequences, we don´t know if there are kind users that need to go beyond the 1048576 limit.

Seems that this setting on Arch Linux systems, where DefaultLimitNOFILE=infinity
in /etc/systemd/system.conf: is affecting more software https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=920913 and the programs affected are started to be patched. This confirms the theory that is a bug in the application that should be fixed.

Also, I was checking at the containerd issue you mentioned, and the rationale to include a limit upstream was just the opposite, increasing the limit.

recent version of Container Optimized OS and saw that the process has the limit set to 65536.
I believe it should be set to 1048576 like docker does right now: moby/moby@428d733

@neolit123
Copy link
Member

neolit123 commented Aug 8, 2019

The applications seem to do big allocations up front of memory based on the amount of file descriptors they have available.

on Ubuntu 17.10 for me getrlimit(RLIMIT_NOFILE, &rl) returns 1000(1024) for the hard limit, which is a sane default.

for systemd:

sudo sysctl fs.nr_open
fs.nr_open = 1048576

but if an application is making the assumption that it can mmap any arbitrary getrlimit value without capping it, it can easily OOM. so that seems like a bug on the app side.

in terms of capping the containerd service in kind, i guess it's OK, if Infinity was problematic in systemd at some point.

@BenTheElder
Copy link
Member

Matching the upstream unit makes sense to me, I assume the Ubuntu package just doesn't have this change yet as their release train takes a bit.

For precedence: the other override we apply is also a patch I sent to the upstream unit that hasn't landed in the package we have installed yet.

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 8, 2019
Copy link
Member

@BenTheElder BenTheElder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 8, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenTheElder, jbarrick-mesosphere

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 8, 2019
@BenTheElder
Copy link
Member

(I would also agree that these seem like bugs but, this decision to change the limit was already made upstream anyhow and seems reasonable to me)

@aojea
Copy link
Contributor

aojea commented Aug 8, 2019

/retest

Well, it seems this is only happening in arch Linux that's the only distro with no limit in systemd.
With this we only solve those apps that run inside kind, the one that run outside containerd will fail.
Totally agree with all of you on capping it , but I'd prefer to fix it in Ubuntu upstream rather than in our images only

@k8s-ci-robot k8s-ci-robot merged commit 5315b86 into kubernetes-sigs:master Aug 8, 2019
@aojea
Copy link
Contributor

aojea commented Aug 8, 2019

By the way thanks @jbarrick-mesosphere for your patience and your explanations 😄

@aojea
Copy link
Contributor

aojea commented Aug 8, 2019

For reference https://bugs.launchpad.net/ubuntu/+source/containerd/+bug/1839445

@jbarrick-mesosphere
Copy link
Contributor Author

Thanks everyone 👍

@kwenzh
Copy link

kwenzh commented Aug 11, 2023

The applications seem to do big allocations up front of memory based on the amount of file descriptors they have available.

For example, NFS gets the number of file descriptors using this method:

And then does an allocation based on that:

In this case, that allocation would be 1073741816 * (sizeof int) (4?) bytes == 4.5 gigabytes.

the same issue, ulimit -n is 1073741816 , mount nfs , it happen:
mount.nfs: timeout set for Fri Aug 11 02:43:32 2023
mount.nfs: trying text-based options 'nolock,tcp,vers=3,addr=10.72xxx'
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: portmap query failed: RPC: Success
mount.nfs: mount to NFS server '10.72xx:/_test' failed: RPC Error: Success

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants