use infinity for LimitNOFILE #1799

BenTheElder · 2020-08-21T07:43:36Z

see: containerd/containerd#4475

k8s-ci-robot · 2020-08-21T07:43:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenTheElder

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [BenTheElder]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

neolit123

/lgtm

aojea · 2020-08-21T11:03:44Z

what can I say ;)
#760 (comment)

neolit123 · 2020-08-21T11:13:19Z

Infinity in systemd files is not great in general but looks containerd are ok with them so..

BenTheElder · 2022-07-22T22:20:07Z

In both PRs in KIND we are matching the upstream Containerd unit file at that point in time. The Containerd PR is linked here.

thomasjm · 2022-07-23T00:13:05Z

Oh whoops, thanks for pointing out the PR. I'd like to make a bit of an argument against this change :)

"Because containerd does it" doesn't seem that compelling, when looking through containerd's reasoning seems to be a mix of "because we can" and "because we want to disable accounting (presumably because of some perf overhead from this, but I couldn't immediately find numbers on it)".

The reasoning above ignores the fact that this harms reproducibility in kind, because now your LimitNOFILE gets inherited from whatever the host system chooses. This can cause any number of application failures, both due to overly low and overly high limits. (Indeed, both this PR and the containerd one have a wave of application breakage issues referencing them.)

For example: things like NFS and MySQL get unhappy when LimitNOFILE is too low. Other applications listed in this PR allocate too much memory when the limit is high. In my case, a high limit makes fork/exec takes too long because an application tries to close all file descriptors by looping over all descriptors up to the limit. It's hard to fully blame this on an "application bug," because this looping technique is the traditional POSIX way to do it and only in recent years have platform-specific alternatives emerged.

Is there any way we could revisit this? Unless the perf issue is actually significant it seems better to me to keep the explicit limit.

BenTheElder · 2022-07-26T17:50:45Z

Sorry, I've been on vacation and not fully available.

"Because containerd does it" doesn't seem that compelling,

This is the reason both changes were ultimately accepted, matching upstream.
We strive to be as vanilla as possible, excluding deviations that are necessary.

The reasoning above ignores the fact that this harms reproducibility in kind, because now your LimitNOFILE gets inherited from whatever the host system chooses. This can cause any number of application failures, both due to overly low and overly high limits. (Indeed, both this PR and the containerd one have a wave of application breakage issues referencing them.)

Resource limits are something that are just not going to be reproducible in KIND and it's not a suitable tool for testing them. When you share a host kernel there are any number of leaky limits, for example inotify limits are not namespaced.

Is there any way we could revisit this? Unless the perf issue is actually significant it seems better to me to keep the explicit limit.

I'm not fully convinced, if we had reason to expect a standard limit on production clusters maybe, but as it stands applications need to handle varying limits better if they're meant to be portable to different Kubernetes clusters, and kind's goal is to test Kubernetes itself, and to help test applications on Kubernetes, standard, portable Kubernetes.

thomasjm · 2022-08-05T00:59:17Z

Sorry for the delay, I had to think about this a bit and explore alternatives.

I'll make a halfhearted argument but really I'd settle for a workaround, perhaps a way to make this configurable?

Resource limits are something that are just not going to be reproducible in KIND and it's not a suitable tool for testing them...kind's goal is to test Kubernetes itself, and to help test applications on Kubernetes, standard, portable Kubernetes

Right, Kind's main page says it is primarily for testing Kubernetes, but also for local development or CI. Speaking as one of the latter types of users, it seems to me that Kind's value is in bundling up the huge complexity of a Kubernetes deployment and setting it up for me in a reproducible way. From my perspective, there's no essential difference between resource limits and any other kind of complexity that Kind encapsulates. Nor is the fact that other leaky limits exist a great argument against improving reproducibility by pinning down some limits. (Maybe those other limits should be pinned down too!)

FWIW, I think setting a limit like 1 million for RLIMIT_NOFILE would be representative of normal systems and cloud Kubernetes providers. That's the number my Ubuntu box comes with. Whereas the NixOS machine where I run CI chooses 1 billion, and going with that number causes application problems that are hard to resolve :P

However, I'd be totally happy if there were a way for a Kind user to tweak this. I tried passing --default-ulimit nofile=1024000:1024000 to the Docker daemon; that works for normal Docker containers but seems to be overridden by Kind. I tried an init container calling ulimit -n 1024000 but that didn't work either. Is there perhaps a way to adjust how Kind does it without totally recompiling Kind?

BenTheElder · 2022-08-05T03:21:56Z

Right, Kind's main page says it is primarily for testing Kubernetes, but also for local development or CI. Speaking as one of the latter types of users, it seems to me that Kind's value is in bundling up the huge complexity of a Kubernetes deployment and setting it up for me in a reproducible way.

Er, right, but the intention is for kind to represent, to the extent possible, 100% standard Kubernetes, so there's some confidence that if an app runs locally it will run in production.

It's also very important that when we test Kubernetes, we're testing as close to standard / default as possible so we're not giving Kubernetes CI a false sense of working due to being in a non-standard configuration.

If you're running on a bog standard Kubernetes cluster using containerd as the runtime, odds are very good the containerd systemd unit matches upstream, which does not set an arbitrary fixed limit.

FWIW, I think setting a limit like 1 million for RLIMIT_NOFILE would be representative of normal systems and cloud Kubernetes providers. That's the number my Ubuntu box comes with.

For the host, perhaps, but not if you install the containerd package? Unless ubuntu is providing their own packaging and overriding this, last I checked theirs matched upstream, but it's been a while.

Regarding cloud kubernetes providers, I checked GKE since I work there and work covers creating those, and it is infinity as expected.

bentheelder@gke-cluster-1-default-pool-eb5c9ce6-7052 ~ $ cat /usr/lib/systemd/system/multi-user.target.wants/containerd.service | grep -i nofile
LimitNOFILE=infinity

Nor is the fact that other leaky limits exist a great argument against improving reproducibility by pinning down some limits. (Maybe those other limits should be pinned down too!)

They would be, if it was feasible, because nodes should not all be pretending to have the host's full CPU and memory etc., but it turns out nested docker containers are not particularly compatible with doing this, so kind and similarly constructed nested-container alternatives are currently a bad way to test things around resource limits.

I tried an init container calling ulimit -n 1024000 but that didn't work either. Is there perhaps a way to adjust how Kind does it without totally recompiling Kind?

To implement what you're describing, you'd need to set ulimit on the node, or within the application container. You should be able to do that, for example you can customize the node image you're using, or you can exec to the node and change it.

In my case, a high limit makes fork/exec takes too long because an application tries to close all file descriptors by looping over all descriptors up to the limit. It's hard to fully blame this on an "application bug," because this looping technique is the traditional POSIX way to do it and only in recent years have platform-specific alternatives emerged.

I can't say I've ever seen an application do this, do you have any pointers? I would expect reasonable applications to track their file handles, not brute force closing them ...

TLDR: I would still prefer to make changes to this unit file in containerd upstream, and kind will match them.
I have a feeling the maintainers would not be persuaded by this application behavior but I can't say for certain.

The same problem should be reached by this application on a non-kind cluster shipping upstream containerd currently.

thomasjm · 2022-08-05T03:57:09Z

Regarding cloud kubernetes providers, I checked GKE since I work there and work covers creating those, and it is infinity as expected.

Ah, but the VM's kernel itself (or some other mechanism, I don't pretend to fully understand this) sets a lower limit. The existence of limits like this on most systems is probably why this problem doesn't come up more often. On GKE:

tom@gke-... ~ $ cat /proc/1/limits 
Limit                     Soft Limit           Hard Limit           Units     
...
Max open files            1048576              1048576              files

Thus, the decision for Kind to set the limit to Infinity isn't really a decision for "no limit," it's the decision to "take whatever number the host kernel uses and hope you're on a sane system." Unfortunately NixOS seems not to be so sane. This explains why applications with the problematic behavior usually perform fine on clusters.

it turns out nested docker containers are not particularly compatible with doing this, so kind and similarly constructed nested-container alternatives are currently a bad way to test things around resource limits.

Ah, that's news to me. Although I don't really want to test them, just make them be reliably sane.

You should be able to do that, for example you can customize the node image you're using, or you can exec to the node and change it.

Okay, thanks! I guess I can do the former with some Nix magic or the latter with an exec to sed the unit file and then systemctl restart...

I can't say I've ever seen an application do this, do you have any pointers?

The thing I'm wrestling with is an insufficiently sophisticated file descriptor closing system in Haskell's process library. Someone else experienced the same thing with an SFTP server.

I have a feeling the maintainers would not be persuaded by this application behavior but I can't say for certain.

Oh I don't think I will suggest it there, containerd is a general purpose tool and I wouldn't advocate for this change in general.

thomasjm · 2022-08-08T00:16:13Z

Thinking about it a bit more, maybe the right place to apply consistent limits would actually be in the Docker configuration. This seems like the part that's more analogous with the GKE VM. This sort of thing can be done by passing --ulimit to docker run.

One thing that's a complete mystery to me is why it didn't work for me when I tried setting the default-ulimits field in the Docker daemon config. It seemed to have no effect, does Kind invoke Docker in some way that circumvents these limits? Or does some component in the system raise its limits to the system limit?

BenTheElder · 2022-08-10T21:26:06Z

The thing I'm wrestling with is an insufficiently sophisticated file descriptor closing system in haskell/process#189. Someone else experienced the same thing with an SFTP server.

Interesting, thanks.

Thus, the decision for Kind to set the limit to Infinity isn't really a decision for "no limit," it's the decision to "take whatever number the host kernel uses and hope you're on a sane system." Unfortunately NixOS seems not to be so sane. This explains why applications with the problematic behavior usually perform fine on clusters.

That's the same decision in containerd upstream?

If we want to limit differently than the host, we should probably do so at the node level, but the question is if the host doesn't have a reasonable limit, what limit should we be setting for the kind node and why? This will be a breaking change for another user with a particular host limit, right now they could instead tune their host limit.

Sounds like if not the haskell library, NixOS should be reconfigured?

It seemed to have no effect, does Kind invoke Docker in some way that circumvents these limits? Or does some component in the system raise its limits to the system limit?

Not to my knowledge, kind does not invoke docker with these settings so the default should take effect. Docker may need to be restarted first.

Relatedly I think you should be able to set ulimit in the application container (not another in the pod) if you only need to ensure a maximum (you can't raise the limit but you can lower it) just by calling ulimit. docker-library/rabbitmq#545 (comment)

... also related kubernetes/kubernetes#3595 😞

use infinity for LimitNOFILE

2f17d25

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 21, 2020

k8s-ci-robot requested review from aojea and neolit123 August 21, 2020 07:43

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 21, 2020

neolit123 reviewed Aug 21, 2020

View reviewed changes

k8s-ci-robot assigned neolit123 Aug 21, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 21, 2020

k8s-ci-robot merged commit 6dbfc9b into kubernetes-sigs:master Aug 21, 2020

BenTheElder deleted the nofileinfinity branch August 21, 2020 14:10

aojea mentioned this pull request Sep 19, 2020

Enable Simulation of automatically provisioned ReadWriteMany PVs #1487

Open

RedRoserade mentioned this pull request Apr 2, 2021

uWSGI running in a Kind cluster on Fedora 33 uses over 8Gi of memory #2175

Closed

aojea mentioned this pull request Feb 17, 2022

A Pod becomes a memory hog in Kind (the issue seems Kind-specific) #2623

Closed

thomasjm mentioned this pull request Dec 14, 2023

Flaky ingress behavior using ingress-nginx and rootless podman #3451

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use infinity for LimitNOFILE #1799

use infinity for LimitNOFILE #1799

BenTheElder commented Aug 21, 2020

k8s-ci-robot commented Aug 21, 2020

neolit123 left a comment

aojea commented Aug 21, 2020

neolit123 commented Aug 21, 2020 via email

BenTheElder commented Jul 22, 2022

thomasjm commented Jul 23, 2022 •

edited

Loading

BenTheElder commented Jul 26, 2022

thomasjm commented Aug 5, 2022

BenTheElder commented Aug 5, 2022

thomasjm commented Aug 5, 2022

thomasjm commented Aug 8, 2022

BenTheElder commented Aug 10, 2022 •

edited

Loading

use infinity for LimitNOFILE #1799

use infinity for LimitNOFILE #1799

Conversation

BenTheElder commented Aug 21, 2020

k8s-ci-robot commented Aug 21, 2020

neolit123 left a comment

Choose a reason for hiding this comment

aojea commented Aug 21, 2020

neolit123 commented Aug 21, 2020 via email

BenTheElder commented Jul 22, 2022

thomasjm commented Jul 23, 2022 • edited Loading

BenTheElder commented Jul 26, 2022

thomasjm commented Aug 5, 2022

BenTheElder commented Aug 5, 2022

thomasjm commented Aug 5, 2022

thomasjm commented Aug 8, 2022

BenTheElder commented Aug 10, 2022 • edited Loading

thomasjm commented Jul 23, 2022 •

edited

Loading

BenTheElder commented Aug 10, 2022 •

edited

Loading