Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use infinity for LimitNOFILE #1799

Merged
merged 1 commit into from
Aug 21, 2020

Conversation

BenTheElder
Copy link
Member

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 21, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenTheElder

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 21, 2020
Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 21, 2020
@k8s-ci-robot k8s-ci-robot merged commit 6dbfc9b into kubernetes-sigs:master Aug 21, 2020
@aojea
Copy link
Contributor

aojea commented Aug 21, 2020

what can I say ;)
#760 (comment)

@neolit123
Copy link
Member

neolit123 commented Aug 21, 2020 via email

@BenTheElder
Copy link
Member Author

In both PRs in KIND we are matching the upstream Containerd unit file at that point in time. The Containerd PR is linked here.

@thomasjm
Copy link

thomasjm commented Jul 23, 2022

Oh whoops, thanks for pointing out the PR. I'd like to make a bit of an argument against this change :)

"Because containerd does it" doesn't seem that compelling, when looking through containerd's reasoning seems to be a mix of "because we can" and "because we want to disable accounting (presumably because of some perf overhead from this, but I couldn't immediately find numbers on it)".

The reasoning above ignores the fact that this harms reproducibility in kind, because now your LimitNOFILE gets inherited from whatever the host system chooses. This can cause any number of application failures, both due to overly low and overly high limits. (Indeed, both this PR and the containerd one have a wave of application breakage issues referencing them.)

For example: things like NFS and MySQL get unhappy when LimitNOFILE is too low. Other applications listed in this PR allocate too much memory when the limit is high. In my case, a high limit makes fork/exec takes too long because an application tries to close all file descriptors by looping over all descriptors up to the limit. It's hard to fully blame this on an "application bug," because this looping technique is the traditional POSIX way to do it and only in recent years have platform-specific alternatives emerged.

Is there any way we could revisit this? Unless the perf issue is actually significant it seems better to me to keep the explicit limit.

@BenTheElder
Copy link
Member Author

Sorry, I've been on vacation and not fully available.

"Because containerd does it" doesn't seem that compelling,

This is the reason both changes were ultimately accepted, matching upstream.
We strive to be as vanilla as possible, excluding deviations that are necessary.

The reasoning above ignores the fact that this harms reproducibility in kind, because now your LimitNOFILE gets inherited from whatever the host system chooses. This can cause any number of application failures, both due to overly low and overly high limits. (Indeed, both this PR and the containerd one have a wave of application breakage issues referencing them.)

Resource limits are something that are just not going to be reproducible in KIND and it's not a suitable tool for testing them. When you share a host kernel there are any number of leaky limits, for example inotify limits are not namespaced.

Is there any way we could revisit this? Unless the perf issue is actually significant it seems better to me to keep the explicit limit.

I'm not fully convinced, if we had reason to expect a standard limit on production clusters maybe, but as it stands applications need to handle varying limits better if they're meant to be portable to different Kubernetes clusters, and kind's goal is to test Kubernetes itself, and to help test applications on Kubernetes, standard, portable Kubernetes.

@thomasjm
Copy link

thomasjm commented Aug 5, 2022

Sorry for the delay, I had to think about this a bit and explore alternatives.

I'll make a halfhearted argument but really I'd settle for a workaround, perhaps a way to make this configurable?

Resource limits are something that are just not going to be reproducible in KIND and it's not a suitable tool for testing them...kind's goal is to test Kubernetes itself, and to help test applications on Kubernetes, standard, portable Kubernetes

Right, Kind's main page says it is primarily for testing Kubernetes, but also for local development or CI. Speaking as one of the latter types of users, it seems to me that Kind's value is in bundling up the huge complexity of a Kubernetes deployment and setting it up for me in a reproducible way. From my perspective, there's no essential difference between resource limits and any other kind of complexity that Kind encapsulates. Nor is the fact that other leaky limits exist a great argument against improving reproducibility by pinning down some limits. (Maybe those other limits should be pinned down too!)

FWIW, I think setting a limit like 1 million for RLIMIT_NOFILE would be representative of normal systems and cloud Kubernetes providers. That's the number my Ubuntu box comes with. Whereas the NixOS machine where I run CI chooses 1 billion, and going with that number causes application problems that are hard to resolve :P

However, I'd be totally happy if there were a way for a Kind user to tweak this. I tried passing --default-ulimit nofile=1024000:1024000 to the Docker daemon; that works for normal Docker containers but seems to be overridden by Kind. I tried an init container calling ulimit -n 1024000 but that didn't work either. Is there perhaps a way to adjust how Kind does it without totally recompiling Kind?

@BenTheElder
Copy link
Member Author

Right, Kind's main page says it is primarily for testing Kubernetes, but also for local development or CI. Speaking as one of the latter types of users, it seems to me that Kind's value is in bundling up the huge complexity of a Kubernetes deployment and setting it up for me in a reproducible way.

Er, right, but the intention is for kind to represent, to the extent possible, 100% standard Kubernetes, so there's some confidence that if an app runs locally it will run in production.

It's also very important that when we test Kubernetes, we're testing as close to standard / default as possible so we're not giving Kubernetes CI a false sense of working due to being in a non-standard configuration.

If you're running on a bog standard Kubernetes cluster using containerd as the runtime, odds are very good the containerd systemd unit matches upstream, which does not set an arbitrary fixed limit.

FWIW, I think setting a limit like 1 million for RLIMIT_NOFILE would be representative of normal systems and cloud Kubernetes providers. That's the number my Ubuntu box comes with.

For the host, perhaps, but not if you install the containerd package? Unless ubuntu is providing their own packaging and overriding this, last I checked theirs matched upstream, but it's been a while.

Regarding cloud kubernetes providers, I checked GKE since I work there and work covers creating those, and it is infinity as expected.

bentheelder@gke-cluster-1-default-pool-eb5c9ce6-7052 ~ $ cat /usr/lib/systemd/system/multi-user.target.wants/containerd.service | grep -i nofile
LimitNOFILE=infinity

Nor is the fact that other leaky limits exist a great argument against improving reproducibility by pinning down some limits. (Maybe those other limits should be pinned down too!)

They would be, if it was feasible, because nodes should not all be pretending to have the host's full CPU and memory etc., but it turns out nested docker containers are not particularly compatible with doing this, so kind and similarly constructed nested-container alternatives are currently a bad way to test things around resource limits.

I tried an init container calling ulimit -n 1024000 but that didn't work either. Is there perhaps a way to adjust how Kind does it without totally recompiling Kind?

To implement what you're describing, you'd need to set ulimit on the node, or within the application container. You should be able to do that, for example you can customize the node image you're using, or you can exec to the node and change it.

In my case, a high limit makes fork/exec takes too long because an application tries to close all file descriptors by looping over all descriptors up to the limit. It's hard to fully blame this on an "application bug," because this looping technique is the traditional POSIX way to do it and only in recent years have platform-specific alternatives emerged.

I can't say I've ever seen an application do this, do you have any pointers? I would expect reasonable applications to track their file handles, not brute force closing them ...


TLDR: I would still prefer to make changes to this unit file in containerd upstream, and kind will match them.
I have a feeling the maintainers would not be persuaded by this application behavior but I can't say for certain.

The same problem should be reached by this application on a non-kind cluster shipping upstream containerd currently.

@thomasjm
Copy link

thomasjm commented Aug 5, 2022

Regarding cloud kubernetes providers, I checked GKE since I work there and work covers creating those, and it is infinity as expected.

Ah, but the VM's kernel itself (or some other mechanism, I don't pretend to fully understand this) sets a lower limit. The existence of limits like this on most systems is probably why this problem doesn't come up more often. On GKE:

tom@gke-... ~ $ cat /proc/1/limits 
Limit                     Soft Limit           Hard Limit           Units     
...
Max open files            1048576              1048576              files     

Thus, the decision for Kind to set the limit to Infinity isn't really a decision for "no limit," it's the decision to "take whatever number the host kernel uses and hope you're on a sane system." Unfortunately NixOS seems not to be so sane. This explains why applications with the problematic behavior usually perform fine on clusters.

it turns out nested docker containers are not particularly compatible with doing this, so kind and similarly constructed nested-container alternatives are currently a bad way to test things around resource limits.

Ah, that's news to me. Although I don't really want to test them, just make them be reliably sane.

You should be able to do that, for example you can customize the node image you're using, or you can exec to the node and change it.

Okay, thanks! I guess I can do the former with some Nix magic or the latter with an exec to sed the unit file and then systemctl restart...

I can't say I've ever seen an application do this, do you have any pointers?

The thing I'm wrestling with is an insufficiently sophisticated file descriptor closing system in Haskell's process library. Someone else experienced the same thing with an SFTP server.

I have a feeling the maintainers would not be persuaded by this application behavior but I can't say for certain.

Oh I don't think I will suggest it there, containerd is a general purpose tool and I wouldn't advocate for this change in general.

@thomasjm
Copy link

thomasjm commented Aug 8, 2022

Thinking about it a bit more, maybe the right place to apply consistent limits would actually be in the Docker configuration. This seems like the part that's more analogous with the GKE VM. This sort of thing can be done by passing --ulimit to docker run.

One thing that's a complete mystery to me is why it didn't work for me when I tried setting the default-ulimits field in the Docker daemon config. It seemed to have no effect, does Kind invoke Docker in some way that circumvents these limits? Or does some component in the system raise its limits to the system limit?

@BenTheElder
Copy link
Member Author

BenTheElder commented Aug 10, 2022

The thing I'm wrestling with is an insufficiently sophisticated file descriptor closing system in haskell/process#189. Someone else experienced the same thing with an SFTP server.

Interesting, thanks.

Thus, the decision for Kind to set the limit to Infinity isn't really a decision for "no limit," it's the decision to "take whatever number the host kernel uses and hope you're on a sane system." Unfortunately NixOS seems not to be so sane. This explains why applications with the problematic behavior usually perform fine on clusters.

That's the same decision in containerd upstream?

If we want to limit differently than the host, we should probably do so at the node level, but the question is if the host doesn't have a reasonable limit, what limit should we be setting for the kind node and why? This will be a breaking change for another user with a particular host limit, right now they could instead tune their host limit.

Sounds like if not the haskell library, NixOS should be reconfigured?

It seemed to have no effect, does Kind invoke Docker in some way that circumvents these limits? Or does some component in the system raise its limits to the system limit?

Not to my knowledge, kind does not invoke docker with these settings so the default should take effect. Docker may need to be restarted first.

Relatedly I think you should be able to set ulimit in the application container (not another in the pod) if you only need to ensure a maximum (you can't raise the limit but you can lower it) just by calling ulimit. docker-library/rabbitmq#545 (comment)

... also related kubernetes/kubernetes#3595 😞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants