Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: automatically bump RLIMIT_NOFILE on Unix #46279

Closed
bradfitz opened this issue May 20, 2021 · 52 comments
Closed

runtime: automatically bump RLIMIT_NOFILE on Unix #46279

bradfitz opened this issue May 20, 2021 · 52 comments

Comments

@bradfitz
Copy link
Contributor

bradfitz commented May 20, 2021

I just read http://0pointer.net/blog/file-descriptor-limits.html which in a nutshell says:

  • don't use select
  • systemd sets the RLIMIT_NOFILE soft limit to 1024 for compatibility reasons, to not break select users
  • systemd sets the RLIMIT_NOFILE hard limit 512K, for programs that want more (without escalation), but by raising their soft limit past 1024, they're implicitly acknowledging that select won't work.

I realize that since Go doesn't use select, the Go runtime could automatically do this fd soft limit bumping on Linux.

We do have a Select wrapper at https://pkg.go.dev/golang.org/x/sys/unix#Select, though, so perhaps we could do the same thing we did for #42347 in 18510ae (https://go-review.googlesource.com/c/go/+/299671) and do the bumping conditionally based on whether the unix.Select func is in the binary. Or cgo too, I suppose.

I suspect many users are unaware of this 512K hard limit that's free to bump up to. I certainly was unaware. (I normally have to go in and manual tweak my systemd limits instead, usually in response to problems once I hit the limit...) I think fixing it automatically would help more users than it'd hurt. (I actually can't think how it'd hurt anybody?)

I don't think we need it as a backpressure mechanism. As the blog post mentions, memory limits are already that mechanism.

/cc @ianlancetaylor @aclements @rsc @randall77

@ianlancetaylor
Copy link
Contributor

The limitation on select is not a kernel limitation. It's a limitation on the implementation of fd_set in glibc. And we've inherited that limitation in x/sys/unix.FdSet, but perhaps we could fix that. If we did, then we could raise the soft limit to the hard limit unconditionally.

I note that on my Debian system the soft and hard limits are both 131072. On my CentOS 6 system the soft limit is 1024 and the hard limit is 4096. On my recent Fedora system the soft limit is 1024 and the hard limit is 524288.

@bradfitz
Copy link
Contributor Author

bradfitz commented May 20, 2021

Yeah, I saw FdSet was struct { Bits [16]int64 }. Make it opaque with a [16]int64 used by default and a spill-over lazily-allocated bitmap when (*FDSet).Set(fd int) is called with a "big" fd? Doable, but I wonder if it's worth the effort. Does anybody actually use unix.Select?

We'd still need a conditional mechanism regardless for cgo I assume, as we wouldn't know whether C code was using select as easily?

FWIW, on my various Debian (buster) & Ubuntu (focal LTS, hirsute) machines here, I see 1024 & 1048576.

@bradfitz
Copy link
Contributor Author

bradfitz commented May 20, 2021

Does anybody actually use unix.Select?

GitHub code search says https://github.com/search?l=&p=2&q=unix.Select+language%3AGo&type=Code .... it's mostly wireguard-go's rwcancel package.

(cc @zx2c4 as FYI)

@zx2c4
Copy link
Contributor

zx2c4 commented May 20, 2021

I'm happy to get rid of that and replace it with poll. (Want to send a patch?) Done: https://git.zx2c4.com/wireguard-go/commit/?id=a9b377e9e10eb5194c0bdff32136c11b17253bfd

This proposal sounds like a good idea, with the caveat that we probably shouldn't do it in initialization for -buildmode=shared.

@rsc
Copy link
Contributor

rsc commented Nov 10, 2021

What happens today, even in programs that do nothing but file I/O (no select etc), is that if you open too many files you get errors. Auto-bumping would let those programs run longer.

If Go did it at startup, it would be inherited by non-Go programs that we fork+exec. That is a potential incompatibility, but probably not a large one. Technically, I suppose we could undo it in the subprocess between fork and exec.

@rsc
Copy link
Contributor

rsc commented Nov 10, 2021

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@nightlyone
Copy link
Contributor

To summarize the limitating use cases where we should not be raising the soft limit.

  • select implementation from glibc (or other libc like musl too?) is used
  • cgo is used because dynamic linking can load anything with dlopen and also can cause exec calls in places we don't know.
  • select implementation from our own syscalls or unix package is used, unless that one is changed as suggested above.
  • NSS (user/group lookup and DNS lookup) is used from glibc.
  • and we need to reset, if we call the exec family of syscalls

@rsc
Copy link
Contributor

rsc commented Dec 1, 2021

One problem with restoring the limit in exec is we won't know if the limit was intentionally changed by the program in the interim. What about programs that explicitly raise the limit and then exec today? We would be dropping it back down.

It seems like if we are going to raise the limit, we should just do that, not try to put it back.

I just ran into this problem with gofmt on my Mac, where the limit defaults to 256 (and gofmt was editing many files in parallel). I'd love for Go to raise the limit there too.

How much does it really matter if we raise the limit for a subprocess?

People can always set the hard limit if they want Go not to try to bump the soft limit up.

@rsc
Copy link
Contributor

rsc commented Dec 8, 2021

It's pretty awful that the limit is breaking completely reasonable Go programs like gofmt -w.
It seems very wrong for gofmt to have to put a bump in.
It seems like we should bump the limit at startup - Go doesn't use select.

It's very hard to see any programs benefiting from this limit in practice anymore.
I understand that systemd can't make such a global decision, but I think Go can.

@zx2c4
Copy link
Contributor

zx2c4 commented Dec 8, 2021

I think that seems quite reasonable.

We can even document this in unix.Select/syscall.Select and mark them as deprecated so that editors bring attention to them and maybe add something to go vet too. It seems always possible to move to poll or similar.

@rsc
Copy link
Contributor

rsc commented Dec 15, 2021

Not sure anyone is using syscall.Select for fd's anyway.
Every time I've used it in the past decade it has been to get a sub-second-resolution sleeping API (selecting on no fds).

@rsc
Copy link
Contributor

rsc commented Dec 15, 2021

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

@AlekSi
Copy link
Contributor

AlekSi commented Dec 15, 2021

Should the title be updated to mention Unix or something instead of Linux?
Personally, I constantly run into that limitation on macOS; would like to see that resolved.

@ianlancetaylor
Copy link
Contributor

The considerations may be different on different Unix systems. On Linux the details are somewhat specific to systemd.

It may well be appropriate to do this on macOS also, but I don't know what the tradeoffs are there. Why does macOS have a default low limit?

@AlekSi
Copy link
Contributor

AlekSi commented Dec 15, 2021

From what I was able to find, that default goes back to the very first OS X release and probably even back to BSD. The constant is there.

Of course, not doing that on macOS is not a deal-breaker but an annoyance.

@kolyshkin
Copy link
Contributor

The only issue I am aware of that can arise if RLIMIT_NOFILE is set to a very high value is, some binaries (that may be executed from a Go program and thus inherit the limit) want to do something like this (pseudocode):

for fd := 3; fd < getrlimit(RLIMIT_NOFILE); fd++ {
      close(fd) // or set CLOEXEC flag
}

For a specific example, rpm package manager used to do that (fixed by rpm-software-management/rpm@5e6f05c), and also some older version of Python (but I'm not sure).

Most probably this should not be an issue, since Docker also does a similar thing (moby/moby#38814) and since everyone seems to be using containers now, let's hope that issues like this are fixed (yet better, maybe some programs have even started using close_range()).

Also, this is surely not a showstopper to accept the proposal -- just something to keep in mind.

@rsc rsc changed the title proposal: runtime: automatically bump RLIMIT_NOFILE on Linux? proposal: runtime: automatically bump RLIMIT_NOFILE on Unix Jan 5, 2022
@rsc
Copy link
Contributor

rsc commented Jan 5, 2022

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

@rsc rsc changed the title proposal: runtime: automatically bump RLIMIT_NOFILE on Unix runtime: automatically bump RLIMIT_NOFILE on Unix Jan 5, 2022
@rsc rsc removed this from the Proposal milestone Jan 5, 2022
frezbo added a commit to frezbo/talos that referenced this issue May 9, 2023
Now Go only sets the rlimit for the parent and any fork/exec'ed process
gets the rlimit that was the default before fork/exec. Ref: golang/go#46279

This fix got backported to [Go 1.20.4](golang/go@ecf7e00) breaking Talos.
Talos used to set rlimit in the [`SetRLimit`](https://github.com/siderolabs/talos/blob/v1.4.2/internal/app/machined/pkg/runtime/v1alpha1/v1alpha1_sequencer_tasks.go#L302) sequencer task.
This means any process started by `wrapperd` gets the default Rlimit
(1024). Fix this by explicitly setting `rlimit` in `wrapperd` before we
drop any capabilities.

Fixes: siderolabs#7198

Signed-off-by: Noel Georgi <git@frezbo.dev>
frezbo added a commit to frezbo/talos that referenced this issue May 9, 2023
Now Go only sets the rlimit for the parent and any fork/exec'ed process
gets the rlimit that was the default before fork/exec. Ref: golang/go#46279

This fix got backported to [Go 1.20.4](golang/go@ecf7e00) breaking Talos.
Talos used to set rlimit in the [`SetRLimit`](https://github.com/siderolabs/talos/blob/v1.4.2/internal/app/machined/pkg/runtime/v1alpha1/v1alpha1_sequencer_tasks.go#L302) sequencer task.
This means any process started by `wrapperd` gets the default Rlimit
(1024). Fix this by explicitly setting `rlimit` in `wrapperd` before we
drop any capabilities.

Fixes: siderolabs#7198

Signed-off-by: Noel Georgi <git@frezbo.dev>
(cherry picked from commit a2565f6)
gopherbot pushed a commit to golang/gofrontend that referenced this issue May 11, 2023
As of https://go.dev/cl/476695 golang.org/x/sys/unix can call
syscall.prlimit, so we need such a function in libgo.

For golang/go#46279
Fixes golang/go#59712

Change-Id: I87ad6daaba68c188fb0abecb30f7d574db1f2600
Reviewed-on: https://go-review.googlesource.com/c/gofrontend/+/486576
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Than McIntosh <thanm@google.com>
nstester pushed a commit to nstester/gcc that referenced this issue May 11, 2023
As of https://go.dev/cl/476695 golang.org/x/sys/unix can call
syscall.prlimit, so we need such a function in libgo.

For golang/go#46279
Fixes golang/go#59712

Reviewed-on: https://go-review.googlesource.com/c/gofrontend/+/486576
bradfitz pushed a commit to tailscale/go that referenced this issue May 25, 2023
…ild process

If we increased the NOFILE rlimit when starting the program,
restore the original rlimit when forking a child process.

In CL 393354 the os package was changed to raise the open file rlimit
at program start. That code is not inherently tied to the os package.
This CL moves it into the syscall package.

This is a backport of CLs 476096 and 476097 from trunk.

For golang#46279
Fixes golang#59064

Change-Id: Ib813de896de0a5d28fa2b29afdf414a89fbe7b2a
Reviewed-on: https://go-review.googlesource.com/c/go/+/478659
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Tobias Klauser <tobias.klauser@gmail.com>
bradfitz pushed a commit to tailscale/go that referenced this issue May 25, 2023
…ild process

If we increased the NOFILE rlimit when starting the program,
restore the original rlimit when forking a child process.

In CL 393354 the os package was changed to raise the open file rlimit
at program start. That code is not inherently tied to the os package.
This CL moves it into the syscall package.

This is a backport of CLs 476096 and 476097 from trunk.

For golang#46279
Fixes golang#59064

Change-Id: Ib813de896de0a5d28fa2b29afdf414a89fbe7b2a
Reviewed-on: https://go-review.googlesource.com/c/go/+/478659
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Tobias Klauser <tobias.klauser@gmail.com>
wmfgerrit pushed a commit to wikimedia/operations-puppet that referenced this issue Sep 22, 2023
Stop managing limitnofile, Bookworm systemd default is soft 1024 and
hard 524288.  Further, golang >= 1.19 'os' package bumps soft to hard
limit automatically (cfr golang/go#46279)

For example thanos-query-frontend where we don't explicitly set
limitnofile:

  root@titan2001:~# cat /proc/$(pgrep -f "thanos query-frontend")/limits | grep files
  Max open files            524288               524288               files
  root@titan2001:~# systemctl show thanos-query-frontend | grep -i limitnofile
  LimitNOFILE=524288
  LimitNOFILESoft=1024

Bug: T346950
Change-Id: I3df68d63c66293e4425cb7e67670e44459c4d474
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/539435 mentions this issue: syscall: copy rlimit.go's build constraint to rlimit_test.go

gopherbot pushed a commit that referenced this issue Nov 2, 2023
Tests in rlimit_test.go exist to test the behavior of automatically
bumping RLIMIT_NOFILE on Unix implemented in rlimit.go (issue #46279),
with darwin-specific behavior split out into rlimit_darwin.go and
the rest left empty in rlimit_stub.go.

Since the behavior happens only on Unix, it doesn't make sense to test
it on other platforms. Copy rlimit.go's 'unix' build constraint to
rlimit_test.go to accomplish that.

Also simplify the build constraint in rlimit_stub.go while here,
so that its maintenance is easier and it starts to match all
non-darwin Unix GOOS values (previously, 'hurd' happened to be missed).

In particular, this fixes a problem where TestOpenFileLimit was
failing in some environments when testing the wasip1/wasm port.
The RLIMIT_NOFILE bumping behavior isn't implemented there, so
the test was testing the environment and not the Go project.

Updates #46279.
For #61116.

Change-Id: Ic993f9cfc021d4cda4fe3d7fed8e2e180f78a2ca
Reviewed-on: https://go-review.googlesource.com/c/go/+/539435
Reviewed-by: Johan Brandhorst-Satzkorn <johan.brandhorst@gmail.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Auto-Submit: Dmitri Shuralyov <dmitshur@golang.org>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/540615 mentions this issue: [release-branch.go1.21] syscall: copy rlimit.go's build constraint to rlimit_test.go

gopherbot pushed a commit that referenced this issue Nov 7, 2023
… rlimit_test.go

Tests in rlimit_test.go exist to test the behavior of automatically
bumping RLIMIT_NOFILE on Unix implemented in rlimit.go (issue #46279),
with darwin-specific behavior split out into rlimit_darwin.go and
the rest left empty in rlimit_stub.go.

Since the behavior happens only on Unix, it doesn't make sense to test
it on other platforms. Copy rlimit.go's 'unix' build constraint to
rlimit_test.go to accomplish that.

Leave out the simplification of the build constraint in rlimit_stub.go
so that this CL remains a test-only fix.

In particular, this fixes a problem where TestOpenFileLimit was
failing in some environments when testing the wasip1/wasm port.
The RLIMIT_NOFILE bumping behavior isn't implemented there, so
the test was testing the environment and not the Go project.

Updates #46279.
For #61116.
Fixes #63994.

Change-Id: Ic993f9cfc021d4cda4fe3d7fed8e2e180f78a2ca
Cq-Include-Trybots: luci.golang.try:go1.21-wasip1-wasm_wasmtime
Reviewed-on: https://go-review.googlesource.com/c/go/+/539435
Reviewed-by: Johan Brandhorst-Satzkorn <johan.brandhorst@gmail.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Auto-Submit: Dmitri Shuralyov <dmitshur@golang.org>
(cherry picked from commit b7cbcf0)
Reviewed-on: https://go-review.googlesource.com/c/go/+/540615
Reviewed-by: Heschi Kreinick <heschi@google.com>
Auto-Submit: Heschi Kreinick <heschi@google.com>
tagatac added a commit to tagatac/bagoup that referenced this issue Sep 12, 2024
**What is changing**: Use the value returned from `ulimit -n` as the
open files soft limit instead of the value returned from the syscall
`getrlimit`.

**Why this change is being made**: Since Go 1.19, during initialization,
Go programs unconditionally set a high open files soft limit for
themselves without modifying the system-wide defaults. This soft limit
also does not apply to subprocesses run via the `exec` package. Because
`wkhtmltopdf` is run as a subprocess, the soft limit set by the Go
program during initialization does not give us any information about the
limits that will be applied to `wkhtmltopdf`. Two approaches would be to
1. always call `setrlimit` on every run so that the soft limit will
apply to subprocesses, or
2. use `ulimit` to check the soft limit that will be applied to
`wkhtmltopdf`.

I chose to implement approach 2. to reduce the number of syscalls.

More details:
- https://go.dev/src/syscall/rlimit.go
- golang/go#46279
- https://stackoverflow.com/q/73640931/5403337
-
https://www.perplexity.ai/search/explain-why-golang-returns-an-tho.xk6ARhOs6ZSPr4kx8A

**Related issue(s)**: Fixes #63

**Follow-up changes needed**: None

**Is the change completely covered by unit tests? If not, why not?**:
Yes
debarshiray added a commit to debarshiray/podman that referenced this issue Oct 11, 2024
Starting from commit 9126b45 ("Up default Podman rlimits to
avoid max open files"), Podman started bumping its soft limit for the
maximum number of open file descriptors (RLIMIT_NOFILE or ulimit -n) to
permit exposing a large number of ports to a container.  This was later
fine-tuned in commit a2c1a2d ("podman: bump RLIMIT_NOFILE also
without CAP_SYS_RESOURCE").

Unfortunately, this also increases the limits for 'podman exec' sessions
running in containers created with:
  $ podman create --network host --ulimit host ...

This is what Toolbx uses to provide a containerized interactive command
line environment for software development and troubleshooting the host
operating system.

It confuses developers and system administrators debugging a process
that's leaking file descriptors and crashing on the host OS.  The
crashes either don't reproduce inside the container or they take a lot
longer to reproduce, both of which are frustrating.

Therefore, it will be good to retain the limits, at least for this
specific scenario.

It turns out that since this code was written, the Go runtime has had
two interesting changes.

Starting from Go 1.19 [1], the Go runtime bumps the soft limit for
RLIMIT_NOFILE for all Go programs [2].  This means that there's no
longer any need for Podman to bump it's own limits, because it switched
from requiring Go 1.18 to 1.20 in commit 4dd58f2 ("Move golang
requirement from 1.18 to 1.20").  It's probably good to still log the
detected limits, in case Go's behaviour changes.

Not everybody was happy with this [3], because the higher limits got
propagated to child processes spawned by Go programs.  Among other
things, this can break old programs using select(2) [4].  So, Go's
behaviour was fine-tuned to restore the original soft limit for
RLIMIT_NOFILE when forking a child process [5].

With these two changes in Go, which Podman already uses, if the bumping
of RLIMIT_NOFILE is left to the Go runtime, then the limits are no
longer increased for 'podman exec' sesssions.  Otherwise, if Podman
continues to bump the soft limit for RLIMIT_NOFILE on its own, then it
prevents the Go runtime from restoring the original limits when forking,
and leads to the higher limits in 'podman exec' sessions.

Note that this doesn't impact the resource limits for the container's
entry point, and 'podman exec' sessions continue to respect any limits
that were explicitly set with the --ulimit option.

Arguably, users deploying server-side applications in Podman containers
predominantly use the entry point through 'podman run', and they can
continue unimpacted, and Toolbx users can keep using 'podman exec'
without any surprises.

The existing 'podman run --ulimit host ... ulimit -Hn' test in
test/e2e/run_test.go was extended to also check the soft limit.  The
similar test for 'podman exec' was moved from test/e2e/toolbox_test.go
to test/e2e/exec_test.go for consistency and because there's nothing
Toolbx specific about it.  The test was similarly extended, and updated
to be more idiomatic.

Note that the Alpine Linux image doesn't have a standalone binary for
'ulimit' and it's picky about the order in which the options are listed.
The -H or -S must come first, followed by a space, and then the -n.

[1] https://go.dev/doc/go1.19#runtime

[2] Go commit 8427429c592588af ("os: raise open file rlimit at startup")
    golang/go@8427429c592588af
    golang/go#46279

[3] containerd/containerd#8249

[4] http://0pointer.net/blog/file-descriptor-limits.html

[5] Go commit f5eef58e4381259c ("syscall: restore original NOFILE ...")
    golang/go@f5eef58e4381259c
    golang/go#46279

Fixes: containers#17681

Signed-off-by: Debarshi Ray <rishi@fedoraproject.org>
debarshiray added a commit to debarshiray/podman that referenced this issue Oct 11, 2024
Starting from commit 9126b45 ("Up default Podman rlimits to
avoid max open files"), Podman started bumping its soft limit for the
maximum number of open file descriptors (RLIMIT_NOFILE or ulimit -n) to
permit exposing a large number of ports to a container.  This was later
fine-tuned in commit a2c1a2d ("podman: bump RLIMIT_NOFILE also
without CAP_SYS_RESOURCE").

Unfortunately, this also increases the limits for 'podman exec' sessions
running in containers created with:
  $ podman create --network host --ulimit host ...

This is what Toolbx uses to provide a containerized interactive command
line environment for software development and troubleshooting the host
operating system.

It confuses developers and system administrators debugging a process
that's leaking file descriptors and crashing on the host OS.  The
crashes either don't reproduce inside the container or they take a lot
longer to reproduce, both of which are frustrating.

Therefore, it will be good to retain the limits, at least for this
specific scenario.

It turns out that since this code was written, the Go runtime has had
two interesting changes.

Starting from Go 1.19 [1], the Go runtime bumps the soft limit for
RLIMIT_NOFILE for all Go programs [2].  This means that there's no
longer any need for Podman to bump it's own limits, because it switched
from requiring Go 1.18 to 1.20 in commit 4dd58f2 ("Move golang
requirement from 1.18 to 1.20").  It's probably good to still log the
detected limits, in case Go's behaviour changes.

Not everybody was happy with this [3], because the higher limits got
propagated to child processes spawned by Go programs.  Among other
things, this can break old programs using select(2) [4].  So, Go's
behaviour was fine-tuned to restore the original soft limit for
RLIMIT_NOFILE when forking a child process [5].

With these two changes in Go, which Podman already uses, if the bumping
of RLIMIT_NOFILE is left to the Go runtime, then the limits are no
longer increased for 'podman exec' sessions.  Otherwise, if Podman
continues to bump the soft limit for RLIMIT_NOFILE on its own, then it
prevents the Go runtime from restoring the original limits when forking,
and leads to the higher limits in 'podman exec' sessions.

Note that this doesn't impact the resource limits for the container's
entry point, and 'podman exec' sessions continue to respect any limits
that were explicitly set with the --ulimit option.

Arguably, users deploying server-side applications in Podman containers
predominantly use the entry point through 'podman run', and they can
continue unimpacted, and Toolbx users can keep using 'podman exec'
without any surprises.

The existing 'podman run --ulimit host ... ulimit -Hn' test in
test/e2e/run_test.go was extended to also check the soft limit.  The
similar test for 'podman exec' was moved from test/e2e/toolbox_test.go
to test/e2e/exec_test.go for consistency and because there's nothing
Toolbx specific about it.  The test was similarly extended, and updated
to be more idiomatic.

Note that the Alpine Linux image doesn't have a standalone binary for
'ulimit' and it's picky about the order in which the options are listed.
The -H or -S must come first, followed by a space, and then the -n.

[1] https://go.dev/doc/go1.19#runtime

[2] Go commit 8427429c592588af ("os: raise open file rlimit at startup")
    golang/go@8427429c592588af
    golang/go#46279

[3] containerd/containerd#8249

[4] http://0pointer.net/blog/file-descriptor-limits.html

[5] Go commit f5eef58e4381259c ("syscall: restore original NOFILE ...")
    golang/go@f5eef58e4381259c
    golang/go#46279

Fixes: containers#17681

Signed-off-by: Debarshi Ray <rishi@fedoraproject.org>
debarshiray added a commit to debarshiray/podman that referenced this issue Oct 11, 2024
Starting from commit 9126b45 ("Up default Podman rlimits to
avoid max open files"), Podman started bumping its soft limit for the
maximum number of open file descriptors (RLIMIT_NOFILE or ulimit -n) to
permit exposing a large number of ports to a container.  This was later
fine-tuned in commit a2c1a2d ("podman: bump RLIMIT_NOFILE also
without CAP_SYS_RESOURCE").

Unfortunately, this also increases the limits for 'podman exec' sessions
running in containers created with:
  $ podman create --network host --ulimit host ...

This is what Toolbx uses to provide a containerized interactive command
line environment for software development and troubleshooting the host
operating system.

It confuses developers and system administrators debugging a process
that's leaking file descriptors and crashing on the host OS.  The
crashes either don't reproduce inside the container or they take a lot
longer to reproduce, both of which are frustrating.

Therefore, it will be good to retain the limits, at least for this
specific scenario.

It turns out that since this code was written, the Go runtime has had
two interesting changes.

Starting from Go 1.19 [1], the Go runtime bumps the soft limit for
RLIMIT_NOFILE for all Go programs [2].  This means that there's no
longer any need for Podman to bump it's own limits, because it switched
from requiring Go 1.18 to 1.20 in commit 4dd58f2 ("Move golang
requirement from 1.18 to 1.20").  It's probably good to still log the
detected limits, in case Go's behaviour changes.

Not everybody was happy with this [3], because the higher limits got
propagated to child processes spawned by Go programs.  Among other
things, this can break old programs using select(2) [4].  So, Go's
behaviour was fine-tuned to restore the original soft limit for
RLIMIT_NOFILE when forking a child process [5].

With these two changes in Go, which Podman already uses, if the bumping
of RLIMIT_NOFILE is left to the Go runtime, then the limits are no
longer increased for 'podman exec' sessions.  Otherwise, if Podman
continues to bump the soft limit for RLIMIT_NOFILE on its own, then it
prevents the Go runtime from restoring the original limits when forking,
and leads to the higher limits in 'podman exec' sessions.

Note that this doesn't impact the resource limits for the container's
entry point, and 'podman exec' sessions continue to respect any limits
that were explicitly set with the --ulimit option.

Arguably, users deploying server-side applications in Podman containers
predominantly use the entry point through 'podman run', and they can
continue unimpacted, and Toolbx users can keep using 'podman exec'
without any surprises.

The existing 'podman run --ulimit host ... ulimit -Hn' test in
test/e2e/run_test.go was extended to also check the soft limit.  The
similar test for 'podman exec' was moved from test/e2e/toolbox_test.go
to test/e2e/exec_test.go for consistency and because there's nothing
Toolbx specific about it.  The test was similarly extended, and updated
to be more idiomatic.

Note that the Alpine Linux image doesn't have a standalone binary for
'ulimit' and it's picky about the order in which the options are listed.
The -H or -S must come first, followed by a space, and then the -n.

[1] https://go.dev/doc/go1.19#runtime

[2] Go commit 8427429c592588af ("os: raise open file rlimit at startup")
    golang/go@8427429c592588af
    golang/go#46279

[3] containerd/containerd#8249

[4] http://0pointer.net/blog/file-descriptor-limits.html

[5] Go commit f5eef58e4381259c ("syscall: restore original NOFILE ...")
    golang/go@f5eef58e4381259c
    golang/go#46279

Fixes: containers#17681

Signed-off-by: Debarshi Ray <rishi@fedoraproject.org>
debarshiray added a commit to debarshiray/podman that referenced this issue Oct 14, 2024
Starting from commit 9126b45 ("Up default Podman rlimits to
avoid max open files"), Podman started bumping its soft limit for the
maximum number of open file descriptors (RLIMIT_NOFILE or ulimit -n) to
permit exposing a large number of ports to a container.  This was later
fine-tuned in commit a2c1a2d ("podman: bump RLIMIT_NOFILE also
without CAP_SYS_RESOURCE").

Unfortunately, this also increases the limits for 'podman exec' sessions
running in containers created with:
  $ podman create --network host --ulimit host ...

This is what Toolbx uses to provide a containerized interactive command
line environment for software development and troubleshooting the host
operating system.

It confuses developers and system administrators debugging a process
that's leaking file descriptors and crashing on the host OS.  The
crashes either don't reproduce inside the container or they take a lot
longer to reproduce, both of which are frustrating.

Therefore, it will be good to retain the limits, at least for this
specific scenario.

It turns out that since this code was written, the Go runtime has had
two interesting changes.

Starting from Go 1.19 [1], the Go runtime bumps the soft limit for
RLIMIT_NOFILE for all Go programs [2].  This means that there's no
longer any need for Podman to bump it's own limits, because it switched
from requiring Go 1.18 to 1.20 in commit 4dd58f2 ("Move golang
requirement from 1.18 to 1.20").  It's probably good to still log the
detected limits, in case Go's behaviour changes.

Not everybody was happy with this [3], because the higher limits got
propagated to child processes spawned by Go programs.  Among other
things, this can break old programs using select(2) [4].  So, Go's
behaviour was fine-tuned to restore the original soft limit for
RLIMIT_NOFILE when forking a child process [5].

With these two changes in Go, which Podman already uses, if the bumping
of RLIMIT_NOFILE is left to the Go runtime, then the limits are no
longer increased for 'podman exec' sessions.  Otherwise, if Podman
continues to bump the soft limit for RLIMIT_NOFILE on its own, then it
prevents the Go runtime from restoring the original limits when forking,
and leads to the higher limits in 'podman exec' sessions.

The existing 'podman run --ulimit host ... ulimit -Hn' test in
test/e2e/run_test.go was extended to also check the soft limit.  The
similar test for 'podman exec' was moved from test/e2e/toolbox_test.go
to test/e2e/exec_test.go for consistency and because there's nothing
Toolbx specific about it.  The test was similarly extended, and updated
to be more idiomatic.

Due to the behaviour of the Go runtime noted above, and since the tests
are written in Go, the current or soft limit for RLIMIT_NOFILE returned
by syscall.Getrlimit() is the same as the hard limit.

The Alpine Linux image doesn't have a standalone binary for 'ulimit' and
it's picky about the order in which the options are listed.  The -H or
-S must come first, followed by a space, and then the -n.

[1] https://go.dev/doc/go1.19#runtime

[2] Go commit 8427429c592588af ("os: raise open file rlimit at startup")
    golang/go@8427429c592588af
    golang/go#46279

[3] containerd/containerd#8249

[4] http://0pointer.net/blog/file-descriptor-limits.html

[5] Go commit f5eef58e4381259c ("syscall: restore original NOFILE ...")
    golang/go@f5eef58e4381259c
    golang/go#46279

Fixes: containers#17681

Signed-off-by: Debarshi Ray <rishi@fedoraproject.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests