Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark net.ipv4.ip_local_reserved_ports as a safe sysctl #111144

Closed
hindessm opened this issue Jul 14, 2022 · 13 comments · Fixed by #115374
Closed

Mark net.ipv4.ip_local_reserved_ports as a safe sysctl #111144

hindessm opened this issue Jul 14, 2022 · 13 comments · Fixed by #115374
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/security Categorizes an issue or PR as relevant to SIG Security. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@hindessm
Copy link

What would you like to be added?

The net.ipv4.ip_local_reserved_ports sysctl is namespaced so should be safe to permit. It is an often necessary counterpart to net.ipv4.ip_local_port_range sysctl which is currently in the default safe list. It would therefore be helpful to include it in the default safe list as well.

Why is this needed?

If you have a pod with multiple containers which both sets the net.ipv4.ip_local_port_range to a large range such as the commonly used '1025 65535' in order to maximise the number of available ports and binds to local ports in that range then it is possible for a container doing a bind to deadlock repeatedly failing to start if the another container has already used the local port as an ephemeral port. If the ports needed by the bind can be added to the reserved ports list then this issue can be avoided without having to unnecessarily shrink the local port range or add coupling to ensure containers start in the correct order to prevent the deadlock.

@hindessm hindessm added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 14, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 14, 2022
@hindessm
Copy link
Author

/sig network security

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. sig/security Categorizes an issue or PR as relevant to SIG Security. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 14, 2022
@rajibmitra rajibmitra removed their assignment Jul 14, 2022
@dcbw
Copy link
Member

dcbw commented Jul 21, 2022

/assign danwinship

@dcbw
Copy link
Member

dcbw commented Jul 21, 2022

/sig node
/triage accepted
/assign aojea

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 21, 2022
@aojea
Copy link
Member

aojea commented Jul 21, 2022

@hindessm just for clarification, you are requesting to add ip_local_reserved_ports to this list, right?

func SafeSysctlAllowlist() []string {
return []string{
"kernel.shm_rmid_forced",
"net.ipv4.ip_local_port_range",
"net.ipv4.tcp_syncookies",
"net.ipv4.ping_group_range",
"net.ipv4.ip_unprivileged_port_start",
}

@aojea
Copy link
Member

aojea commented Jul 21, 2022

/sig node
/cc @mrunalp

@hindessm
Copy link
Author

hindessm commented Jul 21, 2022

@aojea I'd not previously looked at the source code but yes, that is exactly what I was hoping for and would be fantastic. Thanks for your attention to this issue.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 19, 2022
@pacoxu
Copy link
Member

pacoxu commented Oct 20, 2022

/remove-lifecycle stale

There is a limitation that the host network pod cannot use it. This is already implemented. It would be safe to use it.

  • Pod forbidden sysctl: "net.ipv4.ip_local_reserved_ports" not allowed with host net enabled, Users will get this error if the pod is hostNetwork.

Meanwhile, there is a common problem with this sysctl with old kernel < 3.12. See moby/moby#8674 (comment) or istio/istio#36560. Users may get the error below when applying this sysctl.

Warning FailedCreatePodSandBox 5s (x25 over 42s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: open /proc/sys/net/ipv4/ip_local_reserved_ports: no such file or directory: unknown

I got this in a testing cluster kubernetes v1.25 and containerd v1.6.0.

Linux paco 3.10.0-1160.62.1.el7.x86_64 #1 SMP Tue Apr 5 16:57:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
[root@paco ~]# cat /etc/centos-release
CentOS Linux release 7.9.2009 (Core)

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 20, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 18, 2023
@hindessm
Copy link
Author

/remove-lifecycle stale

It would be a shame not to fix this. As I mentioned when I originally opened it, it is often needed alongside the net.ipv4.ip_local_port_range (which is in the safe list) to avoid timing issues in fairly common multiple container pods configurations that result in crash loop backoff of containers that want a port they are never going to get until the container which has used it as an ephemeral port is (manually) restarted.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2023
@pacoxu
Copy link
Member

pacoxu commented Jan 28, 2023

The PR is easy to raise like #115374.

  • However, we need to get approved from sig network/node sides to make sure this has no side effects.

@aojea
Copy link
Member

aojea commented Jan 28, 2023

to avoid timing issues in fairly common multiple container pods configurations that result in crash loop backoff of containers that want a port they are never going to get until the container which has used it as an ephemeral port is (manually) restarted

@hindessm just curiosity, how common is this? I don't remember many issues about this problem

@hindessm
Copy link
Author

Apologies my statement was probably misleading. I used the phrase "fairly common" in relation to configurations which could theoretically have the race condition I didn't mean to imply that the race was likely to be lost often.

The potentially problematic configuration would need multiple containers in a pod with some containers listening to ports, other containers making outgoing connections and the ephemeral port range overlapping with the listening ports. I suspect that pods configured like this are fairly common - sidecar containers are a popular pattern, and the ephemeral port range is often extended to a broader range of ports such as 1024-65535.

The problem occurs if a sidecar container makes any connections at startup - to obtain config, perform auth, etc. - before another containers listener has attempted to bind to a required port, since it is possible that sidecar connections can use the listeners port as an ephemeral port. If this connection in the sidecar remains open, then the failing listening container goes into crash loop backoff failing to bind requiring something external to kill the sidecar to force it to release the port.

Even given a potentially problematic configuration the problem might not occur often because:
a) the listening container will win the race to bind,
b) a vital ephemeral port isn't chosen, or
c) a vital ephemeral port is chose but only for a short-lived connection and the listening container perhaps restarts only once - which is unfortunate but not a big deal.
So, in a lot of configurations with the race condition, it is likely that the main container wins the race to bind to required ports and the issue does not occur or results in a single restart which might not be noticed.

The situation where we see it happening is due to a server that delays binding to vital ports until it has reloaded/recovered state and has a sidecar that opens a few long-lived connections. Even with this increased timing window, it doesn't happen often but when it does happen it requires intervention to resolve the crash loop backoff - as the listening container being restarted will continue to fail until the sidecar releases the port. It would be simple to avoid if the reserved ports setting was permitted on the containers service we deploy onto as we could protect the problematic ports and still get the expanded ephemeral port range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/security Categorizes an issue or PR as relevant to SIG Security. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants