Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In IPVS mode, host services become available on ClusterIPs #72236

Closed
fasaxc opened this issue Dec 20, 2018 · 85 comments · Fixed by #108460
Closed

In IPVS mode, host services become available on ClusterIPs #72236

fasaxc opened this issue Dec 20, 2018 · 85 comments · Fixed by #108460
Assignees
Labels
area/ipvs area/kube-proxy kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@fasaxc
Copy link
Contributor

fasaxc commented Dec 20, 2018

What happened:

In IPVS mode, if a service on the host is listening on 0.0.0.0:<some port>, pods on the local host (or incoming traffic from other hosts if routing allows it) to any ClusterIP on <some port> reaches the host service.

For example, a pod can access its host's ssh service at something like 10.96.0.1:22

What you expected to happen:

ClusterIPs should reject or drop traffic to unexpected ports (as is the behaviour in iptables mode).

How to reproduce it (as minimally and precisely as possible):

  • Turn on IPVS mode for kube-proxy
  • Configure ssh to listen on 0.0.0.0 (the default in most installations)
  • Exec into a pod and ssh to, say 10.96.0.1:22, should reach the host.

Anything else we need to know?:

This unexpected behaviour has some security impact; it's an extra, unexpected, way for packets to reach the host. In addition, it's very hard for a NetworkPolicy provider to secure this path (such as Calico) to secure this path because IPVS captures traffic that, to the kernel's policy engines, looks like it's going to be terminated at the host. For example, in iptables, there's no way to tell the difference between traffic that is about to be terminated by IPVS and traffic that's going to go to a local service.

Related Calico issue; user was trying to block the traffic with Calico policy but was unable to (because Calico has to whitelist all potential IPVS traffic): projectcalico/calico#2279

Environment:

  • Kubernetes version (use kubectl version): 1.12
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 20, 2018
@fasaxc
Copy link
Contributor Author

fasaxc commented Dec 20, 2018

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 20, 2018
@cjcullen
Copy link
Member

@rramkumar1 @prameshj @bowei

@rramkumar1
Copy link
Contributor

/assign @m1093782566

@uablrek
Copy link
Contributor

uablrek commented Dec 27, 2018

The ClusterIP addresses (and external addresses) are added to the local kube-ipvs0 device which must be done to get traffic to ipvs;

# ip addr show dev kube-ipvs0
13: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default 
    link/ether 92:0b:28:ce:3f:77 brd ff:ff:ff:ff:ff:ff
    inet 12.0.0.1/32 brd 12.0.0.1 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 12.0.0.2/32 brd 12.0.0.2 scope global kube-ipvs0
       valid_lft forever preferred_lft forever

The 12.0.0.x addresses are my ClusterIP's for kubernetes and coredns.

The problem however seems to be the route that is automatically added to the local routing table;

# ip ro show table local
local 12.0.0.1 dev kube-ipvs0 proto kernel scope host src 12.0.0.1 
local 12.0.0.2 dev kube-ipvs0 proto kernel scope host src 12.0.0.2 

Now I can ssh to my local machine using these addresses as described in this issue;

# ssh 12.0.0.1 hostname
vm-004

But if the entry in the local routing table is removed it does not work any more;

# ip ro del 12.0.0.1/32 table local
# ssh 12.0.0.1 hostname
(hangs...)

I think removing the entry in the local routing table immediately after adding the address to the kube-ipvs0 interface is the best way to fix this problem.

@uablrek
Copy link
Contributor

uablrek commented Jan 2, 2019

Related kube-router issue; cloudnativelabs/kube-router#282

@uablrek
Copy link
Contributor

uablrek commented Jan 5, 2019

Pod-to-pod traffic does not work when the local table entry is removed.

So I must withdraw my proposal.

Please read more in cloudnativelabs/kube-router#623

@emptywee
Copy link

emptywee commented Jan 15, 2019

In my case, Kubernetes v1.11.5 with kube-proxy in the iptables mode I get the very same behavior.

$ kubectl -n kube-system exec -ti test-pod -- bash
bash-4.2# telnet 10.149.0.1 22
Trying 10.149.0.1...
Connected to 10.149.0.1.
Escape character is '^]'.
SSH-2.0-OpenSSH_7.6
^]
telnet> q
Connection closed.
bash-4.2# exit

Where 10.149.0.1 is the kubernetes.default service of the ClusterIP type.

$ kubectl -n kube-system logs kube-proxy-rnpkbw201 | head | grep Using
I1205 23:44:23.227397       1 server_others.go:140] Using iptables Proxier.

@m1093782566
Copy link
Contributor

@emptywee

You should use ping instead of telnet.

@emptywee
Copy link

@m1093782566 maybe I misread the original description of the issue, but I took it as it was possible to ssh to the sshd service running on a host from a pod using the IP address of any service created in Kubernetes when kube-proxy is in the ipvs mode, as opposed to iptables mode. In my case the behavior is the same with either mode.

Why should I use ICMP and ping when @fasaxc was talking about ssh'ing as the provided example?

@m1093782566
Copy link
Contributor

ping clusterip inside cluster.

ssh/telnet clusterip:port outside cluster.

@emptywee
Copy link

/shrug

I can ping clusterip and I can telnet to host's sshd from a pod using clusterip:22 address. Regardless of kube-proxy mode.

@k8s-ci-robot k8s-ci-robot added the ¯\_(ツ)_/¯ ¯\\\_(ツ)_/¯ label Jan 15, 2019
@emptywee
Copy link

All I wanted to say is the behavior is the same regardless of the kube-proxy mode and the issue starter said it was different with iptables proxier and ipvs proxier. Maybe it's different in Kubernetes 1.12, but with 1.11 here it's exactly the same.

@song-jiang
Copy link

@emptywee The issue is about accessing a service (10.149.0.1:80) but with an invalid port (22) from a pod. Since the port 22 is invalid for the service, with iptables mode accessing 10.149.0.1:22 should fail. However, ipvs mode attaches 10.149.0.1 to a local eth, accessing 10.149.0.1:22from a pod will be directed to local host sshd process.

@emptywee
Copy link

@song-jiang I understand that, and in my case even with iptables it's still working as though I am in the ipvs mode. As I've shown above: #72236 (comment)

@fasaxc
Copy link
Contributor Author

fasaxc commented Jan 16, 2019

@emptywee did you happen to switch from IPVS to iptables mode without rebooting? I'm wondering if there was an old kube-ipvs device hanging around to give the IPVS behaviour. I don't have a rig handy to check the iptables behaviour but I thought I'd seen that traffic get dropped before.

@emptywee
Copy link

@fasaxc no, I have two clusters, one in ipvs mode and one that has never been in it, which I was going to switch to ipvs soon. Tested the described behavior in both of them and it worked identically.

@fasaxc
Copy link
Contributor Author

fasaxc commented Jan 16, 2019

@emptywee that's odd then, does the service's IP show up in ip addr on the host? Is it the the host that you're able to reach (service traffic typically gets forwarded to the default gateway if kube-proxy's rules get missed)?

@emptywee
Copy link

@fasaxc yeah it could be the case, as we have BGP routing set up so services and pods IP space is routable in our network.

@thockin thockin added triage/unresolved Indicates an issue that can not or will not be resolved. and removed ¯\_(ツ)_/¯ ¯\\\_(ツ)_/¯ labels Mar 7, 2019
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2021
@Jc2k
Copy link

Jc2k commented Dec 22, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2021
@jpiper
Copy link

jpiper commented Jan 27, 2022

I recently discovered this when testing IPVS mode on kube-proxy, wondering why the hell I was able to SSH into a load balancer and get into the underlying host.

Not only does this apply to type: ClusterIP but it also applies to type: LoadBalancer as well. This effectively means when you set up a LoadBalancer, the LoadBalancer exposes all the host's services on the LoadBalancer IP as well (which is truly awful).

A quick hack to patch both these issues is as mentioned above - to use iptables to ensure traffic bound for the LoadBalancer/ClusterIP subnets gets dropped if the port number doesn't match the one in the service definition.

Obviously depending on your iptables setup, the placement of this rule within your chain will be different, but the general gist is as mentioned above:

iptables -I INPUT 2 -d <VIP_CIDR> -m set ! --match-set KUBE-LOAD-BALANCER dst,dst -j REJECT
iptables -I INPUT 2 -d <SERVICE_CIDR> -m set ! --match-set KUBE-CLUSTER-IP dst,dst -j REJECT

@Jc2k
Copy link

Jc2k commented Jan 27, 2022

I tried escalating this to the security team but they weren't able to help!

At this point, the public issue is the right place to continue the discussion. You can also propose it as an agenda item at the SIG Network Kube Proxy meeting.

Unfortunately I can't easily reproduce it at the moment (too many mitigations) so I don't feel like I can strongly campaign to get this fixed on my own. Can you detail your setup a bit to help? Were the load balancers on internet facing IPs? What's your cni?

@jpiper
Copy link

jpiper commented Jan 27, 2022

In my test rig I'm on bare metal, and I put kube-proxy in ipvs mode, and then metallb to assign the VIPs from some private IP space. I have tried both Cilium and Calico for the CNI and both have this bug (my hunch would be that if I tried cilium's kube-proxy replacement this wouldn't be an issue).

I haven't looked into it, but I wouldn't be surprised if the cloud load balancer providers (e.g. https://github.com/kubernetes-sigs/aws-load-balancer-controller) put a security policy on the loadbalancer itself that only allow access on the configured ports, so this dodgy behaviour won't be seen as the LB itself is dropping non-configured ports before it gets to the host.

On bare metal, however, BGP just tells the switch "give me all traffic for this IP" and if your network ACLs allow it, it will just forward all traffic to the LB IP onto the k8s host, which then happily exposes the host services on this IP.

@jpiper
Copy link

jpiper commented Jan 27, 2022

or (and I suspect this is more likely) none of the cloud providers support IPVS out of the box so this isn't really on anyone's radar

Azure/AKS#1846
aws/containers-roadmap#142
https://stackoverflow.com/questions/54238807/is-there-a-way-to-enable-ipvs-proxy-mode-on-gke-cluster

@Jc2k
Copy link

Jc2k commented Jan 27, 2022

Thanks @jpiper. Yeah thats basically my setup (Calico + MetalLB), with VIP's coming from public ip space (not RFC1918).

In the original calico ticket (projectcalico/calico#2279) @KashifSaadat had calico managed iptable rules that couldn't block this traffic. So there was a bug in kube-proxy AND calico couldn't even mitigate it.

I think calico managed pre-dnat rules can block this traffic, but at this point in my environments I have too many other layers of filtering in place to drawn and be confident in any experimental conclusions.

And even if it does its not a replacement for the bug being fixed.

kube-router's kube-proxy replacement did have this issue, their fix is here.

I do wonder how many live systems are unknowingly exposed like this.

@jpiper
Copy link

jpiper commented Feb 2, 2022

@Jc2k How do we go about proposing this as an agenda item then? This seems like an awful bug to have.

@Jc2k
Copy link

Jc2k commented Feb 2, 2022

@jpiper I was given this link but thats all.

That doc mentions someone called hanamant is involved in IPVS at that meeting, is that @hanamantagoudvk? Maybe they can help us get this security issue resolved.

@Jc2k
Copy link

Jc2k commented Feb 2, 2022

@jpiper given you reproduced it on cilium, maybe worth dropping a line to security@cilium.io as well?

@jpiper
Copy link

jpiper commented Feb 2, 2022

@Jc2k in the mean time, I'm just going to use kube-router's IPVS proxy implementation. I can delete the old kube-proxy daemonset and then install kube-router and it clears up all the old cruft and just works, I don't get this issue or the other one in #75262 either. I verified this approach works with calico, will need to verify cilium as well

@Jc2k
Copy link

Jc2k commented Feb 2, 2022

Oh thats a really good tip. I've been bitten by #75262 too.

@uablrek
Copy link
Contributor

uablrek commented Mar 2, 2022

The proposal in #72236 (comment) seems simple;

iptables -I INPUT 2 -d <VIP_CIDR> -m set ! --match-set KUBE-LOAD-BALANCER dst,dst -j REJECT
iptables -I INPUT 2 -d <SERVICE_CIDR> -m set ! --match-set KUBE-CLUSTER-IP dst,dst -j REJECT

But in practice (in code) it is actually hard for several reasons;

  • There is no VIP_CIDR.
  • The SERVICE_CIDR is not available for kube-proxy and it can't be read from the api-server. There seem to be some ideological reason and to argue to make it available is not received well.
  • The filter table (at least the INPUT chain) is updated mostly (only?) by kubelet, not by kube-proxy. So there is a chance of conflicts.

That said, it is of course possible but will require a larger PR that you might think in the first place. I guess that's why it hasn't been done.

A way to implement this may be;

  1. Create and maintain a KUBE-IPVS0-IPS ipset (hash:ip) that holds all addresses assigned to kube-ipvs0
  2. Add iptables rules as below
  3. Test really hard every case where both kubelet and kube-proxy updates the INPUT filter chain
iptables -A INPUT -m set --match-set KUBE-LOAD-BALANCER dst,dst -j ACCEPT
iptables -A INPUT -m set --match-set KUBE-CLUSTER-IP dst,dst -j ACCEPT
iptables -A INPUT -m set --match-set KUBE-IPVS0-IPS dst -j REJECT

This is repeated for IPv6, but that is a minor problem.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 31, 2022
@Jc2k
Copy link

Jc2k commented May 31, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 31, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 29, 2022
@uablrek
Copy link
Contributor

uablrek commented Aug 30, 2022

An open PR exist that fixes this problem.

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 30, 2022
@uablrek
Copy link
Contributor

uablrek commented Sep 2, 2022

And there was much rejoicing 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ipvs area/kube-proxy kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

Successfully merging a pull request may close this issue.