Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Networking breaks when using cilium with strict kube-proxy replacement #27619

Closed
StefanCenusa opened this issue Sep 29, 2020 · 31 comments
Closed
Labels
area/networking kind/docs lifecycle/staleproof Indicates a PR or issue has been deemed to be immune from becoming stale and/or automatically closed

Comments

@StefanCenusa
Copy link

StefanCenusa commented Sep 29, 2020

Bug description
This issue is related to a thread I've started on istio's slack channel.

It affects inter-service telemetry, but it might impact other features as well, because traffic is not treated as http where it should be.

My setup is a kubernetes 1.18 cluster with cilium 1.8 configured with kubeProxyReplacement=strict. This means that the kube-proxy component from kubernetes is replaced and cilium handles its duties.
I'm not an expert in how cilium works, but this mode should improve service to service communication (and more networking) leveraging eBPF functionality.

I have noticed (using tcpdump) that when this mode is enabled, if from one pod I make requests to another service (eg. curl http://servicename.namespace), the connections are "magically" made directly to destination pod (pod-ip:target-port), rather then going through the ClusterIP of the destination service. I don't know the internals of how istio-proxy is configured or how metadata filter works, but this behaviour seem to make istio-proxy into thinking that requests go directly to the pod-ip:container-port, thus no route from istio config is matched, going through some default tcp path.

[ ] Docs
[ ] Installation
[X] Networking
[ ] Performance and Scalability
[X] Extensions and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Expected behavior
Istio metadata headers to be added to requests even when using cilium's kube-proxy solution

Steps to reproduce the bug
To test this bug, I've installed kube-proxy and changed cilium's config to kubeProxyReplacement=partial and inter-service telemetry started to work, no changes to istio setup at all. Moreover, when inspecting traffic between pods using tcpdump, I was able to see some packages being sent to the ClusterIP of the destination service (this didn't happen previously).

Version (include the output of istioctl version --remote and kubectl version --short and helm version if you used Helm)

client version: 1.7.2
control plane version: 1.7.2
data plane version: 1.7.2 (1225 proxies)

How was Istio installed?
istioctl and istio-operator

Environment where bug was observed (cloud vendor, OS, etc)
aws, kops cluster

I am not able to share a config dump of my proxies, but this issue can be reproduced on any test cluster with cilium configured as described.

@howardjohn
Copy link
Member

This essentially bypasses Istio load balancing. We do not support routing "direct to pod ip" traffic like they are doing

@StefanCenusa
Copy link
Author

@howardjohn Thanks for the quick reply! It makes sense to be this way. Unfortunately I've lost a lot of time debugging this issue 😔 Maybe this info will be useful to other istio users that face similar issues or have similar setups.
I would suggest adding this information somewhere in istio's "getting started" or "setup" docs 😄

@howardjohn
Copy link
Member

@StefanCenusa I think that makes sense. We should add the "direct to pod ip" to https://istio.io/latest/docs/ops/deployment/requirements/ and the cilium specific part (not sure where). I wasn't aware that is what the cilium no kube-proxy mode did, so good to know

@howardjohn howardjohn added kind/need more info Need more info or followup from the issue reporter and removed area/extensions and telemetry labels Sep 29, 2020
@StefanCenusa
Copy link
Author

Regarding Cilium, I wasn't aware either. But now I found this in their docs:

Host-reachable services act transparent to Cilium’s lower layer datapath in that upon connect system call (TCP, connected UDP) or sendmsg as well as recvmsg (UDP) the destination IP is checked for an existing service IP and one of the service backends is selected as a target, meaning, while the application is assuming its connection to the service address, the corresponding kernel’s socket is actually connected to the backend address and therefore no additional lower layer NAT is required.

Thanks again! Should I close this issue or leave it open for the docs improvement part?

@howardjohn howardjohn added kind/docs and removed kind/need more info Need more info or followup from the issue reporter labels Sep 29, 2020
@howardjohn
Copy link
Member

Sorry added the wrong label - lets keep this open.

@tgraf
Copy link

tgraf commented Sep 29, 2020

We have just been made aware of this. We can definitely add to our docs to disable host-services to skip the load-balancing decision. Before doing, that I would like to understand a bit better why translating ClusterIP to pod IP prior to Istio seeing it is causing a problem.

Context: I'm a Cilium developer.

@howardjohn
Copy link
Member

@tgraf Istio routing is based on:

  • HTTP traffic - we look only at host header, by binding to 0.0.0.0:<PORT> and doing routing in Envoy Routes.
  • Non HTTP traffic - we create listeners for <SERVICE_VIP>:<PORT>.
  • Auto detected traffic (no protocol specified) is also using <SERVICE_VIP>:<PORT> listeners.

So for HTTP (explicitly declared only) will work (both theoretically and I confirmed this) as the routing doesn't take the IP into account, but everything else is not going to hit the appropriate configuration.

@mandarjog
Copy link
Contributor

mandarjog commented Sep 29, 2020

@tgraf does disabling host services translation significantly reduce functionality provided by cilium?

@tgraf
Copy link

tgraf commented Sep 29, 2020

@tgraf does disabling host services translation significantly reduce functionality provided by cilium?

No, it does not. The functionality when implemented on the network level is the same, just with a much higher overhead as each network packet has to be altered.

We will point it out in the Cilium documentation that host services are currently not compatible with Istio due to the way Istio is doing redirect.

On that note, what if Cilium did the redirect of all traffic? Similar to host services can currently bypass Istio, it can also be used in favor of Istio to redirect traffic to the sidecar.

Non HTTP traffic - we create listeners for <SERVICE_VIP>:.

This then means that headless services are not compatible either if the DNS directly returns pod IPs?

@howardjohn
Copy link
Member

This then means that headless services are not compatible either if the DNS directly returns pod IPs?

Good point - for headless services specifically we create listeners for each pod IP. The assumption we are depending on here, and why we do this only for headless services, is that users will not have massive headless services as this will lead to scalability bottlenecks.

@costinm
Copy link
Contributor

costinm commented Sep 29, 2020

How would Cilium redirect all traffic to Istio ? We do need some metadata - like original DST:port - that we get using a system call. We have discussed different ways to get this metadata - like a 'proxy protocol' prefix, or tunneled in H2 CONNECT - but it'll take some time.

@tgraf
Copy link

tgraf commented Oct 9, 2020

How would Cilium redirect all traffic to Istio ? We do need some metadata - like original DST:port - that we get using a system call. We have discussed different ways to get this metadata - like a 'proxy protocol' prefix, or tunneled in H2 CONNECT - but it'll take some time.

We have already built this using TPROXY so you will see the original network headers. This is also how Cilium redirects packets to Envoy when Istio is not in play. In the past, we have also shared metadata with Envoy and have built an eBPF metadata reader in Envoy to share arbitrary data between the Cilium datapath and Envoy.

@mandarjog
Copy link
Contributor

@tgraf is this available in OSS and does it require CAP_NET_ADMIN to read eBPF metadata?

@tgraf
Copy link

tgraf commented Oct 9, 2020

@tgraf is this available in OSS

Yes, all of this is available as of Cilium today.

and does it require CAP_NET_ADMIN to read eBPF metadata?

Yes, when using TPROXY with IP_TRANSPARENT it will require CAP_NET_ADMIN. We also used to have code to redirect to a known listener port on Envoy which does not require CAP_NET_ADMIN.

@howardjohn howardjohn added this to P1 in Prioritization Oct 23, 2020
@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Jan 8, 2021
@howardjohn
Copy link
Member

howardjohn commented Jan 8, 2021 via email

@istio-policy-bot istio-policy-bot removed the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Jan 8, 2021
@howardjohn howardjohn added the lifecycle/staleproof Indicates a PR or issue has been deemed to be immune from becoming stale and/or automatically closed label Feb 10, 2021
@brb
Copy link

brb commented Oct 18, 2021

Just a heads-up - we released the fix (cilium/cilium#17154) in Cilium v1.10.5 which allows Cilium's KPR to cooperate with Istio's dataplane. The cilium-agent option is named --bpf-lb-sock-hostns-only, while in Helm - hostServices.hostNamespaceOnly. You need to set it to true in order to prevent the cilium's socket LB from translating service connections. Otherwise, Istio sidecar proxy won't be able to determine the original destination.

@dntosas
Copy link
Contributor

dntosas commented Oct 19, 2021

Just a heads-up - we released the fix (cilium/cilium#17154) in Cilium v1.10.5 which allows Cilium's KPR to cooperate with Istio's dataplane. The cilium-agent option is named --bpf-lb-sock-hostns-only, while in Helm - hostServices.hostNamespaceOnly. You need to set it to true in order to prevent the cilium's socket LB from translating service connections. Otherwise, Istio sidecar proxy won't be able to determine the original destination.

Applied above configuration on our clusters, all stuff work like a charm 🚀

@howardjohn howardjohn changed the title Telemetry v2 breaks when using cilium with strict kube-proxy replacement Networking breaks when using cilium with strict kube-proxy replacement Aug 24, 2022
@howardjohn
Copy link
Member

Maybe we can add a check to istioctl precheck to ensure compatible cilium settings are used (bpf-lb-sock-hostns-only)

@brb
Copy link

brb commented Aug 25, 2022

Maybe we can add a check to istioctl precheck to ensure compatible cilium settings are used (bpf-lb-sock-hostns-only)

👍 We could make the setting accessible through Cilium's Config API.

@alfsch
Copy link

alfsch commented Dec 20, 2022

Some news on this?

@howardjohn
Copy link
Member

@alfsch the only thing Istio can ever do about this is warn that its enabled. The option fundamentally breaks Istio. No news on a warning being added

@brb
Copy link

brb commented Feb 2, 2023

@howardjohn Is there an easy way to detect when Cilium runs with Istio?

@howardjohn
Copy link
Member

Statically we can try to read Ciliums config and parse the configmap for this value, but I don't know how reliable that is (or stable). I do wonder if we could add a runtime check. We have a validation container that tests our iptables actually work though, i wonder if we could extend this... something like send a request to kubernetes.default and assert we get the ClusterIP instead of some other IP when we capture. WDYT?

One issue with that is we want to remove the validation container, but I think there will always be a place we can run it, it just may move

@brb
Copy link

brb commented Feb 3, 2023

I'm more interested in detecting a presence of Istio from Cilium PoV. I'd like to add a warning if the KPR / Socket LB are enabled when running with Istio.

@howardjohn
Copy link
Member

Ah got it. At install time you probably have Cilium before Istio, so I would imagine you want it at runtime.

Some possible detection mechanisms:

  • istio-system namespace exists
  • Pod has istio-proxy container
  • pod has a container running as UID 1337 with name "envoy"?
  • Pod has 2 containers and some specific iptables rules
  • Pod has 2 containers and listens on some specific ports

I am not sure what type of data you have where you would run this, but I'd imagine we could find an approach that works

One note, this would break all sidecar service meshes, so you might want a more generic approach

@alfsch
Copy link

alfsch commented Feb 15, 2023

For all who have troubles with cilium in strict mode without kube-proxy.

From my point of view, the root cause for the issues was our cilium version < 12.x and linux kernel < 5.7 due to the usage of ubuntu 20.04 as kubernetes node image.

There were the same issues as described in http(s) communication across namespaces using services with port mappings to different destination ports. Using the same ports mitigated the problem for some pods, but not for all.
e.g. Java based service worked well, but python services with async lib had troubles.....

After updating to cilium 1.12.6 we still faced the issues, but after looking into the logs of the cilium pods, I found in the startup section warnings about possible problems with istio using a kernel lower than 5.7.

We updated the cluster nodes to a current ubuntu 22.04 with kernel 5.15.x and voilà, all our issues were gone. We even don't need the port mapping change to use the same destination ports.

The for us working command line to render the cilium chart is:

   helm template cilium cilium/cilium --version 1.12.6 --namespace` kube-system --set cni.exclusive=false --set operator.replicas=1 \
     --set hubble.tls.auto.method=cronJob --set hubble.relay.enabled=true --set hubble.ui.enabled=true \
     --set socketLB.hostNamespaceOnly=true --set nodePort.enabled=true --set hostPort.enabled=true \
     --set externalIPs.enabled=true --set bpf.masquerade=false \
     --set kubeProxyReplacement=strict --set k8sServiceHost=CHANGEME --set k8sServicePort=CHANGEME  

kubernetes node:

System Info:
  Kernel Version:             5.15.0-60-generic
  OS Image:                   Ubuntu 22.04.1 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.16
  Kubelet Version:            v1.22.16

I found this, because I saw a warning during startup. @brb Is the feature you want already implemented?

@boriscosic
Copy link

boriscosic commented Apr 28, 2023

Resolved: need kernel 5.7 at least
k logs -f cilium-vd6mb | grep cookie (arn:aws:eks:us-west-2:171997294028:cluster/ek8s-test/kube-system)
Defaulted container "cilium-agent" out of: cilium-agent, config (init), mount-cgroup (init), apply-sysctl-overwrites (init), mount-bpf-fs (init), wait-for-node-init (init), clean-cilium-state (init), install-cni-binaries (init)
level=warning msg="Without network namespace cookie lookup functionality, BPF datapath cannot distinguish root and non-root namespace, skipping socket-level loadbalancing will not work. Istio routing chains will be missed. Needs kernel version >= 5.7" subsys=daemon

Still seeing this issue on AWS EKS 1.23:

Linux 5.4.238-148.347.amzn2.x86_64 #1 SMP Thu Apr 6 19:42:57 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Minimal install:

helm install cilium cilium/cilium \
    --version 1.13.2 \
    --namespace kube-system \
    --set kubeProxyReplacement=strict \
    --set socketLB.hostNamespaceOnly=true

On left hand side is the application pod and right hand is the istio-proxy sidecar of said application pod. tcpdump on istio-proxy shows requests going to endpoints of corends, intead of kube-dns.

Prior to replacing kube-proxy we were seeing correct calls:

20:20:22.043611 IP test-cf8977485-gfh4n.33835 > kube-dns.kube-system.svc.cluster.local.53: 10927+ PTR? 0.100.51.198.in-addr.arpa. (43)
20:20:22.046156 IP kube-dns.kube-system.svc.cluster.local.53 > test-cf8977485-gfh4n.33835: 10927 NXDomain 0/1/0 (125)

After replacing kube-proxy:

05:35:36.927550 IP test-cf8977485-xx76b.46530 > 10-0-0-15.kube-dns.kube-system.svc.cluster.local.domain: 24413+ A? address.internal.default.svc.cluster.local. (60)
05:35:36.927600 IP test-cf8977485-xx76b.46530 > 10-0-0-15.kube-dns.kube-system.svc.cluster.local.domain: 46683+ AAAA? address.internal.default.svc.cluster.local. (60)
05:35:36.928428 IP test-cf8977485-xx76b.48588 > ip-10-0-0-202.us-west-2.compute.internal.domain: 9899+ A? address.internal.svc.cluster.local. (52)

Any thoughts on what else to try? After installing cilium, nodes were terminated to ensure all configurations applied correctly.

Screen Shot 2023-04-27 at 11 36 58 PM

@gyutaeb
Copy link

gyutaeb commented Jun 12, 2023

Kernel v5.7 is patched from torvalds/linux@f318903
Cilium need this patch when --set socketLB.hostNamespaceOnly=true.

@christian-posta
Copy link
Contributor

Seems like we can close this?

@boriscosic
Copy link

Thanks @christian-posta @gyutaeb confirmed with EKS 1.24 (Kernel v5.10) this issue is resolved.

@keithmattix
Copy link
Contributor

Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking kind/docs lifecycle/staleproof Indicates a PR or issue has been deemed to be immune from becoming stale and/or automatically closed
Projects
Development

No branches or pull requests