Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingress gateway restarts with tcp_cluster_rewrite #21676

Closed
ayj opened this issue Feb 29, 2020 · 22 comments · Fixed by istio/proxy#2774
Closed

ingress gateway restarts with tcp_cluster_rewrite #21676

ayj opened this issue Feb 29, 2020 · 22 comments · Fixed by istio/proxy#2774
Assignees
Labels
area/extensions and telemetry feature/Multi-cluster issues related with multi-cluster support
Milestone

Comments

@ayj
Copy link
Contributor

ayj commented Feb 29, 2020

Steps to reproduce:

Follow the replicated control plane setup through testing the example service (see https://preliminary.istio.io/docs/setup/install/multicluster/gateways/#configure-the-example-services). The curl request fails with a 503. The destination ingress gateway logs show an exception and backstrace. The ready checks then fail and the pod is restarted.

The stack trace points to https://github.com/istio/proxy/blob/release-1.5/src/envoy/tcp/tcp_cluster_rewrite/tcp_cluster_rewrite.cc where the *.global suffix is rewritting to *.svc.cluster.local.

config_dump: https://gist.github.com/ayj/bb456945b45fb450c63c1539d27a72e4
logs: https://gist.github.com/ayj/6658f586b028bdea12991a876a2a1dcf
logs w/level=trace: https://gist.github.com/ayj/5c07b20d651bb6be48a08b653f0ff590

@ayj ayj changed the title ingress gateway panics when receiving traffic from ingress gateway restarts with tcp_cluster_rewrite Feb 29, 2020
@ayj
Copy link
Contributor Author

ayj commented Feb 29, 2020

cc @douglas-reid @PiotrSikora (last two people to touch tcp_cluster_rewrite.cc)

@ayj
Copy link
Contributor Author

ayj commented Mar 2, 2020

@rshriram, the gateway proxy is crashing somewhere near the sni_filter and tcp_rewrite_cluster filter. Any ideas? The latter filter hasn't changed in a while. Possibly a bad interaction with other filters maybe?

@FrimIdan
Copy link
Contributor

FrimIdan commented Mar 4, 2020

@ayj did you used the following command to install the multicluster?
istioctl manifest apply -f install/kubernetes/operator/examples/multicluster/values-istio-multicluster-gateways.yaml

I I failed to find the file, is this the one operator/data/examples/multicluster/values-istio-multicluster-gateways.yaml?

@costinm
Copy link
Contributor

costinm commented Mar 4, 2020

One thing we can check is if using a 1.4-based ingress ( which is compatible with 1.5 control plane ) is a reasonable workaround until 1.5.1 ships.

@hzxuzhonghu
Copy link
Member

It seems outbound_.8000_._.httpbin.bar.svc.cluster.local does not exist from the config dump, not sure if this matters

@esnible
Copy link
Contributor

esnible commented Mar 5, 2020

Does this not suggest a user could do a denial-of-service on Istio by supplying any .global hostname in a request? Is the single-cluster setup immune?

cc @mbanikazemi

@mbanikazemi
Copy link
Contributor

mbanikazemi commented Mar 5, 2020

@hzxuzhonghu fwiw in my setup I do see the outbound_.8000_._.httpbin.bar.svc.cluster.local cluster and see envoy still crashing probably at Envoy::Tcp::TcpClusterRewrite::TcpClusterRewriteFilter::onNewConnection()

@hzxuzhonghu
Copy link
Member

We should check if disable the envoy.filters.network.wasm can fix this, so that we can narrow the scope

@linsun
Copy link
Member

linsun commented Mar 5, 2020

@esnible good question, I'm also curious about the impact of this issue to single cluster.

cc 1.5 release mgrs fyi @fpesce @dgn @johnma14

@howardjohn
Copy link
Member

It seems outbound_.8000_._.httpbin.bar.svc.cluster.local does not exist from the config dump, not sure if this matters

There was a PR to disable these by default I think

@howardjohn
Copy link
Member

Does this not suggest a user could do a denial-of-service on Istio by supplying any .global hostname in a request? Is the single-cluster setup immune?

This filter is only set if multicluster is enabled:

{{- if .Values.global.multiCluster.enabled }}
. Single cluster is not impacted

@hzxuzhonghu
Copy link
Member

It seems outbound_.8000_._.httpbin.bar.svc.cluster.local does not exist from the config dump, not sure if this matters

There was a PR to disable these by default I think

I think you mean the one from @vikaschoudhary16, it is to disable normal clusters, this kind of clusters here are sni-dnat clusters, which is not influenced

@howardjohn
Copy link
Member

howardjohn commented Mar 6, 2020 via email

@hzxuzhonghu
Copy link
Member

But what tested here is release-1.5, and i searched in my deployment, it is ISTIO_META_ROUTER_MODE: "sni-dnat" for my gateway

@hzxuzhonghu
Copy link
Member

ISTIO_META_ROUTER_MODE: "sni-dnat"

@incfly
Copy link

incfly commented Mar 17, 2020

/cc @incfly

@sdake
Copy link
Member

sdake commented Mar 18, 2020

cc @sdake

@ayj ayj assigned yxue Mar 18, 2020
@sdake
Copy link
Member

sdake commented Mar 19, 2020

@ayj et al, for context on why this is needed.

I think I had put this information in a PR somewhere, or its in a design doc, but I struggle to find it even with google search, so I'll do the evil deed of repeating it here :)

  1. auto-passthrough when exposed via the ingress gateways as implemented in Istio 1.1-1.4 is described here: add istiod port into gateway #21803
  2. AUTO_PASSTHROUGH is implemented in 1.1 AFAICR, however, it wasn’t documented until 1.2 here: https://archive.istio.io/v1.2/docs/reference/config/networking/v1alpha3/gateway/#Server-TLSOptions-TLSmode
  3. MC uses the automatic passthrough introduced here: https://github.com/istio/istio/blob/release-1.1/install/kubernetes/helm/istio/charts/gateways/templates/preconfigured.yaml#L147-L158
  4. MC triggers an EnvoyFilter of type NETWORK to be inserted after the SNI filter: https://github.com/istio/istio/blob/release-1.1/install/kubernetes/helm/istio/charts/gateways/templates/preconfigured.yaml#L171-L184
  5. MC uses tcp_cluster_rewrite to ask Envoy to rewrite the cluster name from service.svc.cluster.global to service.svc.cluster.local: https://github.com/istio/proxy/blob/master/src/envoy/tcp/tcp_cluster_rewrite/tcp_cluster_rewrite.cc#L57-L58
  6. Mutual TLS is implemented by this DR during installation: https://github.com/istio/istio/blob/release-1.1/install/kubernetes/helm/istio/charts/gateways/templates/preconfigured.yaml#L186-L205

Finally:
Ingress traffic is then forwarded by Envoy to the correct Envoy as dictated by the DR, GW, and EF to Pilot.

istio-testing pushed a commit that referenced this issue Mar 20, 2020
- *fixes* ingress gateway restarts with tcp_cluster_rewrite #21676
- *fixes* stackdriver to separate the bucket definitions used for bytes
	  distributions from the definitions used for latency
          measurements.
@mbanikazemi
Copy link
Contributor

In case someone refers to the list of things done by this filter as listed above, note that the filter changes the cluster name from .global (generally to .svc.cluster.local) and not to .global.

@sdake
Copy link
Member

sdake commented Mar 28, 2020

@mbanikazemi thanks! I have improved the reasoning with your comments.

CHeers
-steve

@irisdingbj irisdingbj added the feature/Multi-cluster issues related with multi-cluster support label Mar 31, 2020
@SRodi
Copy link
Member

SRodi commented Oct 18, 2020

Any updates on this one? I am still getting 503 while using Istio version 1.7.3.

kubectl exec --context=$CTX_1 $SLEEP_POD -n foo -c sleep -- curl -I httpbin.bar.global:8000/headers
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0    91    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
HTTP/1.1 503 Service Unavailable
content-length: 91
content-type: text/plain
date: Sun, 18 Oct 2020 17:45:31 GMT
server: envoy

I have installed Istio with install -f manifests/examples/multicluster/values-istio-multicluster-gateways.yaml and following all the steps described in the docs for replicated control plane.

Cluster 1 v1.17.12-gke.1501
KubeDNS ConfigMap Cluster1:

apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-dns
  namespace: kube-system
data:
  stubDomains: |
    {"global": ["$CLUSTER_IP_GKE"]}

Cluster 2 v1.17.11-eks-cfdc40
KubeDNS ConfigMap Cluster2:

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
  labels:
    eks.amazonaws.com/component: coredns
    k8s-app: kube-dns
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           upstream
           fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }
    global:53 {
        errors
        cache 30
        forward . $CLUSTER_IP_EKS:53
        reload
    }

Thank you

@sands6
Copy link

sands6 commented Dec 18, 2020

@SRodi Did you have any luck as I am hitting similar error but I am observing 404 errors on remote cluster gateways:

{"downstream_remote_address":"10.xxx.xxx.xxx:52809","authority":"istio-temp-gateway.service.corporate.domain","path":"/","protocol":"HTTP/1.1","upstream_service_time":"-","upstream_local_address":"-","duration":"0","upstream_transport_failure_reason":"-","route_name":"-","downstream_local_address":"10.xxx.xxx.xxx:80","user_agent":"curl/7.54.0","response_code":"404","response_flags":"NR","start_time":"2020-12-18T18:30:03.304Z","method":"GET","request_id":"a3a1e12f-eb22-4045-955e-e66e776f691c","upstream_host":"-","x_forwarded_for":"10.xxx.xxx.xxx","requested_server_name":"-","bytes_received":"0","istio_policy_status":"-","bytes_sent":"0","upstream_cluster":"-"}

I am not sure if TLD is being translated by rewrite filter and might not able to find upstream cluster reference to gateway endpoint istio-temp-gateway.service.corporate.domain as this is only a dummy endpoint in serviceentry for routing to remote cluster.
I am trying to generate trace logs on remote gateways but interesting part is I don't see any logs related to 404 after few requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/extensions and telemetry feature/Multi-cluster issues related with multi-cluster support
Projects
None yet
Development

Successfully merging a pull request may close this issue.