ingress gateway restarts with tcp_cluster_rewrite #21676

ayj · 2020-02-29T19:14:12Z

Steps to reproduce:

Follow the replicated control plane setup through testing the example service (see https://preliminary.istio.io/docs/setup/install/multicluster/gateways/#configure-the-example-services). The curl request fails with a 503. The destination ingress gateway logs show an exception and backstrace. The ready checks then fail and the pod is restarted.

The stack trace points to https://github.com/istio/proxy/blob/release-1.5/src/envoy/tcp/tcp_cluster_rewrite/tcp_cluster_rewrite.cc where the *.global suffix is rewritting to *.svc.cluster.local.

config_dump: https://gist.github.com/ayj/bb456945b45fb450c63c1539d27a72e4
logs: https://gist.github.com/ayj/6658f586b028bdea12991a876a2a1dcf
logs w/level=trace: https://gist.github.com/ayj/5c07b20d651bb6be48a08b653f0ff590

ayj · 2020-02-29T19:15:34Z

cc @douglas-reid @PiotrSikora (last two people to touch tcp_cluster_rewrite.cc)

ayj · 2020-03-02T23:06:56Z

@rshriram, the gateway proxy is crashing somewhere near the sni_filter and tcp_rewrite_cluster filter. Any ideas? The latter filter hasn't changed in a while. Possibly a bad interaction with other filters maybe?

FrimIdan · 2020-03-04T16:37:30Z

@ayj did you used the following command to install the multicluster?
istioctl manifest apply -f install/kubernetes/operator/examples/multicluster/values-istio-multicluster-gateways.yaml

I I failed to find the file, is this the one operator/data/examples/multicluster/values-istio-multicluster-gateways.yaml?

costinm · 2020-03-04T18:06:50Z

One thing we can check is if using a 1.4-based ingress ( which is compatible with 1.5 control plane ) is a reasonable workaround until 1.5.1 ships.

hzxuzhonghu · 2020-03-05T02:46:13Z

It seems outbound_.8000_._.httpbin.bar.svc.cluster.local does not exist from the config dump, not sure if this matters

esnible · 2020-03-05T02:55:56Z

Does this not suggest a user could do a denial-of-service on Istio by supplying any .global hostname in a request? Is the single-cluster setup immune?

cc @mbanikazemi

mbanikazemi · 2020-03-05T03:01:50Z

@hzxuzhonghu fwiw in my setup I do see the outbound_.8000_._.httpbin.bar.svc.cluster.local cluster and see envoy still crashing probably at Envoy::Tcp::TcpClusterRewrite::TcpClusterRewriteFilter::onNewConnection()

hzxuzhonghu · 2020-03-05T03:33:30Z

We should check if disable the envoy.filters.network.wasm can fix this, so that we can narrow the scope

linsun · 2020-03-05T15:26:09Z

@esnible good question, I'm also curious about the impact of this issue to single cluster.

cc 1.5 release mgrs fyi @fpesce @dgn @johnma14

howardjohn · 2020-03-05T16:57:56Z

It seems outbound_.8000_._.httpbin.bar.svc.cluster.local does not exist from the config dump, not sure if this matters

There was a PR to disable these by default I think

howardjohn · 2020-03-05T16:58:36Z

Does this not suggest a user could do a denial-of-service on Istio by supplying any .global hostname in a request? Is the single-cluster setup immune?

This filter is only set if multicluster is enabled:

istio/manifests/gateways/istio-ingress/templates/preconfigured.yaml

Line 37 in 1896ee0

. Single cluster is not impacted

hzxuzhonghu · 2020-03-06T01:42:05Z

It seems outbound_.8000_._.httpbin.bar.svc.cluster.local does not exist from the config dump, not sure if this matters

There was a PR to disable these by default I think

I think you mean the one from @vikaschoudhary16, it is to disable normal clusters, this kind of clusters here are sni-dnat clusters, which is not influenced

howardjohn · 2020-03-06T01:44:58Z

#18431

…

On Thu, Mar 5, 2020 at 5:42 PM Zhonghu Xu ***@***.***> wrote: It seems outbound_.8000_._.httpbin.bar.svc.cluster.local does not exist from the config dump, not sure if this matters There was a PR to disable these by default I think I think you mean the one from @vikaschoudhary16 <https://github.com/vikaschoudhary16>, it is to disable normal clusters, this kind of clusters here are sni-dnat clusters, which is not influenced — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#21676?email_source=notifications&email_token=AAEYGXLQANMRNOU54HUUJOLRGBIG7A5CNFSM4K65O7ZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEN7XA3I#issuecomment-595554413>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEYGXLG22MAKNBXVXHJRTLRGBIG7ANCNFSM4K65O7ZA> .

hzxuzhonghu · 2020-03-06T01:48:44Z

But what tested here is release-1.5, and i searched in my deployment, it is ISTIO_META_ROUTER_MODE: "sni-dnat" for my gateway

hzxuzhonghu · 2020-03-06T01:50:56Z

istio/manifests/gateways/istio-ingress/values.yaml

Line 158 in 51c4b1a

ISTIO_META_ROUTER_MODE: "sni-dnat"

incfly · 2020-03-17T00:28:12Z

/cc @incfly

sdake · 2020-03-18T14:04:46Z

cc @sdake

sdake · 2020-03-19T18:41:12Z

@ayj et al, for context on why this is needed.

I think I had put this information in a PR somewhere, or its in a design doc, but I struggle to find it even with google search, so I'll do the evil deed of repeating it here :)

auto-passthrough when exposed via the ingress gateways as implemented in Istio 1.1-1.4 is described here: add istiod port into gateway #21803
AUTO_PASSTHROUGH is implemented in 1.1 AFAICR, however, it wasn’t documented until 1.2 here: https://archive.istio.io/v1.2/docs/reference/config/networking/v1alpha3/gateway/#Server-TLSOptions-TLSmode
MC uses the automatic passthrough introduced here: https://github.com/istio/istio/blob/release-1.1/install/kubernetes/helm/istio/charts/gateways/templates/preconfigured.yaml#L147-L158
MC triggers an EnvoyFilter of type NETWORK to be inserted after the SNI filter: https://github.com/istio/istio/blob/release-1.1/install/kubernetes/helm/istio/charts/gateways/templates/preconfigured.yaml#L171-L184
MC uses tcp_cluster_rewrite to ask Envoy to rewrite the cluster name from service.svc.cluster.global to service.svc.cluster.local: https://github.com/istio/proxy/blob/master/src/envoy/tcp/tcp_cluster_rewrite/tcp_cluster_rewrite.cc#L57-L58
Mutual TLS is implemented by this DR during installation: https://github.com/istio/istio/blob/release-1.1/install/kubernetes/helm/istio/charts/gateways/templates/preconfigured.yaml#L186-L205

Finally:
Ingress traffic is then forwarded by Envoy to the correct Envoy as dictated by the DR, GW, and EF to Pilot.

- *fixes* ingress gateway restarts with tcp_cluster_rewrite #21676 - *fixes* stackdriver to separate the bucket definitions used for bytes distributions from the definitions used for latency measurements.

mbanikazemi · 2020-03-26T16:46:50Z

In case someone refers to the list of things done by this filter as listed above, note that the filter changes the cluster name from .global (generally to .svc.cluster.local) and not to .global.

sdake · 2020-03-28T17:37:03Z

@mbanikazemi thanks! I have improved the reasoning with your comments.

CHeers
-steve

SRodi · 2020-10-18T18:06:59Z

Any updates on this one? I am still getting 503 while using Istio version 1.7.3.

kubectl exec --context=$CTX_1 $SLEEP_POD -n foo -c sleep -- curl -I httpbin.bar.global:8000/headers
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0    91    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
HTTP/1.1 503 Service Unavailable
content-length: 91
content-type: text/plain
date: Sun, 18 Oct 2020 17:45:31 GMT
server: envoy

I have installed Istio with install -f manifests/examples/multicluster/values-istio-multicluster-gateways.yaml and following all the steps described in the docs for replicated control plane.

Cluster 1 v1.17.12-gke.1501
KubeDNS ConfigMap Cluster1:

apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-dns
  namespace: kube-system
data:
  stubDomains: |
    {"global": ["$CLUSTER_IP_GKE"]}

Cluster 2 v1.17.11-eks-cfdc40
KubeDNS ConfigMap Cluster2:

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
  labels:
    eks.amazonaws.com/component: coredns
    k8s-app: kube-dns
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           upstream
           fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }
    global:53 {
        errors
        cache 30
        forward . $CLUSTER_IP_EKS:53
        reload
    }

Thank you

sands6 · 2020-12-18T20:40:53Z

@SRodi Did you have any luck as I am hitting similar error but I am observing 404 errors on remote cluster gateways:

{"downstream_remote_address":"10.xxx.xxx.xxx:52809","authority":"istio-temp-gateway.service.corporate.domain","path":"/","protocol":"HTTP/1.1","upstream_service_time":"-","upstream_local_address":"-","duration":"0","upstream_transport_failure_reason":"-","route_name":"-","downstream_local_address":"10.xxx.xxx.xxx:80","user_agent":"curl/7.54.0","response_code":"404","response_flags":"NR","start_time":"2020-12-18T18:30:03.304Z","method":"GET","request_id":"a3a1e12f-eb22-4045-955e-e66e776f691c","upstream_host":"-","x_forwarded_for":"10.xxx.xxx.xxx","requested_server_name":"-","bytes_received":"0","istio_policy_status":"-","bytes_sent":"0","upstream_cluster":"-"}

I am not sure if TLD is being translated by rewrite filter and might not able to find upstream cluster reference to gateway endpoint istio-temp-gateway.service.corporate.domain as this is only a dummy endpoint in serviceentry for routing to remote cluster.
I am trying to generate trace logs on remote gateways but interesting part is I don't see any logs related to 404 after few requests.

ayj changed the title ~~ingress gateway panics when receiving traffic from~~ ingress gateway restarts with tcp_cluster_rewrite Feb 29, 2020

ayj mentioned this issue Mar 2, 2020

503 with demo services in multicluster (replicated control-planes) #21702

Closed

ayj added this to the 1.5 milestone Mar 2, 2020

ayj mentioned this issue Mar 2, 2020

convert deprecated EnvoyFilter to new syntax #21733

Merged

ayj mentioned this issue Mar 3, 2020

Multicluster with Replicated Control Planes not working #21784

Closed

istio-policy-bot added the lifecycle/needs-triage label Mar 4, 2020

mandarjog added area/environments/multicluster area/extensions and telemetry labels Mar 4, 2020

costinm assigned rshriram Mar 4, 2020

istio-policy-bot removed the lifecycle/needs-triage label Mar 5, 2020

istio-policy-bot added the lifecycle/needs-escalation label Mar 13, 2020

istio-policy-bot removed the lifecycle/needs-escalation label Mar 17, 2020

yxue mentioned this issue Mar 17, 2020

increase life span for data in tcp cluster rewrite istio/proxy#2774

Merged

ayj assigned yxue Mar 18, 2020

istio-testing closed this as completed in istio/proxy#2774 Mar 19, 2020

fpesce mentioned this issue Mar 20, 2020

*update* proxy SHA to include bugfixes. #22351

Merged

irisdingbj added the feature/Multi-cluster issues related with multi-cluster support label Mar 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingress gateway restarts with tcp_cluster_rewrite #21676

ingress gateway restarts with tcp_cluster_rewrite #21676

ayj commented Feb 29, 2020 •

edited

Loading

ayj commented Feb 29, 2020

ayj commented Mar 2, 2020

FrimIdan commented Mar 4, 2020

costinm commented Mar 4, 2020

hzxuzhonghu commented Mar 5, 2020

esnible commented Mar 5, 2020

mbanikazemi commented Mar 5, 2020 •

edited

Loading

hzxuzhonghu commented Mar 5, 2020

linsun commented Mar 5, 2020

howardjohn commented Mar 5, 2020

howardjohn commented Mar 5, 2020

hzxuzhonghu commented Mar 6, 2020

howardjohn commented Mar 6, 2020 via email

hzxuzhonghu commented Mar 6, 2020

hzxuzhonghu commented Mar 6, 2020

incfly commented Mar 17, 2020

sdake commented Mar 18, 2020

sdake commented Mar 19, 2020 •

edited

Loading

mbanikazemi commented Mar 26, 2020

sdake commented Mar 28, 2020

SRodi commented Oct 18, 2020

sands6 commented Dec 18, 2020

ingress gateway restarts with tcp_cluster_rewrite #21676

ingress gateway restarts with tcp_cluster_rewrite #21676

Comments

ayj commented Feb 29, 2020 • edited Loading

ayj commented Feb 29, 2020

ayj commented Mar 2, 2020

FrimIdan commented Mar 4, 2020

costinm commented Mar 4, 2020

hzxuzhonghu commented Mar 5, 2020

esnible commented Mar 5, 2020

mbanikazemi commented Mar 5, 2020 • edited Loading

hzxuzhonghu commented Mar 5, 2020

linsun commented Mar 5, 2020

howardjohn commented Mar 5, 2020

howardjohn commented Mar 5, 2020

hzxuzhonghu commented Mar 6, 2020

howardjohn commented Mar 6, 2020 via email

hzxuzhonghu commented Mar 6, 2020

hzxuzhonghu commented Mar 6, 2020

incfly commented Mar 17, 2020

sdake commented Mar 18, 2020

sdake commented Mar 19, 2020 • edited Loading

mbanikazemi commented Mar 26, 2020

sdake commented Mar 28, 2020

SRodi commented Oct 18, 2020

sands6 commented Dec 18, 2020

ayj commented Feb 29, 2020 •

edited

Loading

mbanikazemi commented Mar 5, 2020 •

edited

Loading

sdake commented Mar 19, 2020 •

edited

Loading