Disconnect during CDS push leads to stuck cluster, likely due to XDS proxy #31943

howardjohn · 2021-04-06T18:23:27Z

https://prow.istio.io/view/gs/istio-prow/logs/integ-k8s-118_istio_postsubmit/1379192222347431936

Config dump shows:
Applied "2021-04-05T22:13:44Z/51 applied at 2021-04-05T22:13:52.661Z
Warming 2021-04-05T22:13:45Z/52 applied at 2021-04-05T22:13:59.617Z

Pilot logs:

2021-04-05T22:13:44.385322Z     info    ads     XDS: Pushing:2021-04-05T22:13:44Z/51 Services:17 ConnectedEndpoints:9  Version:2021-04-05T22:13:44Z/51
2021-04-05T22:13:44.599942Z     info    ads     LDS: PUSH for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:91 size:467.6kB
2021-04-05T22:13:44.840148Z     info    ads     RDS: PUSH for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:54 size:59.2kB
2021-04-05T22:13:44.858975Z     info    ads     NDS: PUSH for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:1 size:1.0kB
2021-04-05T22:13:44.884959Z     info    ads     CDS: PUSH for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:132 size:116.5kB
2021-04-05T22:13:44.885783Z     info    ads     EDS: PUSH for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:89 size:19.1kB empty:0 cached:89/89
2021-04-05T22:13:45.070653Z     info    ads     LDS: PUSH for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:91 size:467.6kB
2021-04-05T22:13:45.080929Z     info    ads     RDS: PUSH for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:54 size:59.6kB
2021-04-05T22:13:45.081156Z     info    ads     NDS: PUSH for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:1 size:1.0kB
2021-04-05T22:13:45.244707Z     info    ads     Push debounce stable[74] 1: 101.800309ms since last change, 101.800171ms since last push, full=true
2021-04-05T22:13:45.326469Z     info    ads     XDS: Pushing:2021-04-05T22:13:45Z/52 Services:17 ConnectedEndpoints:9  Version:2021-04-05T22:13:45Z/52
2021-04-05T22:13:45.609298Z     info    ads     CDS: PUSH for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:132 size:116.7kB
2021-04-05T22:13:45.672196Z     info    ads     EDS: PUSH for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:89 size:19.1kB empty:0 cached:89/89
2021-04-05T22:13:45.756663Z     info    ads     RDS: PUSH for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:54 size:59.6kB
2021-04-05T22:13:50.040073Z     info    ads     Push debounce stable[75] 1: 169.413506ms since last change, 169.413105ms since last push, full=false
2021-04-05T22:13:50.040225Z     info    ads     XDS: Incremental Pushing:2021-04-05T22:13:45Z/52 ConnectedEndpoints:8 Version:2021-04-05T22:13:45Z/52
2021-04-05T22:13:52.536339Z     info    ads     ADS: "10.244.0.43:37428" sidecar~10.244.0.43~a-v1-7c7c58df4d-xr9fx.echo-1-9616~echo-1-9616.svc.cluster.local-4 terminated rpc error: code = Canceled desc = context canceled
2021-04-05T22:13:53.914514Z     info    wle     updated auto-registered WorkloadEntry echo-1-9616/vm-10.244.0.50
2021-04-05T22:13:54.065537Z     info    ads     Push debounce stable[76] 1: 105.249877ms since last change, 105.249661ms since last push, full=false
2021-04-05T22:13:54.065656Z     info    ads     XDS: Incremental Pushing:2021-04-05T22:13:45Z/52 ConnectedEndpoints:8 Version:2021-04-05T22:13:45Z/52
2021-04-05T22:13:54.351240Z     info    ads     Push debounce stable[77] 1: 111.339432ms since last change, 111.339215ms since last push, full=false
2021-04-05T22:13:54.351372Z     info    ads     XDS: Incremental Pushing:2021-04-05T22:13:45Z/52 ConnectedEndpoints:8 Version:2021-04-05T22:13:45Z/52
2021-04-05T22:13:55.345979Z     info    ads     ADS: new connection for node:sidecar~10.244.0.43~a-v1-7c7c58df4d-xr9fx.echo-1-9616~echo-1-9616.svc.cluster.local-15
2021-04-05T22:13:55.360685Z     info    ads     EDS: PUSH request for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:90 size:19.4kB empty:0 cached:77/90
2021-04-05T22:13:55.796505Z     info    ads     LDS: PUSH request for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:91 size:467.6kB
2021-04-05T22:13:55.849601Z     info    ads     NDS: PUSH request for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:1 size:1.0kB
2021-04-05T22:13:55.862651Z     info    ads     RDS: PUSH request for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:54 size:59.6kB
2021-04-05T22:13:56.471660Z     info    ads     CDS: PUSH request for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:132 size:116.7kB

Proxy logs:

2021-04-05T22:13:52.489682Z     error   xdsproxy        downstream send error: context deadline exceeded
2021-04-05T22:13:52.490308Z     warn    xdsproxy        downstream terminated with unexpected error context deadline exceeded
2021-04-05T22:13:52.700258Z     error   xdsproxy        downstream send error: rpc error: code = Canceled desc = context canceled
2021-04-05T22:13:54.477238Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2021-04-05T22:13:54.666386Z     info    xdsproxy        connected to upstream XDS server: istiod.istio-system.svc:15012

So here is what happens:

We push v51 - all good
We start pushing v52 - xds proxy receives, but fails to forward to envoy. istiod thinks it has been sent, but envoy doesn't
Envoy reconnects, and requests EDS then CDS. I am not sure why the order is backwards?
At this point I think we send envoy CDS v52 and it receives it. Its stuck in warming, not sure why? Possibly it sent an EDS request which was treated as an ACK and ignored?

cc @ramaraochavali

The text was updated successfully, but these errors were encountered:

ramaraochavali · 2021-04-07T14:31:53Z

Should we have timeout for downstream? If envoy is processing large chunks on main thread it is known to take time. Also in Istiod -> Envoy we do not enforce timeout on stream send https://github.com/istio/istio/blob/master/pilot/pkg/xds/ads.go#L901 looks like

howardjohn · 2021-04-07T14:47:43Z

I am starting to think we should not have any timeouts at all in xds proxy. Envoy's send/recv have no timeouts.

On the inverse side, we can also timeout on Send to Istiod since Istiod will only Recv 1 msg at a time and not batch them up on the server side.

WDYT? We need to make sure even with no timeout that we hand istiod pod disconnect though. I think grpc does it automatically.

ramaraochavali · 2021-04-07T14:54:01Z

I am starting to think we should not have any timeouts at all in xds proxy. Envoy's send/recv have no timeouts.

Yeah. Agree. Let us change it like that.

ramaraochavali · 2021-04-08T10:46:21Z

@howardjohn looking at this grpc/grpc-go#1229 - We might need timeouts if I am reading it correctly otherwise the sendMsg might be blocked forever in some cases. So may be we should increase timeouts?

@dfawley Am I reading that thread correctly? Do you have any suggestions?

howardjohn · 2021-04-08T15:00:24Z

If I understand correctly, Send will block forever until someone (envoy/istiod) reads, and the only way to unblock is to tear down the whole stream or to have the other end read. But isn't that fine? We want to block ~forever.

The only issue is if the other side dies. If its a deadlock, in theory we would want to disconnect and reconnect to potentially unlock it or connect to a new instance (istiod only). If the other end crashed/disconnected, I am hoping/expecting we have some lower level mechanism like keep alives that will detect it and cancel the whole stream?

howardjohn · 2021-04-08T15:01:23Z

Thinking about it another way, Envoy has no timeout on the connection to Istiod afaik, so if its good enough for them seems good enough for us?

the gRPC library now also has an XDS client so we can see what they do, I assume they are experts on grpc best practices so its probably a good reference 🙂

howardjohn · 2021-04-08T15:06:27Z

btw, even if we fix the timeout it seems like there is still a bug in istiod or envoy (probably istiod). I think we are dropping an EDS request (at the end, its debug log so we cannot see it)

ramaraochavali · 2021-04-09T05:19:08Z

gRPC https://github.com/grpc/grpc-go/blob/master/xds/internal/client/v3/client.go#L125 does not enforce timeouts. So I pushed a PR to remove them. PTAL.

For I think we are dropping an EDS request (at the end, its debug log so we cannot see it) - Do you have debug logs that I can take a look?

hzxuzhonghu · 2021-04-09T09:03:41Z

This is definitely a regression, for the network, it is more complicated than we can imagine. So any resilient system should be careful with network timeout.

Come back to the issue, not sure i understand how timeout related

ramaraochavali · 2021-04-09T09:06:53Z

See #31943 (comment) how it is related to timeout and the initial error

hzxuzhonghu · 2021-04-09T09:09:08Z

Envoy reconnects, and requests EDS then CDS. I am not sure why the order is backwards?

@howardjohn Where did you see this?

howardjohn · 2021-04-09T14:13:47Z

@hzxuzhonghu the logs in the initial post

dfawley · 2021-04-09T23:43:18Z

I think #31943 (comment) is right about this. The RPC / operations shouldn't need individual timeouts, but there needs to be a way to determine if the connection is broken, like keepalives.

hzxuzhonghu · 2021-04-12T02:31:57Z

Currently for agent <-> istiod, we have keepalive set. But for envoy<->agent, there is none.

hzxuzhonghu · 2021-04-12T02:53:20Z

@howardjohn IC, I guess the reason we see requests EDS then CDS is that there are two upstream connections at that time.

ramaraochavali · 2021-04-12T06:27:39Z

But for envoy<->agent, there is none.

I added keep alives for envoy <-> agent and unified the logic for both istiod and xds proxy in #32075. PTAL

istio-policy-bot · 2021-07-26T08:03:11Z

🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2021-04-12. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions.

Created by the issue and PR lifecycle manager.

howardjohn added the area/networking label Apr 6, 2021

howardjohn mentioned this issue Apr 6, 2021

Include EDS in configdump #31942

Merged

ramaraochavali mentioned this issue Apr 9, 2021

remove timeouts in xds proxy #32075

Merged

ramaraochavali mentioned this issue Apr 9, 2021

remove xds send timeouts from istiod #32076

Closed

hzxuzhonghu mentioned this issue Apr 12, 2021

Print connection ID #32109

Merged

istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Jul 11, 2021

istio-policy-bot closed this as completed Jul 26, 2021

istio-policy-bot added the lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. label Jul 26, 2021

shonecyx mentioned this issue Dec 22, 2023

CDS/EDC stuck at warming causing Envoy main thread CPU spike #48517

Closed

2 tasks

ramaraochavali mentioned this issue Dec 26, 2023

remove send timeout #48543

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disconnect during CDS push leads to stuck cluster, likely due to XDS proxy #31943

Disconnect during CDS push leads to stuck cluster, likely due to XDS proxy #31943

howardjohn commented Apr 6, 2021

ramaraochavali commented Apr 7, 2021

howardjohn commented Apr 7, 2021

ramaraochavali commented Apr 7, 2021

ramaraochavali commented Apr 8, 2021

howardjohn commented Apr 8, 2021

howardjohn commented Apr 8, 2021

howardjohn commented Apr 8, 2021

ramaraochavali commented Apr 9, 2021

hzxuzhonghu commented Apr 9, 2021

ramaraochavali commented Apr 9, 2021

hzxuzhonghu commented Apr 9, 2021 •

edited

howardjohn commented Apr 9, 2021

dfawley commented Apr 9, 2021

hzxuzhonghu commented Apr 12, 2021

hzxuzhonghu commented Apr 12, 2021

ramaraochavali commented Apr 12, 2021

istio-policy-bot commented Jul 26, 2021

Disconnect during CDS push leads to stuck cluster, likely due to XDS proxy #31943

Disconnect during CDS push leads to stuck cluster, likely due to XDS proxy #31943

Comments

howardjohn commented Apr 6, 2021

ramaraochavali commented Apr 7, 2021

howardjohn commented Apr 7, 2021

ramaraochavali commented Apr 7, 2021

ramaraochavali commented Apr 8, 2021

howardjohn commented Apr 8, 2021

howardjohn commented Apr 8, 2021

howardjohn commented Apr 8, 2021

ramaraochavali commented Apr 9, 2021

hzxuzhonghu commented Apr 9, 2021

ramaraochavali commented Apr 9, 2021

hzxuzhonghu commented Apr 9, 2021 • edited

howardjohn commented Apr 9, 2021

dfawley commented Apr 9, 2021

hzxuzhonghu commented Apr 12, 2021

hzxuzhonghu commented Apr 12, 2021

ramaraochavali commented Apr 12, 2021

istio-policy-bot commented Jul 26, 2021

hzxuzhonghu commented Apr 9, 2021 •

edited