Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disconnect during CDS push leads to stuck cluster, likely due to XDS proxy #31943

Closed
howardjohn opened this issue Apr 6, 2021 · 17 comments
Closed
Labels
area/networking lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while

Comments

@howardjohn
Copy link
Member

https://prow.istio.io/view/gs/istio-prow/logs/integ-k8s-118_istio_postsubmit/1379192222347431936

Config dump shows:
Applied "2021-04-05T22:13:44Z/51 applied at 2021-04-05T22:13:52.661Z
Warming 2021-04-05T22:13:45Z/52 applied at 2021-04-05T22:13:59.617Z

Pilot logs:

2021-04-05T22:13:44.385322Z     info    ads     XDS: Pushing:2021-04-05T22:13:44Z/51 Services:17 ConnectedEndpoints:9  Version:2021-04-05T22:13:44Z/51
2021-04-05T22:13:44.599942Z     info    ads     LDS: PUSH for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:91 size:467.6kB
2021-04-05T22:13:44.840148Z     info    ads     RDS: PUSH for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:54 size:59.2kB
2021-04-05T22:13:44.858975Z     info    ads     NDS: PUSH for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:1 size:1.0kB
2021-04-05T22:13:44.884959Z     info    ads     CDS: PUSH for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:132 size:116.5kB
2021-04-05T22:13:44.885783Z     info    ads     EDS: PUSH for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:89 size:19.1kB empty:0 cached:89/89
2021-04-05T22:13:45.070653Z     info    ads     LDS: PUSH for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:91 size:467.6kB
2021-04-05T22:13:45.080929Z     info    ads     RDS: PUSH for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:54 size:59.6kB
2021-04-05T22:13:45.081156Z     info    ads     NDS: PUSH for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:1 size:1.0kB
2021-04-05T22:13:45.244707Z     info    ads     Push debounce stable[74] 1: 101.800309ms since last change, 101.800171ms since last push, full=true
2021-04-05T22:13:45.326469Z     info    ads     XDS: Pushing:2021-04-05T22:13:45Z/52 Services:17 ConnectedEndpoints:9  Version:2021-04-05T22:13:45Z/52
2021-04-05T22:13:45.609298Z     info    ads     CDS: PUSH for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:132 size:116.7kB
2021-04-05T22:13:45.672196Z     info    ads     EDS: PUSH for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:89 size:19.1kB empty:0 cached:89/89
2021-04-05T22:13:45.756663Z     info    ads     RDS: PUSH for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:54 size:59.6kB
2021-04-05T22:13:50.040073Z     info    ads     Push debounce stable[75] 1: 169.413506ms since last change, 169.413105ms since last push, full=false
2021-04-05T22:13:50.040225Z     info    ads     XDS: Incremental Pushing:2021-04-05T22:13:45Z/52 ConnectedEndpoints:8 Version:2021-04-05T22:13:45Z/52
2021-04-05T22:13:52.536339Z     info    ads     ADS: "10.244.0.43:37428" sidecar~10.244.0.43~a-v1-7c7c58df4d-xr9fx.echo-1-9616~echo-1-9616.svc.cluster.local-4 terminated rpc error: code = Canceled desc = context canceled
2021-04-05T22:13:53.914514Z     info    wle     updated auto-registered WorkloadEntry echo-1-9616/vm-10.244.0.50
2021-04-05T22:13:54.065537Z     info    ads     Push debounce stable[76] 1: 105.249877ms since last change, 105.249661ms since last push, full=false
2021-04-05T22:13:54.065656Z     info    ads     XDS: Incremental Pushing:2021-04-05T22:13:45Z/52 ConnectedEndpoints:8 Version:2021-04-05T22:13:45Z/52
2021-04-05T22:13:54.351240Z     info    ads     Push debounce stable[77] 1: 111.339432ms since last change, 111.339215ms since last push, full=false
2021-04-05T22:13:54.351372Z     info    ads     XDS: Incremental Pushing:2021-04-05T22:13:45Z/52 ConnectedEndpoints:8 Version:2021-04-05T22:13:45Z/52
2021-04-05T22:13:55.345979Z     info    ads     ADS: new connection for node:sidecar~10.244.0.43~a-v1-7c7c58df4d-xr9fx.echo-1-9616~echo-1-9616.svc.cluster.local-15
2021-04-05T22:13:55.360685Z     info    ads     EDS: PUSH request for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:90 size:19.4kB empty:0 cached:77/90
2021-04-05T22:13:55.796505Z     info    ads     LDS: PUSH request for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:91 size:467.6kB
2021-04-05T22:13:55.849601Z     info    ads     NDS: PUSH request for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:1 size:1.0kB
2021-04-05T22:13:55.862651Z     info    ads     RDS: PUSH request for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:54 size:59.6kB
2021-04-05T22:13:56.471660Z     info    ads     CDS: PUSH request for node:a-v1-7c7c58df4d-xr9fx.echo-1-9616 resources:132 size:116.7kB

Proxy logs:

2021-04-05T22:13:52.489682Z     error   xdsproxy        downstream send error: context deadline exceeded
2021-04-05T22:13:52.490308Z     warn    xdsproxy        downstream terminated with unexpected error context deadline exceeded
2021-04-05T22:13:52.700258Z     error   xdsproxy        downstream send error: rpc error: code = Canceled desc = context canceled
2021-04-05T22:13:54.477238Z     warning envoy config    StreamAggregatedResources gRPC config stream closed: 2, context deadline exceeded
2021-04-05T22:13:54.666386Z     info    xdsproxy        connected to upstream XDS server: istiod.istio-system.svc:15012

So here is what happens:

  • We push v51 - all good
  • We start pushing v52 - xds proxy receives, but fails to forward to envoy. istiod thinks it has been sent, but envoy doesn't
  • Envoy reconnects, and requests EDS then CDS. I am not sure why the order is backwards?
  • At this point I think we send envoy CDS v52 and it receives it. Its stuck in warming, not sure why? Possibly it sent an EDS request which was treated as an ACK and ignored?

cc @ramaraochavali

@ramaraochavali
Copy link
Contributor

Should we have timeout for downstream? If envoy is processing large chunks on main thread it is known to take time. Also in Istiod -> Envoy we do not enforce timeout on stream send https://github.com/istio/istio/blob/master/pilot/pkg/xds/ads.go#L901 looks like

@howardjohn
Copy link
Member Author

I am starting to think we should not have any timeouts at all in xds proxy. Envoy's send/recv have no timeouts.

On the inverse side, we can also timeout on Send to Istiod since Istiod will only Recv 1 msg at a time and not batch them up on the server side.

WDYT? We need to make sure even with no timeout that we hand istiod pod disconnect though. I think grpc does it automatically.

@ramaraochavali
Copy link
Contributor

I am starting to think we should not have any timeouts at all in xds proxy. Envoy's send/recv have no timeouts.

Yeah. Agree. Let us change it like that.

@ramaraochavali
Copy link
Contributor

@howardjohn looking at this grpc/grpc-go#1229 - We might need timeouts if I am reading it correctly otherwise the sendMsg might be blocked forever in some cases. So may be we should increase timeouts?

@dfawley Am I reading that thread correctly? Do you have any suggestions?

@howardjohn
Copy link
Member Author

If I understand correctly, Send will block forever until someone (envoy/istiod) reads, and the only way to unblock is to tear down the whole stream or to have the other end read. But isn't that fine? We want to block ~forever.

The only issue is if the other side dies. If its a deadlock, in theory we would want to disconnect and reconnect to potentially unlock it or connect to a new instance (istiod only). If the other end crashed/disconnected, I am hoping/expecting we have some lower level mechanism like keep alives that will detect it and cancel the whole stream?

@howardjohn
Copy link
Member Author

Thinking about it another way, Envoy has no timeout on the connection to Istiod afaik, so if its good enough for them seems good enough for us?

the gRPC library now also has an XDS client so we can see what they do, I assume they are experts on grpc best practices so its probably a good reference 🙂

@howardjohn
Copy link
Member Author

btw, even if we fix the timeout it seems like there is still a bug in istiod or envoy (probably istiod). I think we are dropping an EDS request (at the end, its debug log so we cannot see it)

@ramaraochavali
Copy link
Contributor

gRPC https://github.com/grpc/grpc-go/blob/master/xds/internal/client/v3/client.go#L125 does not enforce timeouts. So I pushed a PR to remove them. PTAL.

For I think we are dropping an EDS request (at the end, its debug log so we cannot see it) - Do you have debug logs that I can take a look?

@hzxuzhonghu
Copy link
Member

This is definitely a regression, for the network, it is more complicated than we can imagine. So any resilient system should be careful with network timeout.

Come back to the issue, not sure i understand how timeout related

@ramaraochavali
Copy link
Contributor

See #31943 (comment) how it is related to timeout and the initial error

@hzxuzhonghu
Copy link
Member

hzxuzhonghu commented Apr 9, 2021

Envoy reconnects, and requests EDS then CDS. I am not sure why the order is backwards?

@howardjohn Where did you see this?

@howardjohn
Copy link
Member Author

@hzxuzhonghu the logs in the initial post

@dfawley
Copy link

dfawley commented Apr 9, 2021

I think #31943 (comment) is right about this. The RPC / operations shouldn't need individual timeouts, but there needs to be a way to determine if the connection is broken, like keepalives.

@hzxuzhonghu
Copy link
Member

Currently for agent <-> istiod, we have keepalive set. But for envoy<->agent, there is none.

@hzxuzhonghu
Copy link
Member

@howardjohn IC, I guess the reason we see requests EDS then CDS is that there are two upstream connections at that time.

@ramaraochavali
Copy link
Contributor

But for envoy<->agent, there is none.

I added keep alives for envoy <-> agent and unified the logic for both istiod and xds proxy in #32075. PTAL

@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label Jul 11, 2021
@istio-policy-bot
Copy link

🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2021-04-12. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions.

Created by the issue and PR lifecycle manager.

@istio-policy-bot istio-policy-bot added the lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. label Jul 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while
Projects
None yet
Development

No branches or pull requests

5 participants