New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disconnect during CDS push leads to stuck cluster, likely due to XDS proxy #31943
Comments
Should we have timeout for downstream? If envoy is processing large chunks on main thread it is known to take time. Also in Istiod -> Envoy we do not enforce timeout on stream send https://github.com/istio/istio/blob/master/pilot/pkg/xds/ads.go#L901 looks like |
I am starting to think we should not have any timeouts at all in xds proxy. Envoy's send/recv have no timeouts. On the inverse side, we can also timeout on Send to Istiod since Istiod will only Recv 1 msg at a time and not batch them up on the server side. WDYT? We need to make sure even with no timeout that we hand istiod pod disconnect though. I think grpc does it automatically. |
Yeah. Agree. Let us change it like that. |
@howardjohn looking at this grpc/grpc-go#1229 - We might need timeouts if I am reading it correctly otherwise the @dfawley Am I reading that thread correctly? Do you have any suggestions? |
If I understand correctly, Send will block forever until someone (envoy/istiod) reads, and the only way to unblock is to tear down the whole stream or to have the other end read. But isn't that fine? We want to block ~forever. The only issue is if the other side dies. If its a deadlock, in theory we would want to disconnect and reconnect to potentially unlock it or connect to a new instance (istiod only). If the other end crashed/disconnected, I am hoping/expecting we have some lower level mechanism like keep alives that will detect it and cancel the whole stream? |
Thinking about it another way, Envoy has no timeout on the connection to Istiod afaik, so if its good enough for them seems good enough for us? the gRPC library now also has an XDS client so we can see what they do, I assume they are experts on grpc best practices so its probably a good reference 🙂 |
btw, even if we fix the timeout it seems like there is still a bug in istiod or envoy (probably istiod). I think we are dropping an EDS request (at the end, its debug log so we cannot see it) |
gRPC https://github.com/grpc/grpc-go/blob/master/xds/internal/client/v3/client.go#L125 does not enforce timeouts. So I pushed a PR to remove them. PTAL. For I think we are dropping an EDS request (at the end, its debug log so we cannot see it) - Do you have debug logs that I can take a look? |
This is definitely a regression, for the network, it is more complicated than we can imagine. So any resilient system should be careful with network timeout. Come back to the issue, not sure i understand how timeout related |
See #31943 (comment) how it is related to timeout and the initial error |
@howardjohn Where did you see this? |
@hzxuzhonghu the logs in the initial post |
I think #31943 (comment) is right about this. The RPC / operations shouldn't need individual timeouts, but there needs to be a way to determine if the connection is broken, like keepalives. |
Currently for agent <-> istiod, we have keepalive set. But for envoy<->agent, there is none. |
@howardjohn IC, I guess the reason we see |
I added keep alives for envoy <-> agent and unified the logic for both istiod and xds proxy in #32075. PTAL |
🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2021-04-12. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions. Created by the issue and PR lifecycle manager. |
https://prow.istio.io/view/gs/istio-prow/logs/integ-k8s-118_istio_postsubmit/1379192222347431936
Config dump shows:
Applied "2021-04-05T22:13:44Z/51 applied at 2021-04-05T22:13:52.661Z
Warming 2021-04-05T22:13:45Z/52 applied at 2021-04-05T22:13:59.617Z
Pilot logs:
Proxy logs:
So here is what happens:
cc @ramaraochavali
The text was updated successfully, but these errors were encountered: