-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NettyServerHandler closes connection when first go_away is ackd regardless of starting stream #5806
Comments
Stream 191, 15, and 17 should all complete normally. I am guessing the failed RPC is not one of those, but that's due to a race between starting new RPCs on the application, and the transport notifying the LoadBalancer that the Transport is in TRANSIENT_FAILURE. The GOAWAY frames (and PINGs) are correct. The first GOAWAY is with the max stream id to ensure that any in-flight RPCs make it in. The ping pong ensures that the client has actually seen the first GOAWAY (in order to send RPCs elsewhere). The final GOAWAY is of the last stream id that was started. Which means that no new RPCs should be started on the transport. The grace period is how long to wait for the ping-ack server side before more aggressively sending the second goaway. I believe the error you are seeing is due to the LoadBalancer / transport race, where it's possible RPCs get assign a transport after it has begun its shutdown. The fix for this race is for Retries to be enabled, which can look at the error code, determine that the RPC headers have not yet been send, and reschedule the RPC on a READY transport. I'm not sure there is much else to do here, other than increase the priority of retries being turned on. |
Thanks @carl-mastrangelo for the quick reply and clarifying the grace period ack behavior. I managed to reproduce something similar in a blank project with high load but as you said the logs do not tell the whole story (at least to me). The setup is not using a load balancer (and only a single endpoint) and the above one used a client side load balancer with round-robin strategy but also on a single endpoint during testing, I assume the same race condition can still apply? This is disconcerting for anyone using client-side loadbalancing as we rely on connection max_age being set. We cannot say that under perfect network conditions a call will succeed. Could having a configurable delay before sending the go_away ack (but marking the connection as unavailable) to the server account for any calls coming in in this window without needing retries? |
If it's the case, this issue is a duplicate of #2562 |
Re-examined the logs from the original post. Only the serve port 43582 had the GO_AWAY, so the logs would be more clear if you only search for 43582. It appears the client ack'ed the first GO_AWAY right after stream 191 is finished, and the UNAVAILALBE error appears after that. This does look like the race described in #2562.
Yes it applies to the Channel implementation in general. A
Per a conversation with @carl-mastrangelo, the client is obligated to ack immediately after receiving the first GO_AWAY, otherwise it would be a violation of the protocol.
The transparent retry would be activated only when we are sure the client has not sent anything onto the network. It is supposed to be invisible to the user. And we think is the right approach to fix the race. To the user it should appear as if the race didn't happen.
The server does cancel the RPC when the grace period is exceeded. The application handler should receive the notification from I am closing this issue in favor of #2562. Please feel free to reopen if I misinterpreted it. |
Scenario:
What version of gRPC are you using?
1.20.0
What did you expect to see?
I expect the client to be able to finish its last stream it started before it received the first go_away request from the server. It should have grace_period time to do so.
There is a window where concurrency occurs however and a client call is able to start when the connection is about to close and the call is dropped without grace period being taken into account.
Attachments:
Client log:
Server logs (with tracids):
NettyServerHandler related part:
The text was updated successfully, but these errors were encountered: