-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rare but persistent blocked or missing stream messages on established connections #3980
Comments
Thanks for reporting this issue. This does sound quite tricky to debug. The first thought I have is, there's a connection problem between these two hosts.
One thing to try out is keepalive (doc, example). This should help rule out those connection problems. For further debugging: I would suggest turning on logs for both gRPC and golang http2 ( |
@menghanl There's no proxy; all connections are direct. I don't think this is a connection-wide issue, in part because it appears that other streams on the same connection at the same time work fine. Also, this happens mid-stream, so the client and server successfully exchange several messages (within a few ms) before one suddenly blocks. I'll see what I can do with logging; handling the volume is one thing, finding the problem case in all those logs is another. Is there a way to map a stream in my own code to an id used by the logs? |
By "the same connection" do you mean the same grpc ClientConn? Or the same TCP connection?
This doesn't rule out the possibility that this is a connection problem though. If the TCP connection for some reason starts to drop packets (may include FIN). The receiving side will never see the connection problem until TCP timeout (which can be 30 minutes). The purpose of keepalive is to detect this failure and recover. |
Unfortunately no.. The go http2 logs should include the http2 stream ID. But gRPC doesn't assign IDs to each stream. |
I mean a single ClientConn. Didn't realize it would use multiple TCP connections under the hood. I'll enable keepalives and report back. |
@menghanl I haven't seen a recurrence of this problem since enabling KeepAlive, so that seems to have done the trick. Thanks for the tip. This is an interesting pitfall since I never saw any errors indicating an underlying connection problem (probably because context always expired first), and wouldn't have suspected this solution on my own. I wonder if the defaults should include a more aggressive keepalive, or if the docs might benefit from an explanation like what you wrote, above. I don't see anywhere where they explain why keepalive is a good idea, even on a high-traffic, high-reliability network. |
Great to know that it works! We didn't want to change the default keepalive configuration because it does have side effect (it sends pings). Good point that the purpose of keepalive isn't explained well. It's briefly mentioned in the gRFC, but I don't think most people will look at that. I will update the godoc and keepalive.md to explain a bit more. |
Sent #3993 to update the doc. |
What version of gRPC are you using?
1.32.0
What version of Go are you using (
go version
)?1.15.2
What operating system (Linux, Windows, …) and version?
linux/ARM
What did you do?
My servers establish long-lived connections to each other, and re-use them for concurrent streaming calls. A typical stream involves several small messages in each direction, and usually only lasts for a fraction of a second. In aggregate we're moving ~10K messages/second between ~100 hosts.
All messages are passed using the generated Send(...) and Recv() methods, finished with a CloseSend() on the client side. Generally this is near-100% reliable, except as described below. The grpc connections themselves rarely report a bad state.
What did you expect to see?
After Send() returns with no error, Recv() on the other side returns with either a valid message, or an error.
What did you see instead?
After Send() is called, Recv() sometimes blocks until context expiration, which is about a minute. The pattern of occurrence is fairly specific. Normally, it doesn't happen at all and there is low latency in message passing. However, seemingly randomly a condition will develop with a specific pair of hosts, where this blocking begins to occur about 0.1% of the time. (0.1% doesn't seem like much, but the long timeouts cascade into larger systemic unreliability) The problem is always stream-local, where other streams on the same connection work fine, concurrent with the broken one.
Once it starts, this will continue to happen for that pair of hosts until process restart (the only time, currently, when all grpc connections are rebuilt). This happens a few times per day, often after the cluster has been communicating trouble-free for several hours.
If the problem weren't sticky, I would write it off as a glitch. If it happened consistently across all connections, I would blame my own code or usage pattern, or at least have a hope of debugging it. As it stands, it appears that the problem lies internal to grpc, but I can't repro it locally so don't know how to begin debugging.
I'm attempting to work around the problem my detecting anomalous blocking times - hard to do reliably in a system with plenty of legitimate reasons to block - but if there is any debugging information I can extract when that happens, I can try to do that.
The text was updated successfully, but these errors were encountered: