-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BDP PINGs are sent much more frequently than necessary #8260
Comments
Does the serve have any customization? I wasn't able to reproduce the behavior by modifying our manual flow control example. |
@voidzcy It is painful to get grpc-java into compilable state on local machine, so I composed grpc only project where this behavior is trivially reproduced as well. It is property of netty based grpc-java client, as server only does what is mandated by http2 spec - acks received PING. With above example NettyServerHandler.onPingRead is called with frequency comparable to inbound requests, but connection is not torn down because of overly aggressive buffer flushing of grpc-java library (thats why Http2ControlFrameLimitEncoder does not kick in). With I think algorithm needs to be adjusted for high-rps low latency scenario. |
Next time you may want to take a look at COMPILING. There are ways to disable the annoying parts. You should be able to build most of the project locally with just Java installed. Although I see no problem with making a repro project like you did.
The algorithm has one PING "in flight" any time it is receiving DATA frames. So yes, it will increase if you send more data or decrease the RTT. Note WINDOW_UPDATE frames also increase in frequency as you decrease the RTT.
It isn't reproducing on my machine after running for 4 minutes. In that project do you actually see the "Maximum number 10000 of outstanding control frames reached" error or just "lots of pings"?
Given there is only one one outstanding BDP PING at a time, I would expect the bug is on the server, especially if using any part of the older HTTP/2 API in Netty. In the old API it appears like there's multiple ways of doing the same thing, but generally only one will have the correct behavior. But obviously if we reproduce it we can dig in further to find the true cause. |
It sad to hear @ejona86 - I suggest you putting breakpoint here in above grpc-only reproducer client, and see literally thousands of outbound PING non-ack frames per second - for ~80k rps. Probably I could provide reproducer in a form that counts inbound PINGs on server-side - but looks like It is difficult (impossible?) to have insights into "wire" (frames) level with current It seems current behavior - caused by resetting PING counter on each request - was introduced by this PR, so may I ask you what was the thinking and rationale behind this? From "external observer" point-of-view my first guess is this default behavior of grpc-java client is tailored for mobile clients doing 1 request per minute over long wire (high rtt). If so, I would update javadoc accordingly so people dont use It for server-server communication and call It a day.
Gprs-java server aggressively flushes outbound buffer on outbound PINGs - basically hiding problem under the rug for the cost of lower performance for end users. But for servers willing to utilize their IO efficiently, there is above error - It indicates that with high rps grpc-java client is DDOSing my server with PINGs, and I would prefer to avoid that.
I described the problem I see on custom server, and for If there is still any confusion, I am happy to provide further explanations @ejona86. Also I am happy to update reproducer with metrics on server inbound control frames - as soon as grpc-java offers any suitable APIs for that. |
Yeah, I fully expect the PINGS are often. But Http2ControlFrameLimitEncoder doesn't care about the frequency; it only cares about how many outbound control frames are buffered. I modified Http2ControlFrameLimitEncoder in gRPC to print out the maximum value seen by outstandingControlFrames and I didn't see it rise above In no way would I disagree "we are sending a lot of pings" and I would be interested in reducing the number of pings at some point, especially if we could demonstrate the CPU/network/performance waste so we can communicate its importance. But that seems a separate issue than the failure you're seeing. This is your issue so if you want to ignore the Http2ControlFrameLimitEncoder exception and instead focus on the CPU/network waste of PING frames we can do that, but the two are separate discussions.
We have a The https://grpc.io/blog/grpc-go-perf-improvements/#bdp-estimation-and-dynamic-flow-control-window is probably the best writeup of the BDP stuff.
Actually, we only have the BDP ping in Netty and we discourage using Netty on Android. OkHttp is for Andoid users. There's a feature request to add it to grpc-okhttp: #8226 . The problem is greatest in data centers and high-capacity cross-continental links; we did it for beefy servers, not edge clients.
What flushing are you talking about? I don't see any flush for the BDP ping; we rely on the normal Might you be talking about this flush? grpc-java/netty/src/main/java/io/grpc/netty/NettyServerHandler.java Lines 894 to 897 in 814655c
That flush is for keepalive and must be present because it is only sent during a period of inactivity. That is a separate code path not used by the BDP pings. |
So given that grpc-java lacks frames listener APIs I ended up with this creative one, It was commited in provided repro. It eloquently demonstrates that grpc-client with Just in case here is also tcpdump from same reproducer - It correlates with ping frames listener data.
1000 PINGs per second correspond to sampling rate of 1 ms (millisecond) - so again I have a question why would anyone need that tiny sampling rate for server application that usually has dynamics of at least 3 (and most likely 4) orders of magnitude more (1 - 10 seconds).
I encountered initial issue while doing custom GRPC server load test - so big netty's outbound buffer + lots of small messages. Lots of PING acks queued up before channel flushed (due to being non-writable). Current grpc-java bdp algorithm is not only naive but dangerous, because on server side I have to select netty's outbound buffer, tcp buffer sizes and outstanding control frames cap assuming grpc-java client has that algorithm (because It is enabled by default) - so be prepared to handle potentially thousands of PING frames per second. I'd suggest to turn It off by default and have javadoc describing potential implications, so users willing to enable It do informed decision. |
I won't continue talking about two different issues in this issue. They are seriously unrelated. If you want to talk about that, then either tell me to stop discussing the control frame queuing issue here or open another issue. I can seriously discuss this, put some of it into context, and consider next steps. But I'm not interested in wasting my time with the two conversations crossing each other. The queuing issue is potentially serious and we would want to resolve within a short period of time. The incompatibility is obviously high-impact and I expect there is simply a small bug in either the sender or receiver (grpc or your custom server). Resolving the inefficiency would need to rise to a high enough importance, would need a design, and planning for when we'll do it and whether we'll do the same work in other languages. So it will be quite a lot of effort and less gain. Before you filed this issue I had been considering ways of making BDP less aggressive as my mind wandered, so I am interested, but you're the first user to notice any problem and have only asserted "it is inefficient!" without quantifying harm. Since we have limited time so we have to prioritize, and this doesn't seem relatively urgent.
Are you sure they are PING acks? How did you verify this? I currently see no evidence that the BDP pings are the source of the accumulation. As I mentioned, there's only one outstanding ping at a time and the frequency of pings you mention aligns with the RTT so it seems to be operating as expected. SETTINGS frames are sent by the BDP code, but the window has exponential growth and their frequency is limited by the rate of PING acks. If your server flushes PING acks less often, then that simply will reduce the rate of the PINGs/SETTINGS overall; they wouldn't accumulate. It seems unlikely to be keepalive pings as well. WINDOW_UPDATES aren't counted by Http2ControlFrameLimitEncoder. I was involved with the design of BDP pings and Http2ControlFrameLimitEncoder, and I did consider their mutual compatibility. I could have missed something or there could be a bug, but with the information you've given I have no reason to suspect BDP pings are the source of issue. Could it potentially be something else, like a RST_STREAM? It might be worthwhile to put a breakpoint at |
There are no two conversations - there is single problem as outlined in caption of this issue: To demonstrate It I provided Reading second part of the previous answer It seems sending 1k PINGs per second is P.S. FYI issue relates to both server and client - as demonstrated by updated reproducer. |
Okay. I will ignore the Http2ControlFrameLimitEncoder error. Can you please demonstrate harm caused by the current behavior? (e.g., show CPU time or network usage as a percentage that is wasted.) I don't see anything here other than "there's an optimization opportunity," which I agree with, but isn't urgent and hasn't been quantified which makes it hard to prioritize.
If the RTT is ~1 ms, then yes 1k PINGs/s is expected if there's lots of RPC activity on the connection. |
I came across this issue when doing some work to improve throughput / reduce cost. In our setup, both incoming and outgoing gRPC requests proxy through a local HTTP2 sidecar over loopback or a domain socket. As a result I'd expect the RTT to be very small and the max BDP PINGs per second to be very high. I disabled Another observation is that most of the gain from disabling |
What version of gRPC-Java are you using?
1.37.0
What is your environment?
5.4.0-74-generic #83-Ubuntu SMP Sat May 8 02:35:39 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
What did you expect to see?
Client does not flood server with PING frames when
autoTuneFlowControl
is enabled (default)What did you see instead?
Connection closed on server with
Steps to reproduce the bug
Client makes request-response calls continuously such that there is constant number of outstanding requests.
Server is 3rd party GRPC implementation based on Netty.
It only acks received PING frames, and does not send own PING frames(ack=false).
Acked frames content is
1234
.Client and server are on the same host.
Eventually (after several seconds) connection is closed by server with
There is workaround
NettyChannelBuilder.flowControlWindow(int)
which happens to disableautoTuneFlowControl
.The text was updated successfully, but these errors were encountered: