Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BDP PINGs are sent much more frequently than necessary #8260

Closed
mostroverkhov opened this issue Jun 15, 2021 · 10 comments · Fixed by #9650
Closed

BDP PINGs are sent much more frequently than necessary #8260

mostroverkhov opened this issue Jun 15, 2021 · 10 comments · Fixed by #9650
Assignees
Milestone

Comments

@mostroverkhov
Copy link

What version of gRPC-Java are you using?

1.37.0

What is your environment?

5.4.0-74-generic #83-Ubuntu SMP Sat May 8 02:35:39 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

What did you expect to see?

Client does not flood server with PING frames when autoTuneFlowControl is enabled (default)

What did you see instead?

Connection closed on server with

    io.netty.handler.codec.http2.Http2Exception: Maximum number 10000 of outstanding control frames reached
	at io.netty.handler.codec.http2.Http2Exception.connectionError(Http2Exception.java:108)
	at io.netty.handler.codec.http2.Http2ControlFrameLimitEncoder.handleOutstandingControlFrames(Http2ControlFrameLimitEncoder.java:96)
	at io.netty.handler.codec.http2.Http2ControlFrameLimitEncoder.writePing(Http2ControlFrameLimitEncoder.java:69)
	at io.netty.handler.codec.http2.Http2FrameCodec.write(Http2FrameCodec.java:333)

Steps to reproduce the bug

Client makes request-response calls continuously such that there is constant number of outstanding requests.

Server is 3rd party GRPC implementation based on Netty.

It only acks received PING frames, and does not send own PING frames(ack=false).
Acked frames content is 1234.

Client and server are on the same host.

Eventually (after several seconds) connection is closed by server with

    io.netty.handler.codec.http2.Http2Exception: Maximum number 10000 of outstanding control frames reached
	at io.netty.handler.codec.http2.Http2Exception.connectionError(Http2Exception.java:108)
	at io.netty.handler.codec.http2.Http2ControlFrameLimitEncoder.handleOutstandingControlFrames(Http2ControlFrameLimitEncoder.java:96)
	at io.netty.handler.codec.http2.Http2ControlFrameLimitEncoder.writePing(Http2ControlFrameLimitEncoder.java:69)
	at io.netty.handler.codec.http2.Http2FrameCodec.write(Http2FrameCodec.java:333)

There is workaround NettyChannelBuilder.flowControlWindow(int) which happens to disable autoTuneFlowControl.

@voidzcy
Copy link
Contributor

voidzcy commented Jun 16, 2021

Does the serve have any customization? I wasn't able to reproduce the behavior by modifying our manual flow control example.

@mostroverkhov
Copy link
Author

mostroverkhov commented Jun 17, 2021

@voidzcy It is painful to get grpc-java into compilable state on local machine, so I composed grpc only project where this behavior is trivially reproduced as well. It is property of netty based grpc-java client, as server only does what is mandated by http2 spec - acks received PING.

With above example NettyServerHandler.onPingRead is called with frequency comparable to inbound requests, but connection is not torn down because of overly aggressive buffer flushing of grpc-java library (thats why Http2ControlFrameLimitEncoder does not kick in).

With autoTuneFlowControl enabled, It seems the rate of PINGs grows with a) - number of requests [1], [2]; b) - decrease of PING round trip time [3].

I think algorithm needs to be adjusted for high-rps low latency scenario.

@ejona86
Copy link
Member

ejona86 commented Jun 17, 2021

It is painful to get grpc-java into compilable state on local machine

Next time you may want to take a look at COMPILING. There are ways to disable the annoying parts. You should be able to build most of the project locally with just Java installed. Although I see no problem with making a repro project like you did.

With autoTuneFlowControl enabled, It seems the rate of PINGs grows with a) - number of requests [1], [2]; b) - decrease of PING round trip time [3].

The algorithm has one PING "in flight" any time it is receiving DATA frames. So yes, it will increase if you send more data or decrease the RTT. Note WINDOW_UPDATE frames also increase in frequency as you decrease the RTT.

I composed grpc only project where this behavior is trivially reproduced as well

It isn't reproducing on my machine after running for 4 minutes. In that project do you actually see the "Maximum number 10000 of outstanding control frames reached" error or just "lots of pings"?

Server is 3rd party GRPC implementation based on Netty.

Given there is only one one outstanding BDP PING at a time, I would expect the bug is on the server, especially if using any part of the older HTTP/2 API in Netty. In the old API it appears like there's multiple ways of doing the same thing, but generally only one will have the correct behavior. But obviously if we reproduce it we can dig in further to find the true cause.

@mostroverkhov
Copy link
Author

mostroverkhov commented Jun 17, 2021

It isn't reproducing on my machine after running for 4 minutes

It sad to hear @ejona86 - I suggest you putting breakpoint here in above grpc-only reproducer client, and see literally thousands of outbound PING non-ack frames per second - for ~80k rps.

Probably I could provide reproducer in a form that counts inbound PINGs on server-side - but looks like It is difficult (impossible?) to have insights into "wire" (frames) level with current gprs-java-over-netty API?

It seems current behavior - caused by resetting PING counter on each request - was introduced by this PR, so may I ask you what was the thinking and rationale behind this?

From "external observer" point-of-view my first guess is this default behavior of grpc-java client is tailored for mobile clients doing 1 request per minute over long wire (high rtt). If so, I would update javadoc accordingly so people dont use It for server-server communication and call It a day.

In that project do you actually see the "Maximum number 10000 of outstanding control frames reached"

Gprs-java server aggressively flushes outbound buffer on outbound PINGs - basically hiding problem under the rug for the cost of lower performance for end users. But for servers willing to utilize their IO efficiently, there is above error - It indicates that with high rps grpc-java client is DDOSing my server with PINGs, and I would prefer to avoid that.

I would expect the bug is on the server

I described the problem I see on custom server, and for pure grpc-javareproducer provided you with instructions & code references that clearly demonstrate the crux of It. The problem is outlined in caption of issue: grpc-java client floods server with PING frames.

If there is still any confusion, I am happy to provide further explanations @ejona86.

Also I am happy to update reproducer with metrics on server inbound control frames - as soon as grpc-java offers any suitable APIs for that.

@ejona86
Copy link
Member

ejona86 commented Jun 17, 2021

It sad to hear @ejona86 - I suggest you putting breakpoint here in above grpc-only reproducer client, and see literally thousands of outbound PING non-ack frames per second - for ~80k rps.

Yeah, I fully expect the PINGS are often. But Http2ControlFrameLimitEncoder doesn't care about the frequency; it only cares about how many outbound control frames are buffered. I modified Http2ControlFrameLimitEncoder in gRPC to print out the maximum value seen by outstandingControlFrames and I didn't see it rise above 1. So I don't believe the reproducer reproduces behavior that would lead to the Http2ControlFrameLimitEncoder "10000 buffered" exception.

In no way would I disagree "we are sending a lot of pings" and I would be interested in reducing the number of pings at some point, especially if we could demonstrate the CPU/network/performance waste so we can communicate its importance. But that seems a separate issue than the failure you're seeing. This is your issue so if you want to ignore the Http2ControlFrameLimitEncoder exception and instead focus on the CPU/network waste of PING frames we can do that, but the two are separate discussions.

It seems current behavior - caused by resetting PING on each request - was introduced by this PR, so may I ask you what was the thinking and rationale behind this?

We have a !isPinging() check that prevents sending a BDP PING if we are still waiting on the ACK. That alone should prevent any accumulation of control frames in Http2ControlFrameLimitEncoder.

The pingCount is a workaround for one server that had weird PING limit behavior; we basically mirror its behavior in grpc and avoid sending a ping if it would reject us. If we removed the PingCountingFrameWriter the BDP ping stuff should work fine everywhere except with that one server.

https://grpc.io/blog/grpc-go-perf-improvements/#bdp-estimation-and-dynamic-flow-control-window is probably the best writeup of the BDP stuff.

From "external observer" point-of-view my first guess is this default behavior of grpc-java client is tailored for mobile clients doing 1 request per minute over long wire (high rtt). If so, I would update javadoc accordingly so people dont use It for server-server communication and call It a day.

Actually, we only have the BDP ping in Netty and we discourage using Netty on Android. OkHttp is for Andoid users. There's a feature request to add it to grpc-okhttp: #8226 . The problem is greatest in data centers and high-capacity cross-continental links; we did it for beefy servers, not edge clients.

Gprs-java server aggressively flushes outbound buffer on outbound PINGs

What flushing are you talking about? I don't see any flush for the BDP ping; we rely on the normal flush() in channelReadComplete() in Netty.

Might you be talking about this flush?

public void ping() {
ChannelFuture pingFuture = encoder().writePing(
ctx, false /* isAck */, KEEPALIVE_PING, ctx.newPromise());
ctx.flush();

That flush is for keepalive and must be present because it is only sent during a period of inactivity. That is a separate code path not used by the BDP pings.

@mostroverkhov
Copy link
Author

So given that grpc-java lacks frames listener APIs I ended up with this creative one, It was commited in provided repro.

It eloquently demonstrates that grpc-client with autotuneFlowControl enabled sends ~1K (one thousand) PINGs per second.

Just in case here is also tcpdump from same reproducer - It correlates with ping frames listener data.

Actually, we only have the BDP ping in Netty and we discourage using Netty on Android

1000 PINGs per second correspond to sampling rate of 1 ms (millisecond) - so again I have a question why would anyone need that tiny sampling rate for server application that usually has dynamics of at least 3 (and most likely 4) orders of magnitude more (1 - 10 seconds).

So I don't believe the reproducer reproduces behavior that would lead to the Http2ControlFrameLimitEncoder "10000 buffered" exception

I encountered initial issue while doing custom GRPC server load test - so big netty's outbound buffer + lots of small messages. Lots of PING acks queued up before channel flushed (due to being non-writable).

Current grpc-java bdp algorithm is not only naive but dangerous, because on server side I have to select netty's outbound buffer, tcp buffer sizes and outstanding control frames cap assuming grpc-java client has that algorithm (because It is enabled by default) - so be prepared to handle potentially thousands of PING frames per second.

I'd suggest to turn It off by default and have javadoc describing potential implications, so users willing to enable It do informed decision.

@ejona86
Copy link
Member

ejona86 commented Jun 18, 2021

It eloquently demonstrates that grpc-client with autotuneFlowControl enabled sends ~1K (one thousand) PINGs per second.

I won't continue talking about two different issues in this issue. They are seriously unrelated. If you want to talk about that, then either tell me to stop discussing the control frame queuing issue here or open another issue. I can seriously discuss this, put some of it into context, and consider next steps. But I'm not interested in wasting my time with the two conversations crossing each other.

The queuing issue is potentially serious and we would want to resolve within a short period of time. The incompatibility is obviously high-impact and I expect there is simply a small bug in either the sender or receiver (grpc or your custom server). Resolving the inefficiency would need to rise to a high enough importance, would need a design, and planning for when we'll do it and whether we'll do the same work in other languages. So it will be quite a lot of effort and less gain. Before you filed this issue I had been considering ways of making BDP less aggressive as my mind wandered, so I am interested, but you're the first user to notice any problem and have only asserted "it is inefficient!" without quantifying harm. Since we have limited time so we have to prioritize, and this doesn't seem relatively urgent.

Lots of PING acks queued up before channel flushed (due to being non-writable).

Are you sure they are PING acks? How did you verify this?

I currently see no evidence that the BDP pings are the source of the accumulation. As I mentioned, there's only one outstanding ping at a time and the frequency of pings you mention aligns with the RTT so it seems to be operating as expected. SETTINGS frames are sent by the BDP code, but the window has exponential growth and their frequency is limited by the rate of PING acks. If your server flushes PING acks less often, then that simply will reduce the rate of the PINGs/SETTINGS overall; they wouldn't accumulate. It seems unlikely to be keepalive pings as well. WINDOW_UPDATES aren't counted by Http2ControlFrameLimitEncoder.

I was involved with the design of BDP pings and Http2ControlFrameLimitEncoder, and I did consider their mutual compatibility. I could have missed something or there could be a bug, but with the information you've given I have no reason to suspect BDP pings are the source of issue.

Could it potentially be something else, like a RST_STREAM? It might be worthwhile to put a breakpoint at limitReached = true in Http2ControlFrameLimitEncoder and take a look with a debugger at the channel's or SslHandler's queue.

@mostroverkhov
Copy link
Author

I won't continue talking about two different issues in this issue. They are seriously unrelated.

There are no two conversations - there is single problem as outlined in caption of this issue: Client floods server with PING frames if autoTuneFlowControl is enabled .

To demonstrate It I provided grpc-java only reproducer of what mentioned in caption - not sure why you keep slipping into your custom server bugs direction (given this fact cant be checked subjectively because server source code is not publicly available).
I can redact out custom server stacktrace if It is confusing or appears unrelated.

Reading second part of the previous answer It seems sending 1k PINGs per second is working as intended for current autoTuneFlowControl algorithm - can you confirm this explicitly?
Then I can accommodate algorithm behavior internally and have the source to link back to.

P.S. FYI issue relates to both server and client - as demonstrated by updated reproducer.

@ejona86
Copy link
Member

ejona86 commented Jun 23, 2021

There are no two conversations - there is single problem as outlined in caption of this issue: Client floods server with PING frames if autoTuneFlowControl is enabled .

Okay. I will ignore the Http2ControlFrameLimitEncoder error.

Can you please demonstrate harm caused by the current behavior? (e.g., show CPU time or network usage as a percentage that is wasted.) I don't see anything here other than "there's an optimization opportunity," which I agree with, but isn't urgent and hasn't been quantified which makes it hard to prioritize.

Reading second part of the previous answer It seems sending 1k PINGs per second is working as intended for current autoTuneFlowControl algorithm - can you confirm this explicitly?

If the RTT is ~1 ms, then yes 1k PINGs/s is expected if there's lots of RPC activity on the connection.

@ejona86 ejona86 changed the title Client floods server with PING frames if autoTuneFlowControl is enabled BDP PINGs are sent much more frequently than necessary Jun 23, 2021
@ejona86 ejona86 added this to the Unscheduled milestone Jun 23, 2021
@davidkilliansc
Copy link

davidkilliansc commented Aug 2, 2022

I came across this issue when doing some work to improve throughput / reduce cost. In our setup, both incoming and outgoing gRPC requests proxy through a local HTTP2 sidecar over loopback or a domain socket. As a result I'd expect the RTT to be very small and the max BDP PINGs per second to be very high.

I disabled autoTuneFlowControl for one service and saw a 3-4% reduction in the number of machines needed to handle the service's throughput, which is a non-trivial cost reduction for us. When I compared production profiles to before the change, I observed the largest reduction in CPU was in reading and writing file descriptors (not surprising). I saw an unexpected increase in CPU spent in Native.eventFDWrite stemming from AbstractStream$TransportState.requestMessagesFromDeframer. I suspect that increase is due to the IO event loop being less likely to already being running at the moment the application thread requests messages (since the IO event loop is now doing less work as a result of reducing the PINGs).

Another observation is that most of the gain from disabling autoTuneFlowControl actually came from reduced CPU usage in the local H2 proxy (the thing responding to all those PING frames), rather than from reduced CPU usage of the grpc-java application process.

@larry-safran larry-safran self-assigned this Nov 16, 2022
@ejona86 ejona86 modified the milestones: Unscheduled, 1.53.0 Dec 27, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants