netty: Writes can starve reads, causing DEADLINE_EXCEEDED #8912

tommyulfsparre · 2022-02-11T14:49:40Z

What version of gRPC-Java are you using?

1.44.0

What is your environment?

A client and server communicating over localhost.

Operating system: Mac OSX
JDK: 11.0.9

What did you expect to see?

Client in this branch not failing the majority of RPCs with DEADLINE_EXCEEDED when using Netty for transport.

What did you see instead?

The majority of call fails with DEADLINE_EXCEEDED using Netty whereas using OkHttp doesn't .

Steps to reproduce the bug

This (hacky) reproducible might indicate a potential problem where the periodic flushing of the WriteQueue doesn't happen frequently enough. This causes latency leading to DEADLINE_EXCEEDED during startup. Switching to ÒkHttp for this particular case does not cause any RPCs to fail.

On the wire with Netty multiple RPCs are batched and sent togheter whereas with ÒkHttp they are immediately dispatched.

Here is a screenshot from Perfmark using Netty:

is this a known issue or am I doing something fundamentally wrong in the reproducer?

The text was updated successfully, but these errors were encountered:

ejona86 · 2022-02-11T17:33:19Z

Thank you for the reproduction and good report. PerfMark is so helpful here.

Java is slow to start. I don't think people really understand how slow Java is when starting. It is slower than Python when starting, as it uses a less optimized interpreter.

We'd expect OkHttp to be faster when starting, at least for the simple reason of it having less code. OkHttp is essentially optimized for the cold-start case, as is common on Android. OkHttp has similar write-combining code as Netty via the AsyncSink, so I don't think the delaying flushing itself is the problem.

The WriteQueue will run indefinitely as long as there are events. Changing the code to use batch sizes of 16 and return after "flush0" didn't substantially affect the number of RPCs completing within 100 ms. If I increase the deadline to 200 ms, then I see more varying results. So I think netty here is heavily over-saturated, and we'd expect the latency to increase horribly when that happens.

I tried returning earlier in WriteQueue (along with rescheduling) because I think the problem is reads are blocked on the writes. It seems returning early from WriteQueue may not have much effect. Looking at Netty it seems it will immediately run the new Runnable. It will let other scheduled runnables execute first, but it doesn't go back to select() until SingleThreadEventExecutor.runAllTasks() completes. This means that deadlines can be processed much more rapidly than reads.

OkHttp has the advantage for this scenario that it uses separate dedicated threads for reading and writing. That means that the responses aren't delayed waiting for the writing to complete. The difference in latency is huge: I see okhttp completing RPCs in 0-10 ms. Although there's other things going on here, because I also see many Netty RPCs completing in 300-400 ms when I set the deadline is 200ms; that means there's thrashing due to to many threads or similar going on.

I think there should likely be some change based on this reproducer, but I suspect it can't be substantially fixed.

ejona86 · 2022-02-11T18:57:57Z

Looking at Netty it seems it will immediately run the new Runnable.

I was accidentally looking in the ioRatio=100 flow. In another flow it runs tasks until X amount of time has passed, where X is based on the time taken for select(). That's good, as it means this should be addressable to some degree, but I'm still unclear on why reads appear to be delayed.

tommyulfsparre · 2022-02-14T09:25:50Z

@ejona86 Thanks for looking into this. Feel free to correct the issue title.

Java is slow to start. I don't think people really understand how slow Java is when starting

For some context, which I forgot to add; we do "pre-warm" the JVM and the gRPC server/channels by issuing health check RPCs until those are served within a configurable latency threshold. This is done before marking the service as healthy/serving.

This mitigation has been partially successful in addressing the slow start but we still see cases where during deployment services are failing RPCs with DEADLINE_EXCEEDED. There is of course many factors outside of gRPC that can contribute to this but I wanted to see if I could reproduce it in an environment with fewer moving parts. Not saying that this is the same issue but it might be.

ejona86 · 2022-03-07T21:31:29Z

Thinking about this more, flow control should normally stall large data transfers and force a batch of reads. So this may be predominantly caused by HPACK/writing out headers.

tommyulfsparre · 2022-04-06T11:54:37Z

@ejona86 Do you think this issue should be raised/moved to the netty project instead?

I tried to reproduce this using apple/servicetalk and got similar behaviour but with a different error, which might point to something that needs (if possible) to be addressed in netty as you have already concluded.

ejona86 · 2022-04-06T14:22:32Z

It wouldn't be inappropriate to share with netty now. I had been wanting to understand why the Netty code isn't working as it would appear it should. Since you have the issue with servicetalk as well, then that means it is likely a netty issue, modulo there's just some config option we're not tweaking.

tommyulfsparre changed the title ~~WriteQueue periodic flushing causes DEADLINE_EXCEEDED during startup?~~ netty: WriteQueue periodic flushing causes DEADLINE_EXCEEDED during startup? Feb 11, 2022

ejona86 added the performance label Feb 11, 2022

ejona86 changed the title ~~netty: WriteQueue periodic flushing causes DEADLINE_EXCEEDED during startup?~~ netty: Writes can starve reads, causing DEADLINE_EXCEEDED Mar 7, 2022

tommyulfsparre mentioned this issue Apr 20, 2022

Http2: Writes starving reads? netty/netty#12320

Open

ejona86 assigned sergiitk May 31, 2022

ejona86 added this to the Next milestone Sep 27, 2022

ejona86 added the netty label Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

netty: Writes can starve reads, causing DEADLINE_EXCEEDED #8912

netty: Writes can starve reads, causing DEADLINE_EXCEEDED #8912

tommyulfsparre commented Feb 11, 2022

ejona86 commented Feb 11, 2022 •

edited

ejona86 commented Feb 11, 2022

tommyulfsparre commented Feb 14, 2022 •

edited

ejona86 commented Mar 7, 2022 •

edited

tommyulfsparre commented Apr 6, 2022

ejona86 commented Apr 6, 2022

netty: Writes can starve reads, causing DEADLINE_EXCEEDED #8912

netty: Writes can starve reads, causing DEADLINE_EXCEEDED #8912

Comments

tommyulfsparre commented Feb 11, 2022

What version of gRPC-Java are you using?

What is your environment?

What did you expect to see?

What did you see instead?

Steps to reproduce the bug

ejona86 commented Feb 11, 2022 • edited

ejona86 commented Feb 11, 2022

tommyulfsparre commented Feb 14, 2022 • edited

ejona86 commented Mar 7, 2022 • edited

tommyulfsparre commented Apr 6, 2022

ejona86 commented Apr 6, 2022

ejona86 commented Feb 11, 2022 •

edited

tommyulfsparre commented Feb 14, 2022 •

edited

ejona86 commented Mar 7, 2022 •

edited