Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netty: Writes can starve reads, causing DEADLINE_EXCEEDED #8912

Open
tommyulfsparre opened this issue Feb 11, 2022 · 6 comments
Open

netty: Writes can starve reads, causing DEADLINE_EXCEEDED #8912

tommyulfsparre opened this issue Feb 11, 2022 · 6 comments
Assignees
Milestone

Comments

@tommyulfsparre
Copy link

What version of gRPC-Java are you using?

1.44.0

What is your environment?

A client and server communicating over localhost.

Operating system: Mac OSX
JDK: 11.0.9

What did you expect to see?

Client in this branch not failing the majority of RPCs with DEADLINE_EXCEEDED when using Netty for transport.

What did you see instead?

The majority of call fails with DEADLINE_EXCEEDED using Netty whereas using OkHttp doesn't .

Steps to reproduce the bug

This (hacky) reproducible might indicate a potential problem where the periodic flushing of the WriteQueue doesn't happen frequently enough. This causes latency leading to DEADLINE_EXCEEDED during startup. Switching to ÒkHttp for this particular case does not cause any RPCs to fail.

On the wire with Netty multiple RPCs are batched and sent togheter whereas with ÒkHttp they are immediately dispatched.

Here is a screenshot from Perfmark using Netty:

permark

is this a known issue or am I doing something fundamentally wrong in the reproducer?

@tommyulfsparre tommyulfsparre changed the title WriteQueue periodic flushing causes DEADLINE_EXCEEDED during startup? netty: WriteQueue periodic flushing causes DEADLINE_EXCEEDED during startup? Feb 11, 2022
@ejona86
Copy link
Member

ejona86 commented Feb 11, 2022

Thank you for the reproduction and good report. PerfMark is so helpful here.

Java is slow to start. I don't think people really understand how slow Java is when starting. It is slower than Python when starting, as it uses a less optimized interpreter.

We'd expect OkHttp to be faster when starting, at least for the simple reason of it having less code. OkHttp is essentially optimized for the cold-start case, as is common on Android. OkHttp has similar write-combining code as Netty via the AsyncSink, so I don't think the delaying flushing itself is the problem.

The WriteQueue will run indefinitely as long as there are events. Changing the code to use batch sizes of 16 and return after "flush0" didn't substantially affect the number of RPCs completing within 100 ms. If I increase the deadline to 200 ms, then I see more varying results. So I think netty here is heavily over-saturated, and we'd expect the latency to increase horribly when that happens.

I tried returning earlier in WriteQueue (along with rescheduling) because I think the problem is reads are blocked on the writes. It seems returning early from WriteQueue may not have much effect. Looking at Netty it seems it will immediately run the new Runnable. It will let other scheduled runnables execute first, but it doesn't go back to select() until SingleThreadEventExecutor.runAllTasks() completes. This means that deadlines can be processed much more rapidly than reads.

OkHttp has the advantage for this scenario that it uses separate dedicated threads for reading and writing. That means that the responses aren't delayed waiting for the writing to complete. The difference in latency is huge: I see okhttp completing RPCs in 0-10 ms. Although there's other things going on here, because I also see many Netty RPCs completing in 300-400 ms when I set the deadline is 200ms; that means there's thrashing due to to many threads or similar going on.

I think there should likely be some change based on this reproducer, but I suspect it can't be substantially fixed.

@ejona86
Copy link
Member

ejona86 commented Feb 11, 2022

Looking at Netty it seems it will immediately run the new Runnable.

I was accidentally looking in the ioRatio=100 flow. In another flow it runs tasks until X amount of time has passed, where X is based on the time taken for select(). That's good, as it means this should be addressable to some degree, but I'm still unclear on why reads appear to be delayed.

@tommyulfsparre
Copy link
Author

tommyulfsparre commented Feb 14, 2022

@ejona86 Thanks for looking into this. Feel free to correct the issue title.

Java is slow to start. I don't think people really understand how slow Java is when starting

For some context, which I forgot to add; we do "pre-warm" the JVM and the gRPC server/channels by issuing health check RPCs until those are served within a configurable latency threshold. This is done before marking the service as healthy/serving.

This mitigation has been partially successful in addressing the slow start but we still see cases where during deployment services are failing RPCs with DEADLINE_EXCEEDED. There is of course many factors outside of gRPC that can contribute to this but I wanted to see if I could reproduce it in an environment with fewer moving parts. Not saying that this is the same issue but it might be.

@ejona86 ejona86 changed the title netty: WriteQueue periodic flushing causes DEADLINE_EXCEEDED during startup? netty: Writes can starve reads, causing DEADLINE_EXCEEDED Mar 7, 2022
@ejona86
Copy link
Member

ejona86 commented Mar 7, 2022

Thinking about this more, flow control should normally stall large data transfers and force a batch of reads. So this may be predominantly caused by HPACK/writing out headers.

@tommyulfsparre
Copy link
Author

@ejona86 Do you think this issue should be raised/moved to the netty project instead?

I tried to reproduce this using apple/servicetalk and got similar behaviour but with a different error, which might point to something that needs (if possible) to be addressed in netty as you have already concluded.

@ejona86
Copy link
Member

ejona86 commented Apr 6, 2022

It wouldn't be inappropriate to share with netty now. I had been wanting to understand why the Netty code isn't working as it would appear it should. Since you have the issue with servicetalk as well, then that means it is likely a netty issue, modulo there's just some config option we're not tweaking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants