[FIX INCLUDED] Improrer use of the "write" flag causes flush() to be ignored, increasing used resources #4205

ninja- · 2015-09-10T18:09:15Z

Scenario:
netty 4.1
user calls writeAndFlush(packet)

Expected behaviour:

the packet is ALWAYS written and flushed...
(want a unit test in the future? stress writeAndFlush() + check that outbound buffer is empty)

What can happen:

the explict flush can be ignored because of improrer use of the OP_WRITE/EPOLLOUT flag.......
(it only gets called later on by the event loop)
it's a pretty serious bug because it increases used resources(5 million ChannelOutboundBuffer$Entrys anyone?)
please note that channel.isWritable() etc. etc. is NOT involved in this

the flag is not always turned on when there's work to do, it's the same problem on epoll and nio backends.

example(confirmed working) patch:

Index: transport-native-epoll/src/main/java/io/netty/channel/epoll/AbstractEpollStreamChannel.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- transport-native-epoll/src/main/java/io/netty/channel/epoll/AbstractEpollStreamChannel.java (revision 250a09df635d70853e1576a9e522c846e918938e)
+++ transport-native-epoll/src/main/java/io/netty/channel/epoll/AbstractEpollStreamChannel.java (revision )
@@ -389,6 +389,8 @@
                 // Wrote all messages.
                 clearFlag(Native.EPOLLOUT);
                 break;
+            } else {
+                setFlag(Native.EPOLLOUT);
             }

nio has the same behaviour and this line at NIO can be blamed
https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/socket/nio/NioSocketChannel.java#L271
or more like https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/socket/nio/NioSocketChannel.java#L283 should also set opwrite

I hope this gives you an idea of what is wrong and you guys can think of a "better" way to fix it(but I will also look for a such a way in the meantime)

The text was updated successfully, but these errors were encountered:

ninja- · 2015-09-10T18:24:36Z

I confirmed on production that this brings the number of $Entrys back to ~ 1 per thread instead of thousands,millions etc...

normanmaurer · 2015-09-10T18:52:43Z

@ninja- actually the fix is not correct as this will cause unnecessary cpu usage. That said there is a bug and I'm working on a fix as we speak

ninja- · 2015-09-10T18:53:47Z

💛 remember to bump the 4.1 beta release after you're done
can't live without the recycler_handle.recycle api so I am permanently on 4.1 :P

Scottmitch · 2015-09-10T18:53:59Z

@ninja- - Thanks for reporting and pointing us in this right direction!

ninja- · 2015-09-10T18:55:07Z

@Scottmitch :-) punching every bug on my way, I hope it's going to be the last one for some time!
btw. no luck at trying to write a unit test 😞 it would always pass the isEmpty() check even when stressing (but maybe it would need a remote channel)

ninja- · 2015-09-10T19:24:45Z

@normanmaurer note that I wasn't able to get flush() to break outside the first few seconds after a player connects. maybe that will give you some tip.
FYI I evaluated whether always setting epollout on doConnect would fix the problem but it doesn't.

Scottmitch · 2015-09-10T19:59:26Z

@ninja- - @normanmaurer has a PR pending approval. It will be pushed upstream soon. When it is upstream we will ask you to verify the fix :)

ninja- · 2015-09-10T20:00:51Z

@Scottmitch 😍 it's 1 minute for me to verify the fix after it's commited somewhere

ninja- · 2015-09-10T20:14:21Z

Honestly I would just wrap doWriteSingle and doWriteMultiple with a while(...). It's not like there is a better way - we can't just return the function before everything is written. No idea what Norman did because I guess you guys are handling this PR on a private repo or something?

normanmaurer · 2015-09-10T20:16:28Z

@ninja- just have a bit patience... a pr with a fix will be upstream soon.

WhiteTrashLord · 2015-09-11T20:06:58Z

Does it affect also 4.0? I have a too high CPU usage with 4.0 :(

ninja- · 2015-09-11T20:17:36Z

@WhiteTrashLord yeah it's a very old bug actually. I tried fixing it myself but I increased cpu usage 2x (while fixing outbound entry count) so I quickly reverted. But I doubt this bug increases cpu usage so it's probably not your main problem.

@normanmaurer any news or are you going to finish this after weekend?

Scottmitch · 2015-09-11T20:34:29Z

@WhiteTrashLord - Yes it impacts 4.0. This bug is more focused on making sure Netty registers for a channel writable event (if the channel is unable to accept all data we try to write...1 case was missed). This bug would likely manifest it self first as memory issues rather than CPU. If you are able to diagnose unexpected Netty behavior feel free to open another issue.

ninja- · 2015-09-11T21:06:32Z

@Scottmitch so you guys are working on it outside of the public repo or something?

ninja- · 2015-09-14T06:16:22Z

@normanmaurer @Scottmitch any update after the weekend?

…to write everything Motivation: writeBytes(...) missed to set EPOLLOUT flag when not all bytes were written. This could lead to have the EpollEventLoop not try to flush the remaining bytes once the socket becomes writable again. Modifications: - Move setting EPOLLOUT flag logic to one point so we are sure we always do it. - Move OP_WRITE flag logic to one point as well. Result: Correctly try to write pending data if socket becomes writable again.

normanmaurer · 2015-09-14T19:04:25Z

@ninja- #4216 should fix it

ninja- · 2015-09-14T19:13:43Z

@normanmaurer i'll deploy it tomorrow and check. I tried somthing similar and CPU usage rocketed...is your fix correct in this matter? Is it sane to expect a MAJOR drop in $entry count from 5M?

Scottmitch · 2015-09-15T06:14:07Z

@ninja- - I think the expectation is that we should only be registering for the write event from epoll when necessary. I wouldn't expect CPU to spike. Please try and report back.

…to write everything Motivation: writeBytes(...) missed to set EPOLLOUT flag when not all bytes were written. This could lead to have the EpollEventLoop not try to flush the remaining bytes once the socket becomes writable again. Modifications: - Move setting EPOLLOUT flag logic to one point so we are sure we always do it. - Move OP_WRITE flag logic to one point as well. Result: Correctly try to write pending data if socket becomes writable again.

…fied Motiviation: We need to ensure the actual close to the transport takes place before the promsie of the write is notified that triggered it. This is needed as otherwise Channel.isActive(), isOpen() and isWritable() may return true even if the Channel should be closed already. Modifications: - Ensure the close takes place first Result: ChannelFutureListener will see the correct state of the Channel.

…to write everything Motivation: writeBytes(...) missed to set EPOLLOUT flag when not all bytes were written. This could lead to have the EpollEventLoop not try to flush the remaining bytes once the socket becomes writable again. Modifications: - Move setting EPOLLOUT flag logic to one point so we are sure we always do it. - Move OP_WRITE flag logic to one point as well. Result: Correctly try to write pending data if socket becomes writable again.

ninja- · 2015-09-19T15:57:27Z

This is insane...the bug is clearly FIXED in the first case, where it grows entries during first few seconds of connection only. For that case the fix was clearly testable and easy to test.
It doesn't fix the second case at all ("long uptime, high number of connections") where given CONSTANT workload, it would grow the demand for [entries+buffer linked with them] to millions with uptime...
(7 million here while the maximum batch between flushes is 30 packets)
while monitoring the number of entries, it's like:
while(true) constant-constant-constant-time break-BOOM an increase

Sat Sep 19 19:10:52 CEST 2015 entries 978619 Sat Sep 19 19:11:33 CEST 2015 entries 978773 Sat Sep 19 19:11:51 CEST 2015 entries 980484 Sat Sep 19 19:14:17 CEST 2015 entries 982609 Sat Sep 19 19:14:21 CEST 2015 entries 991333 Sat Sep 19 19:14:23 CEST 2015 entries 992479 Sat Sep 19 19:14:27 CEST 2015 (this monitor reacts only when the number changes)

@normanmaurer should I investigate again from the start or you know better where to look...?
There has to be another problem related to a case when it can't write all at once, when I proxied 300 local bots, there was no more $Entries than 200 in total.
To manage read() calls without autoread(which would make things even worse), I depend on a following behaviour:

[some packets waiting in the queue]
~~flush a batch of packets ->~~

handle.write(Unpooled.EMPTY_BUFFER, promise ); handle.flush();

~~(while the promise, when finished, will call read() on the other side of proxying connection)~~

is it correct to expect that it will only get called after EVERYTHING before has been written to the socket? Another place when I could place the .read() is the flush(ChannelHandlerContext) method, but I think it's not a good place because it's just a "request" to flush.

(note that when I created the bug there was no packet batching in the code so it would also writeAndFlush every packet. Now it looks like I "just" need a better place to call read() on the downstream after making sure that EVERYTHING has been written?)

ninja- · 2015-09-19T19:07:12Z

OK so in the end...
point 1: the needed netty bugfix was only a part of what was needed
point 2: autoread=false everywhere was needed (the starting point was absolutely no autoread=false), but my first take at it(manually manipulating read() ) was being too optimistic on when the read() should be called as a fallback when it's not scheduled in another place(a write task), so it was more like autoread=true
point 3: as if that wasn't enough, I still remember the Recycler infinity bug fixed few months ago which also related to $Entry and was giving me hard crashes.

With all probability I can now say goodbye to millions and get back on thousands, because the new version deployed on production already shows a good change in behaviour :)

Scottmitch · 2015-09-21T23:36:34Z

With all probability I can now say goodbye to millions and get back on thousands, because the new version deployed on production already shows a good change in behaviour :)

@ninja- SGTM. Thanks again for reporting. Please be sure to report if you see more bad behavior.

ninja- · 2015-09-22T11:30:57Z

@Scottmitch in past few days I audited all the epoll, the recycler, and the outbound buffer. They are 100% fine in their logic. But what I found at this point...few things, maybe you will find that interesting.

a) a batch of small tiny packets is a killer for a proxy (in one read from ByteToMessageDecoder until ChannelReadComplete), to the number of outbound entries (growing eventually to millions).
So after sending bots trough proxy to a server with small activity, the $entries were really small and after sending them to a server with big activity the entries grown to for example 200k/2000k, even though it doesn't call read() until write() to downstream has fully finished.

b) I tried to limit the number of packets being read by a frame decoder in a single batch to for example 20/100 etc. and then write them to the proxied connection, and wait until the are fully written to the other part to read next batch. That was a disaster as it increased time to fully log in and receive the world data to ~ 15s from ~ instant.

c) the part when the proxy application waits with read() before the write to downstream has finished write() is in my opinion adding a bit of visible latency for the user. is that supposed to happen or - is it the best thing that could be done to keep it relatively in sync? should the autoread rule from netty examples apply to all proxy applications even where latency is quite important?

d) since I found that the small packets are a killer, I am thinking of creating a cumulator for them. That is, it wouldn't write 1000/2000/7000 etc. small entries to the outbound buffer but would copy the data to a single buffer, and then write. The downside could be that it couldn't then use IovArray to write them in a batch so it may or it may be not a small performance regression.

The buffers used are PooledUnsafeDirectByteBuf so that gives me a better idea on how to cumulate them(without copying), but it would be harder to implement, so maybe removing copy() is not worth it.

    protected PoolChunk<T> chunk;
    protected long handle;
    protected T memory;
    protected int offset;
    protected int length;

so create an object with PoolChunk[] + long[] + ByteBuffer[] etc. etc. something like that. recycle the original buffers without deallocating, then create a holder with their memory handles, and free them after write.

But that aside I think it would be simpler to hijack the latest entry in outbound buffer and write some data to it so that's what I am going to try now.

Scottmitch · 2015-09-22T18:45:09Z

@ninja- - Would you mind opening a new issue so it is easy to keep thing separated?

ninja- · 2015-09-22T22:24:34Z

@Scottmitch actually, I finally solved the problem in 100% so I am "out of netty bugs".
The more low-level cumulator wasn't needed. Do you know what FINALLY brings the number of entries from UNLIMITED to just ~ 200? Writing a cumulator that instead of adding a new entry to outbound buffer, writes it to a cummulator buffer and only writes the cummulated buffer when flush() is triggered. With this enabled no stressing in my tests could bring the number of entries to more than ~ 200-300-500 etc.

https://gist.github.com/ninja-/91a13b2630210891d429 - MessageSquasher.java

I am telling you this because over past few months I involved the netty team into the fight with outbound entry spam and among LOTS of other issues and bugs, the small packets were the last silent killer. So I think this battle is finally over and I don't currently see any performance or latency regressions. As always, time will tell...

You may think that this finding might be a good opportunity to improve netty itself :)

In that case, I would recommend to the netty team - changing the lifecycle of objectes between they are added to the (outbound buffer) and flushed. So for example, there could be a window( actually, the window is already there in the part when invoker's write() calls outboundBuffer.addEntry so that could be overriden ) for per-transport implementations that could deal with messges when they are pre-added but not yet flushed. Such an implementation could choose to merge them(before a new entry is created) or override the entry.msg with it's magical object containing just "memory adresses" of merged buffers etc. it would leave a nice window for improvements that could benefit everyone.

(what's harder to spot is that the entries take the buffers with them...that means that the number of buffers also increases and increases and new ones need to be allocated. so that's another way it could help with "traditional" apps in my opinion.)

It's up to you guys if you think that's worth the time but that's a way of doing this that should benefit more types of netty apps and as always microbenchmarks on this would be interesting :)
(and I know that especially the HTTP team loves microbenchmarks...)

@normanmaurer @trustin @Scottmitch @nmittler

If you guys think this improvement would be worth the time to code maybe I could open a new issue with an "improvement" tag and we could have some fun implementing some per-transport stuff to it or maybe let it be a ChannelOption so a use case can be chosen?

Scottmitch · 2015-09-23T15:15:30Z

@ninja- - What was your write/flush strategy before the MessageSquasher? Were you using writeAndFlush for every write operation? Where you always using write and then at some later point invoking flush? How did your applications behavior change when you starting using MessageSquasher in this respect?

ninja- · 2015-09-23T16:47:44Z

@Scottmitch it is flushing in a batch that was equal to what came from the ByteToMessageDecoder. That is - if it received 1000 packets (from the frame decoder), it has then written these 1000 packets to the downstream. Then @ channelReadCompleted it would flush it. Getting rid of batching wasn't any help to the $Entry spam.

When using MessageSquasher the number of $Entries stopped growing under high traffic and it can't go higher than 200 in TOTAL, before that - it would quickly get out of control to 200,000 easily and then to maybe millions.

Another change I made was in regards to ByteToMessageDecoder. I changed it so

original:

read all messages (maybe 10000 small ones :((( )
fire channelRead events for each in the decoded list
fire channelReadComplete

to

loop reading messages
if out.size() > for example 100, call channelRead events for them NOW and clear the list
continue decoding
fire channelReadComplete

I don't think that calling them even instantly after decoding, instead of calling the channelRead for the full list would be any degradation to performance.

https://github.com/netty/netty/blob/master/codec/src/main/java/io/netty/handler/codec/ByteToMessageDecoder.java#L245

so the buffers could be more quickly reused because when channelRead is over, the message is usually released.

Scottmitch · 2015-09-25T00:57:43Z

@ninja- - Any way you can provide a reproducer for the original problem?

I don't think that calling them even instantly after decoding, instead of calling the channelRead for the full list would be any degradation to performance.

What are you referring to when you say "calling them"?

if out.size() > for example 100, call channelRead events for them NOW and clear the list
continue decoding

What was the mechanism you used to do this?

ninja- · 2015-09-25T07:08:58Z

"calling them" to channelRead() in this meaning. allow the buffers to be reused more quickly AS it's usually released afer handling. But that's something that completes the way MessageSquasher work otherwise there would be no improvement

Well the reproducer....I wouldn't call that a bug. Just a simple proxy + a frame decoder that decodes frames into buffers + a spam of small Packers in one read

Scottmitch · 2015-09-25T16:29:55Z

@ninja- - See #4275

ninja- · 2015-09-25T16:48:21Z

@Scottmitch hm I don't think it's related, @normanmaurer found the #4275 problem somewhere else. But if you decided to change ByteToMessageDecoder so it would fire channelRead(msg) each by one instead of doing it in a batch maybe it would also help in #4275.

Scottmitch · 2015-09-25T18:32:01Z

@ninja- - What I was wondering was how did you call fireChanellRead after each decoded message? The API just exposes a list to decode which users are free to add as many message as they want. Did you change the interfaces, put a max capacity on the list, or something else? Would you care to submit a PR so it will be easier to visualize?

ninja- · 2015-09-25T18:39:32Z

@Scottmitch I copied original ByteToMessageDecoder to io.netty.channel package under new name while the package was inside the application. So

https://github.com/netty/netty/blob/master/codec/src/main/java/io/netty/handler/codec/ByteToMessageDecoder.java#L244

should be more or less after each callDecode(...). @normanmaurer is evaluating whether this (negatively) could affect performance for some people.

Scottmitch · 2015-09-25T18:51:49Z

https://github.com/netty/netty/blob/master/codec/src/main/java/io/netty/handler/codec/ByteToMessageDecoder.java#L244

should be more or less after each callDecode(...).

I'm a bit slow on this one... the line you reference is called after each callDecode? Like I said submit if you submit a PR (if only to clarify the idea) it should help me understand.

normanmaurer · 2015-09-25T19:02:52Z

@Scottmitch not care I will check and come back to you.

ninja- · 2015-09-25T19:05:13Z

@Scottmitch yeah actually I pointed a bad place. At the moment I am using a hack that fires it inside decode() so more like:

Inside the frame decoder:

    @Override
    protected void decode(ChannelHandlerContext ctx, ByteBuf in, List<Object> out) throws Exception {
        Object o = decode(ctx,in);
        if (o != null)
            ctx.fireChannelRead(o); // instant fire, reuse buffers quicker :)

        if (o != null && out.isEmpty())
            out.add(Unpooled.EMPTY_BUFFER); // to get the read() correctly

    }

    //    @Override
    protected ByteBuf decode(ChannelHandlerContext ctx, ByteBuf in) throws Exception
{
// decode a frame here
}

so I am not changing the original class.

Scottmitch · 2015-09-25T23:41:14Z

@ninja- - Thanks for clarifying. Please open a new issue so we can track.

Scottmitch · 2015-09-25T23:49:54Z

@ninja- @normanmaurer - FYI #4284

…able to write everything Motivation: writeBytes(...) missed to set EPOLLOUT flag when not all bytes were written. This could lead to have the EpollEventLoop not try to flush the remaining bytes once the socket becomes writable again. Modifications: - Move setting EPOLLOUT flag logic to one point so we are sure we always do it. - Move OP_WRITE flag logic to one point as well. Result: Correctly try to write pending data if socket becomes writable again.

ninja- mentioned this issue Sep 10, 2015

flush sometimes ignored, growing outbound entries #4203

Closed

normanmaurer self-assigned this Sep 10, 2015

normanmaurer added this to the 4.0.32.Final milestone Sep 10, 2015

Scottmitch added the defect label Sep 10, 2015

normanmaurer closed this as completed Sep 16, 2015

ninja- mentioned this issue Sep 25, 2015

[#4275] Discard read bytes after X reads to guard against OOME. #4281

Closed

Scottmitch mentioned this issue Sep 25, 2015

ByteToMessageDecoder to call fireChannelRead more frequently #4284

Closed

ninja- mentioned this issue Sep 26, 2015

Implement some kind of squashing outbound buffer entries #4285

Open

[FIX INCLUDED] Improrer use of the "write" flag causes flush() to be ignored, increasing used resources #4205

[FIX INCLUDED] Improrer use of the "write" flag causes flush() to be ignored, increasing used resources #4205

Comments

ninja- commented Sep 10, 2015

ninja- commented Sep 10, 2015

normanmaurer commented Sep 10, 2015

ninja- commented Sep 10, 2015

Scottmitch commented Sep 10, 2015

ninja- commented Sep 10, 2015

ninja- commented Sep 10, 2015

Scottmitch commented Sep 10, 2015

ninja- commented Sep 10, 2015

ninja- commented Sep 10, 2015

normanmaurer commented Sep 10, 2015

WhiteTrashLord commented Sep 11, 2015

ninja- commented Sep 11, 2015

Scottmitch commented Sep 11, 2015

ninja- commented Sep 11, 2015

ninja- commented Sep 14, 2015

normanmaurer commented Sep 14, 2015

ninja- commented Sep 14, 2015

Scottmitch commented Sep 15, 2015

ninja- commented Sep 19, 2015

ninja- commented Sep 19, 2015

Scottmitch commented Sep 21, 2015

ninja- commented Sep 22, 2015

Scottmitch commented Sep 22, 2015

ninja- commented Sep 22, 2015

Scottmitch commented Sep 23, 2015

ninja- commented Sep 23, 2015

Scottmitch commented Sep 25, 2015

ninja- commented Sep 25, 2015

Scottmitch commented Sep 25, 2015

ninja- commented Sep 25, 2015

Scottmitch commented Sep 25, 2015

ninja- commented Sep 25, 2015

Scottmitch commented Sep 25, 2015

normanmaurer commented Sep 25, 2015

ninja- commented Sep 25, 2015

Scottmitch commented Sep 25, 2015

Scottmitch commented Sep 25, 2015