Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX INCLUDED] Improrer use of the "write" flag causes flush() to be ignored, increasing used resources #4205

Closed
ninja- opened this issue Sep 10, 2015 · 37 comments
Assignees
Labels
Milestone

Comments

@ninja-
Copy link

ninja- commented Sep 10, 2015

Scenario:
netty 4.1
user calls writeAndFlush(packet)

Expected behaviour:

the packet is ALWAYS written and flushed...
(want a unit test in the future? stress writeAndFlush() + check that outbound buffer is empty)

What can happen:

the explict flush can be ignored because of improrer use of the OP_WRITE/EPOLLOUT flag.......
(it only gets called later on by the event loop)
it's a pretty serious bug because it increases used resources(5 million ChannelOutboundBuffer$Entrys anyone?)
please note that channel.isWritable() etc. etc. is NOT involved in this

the flag is not always turned on when there's work to do, it's the same problem on epoll and nio backends.

example(confirmed working) patch:

Index: transport-native-epoll/src/main/java/io/netty/channel/epoll/AbstractEpollStreamChannel.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- transport-native-epoll/src/main/java/io/netty/channel/epoll/AbstractEpollStreamChannel.java (revision 250a09df635d70853e1576a9e522c846e918938e)
+++ transport-native-epoll/src/main/java/io/netty/channel/epoll/AbstractEpollStreamChannel.java (revision )
@@ -389,6 +389,8 @@
                 // Wrote all messages.
                 clearFlag(Native.EPOLLOUT);
                 break;
+            } else {
+                setFlag(Native.EPOLLOUT);
             }

nio has the same behaviour and this line at NIO can be blamed
https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/socket/nio/NioSocketChannel.java#L271
or more like https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/socket/nio/NioSocketChannel.java#L283 should also set opwrite

I hope this gives you an idea of what is wrong and you guys can think of a "better" way to fix it(but I will also look for a such a way in the meantime)

@ninja-
Copy link
Author

ninja- commented Sep 10, 2015

I confirmed on production that this brings the number of $Entrys back to ~ 1 per thread instead of thousands,millions etc...

@normanmaurer
Copy link
Member

@ninja- actually the fix is not correct as this will cause unnecessary cpu usage. That said there is a bug and I'm working on a fix as we speak

@normanmaurer normanmaurer self-assigned this Sep 10, 2015
@normanmaurer normanmaurer added this to the 4.0.32.Final milestone Sep 10, 2015
@ninja-
Copy link
Author

ninja- commented Sep 10, 2015

💛 remember to bump the 4.1 beta release after you're done
can't live without the recycler_handle.recycle api so I am permanently on 4.1 :P

@Scottmitch
Copy link
Member

@ninja- - Thanks for reporting and pointing us in this right direction!

@ninja-
Copy link
Author

ninja- commented Sep 10, 2015

@Scottmitch :-) punching every bug on my way, I hope it's going to be the last one for some time!
btw. no luck at trying to write a unit test 😞 it would always pass the isEmpty() check even when stressing (but maybe it would need a remote channel)

@ninja-
Copy link
Author

ninja- commented Sep 10, 2015

@normanmaurer note that I wasn't able to get flush() to break outside the first few seconds after a player connects. maybe that will give you some tip.
FYI I evaluated whether always setting epollout on doConnect would fix the problem but it doesn't.

@Scottmitch
Copy link
Member

@ninja- - @normanmaurer has a PR pending approval. It will be pushed upstream soon. When it is upstream we will ask you to verify the fix :)

@ninja-
Copy link
Author

ninja- commented Sep 10, 2015

@Scottmitch 😍 it's 1 minute for me to verify the fix after it's commited somewhere

@ninja-
Copy link
Author

ninja- commented Sep 10, 2015

Honestly I would just wrap doWriteSingle and doWriteMultiple with a while(...). It's not like there is a better way - we can't just return the function before everything is written. No idea what Norman did because I guess you guys are handling this PR on a private repo or something?

@normanmaurer
Copy link
Member

@ninja- just have a bit patience... a pr with a fix will be upstream soon.

@WhiteTrashLord
Copy link

Does it affect also 4.0? I have a too high CPU usage with 4.0 :(

@ninja-
Copy link
Author

ninja- commented Sep 11, 2015

@WhiteTrashLord yeah it's a very old bug actually. I tried fixing it myself but I increased cpu usage 2x (while fixing outbound entry count) so I quickly reverted. But I doubt this bug increases cpu usage so it's probably not your main problem.

@normanmaurer any news or are you going to finish this after weekend?

@Scottmitch
Copy link
Member

@WhiteTrashLord - Yes it impacts 4.0. This bug is more focused on making sure Netty registers for a channel writable event (if the channel is unable to accept all data we try to write...1 case was missed). This bug would likely manifest it self first as memory issues rather than CPU. If you are able to diagnose unexpected Netty behavior feel free to open another issue.

@ninja-
Copy link
Author

ninja- commented Sep 11, 2015

@Scottmitch so you guys are working on it outside of the public repo or something?

@ninja-
Copy link
Author

ninja- commented Sep 14, 2015

@normanmaurer @Scottmitch any update after the weekend?

normanmaurer added a commit that referenced this issue Sep 14, 2015
…to write everything

Motivation:

writeBytes(...) missed to set EPOLLOUT flag when not all bytes were written. This could lead to have the EpollEventLoop not try to flush the remaining bytes once the socket becomes writable again.

Modifications:

- Move setting EPOLLOUT flag logic to one point so we are sure we always do it.
- Move OP_WRITE flag logic to one point as well.

Result:

Correctly try to write pending data if socket becomes writable again.
@normanmaurer
Copy link
Member

@ninja- #4216 should fix it

@ninja-
Copy link
Author

ninja- commented Sep 14, 2015

@normanmaurer i'll deploy it tomorrow and check. I tried somthing similar and CPU usage rocketed...is your fix correct in this matter? Is it sane to expect a MAJOR drop in $entry count from 5M?

@Scottmitch
Copy link
Member

@ninja- - I think the expectation is that we should only be registering for the write event from epoll when necessary. I wouldn't expect CPU to spike. Please try and report back.

normanmaurer added a commit that referenced this issue Sep 15, 2015
…to write everything

Motivation:

writeBytes(...) missed to set EPOLLOUT flag when not all bytes were written. This could lead to have the EpollEventLoop not try to flush the remaining bytes once the socket becomes writable again.

Modifications:

- Move setting EPOLLOUT flag logic to one point so we are sure we always do it.
- Move OP_WRITE flag logic to one point as well.

Result:

Correctly try to write pending data if socket becomes writable again.
ninja- referenced this issue Sep 15, 2015
…fied

Motiviation:

We need to ensure the actual close to the transport takes place before the promsie of the write is notified that triggered it. This is needed as otherwise Channel.isActive(), isOpen() and isWritable() may return true even if the Channel should be closed already.

Modifications:

- Ensure the close takes place first

Result:

ChannelFutureListener will see the correct state of the Channel.
normanmaurer added a commit that referenced this issue Sep 16, 2015
…to write everything

Motivation:

writeBytes(...) missed to set EPOLLOUT flag when not all bytes were written. This could lead to have the EpollEventLoop not try to flush the remaining bytes once the socket becomes writable again.

Modifications:

- Move setting EPOLLOUT flag logic to one point so we are sure we always do it.
- Move OP_WRITE flag logic to one point as well.

Result:

Correctly try to write pending data if socket becomes writable again.
normanmaurer added a commit that referenced this issue Sep 16, 2015
…to write everything

Motivation:

writeBytes(...) missed to set EPOLLOUT flag when not all bytes were written. This could lead to have the EpollEventLoop not try to flush the remaining bytes once the socket becomes writable again.

Modifications:

- Move setting EPOLLOUT flag logic to one point so we are sure we always do it.
- Move OP_WRITE flag logic to one point as well.

Result:

Correctly try to write pending data if socket becomes writable again.
normanmaurer added a commit that referenced this issue Sep 16, 2015
…to write everything

Motivation:

writeBytes(...) missed to set EPOLLOUT flag when not all bytes were written. This could lead to have the EpollEventLoop not try to flush the remaining bytes once the socket becomes writable again.

Modifications:

- Move setting EPOLLOUT flag logic to one point so we are sure we always do it.
- Move OP_WRITE flag logic to one point as well.

Result:

Correctly try to write pending data if socket becomes writable again.
@ninja-
Copy link
Author

ninja- commented Sep 19, 2015

This is insane...the bug is clearly FIXED in the first case, where it grows entries during first few seconds of connection only. For that case the fix was clearly testable and easy to test.
It doesn't fix the second case at all ("long uptime, high number of connections") where given CONSTANT workload, it would grow the demand for [entries+buffer linked with them] to millions with uptime...
(7 million here while the maximum batch between flushes is 30 packets)
while monitoring the number of entries, it's like:
while(true) constant-constant-constant-time break-BOOM an increase

Sat Sep 19 19:10:52 CEST 2015
entries 978619



Sat Sep 19 19:11:33 CEST 2015
entries 978773

Sat Sep 19 19:11:51 CEST 2015
entries 980484

Sat Sep 19 19:14:17 CEST 2015
entries 982609
Sat Sep 19 19:14:21 CEST 2015
entries 991333
Sat Sep 19 19:14:23 CEST 2015
entries 992479
Sat Sep 19 19:14:27 CEST 2015

(this monitor reacts only when the number changes)

@normanmaurer should I investigate again from the start or you know better where to look...?
There has to be another problem related to a case when it can't write all at once, when I proxied 300 local bots, there was no more $Entries than 200 in total.
To manage read() calls without autoread(which would make things even worse), I depend on a following behaviour:

[some packets waiting in the queue]
flush a batch of packets ->

    handle.write(Unpooled.EMPTY_BUFFER, promise );
    handle.flush();

(while the promise, when finished, will call read() on the other side of proxying connection)

is it correct to expect that it will only get called after EVERYTHING before has been written to the socket? Another place when I could place the .read() is the flush(ChannelHandlerContext) method, but I think it's not a good place because it's just a "request" to flush.

(note that when I created the bug there was no packet batching in the code so it would also writeAndFlush every packet. Now it looks like I "just" need a better place to call read() on the downstream after making sure that EVERYTHING has been written?)

@ninja-
Copy link
Author

ninja- commented Sep 19, 2015

OK so in the end...
point 1: the needed netty bugfix was only a part of what was needed
point 2: autoread=false everywhere was needed (the starting point was absolutely no autoread=false), but my first take at it(manually manipulating read() ) was being too optimistic on when the read() should be called as a fallback when it's not scheduled in another place(a write task), so it was more like autoread=true
point 3: as if that wasn't enough, I still remember the Recycler infinity bug fixed few months ago which also related to $Entry and was giving me hard crashes.

With all probability I can now say goodbye to millions and get back on thousands, because the new version deployed on production already shows a good change in behaviour :)

@Scottmitch
Copy link
Member

With all probability I can now say goodbye to millions and get back on thousands, because the new version deployed on production already shows a good change in behaviour :)

@ninja- SGTM. Thanks again for reporting. Please be sure to report if you see more bad behavior.

@ninja-
Copy link
Author

ninja- commented Sep 22, 2015

@Scottmitch in past few days I audited all the epoll, the recycler, and the outbound buffer. They are 100% fine in their logic. But what I found at this point...few things, maybe you will find that interesting.

a) a batch of small tiny packets is a killer for a proxy (in one read from ByteToMessageDecoder until ChannelReadComplete), to the number of outbound entries (growing eventually to millions).
So after sending bots trough proxy to a server with small activity, the $entries were really small and after sending them to a server with big activity the entries grown to for example 200k/2000k, even though it doesn't call read() until write() to downstream has fully finished.

b) I tried to limit the number of packets being read by a frame decoder in a single batch to for example 20/100 etc. and then write them to the proxied connection, and wait until the are fully written to the other part to read next batch. That was a disaster as it increased time to fully log in and receive the world data to ~ 15s from ~ instant.

c) the part when the proxy application waits with read() before the write to downstream has finished write() is in my opinion adding a bit of visible latency for the user. is that supposed to happen or - is it the best thing that could be done to keep it relatively in sync? should the autoread rule from netty examples apply to all proxy applications even where latency is quite important?

d) since I found that the small packets are a killer, I am thinking of creating a cumulator for them. That is, it wouldn't write 1000/2000/7000 etc. small entries to the outbound buffer but would copy the data to a single buffer, and then write. The downside could be that it couldn't then use IovArray to write them in a batch so it may or it may be not a small performance regression.

The buffers used are PooledUnsafeDirectByteBuf so that gives me a better idea on how to cumulate them(without copying), but it would be harder to implement, so maybe removing copy() is not worth it.

    protected PoolChunk<T> chunk;
    protected long handle;
    protected T memory;
    protected int offset;
    protected int length;

so create an object with PoolChunk[] + long[] + ByteBuffer[] etc. etc. something like that. recycle the original buffers without deallocating, then create a holder with their memory handles, and free them after write.

But that aside I think it would be simpler to hijack the latest entry in outbound buffer and write some data to it so that's what I am going to try now.

@Scottmitch
Copy link
Member

@ninja- - Would you mind opening a new issue so it is easy to keep thing separated?

@ninja-
Copy link
Author

ninja- commented Sep 22, 2015

@Scottmitch actually, I finally solved the problem in 100% so I am "out of netty bugs".
The more low-level cumulator wasn't needed. Do you know what FINALLY brings the number of entries from UNLIMITED to just ~ 200? Writing a cumulator that instead of adding a new entry to outbound buffer, writes it to a cummulator buffer and only writes the cummulated buffer when flush() is triggered. With this enabled no stressing in my tests could bring the number of entries to more than ~ 200-300-500 etc.

https://gist.github.com/ninja-/91a13b2630210891d429 - MessageSquasher.java

I am telling you this because over past few months I involved the netty team into the fight with outbound entry spam and among LOTS of other issues and bugs, the small packets were the last silent killer. So I think this battle is finally over and I don't currently see any performance or latency regressions. As always, time will tell...

You may think that this finding might be a good opportunity to improve netty itself :)

In that case, I would recommend to the netty team - changing the lifecycle of objectes between they are added to the (outbound buffer) and flushed. So for example, there could be a window( actually, the window is already there in the part when invoker's write() calls outboundBuffer.addEntry so that could be overriden ) for per-transport implementations that could deal with messges when they are pre-added but not yet flushed. Such an implementation could choose to merge them(before a new entry is created) or override the entry.msg with it's magical object containing just "memory adresses" of merged buffers etc. it would leave a nice window for improvements that could benefit everyone.

(what's harder to spot is that the entries take the buffers with them...that means that the number of buffers also increases and increases and new ones need to be allocated. so that's another way it could help with "traditional" apps in my opinion.)

It's up to you guys if you think that's worth the time but that's a way of doing this that should benefit more types of netty apps and as always microbenchmarks on this would be interesting :)
(and I know that especially the HTTP team loves microbenchmarks...)

@normanmaurer @trustin @Scottmitch @nmittler

If you guys think this improvement would be worth the time to code maybe I could open a new issue with an "improvement" tag and we could have some fun implementing some per-transport stuff to it or maybe let it be a ChannelOption so a use case can be chosen?

@Scottmitch
Copy link
Member

@ninja- - What was your write/flush strategy before the MessageSquasher? Were you using writeAndFlush for every write operation? Where you always using write and then at some later point invoking flush? How did your applications behavior change when you starting using MessageSquasher in this respect?

@ninja-
Copy link
Author

ninja- commented Sep 23, 2015

@Scottmitch it is flushing in a batch that was equal to what came from the ByteToMessageDecoder. That is - if it received 1000 packets (from the frame decoder), it has then written these 1000 packets to the downstream. Then @ channelReadCompleted it would flush it. Getting rid of batching wasn't any help to the $Entry spam.

When using MessageSquasher the number of $Entries stopped growing under high traffic and it can't go higher than 200 in TOTAL, before that - it would quickly get out of control to 200,000 easily and then to maybe millions.

Another change I made was in regards to ByteToMessageDecoder. I changed it so

original:

  • read all messages (maybe 10000 small ones :((( )
  • fire channelRead events for each in the decoded list
  • fire channelReadComplete

to

  • loop reading messages
  • if out.size() > for example 100, call channelRead events for them NOW and clear the list
  • continue decoding
  • fire channelReadComplete

I don't think that calling them even instantly after decoding, instead of calling the channelRead for the full list would be any degradation to performance.

https://github.com/netty/netty/blob/master/codec/src/main/java/io/netty/handler/codec/ByteToMessageDecoder.java#L245

so the buffers could be more quickly reused because when channelRead is over, the message is usually released.

@Scottmitch
Copy link
Member

@ninja- - Any way you can provide a reproducer for the original problem?

I don't think that calling them even instantly after decoding, instead of calling the channelRead for the full list would be any degradation to performance.

What are you referring to when you say "calling them"?

if out.size() > for example 100, call channelRead events for them NOW and clear the list
continue decoding

What was the mechanism you used to do this?

@ninja-
Copy link
Author

ninja- commented Sep 25, 2015

"calling them" to channelRead() in this meaning. allow the buffers to be reused more quickly AS it's usually released afer handling. But that's something that completes the way MessageSquasher work otherwise there would be no improvement

Well the reproducer....I wouldn't call that a bug. Just a simple proxy + a frame decoder that decodes frames into buffers + a spam of small Packers in one read

@Scottmitch
Copy link
Member

@ninja- - See #4275

@ninja-
Copy link
Author

ninja- commented Sep 25, 2015

@Scottmitch hm I don't think it's related, @normanmaurer found the #4275 problem somewhere else. But if you decided to change ByteToMessageDecoder so it would fire channelRead(msg) each by one instead of doing it in a batch maybe it would also help in #4275.

@Scottmitch
Copy link
Member

@ninja- - What I was wondering was how did you call fireChanellRead after each decoded message? The API just exposes a list to decode which users are free to add as many message as they want. Did you change the interfaces, put a max capacity on the list, or something else? Would you care to submit a PR so it will be easier to visualize?

@ninja-
Copy link
Author

ninja- commented Sep 25, 2015

@Scottmitch I copied original ByteToMessageDecoder to io.netty.channel package under new name while the package was inside the application. So

https://github.com/netty/netty/blob/master/codec/src/main/java/io/netty/handler/codec/ByteToMessageDecoder.java#L244

should be more or less after each callDecode(...). @normanmaurer is evaluating whether this (negatively) could affect performance for some people.

@Scottmitch
Copy link
Member

https://github.com/netty/netty/blob/master/codec/src/main/java/io/netty/handler/codec/ByteToMessageDecoder.java#L244

should be more or less after each callDecode(...).

I'm a bit slow on this one... the line you reference is called after each callDecode? Like I said submit if you submit a PR (if only to clarify the idea) it should help me understand.

@normanmaurer
Copy link
Member

@Scottmitch not care I will check and come back to you.

@ninja-
Copy link
Author

ninja- commented Sep 25, 2015

@Scottmitch yeah actually I pointed a bad place. At the moment I am using a hack that fires it inside decode() so more like:

Inside the frame decoder:

    @Override
    protected void decode(ChannelHandlerContext ctx, ByteBuf in, List<Object> out) throws Exception {
        Object o = decode(ctx,in);
        if (o != null)
            ctx.fireChannelRead(o); // instant fire, reuse buffers quicker :)

        if (o != null && out.isEmpty())
            out.add(Unpooled.EMPTY_BUFFER); // to get the read() correctly

    }

    //    @Override
    protected ByteBuf decode(ChannelHandlerContext ctx, ByteBuf in) throws Exception
{
// decode a frame here
}

so I am not changing the original class.

@Scottmitch
Copy link
Member

@ninja- - Thanks for clarifying. Please open a new issue so we can track.

@Scottmitch
Copy link
Member

@ninja- @normanmaurer - FYI #4284

pulllock pushed a commit to pulllock/netty that referenced this issue Oct 19, 2023
…able to write everything

Motivation:

writeBytes(...) missed to set EPOLLOUT flag when not all bytes were written. This could lead to have the EpollEventLoop not try to flush the remaining bytes once the socket becomes writable again.

Modifications:

- Move setting EPOLLOUT flag logic to one point so we are sure we always do it.
- Move OP_WRITE flag logic to one point as well.

Result:

Correctly try to write pending data if socket becomes writable again.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants