-
-
Notifications
You must be signed in to change notification settings - Fork 15.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak in latest netty version. #6221
Comments
|
@doom369 interesting find, keep us updated. |
|
@doom369 just a guess... can you try set: |
I started 3 different servers with different params : 1 (epoll + no openssl) - For 2 hours all servers up and running without any suspicious memory consumption. However 2 hours is not yet enough for reproducing prev behavior. |
|
Just now server 3 died (11 hours passed). The funny thing is - it has 4 times less load than server 1 and 2 times less load than server 2. Netty LEAK detection doesn't show anything. Error : I did quick restart of this server with only :
and it died again within 5 minutes. So I turned off openSSl and run again with |
|
@doom369 so what exact confit not produce the problem and what does ? I am a bit confused... also could you please try to upgrade in more intectemental steps so we can find out at which version this starts to happen? |
|
@doom369 and a heap dump would be nice |
I'll try in case of next failure, hope it will not appear soon :). |
|
Interesting... so whenever you use OpenSSL it blows up? And whenever you use JDK SSL it never does? |
|
This is what it looks like to me too ... #6222 (comment) |
Correct. (Latest netty and fork25 of tcnative).
With I can say for sure situation became worse with update. |
|
So you are also using Netty's SOCKS code? Is it possible to try a test scenario w/out SSL to rule that out (don't do this in production or with real use data if that doesn't make sense for your scenario)? Providing a reproducer would also help. |
|
@Scottmitch no. #5723 seems very similar to issue I had. Few more details. I migrated from 4.0.37 to 4.1.4-Final - 6 months ago (epoll + openSSL). All was fine (servers were running for weeks). However ~1 month ago one of servers went down with OOM. This is where I found this #5723 and it was very similar to what I saw in logs (I also attached there heap dump screen from problem instance). Look like some new scenarios on my servers pulled some triggers. Servers started to die more and more often. I did many tests with high load scenarios in test env. but with no luck. |
|
Hm... I just remembered that I did all my tests without openSSL in order to reproduce issue on prod. Let check again. |
|
@doom369 please report back... |
|
@normanmaurer @Scottmitch So here is my findings so far : I made a simple test that creates 400 users (opens 400 SSL keep-alive connections) and 400 hardwares (opens 400 plain tcp/ip keep-alive connections). 800 in total. All hardware connections send 1 message in loop. Pipeline on server delivers those message to corresponding user. So this is like 1 to 1 chat for 400 users. I run same test on few configurations : epoll + opensll - failed with OOM (1-2 minutes). Test creates very high request rate (however bandwidth is low ~10 Mbps) so in all tests some of 800 connections are dropped. In average there only 400-500 connections only survived. With low request rate I'm not able to reproduce OOM. On prod load is 10 times lower, so issue for sure not in request rate. Test code : I tried also to use
but in both cases I get NPE : Advanced mode for memory leak shows nothing in all cases. |
|
@doom369 super strange... can I have a heap dump now :) ? I will also fix the NPE as this is not expect. In fact the epoll transport requires unsafe and should fail with another exception if unsafe is not present |
Sure :). This is thread dumps for test case 1 (epoll + openssl) : https://www.dropbox.com/s/efytgymi202u81k/start.bin?dl=0 (70mb) One more thing. All tests were done with : Ubuntu 14.04 64x |
|
@doom369 I gutes there is no way to provide an reproducer ? |
|
|
@doom369 awesome let me try :) |
@normanmaurer - IIRC the kqueue PR does this already |
|
@Scottmitch cool, anyway let us fix this as a separate PR (working on it) |
|
@doom369 what else do I need to install ? Seems like at least redis... anything else ? |
|
@normanmaurer no need in redis. Sorry, my bad. Please replace string |
|
@doom369 ok.. I see lot of these Is this something that is expected ? |
|
Yeah, that's fine. |
|
@doom369 ok its running now... so the OOME should happen on the server I guess ? |
|
Correct. |
|
Interesting, any chance this is related to #6249? |
|
With the help of @doom369 I was able to track down the change that is guilty for this "regression". #6252 should fix this. The commit message of #6252 should give you a better idea what happened, so I will not repeat it here but to make it short its not a memory leak but just a change in how much memory is used when using our custom SSLEngine impl. @rkapsi @doom369 can you please check the PR and let me know... Thanks again to @doom369 for all the help tracking this down. Without you this would have been not possible or took a way longer. |
|
@normanmaurer checked again with your PR from scratch. Everything seems fine. Not reproducible anymore. All last tests I also run with |
|
@doom369 thanks a lot ! |
|
Let me re-open until the PR is merged. |
|
great work @normanmaurer ! +1 on the big thanks to @doom369 for the debug support! |
|
Fixed by #6252 |
|
@normanmaurer I was able to reproduce issue after 24 hour run on production. Please reopen ticket. OpenSSL enabled, no any additional options. This is screen of heap after OOM started : I have no issue on this instance while I had no openSSL enabled and |
|
Can a get a dump again? |
|
@normanmaurer sure. sent link via email. |
|
Thanks... also on which commit your are?
… Am 22.01.2017 um 15:55 schrieb Dmitriy Dumanskiy ***@***.***>:
@normanmaurer sure. sent link via email.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
@normanmaurer latest one. 9077269 Yesterday pulled all changes and did build and deploy. |
|
@doom369 I did not have time yet to investigate in detail but seems like the dumps are not up anymore and i missed to download them in time as I was busy. Could you please reupload and ensure these not "timeout" ? |
|
@normanmaurer done, see your email. |
|
@doom369 will ping you tomorrow morning and see if we can find out whats wrong. |
|
Ok.Thanks for looking into it. |
|
Increasing JVM heap from 96M (on 700M machine) to 96*3MB (2100M) is working for me. At least temporarily. Jan 25, 2019 9:53:48 PM com.twitter.finagle.netty4.channel.ChannelExceptionHandler exceptionCaught WARNING: Unhandled exception in connection with /10.1.18.1:55388, shutting down connection io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 1048576 byte(s) of direct memory (used: 94371847, max: 95158272) at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:640) at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:594) at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764) at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740) at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244) at io.netty.buffer.PoolArena.allocate(PoolArena.java:214) at io.netty.buffer.PoolArena.allocate(PoolArena.java:146) at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:185) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:176) at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:137) at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:114) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:147) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:646) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:581) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at com.twitter.finagle.util.BlockingTimeTrackingThreadFactory$$anon$1.run(BlockingTimeTrackingThreadFactory.scala:23) at java.lang.Thread.run(Thread.java:748) |

After recent update to 4.1.7-Final (from 4.1.4-Final) my servers started dying with OOM within few hours. Before they were running for weeks with no issues.
Error :
Or :
I did restart and made heap dump before abnormal memory consumption and after first error messages from above :
This screenshot shows difference between heap after server start (takes 17% of RAM of Instance) and first OOM in logs (takes 31% of RAM of instance). Instance RAM is 2 GB. So look like all direct memory was consumed (468MB) while heap itself takes less than direct buffers. Load on server is pretty low - 900 req/sec, with ~600 active connections. CPU consumption is only ~15%.
I tried to analyze heap dump but I don't know netty well in order to make any conclusions.
Right now I'm playing with
to find out working settings. I'll update ticket with additional info if any.
Unfortunately I wasn't able to reproduce this issue on QA env. Please let me know if you need more info.
The text was updated successfully, but these errors were encountered: