Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Netty4.0.28 mem leak for io.netty.util.Recycler #5563

Closed
nhynew opened this issue Jul 21, 2016 · 33 comments
Closed

Netty4.0.28 mem leak for io.netty.util.Recycler #5563

nhynew opened this issue Jul 21, 2016 · 33 comments

Comments

@nhynew
Copy link

@nhynew nhynew commented Jul 21, 2016

In Netty4.0.28 , I found java gc can not recycle the old area, and i use jamp -histo:live to export the heap info, there are 985892 io.netty.util.Recycler$DefaultHandle Objects in mem.
image

In the other hand , I use unpooledbytebufallocator for send defalutfullhttprequest content.

Would you please give me some idea to solve this proplem? thanks.

@nhynew

This comment has been minimized.

Copy link
Author

@nhynew nhynew commented Jul 21, 2016

image

this is my code, and i am sure that the allocator thread is same to the release thread wich is in netty IO worker thread.

@Scottmitch

This comment has been minimized.

Copy link
Member

@Scottmitch Scottmitch commented Jul 21, 2016

can you sort the object list by number instances strongly retained, and post the image again?

@nhynew

This comment has been minimized.

Copy link
Author

@nhynew nhynew commented Jul 21, 2016

The follow pic is sorted by intances , please check it . thanks.
image

@Scottmitch

This comment has been minimized.

Copy link
Member

@Scottmitch Scottmitch commented Jul 21, 2016

It looks like there are a bunch of WriteAnFlushTask objects in the heap ... how are you applying back pressure?

Also can you update to the latest released version 4.0.39 and confirm if the issue still exists and if any behavior changes?

@nhynew

This comment has been minimized.

Copy link
Author

@nhynew nhynew commented Jul 22, 2016

This situation is in the case of the emergence of stress tests, after the pressure test is completed, I performed manually fullgc, and I found that these objects can not released, and I found this class Recycler.java each recycle time will apply to other threads referenced object into a WeakHashMap, this will generate a lot of references of the same Object, but I do not know where and how these references are released.

My case is as follows:
1, I use ChannelHandlerContext.writeAndFlush(DefaultFullHttpRequest) in the biz Thread. and this will create WriteAndFlushTask which owner thread is biz thread by Recycler
image
image

2, When the task finished, it will be recycled the reference into WeakHashMap beacause its owner thread is not same to the current thread. So each invocation will to do this , I think it will generate huge number of references for the DefaultHandle Object. Just like the picture shown above I post.
image

I'm not sure I understand correctly, Would you please give me some help for this ?

Then I will try it again by the version 4.0.39 .

@nhynew

This comment has been minimized.

Copy link
Author

@nhynew nhynew commented Jul 22, 2016

I have tried use the version 4.0.39 and I found there are huge number of DefaultHandle and WriteAnFlushTask Objects. The pic is as follows:
image

I am sure that there are no reqeusts for the server, and the test is finished. Then I performed fullgc mannually many times, but these two objects can not recycle by java gc. is this correctly?
any help for pleasure. thanks.

@Scottmitch

This comment has been minimized.

Copy link
Member

@Scottmitch Scottmitch commented Jul 22, 2016

Generally when there are a lot of WriteAndFlush tasks still strongly referenced it is because they are waiting until the channel becomes writable so they can be flushed. This may happen if for example you are writing faster than the peer can read the data, or if the peer is no longer reading data. This is why I was asking how you apply back pressure. Is the channel you are writing to still open? Can you provide a heap dump? Using a tool like Eclipse MAT makes it relatively easy to find all strong references to GC roots to confirm this theory.

@nicmunroe

This comment has been minimized.

Copy link

@nicmunroe nicmunroe commented Jul 22, 2016

We just updated from Netty 4.0.36.Final to 4.0.39.Final on a service yesterday and ran into memory leak that looks very close (if not identical) to what is being discussed here. Here's a screenshot:

possible_netty_mem_leak

We don't have any WriteAndFlush objects in the heap dump, but we do have a bunch of DefaultHandles. This heap dump was performed on a box that had been removed from the load balancer and had not been receiving any traffic for several minutes.

Reverting back to 4.0.36.Final made the problem go away.

@normanmaurer

This comment has been minimized.

Copy link
Member

@normanmaurer normanmaurer commented Jul 22, 2016

@nicmunroe could this be related to #5569 ? If so can you try move back to 4.0.38.Final or use 4.0 branch and see if it still reproduce ?

@nicmunroe

This comment has been minimized.

Copy link

@nicmunroe nicmunroe commented Jul 22, 2016

I'm not sure if it's related or not. I can try out 4.0.38.Final and see how it goes. It will probably be several hours (if not tomorrow) before I can scrape together some time to test and get back to you but I'll do my best.

FWIW - I just now logged into a different box that has been out of the load balancer for 4 hours, did some netstat commands to verify nothing was connected to it, and forced some garbage collections with jcmd. The old gen size did not drop after the GCs. So even with an idle box that has nothing connected to it this memory is not reclaimable.

@normanmaurer

This comment has been minimized.

Copy link
Member

@normanmaurer normanmaurer commented Jul 22, 2016

@nicmunroe I think it could be related.... Take your time and let me know once you tested. Even better would be if you could also test with the current 4.0 branch

@normanmaurer

This comment has been minimized.

Copy link
Member

@normanmaurer normanmaurer commented Jul 22, 2016

@nicmunroe Also looking at your screenshot is it right that you create and destroy a lot threads (Which also includes EventLoopGroup instances)?

@nicmunroe

This comment has been minimized.

Copy link

@nicmunroe nicmunroe commented Jul 22, 2016

@normanmaurer we have a cached thread pool for performing some long running stuff that couldn't be on the Netty worker threads - that's the only place where threads should be frequently created/destroyed. We do have a SimpleChannelPool with a Netty client bootstrap for making downstream HTTP calls with pooled connections that we create on the server, but the SimpleChannelPool, client bootstrap, and associated EventLoopGroup are only created once and reused. Not sure if that answered your question.

I'll have info on the different versions as you requested in a minute.

@nicmunroe

This comment has been minimized.

Copy link

@nicmunroe nicmunroe commented Jul 22, 2016

@normanmaurer - I created three canaries the same way, each one with a different Netty version, and deployed them. Here are the results:

So the problem we're seeing was introduced between 4.0.38.Final and 4.0.39.Final, and is not fixed on 4.0.40.Final-SNAPSHOT. It seems like this can't be exactly the same issue @nhynew is seeing since this ticket was opened against 4.0.38.Final and we're not seeing the leak on our service with that version (just 4.0.39.Final and later), however it also smells related somehow just based on the heap dump similarities?

@johnou

This comment has been minimized.

Copy link
Contributor

@johnou johnou commented Jul 22, 2016

@nicmunroe netty-common-4.0.40.Final-20160722.121107-7-sources.jar.md5 Fri Jul 22 12:11:08 UTC 2016 32 (latest I see) does not contain the fix from 688e65f are you able to build 4.0.40.Final-SNAPSHOT locally from the 4.0 branch?

@nicmunroe

This comment has been minimized.

Copy link

@nicmunroe nicmunroe commented Jul 22, 2016

Ah, I didn't realize that commit was so new. I can retest when the snapshot rebuilds and contains that commit, or if someone points me at a different maven repo that contains a snapshot build that does have the commit. Unfortunately building Netty from source and jury-rigging my service to use it would take more time and effort than I have to spare right now.

Worst case I can try again tomorrow with a new snapshot build.

@normanmaurer

This comment has been minimized.

Copy link
Member

@normanmaurer normanmaurer commented Jul 23, 2016

Yep please test one a new snapshot is online

Am 23.07.2016 um 01:22 schrieb Nic Munroe notifications@github.com:

Ah, I didn't realize that commit was so new. I can retest when the snapshot rebuilds and contains that commit, or if someone points me at a different maven repo that contains a snapshot build that does have the commit. Unfortunately building Netty from source and jury-rigging my service to use it would take more time and effort than I have to spare right now.

Worst case I can try again tomorrow with a new snapshot build.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@nhynew

This comment has been minimized.

Copy link
Author

@nhynew nhynew commented Jul 23, 2016

@Scottmitch With the same situation of @nicmunroe , I have object pool with netty channels of http for downstream http calls, I have allocate 2000 NioEventLoop for max 2000 connections, The http call normally very fast. When i receive the httpresponse i log the response for test. and i am sure the channel is still open when i writing datas. When i finished the test, on the other hand, no pressure for the server. I found the huge number WriteAndFlush objects in the heap after i run the fullgc mannaully.

Is it the problem of too many NioEventLoop threads?, Each Thread has a TheadLocalMap and when the thread is alive, the content in the ThreadLocalmap can not be realeased by java gc ?
Why each ThreadLocalMap has so much many WeakHashMap$Entry objects which reference WriteAndFlushTask ?

I think the normal situation is that when the server is idle after stress test, these object should be relaeased by java gc. But the current situation is clearly not the case.
The MAT result is as follows:
image

image

@normanmaurer

This comment has been minimized.

Copy link
Member

@normanmaurer normanmaurer commented Jul 23, 2016

@nhynew not really as we "pool" these objects by default to reduce GC pressure. If you want to disable object pooling you can do this with -Dio.netty.recycler.maxCapacity=0.

That said I think using 2000 EventLoops is not what you want. The whole point of using NIO is to use not 1 thread per connection. Using the default number of EventLoops when creating a NioEventLoopGroup should be a good start.

@normanmaurer

This comment has been minimized.

Copy link
Member

@normanmaurer normanmaurer commented Jul 23, 2016

@nicmunroe any updates yet ?

@nicmunroe

This comment has been minimized.

Copy link

@nicmunroe nicmunroe commented Jul 23, 2016

@normanmaurer Just kicked off the test with the new snapshot. I'll have results in 30-60 minutes.

@normanmaurer

This comment has been minimized.

Copy link
Member

@normanmaurer normanmaurer commented Jul 23, 2016

@nicmunroe cool... thanks!

@nicmunroe

This comment has been minimized.

Copy link

@nicmunroe nicmunroe commented Jul 23, 2016

@normanmaurer and @johnou looks good with last night's snapshot. No detectable leak. Sorry for hijacking this ticket - looks like they might be separate issues. But thanks for the help. It's good to know the problem has been fixed!

@normanmaurer

This comment has been minimized.

Copy link
Member

@normanmaurer normanmaurer commented Jul 23, 2016

Thanks for confirm the fix!

Am 23.07.2016 um 20:47 schrieb Nic Munroe notifications@github.com:

@normanmaurer and @johnou looks good with last night's snapshot. No detectable leak. Sorry for hijacking this ticket - looks like they might be separate issues. But thanks for the help. It's good to know the problem has been fixed!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@nhynew

This comment has been minimized.

Copy link
Author

@nhynew nhynew commented Jul 24, 2016

@normanmaurer I will try it again use default NioEventLoopGroup. thank you for your suggestion.

@normanmaurer

This comment has been minimized.

Copy link
Member

@normanmaurer normanmaurer commented Jul 24, 2016

So can we close this?

Am 24.07.2016 um 03:12 schrieb nhynew notifications@github.com:

@normanmaurer @Scottmitch I had tried with default NioEventLoopGroup, I found the WriteAndFlushTask only have 5w much less than before by the version 4.0.39.Final and when the old area full ,it can be released by java gc successfully. thanks for your help. thanks!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@nhynew

This comment has been minimized.

Copy link
Author

@nhynew nhynew commented Jul 24, 2016

yes . I read the code of Recycler again and again and finally understand how the Object pool implement. Normally the pool will keep small number objects in it and reuse the Object in the pool to reduce allocated number . thus it also will reduce GC pressure. good idea ! But When the high pressure of the system come out.
the pool will accumulate many objects in the WeakHashMap . When the heap is full ,GC will release some weak reference . Do I understand right?

if i understand right , I think it should controll the entry number of WeakHashMap to Stop it grow indefinitely, right?

thanks for your help .

@normanmaurer

This comment has been minimized.

Copy link
Member

@normanmaurer normanmaurer commented Jul 24, 2016

@nhynew the WeakHashMap should at max have one entry per Stack and Thread. So it would be number of Stacks * number of Threads as worse-case. The contained WeakOrderQueue is bound.

@nhynew

This comment has been minimized.

Copy link
Author

@nhynew nhynew commented Jul 25, 2016

@normanmaurer But I have 16 Threads and i statistic the number of WeakHashMap.Entry wich have more than 300+ totally.
is it exists cycle references ? any help for much pleasure.
The screenshot which is the situation of one of those threads is as below:
image
image

@normanmaurer

This comment has been minimized.

Copy link
Member

@normanmaurer normanmaurer commented Jul 25, 2016

@nhynew can you upload the heap dump somewhere ? I assume you write from other threads then the IO threads as well.

@nhynew

This comment has been minimized.

Copy link
Author

@nhynew nhynew commented Jul 26, 2016

@normanmaurer The dump is not yet available, because it contains information about the company.
But i have found the proplem. it have a relationship with a previous bug. #5569,

I have reviewed my code and I invoke the channel.writeAndFlush in Biz thread, and then it will be wrapped into a WriteAndFlushTask in taskQueue and when the task finished ,it will be recycled by the IO thead. then it will be push into WeakOrderQueue.

the version 4.0.39.Final has the bug about #5569 and would you please build a release version contains the fix?

@normanmaurer

This comment has been minimized.

Copy link
Member

@normanmaurer normanmaurer commented Jul 26, 2016

You csn grab the SNAPSHOT for now. Will release a new version with the fix this week

Am 26.07.2016 um 04:39 schrieb nhynew notifications@github.com:

@normanmaurer The dump is not yet available, because it contains information about the company.
But i have found the proplem. it have a relationship with a previous bug. #5569,

I have reviewed my code and I invoke the channel.writeAndFlush in Biz thread, and then it will be wrapped into a WriteAndFlushTask in taskQueue and when the task finished ,it will be recycled by the IO thead. then it will be push into WeakOrderQueue.

the version 4.0.39.Final has the bug about #5569 and would you please build a release version contains the fix?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@nhynew

This comment has been minimized.

Copy link
Author

@nhynew nhynew commented Jul 26, 2016

Ok, thanks very much for your help for the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.