Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
GitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
NonBlockingSocketWriter memory leak when client disconnects #12353
It is the TcpIpConnection holding on to the NonBlockingSocketWriter. So only when the TcpIpConnection gets gc'd, the NonBlockingSocketWriter gets gc'ed and eventually the ClientMessage/ByteBuffer.
The key question is: is this a temporary retention or permanent retention.
Based on the the memory dump I can see that the retention is caused by the invocation registry. The invocation is registered there, and through the invocation, the ClientMessageTask is retained and this stores the connection.
Eventually the invocation will complete (one way or another) and therefor the invocation will be removed from the invocation registry and therefor the retention link is removed. So there should not be a permanent memory leak.
So the question is this is a bug or if the system is behaving like designed.
So can you keep your application running and determine if eventually the memory is released?
Unfortunately, the heap dump I have contains proprietary information which I am not at liberty to make available publicly.
The retention appears to be permanent. I left the program which hosts the node to run overnight, and the program itself has a scheduled job invoking
We seem to have run into the same problem described here in Hazelcast 3.10.1.
We see out of memory exceptions on members when clients are forcibly disconnected from the cluster (in our case we suspect as a result of client machines being powered off without a controlled shutdown). We have several hundred clients connected to a cluster with several members.
When our heap dumps are analysed we see the ClientMessage/bytebuffer structures are holding huge amounts of memory (much more than all the ordinary data held in the application maps).
@pveentjer Please could you update us, has there been any progress toward resolving this issue? Do you have any idea when/if it will be fixed?
In the meantime, is there anything we should do to limit, or workaround, the impact of this problem?
To add a little more information taken from a JHAT analysis of a heapdump from a live member that crashed due to an OOME;-
Instance Counts for All Classes (excluding platform)
We have attempted to decode the bytes associated with each ClientMessage. The messages appear to be async pub/sub messages. We suspect that when clients are forcibly abruptly disconnected, Hazelcast isn't clearing out a backlog of messages when the connection is dropped.
Also seeing the same issue (v3.10.1) as the poster even with scheduled gc() every few minutes. The instance count with byte arrays keeps on rising. We were able to reproduce it in our lab by spinning up lots of clients that registers entrylisteners on a “busy” map and randomly calls lifecycleservice.terminate().
Edit: moving to 64 bit JVM with G1GC aggressive GC and string deduplication turned on improves the memory footprint, but theres no question Hazelcast retains client messages even after the client disconnects. This is observed over days of comparing heap dumps while running aforementioned torture test.
Hi guys, my sincere apologies. Our system was in a major release the past couple weeks.
I am happy to confirm that 3.10.5 release has indeed fixed the memory retention problem caused by abruptly disconnected clients, and memory usage drops back down to base line level after some period (is there a timeout config that I can set?)
edit: Forgot to say thank you.
Thank you :)
Hi @LoneGumMan !
Happy to hear that. The fix releases the resources when we detect that the client has disconnected. We don't retain the resources for any longer than that and there is no timeout. One other reason that I can think of why the memory usage might still keep increasing until some point is that the GC will delay collecting the unreachable objects until it needs to. You may try tuning the GC to "kick in" somewhat earlier but I don't think this is strictly necessary as the overhead of running GC often might not be worth it.
Closing the issue then as it has been resolved. If you encounter an another problem, please open a new issue. Happy hazelcasting!
I have the system hooked up to JVisualVM, and I am manually triggering GC and monitoring the GC log; it takes a few click, in the span of about 1 minute, for the heap to shrink to baseline in one big jump, not a gradual release.
That is the reason why I think maybe there is a timeouts, say a timeout before the disconnect is detected / confirmed and housekeeping happens.
Either way, it's no big deal, more or a curiosity than anything.