Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Fix Timer thread classloader leak #197
This patch is an alternative to #188. It fixes the ClassLoader leak when the driver is repeatedly loaded/unloaded (e.g. in a containerized environment).
The PR is split into three commits. The first two don't have any functional changes to the driver, they just rename some variables to clarify their usage and move TimerTask related method calls from Statement to go through the Connection rather than statically against the Driver.
The third commit adds a reference counting SharedTimer with two methods,
One nice side effect of this patch is that if a cancellation task is never scheduled, then no Timer or associated Timer thread is ever created.
I added one basic test that creates two connections, two statements, and issues two cancels to test that only a single Timer is created and subsequently deallocated. It and the the rest of the driver tests run successfully.
I think some additional tests are needed, particularly one that simulates loading/unloading the driver via a custom ClassLoader. It would need to run separately from the rest of the driver unit tests as it'll need to run without the driver on the classpath (as it'll load it dynamically). If nobody else gets around to this I'll see if I can write one some time next week.
I'd also like if someone could walk through the Timer allocation/deallocation steps to verify the happens-before and memory semantics of it. Again, all the tests pass but another pair of eyes would be appreciated.
referenced this pull request
Oct 5, 2014
On 10/05/2014 11:36 PM, Sehrope Sarkuni wrote:
Thanks for keeping it well structured.
I like this approach a lot more, though I'm still concerned about direct
It'll be important to test how this behaves when an exception is thrown
It may prove easier to use Arquillian to do this against an embedded
I'll try, time permitting. Things are busy.
Craig Ringer http://www.2ndQuadrant.com/
Yes I don't particularly like manually managing Timers like this but The Timer API usage was already there. The difference now is that we're cleaning up behind ourselves when we're done.
Looks like you're right on this one, an exception causes the Timer to stop running. See the line "If f the timer's task execution thread terminates unexpectedly..." in the Javadocs: http://docs.oracle.com/javase/7/docs/api/java/util/Timer.html
Maybe we should use a ScheduledThreadPoolExecutor. It's Java 1.5+ only but I suppose that's okay as at this point even 1.6 is out of support. Anybody that wants to use this new resource cleanup feature better be running something supported.
Alternatively (or in addition to that), we could have an Executor for running the actual cancellation tasks so they happen in separate Threads. This goes down the rabbit hole of doing more thread management directly in the driver but I think it may be a better approach.
With a single thread handling cancellations, you can only cancel one query at a time. Cancellations are meant to be quick but if something slows it down (e.g. server down or networking issue) then it'll hang and delay all other query cancellations. In the common case where an app is connecting to a single database this isn't much of an issue; i.e. if one can't connect, the rest probably can't either. However, if you're connecting to separate databases with a single shared Timer thread for cancellation then one bad connection will starve all the rest.
No rush on this for me. My involvement in helping fix this is because I want it done right, not because I want it done soon.
In one of my previous attempts, I had a simple test that had such custom classloader. It tried to load classes of the driver several times (using this custom classloader that always loaded classes from org.postgresql regardless of whether the parent classloader had them or not). I then naively tried to create so much churn on the heap so as to force the GC to collect the drivers and then checked if indeed they all disappeared (which would not be the case with the eagerly created Timer thread as it is in the current codebase).
If you find it useful, you can reuse/adapt it to your needs. I ressurected the code to: https://gist.github.com/metlos/59c9cf891482f14d1784
Yes if everything is marked synchronized then the
I haven't had a chance yet to add betters tests for this as it's a bit more complicated then other tests (because of the external classloader). The small test I have there is better than nothing but something more extensive would be nice.
@metlos: If you're still interested in getting this merged do you think you could come up with something that simulates your use case?
My usecase was quite well described by https://gist.github.com/metlos/59c9cf891482f14d1784 as mentioned in one of my previous comments.
We discovered this issue in http://rhq.jboss.org where we have a plugin for monitoring Postgres. This plugin lives in a sort of container with a child-first classloader so that individual plugins are isolated.
On certain occasions the container stops, reloads and restarts all the plugins (without actually stopping the JVM). After a couple of such occasions we've noticed the increased permgen due to classes not being unloaded and tracked it down to this issue.
I added another test based upon the one in the linked gist. I changed it a bit to:
... the problem is it doesn't quite work. The Driver instances themselves are getting GCed so I'm fairly certain the new shared timer code is working fine. The problem is the Driver classes themselves don't get GCed.
I tried running the test stand alone (rather than with everything else on the classpath) and I get the same result. I think it may be because the JVM doesn't actually GC classes (vs objects)... they stay in PermGen. I've been testing it with v1.7 on OSX and have been fiddiling with some JVM -XX flags but no luck.
I haven't updated the PR with the new test as the bulk of it is debug statements from me experimenting with it. You can find it here: https://gist.github.com/sehrope/3b4e11124f6e27d9e680
You can run it via:
(might need to replace the port or add a host if your're not testing locally)