New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Timer thread classloader leak #197

Merged
merged 3 commits into from Dec 26, 2014

Conversation

Projects
None yet
4 participants
@sehrope
Contributor

sehrope commented Oct 5, 2014

This patch is an alternative to #188. It fixes the ClassLoader leak when the driver is repeatedly loaded/unloaded (e.g. in a containerized environment).

The PR is split into three commits. The first two don't have any functional changes to the driver, they just rename some variables to clarify their usage and move TimerTask related method calls from Statement to go through the Connection rather than statically against the Driver.

The third commit adds a reference counting SharedTimer with two methods, getTimer() and releaseTimer(). The former creates a Timer on demand and the latter calls .cancel() on the underlying Timer if the reference count goes to zero (to allow it to be GCed) before clearing it to null.

One nice side effect of this patch is that if a cancellation task is never scheduled, then no Timer or associated Timer thread is ever created.

I added one basic test that creates two connections, two statements, and issues two cancels to test that only a single Timer is created and subsequently deallocated. It and the the rest of the driver tests run successfully.

I think some additional tests are needed, particularly one that simulates loading/unloading the driver via a custom ClassLoader. It would need to run separately from the rest of the driver unit tests as it'll need to run without the driver on the classpath (as it'll load it dynamically). If nobody else gets around to this I'll see if I can write one some time next week.

I'd also like if someone could walk through the Timer allocation/deallocation steps to verify the happens-before and memory semantics of it. Again, all the tests pass but another pair of eyes would be appreciated.

sehrope added some commits Oct 4, 2014

Rename TimerTask and related methods
Rename cancelTimer to cancelTimerTask to clarify usage.
There is already an actual Timer object named cancelTimer in Driver.
Also rename the killTimer() method in AbstractJdbc2Statement to killTimerTask() to clarify it's usage.
Change TimerTask methods to be called through the Connection
Change all references to addTimerTask(...) and purgeTimerTasks() to go through the Connection object.
This removes any direct references to the TimerTask related methods in Driver from the rest of the driver.
The new addTimerTask(...) and purgeTimerTask() methods in AbstractJdbc2Connection still use the shared static Timer in Driver.
Avoid classloader leak from dangling Timer thread
Previously, when the driver is loaded a static Timer is allocated in the Driver class itself.
That Timer is shared by any TimerTask that gets scheduled (ex: for Statement timeout handling).
If the driver is later unloaded (e.g. in a containerized environment) the Timer thread is never stopped.
Additionally, the Timer thread itself may maintain references to other driver classes that will never get unloaded.

The changed classes now use a SharedTimer that maintains a reference count of requests for Timer.
A Connection requests a Timer via the getTimer() method.
This increments a reference count to keep track of the number of outstanding references.
The Timer reference is marked as returned by calling releaseTimer() in Connection.close().
If there are no outstanding references left, the internal shared Timer object is closed so it's resources can be GCed.

Well behaved JDBC code that calls close() on all Connections should be able to unload the Driver without leaving dangling references.
@ringerc

This comment has been minimized.

Show comment
Hide comment
@ringerc

ringerc Oct 6, 2014

Member

On 10/05/2014 11:36 PM, Sehrope Sarkuni wrote:

This patch is an alternative to #188
#188. It fixes the ClassLoader
leak when the driver is repeatedly loaded/unloaded (/e.g. in a
containerized environment/).

The PR is split into three commits. The first two don't have any
functional changes to the driver, they just rename some variables to
clarify their usage and move TimerTask related method calls from
Statement to go through the Connection rather than statically against
the Driver.

Thanks for keeping it well structured.

The third commit adds a reference counting SharedTimer with two methods,
|getTimer()| and |releaseTimer()|. The former creates a Timer on demand
and the latter calls |.cancel()| on the underlying Timer if the
reference count goes to zero (/to allow it to be GCed/) before clearing
it to null.

I like this approach a lot more, though I'm still concerned about direct
use of the Timer API in PgJDBC.

I added one basic test that creates two connections, two statements, and
issues two cancels to test that only a single Timer is created and
subsequently deallocated. It and the the rest of the driver tests run
successfully.

It'll be important to test how this behaves when an exception is thrown
on the timer thread.

I think some additional tests are needed, particularly one that
simulates loading/unloading the driver via a custom ClassLoader. It
would need to run separately from the rest of the driver unit tests as
it'll need to run without the driver on the classpath (/as it'll load it
dynamically/). If nobody else gets around to this I'll see if I can
write one some time next week.

It may prove easier to use Arquillian to do this against an embedded
Glassfish or similar.

I'd also like if someone could walk through the Timer
allocation/deallocation steps to verify the happens-before and memory
semantics of it. Again, all the tests pass but another pair of eyes
would be appreciated.

I'll try, time permitting. Things are busy.

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Member

ringerc commented Oct 6, 2014

On 10/05/2014 11:36 PM, Sehrope Sarkuni wrote:

This patch is an alternative to #188
#188. It fixes the ClassLoader
leak when the driver is repeatedly loaded/unloaded (/e.g. in a
containerized environment/).

The PR is split into three commits. The first two don't have any
functional changes to the driver, they just rename some variables to
clarify their usage and move TimerTask related method calls from
Statement to go through the Connection rather than statically against
the Driver.

Thanks for keeping it well structured.

The third commit adds a reference counting SharedTimer with two methods,
|getTimer()| and |releaseTimer()|. The former creates a Timer on demand
and the latter calls |.cancel()| on the underlying Timer if the
reference count goes to zero (/to allow it to be GCed/) before clearing
it to null.

I like this approach a lot more, though I'm still concerned about direct
use of the Timer API in PgJDBC.

I added one basic test that creates two connections, two statements, and
issues two cancels to test that only a single Timer is created and
subsequently deallocated. It and the the rest of the driver tests run
successfully.

It'll be important to test how this behaves when an exception is thrown
on the timer thread.

I think some additional tests are needed, particularly one that
simulates loading/unloading the driver via a custom ClassLoader. It
would need to run separately from the rest of the driver unit tests as
it'll need to run without the driver on the classpath (/as it'll load it
dynamically/). If nobody else gets around to this I'll see if I can
write one some time next week.

It may prove easier to use Arquillian to do this against an embedded
Glassfish or similar.

I'd also like if someone could walk through the Timer
allocation/deallocation steps to verify the happens-before and memory
semantics of it. Again, all the tests pass but another pair of eyes
would be appreciated.

I'll try, time permitting. Things are busy.

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

@sehrope

This comment has been minimized.

Show comment
Hide comment
@sehrope

sehrope Oct 6, 2014

Contributor

I like this approach a lot more, though I'm still concerned about direct
use of the Timer API in PgJDBC.

Yes I don't particularly like manually managing Timers like this but The Timer API usage was already there. The difference now is that we're cleaning up behind ourselves when we're done.

It'll be important to test how this behaves when an exception is thrown
on the timer thread.

Looks like you're right on this one, an exception causes the Timer to stop running. See the line "If f the timer's task execution thread terminates unexpectedly..." in the Javadocs: http://docs.oracle.com/javase/7/docs/api/java/util/Timer.html

Maybe we should use a ScheduledThreadPoolExecutor. It's Java 1.5+ only but I suppose that's okay as at this point even 1.6 is out of support. Anybody that wants to use this new resource cleanup feature better be running something supported.

Alternatively (or in addition to that), we could have an Executor for running the actual cancellation tasks so they happen in separate Threads. This goes down the rabbit hole of doing more thread management directly in the driver but I think it may be a better approach.

With a single thread handling cancellations, you can only cancel one query at a time. Cancellations are meant to be quick but if something slows it down (e.g. server down or networking issue) then it'll hang and delay all other query cancellations. In the common case where an app is connecting to a single database this isn't much of an issue; i.e. if one can't connect, the rest probably can't either. However, if you're connecting to separate databases with a single shared Timer thread for cancellation then one bad connection will starve all the rest.

I'll try, time permitting. Things are busy.

No rush on this for me. My involvement in helping fix this is because I want it done right, not because I want it done soon.

Contributor

sehrope commented Oct 6, 2014

I like this approach a lot more, though I'm still concerned about direct
use of the Timer API in PgJDBC.

Yes I don't particularly like manually managing Timers like this but The Timer API usage was already there. The difference now is that we're cleaning up behind ourselves when we're done.

It'll be important to test how this behaves when an exception is thrown
on the timer thread.

Looks like you're right on this one, an exception causes the Timer to stop running. See the line "If f the timer's task execution thread terminates unexpectedly..." in the Javadocs: http://docs.oracle.com/javase/7/docs/api/java/util/Timer.html

Maybe we should use a ScheduledThreadPoolExecutor. It's Java 1.5+ only but I suppose that's okay as at this point even 1.6 is out of support. Anybody that wants to use this new resource cleanup feature better be running something supported.

Alternatively (or in addition to that), we could have an Executor for running the actual cancellation tasks so they happen in separate Threads. This goes down the rabbit hole of doing more thread management directly in the driver but I think it may be a better approach.

With a single thread handling cancellations, you can only cancel one query at a time. Cancellations are meant to be quick but if something slows it down (e.g. server down or networking issue) then it'll hang and delay all other query cancellations. In the common case where an app is connecting to a single database this isn't much of an issue; i.e. if one can't connect, the rest probably can't either. However, if you're connecting to separate databases with a single shared Timer thread for cancellation then one bad connection will starve all the rest.

I'll try, time permitting. Things are busy.

No rush on this for me. My involvement in helping fix this is because I want it done right, not because I want it done soon.

@metlos

This comment has been minimized.

Show comment
Hide comment
@metlos

metlos Oct 6, 2014

I think some additional tests are needed, particularly one that simulates loading/unloading the driver via a custom ClassLoader. It would need to run separately from the rest of the driver unit tests as it'll need to run without the driver on the classpath (as it'll load it dynamically). If nobody else gets around to this I'll see if I can write one some time next week.

In one of my previous attempts, I had a simple test that had such custom classloader. It tried to load classes of the driver several times (using this custom classloader that always loaded classes from org.postgresql regardless of whether the parent classloader had them or not). I then naively tried to create so much churn on the heap so as to force the GC to collect the drivers and then checked if indeed they all disappeared (which would not be the case with the eagerly created Timer thread as it is in the current codebase).

If you find it useful, you can reuse/adapt it to your needs. I ressurected the code to: https://gist.github.com/metlos/59c9cf891482f14d1784

metlos commented Oct 6, 2014

I think some additional tests are needed, particularly one that simulates loading/unloading the driver via a custom ClassLoader. It would need to run separately from the rest of the driver unit tests as it'll need to run without the driver on the classpath (as it'll load it dynamically). If nobody else gets around to this I'll see if I can write one some time next week.

In one of my previous attempts, I had a simple test that had such custom classloader. It tried to load classes of the driver several times (using this custom classloader that always loaded classes from org.postgresql regardless of whether the parent classloader had them or not). I then naively tried to create so much churn on the heap so as to force the GC to collect the drivers and then checked if indeed they all disappeared (which would not be the case with the eagerly created Timer thread as it is in the current codebase).

If you find it useful, you can reuse/adapt it to your needs. I ressurected the code to: https://gist.github.com/metlos/59c9cf891482f14d1784

@metlos

This comment has been minimized.

Show comment
Hide comment
@metlos

metlos Oct 6, 2014

Maybe we should use a ScheduledThreadPoolExecutor.

+1. The number of threads in that thread pool could be configured using a sys prop with some conservative default.

Is there any consequence of 2 cancellations running in parallel on a single connection?

metlos commented Oct 6, 2014

Maybe we should use a ScheduledThreadPoolExecutor.

+1. The number of threads in that thread pool could be configured using a sys prop with some conservative default.

Is there any consequence of 2 cancellations running in parallel on a single connection?

private static AtomicInteger timerCount = new AtomicInteger(0);
private Logger log;
private volatile Timer timer = null;

This comment has been minimized.

@metlos

metlos Oct 6, 2014

Why is this field volatile? Since it is only ever accessed from synchronized methods, there should be no need for that?

@metlos

metlos Oct 6, 2014

Why is this field volatile? Since it is only ever accessed from synchronized methods, there should be no need for that?

private Logger log;
private volatile Timer timer = null;
private AtomicInteger refCount = new AtomicInteger(0);

This comment has been minimized.

@metlos

metlos Oct 6, 2014

I suggest making the getRefCount() method synchronized instead of using AtomicInteger here.

Since the count is only ever accessed or changed in synchronized methods, there's no need for this to be an atomic counter. The only reason for that would be the "unsynchronized" getRefCount() method but that's only used in test code and hence not performance-critical.

If getRefCount() were synchronized, this field could be an ordinary int.

@metlos

metlos Oct 6, 2014

I suggest making the getRefCount() method synchronized instead of using AtomicInteger here.

Since the count is only ever accessed or changed in synchronized methods, there's no need for this to be an atomic counter. The only reason for that would be the "unsynchronized" getRefCount() method but that's only used in test code and hence not performance-critical.

If getRefCount() were synchronized, this field could be an ordinary int.

@sehrope

This comment has been minimized.

Show comment
Hide comment
@sehrope

sehrope Oct 14, 2014

Contributor

Yes if everything is marked synchronized then the volatile qualifier could be removed. At this point I think it's a habit for me to mark anything that gets updated from multiple threads volatile even if it's wrapped as such. When I get a chance to spend more time on this I'll try out removing it and see how it looks.

Contributor

sehrope commented Oct 14, 2014

Yes if everything is marked synchronized then the volatile qualifier could be removed. At this point I think it's a habit for me to mark anything that gets updated from multiple threads volatile even if it's wrapped as such. When I get a chance to spend more time on this I'll try out removing it and see how it looks.

@ringerc

This comment has been minimized.

Show comment
Hide comment
@ringerc

ringerc Dec 1, 2014

Member

Any further updates on this?

Member

ringerc commented Dec 1, 2014

Any further updates on this?

@sehrope

This comment has been minimized.

Show comment
Hide comment
@sehrope

sehrope Dec 1, 2014

Contributor

I haven't had a chance yet to add betters tests for this as it's a bit more complicated then other tests (because of the external classloader). The small test I have there is better than nothing but something more extensive would be nice.

@metlos: If you're still interested in getting this merged do you think you could come up with something that simulates your use case?

Contributor

sehrope commented Dec 1, 2014

I haven't had a chance yet to add betters tests for this as it's a bit more complicated then other tests (because of the external classloader). The small test I have there is better than nothing but something more extensive would be nice.

@metlos: If you're still interested in getting this merged do you think you could come up with something that simulates your use case?

@metlos

This comment has been minimized.

Show comment
Hide comment
@metlos

metlos Dec 3, 2014

My usecase was quite well described by https://gist.github.com/metlos/59c9cf891482f14d1784 as mentioned in one of my previous comments.

We discovered this issue in http://rhq.jboss.org where we have a plugin for monitoring Postgres. This plugin lives in a sort of container with a child-first classloader so that individual plugins are isolated.

On certain occasions the container stops, reloads and restarts all the plugins (without actually stopping the JVM). After a couple of such occasions we've noticed the increased permgen due to classes not being unloaded and tracked it down to this issue.

metlos commented Dec 3, 2014

My usecase was quite well described by https://gist.github.com/metlos/59c9cf891482f14d1784 as mentioned in one of my previous comments.

We discovered this issue in http://rhq.jboss.org where we have a plugin for monitoring Postgres. This plugin lives in a sort of container with a child-first classloader so that individual plugins are isolated.

On certain occasions the container stops, reloads and restarts all the plugins (without actually stopping the JVM). After a couple of such occasions we've noticed the increased permgen due to classes not being unloaded and tracked it down to this issue.

@sehrope

This comment has been minimized.

Show comment
Hide comment
@sehrope

sehrope Dec 4, 2014

Contributor

I added another test based upon the one in the linked gist. I changed it a bit to:

  • Load the driver class via the custom classloader
  • Instantiate the driver
  • Add a WeakReference to the driver (not the driver class) to a map
  • Create a connection (uses the same defaults as the rest of tests)
  • Execute a basic query and fetch the result, SELECT 1
  • Execute a query that will hang, SELECT pg_sleep(10)
  • Create a thread to cancel the previous statement after 250 ms (executing and cancelling a statement should load the new cancellation code with the shared timer)
  • Close everything
  • Deregister the driver from the driver manager
  • Repeat ten times
  • Verify that the WeakReferences all return back null

... the problem is it doesn't quite work. The Driver instances themselves are getting GCed so I'm fairly certain the new shared timer code is working fine. The problem is the Driver classes themselves don't get GCed.

I tried running the test stand alone (rather than with everything else on the classpath) and I get the same result. I think it may be because the JVM doesn't actually GC classes (vs objects)... they stay in PermGen. I've been testing it with v1.7 on OSX and have been fiddiling with some JVM -XX flags but no luck.

Any ideas?

I haven't updated the PR with the new test as the bulk of it is debug statements from me experimenting with it. You can find it here: https://gist.github.com/sehrope/3b4e11124f6e27d9e680

You can run it via:

$ ant -lib lib compile testjar
% java -cp lib/junit-4.0.jar:jars/postgresql-tests.jar -Dusername=test -Dpassword=test -Dport=10093 -Ddatabase=test -DdriverArtifact=jars/postgresql-9.4-1200.jdbc41.jar org.postgresql.test.leak.ClassLoaderLeakTest

(might need to replace the port or add a host if your're not testing locally)

Contributor

sehrope commented Dec 4, 2014

I added another test based upon the one in the linked gist. I changed it a bit to:

  • Load the driver class via the custom classloader
  • Instantiate the driver
  • Add a WeakReference to the driver (not the driver class) to a map
  • Create a connection (uses the same defaults as the rest of tests)
  • Execute a basic query and fetch the result, SELECT 1
  • Execute a query that will hang, SELECT pg_sleep(10)
  • Create a thread to cancel the previous statement after 250 ms (executing and cancelling a statement should load the new cancellation code with the shared timer)
  • Close everything
  • Deregister the driver from the driver manager
  • Repeat ten times
  • Verify that the WeakReferences all return back null

... the problem is it doesn't quite work. The Driver instances themselves are getting GCed so I'm fairly certain the new shared timer code is working fine. The problem is the Driver classes themselves don't get GCed.

I tried running the test stand alone (rather than with everything else on the classpath) and I get the same result. I think it may be because the JVM doesn't actually GC classes (vs objects)... they stay in PermGen. I've been testing it with v1.7 on OSX and have been fiddiling with some JVM -XX flags but no luck.

Any ideas?

I haven't updated the PR with the new test as the bulk of it is debug statements from me experimenting with it. You can find it here: https://gist.github.com/sehrope/3b4e11124f6e27d9e680

You can run it via:

$ ant -lib lib compile testjar
% java -cp lib/junit-4.0.jar:jars/postgresql-tests.jar -Dusername=test -Dpassword=test -Dport=10093 -Ddatabase=test -DdriverArtifact=jars/postgresql-9.4-1200.jdbc41.jar org.postgresql.test.leak.ClassLoaderLeakTest

(might need to replace the port or add a host if your're not testing locally)

@metlos

This comment has been minimized.

Show comment
Hide comment
@metlos

metlos Dec 4, 2014

I commented on your gist with my theory for why the GC of classes is not working for you.

metlos commented Dec 4, 2014

I commented on your gist with my theory for why the GC of classes is not working for you.

@sehrope

This comment has been minimized.

Show comment
Hide comment
@sehrope

sehrope Dec 4, 2014

Contributor

Creating extra objects doesn't make a difference. I get the same results when I add the for loop.

Contributor

sehrope commented Dec 4, 2014

Creating extra objects doesn't make a difference. I get the same results when I add the for loop.

davecramer added a commit that referenced this pull request Dec 26, 2014

Merge pull request #197 from sehrope/cl-leak
Fix Timer thread classloader leak
Thanks @sehrope !

@davecramer davecramer merged commit 75dc417 into pgjdbc:master Dec 26, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment