Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MastersListener: Fixed several issues around thread pool usage (v2) #60

Merged
merged 8 commits into from Dec 31, 2015

Conversation

jentfoo
Copy link
Contributor

@jentfoo jentfoo commented Dec 30, 2015

This replaces my previous pull request #57 (which I will now close). @rusher and I iterated on my initial proposal to arrive at this. Too much to explain here, but hopefully the previous pull request in combination with commit messages will give enough clarity into this. Feel free to ask my questions as well.

As previously, all my contributions here are provided under the BSD-new license.

Mike Jensen and others added 6 commits December 15, 2015 13:27
A bit of a sizable commit, several things are included, though they are all closely related.
The following defects in AbstractMastersListener were fixed:
* Thread leakage which results in threads to continue to grow without bounds till the server runs out of memory
    The issue here is that ScheduledThreadPoolExecutor does not cleanup started threads once allowed to be garbage collected (argubly a defect in the java classpath, this is due to the inner class of ThreadPoolExecutor "Worker" not being static, and thus holding on to a reference of the pool).

    The fix here was to use a common pool for the ping and fail loop that has a normal start/shutdown lifecycle.

* Fixed issue where FailLoop may fail to stop, as well as allowing multiple FailLoop's to be started
    Condition possible when:
    Thread A enters 'launchFailLoopIfNotlaunched' and progresses to the point where it has transitioned the atomic boolean, but then one of two conditions happens:
         Either thread A is de-scheduled by the OS before 'scheduledFailover' is set
         ...OR... 'scheduledFailover' is set, but since it's not volatile the update is not witnessed by thread B (sadly because of the above condition this could not be solved by just setting this as volatile)
    Thread B then enters 'stopFailover', it transitions the AtomicBoolean back to false, but does not see the set 'scheduledFailover', so it does not remove the task.  I suspect this is why your doing the null check, previously when you would have got a NullPointerException, now your just leaking the task.

    This problem would be further compounded now that it is possible for another invocation to schedule a second task if 'launchFailLoopIfNotlaunched' is invoked.  The null check that existed here likely was a witness of this race condition.

    The solution here was to use an "AtomicReference" hold an instance of the "FailLoop", which can be set before scheduled.  Once set there, we let the FailLoop track its own scheduled future.  It may need to block to wait for the future to be set in a very tight condition, but otherwise unscheduling is handled interior to the task.

Other changes:
* Also changed was the ability to allow library users to set their own thread pools.  java.util.concurrent's thread pools are deficent in many ways (performance being one of them).
* Threadly was added as a TEST dependency.  I did this because when writing unit tests for the new 'SchedulerServiceProviderHolder' I was being lazy and did not want to re-implement "TestRunnable"'s trivial blocking/verification.  I also updated a couple unit tests, one that was purely simplified, but another where I showed the use of threadly's "TestableScheduler" for deterministic execution.
… loop

The return statement was missing after the unschedule completed, thus resulting in an infinite loop.
Because the task should NOT be terminated except through "stopFailover" the failLoop atomicReference was also made private.  This is critical because calls to "blockTillTerminated" without going through "stopFailover" will also result in an infinite loop.
This moves the thread safety logic for the ConnectionValidator thread pool, and listener tracking out of "AbstractMastersListener".  This IMO makes it a bit easier to reason about.
AbstractMastersListener now just has a final instance which it adds listeners to, and removes listeners to.
ConnectionValidator has a final static pool (with a very long thread timeout to deal with the rare condition where the class may no longer be used again, allowing for garbage cleanup...this IMO is easier than removing the pool when it is no longer used, and recreating manually).

This does fix two defects.
* There was a possible race condition due to 'connectionValidationLock' being per instance, but trying to guard removal against the static instance of 'connectionValidationLoop'.  We would check if there is no subscribed listeners, and if there is remove the pool.  Though since the lock is not shared another listener may be adding itself at the same time we are removing the pool.  This was a large reason to change it so the pool is never removed.  This would be hard to do without locking (and I don't think we want a central lock here).  So in my solution the pool is just never removed, the timeout exists to maintain the very unlikely (but possible) behavior that someone upgraded the library without restarting the VM, or decided to no longer use mairaDb at runtime (also unlikely, but the timeout seems resonable enough, so why not?).
* Since all listeners are sharing the same static pool and task for ConnectionValidator, previously it was assumed that all timeout times were configured the same (and used that to determine the run frequency).  While this is likely correct for 90% of cases, it does not handle multiple configuration values.  This was corrected by not using the scheduler to recurringly run the task, but rather let the task schedule itself again (or don't if there is nothing to do).  It will schedule itself at the greatest common divisor of all configured times provided (so we run at the highest frequency necessary to satisfy all configured timeouts, with a minimum delay of 100 millis between checks).
…mic scheduler is maintained

This fixes a couple race conditions in MastersSlavesListener.
* Under some possible conditions actions could be attempted on a null dynamic sized scheduler.
* It was possible to start too many FailoverLoop runnables.  Because we are checking the size against an already incremented pool size, and because there is no locking between that size and add check.  Several threads may read a size one less than the total desired, but all threads may start those fail loops and then the size will be greater than the maximum count.

The solution to these issues was:
* The pool is no longer ever unset.  It is just a final static instance, we may scale it to zero (and if we do, it will then be elegible for garbage collection).
* FailoverLoop's are increased and decreased from within a single recurring task that runs every 250 mills.  This means that there may be a 250 millisecond delay for failover loops being incremented after a listener is created.

I also changed the DynamicSizedSchedulerInterface to be more generic.  In the future schedulers may be wanted to be used in other places.  It is only MastersSlavesListener that is interested in scaling the thread counts based off listener counts.  So instead that math is done in MastersSlavesListener, and the dynamic scheduler just gets simple pool size requests.

The SchedulerServiceProviderHolderTest for shutdown behavior was removed.  This unit test is no longer valid since it is expected that pools are now shutdown by the normal means.
@jentfoo
Copy link
Contributor Author

jentfoo commented Dec 30, 2015

I made a force push to fix the unit tests, hopefully all good now

rusher added a commit that referenced this pull request Dec 31, 2015
MastersListener: Fixed several issues around thread pool usage (v2)
@rusher rusher merged commit ba472fe into mariadb-corporation:master Dec 31, 2015
@jentfoo jentfoo deleted the thread-fix3 branch January 5, 2016 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants