New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MastersListener: Fixed several issues around thread pool usage (v2) #60
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
A bit of a sizable commit, several things are included, though they are all closely related.
The following defects in AbstractMastersListener were fixed:
* Thread leakage which results in threads to continue to grow without bounds till the server runs out of memory
The issue here is that ScheduledThreadPoolExecutor does not cleanup started threads once allowed to be garbage collected (argubly a defect in the java classpath, this is due to the inner class of ThreadPoolExecutor "Worker" not being static, and thus holding on to a reference of the pool).
The fix here was to use a common pool for the ping and fail loop that has a normal start/shutdown lifecycle.
* Fixed issue where FailLoop may fail to stop, as well as allowing multiple FailLoop's to be started
Condition possible when:
Thread A enters 'launchFailLoopIfNotlaunched' and progresses to the point where it has transitioned the atomic boolean, but then one of two conditions happens:
Either thread A is de-scheduled by the OS before 'scheduledFailover' is set
...OR... 'scheduledFailover' is set, but since it's not volatile the update is not witnessed by thread B (sadly because of the above condition this could not be solved by just setting this as volatile)
Thread B then enters 'stopFailover', it transitions the AtomicBoolean back to false, but does not see the set 'scheduledFailover', so it does not remove the task. I suspect this is why your doing the null check, previously when you would have got a NullPointerException, now your just leaking the task.
This problem would be further compounded now that it is possible for another invocation to schedule a second task if 'launchFailLoopIfNotlaunched' is invoked. The null check that existed here likely was a witness of this race condition.
The solution here was to use an "AtomicReference" hold an instance of the "FailLoop", which can be set before scheduled. Once set there, we let the FailLoop track its own scheduled future. It may need to block to wait for the future to be set in a very tight condition, but otherwise unscheduling is handled interior to the task.
Other changes:
* Also changed was the ability to allow library users to set their own thread pools. java.util.concurrent's thread pools are deficent in many ways (performance being one of them).
* Threadly was added as a TEST dependency. I did this because when writing unit tests for the new 'SchedulerServiceProviderHolder' I was being lazy and did not want to re-implement "TestRunnable"'s trivial blocking/verification. I also updated a couple unit tests, one that was purely simplified, but another where I showed the use of threadly's "TestableScheduler" for deterministic execution.
… loop The return statement was missing after the unschedule completed, thus resulting in an infinite loop. Because the task should NOT be terminated except through "stopFailover" the failLoop atomicReference was also made private. This is critical because calls to "blockTillTerminated" without going through "stopFailover" will also result in an infinite loop.
…riadb-connector-j into fullcontact-thread_pool_fixes
This moves the thread safety logic for the ConnectionValidator thread pool, and listener tracking out of "AbstractMastersListener". This IMO makes it a bit easier to reason about. AbstractMastersListener now just has a final instance which it adds listeners to, and removes listeners to. ConnectionValidator has a final static pool (with a very long thread timeout to deal with the rare condition where the class may no longer be used again, allowing for garbage cleanup...this IMO is easier than removing the pool when it is no longer used, and recreating manually). This does fix two defects. * There was a possible race condition due to 'connectionValidationLock' being per instance, but trying to guard removal against the static instance of 'connectionValidationLoop'. We would check if there is no subscribed listeners, and if there is remove the pool. Though since the lock is not shared another listener may be adding itself at the same time we are removing the pool. This was a large reason to change it so the pool is never removed. This would be hard to do without locking (and I don't think we want a central lock here). So in my solution the pool is just never removed, the timeout exists to maintain the very unlikely (but possible) behavior that someone upgraded the library without restarting the VM, or decided to no longer use mairaDb at runtime (also unlikely, but the timeout seems resonable enough, so why not?). * Since all listeners are sharing the same static pool and task for ConnectionValidator, previously it was assumed that all timeout times were configured the same (and used that to determine the run frequency). While this is likely correct for 90% of cases, it does not handle multiple configuration values. This was corrected by not using the scheduler to recurringly run the task, but rather let the task schedule itself again (or don't if there is nothing to do). It will schedule itself at the greatest common divisor of all configured times provided (so we run at the highest frequency necessary to satisfy all configured timeouts, with a minimum delay of 100 millis between checks).
…mic scheduler is maintained This fixes a couple race conditions in MastersSlavesListener. * Under some possible conditions actions could be attempted on a null dynamic sized scheduler. * It was possible to start too many FailoverLoop runnables. Because we are checking the size against an already incremented pool size, and because there is no locking between that size and add check. Several threads may read a size one less than the total desired, but all threads may start those fail loops and then the size will be greater than the maximum count. The solution to these issues was: * The pool is no longer ever unset. It is just a final static instance, we may scale it to zero (and if we do, it will then be elegible for garbage collection). * FailoverLoop's are increased and decreased from within a single recurring task that runs every 250 mills. This means that there may be a 250 millisecond delay for failover loops being incremented after a listener is created. I also changed the DynamicSizedSchedulerInterface to be more generic. In the future schedulers may be wanted to be used in other places. It is only MastersSlavesListener that is interested in scaling the thread counts based off listener counts. So instead that math is done in MastersSlavesListener, and the dynamic scheduler just gets simple pool size requests. The SchedulerServiceProviderHolderTest for shutdown behavior was removed. This unit test is no longer valid since it is expected that pools are now shutdown by the normal means.
|
I made a force push to fix the unit tests, hopefully all good now |
rusher
added a commit
that referenced
this pull request
Dec 31, 2015
MastersListener: Fixed several issues around thread pool usage (v2)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This replaces my previous pull request #57 (which I will now close). @rusher and I iterated on my initial proposal to arrive at this. Too much to explain here, but hopefully the previous pull request in combination with commit messages will give enough clarity into this. Feel free to ask my questions as well.
As previously, all my contributions here are provided under the BSD-new license.