New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Another round of MSC fixes #51
Conversation
@@ -486,14 +486,11 @@ void transition(final ArrayList<Runnable> tasks) { | |||
} | |||
break; | |||
case LAZY: { | |||
if (state.getState() == State.UP && state != Substate.STOP_REQUESTED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't want to change this; see https://issues.jboss.org/browse/MSC-169?focusedCommentId=13421096&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13421096
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All current MSC-169 related commits will be eliminated.
@@ -844,7 +844,7 @@ public void uncaughtException(final Thread t, final Throwable e) { | |||
ServiceLogger.ROOT.uncaughtException(e, t); | |||
} | |||
}; | |||
private static final ThreadPoolExecutor.CallerRunsPolicy POLICY = new ThreadPoolExecutor.CallerRunsPolicy(); | |||
private static final ThreadPoolExecutor.AbortPolicy POLICY = new ThreadPoolExecutor.AbortPolicy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never want this policy unless we can go through and prove that we have 100% correct behavior when execute()
throws an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove this commit too.
There is a wrong test in MSC test case
MultipleRemoveListenerTestCase.testDoneAfterRemoval()
that fails sporadically. This test expects it is possible to
schedule new tasks after container shutdown.
The test will be eliminated.
@@ -325,12 +325,12 @@ private boolean shouldStop() { | |||
* Returns true if controller is in rest state and no async tasks are running, false otherwise. | |||
* @return true if stable rest state, false otherwise | |||
*/ | |||
boolean isStableRestState() { | |||
private boolean isStableRestState() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Package-private is adequate encapsulation. When you have a private method on Java 9 and earlier which is accessed from a nested class, the compiler has to produce an "access" method which is package-private anyway, so any improved encapsulation is an illusion.
} | ||
} | ||
|
||
private class RemoveChildrenTask extends ControllerTask { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nested classes generally shouldn't be private (see earlier comment).
@@ -1972,6 +1954,40 @@ boolean execute() { | |||
} | |||
} | |||
|
|||
private class TransitiveDependencyAvailableTask extends ControllerTask { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Private classes
@@ -1598,6 +1598,19 @@ public final void run() { | |||
abstract boolean execute(); | |||
} | |||
|
|||
private abstract class DependenciesControllerTask extends ControllerTask { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Private class
…o transition that will always happen during the startup process.
There are no tasks added to 'tasks' variable in first synchronized section so there is no need to release both locks then try to execute empty tasks list and finally reacquiring both locks back again.
Removing dead code - newDependent() neither participates in stability detection nor in controller transitioning.
More strict method modifiers to improve encapsulation. These methods are not accessed from outside classes anymore.
ServiceAvailableTask must be executed outside intrinsic lock. When ServiceAvailableTask is scheduled we cannot call transition() method in commitInstallation() method. It will be called later once all scheduled tasks from commitInstallation() method are finished.
…ndencyFailedTask into two tasks: DependencyFailedTask & RemoveChildrenTask.
…Task into two tasks: StopTask & RemoveChildrenTask.
…w TransitiveDependencyUn/Available tasks.
…h these tests: * first they are relying and testing MSC internals we are going to change in next commits * second byteman tests have problems with deeper class hierarchies (we are going to introduce) I discussed second problem with Andrew Dinn and he wrote me (cite): "Byteman doesn't support rules injection into non static nested classes, because these classes are strangely encoded into class file format."
…-> DependencyUn/availableTask.
All tasks notifiying controller dependencies and parent will be subclasses of this class. There will be implemented different locking strategy than for DependentsControllerTasks.
…y of tasks - DependenciesControllerTasks.
All tasks notifiying controller dependents and children will be subclasses of this class. There will be implemented different locking strategy than for DependenciesControllerTasks.
…y of tasks - DependentsControllerTasks.
…rollerImpl.getDependents() method.
…endentsControllerTask.
…lasses must be final.
c356ae9
to
05abd6f
Compare
@@ -571,12 +570,12 @@ private void transition(final ArrayList<Runnable> tasks) { | |||
dependenciesDemanded = false; | |||
} | |||
getListenerTasks(transition, tasks); | |||
tasks.add(new DependencyStoppedTask(getDependents())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that we're no longer using a snapshot of the dependents array is a functional change I think, because now we're iterating dependents outside of the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is really no functional change @dmlloyd :)
The controller dependents are initialized in DependentsControllerTask constructor
and all tasks are always instantiated under intrinsic lock. See previous commit.
private DependentsControllerTask() { | ||
dependents = getDependents(); | ||
dependents = getDependentsByDependencyName(); | ||
children = ServiceControllerImpl.this.children.toScatteredArray(NO_DEPENDENTS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we benchmarked this against the array approach? The array approach was ugly but fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need for benchmarking now because this is work in progress that will be changed in next pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to have a look in advance this is the change I am talking about:
ropalka@0bfdddf#diff-fd8643d01445d7e64971875e1b6a1d26R1684
@@ -1599,6 +1599,23 @@ private DependentsControllerTask() { | |||
dependents = getDependentsByDependencyName(); | |||
children = ServiceControllerImpl.this.children.toScatteredArray(NO_DEPENDENTS); | |||
} | |||
|
|||
final boolean execute() { | |||
for (Map.Entry<ServiceName, Dependent[]> dependentEntry : dependents.entrySet()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're just iterating the map, then what we really want is a list of pairs, not a map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same applies for this. This is work in progress that will be changed.
I added one more commit to this pull request to fix the issue I introduced earlier. |
Service controller install -> cancel usecase must be synchronous (if installation fails). Only this way installing thread is notified properly when all registry associations have been removed.
ServiceControllerImpl.startInstallation() can fail if and only if there is already some other controller associated either with primary registration or with some alias registration. In such problematic case DuplicateServiceException is being thrown. This source code reordering in startInstallation() method ensures this critical and problematic part of controller installation process is executed first.
These tests have been testing old optional dependency implementation details. In next commit we will introduce new optional dependency implementation.
Is this currently passing with the WildFly full test suite? |
Yes @dmlloyd all these commits (in this penultimate pull request) are |
To be precise I have successfully tested my WIP branch And finally introduce you proposal solution how we will address |
Since now on optional dependencies are guaranteed to work properly. If there is service A installed with N optional dependencies and these optional dependencies are installed (it doesn't matter on the order of installation process) then in worst case scenario such service A will be restarted N times to see all available optional dependencies. But it is guaranteed these optional dependencies will be visible to A. Before this fix there were scenarios possible where DEPENDENT service in UP state depending on optional dependency DEPENDENCY did not see installed DEPENDENCY service referenced by that optional dependency (if DEPENDENT service was caching references to its dependencies during start phase). In other words it ocassionally happened that DEPENDENT's stop phase didn't execute even if missing DEPENDENCY service (referenced via optional dependency) had been installed. Here's concrete example describing one issue with previous optional dependencies implementation: 1) DEPENDENT service that depends on missing DEPENDENCY service (via optional dependency mechanism) is being installed 2) DEPENDENT service ends up in UP state (it is OK optional dependency is missing) 3) DEPENDENCY service is being installed and enters DOWN state (firing DependencyAvailable task) 4) Optional dependency receives 'immediateServiceAvailable' notification from DEPENDENCY's DependencyAvailable task and translates it to 'immediateDependencyDown' notification for DEPENDENT service 5) DEPENDENCY's DependencyAvailable task finished its execution 6) DEPENDENT realizes it has to stop (stoppingDependencies > 0) so it enters STOP_REQUESTED state 7) DEPENDENCY service transitions from DOWN to UP state and fires DependencyStarted task. This task notifies optional dependency with 'immediateDependencyUp' event. This event is propagated to DEPENDENT service (no event transformation) 8) Now when DEPENDENT's last task is executed (in STOP_REQUESTED state) it detects that all its dependencies are satisfied again (stoppingDependencies == 0). 9) DEPENDENT service will transition back from STOP_REQUESTED to UP state. 10) Since now on if DEPENDENT service implementation is using cached optional dependency value it will not see DEPENDENCY service that has just been succesfully installed.
…sm in ServiceRegistrationImpl. The result of this computation must be propagated to ServiceControllerImpl once it is associated with ServiceRegistrationImpl instance field. This problem was related to optional dependencies. Without this counting mechanism in place the following broken usecase may ocassionally happen: 1) DEPENDENCY service is not installed 2) DEPENDENT service is being installed with optional dependency on missing DEPENDENCY service 3) DEPENDENT service moves from DOWN to START_REQUESTED state (it can start because missing optional DEPENDENCY service is ignored) 4) DEPENDENT service moves from START_REQUESTED to START_INITIATING state and fires DependentStartedTask 5) DEPENDENT's DependentStartedTask calls OptionalDependency.dependentStarted() which in turn calls DEPENDENCY ServiceRegistrationImpl.dependentStarted(). Now because DEPENDENCY's ServiceRegistrationImpl 'instance' field is null and because we didn't implement 'dependent started counting' mechanism this notification is lost 6) DEPENDENT ends up in UP state 7) DEPENDENCY service is being installed and enters DOWN state (it was associated with DEPENDENCY's ServiceRegistrationImpl). DEPENDENCY's DependencyAvailableTask is fired 8) DEPENDENCY's DependencyAvailableTask calls DEPENDENT's optional dependency immediateDependencyAvailable() method which is translated into DEPENDENT.immediateDependencyDown() method call 9) DEPENDENT detects it should stop (stoppingDependencies > 0) 10) DEPENDENT transitions from UP to DOWN state and fires DependentStopped task 11) DEPENDENT's DependentStopped task calls optional dependency dependentStopped() method which in turn calls DEPENDENCY's ServiceRegistrationImpl.dependentStopped() method which in turn calls DEPENDENCY's instance.dependentStopped() method. This last call results into runningDependents invariant violation because it is decremented to negative value and invariant says: 'ServiceControllerImpl runningDependents is always non negative number'
are executed separately and last for every controller ongoing transition to another state. In other words we say service controller successfully completed its transition to another state if and only if all its scheduled tasks (except transition notification tasks) for ongoing transition have been completed. This solution is similar to how stability monitoring works. We say controller stability was achieved if and only if all scheduled tasks were completed and controller entered REST state. The purpose of this commit is to align both stability monitors vs. listener transition notifications approaches. Regardless of which one is used the outcome must be the same. Without this explicit separation the following broken usecase (one of many) would ocassionally happen: 1) Controller is in UP state 2) User code associated ServiceListener with controller (via addListener) and is waiting for transition notifications 3) User code scheduled A for removal 4) Controller finished the following transitions UP -> ... -> REMOVING 5) Controller is executing final transition REMOVING -> REMOVED 6) Controller while holding intrinsic lock schedules both RemoveTask & ServiceListener transition task. Once intrinsic lock is released both RemoveTask and ServiceListener transition tasks are scheduled and executed concurrently 7) ServiceListener transition notification task finished its execution while RemoveTask didn't start its execution yet. 8) User received notification from ServiceListener that controller have been removed and so it tries to install new one. But installation of new controller will fail because removal of previous controller didn't happen yet (there is still pending RemoveTask)
…' mechanism. It is not active yet - will be activated in next commit.
Above is proposal how to fix all known MSC race conditions.
|
… mechanism. This is internal change only not visible to MSC API users because ServiceRegistrationImpl references do not escape to user code. In order to address majority of MSC race conditions it was necessary to revisit locking scenarios: * when controllers are installed * when controller tasks are created * when controller tasks are executed * when controllers are removed Most of the races we were observing were because there was dependents snapshot created and cached when controller dependent task was created. And many times this dependents snapshot was out of date at the time dependent task was executed. To address these issues it was necessary to: * distinguish between read and write operations to service registry and * allow read lock to be acquired in one method and release it later in another one The solution addressing aforementioned problem (in this commit) consists of the following changes: [1] ServiceRegistrationImpl intrinsic lock is enhanced to behave like ReadWriteLock [2] Eliminated unnecessary usage of ServiceRegistrationImpl dependents lock. Since now on we're relying on ServiceRegistrationImpl intrinsic lock only to be held while obtaining registration dependents snapshot. [3] ServiceRegistrationImpl write operations (they are modifying internal states) such as: - setInstance() - clearInstance() - addDependent() - removeDependent() - dependentStarted() - dependentStopped() - addDemand() - removeDemand() are protected with write lock [4] We never acquire two different ServiceRegistrationImpl write locks at the same time in the same thread (this is necessary to avoid dead locks) [5] Revisited - ServiceControllerImpl$DependenciesControllerTask.execute() usage of locks - ServiceControllerImpl.startInstallation() usage of locks - ServiceControllerImpl$RemoveTask.run() usage of locks We are acquiring ServiceRegistrationImpl write locks there one after one in the loop (to prevent dead locks, see [4]). [6] We introduced NOOP beforeExecute() and afterExecute() template methods in ControllerTask - beforeExecute() is always executed before task will start its execution - afterExecute() is always executed after task finished its execution These operations remain NOOP for every DependenciesControllerTask. These operations are overriden for every DependentsControllerTask: - we are acquiring multiple ServiceRegistrationImpl read locks at the same time in beforeExecute() method. This will prevent modifications of ServiceControllerImpl dependents set. Later when these dependents will be iterated in execute() method they will be up to date (we're holding the read lock so there is no race) - we are releasing all acquired ServiceRegistrationImpl read locks at the same time in afterExecute() method [7] ServiceControllerImpl.newDependent() was revisited to take controller 'dependents task execution flags' into account. These flags were introduced in previous commit and they allow us to detect if: - dependent task was scheduled - dependent task finished its execution This was the main problem in previous implementation. ServiceControllerImpl.newDependent() method was unable to detect if scheduled dependent tasks were scheduled and if they finished execution and thus it was causing many race conditions in previous implementation.
…ssue. This happened when some ACTIVE service A depended on PASSIVE service P which was parked in WAITING state (e.g. because some dependency PASSIVE service depends on was not installed). In such case dependent service A remained in START_REQUESTED state (because stoppingDependencies > 0) without possibility to move to some other REST state. This issue was causing serious problems for stability monitoring. Stability monitors never reached stability because START_REQUESTED state is not REST state. The fix is to fire ServiceUn/Available events when entering/leaving WAITING state. Once these events are fired service A gets notification that PASSIVE service P is not available (immediateUnavailableDependencies > 0) and service A will transition from START_REQUESTED to PROBLEM state (where it should be).
The problematic usecase was: 1) User called ServiceContainerImpl.shutdown(). This call iterates all services registered in registry and sets REMOVE mode on all of them 2) There is also child service C and parent service P in registry 3) C entered REMOVED state and scheduled RemoveTask 4) P's either in START_FAILED or STOPPED state and P's RemoveChildrenTask was not executed yet 5) C's RemoveTask is being executed and it calls parent.removeChild() 6) There's asyncTasks decrement implemented there that should be called if and only if P's RemoveChildrenTask was executed (before this fix it was not true)
notifications if and only if it was successfully installed and was not removed yet.
that didn't work properly before (because of broken invariants) plus disabling majority of ServiceListener notification methods. This change does not affect MSC public API ABI. Concretely we are disabling support for the following methods: * serviceRemoveRequested() * serviceRemoveRequestCleared() * dependencyFailed() * dependencyFailureCleared() * immediateDependencyAvailable() * immediateDependencyUnavailable() * transitiveDependencyAvailable() * transitiveDependencyUnavailable() Aforementioned listener notification methods are never called since now on. The following ServiceListener notification methods remain functional: * listenerAdded() * transition() These notification methods must be still supported because they are heavily used in WildFly code base. There is no alternative available. Further more this change removes notion of 'transitive notifications'. Since now on controllers care only about their direct neighbourhood. This fix eliminates transitiveDependencyUn/Available notifications and replaces them with immediateDependencyUn/Available notifications. Reasons for eliminating support for above notification methods are: * immediateDependencyUn/Available and transitiveDependencyUn/Available notifications never worked properly in previous releases (it was quite common MSC ServiceControllerImpl had broken invariants) * they degrade MSC state machine performance (more tasks to execute) * ServiceListener have been deprecated (deprecation since EAP 7.0) * these notification methods were never used in WildFly/EAP code base
Changing ServiceControllerImpl.internalSetMode() method signature - eliminating unused taskList parameter.
Changing transition() method signature.
There were two kind of problems with previous solution: [1] shutdownListener was executed in regular listeners chain [2] race condition between shutdown() and install() methods Both aforementioned problems sometimes caused unexpected premature executor shutdown while there were still pending tasks (either scheduled or not) to be executed. This caused many broken scenarios to happen. One example of broken scenario (assuming [1] was the problem): 1) There is only one service A in registry in UP state. It has 1 registered transition listener. 2) Thread {1} called Container.shutdown() and this method completed successfully. It associated shutdownListener with controller for A and scheduled it for removal. There are two listeners now in controller for A and A is transitioning to REMOVED state. 3) A is not in REMOVED state yet 4) User code executing in Thread {2} calls Container.awaitStability(). This method hangs because previous Container.shutdown() asked service A to stop, but A is still transitioning. 5) A is entering REMOVED state and its RemoveTask is finished. 6) Last step is to notify listeners that REMOVING -> REMOVED transition happened. 7) shutdownListener is executed first and because it was registered just with one service it calls executor.shutdown(). Executor is terminated. 8) Problem is there is pending last listener task not executed yet and because it didn't execute yet stability notification was not sent to service container that controller stability was achieved. User code in Thread {2} will hang forever. Similar example of broken scenario can be constructed assuming [2] was the initial problem. This fix: * removes terminateListener from regular controller listeners chain and creates separate and last controller phase intended just for container termination listeners. When controller enters REMOVED state and its stability is achieved (i.e. all its tasks have been executed) terminate listener is being called. This is the only appropriate time to call it. * container instrinsic lock is being held while container.install() is associating controller with primary and alias registrations. Only so it can be guaranteed shutdown() will not see stale registry data (no race condition)
Includes some minor fixes and refactorings.