Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another round of MSC fixes #51

Merged
merged 37 commits into from Oct 30, 2017
Merged

Another round of MSC fixes #51

merged 37 commits into from Oct 30, 2017

Conversation

ropalka
Copy link
Contributor

@ropalka ropalka commented May 24, 2017

Includes some minor fixes and refactorings.

@@ -486,14 +486,11 @@ void transition(final ArrayList<Runnable> tasks) {
}
break;
case LAZY: {
if (state.getState() == State.UP && state != Substate.STOP_REQUESTED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All current MSC-169 related commits will be eliminated.

@@ -844,7 +844,7 @@ public void uncaughtException(final Thread t, final Throwable e) {
ServiceLogger.ROOT.uncaughtException(e, t);
}
};
private static final ThreadPoolExecutor.CallerRunsPolicy POLICY = new ThreadPoolExecutor.CallerRunsPolicy();
private static final ThreadPoolExecutor.AbortPolicy POLICY = new ThreadPoolExecutor.AbortPolicy();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never want this policy unless we can go through and prove that we have 100% correct behavior when execute() throws an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove this commit too.
There is a wrong test in MSC test case
MultipleRemoveListenerTestCase.testDoneAfterRemoval()
that fails sporadically. This test expects it is possible to
schedule new tasks after container shutdown.
The test will be eliminated.

@@ -325,12 +325,12 @@ private boolean shouldStop() {
* Returns true if controller is in rest state and no async tasks are running, false otherwise.
* @return true if stable rest state, false otherwise
*/
boolean isStableRestState() {
private boolean isStableRestState() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Package-private is adequate encapsulation. When you have a private method on Java 9 and earlier which is accessed from a nested class, the compiler has to produce an "access" method which is package-private anyway, so any improved encapsulation is an illusion.

}
}

private class RemoveChildrenTask extends ControllerTask {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested classes generally shouldn't be private (see earlier comment).

@@ -1972,6 +1954,40 @@ boolean execute() {
}
}

private class TransitiveDependencyAvailableTask extends ControllerTask {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Private classes

@@ -1598,6 +1598,19 @@ public final void run() {
abstract boolean execute();
}

private abstract class DependenciesControllerTask extends ControllerTask {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Private class

…o transition that will always happen during the startup process.
There are no tasks added to 'tasks' variable
in first synchronized section
so there is no need to release both locks
then try to execute empty tasks list and finally
reacquiring both locks back again.
Removing dead code - newDependent()
neither participates in stability detection
nor in controller transitioning.
More strict method modifiers to improve encapsulation.
These methods are not accessed from outside classes anymore.
ServiceAvailableTask must be executed outside intrinsic lock.
When ServiceAvailableTask is scheduled we cannot call
transition() method in commitInstallation() method.
It will be called later once all scheduled tasks
from commitInstallation() method are finished.
…ndencyFailedTask into two tasks: DependencyFailedTask & RemoveChildrenTask.
…Task into two tasks: StopTask & RemoveChildrenTask.
…h these tests:

 * first they are relying and testing MSC internals we are going to change in next commits
 * second byteman tests have problems with deeper class hierarchies (we are going to introduce)

I discussed second problem with Andrew Dinn and he wrote me (cite):
"Byteman doesn't support rules injection into non static nested classes,
because these classes are strangely encoded into class file format."
All tasks notifiying controller dependencies and parent will be subclasses of this class.
There will be implemented different locking strategy than for DependentsControllerTasks.
All tasks notifiying controller dependents and children will be subclasses of this class.
There will be implemented different locking strategy than for DependenciesControllerTasks.
@ropalka ropalka force-pushed the 1.2-fixes branch 2 times, most recently from c356ae9 to 05abd6f Compare June 14, 2017 14:30
@@ -571,12 +570,12 @@ private void transition(final ArrayList<Runnable> tasks) {
dependenciesDemanded = false;
}
getListenerTasks(transition, tasks);
tasks.add(new DependencyStoppedTask(getDependents()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we're no longer using a snapshot of the dependents array is a functional change I think, because now we're iterating dependents outside of the lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is really no functional change @dmlloyd :)
The controller dependents are initialized in DependentsControllerTask constructor
and all tasks are always instantiated under intrinsic lock. See previous commit.

private DependentsControllerTask() {
dependents = getDependents();
dependents = getDependentsByDependencyName();
children = ServiceControllerImpl.this.children.toScatteredArray(NO_DEPENDENTS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we benchmarked this against the array approach? The array approach was ugly but fast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need for benchmarking now because this is work in progress that will be changed in next pull request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to have a look in advance this is the change I am talking about:
ropalka@0bfdddf#diff-fd8643d01445d7e64971875e1b6a1d26R1684

@@ -1599,6 +1599,23 @@ private DependentsControllerTask() {
dependents = getDependentsByDependencyName();
children = ServiceControllerImpl.this.children.toScatteredArray(NO_DEPENDENTS);
}

final boolean execute() {
for (Map.Entry<ServiceName, Dependent[]> dependentEntry : dependents.entrySet()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're just iterating the map, then what we really want is a list of pairs, not a map.

Copy link
Contributor Author

@ropalka ropalka Jun 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same applies for this. This is work in progress that will be changed.

@ropalka
Copy link
Contributor Author

ropalka commented Jun 21, 2017

I added one more commit to this pull request to fix the issue I introduced earlier.

Service controller install -> cancel usecase must be synchronous
(if installation fails). Only this way installing thread
is notified properly when all registry associations have been removed.
ServiceControllerImpl.startInstallation() can fail if and only if
there is already some other controller associated either with
primary registration or with some alias registration.
In such problematic case DuplicateServiceException is being thrown.
This source code reordering in startInstallation() method ensures this critical
and problematic part of controller installation process is executed first.
These tests have been testing old optional dependency implementation details.
In next commit we will introduce new optional dependency implementation.
@dmlloyd
Copy link
Contributor

dmlloyd commented Sep 1, 2017

Is this currently passing with the WildFly full test suite?

@ropalka
Copy link
Contributor Author

ropalka commented Sep 1, 2017

Yes @dmlloyd all these commits (in this penultimate pull request) are
successfully tested on both WildFly-Core and WildFly test suite.
But these commits are useless until I'll send final pull request.
All previous pull requests (including this one) were mostly
preparation steps for final pull request I'm working on these days.

@ropalka
Copy link
Contributor Author

ropalka commented Sep 1, 2017

To be precise I have successfully tested my WIP branch
(which I still need to transform into final PR) which was finished two months ago.
These days I returned back to it and again heavily tested it against recent WF & WF-Core.
Now I'm working on WIP branch -> final PR translation.
It's about eliminating all debug messages, creating self
explanatory proposal commits with valuable descriptions
and about separating complex commits into simpler ones.

And finally introduce you proposal solution how we will address
all these race conditions we're observing with vanilla MSC
on WF & WF-Core these days.

Since now on optional dependencies are guaranteed to work properly.
If there is service A installed with N optional dependencies and these
optional dependencies are installed (it doesn't matter on the order
of installation process) then in worst case scenario such service A
will be restarted N times to see all available optional dependencies.
But it is guaranteed these optional dependencies will be visible to A.

Before this fix there were scenarios possible where
DEPENDENT service in UP state depending on optional dependency
DEPENDENCY did not see installed DEPENDENCY service referenced
by that optional dependency (if DEPENDENT service was caching
references to its dependencies during start phase).

In other words it ocassionally happened that DEPENDENT's stop phase
didn't execute even if missing DEPENDENCY service
(referenced via optional dependency) had been installed.

Here's concrete example describing one issue with
previous optional dependencies implementation:

 1) DEPENDENT service that depends on missing DEPENDENCY service
    (via optional dependency mechanism) is being installed
 2) DEPENDENT service ends up in UP state
    (it is OK optional dependency is missing)
 3) DEPENDENCY service is being installed and enters DOWN state
    (firing DependencyAvailable task)
 4) Optional dependency receives 'immediateServiceAvailable' notification
    from DEPENDENCY's DependencyAvailable task and translates it to
    'immediateDependencyDown' notification for DEPENDENT service
 5) DEPENDENCY's DependencyAvailable task finished its execution
 6) DEPENDENT realizes it has to stop (stoppingDependencies > 0)
    so it enters STOP_REQUESTED state
 7) DEPENDENCY service transitions from DOWN to UP state and fires
    DependencyStarted task. This task notifies optional dependency
    with 'immediateDependencyUp' event. This event is propagated
    to DEPENDENT service (no event transformation)
 8) Now when DEPENDENT's last task is executed (in STOP_REQUESTED state)
    it detects that all its dependencies are satisfied again
    (stoppingDependencies == 0).
 9) DEPENDENT service will transition back from STOP_REQUESTED to UP state.
10) Since now on if DEPENDENT service implementation is using cached
    optional dependency value it will not see DEPENDENCY service that
    has just been succesfully installed.
…sm in ServiceRegistrationImpl.

The result of this computation must be propagated to ServiceControllerImpl
once it is associated with ServiceRegistrationImpl instance field.

This problem was related to optional dependencies.
Without this counting mechanism in place the following
broken usecase may ocassionally happen:

 1) DEPENDENCY service is not installed
 2) DEPENDENT service is being installed with optional
    dependency on missing DEPENDENCY service
 3) DEPENDENT service moves from DOWN to START_REQUESTED state
    (it can start because missing optional DEPENDENCY service is ignored)
 4) DEPENDENT service moves from START_REQUESTED to START_INITIATING state
    and fires DependentStartedTask
 5) DEPENDENT's DependentStartedTask calls OptionalDependency.dependentStarted()
    which in turn calls DEPENDENCY ServiceRegistrationImpl.dependentStarted().
    Now because DEPENDENCY's ServiceRegistrationImpl 'instance' field is null
    and because we didn't implement 'dependent started counting' mechanism
    this notification is lost
 6) DEPENDENT ends up in UP state
 7) DEPENDENCY service is being installed and enters DOWN state
    (it was associated with DEPENDENCY's ServiceRegistrationImpl).
    DEPENDENCY's DependencyAvailableTask is fired
 8) DEPENDENCY's DependencyAvailableTask calls DEPENDENT's optional dependency
    immediateDependencyAvailable() method which is translated into
    DEPENDENT.immediateDependencyDown() method call
 9) DEPENDENT detects it should stop (stoppingDependencies > 0)
10) DEPENDENT transitions from UP to DOWN state and fires DependentStopped task
11) DEPENDENT's DependentStopped task calls optional dependency dependentStopped() method
    which in turn calls DEPENDENCY's ServiceRegistrationImpl.dependentStopped() method
    which in turn calls DEPENDENCY's instance.dependentStopped() method.
    This last call results into runningDependents invariant violation because
    it is decremented to negative value and invariant says:
    'ServiceControllerImpl runningDependents is always non negative number'
are executed separately and last for every controller
ongoing transition to another state.

In other words we say service controller successfully
completed its transition to another state if and only if
all its scheduled tasks (except transition notification tasks)
for ongoing transition have been completed.

This solution is similar to how stability monitoring works.
We say controller stability was achieved if and only if
all scheduled tasks were completed and controller entered REST state.

The purpose of this commit is to align both
stability monitors vs. listener transition notifications approaches.
Regardless of which one is used the outcome must be the same.

Without this explicit separation the following
broken usecase (one of many) would ocassionally happen:

 1) Controller is in UP state
 2) User code associated ServiceListener with controller
    (via addListener) and is waiting for transition notifications
 3) User code scheduled A for removal
 4) Controller finished the following transitions UP -> ... -> REMOVING
 5) Controller is executing final transition REMOVING -> REMOVED
 6) Controller while holding intrinsic lock schedules
    both RemoveTask & ServiceListener transition task.
    Once intrinsic lock is released both RemoveTask
    and ServiceListener transition tasks are scheduled
    and executed concurrently
 7) ServiceListener transition notification task finished its execution
    while RemoveTask didn't start its execution yet.
 8) User received notification from ServiceListener
    that controller have been removed and so it tries
    to install new one. But installation of new controller
    will fail because removal of previous controller
    didn't happen yet (there is still pending RemoveTask)
…' mechanism. It is not active yet - will be activated in next commit.
@ropalka
Copy link
Contributor Author

ropalka commented Sep 14, 2017

Above is proposal how to fix all known MSC race conditions.
Just few highlights of this proposal:

  • it fixes broken optional dependencies
  • if fixes broken MSC shutdowns
  • it preserves MSC state machine states
  • it simplifies MSC code base
  • it fixes majority of WildFly & WildFly-Core test suite transient failures (heavily tested)
    • these failures were related to broken optional deps & broken container shutdown

ropalka and others added 11 commits October 30, 2017 21:11
… mechanism.

This is internal change only not visible to MSC API users because
ServiceRegistrationImpl references do not escape to user code.

In order to address majority of MSC race conditions it was necessary to revisit
locking scenarios:
 * when controllers are installed
 * when controller tasks are created
 * when controller tasks are executed
 * when controllers are removed
Most of the races we were observing were because there was dependents snapshot
created and cached when controller dependent task was created. And many times
this dependents snapshot was out of date at the time dependent task was executed.

To address these issues it was necessary to:
 * distinguish between read and write operations to service registry and
 * allow read lock to be acquired in one method and release it later in another one

The solution addressing aforementioned problem (in this commit) consists of the following changes:
[1] ServiceRegistrationImpl intrinsic lock is enhanced to behave like ReadWriteLock
[2] Eliminated unnecessary usage of ServiceRegistrationImpl dependents lock.
    Since now on we're relying on ServiceRegistrationImpl intrinsic lock only
    to be held while obtaining registration dependents snapshot.
[3] ServiceRegistrationImpl write operations (they are modifying internal states) such as:
    - setInstance()
    - clearInstance()
    - addDependent()
    - removeDependent()
    - dependentStarted()
    - dependentStopped()
    - addDemand()
    - removeDemand()
    are protected with write lock
[4] We never acquire two different ServiceRegistrationImpl write locks at the same time
    in the same thread (this is necessary to avoid dead locks)
[5] Revisited
    - ServiceControllerImpl$DependenciesControllerTask.execute() usage of locks
    - ServiceControllerImpl.startInstallation() usage of locks
    - ServiceControllerImpl$RemoveTask.run() usage of locks
    We are acquiring ServiceRegistrationImpl write locks there
    one after one in the loop (to prevent dead locks, see [4]).
[6] We introduced NOOP beforeExecute() and afterExecute() template methods in ControllerTask
    - beforeExecute() is always executed before task will start its execution
    - afterExecute() is always executed after task finished its execution
    These operations remain NOOP for every DependenciesControllerTask.
    These operations are overriden for every DependentsControllerTask:
    - we are acquiring multiple ServiceRegistrationImpl read locks at the same time
      in beforeExecute() method. This will prevent modifications of ServiceControllerImpl dependents set.
      Later when these dependents will be iterated in execute() method they will be up to date
      (we're holding the read lock so there is no race)
    - we are releasing all acquired ServiceRegistrationImpl read locks at the same time
      in afterExecute() method
[7] ServiceControllerImpl.newDependent() was revisited to take controller 'dependents task execution flags' into
    account. These flags were introduced in previous commit and they allow us to detect if:
    - dependent task was scheduled
    - dependent task finished its execution
    This was the main problem in previous implementation. ServiceControllerImpl.newDependent() method
    was unable to detect if scheduled dependent tasks were scheduled and if they finished execution
    and thus it was causing many race conditions in previous implementation.
…ssue.

This happened when some ACTIVE service A depended on PASSIVE service P
which was parked in WAITING state (e.g. because some dependency
PASSIVE service depends on was not installed).
In such case dependent service A remained in START_REQUESTED state
(because stoppingDependencies > 0) without possibility to move
to some other REST state.

This issue was causing serious problems for stability monitoring.
Stability monitors never reached stability because START_REQUESTED state
is not REST state.

The fix is to fire ServiceUn/Available events when entering/leaving
WAITING state. Once these events are fired service A gets notification
that PASSIVE service P is not available (immediateUnavailableDependencies > 0)
and service A will transition from START_REQUESTED to PROBLEM state
(where it should be).
The problematic usecase was:
1) User called ServiceContainerImpl.shutdown().
   This call iterates all services registered in
   registry and sets REMOVE mode on all of them
2) There is also child service C and parent service P in registry
3) C entered REMOVED state and scheduled RemoveTask
4) P's either in START_FAILED or STOPPED state and
   P's RemoveChildrenTask was not executed yet
5) C's RemoveTask is being executed and it calls parent.removeChild()
6) There's asyncTasks decrement implemented there
   that should be called if and only if P's RemoveChildrenTask
   was executed (before this fix it was not true)
notifications if and only if it was successfully installed and was
not removed yet.
that didn't work properly before (because of broken invariants)
plus disabling majority of ServiceListener notification methods.
This change does not affect MSC public API ABI.

Concretely we are disabling support for the following methods:
 * serviceRemoveRequested()
 * serviceRemoveRequestCleared()
 * dependencyFailed()
 * dependencyFailureCleared()
 * immediateDependencyAvailable()
 * immediateDependencyUnavailable()
 * transitiveDependencyAvailable()
 * transitiveDependencyUnavailable()
Aforementioned listener notification methods are never called since now on.
The following ServiceListener notification methods remain functional:
 * listenerAdded()
 * transition()
These notification methods must be still supported because they are
heavily used in WildFly code base. There is no alternative available.

Further more this change removes notion of 'transitive notifications'.
Since now on controllers care only about their direct neighbourhood.
This fix eliminates transitiveDependencyUn/Available notifications and
replaces them with immediateDependencyUn/Available notifications.

Reasons for eliminating support for above notification methods are:
 * immediateDependencyUn/Available and transitiveDependencyUn/Available
   notifications never worked properly in previous releases
   (it was quite common MSC ServiceControllerImpl had broken invariants)
 * they degrade MSC state machine performance (more tasks to execute)
 * ServiceListener have been deprecated (deprecation since EAP 7.0)
 * these notification methods were never used in WildFly/EAP code base
Changing ServiceControllerImpl.internalSetMode() method
signature - eliminating unused taskList parameter.
Changing transition() method signature.
There were two kind of problems with previous solution:
[1] shutdownListener was executed in regular listeners chain
[2] race condition between shutdown() and install() methods

Both aforementioned problems sometimes caused unexpected
premature executor shutdown while there were still
pending tasks (either scheduled or not) to be executed.

This caused many broken scenarios to happen.

One example of broken scenario (assuming [1] was the problem):
1) There is only one service A in registry in UP state.
   It has 1 registered transition listener.
2) Thread {1} called Container.shutdown() and this method
   completed successfully. It associated shutdownListener
   with controller for A and scheduled it for removal.
   There are two listeners now in controller for A
   and A is transitioning to REMOVED state.
3) A is not in REMOVED state yet
4) User code executing in Thread {2} calls
   Container.awaitStability(). This method hangs
   because previous Container.shutdown() asked
   service A to stop, but A is still transitioning.
5) A is entering REMOVED state and its RemoveTask is finished.
6) Last step is to notify listeners that REMOVING -> REMOVED
   transition happened.
7) shutdownListener is executed first and because it
   was registered just with one service it calls
   executor.shutdown(). Executor is terminated.
8) Problem is there is pending last listener task not executed yet
   and because it didn't execute yet stability notification
   was not sent to service container that controller stability
   was achieved. User code in Thread {2} will hang forever.

Similar example of broken scenario can be constructed
assuming [2] was the initial problem.

This fix:
 * removes terminateListener from regular controller
   listeners chain and creates separate and last controller phase
   intended just for container termination listeners.
   When controller enters REMOVED state and its stability
   is achieved (i.e. all its tasks have been executed)
   terminate listener is being called. This is the only
   appropriate time to call it.
 * container instrinsic lock is being held
   while container.install() is associating controller with
   primary and alias registrations. Only so it can be guaranteed
   shutdown() will not see stale registry data (no race condition)
@dmlloyd dmlloyd merged commit 38bbe12 into jboss-msc:1.2 Oct 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants