Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISPN-11385 Convert Remote Command Executor to Non blocking/blocking t… #7997

Conversation

wburns
Copy link
Member

@wburns wburns commented Mar 4, 2020

…hread executor

  • Invoke commands that block on blocking executor
  • Invoke other commands by caller
  • Use non blocking executor instead of remote in other places

https://issues.redhat.com/browse/ISPN-11385
https://issues.redhat.com/browse/ISPN-11473
https://issues.redhat.com/browse/ISPN-11489

@wburns wburns added the Preview label Mar 4, 2020
@wburns
Copy link
Member Author

wburns commented Mar 4, 2020

Preview to confirm other CI modules pass. With this all the regular thread pools are combined into the non blocking and blocking ones !

@wburns
Copy link
Member Author

wburns commented Mar 4, 2020

please run performance tests please

@wburns wburns force-pushed the ISPN-11385_convert_remote_command_executor branch from 75e1aa2 to 0574783 Compare March 5, 2020 17:00
@wburns
Copy link
Member Author

wburns commented Mar 5, 2020

run performance tests please

@wburns
Copy link
Member Author

wburns commented Mar 5, 2020

Only 1 related test failure in the previous run, which should now be fixed (AsyncInvocationTest thread leak fixed).

Rerunning with that fixed and also deprecation of canBlock method.

@wburns wburns removed the Preview label Mar 5, 2020
@ghost
Copy link

ghost commented Mar 5, 2020

Performance tests run successfully. Link to the results here.

Additional info:
Commit: 0574783
Build number: #400
Comment body: run performance tests please
skip ci

@wburns wburns force-pushed the ISPN-11385_convert_remote_command_executor branch from 9c073f6 to 57a1afb Compare March 10, 2020 17:00
@wburns
Copy link
Member Author

wburns commented Mar 11, 2020

@@ -67,6 +73,10 @@ private static void allowTestsToBlock(BlockHound.Builder builder) {
CommonsBlockHoundIntegration.allowPublicMethodsToBlock(builder, NotifierLatch.class);

CommonsBlockHoundIntegration.allowPublicMethodsToBlock(builder, TestBlocking.class);

CommonsBlockHoundIntegration.allowMethodsToBlock(builder, Class.forName(ReplListener.class.getName() + "$ReplListenerInterceptor"), false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming this is just for the Thread.sleep() call, I think it would be better to add an executor parameter to TestingUtil.delayed() and to inject the non-blocking executor in ReplListenerInterceptor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually am not sure why I didn't just add TestingUtil#sleepThread to the exception list. Let me try that instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah it was because of logCommand that acquires a lock. I think I will leave it as is for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I can also add TestingUtil#sleepThread as okay to block though too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to add TestingUtil#sleepThread

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to work fine without the ReplListenerInterceptor exception now.

@wburns wburns force-pushed the ISPN-11385_convert_remote_command_executor branch 2 times, most recently from b3bd615 to ad99dea Compare March 18, 2020 16:53
@wburns
Copy link
Member Author

wburns commented Mar 18, 2020

Actually now that the persistence checks are back in it found more issues. So I am working on this still.

@wburns
Copy link
Member Author

wburns commented Mar 18, 2020

It looks like it should be fixed now, will let CI sort it out :)

@@ -1139,11 +1138,14 @@ public ClusterExecutor executor() {
if (transport != null) {
long time = configurationManager.getGlobalConfiguration().transport().distributedSyncTimeout();
return ClusterExecutors.allSubmissionExecutor(null, this, transport, time, TimeUnit.MILLISECONDS,
globalComponentRegistry.getComponent(ExecutorService.class, KnownComponentNames.REMOTE_COMMAND_EXECUTOR),
// This can run arbitrary code, including user - such commands can block
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I more put it here because it can block :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be using the non-blocking executor then?

Copy link
Member Author

@wburns wburns Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately until the other JIRA is fixed, we don't have a great solution. And cluster executor isn't that widely used afaik. But we should hopefully get it fixed before people use it like this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I think the few users of cluster executor may be doing exactly blocking cache operations, because there's no way to return a value asynchronously.

I'm starting to think that the proper solution is

change PersistenceManagerImpl to detect if it is a blocking thread and run it inline and if non blocking thread to run the command in a blocking thread.

In fact, I would go even further, and change continueOnCPUExecutor to also continue on the caller thread if the caller thread was blocking. Otherwise, for cluster executor tasks doing cache.put(k1, v1), where the put requires 1 store operation to read the previous value and 1 store operation to store the value, the store read would happen on the task's initial blocking thread, but the store write would need another blocking thread. If the size of the blocking thread pool is N and you have N simultaneous tasks like this, there's no free thread to process the store writes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an FYI but the only way currently we have to detect if it is a blocking thread is to check the thread name, which is quite brittle.

Also your put case, I don't see how the read and write would need concurrent blocking threads. The read would be done before then the write would be done afterwards, synchronously.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an FYI but the only way currently we have to detect if it is a blocking thread is to check the thread name, which is quite brittle.

Can't we do !(Thread.currentThread() instanceof ISPNNonBlockingThread)?

Also your put case, I don't see how the read and write would need concurrent blocking threads. The read would be done before then the write would be done afterwards, synchronously.

If the cluster executor task does a blocking cache.put(k, v), it needs a (blocking) thread for the entire duration of the cache operation. The read would run on the same thread, but then continueOnCPUExecutor() would submit a task to the non-blocking executor, and the next PersistenceManagerImpl call would submit a task to the blocking executor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we do !(Thread.currentThread() instanceof ISPNNonBlockingThread)?

No, unfortunately. This would include user threads, jgroups etc.

If the cluster executor task does a blocking cache.put(k, v), it needs a (blocking) thread for the entire duration of the cache operation. The read would run on the same thread, but then continueOnCPUExecutor() would submit a task to the non-blocking executor, and the next PersistenceManagerImpl call would submit a task to the blocking executor.

Oh, okay you were not referring to the read then write. I agree if a blocking operation is invoked on a blocking thread then yes it would use more than 1.

@@ -67,6 +73,10 @@ private static void allowTestsToBlock(BlockHound.Builder builder) {
CommonsBlockHoundIntegration.allowPublicMethodsToBlock(builder, NotifierLatch.class);

CommonsBlockHoundIntegration.allowPublicMethodsToBlock(builder, TestBlocking.class);

CommonsBlockHoundIntegration.allowMethodsToBlock(builder, Class.forName(ReplListener.class.getName() + "$ReplListenerInterceptor"), false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to add TestingUtil#sleepThread

@danberindei
Copy link
Member

There are some startup failures in CI

@wburns
Copy link
Member Author

wburns commented Mar 18, 2020

Yeah I figured there may be and my local run was just a fluke.

@wburns
Copy link
Member Author

wburns commented Mar 19, 2020

So the failure is because we commit the transaction in a blocking thread as the API is blocking. However the store write can touch the cache store, which doesn't want you to run it on the blocking thread. I am not sure how to fix this other than to change PersistenceManagerImpl to detect if it is a blocking thread and run it inline and if non blocking thread to run the command in a blocking thread. I think this is how it should be long term, but sadly this check is finding lots of bugs as it is now :)

@wburns wburns force-pushed the ISPN-11385_convert_remote_command_executor branch from 4730be1 to bf6f39a Compare March 19, 2020 15:50
@wburns
Copy link
Member Author

wburns commented Mar 19, 2020

There are some test failures from the tx changes where I missed something, guessing I am not joining somewhere :)

@wburns wburns force-pushed the ISPN-11385_convert_remote_command_executor branch 2 times, most recently from 21db4df to fdefde6 Compare March 20, 2020 12:42
@wburns
Copy link
Member Author

wburns commented Mar 20, 2020

Updated to fix the 2 blocking test failures.

@@ -1139,11 +1138,14 @@ public ClusterExecutor executor() {
if (transport != null) {
long time = configurationManager.getGlobalConfiguration().transport().distributedSyncTimeout();
return ClusterExecutors.allSubmissionExecutor(null, this, transport, time, TimeUnit.MILLISECONDS,
globalComponentRegistry.getComponent(ExecutorService.class, KnownComponentNames.REMOTE_COMMAND_EXECUTOR),
// This can run arbitrary code, including user - such commands can block
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I think the few users of cluster executor may be doing exactly blocking cache operations, because there's no way to return a value asynchronously.

I'm starting to think that the proper solution is

change PersistenceManagerImpl to detect if it is a blocking thread and run it inline and if non blocking thread to run the command in a blocking thread.

In fact, I would go even further, and change continueOnCPUExecutor to also continue on the caller thread if the caller thread was blocking. Otherwise, for cluster executor tasks doing cache.put(k1, v1), where the put requires 1 store operation to read the previous value and 1 store operation to store the value, the store read would happen on the task's initial blocking thread, but the store write would need another blocking thread. If the size of the blocking thread pool is N and you have N simultaneous tasks like this, there's no free thread to process the store writes.

}
}

private <T> CompletionStage<T> handleRollbackFailure(Throwable t, LocalTransaction localTransaction) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT handleRollbackFailure and handleCommitFailure don't need to return a CompletionStage

Copy link
Member Author

@wburns wburns Mar 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't, but the users of it require it to be a CompletionStage, so less code overall. :)

Copy link
Member

@danberindei danberindei Mar 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you return void and throw the exception directly, the callers can use handle instead of handleAndCompose

@@ -41,6 +43,7 @@ public void test() throws Exception {
cache(2).put(key, "value");

ControlledRpcManager rpcManager = ControlledRpcManager.replaceRpcManager(cache(2));
rpcManager.excludeCommands(StateResponseCommand.class, StateTransferStartCommand.class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's so weird that it wasn't a problem before, I have to debug the test to see how it's passing on master :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree. But I wasn't quite sure what was going on.

@wburns wburns force-pushed the ISPN-11385_convert_remote_command_executor branch 3 times, most recently from d028787 to 7010d95 Compare March 20, 2020 17:01

//rollback transaction before throwing the exception as there's no guarantee the TM calls XAResource.rollback
//after prepare failed.
return CompletionStages.handleAndCompose(rollback(localTransaction), (ignore2, rollbackThrowable) -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for handleAndCompose

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, why. We need to handle the case when it was not an error to wrap it with an XAException still. And we want to still catch the rollback exception to supress or rethrow that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but none of those need to return a CompletionStage, so you can use handle instead of handleAndCompose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to keep all the exceptions as bare XAException. If I do the other I would have to use CompletionException wrapping XAException and all callers would have to pay attention to that including TransactionXAAdapter, but I can do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I don't feel comfortable changing this right now. I can revisit later if we need.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to keep all the exceptions as bare XAException. If I do the other I would have to use CompletionException wrapping XAException and all callers would have to pay attention to that including TransactionXAAdapter, but I can do that.

Not sure what you mean. When you do CompletableFuture.join() it will wrap the exception in a CompletionStage anyway, so you have to be prepared to extract the exception with CompletableFutures.extractException.

But I'm ok with revisiting this later.

@wburns wburns force-pushed the ISPN-11385_convert_remote_command_executor branch from 7010d95 to 3173ad9 Compare March 20, 2020 19:09
@wburns
Copy link
Member Author

wburns commented Mar 20, 2020

Fixed latest comments.

@wburns wburns force-pushed the ISPN-11385_convert_remote_command_executor branch 2 times, most recently from 514d861 to 6030cc1 Compare March 20, 2020 19:12
…hread executor

* Invoke commands that block on blocking executor
* Invoke other commands by caller
* Use non blocking executor instead of remote in other places
@wburns wburns force-pushed the ISPN-11385_convert_remote_command_executor branch from 6030cc1 to 0caa5b7 Compare March 20, 2020 20:55
@danberindei danberindei merged commit 2c74173 into infinispan:master Mar 20, 2020
@danberindei
Copy link
Member

Merged, thanks Will!

@wburns
Copy link
Member Author

wburns commented Mar 20, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants