Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISPN-7889 BaseDistributionInterceptor.remoteGet may cause concurrency issues #6501

Merged
merged 3 commits into from Jan 15, 2019

Conversation

@danberindei
Copy link
Member

commented Dec 4, 2018

https://issues.jboss.org/browse/ISPN-7889
https://issues.jboss.org/browse/ISPN-8889

  • Rename remoteGet to remoteGetSingleKey and use it only for single-key
    commands.
  • Rename remoteGetAll to remoteGetMany and use it for all multi-key
    commands.
  • Throw OutdatedTopologyException.RETRY_SAME_TOPOLOGY after receiving
    only UnsureResponses and RETRY_NEXT_TOPOLOGY if there was at least one
    CacheNotFoundResponse (and no valid response).
  • Make remote scattered reads wait for the right topology
  • Improve support for PartitionHandling.ALLOW_READS in scattered caches.
  • Make partition availability checks more readable.
  • Add asynchronous methods in ControlledRpcManager
  • Introduce OutdatedTopologyException.RETRY_SAME_TOPOLOGY

I've separated the OutdatedTopologyException changes into a commit, they touch quite a lot of files. I can do the same for the ControlledRpcManager, I changed some method names there and it affected quite a few tests.

I need your review @rvansa, especially for the scattered mode changes. I tried to add/clarify comments as much as possible, but I might have missed some stuff ;)

@danberindei danberindei requested a review from rvansa Dec 4, 2018

@danberindei danberindei changed the title Ispn 7889 remote get all ISPN-7889 BaseDistributionInterceptor.remoteGet may cause concurrency issues Dec 4, 2018

@danberindei danberindei added this to the 10.0.0.Alpha2 milestone Dec 4, 2018

@galderz galderz removed this from the 10.0.0.Alpha2 milestone Dec 7, 2018

@danberindei danberindei force-pushed the danberindei:ISPN-7889_remoteGetAll branch 3 times, most recently from 26153af to dc643bb Dec 10, 2018

@wburns
Copy link
Member

left a comment

Just some minor stuff, tbh I am not as familiar with the issue that I could be. Hopefully someone else can look otherwise I would be fine discussing it more via zulip etc.

@@ -79,11 +79,11 @@ public Object visitGetKeysInGroupCommand(InvocationContext ctx, GetKeysInGroupCo
return invokeNextThenAccept(ctx, command, (rCtx, rCommand, rv) -> {
GetKeysInGroupCommand cmd = (GetKeysInGroupCommand) rCommand;
final int commandTopologyId = cmd.getTopologyId();
// TODO Dan: We used to throw an exception if the node was still a write owner
// The new check should be better, but maybe we should retry if it's not a primary owner any more?

This comment has been minimized.

Copy link
@wburns

wburns Dec 12, 2018

Member

Just fyi, streams and now publishers both support reading from backup if it changes since the start of the operation. So I would say consistency wise it is probably fine.

This comment has been minimized.

Copy link
@danberindei

danberindei Dec 20, 2018

Author Member

Ok, I'll remove the comment

*
* <p>Most of the time, read commands can be retried in the same topology, see {@link #RETRY_SAME_TOPOLOGY}.</p>
*
* <p>This exception can be thrown very often, so it has not stack trace information, and using the constants

This comment has been minimized.

Copy link
@wburns

wburns Dec 12, 2018

Member
Suggested change
* <p>This exception can be thrown very often, so it has not stack trace information, and using the constants
* <p>This exception can be thrown very often when node is joining or leaving, so it has not stack trace information, and using the constants
* owners means either one owner had topology T+2 and by now we have at least T+1, or one owner had topology T-1
* and another had T+1, and by now all should have at least T.</p>
*/
public static final OutdatedTopologyException RETRY_SAME_TOPOLOGY =

This comment has been minimized.

Copy link
@wburns

wburns Dec 12, 2018

Member

Should probably add @SuppressWarnings("ThrowableInstanceNeverThrown") like the other

This comment has been minimized.

Copy link
@danberindei

danberindei Dec 20, 2018

Author Member

I just ran the inspection and it didn't complain, so I'm deleting the annotation from the other instead

// 5. A receives two unsure responses and throws OTE
// However, now we are sure that we can immediately retry the request, because C must have updated its topology
OutdatedTopologyException ote = (OutdatedTopologyException) ce;
if (ote.requestedTopologyId >= 0) {

This comment has been minimized.

Copy link
@wburns

wburns Dec 12, 2018

Member

I wonder should we just not do an equality check? I think that would be a bit clearer than trying to remember what values were what.

This comment has been minimized.

Copy link
@danberindei

danberindei Dec 20, 2018

Author Member

The trick (was) is that I wanted to allow exceptions with a custom topology id as well, but I've removed those usages so I'll try to simplify the check here

@rvansa
Copy link
Member

left a comment

I am still a bit torn on the 'relativity' of OTEs; I would find absolute requestTopologyId less fragile. Are we that concerned about allocations during topology change? OTE does not have a stack trace and it does not leave current thread, so I would say that the absolute ids are worth the allocations.

}

if (response instanceof UnsureResponse) {
hasUnsureResponse = true;

This comment has been minimized.

Copy link
@rvansa

rvansa Dec 14, 2018

Member

Looks like this field is never read.

This comment has been minimized.

Copy link
@danberindei

danberindei Dec 20, 2018

Author Member

Yep, removed

CacheEntry entry = value == null ? NullCacheEntry.getInstance() : value.toInternalCacheEntry(key);
wrapRemoteEntry(ctx, key, entry, false);
}
// TODO Dan: transform the remote entries before wrapping them in the context

This comment has been minimized.

Copy link
@rvansa

rvansa Dec 14, 2018

Member

Hmm? handleRemotelyRetrievedKeys looks up the value from context, so we need the actual values there before calling that method.

This comment has been minimized.

Copy link
@danberindei

danberindei Dec 20, 2018

Author Member

What I was trying to say is that handleRemotelyRetrievedKeys could receive a list of remote values and insert them into the context itself

@@ -313,16 +312,26 @@ public Object visitReadWriteManyEntriesCommand(InvocationContext ctx,
InvocationContext ctx, C command, WriteManyCommandHelper<C, Container, Item> helper, ConsistentHash ch,
MergingCompletableFuture<Object> allFuture, MutableInt offset, IntSet segments) throws Exception {
Container myItems = helper.newContainer();
List<CompletableFuture<?>> retrievals = null;
List<Object> remoteKeys = new ArrayList<>();

This comment has been minimized.

Copy link
@rvansa

rvansa Dec 14, 2018

Member

There won't be any remote keys in the common case; therefore this array should not be allocated ahead.

This comment has been minimized.

Copy link
@danberindei

danberindei Dec 20, 2018

Author Member

Ok

This comment has been minimized.

Copy link
@danberindei

danberindei Jan 6, 2019

Author Member

I made the list allocation lazy everywhere.

@@ -384,16 +392,25 @@ public Object visitReadWriteManyEntriesCommand(InvocationContext ctx,

private <C extends WriteCommand, Item> Object handleRemoteReadWriteManyCommand(
InvocationContext ctx, C command, WriteManyCommandHelper<C, ?, Item> helper) throws Exception {
List<CompletableFuture<?>> retrievals = null;
List<Object> remoteKeys = new ArrayList<>();

This comment has been minimized.

Copy link
@rvansa

rvansa Dec 14, 2018

Member

same as above

This comment has been minimized.

Copy link
@danberindei

danberindei Dec 20, 2018

Author Member

Ok

CacheEntry cacheEntry = ctx.lookupEntry(key);
if (cacheEntry == null) {
// this should be a rare situation, so we don't mind being a bit ineffective with the remote gets
if (command.hasAnyFlag(FlagBitSets.SKIP_REMOTE_LOOKUP) || command.hasAnyFlag(

This comment has been minimized.

Copy link
@rvansa

rvansa Dec 14, 2018

Member

replace || with bitwise I

}

protected <C extends VisitableCommand & FlagAffectedCommand & TopologyAffectedCommand, K> Object handleTxWriteManyCommand(
InvocationContext ctx, C command, Collection<K> keys, BiFunction<C, List<K>, C> copyCommand) {
boolean ignorePreviousValue = command.hasAnyFlag(SKIP_REMOTE_FLAGS) || command.loadType() == VisitableCommand.LoadType.DONT_LOAD;
List<K> filtered = new ArrayList<>(keys.size());
List<CompletableFuture<?>> remoteGets = null;
List<Object> remoteKeys = new ArrayList<>();

This comment has been minimized.

Copy link
@rvansa

rvansa Dec 14, 2018

Member

allocate lazily

// a broadcast to all nodes then can end with suspect exception, but we won't get any new topology.
// An example of this situation is when a node sends leave - topology can be installed before the new view.
// To prevent suspect exceptions use SYNCHRONOUS_IGNORE_LEAVERS response mode.
requestedTopologyId = cmd.getTopologyId() + 1;

This comment has been minimized.

Copy link
@rvansa

rvansa Dec 14, 2018

Member

The previous comment suggested that we should retry with the same topology in this case, why are you increasing by 1?

This comment has been minimized.

Copy link
@danberindei

danberindei Dec 20, 2018

Author Member

This branch should never be used, read commands should never throw a SuspectException.
I'll rename the method to make it clearer that it's only used for read commands and simplify the branch.

case ALLOW_READ_WRITES:
return true;
case ALLOW_READS:
// Never allow bulk reads/writes

This comment has been minimized.

Copy link
@rvansa

rvansa Dec 14, 2018

Member

Why not allowing bulk reads? (please add comment)

This comment has been minimized.

Copy link
@danberindei

danberindei Dec 20, 2018

Author Member

Good question... I think we can allow bulk reads for replicated caches :)

@@ -96,6 +96,7 @@ public void notifyTransactionDataReceived(int topologyId) {
if (transactionDataTopologyId >= expectedTopologyId) {
return CompletableFutures.completedNull();
} else {
// TODO Dan: Use thenComposeAsync to continue each command in a separate thread?

This comment has been minimized.

Copy link
@rvansa

rvansa Dec 14, 2018

Member

No, the thread in which the execution should continue should be decided by the caller of transactionDataFuture.

This comment has been minimized.

Copy link
@danberindei

danberindei Dec 20, 2018

Author Member

Good point, so we just need to modify DefaultTopologyRunnable to use thenApplyAsync... or remove DefaultTopologyRunnable altogether :)

This comment has been minimized.

Copy link
@danberindei

danberindei Jan 6, 2019

Author Member

BaseStateTransferInterceptor also uses transactionDataReceived() directly, so I created https://issues.jboss.org/browse/ISPN-9852

@@ -65,7 +66,8 @@ protected void createCacheManagers() throws Throwable {
protected <T> CompletionStage<T> performRequest(Collection<Address> targets, ReplicableCommand command,
ResponseCollector<T> collector,
Function<ResponseCollector<T>,
CompletionStage<T>> invoker) {
CompletionStage<T>> invoker,

This comment has been minimized.

Copy link
@rvansa

rvansa Dec 14, 2018

Member

indentation?

This comment has been minimized.

Copy link
@danberindei

danberindei Dec 20, 2018

Author Member

That's IntelliJ aligns it for me if it's on a separate line, but it fits on the Function< line so I moved it

invokeNextAndFinally(ctx, localCommand, handler);
} else {
// We must wait until all retrievals finish before proceeding with the local command
CompletableFuture[] ra = retrievals.toArray(new CompletableFuture[retrievals.size()]);
Object result = asyncInvokeNext(ctx, command, CompletableFuture.allOf(ra));
Object result = asyncInvokeNext(ctx, command, retrievals);

This comment has been minimized.

Copy link
@danberindei

danberindei Dec 20, 2018

Author Member

@rvansa Ouch, I read lines 348 and 352 invoking the next interceptor and I almost had a heart attack thinking that we're going to have multiple threads accessing the context :)

CallInterceptor should be pretty fast, wouldn't it simplify things if we only invoked it after we got all the entries in the context?

@danberindei danberindei force-pushed the danberindei:ISPN-7889_remoteGetAll branch from dc643bb to d728595 Dec 20, 2018

@danberindei

This comment has been minimized.

Copy link
Member Author

commented Dec 20, 2018

Updated @rvansa @wburns

danberindei added 2 commits Dec 4, 2018
ISPN-7889 Introduce OutdatedTopologyException.RETRY_SAME_TOPOLOGY
And rename OTE.INSTANCE to OTE.RETRY_NEXT_TOPOLOGY
ISPN-7889 BaseDistributionInterceptor.remoteGet may cause concurrency…
… issues

* Rename remoteGet to remoteGetSingleKey and use it only for single-key
  commands.
* Rename remoteGetAll to remoteGetMany and use it for all multi-key
  commands.
* Throw OutdatedTopologyException.RETRY_SAME_TOPOLOGY after receiving
  only UnsureResponses and RETRY_NEXT_TOPOLOGY if there was at least one
  CacheNotFoundResponse (and no valid response).
* Make remote scattered reads wait for the right topology
* Improve support for PartitionHandling.ALLOW_READS in scattered caches.
* Make partition availability checks more readable.
* Add asynchronous methods in ControlledRpcManager

@danberindei danberindei force-pushed the danberindei:ISPN-7889_remoteGetAll branch from d728595 to 0a0a04e Jan 6, 2019

@danberindei danberindei force-pushed the danberindei:ISPN-7889_remoteGetAll branch from 0a0a04e to ebfb157 Jan 6, 2019

@tristantarrant tristantarrant added this to the 10.0.0.Beta1 milestone Jan 15, 2019

@wburns wburns merged commit bbab254 into infinispan:master Jan 15, 2019

1 check failed

continuous-integration/jenkins/pr-head This commit has test failures
Details
@wburns

This comment has been minimized.

Copy link
Member

commented Jan 15, 2019

Integrated into master, thanks @danberindei !

@danberindei danberindei deleted the danberindei:ISPN-7889_remoteGetAll branch Jan 22, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.