Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISPN-9128 RehashWithSharedStoreTest.testRehashes random failures #5989

Merged
merged 1 commit into from Jul 16, 2018

Conversation

rvansa
Copy link
Member

@rvansa rvansa commented May 17, 2018

@rvansa rvansa requested a review from danberindei May 17, 2018 15:11
// This is used to determine nodes that own the entries. In scattered cache (which uses transitory topology)
// the pendingCH is used for reading but the nodes in there are not guaranteed to have the data yet.
// CurrentCH should be safe - the nodes either have the data or the owner is unknown.
return topology.getCurrentCH();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant this method to be a mirror of CacheTopology.getReadConsistentHash(), only without using the unionCH field (which is null on the ClusterTopologyManagerImpl side). Since this is breaking the symmetry, we should change the method name as well.

@@ -503,6 +503,11 @@ private void applyValues(Address address, List<Object> keys, Response response)
for (int i = 0; i < keys.size(); ++i) {
Object key = keys.get(i);
InternalCacheValue icv = values[i];
if (icv == null) {
// The entry got lost in the meantime - this can happen when the cache is stopped concurrently to processing
// the GetAllCommand. We'll just avoid NPEs here: data is lost as > 1 nodes have left.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we get an exception if the cache is being stopped?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not when the command enters interceptor stack and the cache is stopped only after that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it can happen for our internal GetAllCommand it can probably happen for the application's GetAllCommands as well, so I'd change the code to throw an IllegalLifecycleStateException if possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the stopping receiver this topology check was ignored because the GetAllCommand has SKIP_OWNERSHIP_CHECK during the state transfer - this node was not the owner in previous topology (so it can't respond according to CH) but it has responded during the keys transfer as having the last version of the key. However as it's being stopped the data-container has been cleared concurrently and it fetches null data.
I agree that this could be improved by another type of signalization but this is out of scope of this PR.
On the originator, I don't see why you should throw an exception instead of processing the other data. Would you be ok with issuing a WARN log here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're probably right that this is out of the scope of this PR, we could handle ISPN-9095 separately.

However as it's being stopped the data-container has been cleared concurrently and it fetches null data.

When a cache shuts down while a remote command is running I would expect BaseBlockingRunnable to replace the command's response with a CacheNotFoundResponse. DataContainer.clear() has priority 999 and PerCacheInboundInvocationHandler.stop() has priority 10, so IMO the GetAllCommand should return either the proper value or a CNFR. BTW, non-scattered caches also rely on CNFR to know that a remote node is no longer an owner and the originator should ask another node, so if the CNFR logic is broken it's even more important to fix it.

On the originator, I don't see why you should throw an exception instead of processing the other data. Would you be ok with issuing a WARN log here?

I'd fix the remote node to properly reply with a CNFR when it's shutting down, and log an error in ScatteredStateConsumerImpl if a value was invalid. If SSCI can retry that key and fetch the value from another node we could make it a warning instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I was wrong... I thought that the test is already stopping the cache after test failure but in fact the TestingUtil is clearing the data container directly. So this was a test problem.

In this case I don't understand why the exception happened in https://issues.jboss.org/browse/ISPN-9095

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then do we still need this change?


@Test(groups = "functional", testName = "scattered.statetransfer.CoordStopTest")
@CleanupAfterMethod
public class CoordStopTest extends MultipleCacheManagersTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually call it "coordinator" ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal question: are you Dan or Daniel on your birth certificate? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dan 😄

I know some of our test names are ridiculously long, but I don't think CoordinatorStopTest would be one of them ;)

// start other caches
cache(1);
cache(2);
waitForClusterToForm();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waitForClusterToForm() already starts the default caches.

@Override
protected void clearContent() throws Throwable {
super.clearContent();
viewLatch = new CountDownLatch(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed?

int stableTopologyId = cache(1).getAdvancedCache().getDistributionManager().getCacheTopology().getTopologyId();
BlockingLocalTopologyManager ltm2 = BlockingLocalTopologyManager.replaceTopologyManager(manager(2), cache(2).getName());
ControlledRpcManager rpcManager2 = ControlledRpcManager.replaceRpcManager(cache(2));
rpcManager2.excludeCommands(cmd -> cmd instanceof StateResponseCommand, CoordStopTest::isCancelStateTransfer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it would be simpler to handle the response commands explicitly, then you can guarantee there are no cancel commands.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by handling explicitly? Overriding RpcManager on my own?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I meant calling ControlledRpcManager.expectCommand()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the ControlledRpcManager requires the commands to be send in order, and cancel commands depend on the timing - I would probably need to add additional synchronization to enforce them. Not that it wouldn't be possible, but I simply don't care about them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the order of the cancel commands is pretty straightforward: if you intercept the response commands and you wait for their responses, all the segments will be received and there will be no need for a cancel command.

assertEquals(2, t1.getCacheTopology().getActualMembers().size());
assertEquals(null, t1.getCacheTopology().getPendingCH());
for (Address member : t1.getCacheTopology().getMembers()) {
assertEquals("Topology: " + t1.getCacheTopology(), 1, t1.getCacheTopology().getCurrentCH().getSegmentsForOwner(member).size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you know the owner indexes, you can assert them directly (maybe even extract this to a method):

  assertEquals(emptyList(), t1.getCacheTopology().getCurrentCH().locateOwnersForSegment(0));
  assertEquals(singletonList(address(1)), t1.getCacheTopology().getCurrentCH().locateOwnersForSegment(1));
  assertEquals(singletonList(address(2)), t1.getCacheTopology().getCurrentCH().locateOwnersForSegment(2));

assertEquals(2, t2.getCacheTopology().getActualMembers().size());
assertNotNull(t2.getCacheTopology().getPendingCH());
for (int segment = 0; segment < t2.getCacheTopology().getPendingCH().getNumSegments(); ++segment) {
assertEquals("Topology: " + t2.getCacheTopology(), 1, t2.getCacheTopology().getPendingCH().locateOwnersForSegment(segment).size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, this assert is too generic considering we know who the owners are.

@danberindei
Copy link
Member

I got a failure when running the whole test suite:

[ERROR] testCoordLeaves[bias=NEVER](org.infinispan.scattered.statetransfer.CoordStopTest)  Time elapsed: 0.053 s  <<< FAILURE!
java.lang.AssertionError: expected:<10> but was:<11>
	at org.testng.AssertJUnit.fail(AssertJUnit.java:59)
	at org.testng.AssertJUnit.failNotEquals(AssertJUnit.java:364)
	at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:80)
	at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:245)
	at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:252)
	at org.infinispan.util.BlockingLocalTopologyManager.expectTopologyUpdate(BlockingLocalTopologyManager.java:102)
	at org.infinispan.scattered.statetransfer.CoordStopTest.testCoordLeaves(CoordStopTest.java:127)

When running in the IDE, I didn't get a failure, but I got some errors in the log:

09:13:51,898 ERROR (testng-test) [BlockingLocalTopologyManager] Stopped blocking topology updates, but there are 1 blocked updates in the queue: [Event{type=REBALANCE_START, topologyId=11, viewId=3}]
09:14:11,875 ERROR (remote-thread-test-NodeF-p18-t6) [BlockingLocalTopologyManager] Failed waiting for test to unblock REBALANCE_START 11 on test-NodeF-15557
java.util.concurrent.TimeoutException: null
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) ~[?:1.8.0_152]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) ~[?:1.8.0_152]
	at org.infinispan.util.BlockingLocalTopologyManager$Event.awaitUnblock(BlockingLocalTopologyManager.java:247) ~[test-classes/:?]
	at org.infinispan.util.BlockingLocalTopologyManager.beforeHandleRebalance(BlockingLocalTopologyManager.java:204) ~[test-classes/:?]
	at org.infinispan.util.AbstractControlledLocalTopologyManager.handleRebalance(AbstractControlledLocalTopologyManager.java:64) ~[classes/:?]

pr_rvansa_ISPN-9128_2018-05-22_CoordStopTest-infinispan-core.log.gz
pr_rvansa_ISPN-9128_2018-05-22_ScatteredSyncStoreSharedTest-infinispan-core.log.gz

@rvansa
Copy link
Member Author

rvansa commented Jun 11, 2018

Okay, the first failure is an oversight in the test (I thought that ControlledRpcManager.stopBlocking will send anything that's in the queue, but it does not).
The second failure looks like an instance of https://issues.jboss.org/browse/ISPN-9130 fixed in #5975

@rvansa rvansa force-pushed the ISPN-9128 branch 2 times, most recently from 74f9801 to 9fdff67 Compare June 12, 2018 08:06
@@ -60,7 +62,7 @@
private final AtomicInteger count = new AtomicInteger(1);

private volatile boolean stopped = false;
private final Set<Class<? extends ReplicableCommand>> excludedCommands =
private final Set<Predicate<ReplicableCommand>> excludedCommands =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert, I'd really like to keep the logic on the main thread

waitForClusterToForm();

// segment 0 will be moved to cache(2). Since we've lost coord cache(1) now -> 0 and cache(2) -> 1
chf.setOwnerIndexes(new int[][] { {1}, {0}, {1} });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking, but I'd like this in the test before the node is killed

// Let the rebalance begin
rpcManager2.expectCommand(StateRequestCommand.class,
request -> assertEquals(StateRequestCommand.Type.CONFIRM_REVOKED_SEGMENTS, request.getType()))
.send().receiveAll();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird that I wrote expectCommand to only allow access to the command in the consumer, but I'm not going to argue with myself now :)

@@ -503,6 +503,11 @@ private void applyValues(Address address, List<Object> keys, Response response)
for (int i = 0; i < keys.size(); ++i) {
Object key = keys.get(i);
InternalCacheValue icv = values[i];
if (icv == null) {
// The entry got lost in the meantime - this can happen when the cache is stopped concurrently to processing
// the GetAllCommand. We'll just avoid NPEs here: data is lost as > 1 nodes have left.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then do we still need this change?

@rvansa
Copy link
Member Author

rvansa commented Jun 14, 2018

I've kept the icv == null as the test code clearing containers was not fixed in this PR.


// unblock outdated rebalance
rpcManager2.stopBlocking();
keyTransferRequest.send().receiveAll();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the kind of thing I meant in the chat... the fact that you call stopBlocking() before completing the blocked request makes me think that the test absolutely needs another command to be invoked remotely, but you'd rather not say what command :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to verify if this works (I didn't ran the test manually after last update and there was an oversight I had to fix); I've used similar construct in other test and seems it got broken as receiveAll schedules the continuation to the rpcManager's executor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand now. In my mind, the only alternative would have been to stop blocking at the end of the test, and stopping the executor wouldn't matter there because all the commands are already done.

@rvansa
Copy link
Member Author

rvansa commented Jun 15, 2018

@danberindei I know, debugging ATM :-/ One of the issues is that the CH_UPDATE from old coord and REBALANCE_START from new coord can come in arbitrary order, I need to set the order.

Actually leaving coord sending CH updates is pretty PITA in many cases, this is one aspect of ST that should be eventually revised (I am not sure where we're ATM with the graceful leave).

@danberindei
Copy link
Member

Yes, this and many other aspects should be revised, but I haven't made much progress -- too many random failures :(

@danberindei
Copy link
Member

I integrated Will's data container changes and now there's a conflict as well :/

@rvansa
Copy link
Member Author

rvansa commented Jun 26, 2018

@danberindei I made the tests as deterministic as I could... let's hope it passes this time.

@danberindei
Copy link
Member

Ah, I forgot about this again... @rvansa, the test is still failing in Jenkins, but for some reason it doesn't appear upfront in the list of failed tests:

https://ci.infinispan.org/job/Infinispan/job/PR-5989/18/testReport/

@rvansa
Copy link
Member Author

rvansa commented Jul 13, 2018

No worries in this case; I've fixed some of the stuff but it's not 100% as the CI shows... I have been lucky enough to reproduce this after couple hundred invocations, so debugging...

@rvansa
Copy link
Member Author

rvansa commented Jul 13, 2018

@danberindei sigh one liner bug needs 200+ lines of test to orchestrate the correct order of events... Now I've ran the test 2200 times without error, let's see if CI can give me one green run!

@danberindei
Copy link
Member

@rvansa I'd like us to write more unit tests and fewer functional tests for this kind of issue, but we need a lot of work in order to get there... With the huge lists of dependencies our components have I'm afraid a unit test may be even longer.

@rvansa
Copy link
Member Author

rvansa commented Jul 13, 2018

@danberindei Or we fully cover Infinispan in TLA+, verify that and just add a nice Java code generator... :)

@rvansa
Copy link
Member Author

rvansa commented Jul 16, 2018

@danberindei Seems the tests have passed now; only some RemoteCounter failure.

blockAfter(classListToFilter(filter));
}

public void blockAfter(Predicate<ReplicableCommand> filter) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without any further info on what caused the latest failures, I'd say it's this... the more complex the filter gets, the more likely it is to exclude the wrong command and crash the whole Rube Goldberg machine.

True, it's painful to enumerate all the commands that state transfer now needs like I'm trying to do with MockTransport/MockLocalTopologyManager, but IMO those can easily the sequence of commands can be extracted into reusable methods, and where the "business logic" is too complex for the test to predict the commands, then perhaps the logic should be simplified. I'm integrating the PR, but I trust you'll remember what you did to debug the test by the next time it fails :)

@danberindei danberindei merged commit f8531b4 into infinispan:master Jul 16, 2018
@danberindei
Copy link
Member

Integrated, thanks Radim!

@danberindei
Copy link
Member

@rvansa should I keep https://issues.jboss.org/browse/ISPN-9095 open?

@rvansa
Copy link
Member Author

rvansa commented Jul 16, 2018

I think you should close that; with the fix in the server won't emit the NPE and won't get stuck. Maybe there's an underlying problem, but it will manifest differently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants