ISPN-9128 RehashWithSharedStoreTest.testRehashes random failures #5989

rvansa · 2018-05-17T15:11:42Z

https://issues.jboss.org/browse/ISPN-9128

Includes https://issues.jboss.org/browse/ISPN-9095 as well.

The fix is just one liner :)

danberindei · 2018-05-22T05:37:49Z

core/src/main/java/org/infinispan/partitionhandling/impl/AvailabilityStrategy.java

+            // This is used to determine nodes that own the entries. In scattered cache (which uses transitory topology)
+            // the pendingCH is used for reading but the nodes in there are not guaranteed to have the data yet.
+            // CurrentCH should be safe - the nodes either have the data or the owner is unknown.
+            return topology.getCurrentCH();


I meant this method to be a mirror of CacheTopology.getReadConsistentHash(), only without using the unionCH field (which is null on the ClusterTopologyManagerImpl side). Since this is breaking the symmetry, we should change the method name as well.

danberindei · 2018-05-22T05:39:48Z

core/src/main/java/org/infinispan/scattered/impl/ScatteredStateConsumerImpl.java

@@ -503,6 +503,11 @@ private void applyValues(Address address, List<Object> keys, Response response)
      for (int i = 0; i < keys.size(); ++i) {
         Object key = keys.get(i);
         InternalCacheValue icv = values[i];
+         if (icv == null) {
+            // The entry got lost in the meantime - this can happen when the cache is stopped concurrently to processing
+            // the GetAllCommand. We'll just avoid NPEs here: data is lost as > 1 nodes have left.


Shouldn't we get an exception if the cache is being stopped?

Not when the command enters interceptor stack and the cache is stopped only after that.

If it can happen for our internal GetAllCommand it can probably happen for the application's GetAllCommands as well, so I'd change the code to throw an IllegalLifecycleStateException if possible.

On the stopping receiver this topology check was ignored because the GetAllCommand has SKIP_OWNERSHIP_CHECK during the state transfer - this node was not the owner in previous topology (so it can't respond according to CH) but it has responded during the keys transfer as having the last version of the key. However as it's being stopped the data-container has been cleared concurrently and it fetches null data.
I agree that this could be improved by another type of signalization but this is out of scope of this PR.
On the originator, I don't see why you should throw an exception instead of processing the other data. Would you be ok with issuing a WARN log here?

You're probably right that this is out of the scope of this PR, we could handle ISPN-9095 separately.

However as it's being stopped the data-container has been cleared concurrently and it fetches null data.

When a cache shuts down while a remote command is running I would expect BaseBlockingRunnable to replace the command's response with a CacheNotFoundResponse. DataContainer.clear() has priority 999 and PerCacheInboundInvocationHandler.stop() has priority 10, so IMO the GetAllCommand should return either the proper value or a CNFR. BTW, non-scattered caches also rely on CNFR to know that a remote node is no longer an owner and the originator should ask another node, so if the CNFR logic is broken it's even more important to fix it.

On the originator, I don't see why you should throw an exception instead of processing the other data. Would you be ok with issuing a WARN log here?

I'd fix the remote node to properly reply with a CNFR when it's shutting down, and log an error in ScatteredStateConsumerImpl if a value was invalid. If SSCI can retry that key and fetch the value from another node we could make it a warning instead.

Oh, I was wrong... I thought that the test is already stopping the cache after test failure but in fact the TestingUtil is clearing the data container directly. So this was a test problem.

In this case I don't understand why the exception happened in https://issues.jboss.org/browse/ISPN-9095

Ok, then do we still need this change?

danberindei · 2018-05-22T05:40:47Z

core/src/test/java/org/infinispan/scattered/statetransfer/CoordStopTest.java

+
+@Test(groups = "functional", testName = "scattered.statetransfer.CoordStopTest")
+@CleanupAfterMethod
+public class CoordStopTest extends MultipleCacheManagersTest {


We actually call it "coordinator" ;)

Personal question: are you Dan or Daniel on your birth certificate? :)

Dan 😄

I know some of our test names are ridiculously long, but I don't think CoordinatorStopTest would be one of them ;)

danberindei · 2018-05-22T05:50:05Z

core/src/test/java/org/infinispan/scattered/statetransfer/CoordStopTest.java

+      // start other caches
+      cache(1);
+      cache(2);
+      waitForClusterToForm();


waitForClusterToForm() already starts the default caches.

danberindei · 2018-05-22T05:51:52Z

core/src/test/java/org/infinispan/scattered/statetransfer/CoordStopTest.java

+   @Override
+   protected void clearContent() throws Throwable {
+      super.clearContent();
+      viewLatch = new CountDownLatch(1);


Is this really needed?

danberindei · 2018-05-22T06:00:35Z

core/src/test/java/org/infinispan/scattered/statetransfer/CoordStopTest.java

+      int stableTopologyId = cache(1).getAdvancedCache().getDistributionManager().getCacheTopology().getTopologyId();
+      BlockingLocalTopologyManager ltm2 = BlockingLocalTopologyManager.replaceTopologyManager(manager(2), cache(2).getName());
+      ControlledRpcManager rpcManager2 = ControlledRpcManager.replaceRpcManager(cache(2));
+      rpcManager2.excludeCommands(cmd -> cmd instanceof StateResponseCommand, CoordStopTest::isCancelStateTransfer);


IMO it would be simpler to handle the response commands explicitly, then you can guarantee there are no cancel commands.

What do you mean by handling explicitly? Overriding RpcManager on my own?

No, I meant calling ControlledRpcManager.expectCommand()

IIUC the ControlledRpcManager requires the commands to be send in order, and cancel commands depend on the timing - I would probably need to add additional synchronization to enforce them. Not that it wouldn't be possible, but I simply don't care about them.

IMO the order of the cancel commands is pretty straightforward: if you intercept the response commands and you wait for their responses, all the segments will be received and there will be no need for a cancel command.

danberindei · 2018-05-22T06:15:37Z

core/src/test/java/org/infinispan/scattered/statetransfer/CoordStopTest.java

+      assertEquals(2, t1.getCacheTopology().getActualMembers().size());
+      assertEquals(null, t1.getCacheTopology().getPendingCH());
+      for (Address member : t1.getCacheTopology().getMembers()) {
+         assertEquals("Topology: " + t1.getCacheTopology(), 1, t1.getCacheTopology().getCurrentCH().getSegmentsForOwner(member).size());


Since you know the owner indexes, you can assert them directly (maybe even extract this to a method):

assertEquals(emptyList(), t1.getCacheTopology().getCurrentCH().locateOwnersForSegment(0)); assertEquals(singletonList(address(1)), t1.getCacheTopology().getCurrentCH().locateOwnersForSegment(1)); assertEquals(singletonList(address(2)), t1.getCacheTopology().getCurrentCH().locateOwnersForSegment(2));

danberindei · 2018-05-22T06:16:31Z

core/src/test/java/org/infinispan/scattered/statetransfer/CoordStopTest.java

+      assertEquals(2, t2.getCacheTopology().getActualMembers().size());
+      assertNotNull(t2.getCacheTopology().getPendingCH());
+      for (int segment = 0; segment < t2.getCacheTopology().getPendingCH().getNumSegments(); ++segment) {
+         assertEquals("Topology: " + t2.getCacheTopology(), 1, t2.getCacheTopology().getPendingCH().locateOwnersForSegment(segment).size());


Same here, this assert is too generic considering we know who the owners are.

danberindei · 2018-05-22T06:43:50Z

I got a failure when running the whole test suite:

[ERROR] testCoordLeaves[bias=NEVER](org.infinispan.scattered.statetransfer.CoordStopTest)  Time elapsed: 0.053 s  <<< FAILURE!
java.lang.AssertionError: expected:<10> but was:<11>
	at org.testng.AssertJUnit.fail(AssertJUnit.java:59)
	at org.testng.AssertJUnit.failNotEquals(AssertJUnit.java:364)
	at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:80)
	at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:245)
	at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:252)
	at org.infinispan.util.BlockingLocalTopologyManager.expectTopologyUpdate(BlockingLocalTopologyManager.java:102)
	at org.infinispan.scattered.statetransfer.CoordStopTest.testCoordLeaves(CoordStopTest.java:127)

When running in the IDE, I didn't get a failure, but I got some errors in the log:

09:13:51,898 ERROR (testng-test) [BlockingLocalTopologyManager] Stopped blocking topology updates, but there are 1 blocked updates in the queue: [Event{type=REBALANCE_START, topologyId=11, viewId=3}]
09:14:11,875 ERROR (remote-thread-test-NodeF-p18-t6) [BlockingLocalTopologyManager] Failed waiting for test to unblock REBALANCE_START 11 on test-NodeF-15557
java.util.concurrent.TimeoutException: null
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) ~[?:1.8.0_152]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) ~[?:1.8.0_152]
	at org.infinispan.util.BlockingLocalTopologyManager$Event.awaitUnblock(BlockingLocalTopologyManager.java:247) ~[test-classes/:?]
	at org.infinispan.util.BlockingLocalTopologyManager.beforeHandleRebalance(BlockingLocalTopologyManager.java:204) ~[test-classes/:?]
	at org.infinispan.util.AbstractControlledLocalTopologyManager.handleRebalance(AbstractControlledLocalTopologyManager.java:64) ~[classes/:?]

pr_rvansa_ISPN-9128_2018-05-22_CoordStopTest-infinispan-core.log.gz
pr_rvansa_ISPN-9128_2018-05-22_ScatteredSyncStoreSharedTest-infinispan-core.log.gz

rvansa · 2018-06-11T13:38:54Z

Okay, the first failure is an oversight in the test (I thought that ControlledRpcManager.stopBlocking will send anything that's in the queue, but it does not).
The second failure looks like an instance of https://issues.jboss.org/browse/ISPN-9130 fixed in #5975

danberindei · 2018-06-14T07:16:44Z

core/src/test/java/org/infinispan/util/ControlledRpcManager.java

@@ -60,7 +62,7 @@
   private final AtomicInteger count = new AtomicInteger(1);

   private volatile boolean stopped = false;
-   private final Set<Class<? extends ReplicableCommand>> excludedCommands =
+   private final Set<Predicate<ReplicableCommand>> excludedCommands =


Please revert, I'd really like to keep the logic on the main thread

danberindei · 2018-06-14T07:17:24Z

core/src/test/java/org/infinispan/scattered/statetransfer/CoordinatorStopTest.java

+      waitForClusterToForm();
+
+      // segment 0 will be moved to cache(2). Since we've lost coord cache(1) now -> 0 and cache(2) -> 1
+      chf.setOwnerIndexes(new int[][] { {1}, {0}, {1} });


Nitpicking, but I'd like this in the test before the node is killed

danberindei · 2018-06-14T07:23:42Z

core/src/test/java/org/infinispan/scattered/statetransfer/CoordinatorStopTest.java

+      // Let the rebalance begin
+      rpcManager2.expectCommand(StateRequestCommand.class,
+            request -> assertEquals(StateRequestCommand.Type.CONFIRM_REVOKED_SEGMENTS, request.getType()))
+            .send().receiveAll();


It's weird that I wrote expectCommand to only allow access to the command in the consumer, but I'm not going to argue with myself now :)

danberindei · 2018-06-14T07:26:46Z

core/src/main/java/org/infinispan/scattered/impl/ScatteredStateConsumerImpl.java

@@ -503,6 +503,11 @@ private void applyValues(Address address, List<Object> keys, Response response)
      for (int i = 0; i < keys.size(); ++i) {
         Object key = keys.get(i);
         InternalCacheValue icv = values[i];
+         if (icv == null) {
+            // The entry got lost in the meantime - this can happen when the cache is stopped concurrently to processing
+            // the GetAllCommand. We'll just avoid NPEs here: data is lost as > 1 nodes have left.


Ok, then do we still need this change?

rvansa · 2018-06-14T07:56:13Z

I've kept the icv == null as the test code clearing containers was not fixed in this PR.

danberindei · 2018-06-14T13:56:56Z

core/src/test/java/org/infinispan/scattered/statetransfer/CoordinatorStopTest.java

+
+      // unblock outdated rebalance
+      rpcManager2.stopBlocking();
+      keyTransferRequest.send().receiveAll();


This is the kind of thing I meant in the chat... the fact that you call stopBlocking() before completing the blocked request makes me think that the test absolutely needs another command to be invoked remotely, but you'd rather not say what command :)

I need to verify if this works (I didn't ran the test manually after last update and there was an oversight I had to fix); I've used similar construct in other test and seems it got broken as receiveAll schedules the continuation to the rpcManager's executor.

I think I understand now. In my mind, the only alternative would have been to stop blocking at the end of the test, and stopping the executor wouldn't matter there because all the commands are already done.

danberindei · 2018-06-15T11:55:02Z

Sorry @rvansa, now the test is failing:

https://ci.infinispan.org/job/Infinispan/job/PR-5989/5/testReport/junit/org.infinispan.scattered.statetransfer/CoordinatorStopTest/testCoordinatorLeaves_bias_NEVER_/

rvansa · 2018-06-15T12:00:12Z

@danberindei I know, debugging ATM :-/ One of the issues is that the CH_UPDATE from old coord and REBALANCE_START from new coord can come in arbitrary order, I need to set the order.

Actually leaving coord sending CH updates is pretty PITA in many cases, this is one aspect of ST that should be eventually revised (I am not sure where we're ATM with the graceful leave).

danberindei · 2018-06-15T13:07:48Z

Yes, this and many other aspects should be revised, but I haven't made much progress -- too many random failures :(

danberindei · 2018-06-15T15:11:21Z

I integrated Will's data container changes and now there's a conflict as well :/

rvansa · 2018-06-26T11:20:57Z

@danberindei I made the tests as deterministic as I could... let's hope it passes this time.

danberindei · 2018-07-12T15:01:20Z

Ah, I forgot about this again... @rvansa, the test is still failing in Jenkins, but for some reason it doesn't appear upfront in the list of failed tests:

https://ci.infinispan.org/job/Infinispan/job/PR-5989/18/testReport/

rvansa · 2018-07-13T12:08:03Z

No worries in this case; I've fixed some of the stuff but it's not 100% as the CI shows... I have been lucky enough to reproduce this after couple hundred invocations, so debugging...

rvansa · 2018-07-13T14:47:46Z

@danberindei sigh one liner bug needs 200+ lines of test to orchestrate the correct order of events... Now I've ran the test 2200 times without error, let's see if CI can give me one green run!

danberindei · 2018-07-13T15:19:56Z

@rvansa I'd like us to write more unit tests and fewer functional tests for this kind of issue, but we need a lot of work in order to get there... With the huge lists of dependencies our components have I'm afraid a unit test may be even longer.

rvansa · 2018-07-13T15:23:14Z

@danberindei Or we fully cover Infinispan in TLA+, verify that and just add a nice Java code generator... :)

rvansa · 2018-07-16T10:35:14Z

@danberindei Seems the tests have passed now; only some RemoteCounter failure.

danberindei · 2018-07-16T11:39:56Z

core/src/test/java/org/infinispan/remoting/transport/ControlledTransport.java

+      blockAfter(classListToFilter(filter));
+   }
+
+   public void blockAfter(Predicate<ReplicableCommand> filter) {


Without any further info on what caused the latest failures, I'd say it's this... the more complex the filter gets, the more likely it is to exclude the wrong command and crash the whole Rube Goldberg machine.

True, it's painful to enumerate all the commands that state transfer now needs like I'm trying to do with MockTransport/MockLocalTopologyManager, but IMO those can easily the sequence of commands can be extracted into reusable methods, and where the "business logic" is too complex for the test to predict the commands, then perhaps the logic should be simplified. I'm integrating the PR, but I trust you'll remember what you did to debug the test by the next time it fails :)

danberindei · 2018-07-16T11:47:22Z

Integrated, thanks Radim!

danberindei · 2018-07-16T11:48:26Z

@rvansa should I keep https://issues.jboss.org/browse/ISPN-9095 open?

rvansa · 2018-07-16T12:50:01Z

I think you should close that; with the fix in the server won't emit the NPE and won't get stuck. Maybe there's an underlying problem, but it will manifest differently.

rvansa requested a review from danberindei May 17, 2018 15:11

danberindei reviewed May 22, 2018

View reviewed changes

rvansa force-pushed the ISPN-9128 branch 2 times, most recently from 74f9801 to 9fdff67 Compare June 12, 2018 08:06

danberindei suggested changes Jun 14, 2018

View reviewed changes

danberindei added the Changes Suggested label Jun 14, 2018

rvansa force-pushed the ISPN-9128 branch from 9fdff67 to 39726db Compare June 14, 2018 07:53

rvansa force-pushed the ISPN-9128 branch from 39726db to fc50a6b Compare June 14, 2018 13:45

danberindei reviewed Jun 14, 2018

View reviewed changes

danberindei added the Needs Rebase label Jun 15, 2018

tristantarrant added this to the 9.4.0.Alpha1 milestone Jun 21, 2018

rvansa force-pushed the ISPN-9128 branch from fc50a6b to 0f670db Compare June 26, 2018 11:19

rvansa removed Changes Suggested Needs Rebase labels Jun 26, 2018

rvansa force-pushed the ISPN-9128 branch from 0f670db to e596b6b Compare June 26, 2018 14:03

rvansa force-pushed the ISPN-9128 branch from e596b6b to 05ac412 Compare July 3, 2018 14:22

rvansa force-pushed the ISPN-9128 branch from 05ac412 to e9d63da Compare July 11, 2018 15:02

danberindei added the Changes Required label Jul 12, 2018

ISPN-9128 RehashWithSharedStoreTest.testRehashes random failures

21617ac

rvansa force-pushed the ISPN-9128 branch from e9d63da to 21617ac Compare July 13, 2018 14:45

danberindei reviewed Jul 16, 2018

View reviewed changes

danberindei merged commit f8531b4 into infinispan:master Jul 16, 2018

rvansa mentioned this pull request Jul 23, 2018

ISPN-9128 RehashWithSharedStoreTest.testRehashes random failures #6166

Merged

ISPN-9128 RehashWithSharedStoreTest.testRehashes random failures #5989

ISPN-9128 RehashWithSharedStoreTest.testRehashes random failures #5989

Conversation

rvansa commented May 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danberindei May 22, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danberindei commented May 22, 2018

rvansa commented Jun 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rvansa commented Jun 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danberindei commented Jun 15, 2018

rvansa commented Jun 15, 2018

danberindei commented Jun 15, 2018

danberindei commented Jun 15, 2018

rvansa commented Jun 26, 2018

danberindei commented Jul 12, 2018

rvansa commented Jul 13, 2018

rvansa commented Jul 13, 2018 • edited

danberindei commented Jul 13, 2018

rvansa commented Jul 13, 2018

rvansa commented Jul 16, 2018

Choose a reason for hiding this comment

danberindei commented Jul 16, 2018

danberindei commented Jul 16, 2018

rvansa commented Jul 16, 2018

danberindei May 22, 2018 •

edited

rvansa commented Jul 13, 2018 •

edited