New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ISPN-9128 RehashWithSharedStoreTest.testRehashes random failures #5989
Conversation
// This is used to determine nodes that own the entries. In scattered cache (which uses transitory topology) | ||
// the pendingCH is used for reading but the nodes in there are not guaranteed to have the data yet. | ||
// CurrentCH should be safe - the nodes either have the data or the owner is unknown. | ||
return topology.getCurrentCH(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant this method to be a mirror of CacheTopology.getReadConsistentHash()
, only without using the unionCH
field (which is null
on the ClusterTopologyManagerImpl
side). Since this is breaking the symmetry, we should change the method name as well.
@@ -503,6 +503,11 @@ private void applyValues(Address address, List<Object> keys, Response response) | |||
for (int i = 0; i < keys.size(); ++i) { | |||
Object key = keys.get(i); | |||
InternalCacheValue icv = values[i]; | |||
if (icv == null) { | |||
// The entry got lost in the meantime - this can happen when the cache is stopped concurrently to processing | |||
// the GetAllCommand. We'll just avoid NPEs here: data is lost as > 1 nodes have left. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we get an exception if the cache is being stopped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not when the command enters interceptor stack and the cache is stopped only after that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it can happen for our internal GetAllCommand
it can probably happen for the application's GetAllCommands
as well, so I'd change the code to throw an IllegalLifecycleStateException
if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the stopping receiver this topology check was ignored because the GetAllCommand has SKIP_OWNERSHIP_CHECK
during the state transfer - this node was not the owner in previous topology (so it can't respond according to CH) but it has responded during the keys transfer as having the last version of the key. However as it's being stopped the data-container has been cleared concurrently and it fetches null data.
I agree that this could be improved by another type of signalization but this is out of scope of this PR.
On the originator, I don't see why you should throw an exception instead of processing the other data. Would you be ok with issuing a WARN log here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're probably right that this is out of the scope of this PR, we could handle ISPN-9095 separately.
However as it's being stopped the data-container has been cleared concurrently and it fetches null data.
When a cache shuts down while a remote command is running I would expect BaseBlockingRunnable
to replace the command's response with a CacheNotFoundResponse
. DataContainer.clear()
has priority 999 and PerCacheInboundInvocationHandler.stop()
has priority 10, so IMO the GetAllCommand
should return either the proper value or a CNFR. BTW, non-scattered caches also rely on CNFR to know that a remote node is no longer an owner and the originator should ask another node, so if the CNFR logic is broken it's even more important to fix it.
On the originator, I don't see why you should throw an exception instead of processing the other data. Would you be ok with issuing a WARN log here?
I'd fix the remote node to properly reply with a CNFR when it's shutting down, and log an error in ScatteredStateConsumerImpl
if a value was invalid. If SSCI can retry that key and fetch the value from another node we could make it a warning instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I was wrong... I thought that the test is already stopping the cache after test failure but in fact the TestingUtil
is clearing the data container directly. So this was a test problem.
In this case I don't understand why the exception happened in https://issues.jboss.org/browse/ISPN-9095
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then do we still need this change?
|
||
@Test(groups = "functional", testName = "scattered.statetransfer.CoordStopTest") | ||
@CleanupAfterMethod | ||
public class CoordStopTest extends MultipleCacheManagersTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually call it "coordinator" ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personal question: are you Dan or Daniel on your birth certificate? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dan 😄
I know some of our test names are ridiculously long, but I don't think CoordinatorStopTest
would be one of them ;)
// start other caches | ||
cache(1); | ||
cache(2); | ||
waitForClusterToForm(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waitForClusterToForm()
already starts the default caches.
@Override | ||
protected void clearContent() throws Throwable { | ||
super.clearContent(); | ||
viewLatch = new CountDownLatch(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really needed?
int stableTopologyId = cache(1).getAdvancedCache().getDistributionManager().getCacheTopology().getTopologyId(); | ||
BlockingLocalTopologyManager ltm2 = BlockingLocalTopologyManager.replaceTopologyManager(manager(2), cache(2).getName()); | ||
ControlledRpcManager rpcManager2 = ControlledRpcManager.replaceRpcManager(cache(2)); | ||
rpcManager2.excludeCommands(cmd -> cmd instanceof StateResponseCommand, CoordStopTest::isCancelStateTransfer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it would be simpler to handle the response commands explicitly, then you can guarantee there are no cancel commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by handling explicitly? Overriding RpcManager on my own?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I meant calling ControlledRpcManager.expectCommand()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC the ControlledRpcManager requires the commands to be send in order, and cancel commands depend on the timing - I would probably need to add additional synchronization to enforce them. Not that it wouldn't be possible, but I simply don't care about them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the order of the cancel commands is pretty straightforward: if you intercept the response commands and you wait for their responses, all the segments will be received and there will be no need for a cancel command.
assertEquals(2, t1.getCacheTopology().getActualMembers().size()); | ||
assertEquals(null, t1.getCacheTopology().getPendingCH()); | ||
for (Address member : t1.getCacheTopology().getMembers()) { | ||
assertEquals("Topology: " + t1.getCacheTopology(), 1, t1.getCacheTopology().getCurrentCH().getSegmentsForOwner(member).size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you know the owner indexes, you can assert them directly (maybe even extract this to a method):
assertEquals(emptyList(), t1.getCacheTopology().getCurrentCH().locateOwnersForSegment(0));
assertEquals(singletonList(address(1)), t1.getCacheTopology().getCurrentCH().locateOwnersForSegment(1));
assertEquals(singletonList(address(2)), t1.getCacheTopology().getCurrentCH().locateOwnersForSegment(2));
assertEquals(2, t2.getCacheTopology().getActualMembers().size()); | ||
assertNotNull(t2.getCacheTopology().getPendingCH()); | ||
for (int segment = 0; segment < t2.getCacheTopology().getPendingCH().getNumSegments(); ++segment) { | ||
assertEquals("Topology: " + t2.getCacheTopology(), 1, t2.getCacheTopology().getPendingCH().locateOwnersForSegment(segment).size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, this assert is too generic considering we know who the owners are.
I got a failure when running the whole test suite:
When running in the IDE, I didn't get a failure, but I got some errors in the log:
pr_rvansa_ISPN-9128_2018-05-22_CoordStopTest-infinispan-core.log.gz |
Okay, the first failure is an oversight in the test (I thought that |
74f9801
to
9fdff67
Compare
@@ -60,7 +62,7 @@ | |||
private final AtomicInteger count = new AtomicInteger(1); | |||
|
|||
private volatile boolean stopped = false; | |||
private final Set<Class<? extends ReplicableCommand>> excludedCommands = | |||
private final Set<Predicate<ReplicableCommand>> excludedCommands = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert, I'd really like to keep the logic on the main thread
waitForClusterToForm(); | ||
|
||
// segment 0 will be moved to cache(2). Since we've lost coord cache(1) now -> 0 and cache(2) -> 1 | ||
chf.setOwnerIndexes(new int[][] { {1}, {0}, {1} }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicking, but I'd like this in the test before the node is killed
// Let the rebalance begin | ||
rpcManager2.expectCommand(StateRequestCommand.class, | ||
request -> assertEquals(StateRequestCommand.Type.CONFIRM_REVOKED_SEGMENTS, request.getType())) | ||
.send().receiveAll(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird that I wrote expectCommand
to only allow access to the command in the consumer, but I'm not going to argue with myself now :)
@@ -503,6 +503,11 @@ private void applyValues(Address address, List<Object> keys, Response response) | |||
for (int i = 0; i < keys.size(); ++i) { | |||
Object key = keys.get(i); | |||
InternalCacheValue icv = values[i]; | |||
if (icv == null) { | |||
// The entry got lost in the meantime - this can happen when the cache is stopped concurrently to processing | |||
// the GetAllCommand. We'll just avoid NPEs here: data is lost as > 1 nodes have left. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then do we still need this change?
I've kept the |
|
||
// unblock outdated rebalance | ||
rpcManager2.stopBlocking(); | ||
keyTransferRequest.send().receiveAll(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the kind of thing I meant in the chat... the fact that you call stopBlocking()
before completing the blocked request makes me think that the test absolutely needs another command to be invoked remotely, but you'd rather not say what command :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to verify if this works (I didn't ran the test manually after last update and there was an oversight I had to fix); I've used similar construct in other test and seems it got broken as receiveAll schedules the continuation to the rpcManager's executor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand now. In my mind, the only alternative would have been to stop blocking at the end of the test, and stopping the executor wouldn't matter there because all the commands are already done.
@danberindei I know, debugging ATM :-/ One of the issues is that the CH_UPDATE from old coord and REBALANCE_START from new coord can come in arbitrary order, I need to set the order. Actually leaving coord sending CH updates is pretty PITA in many cases, this is one aspect of ST that should be eventually revised (I am not sure where we're ATM with the graceful leave). |
Yes, this and many other aspects should be revised, but I haven't made much progress -- too many random failures :( |
I integrated Will's data container changes and now there's a conflict as well :/ |
@danberindei I made the tests as deterministic as I could... let's hope it passes this time. |
Ah, I forgot about this again... @rvansa, the test is still failing in Jenkins, but for some reason it doesn't appear upfront in the list of failed tests: https://ci.infinispan.org/job/Infinispan/job/PR-5989/18/testReport/ |
No worries in this case; I've fixed some of the stuff but it's not 100% as the CI shows... I have been lucky enough to reproduce this after couple hundred invocations, so debugging... |
@danberindei sigh one liner bug needs 200+ lines of test to orchestrate the correct order of events... Now I've ran the test 2200 times without error, let's see if CI can give me one green run! |
@rvansa I'd like us to write more unit tests and fewer functional tests for this kind of issue, but we need a lot of work in order to get there... With the huge lists of dependencies our components have I'm afraid a unit test may be even longer. |
@danberindei Or we fully cover Infinispan in TLA+, verify that and just add a nice Java code generator... :) |
@danberindei Seems the tests have passed now; only some RemoteCounter failure. |
blockAfter(classListToFilter(filter)); | ||
} | ||
|
||
public void blockAfter(Predicate<ReplicableCommand> filter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without any further info on what caused the latest failures, I'd say it's this... the more complex the filter gets, the more likely it is to exclude the wrong command and crash the whole Rube Goldberg machine.
True, it's painful to enumerate all the commands that state transfer now needs like I'm trying to do with MockTransport
/MockLocalTopologyManager
, but IMO those can easily the sequence of commands can be extracted into reusable methods, and where the "business logic" is too complex for the test to predict the commands, then perhaps the logic should be simplified. I'm integrating the PR, but I trust you'll remember what you did to debug the test by the next time it fails :)
Integrated, thanks Radim! |
@rvansa should I keep https://issues.jboss.org/browse/ISPN-9095 open? |
I think you should close that; with the fix in the server won't emit the NPE and won't get stuck. Maybe there's an underlying problem, but it will manifest differently. |
https://issues.jboss.org/browse/ISPN-9128
Includes https://issues.jboss.org/browse/ISPN-9095 as well.
The fix is just one liner :)