Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISPN-13640 ResilienceMaxRetryIT is failing randomly #9814

Closed
wants to merge 2 commits into from

Conversation

diegolovison
Copy link
Contributor

Copy link
Member

@danberindei danberindei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the reason for some of the changes, Diego.
Also some of the comments are getting outdated and don't make a lot of sense any more.

And the test run time is better than before, but it's still anywhere from 15s to 30s (in CI). By changing the test to use the embedded driver and stop() instead of kill(), I got it down to only 5s.

main...danberindei:pr_diegolovison_ISPN-13640

} else {
remoteCacheManager = createRemoteCacheManager();
}
public <K, V> RemoteCache<K, V> create(int... index) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change really needed?
To me it seems that the varargs with a special case when empty complicates things unnecessarily.

RemoteCache<String, String> cache = SERVER_TEST.hotrod().withClientConfiguration(builder).withCacheMode(CacheMode.REPL_SYNC).create(0);
// 1 -> configure max-retries="0" and initial_server_list=server0
// 2 -> start 3 servers
RemoteCache<String, String> cache = createCache(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above is confusing: this line isn't starting any servers (the servers are running at the start of the test method).
It is configuring the RemoteCacheManager, but in a very roundabout way, so the comment on line 38 could also be made clearer.

Bringing back the method body would probably help, but part of the confusion is from the HotRodTestClientDriver, which is simplifying things too much for this kind of tests IMO. It's fine for a test that just wants a RemoteCache it can use immediately, but not for a test like this, which needs fine control over the RemoteCacheManager, because it just adds another layer that obscures the test logic.

Comment on lines 52 to 55
// 5 -> the client will receive the topology from the server. if the client hit a server that is not working,
// the client topology won't be updated.
// we keep iterating until the client topology be: server1 and server2 or the eventually timeout occurred
Eventually.eventuallyEquals(2, () -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrasing looks odd to me, it's like saying "A happens 100% of the time. If B happens, then A does not happen." Perhaps something like this would be better:

5 -> wait for the client to install a topology containing only server1 and server2
When the server executing an operation has a newer topology, it sends a topology update to the client
But if client sends the operation to server0, there is no response, so multiple operations may be needed

Also, I was wrong to suggest eventually here: the condition is too complicated, so it's hard to read, and waiting between iterations doesn't help.


// if the eventually timeout occurred, this step will fail
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If eventually() times out, it throws an exception, so this check is no longer needed.


// 3 -> start the client and do some operations
for (int i = 0; i < 100; i++) {
String random = UUID.randomUUID().toString();
cache.put(random, random);
}

// 4 -> kill localhost:port1
// 4 -> kill server0
InetAddress killedServer = SERVERS.getServerDriver().getServerAddress(0);
SERVERS.getServerDriver().kill(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to kill the server. Killing the server (especially with the container driver) just makes the test take longer, because FD_SOCK won't suspect the killed server, so the test needs to wait for at least 10 seconds until FD_ALL suspects the killed server.
There's no reason not to use stop(0) here.

@@ -76,7 +74,7 @@ public void testMaxRetries0() {
}
}

// 6 -> kill port2 and port3, start port1
// 6 -> kill server1 and server2, start server0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the comment here because lines 83-83 haven't changed:

Separating steps 7 and 8 seems very artificial, and the per-step comments don't say anything useful.
This is a repeat of step 5, only with the a topology reset instead of a topology update:

7 -> wait for the client to resets the topology to the initial server list
The client resets to the initial server list when all servers are marked as failed,
During each cache operation, the connection fails and the target server is marked as failed,
but we don't control the target server so we don't know how many operations we need.
When an operation succeeds, we know the client has reset its topology
and has connected to server0.

// 5 -> the client will receive the topology from the server. if the client hit a server that is not working,
// the client topology won't be updated.
// we keep iterating until the client topology be: server1 and server2 or the eventually timeout occurred
Eventually.eventuallyEquals(2, () -> {
String random = UUID.randomUUID().toString();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UUID.randomUUID() is pretty heavyweight, a simple index or even ThreadLocalRandom.current().nextInt() would work just as well.
Also a cache.get(k) would be cheaper than a cache.put(k, k)

@diegolovison diegolovison marked this pull request as draft January 19, 2022 02:15
@diegolovison
Copy link
Contributor Author

diegolovison commented Jan 19, 2022

Hi @danberindei
I think we should go with your commit. Closing this PR.

I saw later a link to your branch in the first comment.
You implemented the restart.

@danberindei
Copy link
Member

Diego, my commit was only to make the test use the embedded driver, on top of your changes here. I didn't change anything else, or at least I tried not to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants