New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ISPN-13640 ResilienceMaxRetryIT is failing randomly #9814
Conversation
941100a
to
3491b57
Compare
3491b57
to
c243172
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the reason for some of the changes, Diego.
Also some of the comments are getting outdated and don't make a lot of sense any more.
And the test run time is better than before, but it's still anywhere from 15s to 30s (in CI). By changing the test to use the embedded driver and stop() instead of kill(), I got it down to only 5s.
} else { | ||
remoteCacheManager = createRemoteCacheManager(); | ||
} | ||
public <K, V> RemoteCache<K, V> create(int... index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change really needed?
To me it seems that the varargs with a special case when empty complicates things unnecessarily.
RemoteCache<String, String> cache = SERVER_TEST.hotrod().withClientConfiguration(builder).withCacheMode(CacheMode.REPL_SYNC).create(0); | ||
// 1 -> configure max-retries="0" and initial_server_list=server0 | ||
// 2 -> start 3 servers | ||
RemoteCache<String, String> cache = createCache(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment above is confusing: this line isn't starting any servers (the servers are running at the start of the test method).
It is configuring the RemoteCacheManager, but in a very roundabout way, so the comment on line 38 could also be made clearer.
Bringing back the method body would probably help, but part of the confusion is from the HotRodTestClientDriver, which is simplifying things too much for this kind of tests IMO. It's fine for a test that just wants a RemoteCache it can use immediately, but not for a test like this, which needs fine control over the RemoteCacheManager, because it just adds another layer that obscures the test logic.
// 5 -> the client will receive the topology from the server. if the client hit a server that is not working, | ||
// the client topology won't be updated. | ||
// we keep iterating until the client topology be: server1 and server2 or the eventually timeout occurred | ||
Eventually.eventuallyEquals(2, () -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The phrasing looks odd to me, it's like saying "A happens 100% of the time. If B happens, then A does not happen." Perhaps something like this would be better:
5 -> wait for the client to install a topology containing only server1 and server2
When the server executing an operation has a newer topology, it sends a topology update to the client
But if client sends the operation to server0, there is no response, so multiple operations may be needed
Also, I was wrong to suggest eventually
here: the condition is too complicated, so it's hard to read, and waiting between iterations doesn't help.
|
||
// if the eventually timeout occurred, this step will fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If eventually() times out, it throws an exception, so this check is no longer needed.
|
||
// 3 -> start the client and do some operations | ||
for (int i = 0; i < 100; i++) { | ||
String random = UUID.randomUUID().toString(); | ||
cache.put(random, random); | ||
} | ||
|
||
// 4 -> kill localhost:port1 | ||
// 4 -> kill server0 | ||
InetAddress killedServer = SERVERS.getServerDriver().getServerAddress(0); | ||
SERVERS.getServerDriver().kill(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to kill the server. Killing the server (especially with the container driver) just makes the test take longer, because FD_SOCK won't suspect the killed server, so the test needs to wait for at least 10 seconds until FD_ALL suspects the killed server.
There's no reason not to use stop(0) here.
@@ -76,7 +74,7 @@ public void testMaxRetries0() { | |||
} | |||
} | |||
|
|||
// 6 -> kill port2 and port3, start port1 | |||
// 6 -> kill server1 and server2, start server0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the comment here because lines 83-83 haven't changed:
Separating steps 7 and 8 seems very artificial, and the per-step comments don't say anything useful.
This is a repeat of step 5, only with the a topology reset instead of a topology update:
7 -> wait for the client to resets the topology to the initial server list
The client resets to the initial server list when all servers are marked as failed,
During each cache operation, the connection fails and the target server is marked as failed,
but we don't control the target server so we don't know how many operations we need.
When an operation succeeds, we know the client has reset its topology
and has connected to server0.
// 5 -> the client will receive the topology from the server. if the client hit a server that is not working, | ||
// the client topology won't be updated. | ||
// we keep iterating until the client topology be: server1 and server2 or the eventually timeout occurred | ||
Eventually.eventuallyEquals(2, () -> { | ||
String random = UUID.randomUUID().toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UUID.randomUUID()
is pretty heavyweight, a simple index or even ThreadLocalRandom.current().nextInt()
would work just as well.
Also a cache.get(k)
would be cheaper than a cache.put(k, k)
Hi @danberindei I saw later a link to your branch in the first comment. |
Diego, my commit was only to make the test use the embedded driver, on top of your changes here. I didn't change anything else, or at least I tried not to. |
https://issues.redhat.com/browse/ISPN-13640