New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

ISPN-13640 ResilienceMaxRetryIT is failing randomly #9814

Closed

diegolovison wants to merge 2 commits into infinispan:main from diegolovison:ISPN-13640

Contributor

diegolovison commented Jan 18, 2022

https://issues.redhat.com/browse/ISPN-13640

diegolovison force-pushed the ISPN-13640 branch from 941100a to 3491b57 Compare

January 18, 2022 14:17


          ISPN-13640 ResilienceMaxRetryIT is failing randomly

c243172

diegolovison force-pushed the ISPN-13640 branch from 3491b57 to c243172 Compare

January 18, 2022 15:02

danberindei suggested changes

View reviewed changes

Member

danberindei left a comment

I don't understand the reason for some of the changes, Diego.
Also some of the comments are getting outdated and don't make a lot of sense any more.

And the test run time is better than before, but it's still anywhere from 15s to 30s (in CI). By changing the test to use the embedded driver and stop() instead of kill(), I got it down to only 5s.

main...danberindei:pr_diegolovison_ISPN-13640

server/testdriver/core/src/main/java/org/infinispan/server/test/api/HotRodTestClientDriver.java

-                    } else {
-                       remoteCacheManager = createRemoteCacheManager();
-                    }
+                 public <K, V> RemoteCache<K, V> create(int... index) {

Member

danberindei Jan 18, 2022

Is this change really needed?
To me it seems that the varargs with a special case when empty complicates things unnecessarily.

server/tests/src/test/java/org/infinispan/server/resilience/ResilienceMaxRetryIT.java Outdated

-                    RemoteCache<String, String> cache = SERVER_TEST.hotrod().withClientConfiguration(builder).withCacheMode(CacheMode.REPL_SYNC).create(0);
+                    // 1 -> configure max-retries="0" and initial_server_list=server0
+                    // 2 -> start 3 servers
+                    RemoteCache<String, String> cache = createCache(0);

Member

danberindei Jan 18, 2022

The comment above is confusing: this line isn't starting any servers (the servers are running at the start of the test method).
It is configuring the RemoteCacheManager, but in a very roundabout way, so the comment on line 38 could also be made clearer.

Bringing back the method body would probably help, but part of the confusion is from the HotRodTestClientDriver, which is simplifying things too much for this kind of tests IMO. It's fine for a test that just wants a RemoteCache it can use immediately, but not for a test like this, which needs fine control over the RemoteCacheManager, because it just adds another layer that obscures the test logic.

server/tests/src/test/java/org/infinispan/server/resilience/ResilienceMaxRetryIT.java Outdated

Comment on lines 52 to 55

+                    // 5 -> the client will receive the topology from the server. if the client hit a server that is not working,
+                    // the client topology won't be updated.
+                    // we keep iterating until the client topology be: server1 and server2 or the eventually timeout occurred
+                    Eventually.eventuallyEquals(2, () -> {

Member

danberindei Jan 18, 2022

The phrasing looks odd to me, it's like saying "A happens 100% of the time. If B happens, then A does not happen." Perhaps something like this would be better:

5 -> wait for the client to install a topology containing only server1 and server2
When the server executing an operation has a newer topology, it sends a topology update to the client
But if client sends the operation to server0, there is no response, so multiple operations may be needed

Also, I was wrong to suggest eventually here: the condition is too complicated, so it's hard to read, and waiting between iterations doesn't help.

server/tests/src/test/java/org/infinispan/server/resilience/ResilienceMaxRetryIT.java Outdated


		// if the eventually timeout occurred, this step will fail

Member

danberindei Jan 18, 2022

If eventually() times out, it throws an exception, so this check is no longer needed.

server/tests/src/test/java/org/infinispan/server/resilience/ResilienceMaxRetryIT.java Outdated

                     // 3 -> start the client and do some operations
                     for (int i = 0; i < 100; i++) {
                        String random = UUID.randomUUID().toString();
                        cache.put(random, random);
                     }
-                    // 4 -> kill localhost:port1
+                    // 4 -> kill server0
                     InetAddress killedServer = SERVERS.getServerDriver().getServerAddress(0);
                     SERVERS.getServerDriver().kill(0);

Member

danberindei Jan 18, 2022

We don't need to kill the server. Killing the server (especially with the container driver) just makes the test take longer, because FD_SOCK won't suspect the killed server, so the test needs to wait for at least 10 seconds until FD_ALL suspects the killed server.
There's no reason not to use stop(0) here.

server/tests/src/test/java/org/infinispan/server/resilience/ResilienceMaxRetryIT.java Outdated

    
            @@ -76,7 +74,7 @@ public void testMaxRetries0() {
          
                       }

                    }

                    // 6 -> kill port2 and port3, start port1

                    // 6 -> kill server1 and server2, start server0

Member

danberindei Jan 18, 2022

Adding the comment here because lines 83-83 haven't changed:

Separating steps 7 and 8 seems very artificial, and the per-step comments don't say anything useful.
This is a repeat of step 5, only with the a topology reset instead of a topology update:

7 -> wait for the client to resets the topology to the initial server list
The client resets to the initial server list when all servers are marked as failed,
During each cache operation, the connection fails and the target server is marked as failed,
but we don't control the target server so we don't know how many operations we need.
When an operation succeeds, we know the client has reset its topology
and has connected to server0.

server/tests/src/test/java/org/infinispan/server/resilience/ResilienceMaxRetryIT.java Outdated

+                    // 5 -> the client will receive the topology from the server. if the client hit a server that is not working,
+                    // the client topology won't be updated.
+                    // we keep iterating until the client topology be: server1 and server2 or the eventually timeout occurred
+                    Eventually.eventuallyEquals(2, () -> {
                        String random = UUID.randomUUID().toString();

Member

danberindei Jan 18, 2022

UUID.randomUUID() is pretty heavyweight, a simple index or even ThreadLocalRandom.current().nextInt() would work just as well.
Also a cache.get(k) would be cheaper than a cache.put(k, k)


          Restart is not working

6df9f31

diegolovison marked this pull request as draft

January 19, 2022 02:15

Contributor Author

diegolovison commented Jan 19, 2022 •

edited

Hi @danberindei
I think we should go with your commit. Closing this PR.

I saw later a link to your branch in the first comment.
You implemented the restart.

diegolovison closed this

Member

danberindei commented Jan 19, 2022

Diego, my commit was only to make the test use the embedded driver, on top of your changes here. I didn't change anything else, or at least I tried not to.

danberindei mentioned this pull request

ISPN-13640 ResilienceMaxRetryIT is failing randomly #9822

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment