ISPN-15064 Hot Rod Client flaky tests on CI #11150

jabolina · 2023-07-21T20:08:22Z

https://issues.redhat.com/browse/ISPN-15064
https://issues.redhat.com/browse/ISPN-15089

Allowing retry of operation on the same channel;
Concurrency fixes in some tests;

Opening as a draft to let CI run. There might still have some flaky because of the timeout configuration.

jabolina · 2023-07-25T12:44:04Z

So, I thought at first that GracefulShutdownRestartIT was failing because the cache needed to send the join request. I added a loop to getOrCreate on each server, making sure the cache is ready at the end. But randomly, I get:

09:35:14,528 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 12:35:14,526 ERROR [o.i.s.h.BaseRequestProcessor] ISPN005003: Exception reported org.infinispan.commons.CacheConfigurationException: ISPN000436: Cache 'CD85FF8441D91270E3C522DF38D58B61A28927F2' has been requested, but no matching cache configuration exists
09:35:14,528 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.configuration.ConfigurationManager.getConfiguration(ConfigurationManager.java:61)
09:35:14,528 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.manager.DefaultCacheManager.wireAndStartCache(DefaultCacheManager.java:684)
09:35:14,528 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:674)
09:35:14,528 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.manager.DefaultCacheManager.internalGetCache(DefaultCacheManager.java:563)
09:35:14,528 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:526)
09:35:14,529 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.manager.DefaultCacheManagerAdmin.getOrCreateCache(DefaultCacheManagerAdmin.java:49)
09:35:14,529 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.server.tasks.admin.CacheGetOrCreateTask.execute(CacheGetOrCreateTask.java:47)
09:35:14,529 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.server.tasks.admin.CacheGetOrCreateTask.execute(CacheGetOrCreateTask.java:25)
09:35:14,529 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.server.core.admin.AdminServerTask.execute(AdminServerTask.java:57)
09:35:14,529 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.server.core.admin.AdminOperationsHandler.lambda$runTask$0(AdminOperationsHandler.java:50)
09:35:14,530 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.security.Security.doAs(Security.java:105)
09:35:14,530 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.server.core.admin.AdminOperationsHandler.lambda$runTask$1(AdminOperationsHandler.java:50)
09:35:14,530 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
09:35:14,530 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.jboss.threads.ContextHandler$1.runWith(ContextHandler.java:18)
09:35:14,530 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2513)
09:35:14,531 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1538)
09:35:14,531 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at java.base/java.lang.Thread.run(Thread.java:833)

When this happens, the cache is not registered. And the tests fail with the missing members. I'll look further into this.

fax4ever

Great to improve it. Just few questions here

...client/src/test/java/org/infinispan/client/hotrod/event/ClientClusterFailoverEventsTest.java

...nt/hotrod-client/src/test/java/org/infinispan/client/hotrod/test/MultiHotRodServersTest.java

...client/src/test/java/org/infinispan/client/hotrod/event/ClientClusterFailoverEventsTest.java

fax4ever · 2023-07-25T15:10:29Z

...-client/src/test/java/org/infinispan/client/hotrod/retry/ClientListenerFailoverBusyTest.java

@@ -158,7 +153,6 @@ public Object visitPutKeyValueCommand(InvocationContext ctx, PutKeyValueCommand
         executor.submit(() -> {
            try {
               barrier.await();


are we sure that any interceptor is executed by a different thread?
Maybe the CountDownLatch would be better.

fax4ever · 2023-07-25T15:17:23Z

...od-client/src/main/java/org/infinispan/client/hotrod/impl/transport/netty/HeaderDecoder.java


   private final ChannelFactory channelFactory;
   private final Configuration configuration;
   private final ClientListenerNotifier listenerNotifier;
   // operations may be registered in any thread, and are removed in event loop thread
   private final ConcurrentMap<Long, HotRodOperation<?>> incomplete = new ConcurrentHashMap<>();
+   private final ConcurrentMap<Long, Integer> retries = new ConcurrentHashMap<>();


does it exist a limit to retries?

Yes, the client has an org.infinispan.client.hotrod.configuration.ConfigurationBuilder#maxRetries(int) option to control this.

But providing a bit more context, I am torn about this change. We usually treat retries from failures in the channel, meaning that it eventually closes, and retries happen on a brand-new channel. That's the assertion I changed below.

With this change, I am allowing the same operation to retry on the same channel. I assume the channel is healthy in this case. That's why the operation is using it again. Cases like this would happen due to a short timeout configuration, server overloaded, etc. If the client writes the operation multiple times on the same (healthy) channel, the client receives a response multiple times. This change here is keeping track of that.

What worries me is:

The "healthy channel" assumption to not be valid;

We retry N times but receive <N response. We would keep this object in memory without reason, and if we remove it too early, the channel will throw an exception and close.

I plan on updating this logic to instantiate this map only when necessary.

fax4ever · 2023-07-31T08:07:18Z

query/src/main/java/org/infinispan/search/mapper/work/impl/SearchIndexerImpl.java

@@ -43,10 +48,16 @@ public CompletableFuture<?> add(Object providedId, String routingKey, Object ent
         return CompletableFutures.completedNull();
      }

-      return delegate.add(convertedValue.typeIdentifier, providedId,


@jabolina I wouldn't serialize and block the indexing operations. At the moment we can use the I/IO thread (non blocking) to add the indexing operations to the internal blocking queue of Hibernate Search, that is quite efficient (even if not optimal -- but Lucene indexing operations are still blocking). I wound avoid to Introduce a synchronized and switch the thread to blocking one.
If the goal here is to stop the test only if there is no ongoing indexing operations, I think that there is other way. We could for instance introduce an SPI to await on it. If you agree I can handle it, but later this week.

I updated the RemoteHitCountAccuracyTest by changing the timeout and retries to avoid this scenario. And removed the synchronization stuff. We check later a better approach for this case.

jabolina · 2023-08-01T12:52:04Z

core/src/main/java/org/infinispan/globalstate/impl/GlobalConfigurationManagerImpl.java

-         CompletionStages.join(getOrCreateCache(name, configuration, adminFlags));
+         CompletionStages.join(createCacheInternal(name, null, configuration, adminFlags)
+               .thenCompose(r -> {
+                  if (r instanceof CacheState) {


This is because of a failure that shows up with the GracefulShutdownRestartIT test. The test does some operations, restart two nodes, and does some operations again.

There are times during the restart this step of creating from persisted configuration fails. The putIfAbsent operation fails as the other node already has the configuration in the config cache, and the ConfigurationListener is not notified, the cache does not start, and afterward, it always fails because the configuration does not exist locally.

Now we try to create from persistence, and if there is already a configuration remotely, we verify compatibility and use THE REMOTE. This is during the start, so if we're talking about a case of concurrency, it is OK, as both nodes have the same configuration. But if we're talking about a node joining later, the persisted configuration could be outdated.

I'll create a JIRA and amend the commit message.

https://issues.redhat.com/browse/ISPN-15089

wburns

Looks fine as far as I can tell, just a couple general questions.

wburns · 2023-08-01T22:38:27Z

...t/hotrod-client/src/test/java/org/infinispan/client/hotrod/size/HugeProtobufMessageTest.java

+   // The server will read the buffer by chunks of 64K, meaning, at least 1,000 reads.
+   // Thinking that each socket read takes around 100ms, we need to set a timeout of 100s.
+   // We include some extra time for the server to process the request and send the response.
+   private static final int TIMEOUT_SEC = 100 + 5;


500s seems awfully excessive... Why would sending a 64K chunk ever take 100 ms for a local test in the first place?

Yeah, for this one I am exaggerating heheheh. I was running the suite single-threaded plus tracing enabled, and this one kept failing. But it is 100s, though. Likely CI would do better, should we reduce the time?

Even 100s is a long time for a single test... Should this test not move to the stress category if it really takes this long?

How long does it take without tracing?

Gave a few runs:

<testcase name="testSearches" classname="org.infinispan.client.hotrod.size.HugeProtobufMessageTest" time="40.828"/>

Pretty much all executions were around 50~40s. Guess we can cut it more or less in half.

This is still way too long. Using profiler I see the reason though it is because of the single frame decoder that we have in place for tests. I would say we should allow disabling it for tests that are heavy hitters like this

https://github.com/infinispan/infinispan/blob/main/server/hotrod/src/test/java/org/infinispan/server/hotrod/transport/TestHandlersChannelInitializer.java#L25

By commenting out that line the test went from 12s to 1.5s for me locally

wburns · 2023-08-01T22:38:38Z

...nt/hotrod-client/src/test/java/org/infinispan/client/hotrod/tracing/OpenTelemetryClient.java

@@ -27,6 +28,7 @@ public OpenTelemetryClient(SpanExporter spanExporter) {
      // but this is a test
      SpanProcessor spanProcessor = SimpleSpanProcessor.create(spanExporter);
      SdkTracerProviderBuilder builder = SdkTracerProvider.builder()
+            .setSampler(Sampler.alwaysOn())


Nope, but maybe excessive. The test using this client verifies specifically that a fixed number of events are exported, so I added this to sample everything. Since the number of events is small, maybe this change is not needed because everything is sampled either way.

it is good to add to me

* Allowing retry of operation on the same channel; * Concurrency fixes in some tests;

This include some fixes for the failover tests

...rod/src/test/java/org/infinispan/server/hotrod/transport/TestHandlersChannelInitializer.java

…in some cases * Use the remote configuration if it is compatible with the persisted. * Optinally disable fixed frame decoder in the server for tests.

jabolina force-pushed the ISPN-15064 branch 2 times, most recently from 2902001 to 53e9d4c Compare July 24, 2023 17:42

fax4ever reviewed Jul 25, 2023

View reviewed changes

jabolina force-pushed the ISPN-15064 branch from 53e9d4c to 22656e4 Compare July 27, 2023 22:55

fax4ever reviewed Jul 31, 2023

View reviewed changes

jabolina force-pushed the ISPN-15064 branch 2 times, most recently from 667daac to b0a4950 Compare August 1, 2023 12:41

jabolina marked this pull request as ready for review August 1, 2023 12:42

jabolina commented Aug 1, 2023

View reviewed changes

jabolina force-pushed the ISPN-15064 branch from b0a4950 to 04b6d82 Compare August 1, 2023 13:35

wburns approved these changes Aug 1, 2023

View reviewed changes

jabolina added 2 commits August 2, 2023 15:51

ISPN-15064 Hot Rod Client flaky tests on CI

967406f

* Allowing retry of operation on the same channel; * Concurrency fixes in some tests;

ISPN-15064 ReadAfterLostData eventually does not have waiters

1c1ee1c

This include some fixes for the failover tests

jabolina force-pushed the ISPN-15064 branch from 04b6d82 to a92852b Compare August 2, 2023 18:51

wburns reviewed Aug 2, 2023

View reviewed changes

...rod/src/test/java/org/infinispan/server/hotrod/transport/TestHandlersChannelInitializer.java Outdated Show resolved Hide resolved

ISPN-15089 Restart does not start cache from persisted configuration …

2cf91c3

…in some cases * Use the remote configuration if it is compatible with the persisted. * Optinally disable fixed frame decoder in the server for tests.

jabolina force-pushed the ISPN-15064 branch from a92852b to 2cf91c3 Compare August 2, 2023 19:57

wburns merged commit 6739462 into infinispan:main Aug 3, 2023
3 of 4 checks passed

jabolina deleted the ISPN-15064 branch August 3, 2023 14:58

jabolina mentioned this pull request Oct 24, 2023

[14.0.x] ISPN-15064 Hot Rod Client flaky tests on CI #11424

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ISPN-15064 Hot Rod Client flaky tests on CI #11150

ISPN-15064 Hot Rod Client flaky tests on CI #11150

jabolina commented Jul 21, 2023 •

edited

jabolina commented Jul 25, 2023

fax4ever left a comment

fax4ever Jul 25, 2023

fax4ever Jul 25, 2023

jabolina Jul 25, 2023

fax4ever Jul 31, 2023

jabolina Jul 31, 2023

jabolina Aug 1, 2023 •

edited

jabolina Aug 1, 2023

wburns left a comment

wburns Aug 1, 2023

jabolina Aug 2, 2023

wburns Aug 2, 2023

jabolina Aug 2, 2023 •

edited

wburns Aug 2, 2023

wburns Aug 1, 2023

jabolina Aug 2, 2023

fax4ever Aug 2, 2023

ISPN-15064 Hot Rod Client flaky tests on CI #11150

ISPN-15064 Hot Rod Client flaky tests on CI #11150

Conversation

jabolina commented Jul 21, 2023 • edited

jabolina commented Jul 25, 2023

fax4ever left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jabolina Aug 1, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wburns left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jabolina Aug 2, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jabolina commented Jul 21, 2023 •

edited

jabolina Aug 1, 2023 •

edited

jabolina Aug 2, 2023 •

edited