Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISPN-15064 Hot Rod Client flaky tests on CI #11150

Merged
merged 3 commits into from
Aug 3, 2023

Conversation

jabolina
Copy link
Member

@jabolina jabolina commented Jul 21, 2023

https://issues.redhat.com/browse/ISPN-15064
https://issues.redhat.com/browse/ISPN-15089

  • Allowing retry of operation on the same channel;
  • Concurrency fixes in some tests;

Opening as a draft to let CI run. There might still have some flaky because of the timeout configuration.

@jabolina jabolina force-pushed the ISPN-15064 branch 2 times, most recently from 2902001 to 53e9d4c Compare July 24, 2023 17:42
@jabolina
Copy link
Member Author

So, I thought at first that GracefulShutdownRestartIT was failing because the cache needed to send the join request. I added a loop to getOrCreate on each server, making sure the cache is ready at the end. But randomly, I get:

09:35:14,528 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 12:35:14,526 ERROR [o.i.s.h.BaseRequestProcessor] ISPN005003: Exception reported org.infinispan.commons.CacheConfigurationException: ISPN000436: Cache 'CD85FF8441D91270E3C522DF38D58B61A28927F2' has been requested, but no matching cache configuration exists
09:35:14,528 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.configuration.ConfigurationManager.getConfiguration(ConfigurationManager.java:61)
09:35:14,528 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.manager.DefaultCacheManager.wireAndStartCache(DefaultCacheManager.java:684)
09:35:14,528 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:674)
09:35:14,528 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.manager.DefaultCacheManager.internalGetCache(DefaultCacheManager.java:563)
09:35:14,528 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:526)
09:35:14,529 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.manager.DefaultCacheManagerAdmin.getOrCreateCache(DefaultCacheManagerAdmin.java:49)
09:35:14,529 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.server.tasks.admin.CacheGetOrCreateTask.execute(CacheGetOrCreateTask.java:47)
09:35:14,529 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.server.tasks.admin.CacheGetOrCreateTask.execute(CacheGetOrCreateTask.java:25)
09:35:14,529 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.server.core.admin.AdminServerTask.execute(AdminServerTask.java:57)
09:35:14,529 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.server.core.admin.AdminOperationsHandler.lambda$runTask$0(AdminOperationsHandler.java:50)
09:35:14,530 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.security.Security.doAs(Security.java:105)
09:35:14,530 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.infinispan.server.core.admin.AdminOperationsHandler.lambda$runTask$1(AdminOperationsHandler.java:50)
09:35:14,530 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
09:35:14,530 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.jboss.threads.ContextHandler$1.runWith(ContextHandler.java:18)
09:35:14,530 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2513)
09:35:14,531 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1538)
09:35:14,531 INFO  [o.i.CONTAINER] [o.i.s.r.GracefulShutdownRestartIT#1]STDOUT: 	at java.base/java.lang.Thread.run(Thread.java:833)

When this happens, the cache is not registered. And the tests fail with the missing members. I'll look further into this.

Copy link
Contributor

@fax4ever fax4ever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to improve it. Just few questions here

@@ -158,7 +153,6 @@ public Object visitPutKeyValueCommand(InvocationContext ctx, PutKeyValueCommand
executor.submit(() -> {
try {
barrier.await();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure that any interceptor is executed by a different thread?
Maybe the CountDownLatch would be better.


private final ChannelFactory channelFactory;
private final Configuration configuration;
private final ClientListenerNotifier listenerNotifier;
// operations may be registered in any thread, and are removed in event loop thread
private final ConcurrentMap<Long, HotRodOperation<?>> incomplete = new ConcurrentHashMap<>();
private final ConcurrentMap<Long, Integer> retries = new ConcurrentHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it exist a limit to retries?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the client has an org.infinispan.client.hotrod.configuration.ConfigurationBuilder#maxRetries(int) option to control this.

But providing a bit more context, I am torn about this change. We usually treat retries from failures in the channel, meaning that it eventually closes, and retries happen on a brand-new channel. That's the assertion I changed below.

With this change, I am allowing the same operation to retry on the same channel. I assume the channel is healthy in this case. That's why the operation is using it again. Cases like this would happen due to a short timeout configuration, server overloaded, etc. If the client writes the operation multiple times on the same (healthy) channel, the client receives a response multiple times. This change here is keeping track of that.

What worries me is:

  • The "healthy channel" assumption to not be valid;
  • We retry N times but receive <N response. We would keep this object in memory without reason, and if we remove it too early, the channel will throw an exception and close.

I plan on updating this logic to instantiate this map only when necessary.

@@ -43,10 +48,16 @@ public CompletableFuture<?> add(Object providedId, String routingKey, Object ent
return CompletableFutures.completedNull();
}

return delegate.add(convertedValue.typeIdentifier, providedId,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jabolina I wouldn't serialize and block the indexing operations. At the moment we can use the I/IO thread (non blocking) to add the indexing operations to the internal blocking queue of Hibernate Search, that is quite efficient (even if not optimal -- but Lucene indexing operations are still blocking). I wound avoid to Introduce a synchronized and switch the thread to blocking one.
If the goal here is to stop the test only if there is no ongoing indexing operations, I think that there is other way. We could for instance introduce an SPI to await on it. If you agree I can handle it, but later this week.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the RemoteHitCountAccuracyTest by changing the timeout and retries to avoid this scenario. And removed the synchronization stuff. We check later a better approach for this case.

@jabolina jabolina force-pushed the ISPN-15064 branch 2 times, most recently from 667daac to b0a4950 Compare August 1, 2023 12:41
@jabolina jabolina marked this pull request as ready for review August 1, 2023 12:42
CompletionStages.join(getOrCreateCache(name, configuration, adminFlags));
CompletionStages.join(createCacheInternal(name, null, configuration, adminFlags)
.thenCompose(r -> {
if (r instanceof CacheState) {
Copy link
Member Author

@jabolina jabolina Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because of a failure that shows up with the GracefulShutdownRestartIT test. The test does some operations, restart two nodes, and does some operations again.

There are times during the restart this step of creating from persisted configuration fails. The putIfAbsent operation fails as the other node already has the configuration in the config cache, and the ConfigurationListener is not notified, the cache does not start, and afterward, it always fails because the configuration does not exist locally.

Now we try to create from persistence, and if there is already a configuration remotely, we verify compatibility and use THE REMOTE. This is during the start, so if we're talking about a case of concurrency, it is OK, as both nodes have the same configuration. But if we're talking about a node joining later, the persisted configuration could be outdated.

I'll create a JIRA and amend the commit message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@wburns wburns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine as far as I can tell, just a couple general questions.

// The server will read the buffer by chunks of 64K, meaning, at least 1,000 reads.
// Thinking that each socket read takes around 100ms, we need to set a timeout of 100s.
// We include some extra time for the server to process the request and send the response.
private static final int TIMEOUT_SEC = 100 + 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

500s seems awfully excessive... Why would sending a 64K chunk ever take 100 ms for a local test in the first place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, for this one I am exaggerating heheheh. I was running the suite single-threaded plus tracing enabled, and this one kept failing. But it is 100s, though. Likely CI would do better, should we reduce the time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even 100s is a long time for a single test... Should this test not move to the stress category if it really takes this long?

How long does it take without tracing?

Copy link
Member Author

@jabolina jabolina Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave a few runs:

<testcase name="testSearches" classname="org.infinispan.client.hotrod.size.HugeProtobufMessageTest" time="40.828"/>

Pretty much all executions were around 50~40s. Guess we can cut it more or less in half.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still way too long. Using profiler I see the reason though it is because of the single frame decoder that we have in place for tests. I would say we should allow disabling it for tests that are heavy hitters like this

https://github.com/infinispan/infinispan/blob/main/server/hotrod/src/test/java/org/infinispan/server/hotrod/transport/TestHandlersChannelInitializer.java#L25

By commenting out that line the test went from 12s to 1.5s for me locally

@@ -27,6 +28,7 @@ public OpenTelemetryClient(SpanExporter spanExporter) {
// but this is a test
SpanProcessor spanProcessor = SimpleSpanProcessor.create(spanExporter);
SdkTracerProviderBuilder builder = SdkTracerProvider.builder()
.setSampler(Sampler.alwaysOn())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, but maybe excessive. The test using this client verifies specifically that a fixed number of events are exported, so I added this to sample everything. Since the number of events is small, maybe this change is not needed because everything is sampled either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is good to add to me

* Allowing retry of operation on the same channel;
* Concurrency fixes in some tests;
This include some fixes for the failover tests
…in some cases

* Use the remote configuration if it is compatible with the persisted.

* Optinally disable fixed frame decoder in the server for tests.
@wburns wburns merged commit 6739462 into infinispan:main Aug 3, 2023
3 of 4 checks passed
@jabolina jabolina deleted the ISPN-15064 branch August 3, 2023 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants