-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ISPN-15064 Hot Rod Client flaky tests on CI #11150
Conversation
2902001
to
53e9d4c
Compare
So, I thought at first that
When this happens, the cache is not registered. And the tests fail with the missing members. I'll look further into this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to improve it. Just few questions here
...client/src/test/java/org/infinispan/client/hotrod/event/ClientClusterFailoverEventsTest.java
Outdated
Show resolved
Hide resolved
...nt/hotrod-client/src/test/java/org/infinispan/client/hotrod/test/MultiHotRodServersTest.java
Outdated
Show resolved
Hide resolved
...client/src/test/java/org/infinispan/client/hotrod/event/ClientClusterFailoverEventsTest.java
Outdated
Show resolved
Hide resolved
@@ -158,7 +153,6 @@ public Object visitPutKeyValueCommand(InvocationContext ctx, PutKeyValueCommand | |||
executor.submit(() -> { | |||
try { | |||
barrier.await(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we sure that any interceptor is executed by a different thread?
Maybe the CountDownLatch would be better.
|
||
private final ChannelFactory channelFactory; | ||
private final Configuration configuration; | ||
private final ClientListenerNotifier listenerNotifier; | ||
// operations may be registered in any thread, and are removed in event loop thread | ||
private final ConcurrentMap<Long, HotRodOperation<?>> incomplete = new ConcurrentHashMap<>(); | ||
private final ConcurrentMap<Long, Integer> retries = new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it exist a limit to retries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the client has an org.infinispan.client.hotrod.configuration.ConfigurationBuilder#maxRetries(int)
option to control this.
But providing a bit more context, I am torn about this change. We usually treat retries from failures in the channel, meaning that it eventually closes, and retries happen on a brand-new channel. That's the assertion I changed below.
With this change, I am allowing the same operation to retry on the same channel. I assume the channel is healthy in this case. That's why the operation is using it again. Cases like this would happen due to a short timeout configuration, server overloaded, etc. If the client writes the operation multiple times on the same (healthy) channel, the client receives a response multiple times. This change here is keeping track of that.
What worries me is:
- The "healthy channel" assumption to not be valid;
- We retry N times but receive <N response. We would keep this object in memory without reason, and if we remove it too early, the channel will throw an exception and close.
I plan on updating this logic to instantiate this map only when necessary.
@@ -43,10 +48,16 @@ public CompletableFuture<?> add(Object providedId, String routingKey, Object ent | |||
return CompletableFutures.completedNull(); | |||
} | |||
|
|||
return delegate.add(convertedValue.typeIdentifier, providedId, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jabolina I wouldn't serialize and block the indexing operations. At the moment we can use the I/IO thread (non blocking) to add the indexing operations to the internal blocking queue of Hibernate Search, that is quite efficient (even if not optimal -- but Lucene indexing operations are still blocking). I wound avoid to Introduce a synchronized
and switch the thread to blocking one.
If the goal here is to stop the test only if there is no ongoing indexing operations, I think that there is other way. We could for instance introduce an SPI to await on it. If you agree I can handle it, but later this week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the RemoteHitCountAccuracyTest
by changing the timeout and retries to avoid this scenario. And removed the synchronization stuff. We check later a better approach for this case.
667daac
to
b0a4950
Compare
CompletionStages.join(getOrCreateCache(name, configuration, adminFlags)); | ||
CompletionStages.join(createCacheInternal(name, null, configuration, adminFlags) | ||
.thenCompose(r -> { | ||
if (r instanceof CacheState) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because of a failure that shows up with the GracefulShutdownRestartIT
test. The test does some operations, restart two nodes, and does some operations again.
There are times during the restart this step of creating from persisted configuration fails. The putIfAbsent
operation fails as the other node already has the configuration in the config cache, and the ConfigurationListener is not notified, the cache does not start, and afterward, it always fails because the configuration does not exist locally.
Now we try to create from persistence, and if there is already a configuration remotely, we verify compatibility and use THE REMOTE. This is during the start, so if we're talking about a case of concurrency, it is OK, as both nodes have the same configuration. But if we're talking about a node joining later, the persisted configuration could be outdated.
I'll create a JIRA and amend the commit message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine as far as I can tell, just a couple general questions.
// The server will read the buffer by chunks of 64K, meaning, at least 1,000 reads. | ||
// Thinking that each socket read takes around 100ms, we need to set a timeout of 100s. | ||
// We include some extra time for the server to process the request and send the response. | ||
private static final int TIMEOUT_SEC = 100 + 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
500s seems awfully excessive... Why would sending a 64K chunk ever take 100 ms for a local test in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, for this one I am exaggerating heheheh. I was running the suite single-threaded plus tracing enabled, and this one kept failing. But it is 100s, though. Likely CI would do better, should we reduce the time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even 100s is a long time for a single test... Should this test not move to the stress category if it really takes this long?
How long does it take without tracing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave a few runs:
<testcase name="testSearches" classname="org.infinispan.client.hotrod.size.HugeProtobufMessageTest" time="40.828"/>
Pretty much all executions were around 50~40s. Guess we can cut it more or less in half.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still way too long. Using profiler I see the reason though it is because of the single frame decoder that we have in place for tests. I would say we should allow disabling it for tests that are heavy hitters like this
By commenting out that line the test went from 12s to 1.5s for me locally
@@ -27,6 +28,7 @@ public OpenTelemetryClient(SpanExporter spanExporter) { | |||
// but this is a test | |||
SpanProcessor spanProcessor = SimpleSpanProcessor.create(spanExporter); | |||
SdkTracerProviderBuilder builder = SdkTracerProvider.builder() | |||
.setSampler(Sampler.alwaysOn()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, but maybe excessive. The test using this client verifies specifically that a fixed number of events are exported, so I added this to sample everything. Since the number of events is small, maybe this change is not needed because everything is sampled either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is good to add to me
* Allowing retry of operation on the same channel; * Concurrency fixes in some tests;
This include some fixes for the failover tests
...rod/src/test/java/org/infinispan/server/hotrod/transport/TestHandlersChannelInitializer.java
Outdated
Show resolved
Hide resolved
…in some cases * Use the remote configuration if it is compatible with the persisted. * Optinally disable fixed frame decoder in the server for tests.
https://issues.redhat.com/browse/ISPN-15064
https://issues.redhat.com/browse/ISPN-15089
Opening as a draft to let CI run. There might still have some flaky because of the timeout configuration.