New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ISPN-15181 Deadlock when creating caches with a zero-capacity node an… #11313
Conversation
core/src/main/java/org/infinispan/topology/ClusterCacheStatus.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/infinispan/globalstate/impl/GlobalConfigurationManagerImpl.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/infinispan/distribution/ZeroCapacityAdministrationTest.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/infinispan/globalstate/impl/GlobalConfigurationManagerImpl.java
Outdated
Show resolved
Hide resolved
String cacheName = "another-cache"; | ||
ScopedState ss = new ScopedState(CACHE_SCOPE, cacheName); | ||
while (!DistributionTestHelper.isFirstOwner(node1.getCache(CONFIG_STATE_CACHE_NAME), ss)) { | ||
if (tries > 50) fail("Exceeded attempts to find configuration mapping to remote"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the simplicity of this. I had originally thought we would use mocks etc to ensure that the firstOwner is node1, but maybe this is good enough.
One thing I have noticed though, is that if the test fails (due to reverting your fix) it's unable to cleanup it's state and must be interrupted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look just to make sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is that during the cleanup, each cache is stopped individually. Since we have the cache waiting for the non-zero capacity node to join, the cleanup hangs as the cache future [1] is never completed.
[1]
infinispan/core/src/main/java/org/infinispan/manager/DefaultCacheManager.java
Lines 550 to 553 in eb7e866
if (cacheFuture != null) { | |
try { | |
return (Cache<K, V>) cacheFuture.join(); | |
} catch (CompletionException e) { |
I've updated the test to avoid that. If we have a regression, it will be easier to identify.
wouldn't be simpler to return a completed future in |
That results in the cache configuration not being defined locally: org.infinispan.commons.CacheConfigurationException: ISPN000436: Cache 'another-cache' has been requested, but no matching cache configuration exists
at org.infinispan.configuration.ConfigurationManager.getConfiguration(ConfigurationManager.java:61)
at org.infinispan.manager.DefaultCacheManager.wireAndStartCache(DefaultCacheManager.java:684)
at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:674)
at org.infinispan.manager.DefaultCacheManager.internalGetCache(DefaultCacheManager.java:563)
at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:526)
at org.infinispan.manager.DefaultCacheManagerAdmin.getOrCreateCache(DefaultCacheManagerAdmin.java:49)
at org.infinispan.distribution.ZeroCapacityAdministrationTest.testCreateNewClusteredCacheFromZeroToRemote(ZeroCapacityAdministrationTest.java:78)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:124)
at org.testng.internal.MethodInvocationHelper$1.runTestMethod(MethodInvocationHelper.java:230)
at org.infinispan.commons.test.TestNGLongTestsHook.run(TestNGLongTestsHook.java:24)
at org.testng.internal.MethodInvocationHelper.invokeHookable(MethodInvocationHelper.java:242)
at org.testng.internal.Invoker.invokeMethod(Invoker.java:579)
at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:719)
at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:989)
at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:125)
at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:109)
at org.testng.TestRunner.privateRun(TestRunner.java:648)
at org.testng.TestRunner.run(TestRunner.java:505)
at org.testng.SuiteRunner.runTest(SuiteRunner.java:455)
at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:450)
at org.testng.SuiteRunner.privateRun(SuiteRunner.java:415)
at org.testng.SuiteRunner.run(SuiteRunner.java:364)
at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:84)
at org.testng.TestNG.runSuitesSequentially(TestNG.java:1208)
at org.testng.TestNG.runSuitesLocally(TestNG.java:1137)
at org.testng.TestNG.runSuites(TestNG.java:1049)
at org.testng.TestNG.run(TestNG.java:1017)
at com.intellij.rt.testng.IDEARemoteTestNG.run(IDEARemoteTestNG.java:66)
at com.intellij.rt.testng.RemoteTestNGStarter.main(RemoteTestNGStarter.java:109) |
I think this is a good call given that I backported the fix to my |
I mean, you still have to invoke |
Reverted the changes in diff --git a/core/src/main/java/org/infinispan/globalstate/impl/GlobalConfigurationStateListener.java b/core/src/main/java/org/infinispan/globalstate/impl/GlobalConfigurationStateListener.java
index a09f054563..97419b1e41 100644
--- a/core/src/main/java/org/infinispan/globalstate/impl/GlobalConfigurationStateListener.java
+++ b/core/src/main/java/org/infinispan/globalstate/impl/GlobalConfigurationStateListener.java
@@ -42,9 +42,13 @@ public CompletionStage<Void> handleCreate(CacheEntryCreatedEvent<ScopedState, Ca
String name = event.getKey().getName();
CacheState state = event.getValue();
- return CACHE_SCOPE.equals(scope) ?
- gcm.createCacheLocally(name, state) :
- gcm.createTemplateLocally(name, state);
+ if (CACHE_SCOPE.equals(scope)) {
+ // zero capacity nodes have to wait for a non-zero capacity node to start the cache.
+ // prevent the cache creating to blocking the listener invocation
+ var f = gcm.createCacheLocally(name, state);
+ return isZeroNode() ? CompletableFutures.completedNull() : f;
+ }
+ return gcm.createTemplateLocally(name, state);
}
@CacheEntryModified
@@ -74,10 +78,15 @@ public CompletionStage<Void> handleRemove(CacheEntryRemovedEvent<ScopedState, Ca
String name = event.getKey().getName();
if (CACHE_SCOPE.equals(scope)) {
CONTAINER.debugf("Stopping cache %s because it was removed from global state", name);
- return gcm.removeCacheLocally(name);
+ var f = gcm.removeCacheLocally(name);
+ return isZeroNode() ? CompletableFutures.completedNull() : f;
} else {
CONTAINER.debugf("Removing template %s because it was removed from global state", name);
return gcm.removeTemplateLocally(name);
}
}
+
+ private boolean isZeroNode() {
+ return gcm.cacheManager.getCacheManagerConfiguration().isZeroCapacityNode();
+ }
}
|
+1. That's easier to manage than the CFs + Map. Let me update. |
021c76c
to
9b39131
Compare
Updated with Pedro's suggestions. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just waiting on another CI run as the last one was interrupted.
CI has quite a few failures, let me check them. |
…d a single stateful node
9b39131
to
7292536
Compare
Updated. Needed to use |
Thanks @jabolina |
…d a single stateful node
https://issues.redhat.com/browse/ISPN-15181
I changed to delay the cache local creation if there is a remote request to create the same cache. We return a completed future, though, so the listener can proceed.
If we had a way to handle events async/sync dynamically, the solution could be better. Also, I didn't want to modify the zero capacity nodes' join as that has a wider effect, which I am somewhat unfamiliar with.