Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISPN-15181 Deadlock when creating caches with a zero-capacity node an… #11313

Merged
merged 1 commit into from Sep 27, 2023

Conversation

jabolina
Copy link
Member

…d a single stateful node

https://issues.redhat.com/browse/ISPN-15181

I changed to delay the cache local creation if there is a remote request to create the same cache. We return a completed future, though, so the listener can proceed.

If we had a way to handle events async/sync dynamically, the solution could be better. Also, I didn't want to modify the zero capacity nodes' join as that has a wider effect, which I am somewhat unfamiliar with.

String cacheName = "another-cache";
ScopedState ss = new ScopedState(CACHE_SCOPE, cacheName);
while (!DistributionTestHelper.isFirstOwner(node1.getCache(CONFIG_STATE_CACHE_NAME), ss)) {
if (tries > 50) fail("Exceeded attempts to find configuration mapping to remote");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the simplicity of this. I had originally thought we would use mocks etc to ensure that the firstOwner is node1, but maybe this is good enough.

One thing I have noticed though, is that if the test fails (due to reverting your fix) it's unable to cleanup it's state and must be interrupted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look just to make sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that during the cleanup, each cache is stopped individually. Since we have the cache waiting for the non-zero capacity node to join, the cleanup hangs as the cache future [1] is never completed.

[1]

if (cacheFuture != null) {
try {
return (Cache<K, V>) cacheFuture.join();
} catch (CompletionException e) {

I've updated the test to avoid that. If we have a regression, it will be easier to identify.

@pruivo
Copy link
Member

pruivo commented Sep 21, 2023

wouldn't be simpler to return a completed future in GlobalConfigurationStateListener if the capacity factor==0 ?

@ryanemerson
Copy link
Contributor

wouldn't be simpler to return a completed future in GlobalConfigurationStateListener if the capacity factor==0 ?

That results in the cache configuration not being defined locally:

org.infinispan.commons.CacheConfigurationException: ISPN000436: Cache 'another-cache' has been requested, but no matching cache configuration exists

	at org.infinispan.configuration.ConfigurationManager.getConfiguration(ConfigurationManager.java:61)
	at org.infinispan.manager.DefaultCacheManager.wireAndStartCache(DefaultCacheManager.java:684)
	at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:674)
	at org.infinispan.manager.DefaultCacheManager.internalGetCache(DefaultCacheManager.java:563)
	at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:526)
	at org.infinispan.manager.DefaultCacheManagerAdmin.getOrCreateCache(DefaultCacheManagerAdmin.java:49)
	at org.infinispan.distribution.ZeroCapacityAdministrationTest.testCreateNewClusteredCacheFromZeroToRemote(ZeroCapacityAdministrationTest.java:78)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:124)
	at org.testng.internal.MethodInvocationHelper$1.runTestMethod(MethodInvocationHelper.java:230)
	at org.infinispan.commons.test.TestNGLongTestsHook.run(TestNGLongTestsHook.java:24)
	at org.testng.internal.MethodInvocationHelper.invokeHookable(MethodInvocationHelper.java:242)
	at org.testng.internal.Invoker.invokeMethod(Invoker.java:579)
	at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:719)
	at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:989)
	at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:125)
	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:109)
	at org.testng.TestRunner.privateRun(TestRunner.java:648)
	at org.testng.TestRunner.run(TestRunner.java:505)
	at org.testng.SuiteRunner.runTest(SuiteRunner.java:455)
	at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:450)
	at org.testng.SuiteRunner.privateRun(SuiteRunner.java:415)
	at org.testng.SuiteRunner.run(SuiteRunner.java:364)
	at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
	at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:84)
	at org.testng.TestNG.runSuitesSequentially(TestNG.java:1208)
	at org.testng.TestNG.runSuitesLocally(TestNG.java:1137)
	at org.testng.TestNG.runSuites(TestNG.java:1049)
	at org.testng.TestNG.run(TestNG.java:1017)
	at com.intellij.rt.testng.IDEARemoteTestNG.run(IDEARemoteTestNG.java:66)
	at com.intellij.rt.testng.RemoteTestNGStarter.main(RemoteTestNGStarter.java:109)

@ryanemerson
Copy link
Contributor

ryanemerson commented Sep 21, 2023

Also, I didn't want to modify the zero capacity nodes' join as that has a wider effect, which I am somewhat unfamiliar with.

I think this is a good call given that org.infinispan.CONFIG and backup/restore is somewhat of a special case.

I backported the fix to my 14.0.x investigation PR and can confirm it's fixed the issue when deploying with the Operator 🙂

@ryanemerson ryanemerson added the 14.0.x Annotate a PR with this label if you want it to be backported to the 14.0.x branch label Sep 21, 2023
@pruivo
Copy link
Member

pruivo commented Sep 21, 2023

wouldn't be simpler to return a completed future in GlobalConfigurationStateListener if the capacity factor==0 ?

That results in the cache configuration not being defined locally:

org.infinispan.commons.CacheConfigurationException: ISPN000436: Cache 'another-cache' has been requested, but no matching cache configuration exists

	at org.infinispan.configuration.ConfigurationManager.getConfiguration(ConfigurationManager.java:61)
	at org.infinispan.manager.DefaultCacheManager.wireAndStartCache(DefaultCacheManager.java:684)
	at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:674)
	at org.infinispan.manager.DefaultCacheManager.internalGetCache(DefaultCacheManager.java:563)
	at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:526)
	at org.infinispan.manager.DefaultCacheManagerAdmin.getOrCreateCache(DefaultCacheManagerAdmin.java:49)
	at org.infinispan.distribution.ZeroCapacityAdministrationTest.testCreateNewClusteredCacheFromZeroToRemote(ZeroCapacityAdministrationTest.java:78)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:124)
	at org.testng.internal.MethodInvocationHelper$1.runTestMethod(MethodInvocationHelper.java:230)
	at org.infinispan.commons.test.TestNGLongTestsHook.run(TestNGLongTestsHook.java:24)
	at org.testng.internal.MethodInvocationHelper.invokeHookable(MethodInvocationHelper.java:242)
	at org.testng.internal.Invoker.invokeMethod(Invoker.java:579)
	at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:719)
	at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:989)
	at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:125)
	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:109)
	at org.testng.TestRunner.privateRun(TestRunner.java:648)
	at org.testng.TestRunner.run(TestRunner.java:505)
	at org.testng.SuiteRunner.runTest(SuiteRunner.java:455)
	at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:450)
	at org.testng.SuiteRunner.privateRun(SuiteRunner.java:415)
	at org.testng.SuiteRunner.run(SuiteRunner.java:364)
	at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
	at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:84)
	at org.testng.TestNG.runSuitesSequentially(TestNG.java:1208)
	at org.testng.TestNG.runSuitesLocally(TestNG.java:1137)
	at org.testng.TestNG.runSuites(TestNG.java:1049)
	at org.testng.TestNG.run(TestNG.java:1017)
	at com.intellij.rt.testng.IDEARemoteTestNG.run(IDEARemoteTestNG.java:66)
	at com.intellij.rt.testng.RemoteTestNGStarter.main(RemoteTestNGStarter.java:109)

I mean, you still have to invoke gcm.createCacheLocally but just ignore the return value.

@pruivo
Copy link
Member

pruivo commented Sep 21, 2023

Reverted the changes in GlobalConfigurationManagerImpl and the test is passing with this change. Probably would be easier to read than using maps around.

diff --git a/core/src/main/java/org/infinispan/globalstate/impl/GlobalConfigurationStateListener.java b/core/src/main/java/org/infinispan/globalstate/impl/GlobalConfigurationStateListener.java
index a09f054563..97419b1e41 100644
--- a/core/src/main/java/org/infinispan/globalstate/impl/GlobalConfigurationStateListener.java
+++ b/core/src/main/java/org/infinispan/globalstate/impl/GlobalConfigurationStateListener.java
@@ -42,9 +42,13 @@ public CompletionStage<Void> handleCreate(CacheEntryCreatedEvent<ScopedState, Ca
       String name = event.getKey().getName();
       CacheState state = event.getValue();
 
-      return CACHE_SCOPE.equals(scope) ?
-            gcm.createCacheLocally(name, state) :
-            gcm.createTemplateLocally(name, state);
+      if (CACHE_SCOPE.equals(scope)) {
+         // zero capacity nodes have to wait for a non-zero capacity node to start the cache.
+         // prevent the cache creating to blocking the listener invocation
+         var f = gcm.createCacheLocally(name, state);
+         return isZeroNode() ? CompletableFutures.completedNull() : f;
+      }
+      return gcm.createTemplateLocally(name, state);
    }
 
    @CacheEntryModified
@@ -74,10 +78,15 @@ public CompletionStage<Void> handleRemove(CacheEntryRemovedEvent<ScopedState, Ca
       String name = event.getKey().getName();
       if (CACHE_SCOPE.equals(scope)) {
          CONTAINER.debugf("Stopping cache %s because it was removed from global state", name);
-         return gcm.removeCacheLocally(name);
+         var f = gcm.removeCacheLocally(name);
+         return isZeroNode() ? CompletableFutures.completedNull() : f;
       } else {
          CONTAINER.debugf("Removing template %s because it was removed from global state", name);
          return gcm.removeTemplateLocally(name);
       }
    }
+
+   private boolean isZeroNode() {
+      return gcm.cacheManager.getCacheManagerConfiguration().isZeroCapacityNode();
+   }
 }

@jabolina
Copy link
Member Author

+1. That's easier to manage than the CFs + Map. Let me update.

@jabolina
Copy link
Member Author

Updated with Pedro's suggestions. Thanks!

@ryanemerson ryanemerson self-requested a review September 22, 2023 11:30
Copy link
Contributor

@ryanemerson ryanemerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just waiting on another CI run as the last one was interrupted.

@jabolina
Copy link
Member Author

CI has quite a few failures, let me check them.

@jabolina
Copy link
Member Author

Updated. Needed to use SecurityActions.getGlobalComponentRegistry to retrieve the registry.

@ryanemerson ryanemerson merged commit e2dd9ec into infinispan:main Sep 27, 2023
3 of 4 checks passed
@ryanemerson
Copy link
Contributor

Thanks @jabolina

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
14.0.x Annotate a PR with this label if you want it to be backported to the 14.0.x branch
Projects
None yet
3 participants