New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ISPN-14329 Availability of caches should be prevented until a cluster… #10479
Conversation
Blocking the clients is now what I would expect. For a better user experience it would be nice to have the WARN message ISPN000681 not only at the coordinator but on each starting node. Each client access will add an ERROR ISPN005003 with full stack trace, as this is expected I would lower it to WARN and remove the stacktrace, this would prevent from a fast growing logfile at startup if there is a huge number of clients ... Access with the console force a ISPN012005, is it possible to return the state |
If the is set the cache is empty after 'shutdown cluster' and started nodes are all back. |
Applied the suggestions to also log the warning message on the client after the join. Another caveat here is that a node is only aware of the node when it joins, that is, the list of members does not update. This can lead to the exception message can be misleading. I also included a small change in the REST API to handle the initializing case. This way the cache is listed as DEGRADED and the operations (I test get/put) return the exception message of missing members. |
Server side messages for client access ISPN005003 are still with stacktrace, same for ISPN000682 at client side. |
The WARN message ISPN000681 is now shown for all members during start 👍 |
core/src/main/java/org/infinispan/factories/AbstractComponentRegistry.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/infinispan/factories/AbstractComponentRegistry.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/infinispan/factories/ComponentRegistry.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/infinispan/factories/ComponentRegistry.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/infinispan/topology/LocalTopologyManagerImpl.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/infinispan/topology/LocalTopologyManagerImpl.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/infinispan/topology/LocalTopologyManagerImpl.java
Outdated
Show resolved
Hide resolved
@@ -328,7 +328,7 @@ private CacheInfo getCacheInfo(RestRequest request, EmbeddedCacheManager cacheMa | |||
if (ignoredCaches.contains(cacheName)) { | |||
cacheInfo.status = "IGNORED"; | |||
} else { | |||
if(cacheHealth != HealthStatus.FAILED) { | |||
if(cacheHealth != HealthStatus.FAILED && cacheHealth != HealthStatus.DEGRADED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks out of place. Is this a different bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up adding a new status. The problem here is that the cache was not running, which throws an exception.
if (!started) return false; | ||
ComponentStatus cs = cacheFuture.join().getStatus(); | ||
return cs == ComponentStatus.RUNNING || cs == ComponentStatus.INITIALIZING; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll revert this and open another JIRA to follow up, this slipped into my commit, it was supposed to be for a local test only. The problem here is that when getting the cache for the REST/CLI API it verifies if the cache is running before applying an operation. Since in our case the cache is still initializing, causes an exception, returning Unexpected error retrieving data. "ISPN012010: Cache with name 'manual' not found amongst the configured caches"
to the client, which is odd, since we just saw the cache in a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is fine, I just want to make sure it has its own JIRA to detail what is going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created https://issues.redhat.com/browse/ISPN-14414 to handle this.
I applied all the suggestions/changes now. An additional point of attention is that I added another health status to identify the recovering stage, meaning the console needs to be updated to reflect this, too. I added another exception instead of using the |
LGTM |
I'll squash the commits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. However, are we not missing some way for an administrator to force the cache to RUNNING state if they don't have all the members?
core/src/main/java/org/infinispan/factories/AbstractComponentRegistry.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
core/src/main/java/org/infinispan/factories/ComponentRegistry.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/infinispan/globalstate/ThreeNodeGlobalStatePartialRestartTest.java
Outdated
Show resolved
Hide resolved
Applied the suggestions, thanks. |
That's a next https://issues.redhat.com/browse/ISPN-14418 |
I am going to add a couple of things:
|
Coordinator now sends the correct members during initializing. Purge is working as expected if 'shudown cluster' is not used but all nodes are stopped and started in a sequence. PR seems working fine to me |
@wfink I've integrated the keyset fixes to the branch, let me know if it is working as expected. Also, for ignoring purge after the start, I've updated only the SIFS and SingleFile stores, am I missing something here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jabolina let me know what you think about these suggestions.
@@ -212,57 +212,47 @@ public CompletionStage<Void> start(InitializationContext ctx) { | |||
startIndex(); | |||
final AtomicLong maxSeqId = new AtomicLong(0); | |||
|
|||
if (!configuration.purgeOnStartup()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I am not liking this approach to how purgeOnStartup is done. I think there is a lot of value of implementing it directly in the store, otherwise we have to start the store and then clear it, which can be very time consuming (some stores still require this, so having the clear afterwards is fine still). I am thinking it might be better instead to change the configuration provided to the store to have purgeOnStartup in certain circumstances. This way each store is still responsible for purging and can be done in a very efficient manner. For example SIFS can just delete the data and index directory and doesn't need to load them and the subsequent clear will be a noop.
* Caches that do not need state transfer or are private do not need to delay the start. | ||
*/ | ||
@Override | ||
protected CompletionStage<Void> delayStart() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is okay... I am not liking where this code is residing. This is putting too much logic in the component registry itself. The registry should be more about registering components, not knowing inner details of them.
I wonder is it possible to move this logic to an appropriate interceptor or component? For example we can add the clear logic to the persistence manager and stable topology directly to the LocalTopologyManager/StateTransferManager. Then we would need to disconnect start method invocation from the actual interceptor initialization and leave the state as INITIALIZING until it is done. I think this would keep the logic more closely aligned with where it should be.
@wburns I'll apply the suggestions. The only thing that worries me is if we spread the restoration requirements to different places. Right now, everything is under the component registry, although I agree that this is overloading it too much. I'll try one approach of using a new component only for that. If that is not right, I'll go by putting the logic on each individual component. |
Creating a new component dedicated only to restoration wasn't nice, so I added the logic to the LocalTopologyManager/LocalCacheStatus. For the change for persistence, I did a small change only and could keep the purge in the PersistenceManager itself, but it is still calling the clear method after the topology is restored. That is, constructing the whole store and then purging it. I'll look into that now while CI runs. |
Pushed a fix, let's see. The failures were all due to blocking calls. I changed the call from |
Image pushed for Jenkins build #22:
|
Image pushed for Jenkins build #23:
|
Can you expand upon this? I don't think we can just change the method we are testing against here. If anything we may want do this to both methods no? Actually I don't think we can change this at all as the cacheExists doesn't verify if the cache is actually running or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may need to discuss the blocking issue you ran into before. The getCache method on non blocking thread MUST always be proceeded by a isRunning check to ensure it won't block. We cannot change this behavior.
core/src/main/java/org/infinispan/manager/DefaultCacheManager.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/infinispan/globalstate/ThreeNodeGlobalStatePartialRestartTest.java
Outdated
Show resolved
Hide resolved
server/hotrod/src/main/java/org/infinispan/server/hotrod/BaseRequestProcessor.java
Show resolved
Hide resolved
server/rest/src/main/java/org/infinispan/rest/RestRequestHandler.java
Outdated
Show resolved
Hide resolved
server/rest/src/main/java/org/infinispan/rest/cachemanager/RestCacheManager.java
Outdated
Show resolved
Hide resolved
server/rest/src/main/java/org/infinispan/rest/resources/CacheResourceV2.java
Outdated
Show resolved
Hide resolved
bda39d4
to
671783a
Compare
server/hotrod/src/main/java/org/infinispan/server/hotrod/logging/Log.java
Outdated
Show resolved
Hide resolved
server/rest/src/main/java/org/infinispan/rest/cachemanager/RestCacheManager.java
Show resolved
Hide resolved
server/rest/src/main/java/org/infinispan/rest/resources/CacheResourceV2.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just minor things left, otherwise LGTM.
@@ -987,7 +987,7 @@ public CompletionStage<Long> sizePublisher(IntSet segments, InvocationContext ct | |||
*/ | |||
private <E> Flowable<E> getValuesFlowable(BiFunction<InnerPublisherSubscription.InnerPublisherSubscriptionBuilder<K, I, R>, Map.Entry<Address, IntSet>, Publisher<E>> subToFlowableFunction) { | |||
return Flowable.defer(() -> { | |||
if (!componentRegistry.getStatus().allowInvocations()) { | |||
if (!componentRegistry.getStatus().allowInvocations() && !componentRegistry.getStatus().startingUp()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I assume the topology exception is thrown later in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, once it reaches the InvocationContextInterceptor
. I am unsure if this is one of the places we could fail earlier and avoid creating/sending all requests.
@Message(value = "Recovering cache '%s' but there are missing members, known members %s of a total of %s", id = 689) | ||
void recoverFromStateMissingMembers(String cacheName, List<Address> members, int total); | ||
|
||
MissingMembersException recoverFromStateMissingMembers(String cacheName, List<Address> members, String total); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a @message on it afaik.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit of hack, but I can't find it in the docs now :(
Since this has the same signature as the method above, it will inherit the same message and code. I did this so that the warning and exception have the same code. It is hacky because in one, the total
is int, and in another is a String.
server/hotrod/src/main/java/org/infinispan/server/hotrod/BaseRequestProcessor.java
Outdated
Show resolved
Hide resolved
server/hotrod/src/main/java/org/infinispan/server/hotrod/logging/Log.java
Outdated
Show resolved
Hide resolved
Added another commit for https://issues.redhat.com/browse/ISPN-14414. With this, we can return the caches on the state |
server/rest/src/main/java/org/infinispan/rest/resources/ContainerResource.java
Outdated
Show resolved
Hide resolved
… is complete after "shutdown cluster" After the cluster shutdown, the nodes must wait for the complete recovery before accepting commands. We keep the cache's registry as `INITIALIZING` until all nodes join back and the restoration of a stable topology. We log and return an exception with the current members. We can't say the missing member's address for sure, as we only have the persistent UUID. The coordinator throws the exception with the current member list. Created a dedicated exception to handle missing members. Avoid logging the trace. The downside is that SIFS and single-file store now creates the store from the previous file, even when purge on startup is enabled, to possibly clear all the data after. This could be improved further in following PRs.
Changes look fine, just waiting to make sure CI didn't change. |
Integrated into main, thanks @jabolina ! |
… is complete after "shutdown cluster"
https://issues.redhat.com/browse/ISPN-14329
https://issues.redhat.com/browse/ISPN-14414
After the cluster shutdown, the nodes must wait for the complete recovery before accepting commands. We keep the cache's registry as
INITIALIZING
until all nodes join back and the restoration of a stable topology. We do this if the cache has a non-shared store without purging on start.We log and return an exception with the current members. We can't say the missing member's address for sure, as we only have the persistent UUID.
With the test we added, the log message is:
We still need a way to send commands and force the cache to start, even if that means data loss. And handle the inconsistent behavior from reports. That's follow-up work.
But I also wanted to gather some reviews/comments on the approach here (and see how CI behaves). There is also the option to block the cache creation until the cluster is back or receives a manual intervention.