Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISPN-14329 Availability of caches should be prevented until a cluster… #10479

Merged
merged 2 commits into from Mar 23, 2023

Conversation

jabolina
Copy link
Member

@jabolina jabolina commented Nov 23, 2022

… is complete after "shutdown cluster"

https://issues.redhat.com/browse/ISPN-14329
https://issues.redhat.com/browse/ISPN-14414

After the cluster shutdown, the nodes must wait for the complete recovery before accepting commands. We keep the cache's registry as INITIALIZING until all nodes join back and the restoration of a stable topology. We do this if the cache has a non-shared store without purging on start.

We log and return an exception with the current members. We can't say the missing member's address for sure, as we only have the persistent UUID.

With the test we added, the log message is:

11:19:12,840 WARN  (jgroups-7,ThreeNodeGlobalStatePartialRestartTest-NodeF:[]) [o.i.t.ClusterCacheStatus] ISPN000681: Recovering cache testCache missing members, currently have [ThreeNodeGlobalStatePartialRestartTest-NodeF, ThreeNodeGlobalStatePartialRestartTest-NodeG] of a total of 3

We still need a way to send commands and force the cache to start, even if that means data loss. And handle the inconsistent behavior from reports. That's follow-up work.

But I also wanted to gather some reviews/comments on the approach here (and see how CI behaves). There is also the option to block the cache creation until the cluster is back or receives a manual intervention.

@wfink
Copy link
Contributor

wfink commented Dec 7, 2022

Blocking the clients is now what I would expect.

For a better user experience it would be nice to have the WARN message ISPN000681 not only at the coordinator but on each starting node.

Each client access will add an ERROR ISPN005003 with full stack trace, as this is expected I would lower it to WARN and remove the stacktrace, this would prevent from a fast growing logfile at startup if there is a huge number of clients ...

Access with the console force a ISPN012005, is it possible to return the state INITIALIZING to display the cache state for the console, and lower the message to WARN/INFO without stacktrace for the same reasons

@wfink
Copy link
Contributor

wfink commented Dec 7, 2022

If the is set the cache is empty after 'shutdown cluster' and started nodes are all back.
If set to false we have the previous behavior, stale entries are added back if one node goes down and up again.

@jabolina
Copy link
Member Author

jabolina commented Dec 7, 2022

Applied the suggestions to also log the warning message on the client after the join. Another caveat here is that a node is only aware of the node when it joins, that is, the list of members does not update. This can lead to the exception message can be misleading.

I also included a small change in the REST API to handle the initializing case. This way the cache is listed as DEGRADED and the operations (I test get/put) return the exception message of missing members.

@wfink
Copy link
Contributor

wfink commented Dec 8, 2022

Server side messages for client access ISPN005003 are still with stacktrace, same for ISPN000682 at client side.
The cluster/cache state DEGRADED might be missleading, as well you can set the cache back to AVAILABLE or enable state-transfer (enable state-transfer will fail with IllegalState!!).
I would prefer to not being able to set state-transfer, only AVAIL and show the cluster/cache state as INIT

@wfink
Copy link
Contributor

wfink commented Dec 8, 2022

The WARN message ISPN000681 is now shown for all members during start 👍

@@ -328,7 +328,7 @@ private CacheInfo getCacheInfo(RestRequest request, EmbeddedCacheManager cacheMa
if (ignoredCaches.contains(cacheName)) {
cacheInfo.status = "IGNORED";
} else {
if(cacheHealth != HealthStatus.FAILED) {
if(cacheHealth != HealthStatus.FAILED && cacheHealth != HealthStatus.DEGRADED) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks out of place. Is this a different bug?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up adding a new status. The problem here is that the cache was not running, which throws an exception.

Comment on lines 998 to 1000
if (!started) return false;
ComponentStatus cs = cacheFuture.join().getStatus();
return cs == ComponentStatus.RUNNING || cs == ComponentStatus.INITIALIZING;
Copy link
Member Author

@jabolina jabolina Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revert this and open another JIRA to follow up, this slipped into my commit, it was supposed to be for a local test only. The problem here is that when getting the cache for the REST/CLI API it verifies if the cache is running before applying an operation. Since in our case the cache is still initializing, causes an exception, returning Unexpected error retrieving data. "ISPN012010: Cache with name 'manual' not found amongst the configured caches" to the client, which is odd, since we just saw the cache in a list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is fine, I just want to make sure it has its own JIRA to detail what is going on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jabolina
Copy link
Member Author

I applied all the suggestions/changes now. An additional point of attention is that I added another health status to identify the recovering stage, meaning the console needs to be updated to reflect this, too.

I added another exception instead of using the AvailabilityException so it was easier to log a warning.

@tristantarrant
Copy link
Member

LGTM

@jabolina
Copy link
Member Author

jabolina commented Dec 14, 2022

I'll squash the commits.

Copy link
Member

@wburns wburns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. However, are we not missing some way for an administrator to force the cache to RUNNING state if they don't have all the members?

Copy link
Member

@pruivo pruivo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jabolina
Copy link
Member Author

Applied the suggestions, thanks.

@jabolina
Copy link
Member Author

Looks good to me. However, are we not missing some way for an administrator to force the cache to RUNNING state if they don't have all the members?

That's a next https://issues.redhat.com/browse/ISPN-14418

@jabolina
Copy link
Member Author

I am going to add a couple of things:

  • The coordinator will have an up-to-date list of members when throwing an exception. Other members will have a possibly outdated list (as of right now);
  • If the cluster restores after the shutdown, we do not purge the store, even if purge=true.

@wfink
Copy link
Contributor

wfink commented Dec 21, 2022

Coordinator now sends the correct members during initializing.
purge is not happening if purge=true

Purge is working as expected if 'shudown cluster' is not used but all nodes are stopped and started in a sequence.

PR seems working fine to me

@jabolina
Copy link
Member Author

jabolina commented Jan 6, 2023

@wfink I've integrated the keyset fixes to the branch, let me know if it is working as expected.

Also, for ignoring purge after the start, I've updated only the SIFS and SingleFile stores, am I missing something here?

Copy link
Member

@wburns wburns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jabolina let me know what you think about these suggestions.

@@ -212,57 +212,47 @@ public CompletionStage<Void> start(InitializationContext ctx) {
startIndex();
final AtomicLong maxSeqId = new AtomicLong(0);

if (!configuration.purgeOnStartup()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I am not liking this approach to how purgeOnStartup is done. I think there is a lot of value of implementing it directly in the store, otherwise we have to start the store and then clear it, which can be very time consuming (some stores still require this, so having the clear afterwards is fine still). I am thinking it might be better instead to change the configuration provided to the store to have purgeOnStartup in certain circumstances. This way each store is still responsible for purging and can be done in a very efficient manner. For example SIFS can just delete the data and index directory and doesn't need to load them and the subsequent clear will be a noop.

* Caches that do not need state transfer or are private do not need to delay the start.
*/
@Override
protected CompletionStage<Void> delayStart() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is okay... I am not liking where this code is residing. This is putting too much logic in the component registry itself. The registry should be more about registering components, not knowing inner details of them.

I wonder is it possible to move this logic to an appropriate interceptor or component? For example we can add the clear logic to the persistence manager and stable topology directly to the LocalTopologyManager/StateTransferManager. Then we would need to disconnect start method invocation from the actual interceptor initialization and leave the state as INITIALIZING until it is done. I think this would keep the logic more closely aligned with where it should be.

@jabolina
Copy link
Member Author

jabolina commented Jan 6, 2023

@wburns I'll apply the suggestions. The only thing that worries me is if we spread the restoration requirements to different places. Right now, everything is under the component registry, although I agree that this is overloading it too much. I'll try one approach of using a new component only for that. If that is not right, I'll go by putting the logic on each individual component.

@jabolina
Copy link
Member Author

jabolina commented Jan 9, 2023

Creating a new component dedicated only to restoration wasn't nice, so I added the logic to the LocalTopologyManager/LocalCacheStatus. For the change for persistence, I did a small change only and could keep the purge in the PersistenceManager itself, but it is still calling the clear method after the topology is restored. That is, constructing the whole store and then purging it. I'll look into that now while CI runs.

@jabolina
Copy link
Member Author

jabolina commented Mar 9, 2023

Pushed a fix, let's see. The failures were all due to blocking calls. I changed the call from isRunning to cacheExists.

@infinispanrelease
Copy link

Image pushed for Jenkins build #22:

quay.io/infinispan-test/server:PR-10479

@infinispanrelease
Copy link

Image pushed for Jenkins build #23:

quay.io/infinispan-test/server:PR-10479

@wburns wburns removed the Image Required Set this label in order for a server image to be built with the PR changes and pushed to quay.io label Mar 10, 2023
@wburns
Copy link
Member

wburns commented Mar 10, 2023

Pushed a fix, let's see. The failures were all due to blocking calls. I changed the call from isRunning to cacheExists.

Can you expand upon this? I don't think we can just change the method we are testing against here. If anything we may want do this to both methods no?

Actually I don't think we can change this at all as the cacheExists doesn't verify if the cache is actually running or not.

Copy link
Member

@wburns wburns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may need to discuss the blocking issue you ran into before. The getCache method on non blocking thread MUST always be proceeded by a isRunning check to ensure it won't block. We cannot change this behavior.

@jabolina jabolina force-pushed the ISPN-14329 branch 4 times, most recently from bda39d4 to 671783a Compare March 13, 2023 21:15
Copy link
Member

@wburns wburns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor things left, otherwise LGTM.

@@ -987,7 +987,7 @@ public CompletionStage<Long> sizePublisher(IntSet segments, InvocationContext ct
*/
private <E> Flowable<E> getValuesFlowable(BiFunction<InnerPublisherSubscription.InnerPublisherSubscriptionBuilder<K, I, R>, Map.Entry<Address, IntSet>, Publisher<E>> subToFlowableFunction) {
return Flowable.defer(() -> {
if (!componentRegistry.getStatus().allowInvocations()) {
if (!componentRegistry.getStatus().allowInvocations() && !componentRegistry.getStatus().startingUp()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I assume the topology exception is thrown later in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, once it reaches the InvocationContextInterceptor. I am unsure if this is one of the places we could fail earlier and avoid creating/sending all requests.

@Message(value = "Recovering cache '%s' but there are missing members, known members %s of a total of %s", id = 689)
void recoverFromStateMissingMembers(String cacheName, List<Address> members, int total);

MissingMembersException recoverFromStateMissingMembers(String cacheName, List<Address> members, String total);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a @message on it afaik.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of hack, but I can't find it in the docs now :(
Since this has the same signature as the method above, it will inherit the same message and code. I did this so that the warning and exception have the same code. It is hacky because in one, the total is int, and in another is a String.

@jabolina
Copy link
Member Author

jabolina commented Mar 22, 2023

Added another commit for https://issues.redhat.com/browse/ISPN-14414. With this, we can return the caches on the state INITIALIZING to the console. Note: as we are returning a new value, the console needs to be updated to reflect this new status.

… is complete after "shutdown cluster"

After the cluster shutdown, the nodes must wait for the complete
recovery before accepting commands. We keep the cache's registry as
`INITIALIZING` until all nodes join back and the restoration of a stable
topology.

We log and return an exception with the current members. We can't say
the missing member's address for sure, as we only have the persistent
UUID. The coordinator throws the exception with the current member list.

Created a dedicated exception to handle missing members. Avoid logging the trace.

The downside is that SIFS and single-file store now creates the store
from the previous file, even when purge on startup is enabled, to
possibly clear all the data after. This could be improved further in
following PRs.
@wburns
Copy link
Member

wburns commented Mar 22, 2023

Changes look fine, just waiting to make sure CI didn't change.

@wburns wburns merged commit 2bd7b00 into infinispan:main Mar 23, 2023
1 check failed
@wburns
Copy link
Member

wburns commented Mar 23, 2023

Integrated into main, thanks @jabolina !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants