[14.0.x] ISPN-15357 Unable to enable rebalancing after cluster scale up #11683

jabolina · 2024-01-16T13:27:29Z

https://issues.redhat.com/browse/ISPN-15357

Backport of #11633.

Utilize the global state manager to persist the global rebalance status. This allows the rebalance disabled to tolerate complete cluster restarts and still avoid rebalancing. Otherwise, only rolling upgrades could do this.

This PR has two sides. First, by "stateful" we refer to nodes with information in the persistent state about the cache cluster before restart, and "stateless" a node without any information. Originally, a stateful cluster only accepts stateful joiners until the old cluster is complete. Stateless nodes can only join *after* that, but they have no clue about that, on what stage the recovery is. If they try joining too early, the receive an exception, which could lead to the node failing to start. With the changes, a stateless node can issue a join request at any point. It will *not* join the cluster imediatelly, as the cluster needs to recover, but now the node receive a specific response, so it can send a join request again after the recovery is complete. This is automatic now, and shouldn't need any intervention. The tricky case, starting a stateless coordinator with a stateful joiner. This is very likely a problem during initialization, or a misconfiguration during updates. In such a case, the stateful node trying to join will receive an exception, and likely fail to start. This would possible identify issues elsewhere and the user would need to manually fix it. If the user is manually adding nodes, it needs to make sure to start the new node *after* all the previous nodes are up and running. Any concurrent cluster start, or concurrent membership changes (i.e., a node shutdown while adding another one) could trigger this behavior.

jabolina · 2024-01-16T13:35:41Z

Operator test failure seems unrelated.

jabolina · 2024-01-24T13:15:21Z

I've seen the InitialClusterSize failure also on main. I'll investigate it, but we don't need to hold this one.

jabolina added 2 commits January 16, 2024 09:57

ISPN-15357 Persist global rebalance status

7b6a5b1

Utilize the global state manager to persist the global rebalance status. This allows the rebalance disabled to tolerate complete cluster restarts and still avoid rebalancing. Otherwise, only rolling upgrades could do this.

jabolina added the Backport label Jan 16, 2024

jabolina added this to the 14.0.22.Final milestone Jan 16, 2024

tristantarrant modified the milestones: 14.0.22.Final, 14.0.23.Final Jan 22, 2024

tristantarrant merged commit 628e112 into infinispan:14.0.x Jan 25, 2024
2 of 4 checks passed

jabolina deleted the ISPN-15357-backport branch January 25, 2024 13:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[14.0.x] ISPN-15357 Unable to enable rebalancing after cluster scale up #11683

[14.0.x] ISPN-15357 Unable to enable rebalancing after cluster scale up #11683

jabolina commented Jan 16, 2024

jabolina commented Jan 16, 2024

jabolina commented Jan 24, 2024

[14.0.x] ISPN-15357 Unable to enable rebalancing after cluster scale up #11683

[14.0.x] ISPN-15357 Unable to enable rebalancing after cluster scale up #11683

Conversation

jabolina commented Jan 16, 2024

jabolina commented Jan 16, 2024

jabolina commented Jan 24, 2024