Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[14.0.x] ISPN-15357 Unable to enable rebalancing after cluster scale up #11683

Merged
merged 2 commits into from
Jan 25, 2024

Conversation

jabolina
Copy link
Member

Utilize the global state manager to persist the global rebalance status.
This allows the rebalance disabled to tolerate complete cluster restarts
and still avoid rebalancing. Otherwise, only rolling upgrades could do
this.
This PR has two sides. First, by "stateful" we refer to nodes with
information in the persistent state about the cache cluster before
restart, and "stateless" a node without any information.

Originally, a stateful cluster only accepts stateful joiners until the
old cluster is complete. Stateless nodes can only join *after* that, but
they have no clue about that, on what stage the recovery is. If they try
joining too early, the receive an exception, which could lead to the
node failing to start.

With the changes, a stateless node can issue a join request at any
point. It will *not* join the cluster imediatelly, as the cluster needs
to recover, but now the node receive a specific response, so it can send
a join request again after the recovery is complete. This is automatic
now, and shouldn't need any intervention.

The tricky case, starting a stateless coordinator with a stateful
joiner. This is very likely a problem during initialization, or a
misconfiguration during updates. In such a case, the stateful node
trying to join will receive an exception, and likely fail to start. This
would possible identify issues elsewhere and the user would need to
manually fix it.

If the user is manually adding nodes, it needs to make sure to start the
new node *after* all the previous nodes are up and running. Any
concurrent cluster start, or concurrent membership changes (i.e., a node
shutdown while adding another one) could trigger this behavior.
@jabolina jabolina added this to the 14.0.22.Final milestone Jan 16, 2024
@jabolina
Copy link
Member Author

Operator test failure seems unrelated.

@jabolina
Copy link
Member Author

I've seen the InitialClusterSize failure also on main. I'll investigate it, but we don't need to hold this one.

@tristantarrant tristantarrant merged commit 628e112 into infinispan:14.0.x Jan 25, 2024
2 of 4 checks passed
@jabolina jabolina deleted the ISPN-15357-backport branch January 25, 2024 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants