New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CP Subsystem/Raft: Instability when restarting members #24897
Comments
CP membership, unlike Hazelcast cluster membership, is not dynamic. As discussed in CP subsystem management
So that's the expected behaviour of CP subsystem. In the context of your test, adding an explicit call to promote
|
Hi @lprimak hazelcast/hazelcast/src/test/java/com/hazelcast/cp/internal/CPMemberAddRemoveTest.java Lines 397 to 414 in 7c3894b
|
Thank you! |
I have created (with great difficulty) code that work correctly to keep CP going while restarting nodes. |
In general, the way I am testing this by running https://github.com/flowlogix/hazelcast-issues/ CacheTester, which is an interactive tool that exercises locking and maps (Hazelcast in general). |
Apparently, I am not the only one who is having this issue. Hopefully the test reproducers can shed some more light on this. |
This behaviour is by design. We shouldn’t allow the restarted member to join back the cluster automatically, because the cluster thinks that the old member is still present in a cluster, but temporarily became unavailable. |
We have such an option - after 4 hours if the missed member is not joined back to the cluster, it will be auto-removed and will be replaced by an additional member. |
Yes, we have a task in our backlog, to align k8s pods auto-restart and scaling up/down with CP members removal and promotion |
Thanks for your feedback @arodionov
My current CP use case is only for CP FencedLock. Therefore, the persistence feature does not bring any value.
Above is incorrect, because nodes that shut down normally do not re-join the CP cluster upon restart either.
There is an issue with this. See the description (#24897 (comment)) - bullet point 2
Can you link to the issue and / or PR so I can test it out? The main critical issue is still (see description, bullet point 1) that cluster becomes unusable and needs to be restarted if CP goes down to one member (reproducer tests in #24903) |
Yes, only in the case of a so-called graceful shutdown the CP member will be automatically removed from the CP subsystem, and in logs, there should be 2 CP members:
But if a CP member has been terminated or stopped responding, we can't distinguish if it's temporarily unavailable (due to network problems) or if it has crashed. In that case, in logs, there will be the only information about cluster status, but the CP subsystem will consider that this CP member is temporarily unavailable:
Yes, it will. The CP Persistence only persists the CP member's state in the form of a Raft log (there will be Meta-information about CP members, FencedLock owner, Raft sessions, etc) |
This is also by design - if a cluster lost the majority of its members it will be blocked and should be recovered manually https://docs.hazelcast.com/hazelcast/5.3/cp-subsystem/management#handling-a-lost-majority |
Just tested this scenario, and there is no "majority lost" or "availability decreased" events delivered in case of normal shutdown of two other members, just "member removed" message. I think you are referring to scenario if two other members "crash" The scenario I am talking about is normal shutdown of members, not crashes. This should not be any different from "initial state" of CP and should be able to accept new members. Member removed by proper shutdown (one node left):
However, CP subsystem is unusable and still tries to find the now-shutdown member:
and the loop continues indefinitely Maybe has something to do with this?
a matter of opinion, perhaps in some applications, but not in mine. Maybe an issue is that the "normal shutdown" events don't result in a clear state of things in the logs. For example, if the CP system becomes "unusable" by design, there should be some ERROR or WARNING message in the logs, but there is none. |
Another issue... I have tried to auto-reset the subsystem via code when CP goes down to one member, but I get the below exception... Currently, there is no way to run a CP system without some manual intervention for simple tasks...
|
Another one: shutdown is not possible in the "unusable" state. What that means in practice is that a CP cluster can't be cleanly and quickly shut down. It gets into the leader election loop again and won't exit. I think this happens when the application is has used CP FencedLock and any CP sessions are open.
Another one:
|
This bug has plagued me for more than a year already, and I'm happy to see it finally documented. Never had time to investigate it in more depth myself, glad someone else did it. Deploying apps on k8s is pretty widespread, these days, and k8s does move pods around quite frequently, for large and active clusters (like we operate - thousands of pods), in order to optimize load on nodes. Under those circumstances, 4h as a default for kicking pods/hazelcast nodes that no longer exist and not somehow using info from the k8s API to speed up the cleanup/recovery process after a node from a CP group vanishes are both less than ideal decisions re. hazelcast's implementation, IMO. |
Thank you for listening @arodionov |
@arodionov I have added a new test, This test is failing and clearly demonstrates incorrect behavior described here: #24897 (comment) |
@arodionov Thanks! Looks great! |
After some back-and-fourth, it's become clear that whenMemberExitsNormally_thenReceiveEvents is no longer necessary after the PR and is working correctly. |
I have encountered the same issue, using version 5.2.4. Has this issue been ultimately resolved? |
No. First step to fix it will be in Hazelcast 5.4 where the |
After trying for months to get the CP subsystem behaving properly in the face of pod restarts in kubernetes, I finally gave up, and switched to a solution which doesn't use it at all. As far as I can tell, Hazelcast's CP subsystem is not at all a good fit for using in an environment where starting and stopping nodes isn't uncommon. |
Describe the bug
When restarting a CP members, there are number of stability issues encountered when members go up/ down, restart, freeze, or crash.
The theme is the same, an infinite loop trying to select a new leader and failing due to looking for long-ago-dropped members.
Expected behavior
CP cluster keeps being stable
** Issues (expanded) **
To Reproduce
PR #24903 has the additional tests
Exception
After the node rejoins, infinite loop trying to select a new CP leader is seen:
Additional context
See slack conversations:
https://hazelcastcommunity.slack.com/archives/C015Q2TUBKL/p1687454583340749
https://hazelcastcommunity.slack.com/archives/C015Q2TUBKL/p1687530351955629
The text was updated successfully, but these errors were encountered: