Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure all Raft nodes are terminated before CP member terminates #16022

Merged

Conversation

@metanet
Copy link
Contributor

metanet commented Nov 14, 2019

It could happen that a Raft node can leak and stay in ACTIVE state after
a CP member terminates, because of a race between Hazelcast member
shutdown and Raft node termination logic. This commit fixes this
problem by deleting a Raft node from the internal state of RaftService
only after the Raft node has completed its termination. By this way,
the Hazelcast member shutdown / termination blocks until all Raft nodes
are terminated.

Even though we have been bashing about our test failures in our Windows
environment, it turned out that Windows helped us to discover another
edge case in the never-ending battle of Hazelcast member shutdowns and
CP group destroys. Windows does not allow a file to be deleted while it
is still open (probably this behaviour could be changed) and if a Raft
node has leaked in ACTIVE status, it causes failure during deletion of
CP group directories in the following loops of the CP group destroy
tests.

Thank you Windows for having principles for file deletion.

Fixes hazelcast/hazelcast-enterprise#3219

It could happen that a Raft node can leak and stay in ACTIVE state after
a CP member terminates, because of a race between Hazelcast member
shutdown and Raft node termination logic. This commit fixes this
problem by deleting a Raft node from the internal state of RaftService
only after the Raft node has completed its termination. By this way,
the Hazelcast member shutdown / termination blocks until all Raft nodes
are terminated.

Even though we have been bashing about our test failures in our Windows
environment, it turned out that Windows helped us to discover another
edge case in the never-ending battle of Hazelcast member shutdowns and
CP group destroys. Windows does not allow a file to be deleted while it
is still open (probably this behaviour could be changed) and if a Raft
node has leaked in ACTIVE status, it causes failure during deletion of
CP group directories in the following loops of the CP group destroy
tests.

Thank you Windows for having principles for file deletion.

Fixes hazelcast/hazelcast-enterprise#3219
@metanet metanet added this to the 4.0 milestone Nov 14, 2019
@metanet metanet requested a review from mdogan Nov 14, 2019
@mdogan
mdogan approved these changes Nov 15, 2019
@metanet metanet merged commit 81fed0b into hazelcast:master Nov 15, 2019
1 check passed
1 check passed
default Test PASSed.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.