Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed missing EventService registrations after cluster members startup #16020

Merged

Conversation

petrpleshachkov
Copy link
Contributor

@petrpleshachkov petrpleshachkov commented Nov 14, 2019

Fixed a race condition between new cluster member join and post join
operations executed as part of concurrent member join.

Send post operations directly to master from joining member and it in
turn broadcasts them to all other members of the cluster. This way
master guarantees that all post join operations are executed on all
members of the cluster.

Fixes: #15950

@petrpleshachkov petrpleshachkov force-pushed the fix/master/gh-15950 branch 3 times, most recently from b572d56 to a66316a Compare November 14, 2019 13:49
@petrpleshachkov petrpleshachkov changed the title Fixed missing EventService registrations after concurrent cluster mem… Petr Pleshachkov Fixed missing EventService registrations after cluster members startup Nov 14, 2019
@mmedenjak mmedenjak assigned mmedenjak and unassigned mmedenjak Nov 14, 2019
@mmedenjak mmedenjak added this to the 4.0 milestone Nov 14, 2019
@petrpleshachkov petrpleshachkov changed the title Petr Pleshachkov Fixed missing EventService registrations after cluster members startup Fixed missing EventService registrations after cluster members startup Nov 14, 2019
@mmedenjak mmedenjak self-requested a review November 21, 2019 07:56
@mmedenjak
Copy link
Contributor

run-lab-run

Copy link
Contributor

@mmedenjak mmedenjak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind that this needs to be backported (if possible) and that the current fix is not compatible with RU or patch-level guarantees.

@@ -148,12 +148,10 @@ private void sendPostJoinOperations() {
final OperationService operationService = nodeEngine.getOperationService();
final Collection<Member> members = clusterService.getMembers();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: members are no longer needed.

@mmedenjak
Copy link
Contributor

You know, I can smell bugs even with this solution. For instance, a member joins a stable cluster and prepares to send the OnJoinOp operation to the master. A that moment, the master splits away from the cluster. Another member is elected master and now the old master rejoins the cluster. Now, the OnJoinOp arrives at the old master which doesn't propagate the registrations.

I guess I could conjure up some other scenarios, given enough time. But honestly, I don't think we need to solve this completely as this sounds like the atomic broadcast problem and I don't think it's solvable with our AP-style membership protocol without venturing into CP-land.

You can try finding a solution for the patch release but if there is none, we can just say it's an inherent design issue which is unsolvable due to minor and patch level guarantees, has been solved in 4.0 and that if it's an issue, users can insert an artificial delay between joining members (as they have already been instructed).

@petrpleshachkov
Copy link
Contributor Author

Regarding 3.12, yes, this fix is not going to work with RU. It may even make things worse if joining member is upgraded, but master is not. In this case, master is not going to broadcast the registrations as well as joining member. For this scenario we can keep old logic in combination with the new one. Yes, we will broadcast more events and there will be duplicates (AFAIU they are already handled properly), but in this case we will have more guarantees at least when the master is stable. WDYT, guys?

@mmedenjak
Copy link
Contributor

Yes, I wanted to suggest sending the operation on multiple occasions (e.g. a blunt version might send the operation again on every member added event) but I was unsure if the operations were idempotent.

@petrpleshachkov petrpleshachkov requested a review from a team as a code owner November 22, 2019 15:47
Fixed a race condition between new cluster member join and post join
operations executed as part of concurrent member join.

Send post operations directly to master from joining member and it in
turn broadcasts them to all other members of the cluster. This way
master guarantees that all post join operations are executed on all
members of the cluster.

Fixes: hazelcast#15950
@hazelcast hazelcast deleted a comment from petrpleshachkov Nov 26, 2019
@hazelcast hazelcast deleted a comment from petrpleshachkov Nov 26, 2019
@petrpleshachkov
Copy link
Contributor Author

Guys, thanks for the review, I am merging the PR.

@petrpleshachkov petrpleshachkov merged commit 76092a4 into hazelcast:master Nov 26, 2019
@mmedenjak mmedenjak added the Source: Internal PR or issue was opened by an employee label Apr 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent EventService registrations after forming cluster
4 participants