Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed missing EventService registrations after cluster members startup #16020

Merged

Conversation

@petrpleshachkov
Copy link
Contributor

petrpleshachkov commented Nov 14, 2019

Fixed a race condition between new cluster member join and post join
operations executed as part of concurrent member join.

Send post operations directly to master from joining member and it in
turn broadcasts them to all other members of the cluster. This way
master guarantees that all post join operations are executed on all
members of the cluster.

Fixes: #15950

@petrpleshachkov petrpleshachkov force-pushed the petrpleshachkov:fix/master/gh-15950 branch 3 times, most recently from b572d56 to a66316a Nov 14, 2019
@petrpleshachkov petrpleshachkov changed the title Fixed missing EventService registrations after concurrent cluster mem… Petr Pleshachkov Fixed missing EventService registrations after cluster members startup Nov 14, 2019
@mmedenjak mmedenjak requested a review from vojtechtoman Nov 14, 2019
@mmedenjak mmedenjak assigned mmedenjak and unassigned mmedenjak Nov 14, 2019
@mmedenjak mmedenjak added this to the 4.0 milestone Nov 14, 2019
@petrpleshachkov petrpleshachkov changed the title Petr Pleshachkov Fixed missing EventService registrations after cluster members startup Fixed missing EventService registrations after cluster members startup Nov 14, 2019
@petrpleshachkov petrpleshachkov requested a review from mdogan Nov 14, 2019
@mdogan
mdogan approved these changes Nov 18, 2019
@mmedenjak mmedenjak self-requested a review Nov 21, 2019
@mmedenjak

This comment has been minimized.

Copy link
Contributor

mmedenjak commented Nov 21, 2019

run-lab-run

Copy link
Contributor

mmedenjak left a comment

Keep in mind that this needs to be backported (if possible) and that the current fix is not compatible with RU or patch-level guarantees.

@@ -148,12 +148,10 @@ private void sendPostJoinOperations() {
final OperationService operationService = nodeEngine.getOperationService();
final Collection<Member> members = clusterService.getMembers();

This comment has been minimized.

Copy link
@mmedenjak

mmedenjak Nov 21, 2019

Contributor

Minor: members are no longer needed.

@mmedenjak

This comment has been minimized.

Copy link
Contributor

mmedenjak commented Nov 21, 2019

You know, I can smell bugs even with this solution. For instance, a member joins a stable cluster and prepares to send the OnJoinOp operation to the master. A that moment, the master splits away from the cluster. Another member is elected master and now the old master rejoins the cluster. Now, the OnJoinOp arrives at the old master which doesn't propagate the registrations.

I guess I could conjure up some other scenarios, given enough time. But honestly, I don't think we need to solve this completely as this sounds like the atomic broadcast problem and I don't think it's solvable with our AP-style membership protocol without venturing into CP-land.

You can try finding a solution for the patch release but if there is none, we can just say it's an inherent design issue which is unsolvable due to minor and patch level guarantees, has been solved in 4.0 and that if it's an issue, users can insert an artificial delay between joining members (as they have already been instructed).

@petrpleshachkov

This comment has been minimized.

Copy link
Contributor Author

petrpleshachkov commented Nov 21, 2019

Regarding 3.12, yes, this fix is not going to work with RU. It may even make things worse if joining member is upgraded, but master is not. In this case, master is not going to broadcast the registrations as well as joining member. For this scenario we can keep old logic in combination with the new one. Yes, we will broadcast more events and there will be duplicates (AFAIU they are already handled properly), but in this case we will have more guarantees at least when the master is stable. WDYT, guys?

@mmedenjak

This comment has been minimized.

Copy link
Contributor

mmedenjak commented Nov 21, 2019

Yes, I wanted to suggest sending the operation on multiple occasions (e.g. a blunt version might send the operation again on every member added event) but I was unsure if the operations were idempotent.

@petrpleshachkov petrpleshachkov requested a review from hazelcast/clients as a code owner Nov 22, 2019
Fixed a race condition between new cluster member join and post join
operations executed as part of concurrent member join.

Send post operations directly to master from joining member and it in
turn broadcasts them to all other members of the cluster. This way
master guarantees that all post join operations are executed on all
members of the cluster.

Fixes: #15950
@hazelcast hazelcast deleted a comment from petrpleshachkov Nov 26, 2019
@hazelcast hazelcast deleted a comment from petrpleshachkov Nov 26, 2019
@petrpleshachkov

This comment has been minimized.

Copy link
Contributor Author

petrpleshachkov commented Nov 26, 2019

Guys, thanks for the review, I am merging the PR.

@petrpleshachkov petrpleshachkov merged commit 76092a4 into hazelcast:master Nov 26, 2019
1 check passed
1 check passed
default Test PASSed.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.