Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent EventService registrations after forming cluster #15950

Closed
vojtechtoman opened this issue Nov 6, 2019 · 1 comment · Fixed by #16020
Closed

Inconsistent EventService registrations after forming cluster #15950

vojtechtoman opened this issue Nov 6, 2019 · 1 comment · Fixed by #16020

Comments

@vojtechtoman
Copy link
Contributor

@vojtechtoman vojtechtoman commented Nov 6, 2019

Occasionally, after forming a cluster, certain members do not contain EventService registrations for some other members. Normally, all members should have the same set of registrations.

After some investigation, I figured out what the problem is:

Suppose there are (at least) three members: A, B, C, joining the cluster in this order. When B joins the cluster, the master (A) generates the FinalizeJoinOp and executes it on B. B then executes the preJoin operations (which, among others, include a snapshot of A’s event registrations to be propagated to B) and when done, it sends its postJoin operations back to the other cluster members (only A in this case). The purpose of these operations is, among others, to propagate B’s event registrations to the rest of the cluster.

Now, suppose C joins the cluster before B’s postJoin operations reach A. In such scenario (which ClusterJoinManager.startJoin() is not protected against), A’s snapshot of event registrations does not include B’s registrations (which include B itself) yet, therefore the preJoin operations sent to C do not contain any mention of B. If C is the last member to join the cluster, there is no other member that can fix this at a later point (by sending their postJoin operation to C that would add B to the list of event registrations).

It is quite possible that the above problem applies not only to EventService, but also to other PreJoinAwareService/PostJoinAwareService services.

I encountered this while working on an unrelated issue on 3.12.x. It (occasionally) occurs in Hot Restart tests where we start all members at once, in parallel. In many real situations, I believe the issue is quite unlikely to occur.

@vojtechtoman

This comment has been minimized.

Copy link
Contributor Author

@vojtechtoman vojtechtoman commented Nov 11, 2019

Reproducer (run this repeatedly until the final assertEventRegistrations() fails - it takes about 10 runs on my machine):

package com.hazelcast.spi.hotrestart.cluster;

import com.hazelcast.core.HazelcastInstance;
import com.hazelcast.instance.Node;
import com.hazelcast.nio.Address;
import com.hazelcast.spi.EventRegistration;
import com.hazelcast.spi.impl.proxyservice.impl.ProxyServiceImpl;
import com.hazelcast.test.AssertTask;
import com.hazelcast.test.HazelcastSerialClassRunner;
import com.hazelcast.test.annotation.ParallelTest;
import com.hazelcast.test.annotation.QuickTest;
import org.junit.Test;
import org.junit.experimental.categories.Category;
import org.junit.runner.RunWith;

import java.util.Collection;

import static com.hazelcast.cluster.ClusterShutdownTest.assertNodesShutDownEventually;
import static com.hazelcast.cluster.ClusterShutdownTest.getNodes;
import static org.junit.Assert.assertEquals;

@RunWith(HazelcastSerialClassRunner.class)
@Category({QuickTest.class, ParallelTest.class})
public class HotRestartEventServiceTest extends AbstractHotRestartClusterStartTest {

    @Test
    public void test_eventRegistrations_afterHotRestart() {
        HazelcastInstance[] instances = startNewInstances(3);
        instances[0].getMap("map0").put("foo", "bar");

        Address[] addresses = getAddresses(instances);
        Node[] nodes = getNodes(instances);
        instances[0].getCluster().shutdown();
        assertNodesShutDownEventually(nodes);

        assertEventRegistrations(3, restartInstances(addresses));
    }

    private static void assertEventRegistrations(final int expected, final HazelcastInstance... instances) {
        assertTrueEventually(new AssertTask() {
            @Override
            public void run() {
                for (HazelcastInstance instance : instances) {
                    Collection<EventRegistration> regs = getNodeEngineImpl(instance).getEventService().getRegistrations(
                            ProxyServiceImpl.SERVICE_NAME, ProxyServiceImpl.SERVICE_NAME);
                    assertEquals(instance + ": " + regs, expected, regs.size());
                }
            }
        });
    }
}
@petrpleshachkov petrpleshachkov self-assigned this Nov 12, 2019
petrpleshachkov pushed a commit to petrpleshachkov/hazelcast that referenced this issue Nov 14, 2019
…bers

startup

Fixed a race condition between new cluster member join and post join
operations executed as part of concurrent member join.

Send post operations directly to master from joining member and it in
turn broadcasts them to all other members of the cluster. This way
master guarantees that all post join operations are executed on all
members of the cluster.

Fixes: hazelcast#15950
petrpleshachkov pushed a commit to petrpleshachkov/hazelcast that referenced this issue Nov 14, 2019
Fixed a race condition between new cluster member join and post join
operations executed as part of concurrent member join.

Send post operations directly to master from joining member and it in
turn broadcasts them to all other members of the cluster. This way
master guarantees that all post join operations are executed on all
members of the cluster.

Fixes: hazelcast#15950
petrpleshachkov pushed a commit to petrpleshachkov/hazelcast that referenced this issue Nov 14, 2019
…bers startup

Fixed a race condition between new cluster member join and post join
operations executed as part of concurrent member join.

Send post operations directly to master from joining member and it in
turn broadcasts them to all other members of the cluster. This way
master guarantees that all post join operations are executed on all
members of the cluster.

Fixes: hazelcast#15950
petrpleshachkov pushed a commit to petrpleshachkov/hazelcast that referenced this issue Nov 14, 2019
Fixed a race condition between new cluster member join and post join
operations executed as part of concurrent member join.

Send post operations directly to master from joining member and it in
turn broadcasts them to all other members of the cluster. This way
master guarantees that all post join operations are executed on all
members of the cluster.

Fixes: hazelcast#15950
petrpleshachkov pushed a commit to petrpleshachkov/hazelcast that referenced this issue Nov 22, 2019
Fixed a race condition between new cluster member join and post join
operations executed as part of concurrent member join.

In addition to broadcasting post operations from joining member to all other members
(keep this logic to support rolling upgrade),
broadcast the post operations from master as well. This way
master guarantees that all post join operations are executed on all
members of the cluster.

Rolling upgrade scenario depends whether the master has been upgraded
earlier joining member. If so, the guarantees are preserved.

Fixes: hazelcast#15950
petrpleshachkov pushed a commit to petrpleshachkov/hazelcast that referenced this issue Nov 22, 2019
Fixed a race condition between new cluster member join and post join
operations executed as part of concurrent member join.

In addition to broadcasting post operations from joining member to all other members
(keep this logic to support rolling upgrade),
broadcast the post operations from master as well. This way
master guarantees that all post join operations are executed on all
members of the cluster.

Rolling upgrade scenario depends whether the master has been upgraded
earlier joining member. If so, the guarantees are preserved.

Fixes: hazelcast#15950
petrpleshachkov pushed a commit to petrpleshachkov/hazelcast that referenced this issue Nov 22, 2019
Fixed a race condition between new cluster member join and post join
operations executed as part of concurrent member join.

Send post operations directly to master from joining member and it in
turn broadcasts them to all other members of the cluster. This way
master guarantees that all post join operations are executed on all
members of the cluster.

Fixes: hazelcast#15950
petrpleshachkov added a commit that referenced this issue Nov 26, 2019
#16088)

Fixed a race condition between new cluster member join and post join
operations executed as part of concurrent member join.

In addition to broadcasting post operations from joining member to all other members
(keep this logic to support rolling upgrade),
broadcast the post operations from master as well. This way
master guarantees that all post join operations are executed on all
members of the cluster.

Rolling upgrade scenario depends whether the master has been upgraded
earlier joining member. If so, the guarantees are preserved.

Fixes: #15950
petrpleshachkov added a commit that referenced this issue Nov 26, 2019
#16020)

Fixed a race condition between new cluster member join and post join
operations executed as part of concurrent member join.

Send post operations directly to master from joining member and it in
turn broadcasts them to all other members of the cluster. This way
master guarantees that all post join operations are executed on all
members of the cluster.

Fixes: #15950
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.