Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider AbstractJoiner's permanent blacklist in split-brain handling [HZ-2065] #24830

Merged
merged 3 commits into from
Jun 22, 2023

Commits on Jun 15, 2023

  1. Consider AbstractJoiner's permanent blacklist in split-brain handling

    For context: Hazelcast uses a `Joiner` implementation that handles the
    joining of Nodes to Clusters; all current Joiners extend `AbstractJoiner`
    which features a map of blacklisted addresses, `blacklistedAddresses`.
    The key of the map is the address, and the value is a boolean, which
    shows whether the blacklist is "permanent" (true) or not. A blacklist is
    applied in 2 situations: after failing a connection while not joined to
    a cluster (non-permanent), or during a `ClusterMismatchOp` (permanent).
    The latter is only fired when a Node tries to join a cluster, but is not
    compatible with the cluster, and so is rejected.
    
    The `TcpIpJoiner` joiner employs the blacklist quite heavily, however it
    currently does not use it in the `searchForOtherClusters` method. This
    method is only called by the `SplitBrainHandler` task, which runs on loop
    (default of 5m delay, 2m interval). Due to this, when running multiple
    clusters sharing the same member list, it is possible to encounter a
    situation such as this:
    ```
    Member list is defined as "127.0.0.1" for all Nodes.
    
    Node A1 starts and forms cluster "cluster_A" as master
    Node B1 starts and forms cluster "cluster_B" as master
    Node A2 starts and joins cluster "cluster_A"
    
    Node B1, as master, runs `SplitBrainHandler` every 2 minutes.
    This finds A2's address; B1 sends a merge validation op
    A2 is not master, so offloads to ClusterJoinManager#answerWhoisMasterQuestion
    This calls #ensureValidConfiguration, results in ClusterMismatchOp
    ClusterMismatchOp blacklists A2's address on the first encounter
    ```
    In this scenario, even though the first attempt from B1's split brain
    handler blacklists A2's address, because there are no blacklist checks
    within `TcpIpJoiner#searchForOtherClusters`, the attempts continue with
    every execution of the `SplitBrainHandler` - this results in numerous
    unnecessary warnings in the logs for members B1 and A2.
    
    This commit resolves scenarios like this by removing permanently
    blacklisted addresses from the collection of `possibleAddresses` output
    by `TcpIpJoiner#searchForOtherClusters`. Now master nodes will not
    attempt to send `SplitBrainJoinMessage`s to other cluster nodes after
    blacklisting its address on the first encounter.
    
    However, introducing this blacklist check for `SplitBrainHandler` adds
    some further considerations, such as what would happen in this scenario:
    ```
    Member list is defined as "127.0.0.1" for all Nodes.
    
    Node A1 starts on port 5701 and forms cluster "cluster_A" as master
    Node A2 starts on port 5702 and joins cluster "cluster_A"
    
    Node B1 starts on port 5703 and forms cluster "cluster_B" as master
    Node B2 starts on port 5704 and joins cluster "cluster_B"
    
    *Some time elapses, both A1/B2 have run SplitBrainHandler tasks*
    
    Node A2 is shutdown.
    Node B3 starts on port 5702 and joins cluster "cluster_B"
    
    B1 (master) has B3's address blacklisted from when A2 was on port 5702
    B1 will now never send SplitBrainJoinMessages to B3 due to this
    ```
    In this scenario, B3 is able to join "cluster_B" successfully as it
    seeks the master itself and so the blacklist is not applied. This commit
    resolves the situation by removing Node addresses from the permanent
    blacklist within the `TcpIpJoiner#onMemberAdded` method.
    
    As stated above, addresses are only permanently blacklisted in a
    `ClusterMismatchOp`, so if a Node has successfully triggered
    `onMemberAdded`, there has clearly been a topology change meaning this
    address is no longer incompatible. Therefore it should be safe to add
    this permanent blacklist removal functionality.
    
    Fixes https://hazelcast.atlassian.net/browse/HZ-2065
    JamesHazelcast committed Jun 15, 2023
    Configuration menu
    Copy the full SHA
    bd60822 View commit details
    Browse the repository at this point in the history

Commits on Jun 21, 2023

  1. Add SplitBrainHandler blacklist check regression test for TcpIpJoiner

    To conduct this test thoroughly, it was necessary to extract part of
    the `TcpIpJoiner#searchForOtherClusters` function into a separate
    function `TcpIpJoiner#getFilteredPossibleAddresses` - this function
    reduces the results of `AbstractJoiner#getPossibleAddresses` (applying
    filters such as removing known and blacklisted addresses), and does
    not change any functionality, it just exposes this list more easily.
    JamesHazelcast committed Jun 21, 2023
    Configuration menu
    Copy the full SHA
    9d9dc2a View commit details
    Browse the repository at this point in the history

Commits on Jun 22, 2023

  1. Configuration menu
    Copy the full SHA
    e861df9 View commit details
    Browse the repository at this point in the history