Consider AbstractJoiner's permanent blacklist in split-brain handling [HZ-2065] #24830

For context: Hazelcast uses a `Joiner` implementation that handles the joining of Nodes to Clusters; all current Joiners extend `AbstractJoiner` which features a map of blacklisted addresses, `blacklistedAddresses`. The key of the map is the address, and the value is a boolean, which shows whether the blacklist is "permanent" (true) or not. A blacklist is applied in 2 situations: after failing a connection while not joined to a cluster (non-permanent), or during a `ClusterMismatchOp` (permanent). The latter is only fired when a Node tries to join a cluster, but is not compatible with the cluster, and so is rejected. The `TcpIpJoiner` joiner employs the blacklist quite heavily, however it currently does not use it in the `searchForOtherClusters` method. This method is only called by the `SplitBrainHandler` task, which runs on loop (default of 5m delay, 2m interval). Due to this, when running multiple clusters sharing the same member list, it is possible to encounter a situation such as this: ``` Member list is defined as "127.0.0.1" for all Nodes. Node A1 starts and forms cluster "cluster_A" as master Node B1 starts and forms cluster "cluster_B" as master Node A2 starts and joins cluster "cluster_A" Node B1, as master, runs `SplitBrainHandler` every 2 minutes. This finds A2's address; B1 sends a merge validation op A2 is not master, so offloads to ClusterJoinManager#answerWhoisMasterQuestion This calls #ensureValidConfiguration, results in ClusterMismatchOp ClusterMismatchOp blacklists A2's address on the first encounter ``` In this scenario, even though the first attempt from B1's split brain handler blacklists A2's address, because there are no blacklist checks within `TcpIpJoiner#searchForOtherClusters`, the attempts continue with every execution of the `SplitBrainHandler` - this results in numerous unnecessary warnings in the logs for members B1 and A2. This commit resolves scenarios like this by removing permanently blacklisted addresses from the collection of `possibleAddresses` output by `TcpIpJoiner#searchForOtherClusters`. Now master nodes will not attempt to send `SplitBrainJoinMessage`s to other cluster nodes after blacklisting its address on the first encounter. However, introducing this blacklist check for `SplitBrainHandler` adds some further considerations, such as what would happen in this scenario: ``` Member list is defined as "127.0.0.1" for all Nodes. Node A1 starts on port 5701 and forms cluster "cluster_A" as master Node A2 starts on port 5702 and joins cluster "cluster_A" Node B1 starts on port 5703 and forms cluster "cluster_B" as master Node B2 starts on port 5704 and joins cluster "cluster_B" *Some time elapses, both A1/B2 have run SplitBrainHandler tasks* Node A2 is shutdown. Node B3 starts on port 5702 and joins cluster "cluster_B" B1 (master) has B3's address blacklisted from when A2 was on port 5702 B1 will now never send SplitBrainJoinMessages to B3 due to this ``` In this scenario, B3 is able to join "cluster_B" successfully as it seeks the master itself and so the blacklist is not applied. This commit resolves the situation by removing Node addresses from the permanent blacklist within the `TcpIpJoiner#onMemberAdded` method. As stated above, addresses are only permanently blacklisted in a `ClusterMismatchOp`, so if a Node has successfully triggered `onMemberAdded`, there has clearly been a topology change meaning this address is no longer incompatible. Therefore it should be safe to add this permanent blacklist removal functionality. Fixes https://hazelcast.atlassian.net/browse/HZ-2065

To conduct this test thoroughly, it was necessary to extract part of the `TcpIpJoiner#searchForOtherClusters` function into a separate function `TcpIpJoiner#getFilteredPossibleAddresses` - this function reduces the results of `AbstractJoiner#getPossibleAddresses` (applying filters such as removing known and blacklisted addresses), and does not change any functionality, it just exposes this list more easily.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider AbstractJoiner's permanent blacklist in split-brain handling [HZ-2065] #24830

Consider AbstractJoiner's permanent blacklist in split-brain handling [HZ-2065] #24830

Commits on Jun 15, 2023

Commits on Jun 21, 2023

Commits on Jun 22, 2023