For context: Hazelcast uses a `Joiner` implementation that handles the
joining of Nodes to Clusters; all current Joiners extend `AbstractJoiner`
which features a map of blacklisted addresses, `blacklistedAddresses`.
The key of the map is the address, and the value is a boolean, which
shows whether the blacklist is "permanent" (true) or not. A blacklist is
applied in 2 situations: after failing a connection while not joined to
a cluster (non-permanent), or during a `ClusterMismatchOp` (permanent).
The latter is only fired when a Node tries to join a cluster, but is not
compatible with the cluster, and so is rejected.
The `TcpIpJoiner` joiner employs the blacklist quite heavily, however it
currently does not use it in the `searchForOtherClusters` method. This
method is only called by the `SplitBrainHandler` task, which runs on loop
(default of 5m delay, 2m interval). Due to this, when running multiple
clusters sharing the same member list, it is possible to encounter a
situation such as this:
```
Member list is defined as "127.0.0.1" for all Nodes.
Node A1 starts and forms cluster "cluster_A" as master
Node B1 starts and forms cluster "cluster_B" as master
Node A2 starts and joins cluster "cluster_A"
Node B1, as master, runs `SplitBrainHandler` every 2 minutes.
This finds A2's address; B1 sends a merge validation op
A2 is not master, so offloads to ClusterJoinManager#answerWhoisMasterQuestion
This calls #ensureValidConfiguration, results in ClusterMismatchOp
ClusterMismatchOp blacklists A2's address on the first encounter
```
In this scenario, even though the first attempt from B1's split brain
handler blacklists A2's address, because there are no blacklist checks
within `TcpIpJoiner#searchForOtherClusters`, the attempts continue with
every execution of the `SplitBrainHandler` - this results in numerous
unnecessary warnings in the logs for members B1 and A2.
This commit resolves scenarios like this by removing permanently
blacklisted addresses from the collection of `possibleAddresses` output
by `TcpIpJoiner#searchForOtherClusters`. Now master nodes will not
attempt to send `SplitBrainJoinMessage`s to other cluster nodes after
blacklisting its address on the first encounter.
However, introducing this blacklist check for `SplitBrainHandler` adds
some further considerations, such as what would happen in this scenario:
```
Member list is defined as "127.0.0.1" for all Nodes.
Node A1 starts on port 5701 and forms cluster "cluster_A" as master
Node A2 starts on port 5702 and joins cluster "cluster_A"
Node B1 starts on port 5703 and forms cluster "cluster_B" as master
Node B2 starts on port 5704 and joins cluster "cluster_B"
*Some time elapses, both A1/B2 have run SplitBrainHandler tasks*
Node A2 is shutdown.
Node B3 starts on port 5702 and joins cluster "cluster_B"
B1 (master) has B3's address blacklisted from when A2 was on port 5702
B1 will now never send SplitBrainJoinMessages to B3 due to this
```
In this scenario, B3 is able to join "cluster_B" successfully as it
seeks the master itself and so the blacklist is not applied. This commit
resolves the situation by removing Node addresses from the permanent
blacklist within the `TcpIpJoiner#onMemberAdded` method.
As stated above, addresses are only permanently blacklisted in a
`ClusterMismatchOp`, so if a Node has successfully triggered
`onMemberAdded`, there has clearly been a topology change meaning this
address is no longer incompatible. Therefore it should be safe to add
this permanent blacklist removal functionality.
Fixes https://hazelcast.atlassian.net/browse/HZ-2065