Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add connection health checking to static discovery strategy for WAN [HZ-2888] #25364

Merged
merged 2 commits into from Sep 18, 2023

Conversation

JamesHazelcast
Copy link
Contributor

@JamesHazelcast JamesHazelcast commented Sep 4, 2023

This commit applies several improvements to our existing static discovery strategy used in WAN replication; for a full overview of the expectations for these changes, please refer to the description of HZ-2888.

At a high level, this commit introduces health checking for connections in WAN, managed through the StaticDiscoveryStrategy, with various properties defined in StaticDiscoveryProperties, passed through the WanBatchPublisherConfig properties mapping.

Connections are marked as "unhealthy" under the circumstances where they would previously have only been removed from the target endpoints list; these "unhealthy" endpoints are then no longer returned during discovery, and instead they are periodically probed for liveliness. If the connection becomes live again, it will be returned to the active target list during the next discovery task.

ConnectionHealthData is maintained for each unhealthy endpoint, which tracks data related to its health checks, and is used for functionality such as "failed probe" removal, and configurable back-off for checks.

Metrics are now recorded for each WAN connection, with a single metric used for all connections alongside a discriminator containing each individual endpoint's address, and the value being 0 for unhealthy endpoints and 1 for healthy ones.

This commit also contains regression tests for all new functionality, contained within StaticDiscoveryEndpointsTest.

EE PR: https://github.com/hazelcast/hazelcast-enterprise/pull/6442
Closes https://hazelcast.atlassian.net/browse/HZ-2888

…HZ-2888]

This commit applies several improvements to our existing static
discovery strategy used in WAN replication; for a full overview of the
expectations for these changes, please refer to the description of HZ-2888.

At a high level, this commit introduces health checking for connections
in WAN, managed through the `StaticDiscoveryStrategy`, with various
properties defined in `StaticDiscoveryProperties`, passed through the
`WanBatchPublisherConfig` properties mapping.

Connections are marked as "unhealthy" under the circumstances where
they would previously have only been removed from the target endpoints
list; these "unhealthy" endpoints are then no longer returned during
discovery, and instead they are periodically probed for liveliness. If
the connection becomes live again, it will be returned to the active
target list during the next discovery task.

`ConnectionHealthData` is maintained for each unhealthy endpoint, which
tracks data related to its health checks, and is used for functionality
such as "failed probe" removal, and configurable back-off for checks.

Metrics are now recorded for each WAN connection, with a single metric
used for all connections alongside a discriminator containing each
individual endpoint's address, and the value being `0` for unhealthy
endpoints and `1` for healthy ones.

This commit also contains regression tests for all new functionality,
contained within `StaticDiscoveryEndpointsTest`.

Closes https://hazelcast.atlassian.net/browse/HZ-2888
public Set<Address> getUnhealthyEndpoints() {
Set<Address> addresses = null;
for (DiscoveryStrategy strategy : discoveryStrategies) {
Set<Address> local = strategy.getUnhealthyEndpoints();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Address" type and "local" as variable name made me think this is the "local member address" and I started wondering "why do we care about local address here?"... maybe my brain is wired the wrong way, but can you think of a better name for this variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @vbekiaris, thanks - I've renamed both variables in this block to clarify their usage: de3c107

* @param address the address to mark as unhealthy
* @since 5.4
*/
default void markEndpointAsUnhealthy(Address address) { }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would remove these 2 new methods from public api, if target is only using it for wan discovery.
It can be made an implementation detail for wan discovery for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ahmetmircik, I've marked these methods as @PrivateApi in both DiscoveryStrategy and DiscoveryService for now 👍 de3c107

Refactor new strategy logic into separate derivative class `HealthTrackingStaticDiscoveryStrategy`, clean up variable names, reduce public API access, add support for legacy system usage.
@JamesHazelcast JamesHazelcast merged commit b003371 into hazelcast:master Sep 18, 2023
8 checks passed
@JamesHazelcast JamesHazelcast deleted the add/5.4/hz-2888 branch September 18, 2023 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants