New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add connection health checking to static discovery strategy for WAN [HZ-2888] #25364
Add connection health checking to static discovery strategy for WAN [HZ-2888] #25364
Conversation
…HZ-2888] This commit applies several improvements to our existing static discovery strategy used in WAN replication; for a full overview of the expectations for these changes, please refer to the description of HZ-2888. At a high level, this commit introduces health checking for connections in WAN, managed through the `StaticDiscoveryStrategy`, with various properties defined in `StaticDiscoveryProperties`, passed through the `WanBatchPublisherConfig` properties mapping. Connections are marked as "unhealthy" under the circumstances where they would previously have only been removed from the target endpoints list; these "unhealthy" endpoints are then no longer returned during discovery, and instead they are periodically probed for liveliness. If the connection becomes live again, it will be returned to the active target list during the next discovery task. `ConnectionHealthData` is maintained for each unhealthy endpoint, which tracks data related to its health checks, and is used for functionality such as "failed probe" removal, and configurable back-off for checks. Metrics are now recorded for each WAN connection, with a single metric used for all connections alongside a discriminator containing each individual endpoint's address, and the value being `0` for unhealthy endpoints and `1` for healthy ones. This commit also contains regression tests for all new functionality, contained within `StaticDiscoveryEndpointsTest`. Closes https://hazelcast.atlassian.net/browse/HZ-2888
public Set<Address> getUnhealthyEndpoints() { | ||
Set<Address> addresses = null; | ||
for (DiscoveryStrategy strategy : discoveryStrategies) { | ||
Set<Address> local = strategy.getUnhealthyEndpoints(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "Address" type and "local" as variable name made me think this is the "local member address" and I started wondering "why do we care about local address here?"... maybe my brain is wired the wrong way, but can you think of a better name for this variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch @vbekiaris, thanks - I've renamed both variables in this block to clarify their usage: de3c107
hazelcast/src/main/java/com/hazelcast/spi/discovery/integration/DiscoveryService.java
Show resolved
Hide resolved
* @param address the address to mark as unhealthy | ||
* @since 5.4 | ||
*/ | ||
default void markEndpointAsUnhealthy(Address address) { } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would remove these 2 new methods from public api, if target is only using it for wan discovery.
It can be made an implementation detail for wan discovery for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ahmetmircik, I've marked these methods as @PrivateApi
in both DiscoveryStrategy
and DiscoveryService
for now 👍 de3c107
Refactor new strategy logic into separate derivative class `HealthTrackingStaticDiscoveryStrategy`, clean up variable names, reduce public API access, add support for legacy system usage.
This commit applies several improvements to our existing static discovery strategy used in WAN replication; for a full overview of the expectations for these changes, please refer to the description of HZ-2888.
At a high level, this commit introduces health checking for connections in WAN, managed through the
StaticDiscoveryStrategy
, with various properties defined inStaticDiscoveryProperties
, passed through theWanBatchPublisherConfig
properties mapping.Connections are marked as "unhealthy" under the circumstances where they would previously have only been removed from the target endpoints list; these "unhealthy" endpoints are then no longer returned during discovery, and instead they are periodically probed for liveliness. If the connection becomes live again, it will be returned to the active target list during the next discovery task.
ConnectionHealthData
is maintained for each unhealthy endpoint, which tracks data related to its health checks, and is used for functionality such as "failed probe" removal, and configurable back-off for checks.Metrics are now recorded for each WAN connection, with a single metric used for all connections alongside a discriminator containing each individual endpoint's address, and the value being
0
for unhealthy endpoints and1
for healthy ones.This commit also contains regression tests for all new functionality, contained within
StaticDiscoveryEndpointsTest
.EE PR: https://github.com/hazelcast/hazelcast-enterprise/pull/6442
Closes https://hazelcast.atlassian.net/browse/HZ-2888