Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve xDS health check #5785

Merged
merged 16 commits into from
Aug 8, 2024
Merged

Conversation

jrhee17
Copy link
Contributor

@jrhee17 jrhee17 commented Jun 26, 2024

Motivation:

It is recommended to review #5802 prior to this PR

This changeset attempts to solve several problems:

Custom filter logic

xDS considers all endpoints when computing whether a PrioritySet is in panic state. For instance, if the percentage of unhealthy endpoints exceeds a preconfigured panic threshold, the endpoint selection includes all endpoints regardless of the degraded status. While armeria supports an HealthCheckedEndpointGroup out of the box, it filters out healthy endpoints automatically.

In order to resolve this, I propose that a AbstractHealthCheckedEndpointGroupBuilder#healthCheckedEndpointPredicate API is added

Per-cluster health check configuration

Per-cluster member health check is difficult with the current API since a single parameter set is statically defined for an entire health checked endpoint group.

We already have an abstraction AbstractHealthCheckedEndpointGroupBuilder#newCheckerFactory. I propose that this API be used for the purpose of xDS. In order to support the parameters xDS allows configuring, I propose that parameters are passed to HttpHealthChecker via the constructor instead of the HealthCheckerContext. In order to support this change, HttpHealthChecker has been moved to an internal package so the xds module can also access it.

Modifications:

  • Added APIs AbstractHealthCheckedEndpointGroupBuilder#healthCheckedEndpointPredicate and modified HealthCheckedEndpointGroup to filter endpoints based on the predicate.
    • Health checked Endpoints now have attributes HEALTHY and DEGRADED set.
    • HealthCheckedEndpointGroup#setEndpoints is now called on every invocation of updateHealth, not just when a health status changes.
  • xDS sometimes needs to override the health checked Endpoint, so modified to receive health checked parameters from the HttpHealthChecker constructor instead of HealthCheckerContext.
    • In the process, HttpHealthChecker has been moved to an internal package
    • HealthCheckerContext#originalEndpoint has been added to retrieve the original endpoint. The other APIs haven't been deprecated since it is potentially useful information that other HealthCheckers may be using.
  • Introduced a XdsHealthCheckedEndpointGroupBuilder which implements its own checkerFactory.
  • EndpointUtil#coarseHealth has been updated to consider HEALTHY, DEGRADED attributes when determining health.

Result:

  • The behavior of XdsEndpointGroup is more aligned with envoy in terms of health checking
  • Preparation for zone-aware load balancing

Copy link
Member

@trustin trustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits 😄

* in {@param newAttributes} has higher precedence.
*/
@SuppressWarnings("unchecked")
public Endpoint mergeAttrs(Attributes newAttributes) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe unrelated, but I noticed that the current withAttrs behavior is somewhat confusing. withAttr(AttributeKey, ...) doesn't replace the entire attributes but withAttrs() does, and yet they both use with prefix. I feel like we need this change first:

  • Change the semantics of withAttrs() so it doesn't replace all attributes.
  • Introduce replaceAttrs() that provides the original behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this, but note that this PR will act as an umbrella PR to let reviewers know the overall direction of the health check changes.
This PR will be split into multiple PRs for more thorough/readable changes as usual

* A dummy implementation to avoid breaking changes in {@link AbstractHealthCheckedEndpointGroupBuilder}.
*/
Function<Endpoint, HealthCheckerParams> paramsFactory() {
return endpoint -> new HealthCheckerParamsAdapter() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about just always using the same HealthCheckerParams implementation class (or factory method) so we don't generate multiple classes? It looks like a simple value class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, did a cast instead of the previous method chaining.

@jrhee17 jrhee17 added this to the 1.30.0 milestone Jul 9, 2024
@jrhee17 jrhee17 marked this pull request as ready for review July 9, 2024 05:44
ikhoon pushed a commit that referenced this pull request Jul 25, 2024
Motivation:

It seems inconsistent that `Endpoint#withAttr` merges the attribute to
the existing attributes, whereas `Endpoint#withAttrs` simply replaces
the previous attributes.
I propose that we modify the behavior of `Endpoint#withAttrs` to be
aligned with `Endpoint#withAttr`. Additionally, in order to support the
previous behavior, I propose a new API `Endpoint#replaceAttrs` be added.

I've confirmed that there are no uses of this API at least internally.

ref: #5785 (comment)

Modifications:

- Modified the behavior of `Endpoint#withAttrs` to merge attributes
- Added a new API `Endpoint#replaceAttrs` to support the previous
behavior

Result:

- More consistent APIs
- Preparation for #5785

<!--
Visit this URL to learn more about how to write a pull request
description:

https://armeria.dev/community/developer-guide#how-to-write-pull-request-description
-->
* }</pre>
*/
@UnstableApi
public SELF healthCheckedEndpointPredicate(Predicate<Endpoint> healthCheckedEndpointPredicate) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this API is used in xDS now, how about adding it after #5741 is merged?
Because I think the health status needs to be exposed explicitly to users to implement this healthCheckedEndpointPredicate.

public SELF healthCheckedEndpointPredicate(BiPredicate<Endpoint, HealthStatus> healthCheckedEndpointPredicate) { ... }

By the way, sorry for the late review for #5741 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this API is used in xDS now, how about adding it after #5741 is merged?

I didn't think #5741 is related to this PR since I didn't think DEGRADED is not really used internally at the moment. I thought the more important benefit of this PR was that UNHEALTHY endpoints are now considered for priority calculation.

Because I think the health status needs to be exposed explicitly to users to implement this healthCheckedEndpointPredicate.

I see. Am I understanding correctly that you prefer this API is not exposed publicly unless HealthCheckedEndpoint is introduced?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the more important benefit of this PR was that UNHEALTHY endpoints are now considered for priority calculation.

I looked over quicky your PR. Now I understood your purpose.

you prefer this API is not exposed publicly unless HealthCheckedEndpoint is introduced?

I thought at least an utility API should be provided to check the health status of an Endpoint if it takes some time to design HealthCheckedEndpoint.

It seems like no one will be using this API right now, so it is okay to add it. But changes for HealthCheckedEndpoint would be needed in a f/u work to make this API useful to normal users.

@@ -89,6 +94,8 @@ final class DefaultHealthCheckerContext
this.clientOptions = clientOptions;
this.retryBackoff = retryBackoff;
this.onUpdateHealth = onUpdateHealth;
endpointAttributes = Attributes.of(HEALTHY_ATTR, false,
DEGRADED_ATTR, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we cache 4 Attributes.of(HEALTHY_ATTR, false|true, DEGRADED_ATTR, false|true) in a static map and reuse it?

Copy link
Contributor Author

@jrhee17 jrhee17 Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍

*/
@UnstableApi
public SELF initialStateResolver(
Function<? super List<Endpoint>, ? extends List<Endpoint>> initialStateResolver) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional) What do you think of changing this method to take an Endpoint at once? I think this method resembles a combination of map + filter in a role.

initialStateResolver(endpoint -> endpoint)
initialStateResolver(endpoint -> null)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, since the xDS implementation needs the health check attribute to be retained from filtered initialStateResolver, an AttributeRetainingEndpointGroup has been introduced

Copy link
Contributor Author

@jrhee17 jrhee17 Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spent way too long debugging an issue related to this. I realized that initialStateResolver can have a race condition with the first update of healthy candidates.

I've just removed initialStateResolver.

Copy link
Contributor

@ikhoon ikhoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @jrhee17 🙇‍♂️🙇‍♂️

* }</pre>
*/
@UnstableApi
public SELF healthCheckedEndpointPredicate(Predicate<Endpoint> healthCheckedEndpointPredicate) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the more important benefit of this PR was that UNHEALTHY endpoints are now considered for priority calculation.

I looked over quicky your PR. Now I understood your purpose.

you prefer this API is not exposed publicly unless HealthCheckedEndpoint is introduced?

I thought at least an utility API should be provided to check the health status of an Endpoint if it takes some time to design HealthCheckedEndpoint.

It seems like no one will be using this API right now, so it is okay to add it. But changes for HealthCheckedEndpoint would be needed in a f/u work to make this API useful to normal users.

@jrhee17 jrhee17 force-pushed the feature/xds-health-check branch 2 times, most recently from 78d9d71 to ee7512c Compare August 6, 2024 06:00
@jrhee17
Copy link
Contributor Author

jrhee17 commented Aug 7, 2024

This PR is now back to reliably passing the CI

Copy link
Member

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fantastic. 👍 👍 👍

Copy link
Member

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 👍 👍 👍 👍

@jrhee17 jrhee17 merged commit c7aca10 into line:main Aug 8, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants