Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug where all healthy endpoints are removed as HealthCheckedEndpointGroup is initialized #5343

Merged
merged 2 commits into from Jan 22, 2024

Conversation

ikhoon
Copy link
Contributor

@ikhoon ikhoon commented Dec 12, 2023

Motivation:

If new Endpoints are updated in HealthCheckedEndpointGroup, HealthCheckedEndpointGroup creates a new HealthCheckContextGroup. When the new HealthCheckContextGroup is initialized, the old HealthCheckContextGroup is removed for rolling updates.

The removal logic is added as a fallback to contextGroup.whenInitialized().

// Remove old contexts when the newly created contexts are fully initialized to smoothly transition
// to new endpoints.
contextGroup.whenInitialized().handle((unused, cause) -> {
if (cause != null && !initialized) {
if (logger.isWarnEnabled()) {
logger.warn("The first health check failed for all endpoints. " +
"numCandidates: {} candidates: {}",
candidates.size(), truncate(candidates, 10), cause);
}
}
initialized = true;
destroyOldContexts(contextGroup);
setEndpoints(allHealthyEndpoints());
Since context groups are stored in the order in which they were inserted, if the callback added first is executed first, the old value will always be deleted and the latest value will be maintained.

CompletableFuture uses a stack structure for callbacks, so the last callback will be executed first when it completes. If all contexts see the same future completion event, the old context group could remove the new context group. For example, D context group removes A to C context groups, and then C context group removes groups until it finds itself. But there is no C in the context group chain. As a result, C removes D, the newest and last one.

This situation will rarely occur when:

  • An HealthCheckedEndpointGroup is not initialized yet.
  • The delegate is updated with new endpoints that have the same value as the previous endpoints, but there are duplicate endpoints. The new endpoints create a new HealthCheckContextGroup that shares all HttpHealthCheckers with the previous one.

Modifications:

  • Do not try to remove the old HealthCheckContextGroup with a context group if it was removed before.
  • Fix a bug where the reference count of DefaultHealthCheckerContext is incorrectly counted if there are duplicate endpoints.
  • Use the endpoints selected by HealthCheckStrategy instead of using the original endpoints updated by the delegate.

Result:

You no longer see EndpointSelectionTimeoutException when HealthCheckedEndpointGroup is initialized with duplicate Endpoints.

…dpointGroup` is initialized

Motivation:

If new `Endpoint`s are updated in `HealthCheckedEndpointGroup`,
`HealthCheckedEndpointGroup` creates a new `HealthCheckContextGroup`.
When the new `HealthCheckContextGroup` is initialized, the old
`HealthCheckContextGroup` are removed for rolling updates.

The removal logic is added as a fallback to `contextGroup.whenInitialized()`.
https://github.com/line/armeria/blob/3f54be0ce4370b24977994e247fa3816fde25e29/core/src/main/java/com/linecorp/armeria/client/endpoint/healthcheck/HealthCheckedEndpointGroup.java#L175-L187
Since context groups are stored in the order in which they were inserted,
if the callback added first is executed first, the old value will always
be deleted and the latest value will be maintained.

`CompletableFuture` uses a stack structure for callbacks so the last
callback will be executed first when it completes. If all contexts has
the same endpoints, the old context group could remove the new context
group.

For example, `D` context group removes `A` to `C` context groups and
then `C` context group removes groups until it finds itself. As a
result, `contextGroupChain` becomes empty.

This situation will rarely occur when:
- An `HealthCheckedEndpointGroup` is not initialized yet.
- The delegate is updated new endpoints which have the same value as the
  previous endpoints, but there are duplicate endpoints.
  The different length of endpoints creates a new `HealthCheckContextGroup`
  that shares all `HttpHealthChecker`s with the previous one.

Modifications:

- Do not try to remove old `HealthCheckContextGroup` with a context
  group if it was removed before.
- Fix a bug where the reference count of `DefaultHealthCheckerContext`
  is incorrectly counted if there are duplicate endpoints.
- Use the endpoints selected by `HealthCheckStrategy` instead of using
  the original endpoints updated by the delegate.

Result:

You no longer see `EndpointSelectionTimeoutException` when
`HealthCheckedEndpointGroup` is initialized with duplicate `Endpoint`s.
@ikhoon ikhoon added the defect label Dec 12, 2023
@ikhoon ikhoon added this to the 1.27.0 milestone Dec 12, 2023
Comment on lines +262 to +265
if (!contextGroupChain.contains(contextGroup)) {
// The contextGroup is already removed by another callback of `contextGroup.whenInitialized()`.
return;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change prevents the old contextGroup from deleting new contextGroups.

Copy link

codecov bot commented Dec 12, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (3f54be0) 73.94% compared to head (b001b46) 73.95%.

❗ Current head b001b46 differs from pull request most recent head a8106e1. Consider uploading reports for the commit a8106e1 to get more accurate results

Additional details and impacted files
@@            Coverage Diff            @@
##               main    #5343   +/-   ##
=========================================
  Coverage     73.94%   73.95%           
- Complexity    20104    20110    +6     
=========================================
  Files          1730     1730           
  Lines         74161    74165    +4     
  Branches       9465     9467    +2     
=========================================
+ Hits          54841    54849    +8     
+ Misses        14844    14840    -4     
  Partials       4476     4476           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@jrhee17 jrhee17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CompletableFuture uses a stack structure for callbacks, so the last callback will be executed first when it completes.

TIL! The changes make sense. Thanks @ikhoon 🙇 👍 🙇

Copy link
Member

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice finding. Thanks, @ikhoon! 👍

@ikhoon ikhoon merged commit e06e883 into line:main Jan 22, 2024
13 checks passed
@ikhoon ikhoon deleted the healthcheck-group-bugs branch January 22, 2024 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants