Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a way to detect unhealthy connections #5763

Merged
merged 14 commits into from
Aug 7, 2024

Conversation

ikhoon
Copy link
Contributor

@ikhoon ikhoon commented Jun 14, 2024

Motivation:

Currently, there is no extension point to detect errors for specific connections and terminate connections that are unhealthy.

Related: #5717 #5751

API design:

OutlierDetector is similar to CircuitBreaker, but the state is simpler, only one direction, and optimized for ephemeral resources such as connections and streams.

  • OutlierDetector determines whether the target is an outlier based on onSuccess() and onFailure() events.
  • OutlierDetectingRule is used to decide whether a request fails.
  • OutlierDetectionDecision is the result of OutlierDetectingRule.
    • OutlierDetectionDecision.FATAL is a special result that can immediately mark the target as an outlier.

Example:

OutlierDetectingRule rule =
  OutlierDetectingRule
    .builder()
    .onServerError()
    .onException(IOException.class)
    .onException(WriteTimeoutException, OutlierDetectionDecision.FATAL)
    .build();

OutlierDetection outlierDetection = 
  OutlierDetection
    .builder(rule)
    .counterSlidingWindow(10_seconds)
    .counterUpdateInterval(1_seconds)
    .failureRateThreshold(0.5)
    .build();

ClientFactory
  .builder()
  // Apply the OutlierDetection to detect and close unhealthy connections
  .connectionOutlierDetection(outlierDetection)

Modifications:

  • Add OutlierDetector, OutlierDetectingRule and OutlierDetectionDecision and their implementations to common.outlier package.
  • Move EventCounter, EventCount and SlidingWindowCounter to common.util package and expose EventCounter and EventCount as public APIs to minimize duplication.
    • SlidingWindowCounter can be created with EventCounter.ofSlidingWindow(...).
  • Unlike CircuitBreakerRule, OutlierDetectingRule returns a decision synchronously and is simplified to look at headers and causes. Because:
    • I didn't see content is necessary to detect an outlier.
    • The response status and exception type are sufficient information to determine the result.
  • Create an OutlierDetector in HttpSessionHandler
    • A hook applying OutlierDetector is added right after HttpSessionHandler.invoke() is called.
  • Deprecating) CircuitBreakerListener.onEventCountUpdated(String,.circuitbreaker.EventCount) has been deprecated in favor of CircuitBreakerListener.onEventCountUpdated(String,.util.EventCount).
  • Add ClientFactoryBuilder.connectionOutlierDetection(OutlierDetection) to detect unhealthy connection.
    • This option is disabled by default.

Result:

Motivation:

Currently, there is no extension point to detect errors for specific
connections and terminate connections that are unhealthy.

Related: line#5751

Modifications:

- TBU

Result:
@ikhoon ikhoon added this to the 1.30.0 milestone Jun 17, 2024
@ikhoon ikhoon marked this pull request as ready for review June 17, 2024 10:17
Copy link
Contributor

@jrhee17 jrhee17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in terms of correctness. Left only nit comments 👍 👍 👍

* {@link OutlierDetectingRule} and {@link OutlierDetector}.
*/
@UnstableApi
public interface OutlierDetection {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional) would OutlierDetectorFactory be a clearer name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thought about the name, OutlierConfig and OutlierContext.
OutlierDetectorFactory wasn't chosen since it had OutlierDetectingRule.

I'm not strong on the name. Let's listen to other folks' opinions and choose a better one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OutlierDetection sounds OK to me, but OutlierDetectingRule could be renamed to OutlierRule?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OutlierDetectorFactory wasn't chosen since it had OutlierDetectingRule.

It seems like OutlierDetectingRule is used alongside OutlierDetector, and both are retrieved and created from this OutlierDetection. What do you think of adding OutlierDetectingRule to OutlierDetector instead?

outlierDetector.rule().decide(...)

outlierDetector.onSuccess();
// The connection was marked as an outlier, but it's back to normal.
if (outlierDetector.isOutlier()) {
markUnacquirable();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I would still recommend users (especially those using HTTP1) to use a RetryingClient as I think it's possible that requests that unluckily acquired this connection could be failed due to this call depending on the timing. Although I know the code path to markUnacquirable is already exposed to the user, I feel like the timing of this call is more unpredictable with this change.

e.g. If canSendRequest = false:

id = session.incrementAndGetNumRequestsSent();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. What do you think of explicitly executing this callback in the EventLoop to guarantee the execution order?

* A builder for creating an {@link OutlierDetectingRule}.
*/
@UnstableApi
public final class OutlierDetectingRuleBuilder {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extend AbstractRuleBuilder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are similar but AbstractRuleBuilder is in the client package and ClientRequestContext is used while OutlierDetectingRuleBuilder uses RequestContext.

Copy link
Member

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Left a few suggestions. 👍

* {@link OutlierDetectingRule} and {@link OutlierDetector}.
*/
@UnstableApi
public interface OutlierDetection {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OutlierDetectorFactory wasn't chosen since it had OutlierDetectingRule.

It seems like OutlierDetectingRule is used alongside OutlierDetector, and both are retrieved and created from this OutlierDetection. What do you think of adding OutlierDetectingRule to OutlierDetector instead?

outlierDetector.rule().decide(...)

@ikhoon
Copy link
Contributor Author

ikhoon commented Jul 4, 2024

It seems like OutlierDetectingRule is used alongside OutlierDetector, and both are retrieved and created from this OutlierDetection. What do you think of adding OutlierDetectingRule to OutlierDetector instead?

They are used together but the role of OutlierDetector and OutlierDetectingRule is different. OutlierDetector is designed as a utility. It may be used alone. So I didn't want to force users who only need OutlierDetector to implement OutlierDetectingRule.

@minwoox
Copy link
Member

minwoox commented Jul 4, 2024

OutlierDetector is designed as a utility. It may be used alone.

Haven't thought about this case. Then, I'm fine with the current design. 👍

Copy link
Member

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that you want to use this for xDS but I found there's a mismatch with the Envoy implementation so I've left a question. PTAL. 😉

private final EventCounter counter;
private final double failureRateThreshold;
private final long minimumRequestThreshold;
private volatile boolean isOutlier;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if it once becomes an outlier, the client will never use it unlike Envoy bring it back after rejection time. Is that correct?
https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/outlier#ejection-algorithm

Copy link
Contributor Author

@ikhoon ikhoon Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outlier algorithm of Envoy is designed at the host (endpoint) level. Hosts do not tend to change for a long time. So, after ejection_time, sending requests again like a circuit breaker is necessary.

To provide the same functionality as Envoy, we may create an OutlierDetectingEndpointGroup using OutlierDetector.

  • If a host is determined as an outlier, the host is excluded from healthy endpoints for ejection_time.
  • After ejection_time, if the host is still in the endpoints, a new OutlierDetector is created for it.
  • ejection_time option will be provided by OutlierDetectingEndpointGroup

In fact, I implemented a similar EndpointGroup internally for the LINE push server. It would be better to modify the implementation and add it to Armeria.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. 👍

Copy link
Member

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 👍 👍 👍

@ikhoon ikhoon merged commit dd14024 into line:main Aug 7, 2024
14 of 15 checks passed
@ikhoon ikhoon deleted the outlier-detection branch August 7, 2024 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Outlier detection for connections
4 participants