Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle destination gray failures better (Degraded health state) #2071

Open
davidni opened this issue Mar 15, 2023 · 1 comment
Open

Handle destination gray failures better (Degraded health state) #2071

davidni opened this issue Mar 15, 2023 · 1 comment
Assignees
Labels
Type: Idea This issue is a high-level idea for discussion.
Milestone

Comments

@davidni
Copy link
Contributor

davidni commented Mar 15, 2023

What should we add or change to make your life better?

The existing health policies (e.g. ConsecutiveFailuresHealthPolicy) can be problematic when destinations are not entirely up nor entirely down. This can happen in practice, and the current policies produce sub-optimal routing decisions.

Why is this important to you?

This would help maintain high availability for external scenarios despite transient instability of internal services and/or machines.

Proposal

  • Introduce a new Degraded destination health state (see also: Degraded health state #1011)
  • Modify destination selection policy to prefer destinations in Healthy state rather than Destination, but still pick Degraded destinations if there are "few" (according to some criteria) Healthy ones
  • Modify active health checks to allow a destination to declare itself to be degraded if it would like to not be a preferred choice for traffic, while allowing YARP to make the final decision whether to use this destination or not based on all signals it has access to
  • Destination health transitions should avoid spurious Health and Unhealth determinations based on single health observations. The diagram below proposes a symmetric decision process where a transition from Degraded state requires three consecutive consistent health observations
--- 
title: Destination health state diagram (PROPOSAL)
--- 

stateDiagram-v2 
    [*] --> Unknown 
    Unknown --> Healthy: active probe == "success" 
    Unknown --> Unhealthy: active probe != "success" 
    Degraded --> Healthy: active probe == "success"\n(3rd consecutive) 
    Healthy --> Degraded: proxy result == "degraded" OR\nactive probe == "degraded" OR\nactive probe !="success" in 2 of past 5 evals 
    Degraded --> Unhealthy: active probe != "success"\n(3rd consecutive) 
    Unhealthy --> Degraded: active probe =="success"

NOTE: This is related to #1011 but goes beyond it in scope. Filing as a separate issue seemed appropriate.

@karelz
Copy link
Member

karelz commented Nov 28, 2023

We should consider letting Affinity still target Degraded nodes (see #2335 (comment))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Idea This issue is a high-level idea for discussion.
Projects
None yet
Development

No branches or pull requests

5 participants