Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add design proposal for L4 HealthCheck #404

Merged
merged 4 commits into from
Dec 11, 2019

Conversation

yskopets
Copy link
Contributor

@yskopets yskopets commented Nov 2, 2019

Summary

  • add design proposal for L4 HealthCheck

Related issues

#393

@yskopets yskopets force-pushed the docs/health-checks-proposal branch 2 times, most recently from 3e94429 to 96ad7f1 Compare November 2, 2019 18:32
Copy link
Contributor

@jakubdyszkiewicz jakubdyszkiewicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice research!

Just for the reference to how Consul works. Each Agent healthchecks their own service and send healthy/unhealthy event to Master only if there is a change. Service is healthy only when HC passed and Agent is up and running. At the same time, Agents are aware of each by communicating using Gossip Protocol, so when Agent is down agents nearby will detect this, send a message to Master and service becomes unhealthy.

passiveChecks:
unhealthy_threshold: 3
penalty_interval: 10s # for how long endpoint should be considered unhealthy
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I see source and destination I think of the connection between apps. For TrafficPermissions it makes sense because we want to secure connection between apps. Same with TrafficLogging. For passive HC it makes sense because we are checking the health of the connection.

For active HC I'd say this semantic is confusing. We want to define active HC for the application, not for the connection between applications. Maybe we should come up with different semantics for this, like target instead of sources+destinations.


Conclusions:
* we can already use `Health xDS` for `Envoy -> local app` health checks
* changes to the Envoy will be necessary to use `Health xDS` for `Envoy -> upstream` "health checks" (add support for mTLS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use case of using HDS for Envoy -> upstream? Wouldn't it be better to only HC your local app and send status to CP, which then updates list of endpoints for dataplanes that use this service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

success of Envoy -> local app check doesn't give a full picture:

  • it doesn't account for mTLS between client and server
  • it doesn't account for different geographical location (e.g., connectivity to stand-by instances in another datacenter)


## Requirements

1. support `Envoy -> upstream` "health checks"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this scales very well. At least for active health checks.
Let's say we've got app backend and 10 other apps with 10 instances each that communicate with it. HC is sent every second. Now we generate 100rps to backend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's always a user's choice.

Active health checks make perfect sense when your infrastructure is not big.

Conclusions:
* we can already use `Health xDS` for `Envoy -> local app` health checks
* changes to the Envoy will be necessary to use `Health xDS` for `Envoy -> upstream` "health checks" (add support for mTLS)
* changes to the Envoy will be necessary to send event logs to the Control Plane (instead of logging to a file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to use active HC events when HDS is used?

For passive HC (outlier detection) I think this is "very local" for connection between A -> B. What would you do with information that B does not work from A perspective in the control plane?

Copy link
Contributor Author

@yskopets yskopets Nov 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the goal of a Control Plane is to be smart and help users in every possible way.

E.g.,

  • make it visible that the problem is local to a single dataplane
  • make it visible that the problem is specific to a certain geo location

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@yskopets yskopets requested a review from a team December 4, 2019 13:11
@yskopets yskopets merged commit 57e210b into master Dec 11, 2019
@yskopets yskopets deleted the docs/health-checks-proposal branch December 19, 2019 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants