Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health checks and dashboard for connectivity between zones #1907

Open
jakubdyszkiewicz opened this issue Apr 29, 2021 · 16 comments
Open

Health checks and dashboard for connectivity between zones #1907

jakubdyszkiewicz opened this issue Apr 29, 2021 · 16 comments
Labels
area/multizone area/observability kind/design Design doc or related triage/accepted The issue was reviewed and is complete enough to start working on it

Comments

@jakubdyszkiewicz
Copy link
Contributor

Summary

Right now we have pretty good observability of a connection from Remote to Global. You can see easily see in Global that Remote is online or not.

The problem is that once everything zone is connected to global, we don't provide any tools to check connectivity from one zone to another.

Observability

I'd like to have a dashboard in the GUI to see connectivity between zones. Something like this

image

where there is a separate Zone on each end. If the connection is up, the line is green, if not the line is red.

(It also can be just a simple table with every zone in column and every zone in row)

Traffic reliability

If Global <-> Remote communication works, but Zone <-> Zone does not we still include Ingress endpoint into EDS. Of course, Envoy can exclude this endpoint with Health Checks / Circuit Breaker + Retries, but we should do this beforehand.

If Remote A knows that it cannot connect to Ingress of Remote B, it should not include this in the EDS.

Overview of potential implementation

  • Build a component that runs on Remote CP (leader) that healthchecks (just open TCP connection) Ingress from other zones and save this information in Config called zone-healthcheck
  • When building XDS config, check zone-healthcheck Config and exclude ingress if we see that HC failed
  • Propagate zone-healthcheck from every Remote CP to Global CP
  • Build an API based on all zone-healthcheck from every zone
  • Build a dashboard based on this API

Considerations

  • Zone can have 2 ingresses, one can fail, the other don't. Probably it would be good to have "partially degrated" status just to follow the same pattern as with DP health checks.
@github-actions
Copy link
Contributor

This issue was inactive for 30 days it will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant please comment on it promptly or attend the next triage meeting.

@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Nov 23, 2021
@lahabana lahabana added area/multizone area/observability kind/design Design doc or related triage/accepted The issue was reviewed and is complete enough to start working on it and removed triage/stale Inactive for some time. It will be triaged again labels Jan 28, 2022
@lahabana
Copy link
Contributor

This ticket will need to be split into a milestone and this only tracks the design of it.
The proposed design/idea from @jakubdyszkiewicz probably needs to be adapted with the creation of ZoneEgress.

@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Feb 28, 2022
@github-actions
Copy link
Contributor

This issue was inactive for 30 days it will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant please comment on it promptly or attend the next triage meeting.

@lahabana lahabana removed the triage/stale Inactive for some time. It will be triaged again label Apr 26, 2022
@github-actions
Copy link
Contributor

This issue was inactive for 30 days it will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant please comment on it promptly or attend the next triage meeting.

@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label May 27, 2022
@lahabana lahabana removed the triage/stale Inactive for some time. It will be triaged again label May 27, 2022
@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Jun 27, 2022
@github-actions
Copy link
Contributor

This issue was inactive for 30 days it will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant please comment on it promptly or attend the next triage meeting.

@lahabana lahabana removed the triage/stale Inactive for some time. It will be triaged again label Jun 27, 2022
@github-actions
Copy link
Contributor

This issue was inactive for 30 days it will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant please comment on it promptly or attend the next triage meeting.

@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Jul 29, 2022
@michaelbeaumont michaelbeaumont removed the triage/stale Inactive for some time. It will be triaged again label Jul 29, 2022
@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Oct 28, 2022
@github-actions
Copy link
Contributor

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

@lahabana lahabana removed the triage/stale Inactive for some time. It will be triaged again label Oct 28, 2022
@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Jan 27, 2023
@github-actions
Copy link
Contributor

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

@lahabana lahabana removed the triage/stale Inactive for some time. It will be triaged again label Jan 27, 2023
@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Apr 28, 2023
@github-actions
Copy link
Contributor

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

@lahabana lahabana removed the triage/stale Inactive for some time. It will be triaged again label Apr 28, 2023
@github-actions
Copy link
Contributor

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Jul 28, 2023
@lahabana lahabana removed the triage/stale Inactive for some time. It will be triaged again label Jul 28, 2023
@github-actions
Copy link
Contributor

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Oct 27, 2023
@lahabana lahabana removed the triage/stale Inactive for some time. It will be triaged again label Nov 2, 2023
@mohamedawnallah
Copy link

Hey @jakubdyszkiewicz @lahabana, I'm excited to take on this issue! Could you kindly provide me with some guidance and details on how to get started? Thanks!

@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Feb 6, 2024
Copy link
Contributor

github-actions bot commented Feb 6, 2024

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

@slonka
Copy link
Contributor

slonka commented Feb 7, 2024

hi @mohamedawnallah I'm moving this back to triage to discuss this on the next meeting and we'll get back to you with more details.

@slonka slonka added triage/needs-information Reviewed and some extra information was asked to the reporter triage/pending This issue will be looked at on the next triage meeting and removed triage/accepted The issue was reviewed and is complete enough to start working on it triage/stale Inactive for some time. It will be triaged again labels Feb 7, 2024
@jakubdyszkiewicz jakubdyszkiewicz added triage/accepted The issue was reviewed and is complete enough to start working on it and removed triage/pending This issue will be looked at on the next triage meeting triage/needs-information Reviewed and some extra information was asked to the reporter labels Feb 19, 2024
Copy link
Contributor

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label May 20, 2024
@lukidzi lukidzi removed the triage/stale Inactive for some time. It will be triaged again label May 20, 2024
@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Aug 20, 2024
Copy link
Contributor

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

@bartsmykla bartsmykla removed the triage/stale Inactive for some time. It will be triaged again label Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/multizone area/observability kind/design Design doc or related triage/accepted The issue was reviewed and is complete enough to start working on it
Projects
None yet
Development

No branches or pull requests

7 participants