Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore Prometheus metrics for service status when too many are flagged as down #127

Open
nkinkade opened this issue Mar 21, 2023 · 0 comments
Assignees

Comments

@nkinkade
Copy link
Contributor

Today, as far as I know, if there is some problem with the script_exporter, and for any reason it reports that a lot, or even all, e2e tests to experiments are failing, then Locate will remove those services from possible selection in client queries. This has happened to us before, where script_exporter marked all ndt services as down when they were really not down, causing a global outage, since mlab-ns thought that none of the platform was healthy. To avoid this, in mlab-ns, we implemented a safety net where if monitoring says that more than 25% of the platform is down, then it ignores the monitoring data and just continues to use its cached service status data:

https://github.com/m-lab/mlab-ns/blob/main/server/mlabns/handlers/update.py#L371
https://github.com/m-lab/mlab-ns/blob/main/server/mlabns/handlers/update.py#L498

Since problems with script_exporter or GCP networking have happened before, I believe we should implement a similar safety check into Locate, such that when script_exporter reports more than a certain percentage of the platform as down, that Locate will ignore the signal from script_exporter until the percentage is back above the threshold. With Locate this should be even safer than with mlab-ns, since Locate still has the heartbeat signal to rely on whereas mlab-ns had/has only the monitoring signals.

Perhaps Locate could use logic like: if script_exporter thinks that more than 25% of all services are down and heartbeat says they are healthy, then ignore the script_exporter monitoring data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants