Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mixin: remote-write related alert severity should take HA setup into account #7176

Closed
beorn7 opened this issue Apr 27, 2020 · 4 comments
Closed

Comments

@beorn7
Copy link
Member

beorn7 commented Apr 27, 2020

Currently, the PrometheusRemoteStorageFailures and PrometheusRemoteWriteBehind alerts are critical. However, especially with remote-write setups, many users will run HA pairs (or groups) of Prometheus servers, and the remote-write receiver will have some way of dedup'ing the incoming samples. If that's the case, just one Prometheus replica having trouble with remote-write should just be a warning. The alert should be critical only if all members of the HA group have trouble.

@beorn7
Copy link
Member Author

beorn7 commented Nov 2, 2020

However, thinking about it, the current way how Cortex handles HA pairs will actually not switch the replica if one falls behind…

@krajorama
Copy link
Member

Hello from the bug scrub, is there progress on this issue @beorn7 ? Otherwise we'll close it next time around.

@ArthurSens
Copy link
Member

I think alert severity is highly debatable, not only here but in several parts of our mixins. Some might say that if one replica is completely down, the HA setup is compromised and someone should be paged as a precaution. Others might say that data is still being ingested and it's safe to keep it like this for some time, no need to page.

What I wanted to highlight here is that alert severity is highly opinionated, and hard to find a one-fits-all solution 😬

@beorn7
Copy link
Member Author

beorn7 commented Apr 30, 2024

I noticed no complaint on the current state in the last 3.5y. So let's close for now. If anyone feels the need to revisit, they can follow-up here or open an new issue and we'll take it from there.

@beorn7 beorn7 closed this as completed Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants