You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the PrometheusRemoteStorageFailures and PrometheusRemoteWriteBehind alerts are critical. However, especially with remote-write setups, many users will run HA pairs (or groups) of Prometheus servers, and the remote-write receiver will have some way of dedup'ing the incoming samples. If that's the case, just one Prometheus replica having trouble with remote-write should just be a warning. The alert should be critical only if all members of the HA group have trouble.
The text was updated successfully, but these errors were encountered:
I think alert severity is highly debatable, not only here but in several parts of our mixins. Some might say that if one replica is completely down, the HA setup is compromised and someone should be paged as a precaution. Others might say that data is still being ingested and it's safe to keep it like this for some time, no need to page.
What I wanted to highlight here is that alert severity is highly opinionated, and hard to find a one-fits-all solution 😬
I noticed no complaint on the current state in the last 3.5y. So let's close for now. If anyone feels the need to revisit, they can follow-up here or open an new issue and we'll take it from there.
Currently, the
PrometheusRemoteStorageFailures
andPrometheusRemoteWriteBehind
alerts are critical. However, especially with remote-write setups, many users will run HA pairs (or groups) of Prometheus servers, and the remote-write receiver will have some way of dedup'ing the incoming samples. If that's the case, just one Prometheus replica having trouble with remote-write should just be a warning. The alert should be critical only if all members of the HA group have trouble.The text was updated successfully, but these errors were encountered: