Add proposal for Prometheus Alerting in Observatorium #453

onprem · 2022-01-07T14:24:57Z

This proposal talks about adding Prometheus alerting capabilities to Observatorium.

Signed-off-by: Prem Saraswat prmsrswt@gmail.com

Signed-off-by: Prem Saraswat <prmsrswt@gmail.com>

docs/proposals/20220105-observatorium-alerting.md

spaparaju

Another option for Rulers is reach 'Thanos Queriers' through Observatorium API to reuse the tenancy semantics at Observatorium API and future-proof Rulers to work across multiple instances of Observatorium (possibly at multiple regions)

matej-g

So far it's looking sound 👍 I'd have another look once the current comments and remaining to-dos in the doc are addressed

docs/proposals/20220105-observatorium-alerting.md

matej-g · 2022-01-10T10:06:47Z

docs/proposals/20220105-observatorium-alerting.md

+
+![ruler-connections-diagram](../assets/alerting-ruler.png)
+
+1. Alerts will be evaluated based on periodic requests to any replica of  HTTP Query API of Query-frontend (Question: Or querier?). 


I think this could be simplified and simply say against Query API, regardless of which component.

docs/proposals/20220105-observatorium-alerting.md

bill3tt

Nice start @onprem - the bulk of the thinking is there and seems sensible 👍

The biggest bits that are missing for me at the moment are about how we construct the routing trees that alert manager consumes. Users will write routing table configuration to us, and we will store that somehow, but how do we create the final blob of YAML that we sync into AM?

Also - I would be careful about drawing the line between the Rules API proposal and this Alerting proposal, which lets us keep this proposal much more focused. For example - when the Rules API is implemented Ruler will naturally start evaluating alerts, all we need to do is configure Ruler to point at an Alertmanager URL.

docs/proposals/20220105-observatorium-alerting.md

bill3tt · 2022-01-10T10:41:23Z

docs/proposals/20220105-observatorium-alerting.md

+* Scaling Rulers
+* Scaling Alertmanager beyond 3 replicas.


I don't understand why these are relevant / non-goals?

Because it might be important for usability of whole platform (:

docs/proposals/20220105-observatorium-alerting.md

bill3tt · 2022-01-11T11:13:32Z

Just chatting with @onprem - there will likely be some lag in responding to all the comments in here as we're currently focused on shipping the Rules API. That is a hard dependency of this proposal & unblocks other workstreams. Watch this space :)

onprem · 2022-01-20T11:03:34Z

Another option for Rulers is reach 'Thanos Queriers' through Observatorium API to reuse the tenancy semantics at Observatorium API and future-proof Rulers to work across multiple instances of Observatorium (possibly at multiple regions)

This is indeed possible but I think this might involve more work. Going through Obs. API means we have take care of authenticating all requests, which might involve running token-refresher as sidecar for every Ruler, configuring the tenant secrets, etc.

docs/proposals/20220105-observatorium-alerting.md

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

bwplotka

Updated proposal, PTAL 🤗

docs/proposals/20220105-observatorium-alerting.md

bwplotka · 2022-01-25T14:01:21Z

docs/proposals/20220105-observatorium-alerting.md

+* Scaling Rulers
+* Scaling Alertmanager beyond 3 replicas.


Because it might be important for usability of whole platform (:

docs/proposals/20220105-observatorium-alerting.md

bwplotka · 2022-01-25T15:07:58Z

docs/proposals/20220105-observatorium-alerting.md

+
+![ruler-connections-diagram](../assets/alerting-ruler.png)
+
+1. Alerts will be evaluated based on periodic requests to any replica of  HTTP Query API of Query-frontend (Question: Or querier?). 


All addressed, except this:

My understanding is that when the Rules API is active, alerts will already be evaluated in Thanos Ruler? The outcome is that Ruler will create new rules with name = ALERT?

@ianbillett could you elaborate on what do you mean? Outcome is that Recording Rule is evaluated and it's output is pushed to receiver. Similarly, Alert is evaluated and if triggered it is sending __name__ = ALERT? and pushing Alert to AM.

docs/proposals/20220105-observatorium-alerting.md

jessicalins

cool! 🅰️ added a few questions and formatting suggestions

docs/proposals/20220105-observatorium-alerting.md

jessicalins · 2022-01-25T21:16:12Z

docs/proposals/20220105-observatorium-alerting.md

+
+### Why thanos-ruler-syncer is not currently sufficient for our use-case?
+
+Thanos rule syncer only support syncing Rules. Does not support AM Routing configuration. It also requires separate service thanos-objstore to be deployed. This unnecessary microservice architecture will become a problem and a fuss to operate and maintain. We don't need to scale those components separatedly, so I propose to keep it in a single binary for now and expected future. 


makes sense 👍

matej-g

Some non essential comments, but content-wise this sounds good to me 👍

docs/proposals/20220105-observatorium-alerting.md

matej-g · 2022-01-26T10:20:14Z

docs/proposals/20220105-observatorium-alerting.md

+
+1. Alerts will be evaluated based on periodic requests to any replica of HTTP Query API, potentially through Query-frontend. 
+2. Ruler evaluates Alerts that can be triggered. Triggered alerts will be pushed to all replicas of the Alertmanager (discussed later) through AM HTTP API.
+3. It exposes gRPC Rules API which is consumed by Queriers. Querier than can show this API via it's HTTP Rule API.


What is meant here by show? Maybe specifying that it's consumed by queriers is enough?

docs/proposals/20220105-observatorium-alerting.md

matej-g · 2022-01-26T10:31:09Z

docs/proposals/20220105-observatorium-alerting.md

+3. It exposes gRPC Rules API which is consumed by Queriers. Querier than can show this API via it's HTTP Rule API.
+4. It saves samples to any replica of Receiver Router (or RouterIngestor). 
+
+### Rules API (non row one)


I'm finding this non-raw vs. raw distinction a bit confusing, maybe we could call this one just gRPC Rules API?

Hmm both are HTTP in most cases.

One is raw configuration you can GET, PUT and DELETE, second is currently LOADED rules and alerts (+ instances of active alerts) you can GET

docs/proposals/20220105-observatorium-alerting.md

Co-authored-by: Jéssica Lins <jlins@redhat.com> Co-authored-by: Matej Gera <38492574+matej-g@users.noreply.github.com>

bwplotka

Thanks for review, some updates.

docs/proposals/20220105-observatorium-alerting.md

bwplotka · 2022-01-26T11:12:14Z

docs/proposals/20220105-observatorium-alerting.md

+Goals:
+
+* Tenants should be able to read the status of active alerts 
+  * via ([Prometheus Alert API GET](https://prometheus.io/docs/prometheus/latest/querying/api/#alerts))


TODO: Remove, since Rules contains already same info!.

bwplotka · 2022-01-26T11:18:57Z

docs/proposals/20220105-observatorium-alerting.md

+
+We propose defining an additional HTTP API for routing on `/api/metrics/v1/{tenant}/api/v1/routing/raw` path in Observatorium API. Path that supports both GET, DELETE and PUT. 
+
+We propose to base the type on [Prometheus Operator AM Config](https://github.com/prometheus-operator/prometheus-operator/blob/9c0db5656f04e005de6a0413fd8eb8f11ec99757/pkg/apis/monitoring/v1alpha1/alertmanager_config_types.go#L69). Proposed adding this API to upstream Alertmanager too (TODO).


TODO: I will start a discussion on the Alertmanager repo.

TODO: Clarify what API should look like exactly, before merging

I think we should clarify what exactly do we want here from the upstream Alertmanager. The fact is that AM already has an API which would enable us to get currently loaded configuration, see the /status endpoint in the spec.

docs/proposals/20220105-observatorium-alerting.md

bwplotka · 2022-01-26T11:55:34Z

docs/proposals/20220105-observatorium-alerting.md

+* **Other docs:*
+  * None 
+
+> TL;DR: For monitoring use cases, especially tenants of the metric signal of Observatorium, users want to be able to configure Prometheus-based alerts that will be evaluated for a given interval. If triggered they should notify the desired target (receiver) (e.g PagerDuty using Alertmanager Configuration). Users should also be able to see the status of all alerts through Prometheus-like alerts HTTP API. This proposal specifies how such a system can be deployed within Observatorium.


TODO:

How often do we reload AM conf

How do we trigger reload?

Does triggering reload affects AM

Exact config file? (type)

How tenancy is enabled?

I believe the biggest question mark we have is still around multi-tenancy and Alertmanager configuration. I'll try to summarize the discussions so far.

Main concern:

We do not have (yet) a good idea on how multi-tenancy with configuration should be implemented. The fact is that Alertmanager does not recognize tenancy, meaning the configuration, from global perspective, represents a configuration of a single user. To make tenancy possible, we need to design a solution which would makes it possible to isolate configurations of multiple users (tenants) while still being represented in the global configuration (on top of this, there are related questions such as how to enable such isolation when it comes to UI and other APIs not included in this proposal - i.e. silences, inhibitions - the design choices here might have impact on these as well).

Ideas:

Have separate Alertmanager instance for each tenant (hard tenancy)

With this, we would completely side-step the issue of configuration, since the global config = user config. Downsides? Extra overhead of running Alertmanager instance for each tenant?

Combine configurations provided by users into a single global configuration (soft tenancy)

This aligns with how we deal with e.g. rules (see proposal). This would mean creating one global file with each tenant being represented in the global configuration. The main issue with this solution is how to combine tenants configuration:

Alertmanager has a few global options that are applied if the child routes have no configuration set. If tenants are to provide us with their global parameters, it is not clear how it would be possible to combine global parameters from multiple tenants into one config file.

An additional suggestion how to solve this would be to accept only non-global parameters, i.e. routes and receivers and combine these into a global config file. The downside is obviously that users cannot leverage global options and would need to specify configuration for all routes explicitly.

In this context it is also worth looking at how Cortex works, in which Alertmanager is running as a component within Cortex. This makes it possible to run a 'multitenant Alertmanager' as a single component, which encompasses multiple Alertmanagers, each dedicated to a single tenant (meaning each tenant can provide their isolated, global configuration)

pavolloffay · 2022-01-26T13:57:40Z

The image is weirdly rendered on the website

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

- Add more alternatives from disucssion - Clarify text in some places - Disable invalid links for now to make CI pass Signed-off-by: Matej Gera <matejgera@gmail.com>

Add proposal for Prometheus Alerting in Observatorium

6bc373b

Signed-off-by: Prem Saraswat <prmsrswt@gmail.com>

moadz reviewed Jan 7, 2022

View reviewed changes

yeya24 reviewed Jan 7, 2022

View reviewed changes

docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved

docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved

spaparaju reviewed Jan 9, 2022

View reviewed changes

matej-g reviewed Jan 10, 2022

View reviewed changes

bill3tt reviewed Jan 10, 2022

View reviewed changes

simonpasquier reviewed Jan 21, 2022

View reviewed changes

docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved

docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved

docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved

Updated proposal.

0c0f18d

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

bwplotka reviewed Jan 25, 2022

View reviewed changes

jessicalins reviewed Jan 25, 2022

View reviewed changes

matej-g reviewed Jan 26, 2022

View reviewed changes

Apply suggestions from code review

745027e

Co-authored-by: Jéssica Lins <jlins@redhat.com> Co-authored-by: Matej Gera <38492574+matej-g@users.noreply.github.com>

bwplotka reviewed Jan 26, 2022

View reviewed changes

Addressed comments.

89ca9d2

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

saswatamcode mentioned this pull request Feb 2, 2022

Interact with Observatorium using obsctl #455

Closed

Update proposal; add more alternatives

d07cb58

- Add more alternatives from disucssion - Clarify text in some places - Disable invalid links for now to make CI pass Signed-off-by: Matej Gera <matejgera@gmail.com>

jessicalins mentioned this pull request Feb 22, 2022

Rules API: Add docs #457

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add proposal for Prometheus Alerting in Observatorium #453

Add proposal for Prometheus Alerting in Observatorium #453

onprem commented Jan 7, 2022

spaparaju left a comment

matej-g left a comment

matej-g Jan 10, 2022

bill3tt left a comment

bill3tt Jan 10, 2022

bwplotka Jan 25, 2022

bill3tt commented Jan 11, 2022 •

edited

onprem commented Jan 20, 2022

bwplotka left a comment

bwplotka Jan 25, 2022

bwplotka Jan 25, 2022

jessicalins left a comment

jessicalins Jan 25, 2022

matej-g left a comment

matej-g Jan 26, 2022

matej-g Jan 26, 2022

bwplotka Jan 26, 2022

bwplotka left a comment

bwplotka Jan 26, 2022

bwplotka Jan 26, 2022

matej-g Feb 10, 2022

bwplotka Jan 26, 2022

matej-g Feb 10, 2022

pavolloffay commented Jan 26, 2022


		![ruler-connections-diagram](../assets/alerting-ruler.png)

		1. Alerts will be evaluated based on periodic requests to any replica of HTTP Query API of Query-frontend (Question: Or querier?).


		### Why thanos-ruler-syncer is not currently sufficient for our use-case?

		Thanos rule syncer only support syncing Rules. Does not support AM Routing configuration. It also requires separate service thanos-objstore to be deployed. This unnecessary microservice architecture will become a problem and a fuss to operate and maintain. We don't need to scale those components separatedly, so I propose to keep it in a single binary for now and expected future.


		We propose defining an additional HTTP API for routing on `/api/metrics/v1/{tenant}/api/v1/routing/raw` path in Observatorium API. Path that supports both GET, DELETE and PUT.

		We propose to base the type on [Prometheus Operator AM Config](https://github.com/prometheus-operator/prometheus-operator/blob/9c0db5656f04e005de6a0413fd8eb8f11ec99757/pkg/apis/monitoring/v1alpha1/alertmanager_config_types.go#L69). Proposed adding this API to upstream Alertmanager too (TODO).

Add proposal for Prometheus Alerting in Observatorium #453

Are you sure you want to change the base?

Add proposal for Prometheus Alerting in Observatorium #453

Conversation

onprem commented Jan 7, 2022

spaparaju left a comment

Choose a reason for hiding this comment

matej-g left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bill3tt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bill3tt commented Jan 11, 2022 • edited

onprem commented Jan 20, 2022

bwplotka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jessicalins left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matej-g left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bwplotka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Main concern:

Ideas:

pavolloffay commented Jan 26, 2022

bill3tt commented Jan 11, 2022 •

edited