Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add proposal for Prometheus Alerting in Observatorium #453

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

onprem
Copy link
Member

@onprem onprem commented Jan 7, 2022

This proposal talks about adding Prometheus alerting capabilities to Observatorium.

Signed-off-by: Prem Saraswat prmsrswt@gmail.com

Signed-off-by: Prem Saraswat <prmsrswt@gmail.com>
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
Copy link

@spaparaju spaparaju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option for Rulers is reach 'Thanos Queriers' through Observatorium API to reuse the tenancy semantics at Observatorium API and future-proof Rulers to work across multiple instances of Observatorium (possibly at multiple regions)

Copy link
Contributor

@matej-g matej-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far it's looking sound 👍 I'd have another look once the current comments and remaining to-dos in the doc are addressed

docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved

![ruler-connections-diagram](../assets/alerting-ruler.png)

1. Alerts will be evaluated based on periodic requests to any replica of HTTP Query API of Query-frontend (Question: Or querier?).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be simplified and simply say against Query API, regardless of which component.

docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
Copy link
Contributor

@bill3tt bill3tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice start @onprem - the bulk of the thinking is there and seems sensible 👍

The biggest bits that are missing for me at the moment are about how we construct the routing trees that alert manager consumes. Users will write routing table configuration to us, and we will store that somehow, but how do we create the final blob of YAML that we sync into AM?

Also - I would be careful about drawing the line between the Rules API proposal and this Alerting proposal, which lets us keep this proposal much more focused. For example - when the Rules API is implemented Ruler will naturally start evaluating alerts, all we need to do is configure Ruler to point at an Alertmanager URL.

docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
Comment on lines 41 to 42
* Scaling Rulers
* Scaling Alertmanager beyond 3 replicas.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why these are relevant / non-goals?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it might be important for usability of whole platform (:

docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
@bill3tt
Copy link
Contributor

bill3tt commented Jan 11, 2022

Just chatting with @onprem - there will likely be some lag in responding to all the comments in here as we're currently focused on shipping the Rules API. That is a hard dependency of this proposal & unblocks other workstreams. Watch this space :)

@onprem
Copy link
Member Author

onprem commented Jan 20, 2022

Another option for Rulers is reach 'Thanos Queriers' through Observatorium API to reuse the tenancy semantics at Observatorium API and future-proof Rulers to work across multiple instances of Observatorium (possibly at multiple regions)

This is indeed possible but I think this might involve more work. Going through Obs. API means we have take care of authenticating all requests, which might involve running token-refresher as sidecar for every Ruler, configuring the tenant secrets, etc.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated proposal, PTAL 🤗

docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
Comment on lines 41 to 42
* Scaling Rulers
* Scaling Alertmanager beyond 3 replicas.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it might be important for usability of whole platform (:

docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved

![ruler-connections-diagram](../assets/alerting-ruler.png)

1. Alerts will be evaluated based on periodic requests to any replica of HTTP Query API of Query-frontend (Question: Or querier?).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All addressed, except this:

My understanding is that when the Rules API is active, alerts will already be evaluated in Thanos Ruler? The outcome is that Ruler will create new rules with name = ALERT?

@ianbillett could you elaborate on what do you mean? Outcome is that Recording Rule is evaluated and it's output is pushed to receiver. Similarly, Alert is evaluated and if triggered it is sending __name__ = ALERT? and pushing Alert to AM.

docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
Copy link
Contributor

@jessicalins jessicalins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool! 🅰️ added a few questions and formatting suggestions

docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved

### Why thanos-ruler-syncer is not currently sufficient for our use-case?

Thanos rule syncer only support syncing Rules. Does not support AM Routing configuration. It also requires separate service thanos-objstore to be deployed. This unnecessary microservice architecture will become a problem and a fuss to operate and maintain. We don't need to scale those components separatedly, so I propose to keep it in a single binary for now and expected future.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense 👍

Copy link
Contributor

@matej-g matej-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some non essential comments, but content-wise this sounds good to me 👍

docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved

1. Alerts will be evaluated based on periodic requests to any replica of HTTP Query API, potentially through Query-frontend.
2. Ruler evaluates Alerts that can be triggered. Triggered alerts will be pushed to all replicas of the Alertmanager (discussed later) through AM HTTP API.
3. It exposes gRPC Rules API which is consumed by Queriers. Querier than can show this API via it's HTTP Rule API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant here by show? Maybe specifying that it's consumed by queriers is enough?

docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
3. It exposes gRPC Rules API which is consumed by Queriers. Querier than can show this API via it's HTTP Rule API.
4. It saves samples to any replica of Receiver Router (or RouterIngestor).

### Rules API (non row one)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm finding this non-raw vs. raw distinction a bit confusing, maybe we could call this one just gRPC Rules API?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm both are HTTP in most cases.

One is raw configuration you can GET, PUT and DELETE, second is currently LOADED rules and alerts (+ instances of active alerts) you can GET

docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
Co-authored-by: Jéssica Lins  <jlins@redhat.com>
Co-authored-by: Matej Gera <38492574+matej-g@users.noreply.github.com>
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review, some updates.

docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
Goals:

* Tenants should be able to read the status of active alerts
* via ([Prometheus Alert API GET](https://prometheus.io/docs/prometheus/latest/querying/api/#alerts))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Remove, since Rules contains already same info!.


We propose defining an additional HTTP API for routing on `/api/metrics/v1/{tenant}/api/v1/routing/raw` path in Observatorium API. Path that supports both GET, DELETE and PUT.

We propose to base the type on [Prometheus Operator AM Config](https://github.com/prometheus-operator/prometheus-operator/blob/9c0db5656f04e005de6a0413fd8eb8f11ec99757/pkg/apis/monitoring/v1alpha1/alertmanager_config_types.go#L69). Proposed adding this API to upstream Alertmanager too (TODO).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: I will start a discussion on the Alertmanager repo.

TODO: Clarify what API should look like exactly, before merging

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should clarify what exactly do we want here from the upstream Alertmanager. The fact is that AM already has an API which would enable us to get currently loaded configuration, see the /status endpoint in the spec.

docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
docs/proposals/20220105-observatorium-alerting.md Outdated Show resolved Hide resolved
* **Other docs:*
* None

> TL;DR: For monitoring use cases, especially tenants of the metric signal of Observatorium, users want to be able to configure Prometheus-based alerts that will be evaluated for a given interval. If triggered they should notify the desired target (receiver) (e.g PagerDuty using Alertmanager Configuration). Users should also be able to see the status of all alerts through Prometheus-like alerts HTTP API. This proposal specifies how such a system can be deployed within Observatorium.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO:

  • How often do we reload AM conf
  • How do we trigger reload?
  • Does triggering reload affects AM
  • Exact config file? (type)
  • How tenancy is enabled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the biggest question mark we have is still around multi-tenancy and Alertmanager configuration. I'll try to summarize the discussions so far.

Main concern:

  • We do not have (yet) a good idea on how multi-tenancy with configuration should be implemented. The fact is that Alertmanager does not recognize tenancy, meaning the configuration, from global perspective, represents a configuration of a single user. To make tenancy possible, we need to design a solution which would makes it possible to isolate configurations of multiple users (tenants) while still being represented in the global configuration (on top of this, there are related questions such as how to enable such isolation when it comes to UI and other APIs not included in this proposal - i.e. silences, inhibitions - the design choices here might have impact on these as well).

Ideas:

  1. Have separate Alertmanager instance for each tenant (hard tenancy)
  • With this, we would completely side-step the issue of configuration, since the global config = user config. Downsides? Extra overhead of running Alertmanager instance for each tenant?
  1. Combine configurations provided by users into a single global configuration (soft tenancy)
  • This aligns with how we deal with e.g. rules (see proposal). This would mean creating one global file with each tenant being represented in the global configuration. The main issue with this solution is how to combine tenants configuration:
  • Alertmanager has a few global options that are applied if the child routes have no configuration set. If tenants are to provide us with their global parameters, it is not clear how it would be possible to combine global parameters from multiple tenants into one config file.
  • An additional suggestion how to solve this would be to accept only non-global parameters, i.e. routes and receivers and combine these into a global config file. The downside is obviously that users cannot leverage global options and would need to specify configuration for all routes explicitly.

In this context it is also worth looking at how Cortex works, in which Alertmanager is running as a component within Cortex. This makes it possible to run a 'multitenant Alertmanager' as a single component, which encompasses multiple Alertmanagers, each dedicated to a single tenant (meaning each tenant can provide their isolated, global configuration)

@pavolloffay
Copy link
Member

The image is weirdly rendered on the website

Screenshot of Observatorium Alerting - Observatorium

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
- Add more alternatives from disucssion
- Clarify text in some places
- Disable invalid links for now to make CI pass

Signed-off-by: Matej Gera <matejgera@gmail.com>
@jessicalins jessicalins mentioned this pull request Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants