Alerting: Support per series state change tracking for queries that return multiple series #6041

bergquist · 2016-09-14T12:41:22Z

Currently, we don't update the alert message if the alert query return new/other series then the first one.

Let's say that one alert executes and have an eval match on "serie 1".
Next time the alert executes "serie 1" is fine but "serie 2" is alerting.
This will go unnoticed in our current implementation.

When an alert is triggered we should create a key based on the alerting series. The next time the same alert is triggered we can compare the state and the key created by the series.

If the data source support dimensions, we should include those when creating the key.

One suggestions for how to implement such key creation:
ex

{ "serie 1", datapoints: ... },
{ "serie 2", datapoints: ... }

Should become serie 1;serie 2

{ "web-front-01", tags: {"server": "web-front-01", "application": "web"}, datapoints: ... },
{ "web-front-01", tags: {"server": "web-front-02", "application": "web"}, datapoints: ... }

Should become web-front-01{server=web-front-01,application=web};web-front-02{server=web-front-02,application=web}

Keys might seem long but I don't think it matters that much. Perhaps we could find a way of shortening them by just storing a hash of it or something like that.

The text was updated successfully, but these errors were encountered:

klausenbusk · 2017-01-05T21:09:19Z

Just a +1 for this.

So our use-case for this feature, is that we have ~ 150 boxes at different locations and it could be really useful if we could setup a alert like "if memory usage 75%> of total".
With the current implementation that wouldn't work very well, as multiple boxes will only trigger 1 alert..

pdf · 2017-01-14T05:07:24Z

Implementation details will likely overlap with #6557 I believe.

ghost · 2017-01-14T22:48:07Z

I realize this is a more suitable issue for my comment in #6685 (comment)

I have another slightly different use case for this. I have multiple series in a graph: https://snapshot.raintank.io/dashboard/snapshot/V6cQW9mzHHfnDJHKgNKbyyCfzp4nZSBy

When the first reaches the alert level, everything goes smooth, but when the second reaches the alert level, it doesn't flip the alert state, and thus doesn't trigger anything. It'd be useful though :)

An idea how to go about this: a per-alert "notification id", which can use variables to distinguish between the desired alerts. In my graph above, I'd set the notification id to {{host}}.

pdf · 2017-01-14T22:51:38Z

@lgierth see also #6553 (comment) for how I'd like to see this sort of thing work.

yannispanousis · 2017-02-08T15:58:44Z

Would absolutely love this

ddhirajkumar · 2017-03-14T00:14:48Z

I was thinking that if we support multiple alerts per graph, we would have a solution for this issue.

bergquist · 2017-11-01T15:53:43Z

Updated the description. Feedback on how to implement this would be appreciated.

pdu · 2017-11-29T06:57:51Z

I would like to share the way I got to work around this.

Manually create a dashboard to monitor one EC2 instance, I've created the separate panels in the dashboard to monitor CPU, memory, disk, network rx/tx/rx_error/tx_error, etc.
Then I export the dashboard as a JSON template file.
I write a cron script to automatically create/delete dashboards
3.1. get the EC2 list
3.2. get the dashboard list
3.3. call grafana API to add the missing dashboard based on the JSON template file
3.4. call grafana API to delete the dashboard if the EC2 instance got terminated

gregorsini · 2017-12-04T23:18:02Z

Hi pdu, do you have time to share some of your script or any generic script to add or delete a dashboard? I've contemplated your approach, but would appreciate any head-start you're able to share.

Best, Greg.

shurshun · 2017-12-07T14:55:49Z

+1

whidrasl · 2017-12-08T09:36:25Z

+1

pdu · 2017-12-13T09:11:06Z

@gregorsini sorry I just saw the message, please refer to https://github.com/pdu/grafana_ec2_monitoring

ashuw018 · 2018-02-02T04:38:36Z

Hi, just checking if there are any chances of getting this feature in newer future, due to lack of template variable use in alerting , i have created separate dashboard for alerting only where i am not using any template variable but managing multiple series within single graph by tagging. So this feature will be very useful in my scenario.

Thanks,

karimcitoh · 2019-09-17T15:23:38Z

+1

micw · 2019-09-21T21:25:21Z

@bergquist Is there any progress in this?

Best regards,
Michael.

manhojviknesh · 2019-09-30T06:32:18Z

@bergquist : Kindly update on the status of this feature.
It would be extremely handy if we have alert triggers for every condition separately instead of alerting on single condition as a whole.

fernandobeltranjsc · 2019-10-20T19:22:11Z

I do not understand how this functionality is not available 3 years later, it is a problem not being able to have alarms for each series. If the CPU alarm goes off, it will no longer trigger while it is still active even if another computer has another CPU problem. It is not reasonable.

mgiammarco · 2020-02-19T11:37:53Z

I am another one that does not understand why you cannot have alarms for each series.
In these years I have made 7 of mine customers to use Influx stack that has this feature. You have lost 7 potential customers. Worse when I have explained them that Grafana does not support this feature they replied me: "it is not a useful product" (the real sentence was far worse).

mgiammarco · 2020-02-19T11:39:51Z

And please stop considering workarounds like repeated notifications. There are already too many notifications we cannot add even more as a workaround of that feature.

cqcn1991 · 2020-04-30T11:02:14Z

+1 for this
Really looking for this feature

erkexzcx · 2020-06-04T07:56:08Z

+1

lmondoux · 2020-06-27T15:58:13Z

+1

vbichov · 2020-07-08T07:27:12Z

basically this feature will bring Prometheus (or cortex) alert engine functionality to grafana.
I'm waiting for this feature for more than a year now.
Any chance this will be implemented?

czd890 · 2020-07-15T04:31:55Z

+1

ghost · 2020-07-15T18:10:30Z

Grafana Cortex alerting UI +1

johntdyer · 2020-07-21T18:20:25Z

we so need this

janbrunrasmussen · 2020-07-21T20:11:56Z

Any chance someone from Grafana could update on the state of this?

wiardvanrij · 2020-07-29T14:47:21Z

+1...

bbl232 · 2020-07-31T20:25:44Z

bump! we also could really use this :)

leoowu · 2020-09-18T02:28:35Z

any further update ?

knmorgan · 2020-09-25T02:28:00Z

+1 from me too.

We are currently getting updates on alerts by using "Send reminders" in the notification channel settings, but this is obviously not ideal. This problem combined with #16662 really cripples alerts for us.

rmccarthy-ellevation · 2020-10-29T15:47:06Z

Is there going to be an update on this?

uklance · 2021-01-06T10:13:30Z

My use case for this is we have multiple microservices running in Kubernetes and we calculate the replica utilisation for each service using a prometheus query. We get an alert if the replica utilisation is not 100% for more than 5 minutes (ie we get an alert when a microservice in our team's Kubernetes namespace dies and can't restart).

Currently, we only get a single alert when the first microservice dies and then get an OK alert when all services are back up. But in between a second, third or fourth microservice might have died and we don't get an additional alert for these. Ideally we would like a fail/OK alert for each microservice (ie a fail/OK for each series in the chart).

I could create a separate panel/chart/alert for each series but we are likely to have 20/50/100 microservices in future and this would mean copy/pasting Grafana JSON every time I add a new microservice. Currently, using a single panel/chart/alert, the list of applications is dynamically calculated via a prometheus query and we don't need to make any changes to Grafana config when we add more microservices to our Kubernetes namespace.

danielfariati · 2021-01-20T21:01:22Z

Example use case:
Consider you have more than 1.000 different databases.
Consider you have an alert for RDS burst balance (Cloudwatch).
An alert is triggered because one of the databases have a low burst balance.
Even if you fix it right away, it can take a lot of time do the burst balance to go up again.

Then, there are two possibilities, in the current grafana version, as far as I know:

With reminders off: You will not be notified if other databases have a low burst balance in the meantime, as the alert will not be triggered again;
With reminders on: You will be notified several times, even if there are no new alerts, which can lead to information overload (you will stop several times to read the alerts and then discover that nothing has changed);

This funcionality to support per series state change would solve this in a nice way, at least for multiple use cases of mine.
We currently use another alerting solution for those cases, but this is also not ideal, as centralizing our graphs / alerts would be much better.

I can't think of any case where alerting per series would cause too many alerts.
If you created an alert is because you want to know if something happened.
If you are not insterested in knowing that, then it problably means that your alert is not useful or is not configured with the right parameters.
Also, if you grouped / created series for an alert is probably because the series matter.
Otherwise, you would create the alert without any sort of grouping.
Even then, nothing that an option to enable support per series or keep the current implementation wouldn't fix.

We currently use another alerting solution for this use cases, as grafana can't handle them in a scalable way (creating one graph per series is not scalable is several cases... for example, when new series are created/deleted dynamically).
But I would love to see this functionality in grafana, so we could centralize our alerts.

danielfariati · 2021-01-20T21:14:31Z

@bergquist Can you update us on this topic? You said that you guys were working on redesigning the alert system in and writing up a design doc, but I didn't see any follow up on that. Did that doc include this in some way?

florian-forestier · 2021-02-04T11:15:33Z

Bumping into this. This feature would be useful to connect Grafana alerting with other supervision tools, and keep users updated about what's really going on on their infrastructure.

I'm pretty sure this is not a complicated thing to do, because the "Test button" already does the job : even if the alert is already in alerting, it will display actualized information.

/api/alerts :

    {
        "id": 13,
        "dashboardId": 17,
        "dashboardUid": "___",
        "dashboardSlug": "supervision",
        "panelId": 38,
        "name": "my_alert",
        "state": "alerting",
        "newStateDate": "2021-02-04T08:55:03Z",
        "evalDate": "0001-01-01T00:00:00Z",
        "evalData": {
            "evalMatches": [
                {
                    "metric": "controller-2",
                    "tags": {
                        "__name__": "up",
                        "address_ori": "controller-2:7",
                        "instance": "controller-2",
                        "job": "node-exporter",
                        "type": "controller"
                    },
                    "value": 0
                }
            ]
        },
        "executionError": "",
        "url": "/d/____/supervision"
    },

"Test button" result :

{
  "firing": true,
  "state": "pending",
  "conditionEvals": "true = true",
  "timeMs": "7.959ms",
  "matches": [
    {
      "metric": "controller-0",
      "value": 0
    },
    {
      "metric": "controller-1",
      "value": 0
    },
    {
      "metric": "controller-2",
      "value": 0
    }
  ]

One (awful) workaround is to use /api/alerts/:id/pause to suspend and reactivate alert, so evalMatches is updated...

If you don't want to add this as a "base functionality", is this possible to have a /api/alerts/:id/recalculate endpoint, which will update evalMatches when asked ?

Edit : going a little bit in the code, I think I found where to do changes : pkg/services/alerting/result_handler.go, line 50; but I presume that line 61 will return an error (because bus.Dispatch seems to be able to throw error when oldState=newState, as described on line 67).

kylebrandt · 2021-06-08T14:42:47Z

The new beta version of alerting in Grafana 8 (opt-in with "ngalert" feature toggle) supports "multi-dimensional" alerting based on labels and often in combination with Server Side Expressions. So one can have multiple alert instances from a single rule. Each instance (based on the set of labels) has its own state.

For example:

Would create alert (instances) per device,instance,job:

The exception is the "classic condition" operation within SSE, which is not per series and behaves like the pre-8 dashboard alerting conditions.

Demos etc regarding the new alerting V8 will be in the Grafacon session (online streaming) on June 16, 2021: https://grafana.com/go/grafanaconline/2021/alerting/

pere3 · 2021-08-30T10:30:00Z

I don't really understand how is completely new feature (which is currently in alpha and not really doing the same alerting mechanism that we all used to (it separates alerts from graphs after creating)) closes issue from where it starts

It also strange how this feature closes issue #11849, where alerts api not being updated after "alerting" state.

bergquist added area/alerting Grafana Alerting area/alerting/evaluation Issues when evaluating alerts labels Sep 14, 2016

torkelo changed the title ~~Alerting: Update state and eval_data if alerting eval_match changes.~~ Alerting: Support per series state change tracking for queries that return multiple series Dec 12, 2016

torkelo mentioned this issue Dec 12, 2016

[Feature request] Alert if NO DATA in one metric from query #6925

Closed

ghost mentioned this issue Jan 14, 2017

Alerts should fire once per series ipfs/infra#217

Closed

yannispanousis mentioned this issue Feb 8, 2017

Alerting: Firing/Notification Severity (Critical / Warning / Info) #6553

Open

bergquist mentioned this issue Feb 9, 2017

[Feature Request] Add support for sending alert notifications to Prometheus alertmanager. #7481

Closed

sanchitraizada mentioned this issue Mar 14, 2017

[Feature request] Multiple alerts per graph #7832

Closed

torkelo mentioned this issue Apr 28, 2017

Alerts - Aggregated Graphs #8241

Closed

marefr mentioned this issue May 30, 2017

[Feature request] Send unique incident_key to Pagerduty per alert and serie #8493

Closed

pdf mentioned this issue Aug 23, 2017

Alerting support for queries using template variables #6557

Closed

torkelo mentioned this issue Sep 8, 2017

[Bug] Alerting: no alert when only one group in a GROUP BY query has no data #9198

Closed

bergquist mentioned this issue Sep 27, 2017

[Feature request] trigger the alerts separately in the same graph but different group by tags #9348

Closed

bergquist added the type/feature-request label Nov 1, 2017

bergquist added this to Requirement / Discussions in Issue preparations Nov 1, 2017

torkelo mentioned this issue Nov 24, 2017

How to configure the same alerts for every servers in the cluster? #9976

Closed

bergquist mentioned this issue Apr 10, 2018

[Feature Req] Send more details during alert notifications using webhooks #11538

Closed

This was referenced Nov 20, 2019

Could I send an alert every minute if the conditions are met every minute? #20493

Closed

Successively trigger alert when Errors happened but no recovered. #20494

Closed

bobheadxi mentioned this issue Jun 24, 2020

monitoring: granular alerts notifications with Alertmanager sourcegraph/sourcegraph#11452

Closed

marefr mentioned this issue May 10, 2021

api parameter evalMatches data is cached #33776

Closed

kylebrandt closed this as completed Jun 8, 2021

Alerting: Support per series state change tracking for queries that return multiple series #6041

Alerting: Support per series state change tracking for queries that return multiple series #6041

Comments

bergquist commented Sep 14, 2016 • edited Loading

klausenbusk commented Jan 5, 2017

pdf commented Jan 14, 2017

ghost commented Jan 14, 2017

pdf commented Jan 14, 2017

yannispanousis commented Feb 8, 2017

ddhirajkumar commented Mar 14, 2017 • edited Loading

bergquist commented Nov 1, 2017

pdu commented Nov 29, 2017

gregorsini commented Dec 4, 2017

shurshun commented Dec 7, 2017

whidrasl commented Dec 8, 2017

pdu commented Dec 13, 2017

ashuw018 commented Feb 2, 2018

karimcitoh commented Sep 17, 2019

micw commented Sep 21, 2019

manhojviknesh commented Sep 30, 2019

fernandobeltranjsc commented Oct 20, 2019

mgiammarco commented Feb 19, 2020

mgiammarco commented Feb 19, 2020

cqcn1991 commented Apr 30, 2020 • edited Loading

erkexzcx commented Jun 4, 2020

lmondoux commented Jun 27, 2020

vbichov commented Jul 8, 2020

czd890 commented Jul 15, 2020

ghost commented Jul 15, 2020

johntdyer commented Jul 21, 2020

janbrunrasmussen commented Jul 21, 2020

wiardvanrij commented Jul 29, 2020

bbl232 commented Jul 31, 2020

leoowu commented Sep 18, 2020

knmorgan commented Sep 25, 2020

rmccarthy-ellevation commented Oct 29, 2020

uklance commented Jan 6, 2021 • edited Loading

danielfariati commented Jan 20, 2021 • edited Loading

danielfariati commented Jan 20, 2021

florian-forestier commented Feb 4, 2021 • edited Loading

kylebrandt commented Jun 8, 2021 • edited Loading

pere3 commented Aug 30, 2021

bergquist commented Sep 14, 2016 •

edited

Loading

ddhirajkumar commented Mar 14, 2017 •

edited

Loading

cqcn1991 commented Apr 30, 2020 •

edited

Loading

uklance commented Jan 6, 2021 •

edited

Loading

danielfariati commented Jan 20, 2021 •

edited

Loading

florian-forestier commented Feb 4, 2021 •

edited

Loading

kylebrandt commented Jun 8, 2021 •

edited

Loading