Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ngalert] Missing alert notifications in Grafana 8.1.3 and 8.1.4 #39295

Closed
grobinson-grafana opened this issue Sep 16, 2021 · 19 comments
Closed
Labels
needs investigation for unconfirmed bugs. use type/bug for confirmed bugs, even if they "need" more investigating

Comments

@grobinson-grafana
Copy link
Contributor

What happened:

Some users are reporting missing alert notifications on Grafana 8.1.3 and 8.1.4 which is fixed when downgrading to 8.1.2.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Grafana version: 8.1.3/8.1.4
  • Data source type & version:
  • OS Grafana is installed on:
  • User OS & Browser:
  • Grafana plugins:
  • Others:
@CirnoT
Copy link

CirnoT commented Sep 16, 2021

Confirmed, no email alerts since v8.1.3

[smtp]
enabled = true
host = 127.0.0.1:25
from_address = ***
from_name = ***

[alerting]
enabled = true
execute_alerts = true
error_or_timeout = alerting
nodata_or_nullvalues = no_data
concurrent_render_limit = 5
max_attempts = 3

[feature_toggles]
enable = ngalert
{
  "template_files": {},
  "alertmanager_config": {
    "route": {
      "receiver": "***"
    },
    "templates": null,
    "receivers": [
      {
        "name": "***",
        "grafana_managed_receiver_configs": [
          {
            "uid": "bB1fvweGk",
            "name": "***",
            "type": "email",
            "disableResolveMessage": false,
            "settings": {
              "addresses": "***",
              "singleEmail": false
            },
            "secureFields": {}
          }
        ]
      }
    ]
  }
}

@torkelo
Copy link
Member

torkelo commented Sep 16, 2021

so 8.1.4 did not fix this? #38983

@torkelo torkelo added area/unified-alerting needs investigation for unconfirmed bugs. use type/bug for confirmed bugs, even if they "need" more investigating labels Sep 16, 2021
@CirnoT
Copy link

CirnoT commented Sep 16, 2021

Considering that I have created this configuration for ngalert after v7.2 and that encryption does not cover email receivers, I believe that this would have absolutely nothing to do with reported issue.

Others seem to confirm that the issue occurs on 8.1.4 as well (check #39009 (comment))

I also don't see how linked PR could affect only 8.1.3+ but not 8.1.2.


On unrelated note, the linked PR fixed migration however for affected users this will change nothing, as migration already ran.

@chaoyi996
Copy link

Not notified on 8.1.3 and 8.1.4

@haeiven
Copy link

haeiven commented Sep 20, 2021

Having the same issue with 8.2.0-beta1.

@ImmoWetzel
Copy link

no notification also for grafana 8.1.4 as docker container

@staeglis
Copy link

I can confirm the issue too

@ImmoWetzel
Copy link

I do see
t=2021-09-20T15:30:52+0000 lvl=dbug msg="Applying default URL parsing for this data source type" logger=datasource type=prometheus url=http://10.160.7.44:9090
t=2021-09-20T15:30:52+0000 lvl=eror msg="Request Completed" logger=context userId=1 orgId=1 uname=admin method=GET path=/api/ruler/1/api/v1/rules status=500 remote_addr=172.20.62.163 time_ms=2 size=96 referer=http://10.160.7.44:3000/alerting/list

in my logs... Does this helps

@jan666
Copy link

jan666 commented Sep 21, 2021

Same here (8.1.3 on FreeBSD). E-Mail + Pushover does not work (nothing else configures)

@grobinson-grafana
Copy link
Contributor Author

I'm afraid I have not been able to reproduce this issue on either 8.1.3 or 8.1.4. It doesn't appear to be related to email notifications as others have replied this affects other notifications too, such as Pushover.

This is the configuration I have been using:

{
  "template_files": {},
  "alertmanager_config": {
    "route": {
      "receiver": "test-email"
    },
    "templates": null,
    "receivers": [
      {
        "name": "test-email",
        "grafana_managed_receiver_configs": [
          {
            "uid": "2c_9eVHnzz",
            "name": "test-email",
            "type": "email",
            "disableResolveMessage": false,
            "settings": {
              "addresses": "example@example.com",
              "singleEmail": false
            },
            "secureFields": {}
          }
        ]
      }
    ]
  }
}

and I created an alert rule with the following configuration:

Screenshot 2021-09-21 at 14 28 24

I can see that Grafana is attempting to send an email with both smtp to localhost (which I do not have a listening smtp server) and I also confirmed that a webhook notification can be sent to an HTTP server that I have also listening on localhost.

level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="test-email/email[0]: notify retry canceled due to unrecoverable error after 1 attempts: Failed to send notification to email addresses: example@example.com: dial tcp [::1]:25: connect: connection refused"

@staeglis
Copy link

@gerobinson I assume the issue affects alerts directly managed by grafana only.

@grobinson-grafana
Copy link
Contributor Author

@staeglis I'm not sure, @ImmoWetzel appears to be using Prometheus? I also created an alert on a CSV but I'm afraid I am receiving notifications for that alert also without issue.

Screenshot 2021-09-21 at 15 14 35

@ImmoWetzel
Copy link

ImmoWetzel commented Sep 21, 2021 via email

@DasSkelett
Copy link

I am also suffering from missing notifications for Grafana managed alerts, currently on Grafana v8.1.4.
In my case those notifications should happen via Telegram.

t=2021-09-21T18:03:14+0000 lvl=info msg="recording state cache metrics" logger=ngalert now=2021-09-21T18:03:14+0000

t=2021-09-21T18:03:17+0000 lvl=info msg="level=debug component=dispatcher msg=\"Received alert\" alert=\"Request Duration (logarithmic)[b74ae7b][resolved]\"" logger=alertmanager

t=2021-09-21T18:03:17+0000 lvl=info msg="level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=\"[Request Duration (logarithmic)[b74ae7b][resolved]]\"" logger=alertmanager

t=2021-09-21T18:06:14+0000 lvl=info msg="recording state cache metrics" logger=ngalert now=2021-09-21T18:06:14+0000

t=2021-09-21T18:06:17+0000 lvl=info msg="level=debug component=dispatcher msg=\"Received alert\" alert=\"Request Duration (logarithmic)[e4c5e85][resolved]\"" logger=alertmanager

What confuses me is that they all contain this [resolved] tag. I don't know what it means, but the alert is definitely not resolved, and the UI correctly shows it as "Firing":
UI

Happy to provide other details and logs about my instance.

@vfylyk
Copy link

vfylyk commented Sep 22, 2021

On version 8.1.4, I have recreated more or less what George did. It still did not issue any notifications, and in their first appearance in the logs, everything displays as "resolved".

I created this contact point:

image

Then this notification policy:

image

And the alert with these settings:

image

image

image

This is what comes up in the logs as the first mention of these test alerts. No notification is ever attempted to be sent.

t=2021-09-22T02:21:18+0000 lvl=info msg="level=debug component=dispatcher msg=\"Received alert\" alert=Test[aee0740][resolved]" logger=alertmanager
t=2021-09-22T02:21:18+0000 lvl=info msg="level=debug component=dispatcher msg=\"Received alert\" alert=Test[3063b92][resolved]" logger=alertmanager
t=2021-09-22T02:21:29+0000 lvl=info msg="recording state cache metrics" logger=ngalert now=2021-09-22T02:21:29+0000
t=2021-09-22T02:21:44+0000 lvl=info msg="recording state cache metrics" logger=ngalert now=2021-09-22T02:21:44+0000
t=2021-09-22T02:21:48+0000 lvl=info msg="level=debug component=dispatcher aggrGroup=\"{}/{alertkey=\\\"test\\\"}:{}\" msg=flushing alerts=\"[Test[aee0740][resolved] Test[3063b92][resolved]]\"" logger=alertmanager
t=2021-09-22T02:21:59+0000 lvl=info msg="recording state cache metrics" logger=ngalert now=2021-09-22T02:21:59+0000
t=2021-09-22T02:22:14+0000 lvl=info msg="recording state cache metrics" logger=ngalert now=2021-09-22T02:22:14+0000
t=

@jan666
Copy link

jan666 commented Sep 22, 2021

I created a simple Testalert:

Alert name = Test
Alert type = Grafana managed Alert
Folder = General Alerting

2 Contact points (Pushover and E-Mail) - worked before 8.1.3

Debug output:

t=2021-09-22T07:25:05+0200 lvl=dbug msg="Influxdb request" logger=tsdb.influxdb url="[REDACTED]"
t=2021-09-22T07:25:05+0200 lvl=dbug msg="expression datasource query (seriesSet)" logger=expr query=A
t=2021-09-22T07:25:05+0200 lvl=dbug msg="state manager processing evaluation results" logger=ngalert uid=BlTgAA47k resultCount=1
t=2021-09-22T07:25:05+0200 lvl=dbug msg="setting alert state" logger=ngalert uid=BlTgAA47k
t=2021-09-22T07:25:05+0200 lvl=dbug msg="saving alert states" logger=ngalert count=1
t=2021-09-22T07:25:06+0200 lvl=dbug msg="sending alerts to notifier" logger=ngalert count=0 alerts=[]
t=2021-09-22T07:25:08+0200 lvl=dbug msg="new alert rule version fetched" logger=ngalert title=Test key="{orgID: 1, UID: -kkTyHHnk}" version=2
t=2021-09-22T07:25:08+0200 lvl=dbug msg="Received a query request" logger=tsdb.influxdb numQueries=1
t=2021-09-22T07:25:08+0200 lvl=dbug msg="Making a non-Flux type query" logger=tsdb.influxdb
t=2021-09-22T07:25:08+0200 lvl=dbug msg="Influxdb request" logger=tsdb.influxdb url="[REDACTED]"
t=2021-09-22T07:25:08+0200 lvl=dbug msg="expression datasource query (seriesSet)" logger=expr query=A
t=2021-09-22T07:25:08+0200 lvl=dbug msg="state manager processing evaluation results" logger=ngalert uid=-kkTyHHnk resultCount=1
t=2021-09-22T07:25:08+0200 lvl=dbug msg="setting alert state" logger=ngalert uid=-kkTyHHnk
t=2021-09-22T07:25:08+0200 lvl=dbug msg="saving alert states" logger=ngalert count=1
t=2021-09-22T07:25:08+0200 lvl=dbug msg="alert state changed creating annotation" logger=ngalert alertRuleUID=-kkTyHHnk newState=Alerting
t=2021-09-22T07:25:08+0200 lvl=dbug msg="sending alerts to notifier" logger=ngalert count=1 alerts="[{Annotations:map[__value_string__:[ metric='messung.Sensor1' labels={} value=21.5 ]] EndsAt:2021-09-22T07:25:00.782+02:00 StartsAt:2021-09-22T07:25:00.782+02:00 Alert:{GeneratorURL:[REDACTED] Labels:map[__alert_rule_namespace_uid__:vs37g6g7k __alert_rule_uid__:-kkTyHHnk alertname:Test]}}]"
t=2021-09-22T07:25:08+0200 lvl=info msg="level=debug component=dispatcher msg=\"Received alert\" alert=Test[213cdf7][resolved]" logger=alertmanager
t=2021-09-22T07:25:10+0200 lvl=dbug msg="alert rules fetched" logger=ngalert count=4

I just downgraded to 8.1.2 and changed nothing else. After restarting I immediately received the Pushover notification:

t=2021-09-22T07:43:17+0200 lvl=dbug msg="new alert rule version fetched" logger=ngalert title=Test key="{orgID: 1, UID: -kkTyHHnk}" version=5
t=2021-09-22T07:43:17+0200 lvl=dbug msg="Received a query request" logger=tsdb.influxdb numQueries=1
t=2021-09-22T07:43:17+0200 lvl=dbug msg="Making a non-Flux type query" logger=tsdb.influxdb
t=2021-09-22T07:43:17+0200 lvl=dbug msg="Influxdb request" logger=tsdb.influxdb url="[REDACTED]"
t=2021-09-22T07:43:17+0200 lvl=dbug msg="expression datasource query (seriesSet)" logger=expr query=A
t=2021-09-22T07:43:17+0200 lvl=dbug msg="state manager processing evaluation results" logger=ngalert uid=-kkTyHHnk resultCount=1
t=2021-09-22T07:43:17+0200 lvl=dbug msg="setting alert state" logger=ngalert uid=-kkTyHHnk
t=2021-09-22T07:43:17+0200 lvl=dbug msg="saving alert states" logger=ngalert count=1
t=2021-09-22T07:43:17+0200 lvl=dbug msg="alert state changed creating annotation" logger=ngalert alertRuleUID=-kkTyHHnk newState=Alerting
t=2021-09-22T07:43:17+0200 lvl=dbug msg="sending alerts to notifier" logger=ngalert count=1 alerts="[{Annotations:map[__value_string__:[ metric='messung.Sensor1' labels={} value=21.7 ]] EndsAt:2021-09-22T07:45:09.528+02:00 StartsAt:2021-09-22T07:43:09.528+02:00 Alert:{GeneratorURL:[REDACTED] Labels:map[__alert_rule_namespace_uid__:vs37g6g7k __alert_rule_uid__:-kkTyHHnk alertname:Test]}}]"
t=2021-09-22T07:43:17+0200 lvl=info msg="level=debug component=dispatcher msg=\"Received alert\" alert=Test[213cdf7][active]" logger=alertmanager
t=2021-09-22T07:43:19+0200 lvl=dbug msg="alert rules fetched" logger=ngalert count=4
t=2021-09-22T07:43:24+0200 lvl=info msg="recording state cache metrics" logger=ngalert now=2021-09-22T07:43:24+0200
t=2021-09-22T07:43:29+0200 lvl=dbug msg="alert rules fetched" logger=ngalert count=4
t=2021-09-22T07:43:39+0200 lvl=info msg="recording state cache metrics" logger=ngalert now=2021-09-22T07:43:39+0200
t=2021-09-22T07:43:39+0200 lvl=dbug msg="alert rules fetched" logger=ngalert count=4
t=2021-09-22T07:43:39+0200 lvl=dbug msg="neither config nor template have changed, skipping configuration sync." logger=alertmanager
t=2021-09-22T07:43:47+0200 lvl=info msg="level=debug component=dispatcher aggrGroup={}:{} msg=flushing alerts=[Test[213cdf7][active]]" logger=alertmanager
t=2021-09-22T07:43:47+0200 lvl=dbug msg="Sending webhook" logger=notifications url=https://api.pushover.net/1/messages.json http method=POST
t=2021-09-22T07:43:48+0200 lvl=dbug msg="Webhook succeeded" logger=notifications url=https://api.pushover.net/1/messages.json statuscode="200 OK"
t=2021-09-22T07:43:48+0200 lvl=info msg="level=debug component=dispatcher receiver=pushover-jan integration=pushover[0] msg=\"Notify success\" attempts=1" logger=alertmanager
t=2021-09-22T07:43:49+0200 lvl=dbug msg="alert rules fetched" logger=ngalert count=4
t=2021-09-22T07:43:54+0200 lvl=info msg="recording state cache metrics" logger=ngalert now=2021-09-22T07:43:54+0200
t=2021-09-22T07:43:59+0200 lvl=dbug msg="alert rules fetched" logger=ngalert count=4

Maybe a side note, but: the Tab "Notifications" in 8.1.3 was empty. In 8.1.2 there is a list of groups. "No grouping" has the firing alert.

@staeglis
Copy link

staeglis commented Sep 22, 2021

Maybe a side note, but: the Tab "Notifications" in 8.1.3 was empty. In 8.1.2 there is a list of groups. "No grouping" has the firing alert.

I've observed this too

@grobinson-grafana
Copy link
Contributor Author

I have found the issue and identified the change in 8.1.3 that caused it. The bug occurs when an alert rule is configured to evaluate at an interval that is greater than the interval at which Grafana resends firing alerts to the internal Alertmanager to stop them resolving, which in the 8.1.3 and 8.1.4 releases is 30 seconds. To fix notifications in 8.1.3 and 8.1.4 it is acceptable to set the alert evaluation interval for an alert rule to an interval that is 30 seconds or lower.

@jan666
Copy link

jan666 commented Sep 22, 2021

Can confirm. Setting "Evaluate every" to "20s" (instead of 1m) and it works again (8.1.3).

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs investigation for unconfirmed bugs. use type/bug for confirmed bugs, even if they "need" more investigating
Projects
None yet
Development

No branches or pull requests

10 participants