[Bug] NotificationChannels not reconciling after Grafana restart #811

eduardobaitello · 2022-08-05T02:11:34Z

Describe the bug
When the Grafana deployment pod is recreated (either by deletion or eviction), the grafananotificationchannels are not reconciled by the Operator.

Version
v4.5.1

To Reproduce

For easy reproduction, I'm using Minikube + Bitnami Grafana Operator charts (but the behavior is the same in production environments).

The following values.yaml is used, which installs Grafana with Legacy Alerting, and creates a Pager Duty notification channel for tests:

grafana:
  image:
    tag: 8.5.9-debian-11-r7
  config:
    # Ensure LEGACY alerting
    alerting:
      enabled: true
    unified_alerting:
      enabled: false

extraDeploy:
  - apiVersion: integreatly.org/v1alpha1
    kind: GrafanaNotificationChannel
    metadata:
      name: pager-duty-channel
      labels:
        app.kubernetes.io/instance: grafana-operator
    spec:
      name: pager-duty-channel.json
      json: >
        {
          "uid": "pager-duty-alert-notification",
          "name": "Pager Duty alert notification",
          "type":  "pagerduty",
          "isDefault": true,
          "sendReminder": true,
          "frequency": "15m",
          "disableResolveMessage": true,
          "settings": {
            "integrationKey": "put key here",
            "autoResolve": true,
            "uploadImage": true
        }
        }

Start Minikube, and install Grafana Operator

$ minikube start --kubernetes-version='1.24.3'

$ helm repo add bitnami https://charts.bitnami.com/bitnami

$ helm repo update

$ helm install grafana-operator bitnami/grafana-operator \
  --namespace grafana-operator --create-namespace \
  --version='2.6.10' \
  --values=values.yaml

At the Operator logs, check that the Notification Channel is successfully submitted:

$ kubectl logs grafana-operator-xxxxxx-xxxxx
[...]
1.6596620760552166e+09	INFO	running periodic notificationchannel resync
1.6596620761186154e+09	INFO	notificationchannel grafana-operator/pager-duty-channel successfully submitted
1.6596620761186965e+09	DEBUG	events	Normal	{"object": {"kind":"GrafanaNotificationChannel","namespace":"grafana-operator","name":"pager-duty-channel","uid":"b14e2e0d-fc16-443e-9f9a-44d078e93731","apiVersion":"integreatly.org/v1alpha1","resourceVersion":"5251"}, "reason": "Success", "message": "notificationchannel grafana-operator/pager-duty-channel successfully submitted"}

Force the Grafana Deployment pod to be recreated:

$ kubectl delete po grafana-deployment-xxxxxx-xxxxx

Once the pod is recreated, recheck the Operator logs:

1.6596622458995874e+09	INFO	running periodic dashboard resync
1.659662246054992e+09	INFO	running periodic notificationchannel resync
1.6596622482120936e+09	DEBUG	action-runner	(    0)    SUCCESS update admin credentials secret
1.6596622482158809e+09	DEBUG	action-runner	(    1)    SUCCESS update grafana service
1.659662248218911e+09	DEBUG	action-runner	(    2)    SUCCESS update grafana service account
1.6596622482220106e+09	DEBUG	action-runner	(    3)    SUCCESS update grafana config
1.659662248222039e+09	DEBUG	action-runner	(    4)    SUCCESS plugins unchanged
1.6596622482309968e+09	DEBUG	action-runner	(    5)    SUCCESS update grafana deployment
1.6596622482310247e+09	DEBUG	action-runner	(    6)    SUCCESS check deployment readiness
1.6596622482443264e+09	DEBUG	grafana-controller	desired cluster state met
1.6596622558992486e+09	INFO	running periodic dashboard resync
1.6596622560558162e+09	INFO	running periodic notificationchannel resync

All eventual grafanadashboards and grafanadatasources are recreated, but the Notification Channel is not recreated.

(Optional) If the grafananotificationchannel is recreated, the Operator identifies the change, and submit it again to Grafana:

$ kubectl get -o json grafananotificationchannels pager-duty-channel | kubectl replace --force -f -
grafananotificationchannel.integreatly.org "pager-duty-channel" deleted
grafananotificationchannel.integreatly.org/pager-duty-channel replaced

Screenshots:

Right after installation:

Missing notification channels after Grafana pod recreation:

Recreating the grafananotificationchannel object reverts to the first screenshot, which is the expected behavior.

Expected behavior
It's expected that the Operator submits all grafananotificationchannels to Grafana instance when a pod recreation occurs, without the need to recreate the objects.

Suspect component/Location where the bug might be occurring
May be related to Legacy Alerting.

Runtime:

OS: Linux
Grafana Operator Version: v4.5.1
Environment: Kubernetes / Minikube (but also reproducible in self-managed production k8s)
Deployment type: Bitnami Helm Chart

The text was updated successfully, but these errors were encountered:

weisdd · 2022-08-06T11:22:48Z

@eduardobaitello thanks for the comprehensive description!
From what I can see in the code, there's no logic inside notification channel controller to check if the channel still exists in the grafana instance. So, controller takes no action unless hash of the channel spec changes. Most likely, it's something easy-to-fix as we already have such logic for the dashboard controller. I'll take a closer look at it within a few days.

(UPD): I have a PoC fix, just need some time to polish it.

eduardobaitello · 2022-08-08T23:03:33Z

@weisdd thanks for the feedback!

If there's anything I can help with, just let me know.

weisdd · 2022-08-12T12:53:12Z

@eduardobaitello I've just opened a PR, it's likely to be reviewed next week.

eduardobaitello · 2022-08-22T18:17:50Z

I just tested the v4.6.0 release, and it's working now. Thanks!

eduardobaitello added bug Something isn't working needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 5, 2022

pb82 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 9, 2022

weisdd mentioned this issue Aug 12, 2022

fix: recreate a notification channel if it doesn't exist in grafana #813

Merged

5 tasks

pb82 closed this as completed in #813 Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] NotificationChannels not reconciling after Grafana restart #811

[Bug] NotificationChannels not reconciling after Grafana restart #811

eduardobaitello commented Aug 5, 2022

weisdd commented Aug 6, 2022 •

edited

Loading

eduardobaitello commented Aug 8, 2022

weisdd commented Aug 12, 2022

eduardobaitello commented Aug 22, 2022

[Bug] NotificationChannels not reconciling after Grafana restart #811

[Bug] NotificationChannels not reconciling after Grafana restart #811

Comments

eduardobaitello commented Aug 5, 2022

weisdd commented Aug 6, 2022 • edited Loading

eduardobaitello commented Aug 8, 2022

weisdd commented Aug 12, 2022

eduardobaitello commented Aug 22, 2022

weisdd commented Aug 6, 2022 •

edited

Loading