Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] NotificationChannels not reconciling after Grafana restart #811

Closed
eduardobaitello opened this issue Aug 5, 2022 · 4 comments · Fixed by #813
Closed

[Bug] NotificationChannels not reconciling after Grafana restart #811

eduardobaitello opened this issue Aug 5, 2022 · 4 comments · Fixed by #813
Labels
bug Something isn't working triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@eduardobaitello
Copy link

Describe the bug
When the Grafana deployment pod is recreated (either by deletion or eviction), the grafananotificationchannels are not reconciled by the Operator.

Version
v4.5.1

To Reproduce

For easy reproduction, I'm using Minikube + Bitnami Grafana Operator charts (but the behavior is the same in production environments).

The following values.yaml is used, which installs Grafana with Legacy Alerting, and creates a Pager Duty notification channel for tests:

grafana:
  image:
    tag: 8.5.9-debian-11-r7
  config:
    # Ensure LEGACY alerting
    alerting:
      enabled: true
    unified_alerting:
      enabled: false

extraDeploy:
  - apiVersion: integreatly.org/v1alpha1
    kind: GrafanaNotificationChannel
    metadata:
      name: pager-duty-channel
      labels:
        app.kubernetes.io/instance: grafana-operator
    spec:
      name: pager-duty-channel.json
      json: >
        {
          "uid": "pager-duty-alert-notification",
          "name": "Pager Duty alert notification",
          "type":  "pagerduty",
          "isDefault": true,
          "sendReminder": true,
          "frequency": "15m",
          "disableResolveMessage": true,
          "settings": {
            "integrationKey": "put key here",
            "autoResolve": true,
            "uploadImage": true
        }
        }
  1. Start Minikube, and install Grafana Operator
$ minikube start --kubernetes-version='1.24.3'

$ helm repo add bitnami https://charts.bitnami.com/bitnami

$ helm repo update

$ helm install grafana-operator bitnami/grafana-operator \
  --namespace grafana-operator --create-namespace \
  --version='2.6.10' \
  --values=values.yaml
  1. At the Operator logs, check that the Notification Channel is successfully submitted:
$ kubectl logs grafana-operator-xxxxxx-xxxxx
[...]
1.6596620760552166e+09	INFO	running periodic notificationchannel resync
1.6596620761186154e+09	INFO	notificationchannel grafana-operator/pager-duty-channel successfully submitted
1.6596620761186965e+09	DEBUG	events	Normal	{"object": {"kind":"GrafanaNotificationChannel","namespace":"grafana-operator","name":"pager-duty-channel","uid":"b14e2e0d-fc16-443e-9f9a-44d078e93731","apiVersion":"integreatly.org/v1alpha1","resourceVersion":"5251"}, "reason": "Success", "message": "notificationchannel grafana-operator/pager-duty-channel successfully submitted"}
  1. Force the Grafana Deployment pod to be recreated:
$ kubectl delete po grafana-deployment-xxxxxx-xxxxx
  1. Once the pod is recreated, recheck the Operator logs:
1.6596622458995874e+09	INFO	running periodic dashboard resync
1.659662246054992e+09	INFO	running periodic notificationchannel resync
1.6596622482120936e+09	DEBUG	action-runner	(    0)    SUCCESS update admin credentials secret
1.6596622482158809e+09	DEBUG	action-runner	(    1)    SUCCESS update grafana service
1.659662248218911e+09	DEBUG	action-runner	(    2)    SUCCESS update grafana service account
1.6596622482220106e+09	DEBUG	action-runner	(    3)    SUCCESS update grafana config
1.659662248222039e+09	DEBUG	action-runner	(    4)    SUCCESS plugins unchanged
1.6596622482309968e+09	DEBUG	action-runner	(    5)    SUCCESS update grafana deployment
1.6596622482310247e+09	DEBUG	action-runner	(    6)    SUCCESS check deployment readiness
1.6596622482443264e+09	DEBUG	grafana-controller	desired cluster state met
1.6596622558992486e+09	INFO	running periodic dashboard resync
1.6596622560558162e+09	INFO	running periodic notificationchannel resync

All eventual grafanadashboards and grafanadatasources are recreated, but the Notification Channel is not recreated.

  1. (Optional) If the grafananotificationchannel is recreated, the Operator identifies the change, and submit it again to Grafana:
$ kubectl get -o json grafananotificationchannels pager-duty-channel | kubectl replace --force -f -
grafananotificationchannel.integreatly.org "pager-duty-channel" deleted
grafananotificationchannel.integreatly.org/pager-duty-channel replaced

Screenshots:

Right after installation:
Screenshot from 2022-08-04 22-17-15

Missing notification channels after Grafana pod recreation:
Screenshot from 2022-08-04 22-19-42

Recreating the grafananotificationchannel object reverts to the first screenshot, which is the expected behavior.

Expected behavior
It's expected that the Operator submits all grafananotificationchannels to Grafana instance when a pod recreation occurs, without the need to recreate the objects.

Suspect component/Location where the bug might be occurring
May be related to Legacy Alerting.

Runtime:

  • OS: Linux
  • Grafana Operator Version: v4.5.1
  • Environment: Kubernetes / Minikube (but also reproducible in self-managed production k8s)
  • Deployment type: Bitnami Helm Chart
@eduardobaitello eduardobaitello added bug Something isn't working needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 5, 2022
@weisdd
Copy link
Collaborator

weisdd commented Aug 6, 2022

@eduardobaitello thanks for the comprehensive description!
From what I can see in the code, there's no logic inside notification channel controller to check if the channel still exists in the grafana instance. So, controller takes no action unless hash of the channel spec changes. Most likely, it's something easy-to-fix as we already have such logic for the dashboard controller. I'll take a closer look at it within a few days.

(UPD): I have a PoC fix, just need some time to polish it.

@eduardobaitello
Copy link
Author

@weisdd thanks for the feedback!

If there's anything I can help with, just let me know.

@pb82 pb82 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 9, 2022
@weisdd
Copy link
Collaborator

weisdd commented Aug 12, 2022

@eduardobaitello I've just opened a PR, it's likely to be reviewed next week.

@pb82 pb82 closed this as completed in #813 Aug 18, 2022
@eduardobaitello
Copy link
Author

I just tested the v4.6.0 release, and it's working now. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants