[grafana] Alerting: add options and documentation to deploy a HA cluster #865

JohnnyQQQQ · 2021-11-29T10:31:43Z

This PR adds the possibility to create a headless service for normal deployments and adds some documentation on how to setup a cluster for unfied alerting.

fixes #747

Signed-off-by: Jean-Philippe Quémémer <jeanphilippe.quemener@grafana.com>

lelithium

Hi ! Very interested in this being merged, as we've been seeing odd behaviors with our current HA Grafana deployments using legacy alerts.

Thank you for your work !

lelithium · 2021-11-29T17:41:38Z

charts/grafana/templates/headless-service.yaml

+  - protocol: TCP
+    port: 3000
+    targetPort: 3000


Shouldn't this be set to 9094 so that Grafana can query {{ Name }}-headless:9094 ?

It doesn't really matter, this is only used for a normal deployment to bind to the service as otherwise it wouldn't if there is no port. But I can change it to 9094, so it reflects more the usage of the service.

Ah, understood, that makes sense.

I'm still seeing an issue on a local copy of this. Will comment in main thread to discuss further

lelithium · 2021-11-30T10:21:04Z

To test this out locally, I checked out the latest version of this chart (6.17.9), and added on top of that the headless service you added in manually through:

apiVersion: v1
kind: Service
metadata:
  name: grafana-headless
spec:
  clusterIP: None
  selector:
    app.kubernetes.io/name: grafana
    app.kubernetes.io/instance: <my release name here>
  type: ClusterIP
  ports:
  - protocol: TCP
    port: 3000
    targetPort: 3000

We deploy the Grafana chart with HA enabled through an external PSQL database, and have three replicas. I amended the grafana.ini section to match this config:

  grafana.ini:
    alerting:
      enabled: false
    unified_alerting:
      enabled: true
      ha_peers: grafana-headless:9094

I can see the headless service being created, taking on the right IPs as the Endpoint component. The issue I'm seeing is a race condition between the pod creation/rollout and the headless service taking on the right endpoints.

At initial startup (first deployment, no previous pods), each Grafana pod boots up, and runs into grafana-headless not having valid targets: (logs cleaned up for readability)

[WARN] memberlist: Failed to resolve grafana-headless:9094: lookup grafana-headless on 172.20.0.10:53: no such host

t=2021-11-30T10:07:54+0000 lvl=info msg="component=cluster level=warn msg=refresh result=failure addr=grafana-headless:9094 err=\"1 error occurred:\\n\\t* Failed to resolve grafana-headless:9094: lookup grafana-headless on 172.20.0.10:53: no such host\\n\\n\"" logger=ngalert.multiorg.alertmanager

After the pods have finished booting up, the headless service picks them up, adds them as valid endpoints, and the unified alerting gossip mechanism starts working as expected.

However, in the case of an upgrade or a rollout restart, the pods will boot up, pick up the previous pods IPs from the headless service, and get stuck in a loop trying to gossip with now-removed pods

kubectl rollout restart deploy grafana

yields a series of

t=2021-11-30T10:17:42+0000 lvl=info msg="component=cluster level=debug memberlist=\"2021/11/30 10:17:42 [DEBUG] memberlist: Failed to join 10.0.2.200: dial tcp 10.0.2.200:9094: connect: no route to host\\n\"" logger=ngalert.multiorg.alertmanager

Where 10.0.2.200 is the IP of one of the previous pods.

Maybe there's an easy fix, or I just missed something obvious, but I couldn't find anything relevant in the unified_alerting configuration options. (https://grafana.com/docs/grafana/latest/administration/configuration/#unified_alerting)

I'm not sure of how to fix this, or whether this should be addressed at the Grafana level or the Chart level though.

Let me know if there's any additional info I can add, or if you think this deserves its own issue either here or on the Grafana repo

JohnnyQQQQ · 2021-11-30T10:33:15Z

Yes, for now there is nothing we can do about this. We are thinking of changing the default timeouts or making them configurable. grafana/grafana#42300

JohnnyQQQQ · 2021-11-30T10:34:36Z

To be clear, this does not affect the system in any way. It works fine. It just spills out those massages for a certain time. This is nothing we can fix on Chart level.

lelithium · 2021-11-30T13:39:36Z

Appreciate the feedback, thank you !

JohnnyQQQQ requested review from maorfr, torstenwalter, Xtigyro and zanhsieh as code owners November 29, 2021 10:31

JohnnyQQQQ force-pushed the ha-grafana-8-alerts branch from 4fe3f03 to 99e3b45 Compare November 29, 2021 10:32

JohnnyQQQQ changed the title ~~Alerting: add options and documentation to deploy a HA cluster~~ [grafana] Alerting: add options and documentation to deploy a HA cluster Nov 29, 2021

JohnnyQQQQ added 2 commits November 29, 2021 11:40

Alerting: add options and documentation to deploy a HA cluster

16795a0

Signed-off-by: Jean-Philippe Quémémer <jeanphilippe.quemener@grafana.com>

be more specific

205682d

Signed-off-by: Jean-Philippe Quémémer <jeanphilippe.quemener@grafana.com>

JohnnyQQQQ force-pushed the ha-grafana-8-alerts branch from 99e3b45 to 205682d Compare November 29, 2021 10:40

update grafana chart version

8f42e54

Signed-off-by: Jean-Philippe Quémémer <jeanphilippe.quemener@grafana.com>

JohnnyQQQQ force-pushed the ha-grafana-8-alerts branch from d95242b to 8f42e54 Compare November 29, 2021 10:44

lelithium reviewed Nov 29, 2021

View reviewed changes

zanhsieh approved these changes Nov 30, 2021

View reviewed changes

JohnnyQQQQ merged commit 7a2610e into main Nov 30, 2021

JohnnyQQQQ deleted the ha-grafana-8-alerts branch November 30, 2021 11:57

mlallaouret mentioned this pull request Dec 6, 2021

[grafana] Merge workflow issue #876

Open

eyenx mentioned this pull request Feb 21, 2022

chore(lokiStack): update to 2.6.* #561 adfinis/helm-charts#569

Merged

6 tasks

This was referenced Apr 4, 2023

Alertmanager Reconnect Timeout #2331

Closed

Alertmanager HA peer reconnect timeout with multiple deployments across namespaces grafana/grafana#65891

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[grafana] Alerting: add options and documentation to deploy a HA cluster #865

[grafana] Alerting: add options and documentation to deploy a HA cluster #865

JohnnyQQQQ commented Nov 29, 2021 •

edited

Loading

lelithium left a comment

lelithium Nov 29, 2021

JohnnyQQQQ Nov 29, 2021

lelithium Nov 30, 2021

lelithium commented Nov 30, 2021

JohnnyQQQQ commented Nov 30, 2021

JohnnyQQQQ commented Nov 30, 2021 •

edited

Loading

lelithium commented Nov 30, 2021

[grafana] Alerting: add options and documentation to deploy a HA cluster #865

[grafana] Alerting: add options and documentation to deploy a HA cluster #865

Conversation

JohnnyQQQQ commented Nov 29, 2021 • edited Loading

lelithium left a comment

Choose a reason for hiding this comment

lelithium Nov 29, 2021

Choose a reason for hiding this comment

JohnnyQQQQ Nov 29, 2021

Choose a reason for hiding this comment

lelithium Nov 30, 2021

Choose a reason for hiding this comment

lelithium commented Nov 30, 2021

JohnnyQQQQ commented Nov 30, 2021

JohnnyQQQQ commented Nov 30, 2021 • edited Loading

lelithium commented Nov 30, 2021

JohnnyQQQQ commented Nov 29, 2021 •

edited

Loading

JohnnyQQQQ commented Nov 30, 2021 •

edited

Loading