Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[grafana] Alerting: add options and documentation to deploy a HA cluster #865

Merged
merged 3 commits into from
Nov 30, 2021

Conversation

JohnnyQQQQ
Copy link
Member

@JohnnyQQQQ JohnnyQQQQ commented Nov 29, 2021

This PR adds the possibility to create a headless service for normal deployments and adds some documentation on how to setup a cluster for unfied alerting.

fixes #747

@JohnnyQQQQ JohnnyQQQQ changed the title Alerting: add options and documentation to deploy a HA cluster [grafana] Alerting: add options and documentation to deploy a HA cluster Nov 29, 2021
Signed-off-by: Jean-Philippe Quémémer <jeanphilippe.quemener@grafana.com>
Signed-off-by: Jean-Philippe Quémémer <jeanphilippe.quemener@grafana.com>
Signed-off-by: Jean-Philippe Quémémer <jeanphilippe.quemener@grafana.com>
Copy link
Contributor

@lelithium lelithium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi ! Very interested in this being merged, as we've been seeing odd behaviors with our current HA Grafana deployments using legacy alerts.

Thank you for your work !

Comment on lines +19 to +21
- protocol: TCP
port: 3000
targetPort: 3000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be set to 9094 so that Grafana can query {{ Name }}-headless:9094 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't really matter, this is only used for a normal deployment to bind to the service as otherwise it wouldn't if there is no port. But I can change it to 9094, so it reflects more the usage of the service.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, understood, that makes sense.

I'm still seeing an issue on a local copy of this. Will comment in main thread to discuss further

@lelithium
Copy link
Contributor

To test this out locally, I checked out the latest version of this chart (6.17.9), and added on top of that the headless service you added in manually through:

apiVersion: v1
kind: Service
metadata:
  name: grafana-headless
spec:
  clusterIP: None
  selector:
    app.kubernetes.io/name: grafana
    app.kubernetes.io/instance: <my release name here>
  type: ClusterIP
  ports:
  - protocol: TCP
    port: 3000
    targetPort: 3000

We deploy the Grafana chart with HA enabled through an external PSQL database, and have three replicas. I amended the grafana.ini section to match this config:

  grafana.ini:
    alerting:
      enabled: false
    unified_alerting:
      enabled: true
      ha_peers: grafana-headless:9094

I can see the headless service being created, taking on the right IPs as the Endpoint component. The issue I'm seeing is a race condition between the pod creation/rollout and the headless service taking on the right endpoints.

  1. At initial startup (first deployment, no previous pods), each Grafana pod boots up, and runs into grafana-headless not having valid targets: (logs cleaned up for readability)
[WARN] memberlist: Failed to resolve grafana-headless:9094: lookup grafana-headless on 172.20.0.10:53: no such host

t=2021-11-30T10:07:54+0000 lvl=info msg="component=cluster level=warn msg=refresh result=failure addr=grafana-headless:9094 err=\"1 error occurred:\\n\\t* Failed to resolve grafana-headless:9094: lookup grafana-headless on 172.20.0.10:53: no such host\\n\\n\"" logger=ngalert.multiorg.alertmanager

After the pods have finished booting up, the headless service picks them up, adds them as valid endpoints, and the unified alerting gossip mechanism starts working as expected.

  1. However, in the case of an upgrade or a rollout restart, the pods will boot up, pick up the previous pods IPs from the headless service, and get stuck in a loop trying to gossip with now-removed pods
kubectl rollout restart deploy grafana

yields a series of

t=2021-11-30T10:17:42+0000 lvl=info msg="component=cluster level=debug memberlist=\"2021/11/30 10:17:42 [DEBUG] memberlist: Failed to join 10.0.2.200: dial tcp 10.0.2.200:9094: connect: no route to host\\n\"" logger=ngalert.multiorg.alertmanager

Where 10.0.2.200 is the IP of one of the previous pods.

Maybe there's an easy fix, or I just missed something obvious, but I couldn't find anything relevant in the unified_alerting configuration options. (https://grafana.com/docs/grafana/latest/administration/configuration/#unified_alerting)

I'm not sure of how to fix this, or whether this should be addressed at the Grafana level or the Chart level though.

Let me know if there's any additional info I can add, or if you think this deserves its own issue either here or on the Grafana repo

@JohnnyQQQQ
Copy link
Member Author

Yes, for now there is nothing we can do about this. We are thinking of changing the default timeouts or making them configurable. grafana/grafana#42300

@JohnnyQQQQ
Copy link
Member Author

JohnnyQQQQ commented Nov 30, 2021

To be clear, this does not affect the system in any way. It works fine. It just spills out those massages for a certain time. This is nothing we can fix on Chart level.

@JohnnyQQQQ JohnnyQQQQ merged commit 7a2610e into main Nov 30, 2021
@JohnnyQQQQ JohnnyQQQQ deleted the ha-grafana-8-alerts branch November 30, 2021 11:57
@lelithium
Copy link
Contributor

Appreciate the feedback, thank you !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Question: Grafana 8 alerting - HA setup
3 participants