Add support for heartbeat #444

ben51 · 2016-08-03T15:25:58Z

Send an HTTP request to specific hosts so they know the current instance is
running
Only implements OpsGenie API

This change is

fabxc · 2016-08-08T07:51:40Z

Interesting. I know @grobie and @matthiasr were thinking of something like this. Maybe they want to chime in.

grobie · 2016-08-08T16:37:18Z

We do have a setup for this at SoundCloud. We have one alert which is always firing and use a dead man switch service to detect whether that doesn't reach our notification system (PagerDuty). I think @matthiasr contacted Pagerduty and they plan to support that natively.

It's important to know when the alertmanager is down, so more support for that is good I think. Having an independent heartbeat routine won't catch a few other issues (like Prometheus servers can't reach alertmanager, the alertmanager can't send out notifications, etc.) but I guess that's a good start.

grobie · 2016-08-08T16:46:45Z

heartbeat/impl.go

+	return &OpsGenie{
+		conf:   conf,
+		done:   make(chan struct{}),
+		ticker: time.NewTicker(time.Duration(conf.Interval)),


This will leak ticker routines, as these don't get stopped.

I don't see the need or advantage of making Tick() part of the interface. I'd just expose the interval and keep the ticker local to the Run() method.

Indeed I didn't see this... I made the Tick() method because it is not possible AFAIK to expose attributes on an interface. I could create a HeartbeatProvider interface and a Heartbeat struct to implement this or expose the interval in the HeartbeatRunner on a per implementation basis (map[string]struct{done chan, time.Duration interval})

I see. Well, you could close the ticker in the Stop method, but I'd really keep that local. What about a func Interval() time.Duration method on the interface?

- Send an HTTP request to specific hosts so they know the current instance is running - Only implements OpsGenie API Conf sample: heartbeats: - name: 'opsgenie-hb' opsgenie_configs: - api_key: 'f5ec0d1f-5978-11e6-881f-64006a5a8984' interval: '10m' name: 'my-app-heartbeat' Change-Id: Ied9d45ea630362ed6ed17845506db7b5b21919a4

ben51 · 2016-09-06T09:07:40Z

Hey, what's the status on this PR ? Are you considering merging it or do you require some more work from me ?
Cheers

fabxc · 2016-09-06T10:08:19Z

Hey, thanks for pinging again about this.

I think generally a heartbeat feature is very useful. As grobie said, it does not catch a case where Prometheus servers cannot reach any Alertmanager. But this would only be thoroughly validated if each Prometheus server had its own alert and we checked each of them.
I think the connectivity should generally be addressed by an HA AM setup and meta monitoring.

But we should discuss this a bit further before adding a new feature.

@brian-brazil

brian-brazil · 2016-09-06T10:16:13Z

This should be covered by general meta-monitoring, this approach has the usual issues with bottom-up push-based monitoring and is likely to trigger false positives in an HA scenario as no single AM is a SPOF.

This approach also seems a bit OpsGenie specific. The generic version of this would be normal Prometheus scraping for example.

fabxc · 2016-09-06T10:38:01Z

Mh, not sure I understand.

A heartbeat would allow me to get a page when Alertmanager cannot reach the integration. How is that covered by meta monitoring? My meta monitoring will send an alert to my Alertmanager, which cannot send out the notification... and then what?

In an HA setup, you'd have to deduplicate heartbeats along a name I'd suppose and it would only trigger if no AM sends something on that named heartbeat channel anymore. So it wouldn't be coupled to single AM instances.

brian-brazil · 2016-09-06T10:41:36Z

What you really need is an end-to-end test, and usually the issue is with one AM so the other AMs are still working.

fabxc · 2016-09-06T11:37:57Z

I think we are talking about different intends behind this. I can run 10 fully functional AMs in my DC but if it's cut off from the Internet, there's no meta monitoring that can help me.

FWIW, most users are small-scale and telling them to run a separate meta-monitoring Prometheus+AM in another DC is not a good story for a fully functional and easy to deploy (one of our core features) monitoring setup. Especially if integrations provide such heartbeat/dead-man's-switch functionality already.

brian-brazil · 2016-09-06T12:31:55Z

Meta-monitoring should catch the DC going down, otherwise you don't have full meta-monitoring.

Meta-monitoring is everything you need to catch your monitoring going down.

I think something more along the lines of @grobie's lightning talk at PromCon is what we should be looking at.

fabxc · 2016-09-06T12:38:02Z

That was @matthiasr's talk IIRC.

Conceptually it wasn't too different – it also was a heartbeat/dead-man's-switch – just more Prometheus native and more end-to-end. But in the end, by your reasoning this seems just as obsolete – with full meta-monitoring you'd just directly send an alert notification that your monitoring is not working rather than going through the entire contraption described.

So not sure I understand what you are leaning towards exactly.

brian-brazil · 2016-09-06T13:24:27Z

The heartbeat here as proposed only partially tests that one AM can talk to the notification mechanism.

The lightning talk tested that the monitoring system as a whole can talk to the notification mechanism. This much more accurately represents what you want to monitor.

fabxc · 2016-09-06T13:39:54Z

Yes, but the path Prometheus->Alertmanager can be verified from within the data center via meta monitoring. If that is setup, the only missing piece (when not wanting to require users to setup cross-DC meta monitoring) is the Alertmanager->Internet route.

This gap could be filled by having Prometheus heartbeat to an integration. I agree that the other approach covers more and works just as nice. More importantly, it doesn't introduce another concept.
In the end everything behind AM was just quite a contraption to get the dead-man's-switch working. So being able to work with a heartbeat API, that does what we want would be neat.

Of course people can add their own webhook again that sends some JSON to whatever heartbeat API on an alert notification. But for such trivial stuff, maybe it's worth revisiting whether we should allow assembling custom JSON via templates.
In the case of OpsGenie it's literally:

curl -XPOST 'https://api.opsgenie.com/v1/json/heartbeat/send' -d '
{
     "apiKey": "eb243592-faa2-4ba2-a551q-1afdf565c889",
     "name" : "host1"
}'

And in this case we have named heartbeats. So it does not verify that one AM is working as the entire AM cluster would send heartbeats and they are not differentiated by source IP or similar.

brian-brazil · 2016-09-06T17:10:40Z

Yes, but the path Prometheus->Alertmanager can be verified from within the data center via meta monitoring.

This is an offline-processing system, attempting to monitor each hop individually will miss things. A end-to-end heartbeat is desirable.

(when not wanting to require users to setup cross-DC meta monitoring)

I'm not sure we should have this as a primary goal. Everything we're talking about here should be some form of cross-DC monitoring.

In the end everything behind AM was just quite a contraption to get the dead-man's-switch working.

This is actually one of the simplest solutions I've seen to this general class of problem.

fabxc · 2016-09-07T15:34:06Z

This is actually one of the simplest solutions I've seen to this general class of problem.

Not sure what you mean. If PagerDuty just had an incident handler that is inverted, i.e. pages you if it doesn't receive alerts, this would be super straightforward. The contraption of using 2 or 3 different SaaS providers after leaving Alertmanager has nothing to do with this being a complex problem class – it's simply a gap in the offerings of PD, OpsGenie, etc.

And arguably the OpsGenie heartbeat functionality is exactly that, just with it's tiny API that is unfortunately not equal to the incident API. You skipped the essential part of my last response.

I'm not sure we should have this as a primary goal.

Nothing said of a primary goal, but it's a use case we have to consider. Not every company allows their sysadmins to run something in the cloud. And when on-site is just a single site, there must be a way to do it.

I'm not sure what we are talking about anymore, to be honest. Your last few responses are giving contradicting signals.

brian-brazil · 2016-09-07T15:52:02Z

Not every company allows their sysadmins to run something in the cloud. And when on-site is just a single site, there must be a way to do it.

It's not possible to do meta-monitoring without use of a provider outside your network, so I don't think this is a reasonable design restriction. This PR presumes a cloud service in OpsGenie for example.

Your last few responses are giving contradicting signals.

I believe I've been consistent. The OpsGenie heartbeat would offer testing of the AM->OpsGenie path in some cases - but there's more to meta monitoring than testing that one hop of the alerting path and not everyone uses OpsGenie.

An end to end test that starts at Prometheus and ends outside your network is what's needed. The alertmanager sending heartbeats itself is not the right way to provide this feature, and may lull users into a false sense of security.

fabxc · 2016-09-07T16:17:17Z

This was meant to be a general discussion rather than about accepting exactly this PR. Probably should have moved this into its own issue for clarity. Last response here to finish this.

You are arguing against what I'm saying when I don't even disagree with you.

I've explicitly said that the end-to-end solution is better and proposed that we find a way to integrate our general notifications with such heartbeat APIs, which was ignored.
I also didn't say, or meant to say, that it will replace meta-monitoring in any way.

So one final time to hopefully clarify this:

We want some sort of heartbeat/dead-man's-switch functionality. You do apparently agree with that. The end-to-end variant is exactly that while validating another part of the pipeline along the way, which is great of course.
The lightning talk by @matthiasr has shown that there's a gap in direct support by SaaS integrations. But some (like OpsGenie) do have a dedicated heartbeat APIs that have the correct paging semantics. Problem is, it needs a specific JSON rather than the standard notification.
So, do we want to provide a way to integrate with that without yet another custom webhook handler or not?

Please open an issue if you want to discuss this further without spinning in circles.

brian-brazil · 2016-09-08T05:23:13Z

So, do we want to provide a way to integrate with that without yet another custom webhook handler or not?

I don't think there's enough standardisation in this space to know how to approach this. I've come across about 5 of these from various providers, and I suspect most of them could take a webhook directly as they tend to ignore the body. So for now webhook it is I think.

For this particular PR I'm against as it's approaching the problem at the wrong level. Do you think we should accept this PR?

fabxc · 2016-09-08T17:41:15Z

In the OpsGenie case, the body specifies the name of the heartbeat and cannot be left out.

At this early stage adding such an exposed feature to AM is probably unwise. So no to this PR for now.
I'd like to keep the general problem in mind for the future though – I'd mostly see it realized by allowing to template entire request bodies so things like this can be directly implemented via the regular notification mechanism.

brian-brazil · 2016-09-08T17:46:28Z

I think there was one proposal (from you?) for json templating that didn't look too terrible. I'm mainly concerned on complexity and support grounds, as it'd be far more powerful/complex than what our current templating offers (which already confuses users) and we get to deal with arbitrarily different http response codes/bodies that need special handling. That'd take a good chunk of software engineering to get sane.

fabxc · 2016-09-08T18:31:36Z

Yes, I mentioned that before and further up in this discussion. There's http://jsonnet.org/ which seems rather sane but is lacking Go support.

An alternative would be to specify a request body structure directly in the YAML configuration where you can use templated strings for values. At least this won't require people to generate valid JSON via Go templates, which are also using { as part of their syntax (can be overwritten, but that will be a nasty inconsistency).

brian-brazil · 2016-09-08T21:27:57Z

The problem with doing it via YAML is that you're restricting to structure to be completely static, so no lists or nested dicts. At what point are you okay with users being forced to use the webhook?

fabxc · 2016-09-09T09:59:01Z

Sure, you can just have a field that allows any YAML object.

brian-brazil · 2016-09-09T11:26:36Z

Which probably means you need all fields not to be escaped, leaving escaping up to the user. That's not likely to end well.

matthiasr · 2017-01-04T10:00:02Z

Sorry, I missed the original discussion because I was on vacation.

In general, I believe there is no way to be really sure your alerts are working that doesn't involve a heartbeat.

The more the heartbeat deviates from "normal" alerts, the more potential for false positives and negatives there is. Having a "perfectly normal" alert fire all the time makes sure that alerts work, not just that heartbeats work. The heartbeat interval is configurable already via the normal alertmanager configuration.

I understand that this PR aims to support the OpsGenie specific heartbeat functionality. Would this heartbeat work if it is configured like a normal OpsGenie receiver? In that case, this PR doesn't add anything that could not be done already using a few lines of configuration. I'd prefer to just document how to do it over additional code and configuration directives. That would also solve checking the Prometheus->Alertmanager leg, which the proposed implementation here would not.

And the general approach is transferable to other integrations without having to implement it separately every time. I agree that using Dead Man's Snitch is a crutch caused by lack of support in PagerDuty. The documentation (blog post?) could also describe that. If someone has done something similar with other integrations we can add it; but given these two examples the general approach should be clear enough for everyone to be able to implement it according to their environment.

Of course, a deployment with a single datacenter and a prohibition against relying on anything outside that datacenter will never be able to use any of this, but that seems like a very special case.

matthiasr · 2017-01-04T10:00:23Z

PS: I would be happy to write out the documentation-in-lieu-of-code post.

davidkarlsen · 2017-04-01T18:09:19Z

Is this one going anywhere?

brian-brazil · 2017-04-01T18:21:05Z

Reading back through everything, the consensus is not to accept this PR.

Instead it's proposed to document how to do this properly in the form of an end-to-end test, and maybe also allow templating of webhook.

davidkarlsen · 2017-04-01T21:53:59Z

@brian-brazil Should #679 be reopened in that case - and/or be re-phrased according to the new plan?

brian-brazil · 2017-04-02T01:59:43Z

This PR is an implementation of #679.

ben51 · 2017-04-03T01:05:20Z

Sorry I missed the discussion. From my memories, it wasn't possible to send fake alarms to OpsGenie to implement a heartbeat. Rather, one had to implement a dedicated HTTP call to comply with their implementation.

I do agree that this PR may be overkill if webhook are expended to be able to send any kind of formatted JSON.

talset · 2017-08-29T14:17:16Z

Hi,

Any news on this topics ?

I was not able to find the "end-to-end" documentation regarding opsgenie heartbeat integration.

It seems using normal OpsGenie receiver is "almost" able to do the job with the opsgenie api v1 (except that the name is not part of the POST).

But for the opsgenie api v2 it seems not possible at all : https://docs.opsgenie.com/docs/heartbeat-api#pingHeartbeat (header are used)

Any suggestions ?

jayme-github · 2017-09-15T07:07:56Z

@talset maybe what @Nin-0 did here https://github.com/traum-ferienwohnungen/opsgenie-heartbeat-proxy is of help for you.

talset · 2017-09-15T11:32:16Z

@jayme-github thanks, in fact we did the same but if something is wrong with the pod, we could have an alert than we should not have. Anyway this is the only current workaround.

dano-o · 2018-09-26T11:46:34Z

@talset and anyone bumping into this thread: you can send heartbeats towards OpsGenie, the following authentication methods are supported by both systems:

1.) You can simply use the apiKey in the target URL:

https://api.opsgenie.com/v2/heartbeats/heartbeatname/ping?apiKey=XXXX

2.) You can use the Basic authentication method in AlertManager. OpsGenie will accept the following:

Header Key: Authorization
Header Value: basic base64(:$apiKey)

Basically, just leave the username empty, and use the apikey as the secret

pete-leese · 2018-10-09T08:46:13Z

What have you got in y our Prometheus.yml for this configuration?

yosefy · 2018-10-10T08:59:17Z

it is in AM.yml not promethteus.yml

it works like this:

route:
  receiver: default
  group_by:
  - job
  routes:
  - receiver: deadmansswitch
    match:
      alertname: DeadMansSwitch
    repeat_interval: 1m 

receivers:
 - name: deadmansswitch
   webhook_configs:
   - url: 'https://api.opsgenie.com/v2/heartbeats/HEARTBEAT_NAME/ping'
     send_resolved: true
     http_config:
       basic_auth:
         password: OPS-GENIE-API-KEY

pete-leese · 2018-10-10T23:03:34Z

This is great. Thank you for the missing part of my the puzzle

freeseacher · 2019-02-05T13:57:19Z

Got another issue like this.
We are experimenting with alertmanagers templates and change them like this

- name: opsgenie
  opsgenie_configs:
  - send_resolved: true
    api_key: <secret>
    api_url: https://api.opsgenie.com/
    message: '{{ template "opsgenie.company.message" . }}'
....

this works pretty good until one of our pr changes path to templates to

templates:
- /etc/alertmanager/templates/.*tmpl

that was pretty easy to pass mr cause looks very similar to good one.
and that breaks all the things. Cause opsgenie was only destination for alerts we didn't see alerts for some time.

it seems that solution from #444 (comment) will not help with this. any ideas ?

mfin · 2019-05-15T08:02:07Z

it is in AM.yml not promethteus.yml

it works like this:

route:
  receiver: default
  group_by:
  - job
  routes:
  - receiver: deadmansswitch
    match:
      alertname: DeadMansSwitch
    repeat_interval: 1m 

receivers:
 - name: deadmansswitch
   webhook_configs:
   - url: 'https://api.opsgenie.com/v2/heartbeats/HEARTBEAT_NAME/ping'
     send_resolved: true
     http_config:
       basic_auth:
         password: OPS-GENIE-API-KEY

Thanks for this! One minor change: you have to define group_interval for the DeadMansSwitch route also, or it will inherit the global one. I've set both group_interval and repeat_interval to 1 minute, so the heartbeat ping is really sent each minute.

theartusz · 2021-07-05T07:44:24Z

it is in AM.yml not promethteus.yml

it works like this:

route:
  receiver: default
  group_by:
  - job
  routes:
  - receiver: deadmansswitch
    match:
      alertname: DeadMansSwitch
    repeat_interval: 1m 

receivers:
 - name: deadmansswitch
   webhook_configs:
   - url: 'https://api.opsgenie.com/v2/heartbeats/HEARTBEAT_NAME/ping'
     send_resolved: true
     http_config:
       basic_auth:
         password: OPS-GENIE-API-KEY

@yosefy what format is the password in?
Is it as per v1.SecretKeySelector scheme referenceing only the name of the secret or writing the key in plain text (security issue)?
Tried to reference it as

password:
                name:
                key:

but it didn't work for me.

grobie reviewed Aug 8, 2016
View reviewed changes

ben51 force-pushed the heartbeat branch from 3093551 to 3290758 Compare August 9, 2016 15:59

davidkarlsen mentioned this pull request Apr 1, 2017

Support heartbeating #679

Closed

brian-brazil closed this Apr 1, 2017

brian-brazil mentioned this pull request Jun 26, 2018

HTTP Config custom headers OR OpsGenie heartbeat support prometheus/prometheus#4314

Closed

therealgambo mentioned this pull request Jun 26, 2018

Feature: HTTP Config Additional Headers prometheus/common#140

Closed

jnovack mentioned this pull request Sep 6, 2018

Dead Man Switches (NON-OpsGenie Heartbeats!) #1542

Closed

simonpasquier mentioned this pull request Oct 19, 2018

add group_by_all support #1588

Merged

TheKangaroo mentioned this pull request Dec 22, 2018

proxy support traum-ferienwohnungen/opsgenie-heartbeat-proxy#3

Closed

fernandrone mentioned this pull request Apr 9, 2021

Support OpsGenie Heartbeat prometheus-operator/prometheus-operator#3970

Closed

staticvoid255 mentioned this pull request Aug 16, 2022

AlertManager receivers for Watchdog redkubes/otomi-core#876

Open

Add support for heartbeat #444

Add support for heartbeat #444

Conversation

ben51 commented Aug 3, 2016 • edited by fabxc Loading

fabxc commented Aug 8, 2016

grobie commented Aug 8, 2016

grobie Aug 8, 2016

Choose a reason for hiding this comment

ben51 Aug 9, 2016

Choose a reason for hiding this comment

grobie Aug 9, 2016

Choose a reason for hiding this comment

ben51 Aug 9, 2016

Choose a reason for hiding this comment

ben51 commented Sep 6, 2016

fabxc commented Sep 6, 2016

brian-brazil commented Sep 6, 2016

fabxc commented Sep 6, 2016 • edited Loading

brian-brazil commented Sep 6, 2016

fabxc commented Sep 6, 2016 • edited Loading

brian-brazil commented Sep 6, 2016

fabxc commented Sep 6, 2016

brian-brazil commented Sep 6, 2016

fabxc commented Sep 6, 2016 • edited Loading

brian-brazil commented Sep 6, 2016

fabxc commented Sep 7, 2016

brian-brazil commented Sep 7, 2016

fabxc commented Sep 7, 2016

brian-brazil commented Sep 8, 2016

fabxc commented Sep 8, 2016

brian-brazil commented Sep 8, 2016

fabxc commented Sep 8, 2016 • edited Loading

brian-brazil commented Sep 8, 2016

fabxc commented Sep 9, 2016

brian-brazil commented Sep 9, 2016

matthiasr commented Jan 4, 2017

matthiasr commented Jan 4, 2017

davidkarlsen commented Apr 1, 2017

brian-brazil commented Apr 1, 2017

davidkarlsen commented Apr 1, 2017

brian-brazil commented Apr 2, 2017

ben51 commented Apr 3, 2017

talset commented Aug 29, 2017

jayme-github commented Sep 15, 2017

talset commented Sep 15, 2017

dano-o commented Sep 26, 2018 • edited Loading

pete-leese commented Oct 9, 2018

yosefy commented Oct 10, 2018

pete-leese commented Oct 10, 2018

freeseacher commented Feb 5, 2019

mfin commented May 15, 2019 • edited Loading

theartusz commented Jul 5, 2021

ben51 commented Aug 3, 2016 •

edited by fabxc

Loading

fabxc commented Sep 6, 2016 •

edited

Loading

fabxc commented Sep 6, 2016 •

edited

Loading

fabxc commented Sep 6, 2016 •

edited

Loading

fabxc commented Sep 8, 2016 •

edited

Loading

dano-o commented Sep 26, 2018 •

edited

Loading

mfin commented May 15, 2019 •

edited

Loading