Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for heartbeat #444

Closed
wants to merge 1 commit into from
Closed

Conversation

ben51
Copy link
Contributor

@ben51 ben51 commented Aug 3, 2016

  • Send an HTTP request to specific hosts so they know the current instance is
    running
  • Only implements OpsGenie API

This change is Reviewable

@fabxc
Copy link
Contributor

fabxc commented Aug 8, 2016

Interesting. I know @grobie and @matthiasr were thinking of something like this. Maybe they want to chime in.

@grobie
Copy link
Member

grobie commented Aug 8, 2016

We do have a setup for this at SoundCloud. We have one alert which is always firing and use a dead man switch service to detect whether that doesn't reach our notification system (PagerDuty). I think @matthiasr contacted Pagerduty and they plan to support that natively.

It's important to know when the alertmanager is down, so more support for that is good I think. Having an independent heartbeat routine won't catch a few other issues (like Prometheus servers can't reach alertmanager, the alertmanager can't send out notifications, etc.) but I guess that's a good start.

return &OpsGenie{
conf: conf,
done: make(chan struct{}),
ticker: time.NewTicker(time.Duration(conf.Interval)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will leak ticker routines, as these don't get stopped.

I don't see the need or advantage of making Tick() part of the interface. I'd just expose the interval and keep the ticker local to the Run() method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I didn't see this... I made the Tick() method because it is not possible AFAIK to expose attributes on an interface. I could create a HeartbeatProvider interface and a Heartbeat struct to implement this or expose the interval in the HeartbeatRunner on a per implementation basis (map[string]struct{done chan, time.Duration interval})

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Well, you could close the ticker in the Stop method, but I'd really keep that local. What about a func Interval() time.Duration method on the interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

- Send an HTTP request to specific hosts so they know the current instance is
  running
- Only implements OpsGenie API

Conf sample:

heartbeats:
  - name: 'opsgenie-hb'
    opsgenie_configs:
    - api_key: 'f5ec0d1f-5978-11e6-881f-64006a5a8984'
      interval: '10m'
      name: 'my-app-heartbeat'

Change-Id: Ied9d45ea630362ed6ed17845506db7b5b21919a4
@ben51
Copy link
Contributor Author

ben51 commented Sep 6, 2016

Hey, what's the status on this PR ? Are you considering merging it or do you require some more work from me ?
Cheers

@fabxc
Copy link
Contributor

fabxc commented Sep 6, 2016

Hey, thanks for pinging again about this.

I think generally a heartbeat feature is very useful. As grobie said, it does not catch a case where Prometheus servers cannot reach any Alertmanager. But this would only be thoroughly validated if each Prometheus server had its own alert and we checked each of them.
I think the connectivity should generally be addressed by an HA AM setup and meta monitoring.

But we should discuss this a bit further before adding a new feature.

@brian-brazil

@brian-brazil
Copy link
Contributor

This should be covered by general meta-monitoring, this approach has the usual issues with bottom-up push-based monitoring and is likely to trigger false positives in an HA scenario as no single AM is a SPOF.

This approach also seems a bit OpsGenie specific. The generic version of this would be normal Prometheus scraping for example.

@fabxc
Copy link
Contributor

fabxc commented Sep 6, 2016

Mh, not sure I understand.

A heartbeat would allow me to get a page when Alertmanager cannot reach the integration. How is that covered by meta monitoring? My meta monitoring will send an alert to my Alertmanager, which cannot send out the notification... and then what?

In an HA setup, you'd have to deduplicate heartbeats along a name I'd suppose and it would only trigger if no AM sends something on that named heartbeat channel anymore. So it wouldn't be coupled to single AM instances.

@brian-brazil
Copy link
Contributor

What you really need is an end-to-end test, and usually the issue is with one AM so the other AMs are still working.

@fabxc
Copy link
Contributor

fabxc commented Sep 6, 2016

I think we are talking about different intends behind this. I can run 10 fully functional AMs in my DC but if it's cut off from the Internet, there's no meta monitoring that can help me.

FWIW, most users are small-scale and telling them to run a separate meta-monitoring Prometheus+AM in another DC is not a good story for a fully functional and easy to deploy (one of our core features) monitoring setup. Especially if integrations provide such heartbeat/dead-man's-switch functionality already.

@brian-brazil
Copy link
Contributor

Meta-monitoring should catch the DC going down, otherwise you don't have full meta-monitoring.

Meta-monitoring is everything you need to catch your monitoring going down.

I think something more along the lines of @grobie's lightning talk at PromCon is what we should be looking at.

@fabxc
Copy link
Contributor

fabxc commented Sep 6, 2016

That was @matthiasr's talk IIRC.

Conceptually it wasn't too different – it also was a heartbeat/dead-man's-switch – just more Prometheus native and more end-to-end. But in the end, by your reasoning this seems just as obsolete – with full meta-monitoring you'd just directly send an alert notification that your monitoring is not working rather than going through the entire contraption described.

So not sure I understand what you are leaning towards exactly.

@brian-brazil
Copy link
Contributor

The heartbeat here as proposed only partially tests that one AM can talk to the notification mechanism.

The lightning talk tested that the monitoring system as a whole can talk to the notification mechanism. This much more accurately represents what you want to monitor.

@fabxc
Copy link
Contributor

fabxc commented Sep 6, 2016

Yes, but the path Prometheus->Alertmanager can be verified from within the data center via meta monitoring. If that is setup, the only missing piece (when not wanting to require users to setup cross-DC meta monitoring) is the Alertmanager->Internet route.

This gap could be filled by having Prometheus heartbeat to an integration. I agree that the other approach covers more and works just as nice. More importantly, it doesn't introduce another concept.
In the end everything behind AM was just quite a contraption to get the dead-man's-switch working. So being able to work with a heartbeat API, that does what we want would be neat.

Of course people can add their own webhook again that sends some JSON to whatever heartbeat API on an alert notification. But for such trivial stuff, maybe it's worth revisiting whether we should allow assembling custom JSON via templates.
In the case of OpsGenie it's literally:

curl -XPOST 'https://api.opsgenie.com/v1/json/heartbeat/send' -d '
{
     "apiKey": "eb243592-faa2-4ba2-a551q-1afdf565c889",
     "name" : "host1"
}'

And in this case we have named heartbeats. So it does not verify that one AM is working as the entire AM cluster would send heartbeats and they are not differentiated by source IP or similar.

@brian-brazil
Copy link
Contributor

Yes, but the path Prometheus->Alertmanager can be verified from within the data center via meta monitoring.

This is an offline-processing system, attempting to monitor each hop individually will miss things. A end-to-end heartbeat is desirable.

(when not wanting to require users to setup cross-DC meta monitoring)

I'm not sure we should have this as a primary goal. Everything we're talking about here should be some form of cross-DC monitoring.

In the end everything behind AM was just quite a contraption to get the dead-man's-switch working.

This is actually one of the simplest solutions I've seen to this general class of problem.

@fabxc
Copy link
Contributor

fabxc commented Sep 7, 2016

This is actually one of the simplest solutions I've seen to this general class of problem.

Not sure what you mean. If PagerDuty just had an incident handler that is inverted, i.e. pages you if it doesn't receive alerts, this would be super straightforward. The contraption of using 2 or 3 different SaaS providers after leaving Alertmanager has nothing to do with this being a complex problem class – it's simply a gap in the offerings of PD, OpsGenie, etc.

And arguably the OpsGenie heartbeat functionality is exactly that, just with it's tiny API that is unfortunately not equal to the incident API. You skipped the essential part of my last response.

I'm not sure we should have this as a primary goal.

Nothing said of a primary goal, but it's a use case we have to consider. Not every company allows their sysadmins to run something in the cloud. And when on-site is just a single site, there must be a way to do it.

I'm not sure what we are talking about anymore, to be honest. Your last few responses are giving contradicting signals.

@brian-brazil
Copy link
Contributor

Not every company allows their sysadmins to run something in the cloud. And when on-site is just a single site, there must be a way to do it.

It's not possible to do meta-monitoring without use of a provider outside your network, so I don't think this is a reasonable design restriction. This PR presumes a cloud service in OpsGenie for example.

Your last few responses are giving contradicting signals.

I believe I've been consistent. The OpsGenie heartbeat would offer testing of the AM->OpsGenie path in some cases - but there's more to meta monitoring than testing that one hop of the alerting path and not everyone uses OpsGenie.

An end to end test that starts at Prometheus and ends outside your network is what's needed. The alertmanager sending heartbeats itself is not the right way to provide this feature, and may lull users into a false sense of security.

@fabxc
Copy link
Contributor

fabxc commented Sep 7, 2016

This was meant to be a general discussion rather than about accepting exactly this PR. Probably should have moved this into its own issue for clarity. Last response here to finish this.

You are arguing against what I'm saying when I don't even disagree with you.

I've explicitly said that the end-to-end solution is better and proposed that we find a way to integrate our general notifications with such heartbeat APIs, which was ignored.
I also didn't say, or meant to say, that it will replace meta-monitoring in any way.

So one final time to hopefully clarify this:

We want some sort of heartbeat/dead-man's-switch functionality. You do apparently agree with that. The end-to-end variant is exactly that while validating another part of the pipeline along the way, which is great of course.
The lightning talk by @matthiasr has shown that there's a gap in direct support by SaaS integrations. But some (like OpsGenie) do have a dedicated heartbeat APIs that have the correct paging semantics. Problem is, it needs a specific JSON rather than the standard notification.
So, do we want to provide a way to integrate with that without yet another custom webhook handler or not?

Please open an issue if you want to discuss this further without spinning in circles.

@brian-brazil
Copy link
Contributor

So, do we want to provide a way to integrate with that without yet another custom webhook handler or not?

I don't think there's enough standardisation in this space to know how to approach this. I've come across about 5 of these from various providers, and I suspect most of them could take a webhook directly as they tend to ignore the body. So for now webhook it is I think.

For this particular PR I'm against as it's approaching the problem at the wrong level. Do you think we should accept this PR?

@fabxc
Copy link
Contributor

fabxc commented Sep 8, 2016

In the OpsGenie case, the body specifies the name of the heartbeat and cannot be left out.

At this early stage adding such an exposed feature to AM is probably unwise. So no to this PR for now.
I'd like to keep the general problem in mind for the future though – I'd mostly see it realized by allowing to template entire request bodies so things like this can be directly implemented via the regular notification mechanism.

@brian-brazil
Copy link
Contributor

I think there was one proposal (from you?) for json templating that didn't look too terrible. I'm mainly concerned on complexity and support grounds, as it'd be far more powerful/complex than what our current templating offers (which already confuses users) and we get to deal with arbitrarily different http response codes/bodies that need special handling. That'd take a good chunk of software engineering to get sane.

@fabxc
Copy link
Contributor

fabxc commented Sep 8, 2016

Yes, I mentioned that before and further up in this discussion. There's http://jsonnet.org/ which seems rather sane but is lacking Go support.

An alternative would be to specify a request body structure directly in the YAML configuration where you can use templated strings for values. At least this won't require people to generate valid JSON via Go templates, which are also using { as part of their syntax (can be overwritten, but that will be a nasty inconsistency).

@brian-brazil
Copy link
Contributor

The problem with doing it via YAML is that you're restricting to structure to be completely static, so no lists or nested dicts. At what point are you okay with users being forced to use the webhook?

@fabxc
Copy link
Contributor

fabxc commented Sep 9, 2016

Sure, you can just have a field that allows any YAML object.

@brian-brazil
Copy link
Contributor

Which probably means you need all fields not to be escaped, leaving escaping up to the user. That's not likely to end well.

@matthiasr
Copy link

Sorry, I missed the original discussion because I was on vacation.

In general, I believe there is no way to be really sure your alerts are working that doesn't involve a heartbeat.

The more the heartbeat deviates from "normal" alerts, the more potential for false positives and negatives there is. Having a "perfectly normal" alert fire all the time makes sure that alerts work, not just that heartbeats work. The heartbeat interval is configurable already via the normal alertmanager configuration.

I understand that this PR aims to support the OpsGenie specific heartbeat functionality. Would this heartbeat work if it is configured like a normal OpsGenie receiver? In that case, this PR doesn't add anything that could not be done already using a few lines of configuration. I'd prefer to just document how to do it over additional code and configuration directives. That would also solve checking the Prometheus->Alertmanager leg, which the proposed implementation here would not.

And the general approach is transferable to other integrations without having to implement it separately every time. I agree that using Dead Man's Snitch is a crutch caused by lack of support in PagerDuty. The documentation (blog post?) could also describe that. If someone has done something similar with other integrations we can add it; but given these two examples the general approach should be clear enough for everyone to be able to implement it according to their environment.

Of course, a deployment with a single datacenter and a prohibition against relying on anything outside that datacenter will never be able to use any of this, but that seems like a very special case.

@matthiasr
Copy link

PS: I would be happy to write out the documentation-in-lieu-of-code post.

@davidkarlsen davidkarlsen mentioned this pull request Apr 1, 2017
@davidkarlsen
Copy link

Is this one going anywhere?

@brian-brazil
Copy link
Contributor

Reading back through everything, the consensus is not to accept this PR.

Instead it's proposed to document how to do this properly in the form of an end-to-end test, and maybe also allow templating of webhook.

@davidkarlsen
Copy link

@brian-brazil Should #679 be reopened in that case - and/or be re-phrased according to the new plan?

@brian-brazil
Copy link
Contributor

This PR is an implementation of #679.

@ben51
Copy link
Contributor Author

ben51 commented Apr 3, 2017

Sorry I missed the discussion. From my memories, it wasn't possible to send fake alarms to OpsGenie to implement a heartbeat. Rather, one had to implement a dedicated HTTP call to comply with their implementation.

I do agree that this PR may be overkill if webhook are expended to be able to send any kind of formatted JSON.

@talset
Copy link

talset commented Aug 29, 2017

Hi,

Any news on this topics ?

I was not able to find the "end-to-end" documentation regarding opsgenie heartbeat integration.

It seems using normal OpsGenie receiver is "almost" able to do the job with the opsgenie api v1 (except that the name is not part of the POST).

But for the opsgenie api v2 it seems not possible at all : https://docs.opsgenie.com/docs/heartbeat-api#pingHeartbeat (header are used)

Any suggestions ?

@jayme-github
Copy link

@talset maybe what @Nin-0 did here https://github.com/traum-ferienwohnungen/opsgenie-heartbeat-proxy is of help for you.

@talset
Copy link

talset commented Sep 15, 2017

@jayme-github thanks, in fact we did the same but if something is wrong with the pod, we could have an alert than we should not have. Anyway this is the only current workaround.

@dano-o
Copy link

dano-o commented Sep 26, 2018

@talset and anyone bumping into this thread: you can send heartbeats towards OpsGenie, the following authentication methods are supported by both systems:

1.) You can simply use the apiKey in the target URL:

https://api.opsgenie.com/v2/heartbeats/heartbeatname/ping?apiKey=XXXX

2.) You can use the Basic authentication method in AlertManager. OpsGenie will accept the following:

Header Key: Authorization
Header Value: basic base64(:$apiKey)

Basically, just leave the username empty, and use the apikey as the secret

@pete-leese
Copy link

What have you got in y our Prometheus.yml for this configuration?

@yosefy
Copy link

yosefy commented Oct 10, 2018

it is in AM.yml not promethteus.yml

it works like this:

route:
  receiver: default
  group_by:
  - job
  routes:
  - receiver: deadmansswitch
    match:
      alertname: DeadMansSwitch
    repeat_interval: 1m 

receivers:
 - name: deadmansswitch
   webhook_configs:
   - url: 'https://api.opsgenie.com/v2/heartbeats/HEARTBEAT_NAME/ping'
     send_resolved: true
     http_config:
       basic_auth:
         password: OPS-GENIE-API-KEY

@pete-leese
Copy link

This is great. Thank you for the missing part of my the puzzle

@freeseacher
Copy link

Got another issue like this.
We are experimenting with alertmanagers templates and change them like this

- name: opsgenie
  opsgenie_configs:
  - send_resolved: true
    api_key: <secret>
    api_url: https://api.opsgenie.com/
    message: '{{ template "opsgenie.company.message" . }}'
....

this works pretty good until one of our pr changes path to templates to

templates:
- /etc/alertmanager/templates/.*tmpl

that was pretty easy to pass mr cause looks very similar to good one.
and that breaks all the things. Cause opsgenie was only destination for alerts we didn't see alerts for some time.

it seems that solution from #444 (comment) will not help with this. any ideas ?

@mfin
Copy link

mfin commented May 15, 2019

it is in AM.yml not promethteus.yml

it works like this:

route:
  receiver: default
  group_by:
  - job
  routes:
  - receiver: deadmansswitch
    match:
      alertname: DeadMansSwitch
    repeat_interval: 1m 

receivers:
 - name: deadmansswitch
   webhook_configs:
   - url: 'https://api.opsgenie.com/v2/heartbeats/HEARTBEAT_NAME/ping'
     send_resolved: true
     http_config:
       basic_auth:
         password: OPS-GENIE-API-KEY

Thanks for this! One minor change: you have to define group_interval for the DeadMansSwitch route also, or it will inherit the global one. I've set both group_interval and repeat_interval to 1 minute, so the heartbeat ping is really sent each minute.

@theartusz
Copy link

it is in AM.yml not promethteus.yml

it works like this:

route:
  receiver: default
  group_by:
  - job
  routes:
  - receiver: deadmansswitch
    match:
      alertname: DeadMansSwitch
    repeat_interval: 1m 

receivers:
 - name: deadmansswitch
   webhook_configs:
   - url: 'https://api.opsgenie.com/v2/heartbeats/HEARTBEAT_NAME/ping'
     send_resolved: true
     http_config:
       basic_auth:
         password: OPS-GENIE-API-KEY

@yosefy what format is the password in?
Is it as per v1.SecretKeySelector scheme referenceing only the name of the secret or writing the key in plain text (security issue)?
Tried to reference it as

password:
                name:
                key:

but it didn't work for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet