Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added OpsGenie service for alerting #113

Merged
merged 1 commit into from
Dec 23, 2015
Merged

added OpsGenie service for alerting #113

merged 1 commit into from
Dec 23, 2015

Conversation

ericiles
Copy link
Contributor

Implements #71

I am running this code in our production environment, and have been getting OpsGenie alerts since 12/18/2015.

Usage:

stream
    .from().measurement('cpu')
    .where(lambda: "host" == 'serverA')
    .groupBy('host')
    .window()
        .period(10s)
        .every(10s)
    .mapReduce(influxql.count('idle'))
    .alert()
        .id('kapacitor/{{ .Name }}/{{ index .Tags "host" }}')
        .info(lambda: "count" > 6.0)
        .warn(lambda: "count" > 7.0)
        .crit(lambda: "count" > 8.0)
        .opsGenie()
        .ogTeams('test_team,another_team')
        .ogRecipients('test_recipient,another_recipient')

Configuration:

[opsgenie]
    # Configure OpsGenie with your API key and default routing key.
    enabled = false
    # Your OpsGenie API Key.
    api-key = ""
    # Default OpsGenie teams, can be overridden per alert.
    teams = ""
    # Default OpsGenie recipients, can be overridden per alert.
    recipients = ""
    # The OpsGenie API URL should not need to be changed.
    url = "https://api.opsgenie.com/v1/json/alert"
    # The OpsGenie Recovery URL, you can change this
    # based on which behavior you want a recovery to
    # trigger (Add Notes, Close Alert, etc.)
    recovery_url = "https://api.opsgenie.com/v1/json/alert/note"
    # If true the all alerts will be sent to OpsGenie
    # without explicity marking them in the TICKscript.
    # The routing key can still be overridden.
    global = false

This is my first run at working with Go, so some changes may be necessary. This is mainly a copy of the VictorOps service, modified to work properly with OpsGenie.

  • CHANGELOG.md updated
  • Rebased/mergable
  • Tests pass
  • Sign CLA (if not already signed)

recovery_url = "https://api.opsgenie.com/v1/json/alert/note"
# If true the all alerts will be sent to OpsGenie
# without explicity marking them in the TICKscript.
# The routing key can still be overridden.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should say team and recipients instead of routing key.

@nathanielc
Copy link
Contributor

@ericiles Thanks for the great work!

Overall this looks great, I added some pointers on using a list of string instead of comma separated strings for splitting.

Not sure whether I like

.alert()
   ...
   .opsGenie()
     .ogTeams('team1')

or

.alert()
   ...
   .opsGenie()
     .teams('team1')

better. Thoughts?

Right now everything is flattened on the AlertNode, but this will not scale so I have been thinking of was to scope it better. Long term I would prefer the second option.

@ericiles
Copy link
Contributor Author

@nathanielc

I agree that

.alert()
   ...
   .opsGenie()
     .teams('team1')

would be the better way to go, but saw that doing so would possibly conflict with anything else that wanted to use teams or recipients, which is why I prefixed it with "og"

I am making the suggested changes for the teams and recipients. Also, I have a few things I hard coded in for our use, that I am going to go ahead and make configurable for the end user (the OverwritesQuietHours tag for Critical alerts, for example). I am going to make .ogTags('tag1', 'tag2') an option so that a user can specify whichever tags they wish.

Also, at the moment, on recovery, I am setting the note field to the message. This is fine if the user actually wants to send a note, instead of doing something else, such as closing the alert, but I think there should probably be a better way to allow the user to specify which parameters/values to send to OpsGenie along with their API call. But I am not exactly sure the best way to structure that portion. Possibly something like:

.alert()
   ...
   .opsGenie()
     .ogRecovery('{{ .Level}}: {{ .Name }}/{{ index .Tags "host" }} has normal cpu usage: {{ index .Fields "used" }}')

and then in the kapacitor.conf specifying the parameter that the above text would be sent as, since we already have the recovery_url in the config to specify the api url. Something like:

recovery_url = "https://api.opsgenie.com/v1/json/alert/note"
recovery_param = "note"

Any Suggestions?

@nathanielc
Copy link
Contributor

For now lets remove the og prefix from the calls. I know it could conflict but I plan to change that soon.

I would like the templates to be global to all alert handlers if possible, and then each handler uses the appropriate template for its needs. This way it is clear what is available and what the purpose of each template is.

Right now there are two templates:

  • ID -- a unique ID for the alert doesn't change when the alert state changes
  • Message -- a short message that can change as the alert state changes

I plan to add another:

  • Details -- a much longer message, possibly html etc, for displaying more information on the alert.

For example the email alert handler just uses Message as the email subject and will use the Details as the body of the email. The email handler just ignores the ID while VictorOps/PagerDuty will probably ignore the Details template.

We could add a generic RecoveryMessage template that OpsGenie and others could use. I am not familiar with the OpsGenie workflow, why does the RecoveryMessage need to be different than the normal message? Wouldn't a generic {{ .ID }} is {{ .Level }} message work?

@ericiles
Copy link
Contributor Author

The message field would work fine for us, and is what we currently have it doing. However, other users could possibly wish to go with a different workflow for their alerting with OpsGenie. Using the API, on recovery, one could simply add a comment to the alert, acknowledge an alert, close an alert, add/remove tags to/from the alert, or simply ignore the recovery all together and do nothing.

However, in the interest of keeping things uniform and simple, it is probably best that it just pass the message along as the 'note' parameter to whichever recovery_url they specify, as the note parameter is used for adding comments, closing (with a comment), and Acknowledging (with a comment). That would cover what a majority of users likely would want.

Also, I will go ahead and rename .ogTeams, .ogRecipients, etc. as requested.

@nathanielc
Copy link
Contributor

Ok, in that case lets go with the simple approach that solves the current use case and then if a different one shows up we can address it then. Again, thanks for this great PR.

@ericiles
Copy link
Contributor Author

The discussed changes have been completed. I ran this in our environment and triggered a few alerts, and all appears to be working as expected.

teams = s.teams
}

if len(teams) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an error if no teams where specified?Or does OpsGenie have a sane default when no teams are specified? Same goes for recipients...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

teams and recipients are optional per their API. The reason being, you can configure default teams/recipients at the API level, and I'd imagine worst case, no actual notifications go out, but the alert is simply created in the system.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, a quick rebase/squash and I'll merge this. Thanks!

@nathanielc
Copy link
Contributor

@ericiles Looks great! Can you rebase/squash and then this should be ready for merging. Thanks.

Fixed typos in kapacitor.conf comments

Made requested changes to OpsGenie service.
nathanielc pushed a commit that referenced this pull request Dec 23, 2015
added OpsGenie service for alerting
@nathanielc nathanielc merged commit 70f8dcd into influxdata:master Dec 23, 2015
@ericiles ericiles deleted the opsgenie branch December 28, 2015 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants