Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop plan for Cloudwatch/ Monitoring #21

Closed
8 of 9 tasks
davidschober opened this issue Apr 28, 2017 · 21 comments
Closed
8 of 9 tasks

Develop plan for Cloudwatch/ Monitoring #21

davidschober opened this issue Apr 28, 2017 · 21 comments
Assignees
Labels

Comments

@davidschober
Copy link

davidschober commented Apr 28, 2017

Description

As we move forward with the Avalon on AWS pilot, we need a plan for monitoring. If systems fail, people will need to be notified and a call tree will need to be set up to bring the system back up. OPSgenie should get notifications.

Opsgenie integration with AWS Cloudwatch.

https://www.opsgenie.com/docs/integrations/aws-cloudwatch-integration

Done looks like

  • Team feels comfortable with cloud watch
  • initial monitoring is setup and documented.
  • monitoring should be analogous to current monitoring
  • errors on ops genie should ping MBK as on call for course pilot

Systems needing monitoring:

  • Fedora
  • Avalon Web
  • Avalon Workers
  • RDS / Do we need to monitor this?
  • SOLR
@Toputnal
Copy link

I'll dig in to Cloudwatch and integration for OpsGenie.

@Toputnal
Copy link

It looks like there is a method to setup SNS Topics in Cloudwatch such that Cloudwatch will send alerts to OpsGenie via OpsGenie's API. In my test environment I don't have anything to monitor, yet, but I will setup something simple and test the Cloudwatch integration. If all goes well, then we'll need to configure our production Avalon instance's Cloudwatch with SNS alerts to OpsGenie's web API. @d-venckus @davidschober @mbklein If we do this, then we can setup OpsGenie to alert whoever, whenever, via OpsGenie's integrated scheduler.

@Toputnal
Copy link

Oh, and I should mention that if an alert clears itself in CloudWatch, that will automatically update and close the alert in OpsGenie, assuming we get the integration setup as I mentioned above.

@d-venckus
Copy link

d-venckus commented May 16, 2017 via email

@Toputnal
Copy link

I have created an Topic in our AWS instance so that any messages that go to @mbklein's email also alert OpsGenie, theoretically. We could use some kind of artificial alert to ensure things are working, but I'll leave what that test might be to @mbklein. @davidschober

@davidschober
Copy link
Author

@Toputnal Not sure what you mean with the last comment. I think we need to ensure the main systems are up and respond

  • Fedora
  • Avalon Web
  • Avalon Workers
  • SOLR

Is that what you're looking for?

@Toputnal
Copy link

Toputnal commented May 17, 2017 via email

@Toputnal
Copy link

@davidschober ^

@d-venckus
Copy link

Will OpsGenie also alert MBK a second time, Jim? Is that the idea? To test OpsGenie connectivity fully?

@Toputnal
Copy link

Eventually, yes, @d-venckus, but for now I just wanna see some Blinkin' lights! :-)

@davidschober
Copy link
Author

Got it. Can someone document what we're monitoring?

@Toputnal
Copy link

I'm not certain what the existing monitoring that is setup in AWS is actually monitoring. (That is, what alarms are currently setup to email @mbklein). I'll dig in and see what I can find (without changing anything, at first, obviously). If I accidentally break anything, our SOP is to blame @egspoony ;-) (Back me up on this @d-venckus)!

@Toputnal
Copy link

I have setup alerts for all the services listed above, minus RDS which is provided as a service to us rather than being a service that we run on top of some EC2 instance. Does what I've added look good to @d-venckus , @davidschober @mbklein

@Toputnal
Copy link

I have setup a Dashboard in CloudWatch called "JRBDashboard" which shows what things we are now setup to alert OpsGenie, so that's probably a good place to look at our current AWS/OpsGenie integration. @d-venckus @mbklein @davidschober

@davidschober
Copy link
Author

@mbklein I'll let you take a peak and OK it. @Toputnal and @d-venckus can you write up some brief docs on what we are monitoring, how to set it up, etc? You can put it https://github.com/nulib/repodev_planning_and_docs/wiki/AVR-Technical-Documentation#monitoring-via-cloudwatch or you can send me a doc and I can c&p.

@Toputnal
Copy link

I will update the doc you listed above @davidschober as soon as we stabilize the thresholds of the things we are monitoring. Currently, we have quite a bit of "noise" which we are still ironing out.

@davidschober
Copy link
Author

Thanks @Toputnal

@davidschober
Copy link
Author

@Toputnal I forget are we closing this and creating a "create final cloud watch monitoring" issues with the findings? That seems to make sense to me,.

@Toputnal
Copy link

Toputnal commented May 31, 2017 via email

@davidschober
Copy link
Author

davidschober commented May 31, 2017 via email

@Toputnal
Copy link

All the alerts I created are still in place after the burn down, and rebuild. Yay! Moving to Review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants