-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Develop plan for Cloudwatch/ Monitoring #21
Comments
I'll dig in to Cloudwatch and integration for OpsGenie. |
It looks like there is a method to setup SNS Topics in Cloudwatch such that Cloudwatch will send alerts to OpsGenie via OpsGenie's API. In my test environment I don't have anything to monitor, yet, but I will setup something simple and test the Cloudwatch integration. If all goes well, then we'll need to configure our production Avalon instance's Cloudwatch with SNS alerts to OpsGenie's web API. @d-venckus @davidschober @mbklein If we do this, then we can setup OpsGenie to alert whoever, whenever, via OpsGenie's integrated scheduler. |
Oh, and I should mention that if an alert clears itself in CloudWatch, that will automatically update and close the alert in OpsGenie, assuming we get the integration setup as I mentioned above. |
Unlike with Solarwinds, lately, where I’ve been clearing alerts, only to have OpsGenie persist in endless badgering. Sigh.
…--D
________________________________________
David Venckus, DevOps Systems Administrator
Library IT Infrastructure
Northwestern University Library
d-venckus@northwestern.edu<mailto:d-venckus@northwestern.edu>
From: James R Bottino [mailto:notifications@github.com]
Sent: Tuesday, May 16, 2017 11:24 AM
To: nulib/avalon <avalon@noreply.github.com>
Cc: David Venckus <d-venckus@northwestern.edu>; Mention <mention@noreply.github.com>
Subject: Re: [nulib/avalon] Develop plan for Cloudwatch/ Monitoring (#21)
Oh, and I should mention that if an alert clears itself in CloudWatch, that will automatically update and close the alert in OpsGenie, assuming we get the integration setup as I mentioned above.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_nulib_avalon_issues_21-23issuecomment-2D301836522&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=G2yT2UxDfkAfCMwPTXXhJemsknhPgdbik-kags2FnDY&m=k9RwKaeo6-6ZpX65vtQ_-2OWtjToBPX-RARw9c18ouw&s=ATlqWo51vqJtLjUinWMslDVBpfuti7hi1WmtYxvJd2U&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AGM8zwYmjQRDBqBWu1Z4sW9P02Qhrof2ks5r6c2fgaJpZM4NL-5FHr&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=G2yT2UxDfkAfCMwPTXXhJemsknhPgdbik-kags2FnDY&m=k9RwKaeo6-6ZpX65vtQ_-2OWtjToBPX-RARw9c18ouw&s=--z07MVCKsFZaqfavK8pwHTDBtI89_lEzqGA1Hc1SYw&e=>.
|
I have created an Topic in our AWS instance so that any messages that go to @mbklein's email also alert OpsGenie, theoretically. We could use some kind of artificial alert to ensure things are working, but I'll leave what that test might be to @mbklein. @davidschober |
@Toputnal Not sure what you mean with the last comment. I think we need to ensure the main systems are up and respond
Is that what you're looking for? |
Oh, sorry. When I was poking around in our AWS stuff I saw an alert rule already that was set to email MBK, which I copied to create a rule that has the same triggers as the existing alert rule. So, whatever MBK was getting emailed about will now also go to OpsGenie. That’s the theory, anyway.
Creating some test/false-alert/etc. that will cause MBK to receive an email alerts from AWS, should let us know if the CloudWatch/OpsGenie connection is working correctly.
If all is well, then we can write any additional alert rules for whatever is desired, but, at this point, this is a test of the inter-connect between the monitoring/alerting systems.
Best Regards,
…--Jim
--
James R. Bottino
Senior DevOps SysAdmin
Northwestern University Libraries
Northwestern University
www.library.northwestern.edu
james.bottino@northwestern.edu<mailto:james.bottino@northwestern.edu>
847-491-8389
From: davidschober [mailto:notifications@github.com]
Sent: Tuesday, May 16, 2017 2:28 PM
To: nulib/avalon <avalon@noreply.github.com>
Cc: James Bottino <james.bottino@northwestern.edu>; Mention <mention@noreply.github.com>
Subject: Re: [nulib/avalon] Develop plan for Cloudwatch/ Monitoring (#21)
@Toputnal<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_toputnal&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=FlFUUq8frCNyj8bE1dJD1rdSXxdFfHSEIiTzpFzvzao&m=wy4zDJcRU_YxsWJ8URcUpamgh0yMviyQWa9Z7ToOxGQ&s=v9Vb9T1r-i4fI8zJQr_EJ1duFTWt1KQhWu9vb-Wy9FM&e=> Not sure what you mean with the last comment. I think we need to ensure the main systems are up and respond
* Fedora
* Avalon Web
* Avalon Workers
* SOLR
Is that what you're looking for?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_nulib_avalon_issues_21-23issuecomment-2D301889946&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=FlFUUq8frCNyj8bE1dJD1rdSXxdFfHSEIiTzpFzvzao&m=wy4zDJcRU_YxsWJ8URcUpamgh0yMviyQWa9Z7ToOxGQ&s=PsYbqpcrMjVi9K5JlGiUyJcWNaP6tIrzf2Mz_8cYnq8&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEbXhqXGzw0C0x4Mhn6goqcXrQA8UTtuks5r6fjZgaJpZM4NL-5FHr&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=FlFUUq8frCNyj8bE1dJD1rdSXxdFfHSEIiTzpFzvzao&m=wy4zDJcRU_YxsWJ8URcUpamgh0yMviyQWa9Z7ToOxGQ&s=UWOVEXtP_RO17li9WuIhI7rsn7DS9s8qUtBD4Yhvsps&e=>.
|
Will OpsGenie also alert MBK a second time, Jim? Is that the idea? To test OpsGenie connectivity fully? |
Eventually, yes, @d-venckus, but for now I just wanna see some Blinkin' lights! :-) |
Got it. Can someone document what we're monitoring? |
I'm not certain what the existing monitoring that is setup in AWS is actually monitoring. (That is, what alarms are currently setup to email @mbklein). I'll dig in and see what I can find (without changing anything, at first, obviously). If I accidentally break anything, our SOP is to blame @egspoony ;-) (Back me up on this @d-venckus)! |
I have setup alerts for all the services listed above, minus RDS which is provided as a service to us rather than being a service that we run on top of some EC2 instance. Does what I've added look good to @d-venckus , @davidschober @mbklein |
I have setup a Dashboard in CloudWatch called "JRBDashboard" which shows what things we are now setup to alert OpsGenie, so that's probably a good place to look at our current AWS/OpsGenie integration. @d-venckus @mbklein @davidschober |
@mbklein I'll let you take a peak and OK it. @Toputnal and @d-venckus can you write up some brief docs on what we are monitoring, how to set it up, etc? You can put it https://github.com/nulib/repodev_planning_and_docs/wiki/AVR-Technical-Documentation#monitoring-via-cloudwatch or you can send me a doc and I can c&p. |
I will update the doc you listed above @davidschober as soon as we stabilize the thresholds of the things we are monitoring. Currently, we have quite a bit of "noise" which we are still ironing out. |
Thanks @Toputnal |
@Toputnal I forget are we closing this and creating a "create final cloud watch monitoring" issues with the findings? That seems to make sense to me,. |
I need to see if any of the checks I wrote need to be fixed after MBK did the latest burndown/build, and, if so, fix them. I will close, or move to Review, this once I’ve verified this. I don’t want us to *think* we’ve got monitoring covered when it *may* be broken. I’ll look today.
Best Regards,
…--Jim
--
James R. Bottino
Senior DevOps SysAdmin
Northwestern University Libraries
Northwestern University
www.library.northwestern.edu
james.bottino@northwestern.edu<mailto:james.bottino@northwestern.edu>
847-491-8389
From: davidschober [mailto:notifications@github.com]
Sent: Wednesday, May 31, 2017 8:40 AM
To: nulib/avalon <avalon@noreply.github.com>
Cc: James Bottino <james.bottino@northwestern.edu>; Mention <mention@noreply.github.com>
Subject: Re: [nulib/avalon] Develop plan for Cloudwatch/ Monitoring (#21)
@Toputnal<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_toputnal&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=FlFUUq8frCNyj8bE1dJD1rdSXxdFfHSEIiTzpFzvzao&m=2LlcIueysAy2dqhJ9b1_Fvq5exudqxnV-lOKP_ZQDU8&s=xwlG8KJMFZe6UI8lf9K4tbi85s7UkVNrLiKdtojS2Zc&e=> I forget are we closing this and creating a "create final cloud watch monitoring" issues with the findings? That seems to make sense to me,.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_nulib_avalon_issues_21-23issuecomment-2D305189445&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=FlFUUq8frCNyj8bE1dJD1rdSXxdFfHSEIiTzpFzvzao&m=2LlcIueysAy2dqhJ9b1_Fvq5exudqxnV-lOKP_ZQDU8&s=1k2RaHJ2JBOcWbzWOml1dU9JcgCcF1fOmX34ZSA1Bcs&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEbXhkjpXIqghDOH1lLOawPFxnJQr15sks5r-5FW2bgaJpZM4NL-5FHr&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=FlFUUq8frCNyj8bE1dJD1rdSXxdFfHSEIiTzpFzvzao&m=2LlcIueysAy2dqhJ9b1_Fvq5exudqxnV-lOKP_ZQDU8&s=hj09OG2DNAAmGz1JvY0BKW-BQnWntObbrlyhtVntHJU&e=>.
|
Sounds good!
… On May 31, 2017, at 9:19 AM, James R Bottino ***@***.***> wrote:
I need to see if any of the checks I wrote need to be fixed after MBK did the latest burndown/build, and, if so, fix them. I will close, or move to Review, this once I’ve verified this. I don’t want us to *think* we’ve got monitoring covered when it *may* be broken. I’ll look today.
Best Regards,
--Jim
--
James R. Bottino
Senior DevOps SysAdmin
Northwestern University Libraries
Northwestern University
www.library.northwestern.edu
***@***.******@***.***>
847-491-8389
From: davidschober ***@***.***
Sent: Wednesday, May 31, 2017 8:40 AM
To: nulib/avalon ***@***.***>
Cc: James Bottino ***@***.***>; Mention ***@***.***>
Subject: Re: [nulib/avalon] Develop plan for Cloudwatch/ Monitoring (#21)
@Toputnal<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_toputnal&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=FlFUUq8frCNyj8bE1dJD1rdSXxdFfHSEIiTzpFzvzao&m=2LlcIueysAy2dqhJ9b1_Fvq5exudqxnV-lOKP_ZQDU8&s=xwlG8KJMFZe6UI8lf9K4tbi85s7UkVNrLiKdtojS2Zc&e=> I forget are we closing this and creating a "create final cloud watch monitoring" issues with the findings? That seems to make sense to me,.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_nulib_avalon_issues_21-23issuecomment-2D305189445&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=FlFUUq8frCNyj8bE1dJD1rdSXxdFfHSEIiTzpFzvzao&m=2LlcIueysAy2dqhJ9b1_Fvq5exudqxnV-lOKP_ZQDU8&s=1k2RaHJ2JBOcWbzWOml1dU9JcgCcF1fOmX34ZSA1Bcs&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEbXhkjpXIqghDOH1lLOawPFxnJQr15sks5r-5FW2bgaJpZM4NL-5FHr&d=DwMCaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=FlFUUq8frCNyj8bE1dJD1rdSXxdFfHSEIiTzpFzvzao&m=2LlcIueysAy2dqhJ9b1_Fvq5exudqxnV-lOKP_ZQDU8&s=hj09OG2DNAAmGz1JvY0BKW-BQnWntObbrlyhtVntHJU&e=>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#21 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AElKnrFMqs06q6VJbDoJVprYfFU2jvf1ks5r_XbwgaJpZM4NL_Hr>.
|
All the alerts I created are still in place after the burn down, and rebuild. Yay! Moving to Review. |
Description
As we move forward with the Avalon on AWS pilot, we need a plan for monitoring. If systems fail, people will need to be notified and a call tree will need to be set up to bring the system back up. OPSgenie should get notifications.
Opsgenie integration with AWS Cloudwatch.
https://www.opsgenie.com/docs/integrations/aws-cloudwatch-integration
Done looks like
Systems needing monitoring:
The text was updated successfully, but these errors were encountered: