You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 15, 2024. It is now read-only.
@dougoku pointed out in a recent standup that responding with non-200 technically indicates a problem with the umpire service itself. Perhaps we should consider something like a status field in the body that could return either OK, WARNING, or CRITICAL, but always with a HTTP code of 200.
Somewhat related how are we monitoring umpire? I.e have we simulated an umpire service failure to see what effect that has on its clients? How would we quickly and definitively determine that there is an issue with umpire itself vs a large section of the platform?
The text was updated successfully, but these errors were encountered:
Somewhat related how are we monitoring umpire? I.e have we simulated an umpire service failure to see what effect that has on its clients? How would we quickly and definitively determine that there is an issue with umpire itself vs a large section of the platform?
Umpire has it's own Pingdom monitor ("internal: umpire-production"). If that is red the issue is with Umpire itself. If a bunch of canaries are red then the problem is probably HTTP in general. Thoughts on that?
Having a pingdom check in place for umpire-production makes sense. I'm not really aware of what is in pingdom though. Do we have a shared login for that or is there some tooling around it that allows self-service access to pingdom?
Considering that umpire is largely intended to be used by Pingdom or a similar "up or down" service, I think we'll have to stick with the HTTP response codes for this project. I'm going to close this issue, but feel free to re-open.
@dougoku pointed out in a recent standup that responding with non-200 technically indicates a problem with the umpire service itself. Perhaps we should consider something like a status field in the body that could return either OK, WARNING, or CRITICAL, but always with a HTTP code of 200.
Ref #4.
Somewhat related how are we monitoring umpire? I.e have we simulated an umpire service failure to see what effect that has on its clients? How would we quickly and definitively determine that there is an issue with umpire itself vs a large section of the platform?
The text was updated successfully, but these errors were encountered: