Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add alarms to "♻️ that container" PR (#198)
* Add alarms and alarm docs * Add failedPlacementAlarmPeriods * Add CloudWatch Alarms snapshots * Update template jest snapshots * Add CloudWatch Alarms snapshots * Add failedworker and failedworkerplacement metric * Typo r/LogGroup/Logs * Change metric name * Metric Filter of worker errors to "[failure]" * Have current published version instead of undefined * Jake's Review * uh update-jest * Update alarms.md
- Loading branch information
1 parent
a4f61be
commit 48b8b0f
Showing
4 changed files
with
1,678 additions
and
1,229 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
# Alarms | ||
|
||
This document describes CloudWatch alarms that Watchbot configures. If one of these alarms is tripped, a message will be sent from CloudWatch to the stack's configured `NotificationEmail` or `NotificationTopic`. That message will contain a description field, with a URL pointing to this document. | ||
|
||
**In all cases**, SQS messages that failed and led to these alarms are put back into SQS to be retried. [See the worker retry documentation](./worker-retry-cycle.md) for more info. | ||
|
||
|
||
## WorkerErrors | ||
|
||
### Why? | ||
|
||
There were more than a threshold number of worker containers that failed to process successfully in a 60 second period. The threshold is configured when you create your template via `watchbot.template()` through the `options.errorThreshold` value. The default threshold is 10 errors per minute. | ||
|
||
### What to do | ||
|
||
These errors represent situations where an SQS message resulted in the launch of a worker container, and that container exited with an exit code of **anything other than** `0` or `4`. See [the readme](./worker-runtime-details.md#worker-completion) for more information about how Watchbot interprets container's exit codes. | ||
|
||
This likely represents an error in your worker's code, or an edge-case that your application is unable to cope with. Most of the time, the solution to this problem is deduced by searching through worker logs in CloudWatch logs. | ||
|
||
## QueueSize | ||
|
||
### Why? | ||
|
||
There were more than a threshold number of messages in the SQS queue for some period of time. Both the threshold and the alarm period are configured when you create your template via `watchbot.template()` through the `options.alarmThreshold` and `options.alarmPeriod` values. The default threshold is 40, and the default period is 2 hours. | ||
|
||
### What to do | ||
|
||
This represents a situation where messages are piling up in SQS faster than they are being processed. You may need to decrease the rate at which messages are being sent to SQS, or investigate whether there is something else preventing workers from processing effectively. | ||
|
||
## DeadLetter | ||
|
||
### Why? | ||
|
||
There are visible messages in the dead letter queue. SQS messages are received by Watchbot's watcher container. If processing the message fails for any reason, the message is sent back to Watchbot's primary queue and will be retried. If 10 attempts to process a message result in a failure, then the message will be sent to the dead letter queue. [See the worker retry documentation](./worker-retry-cycle.md) for more info. | ||
|
||
### What to do | ||
|
||
These messages consistently failed processing attempts. It is possible that these messages represent an edge case in your worker's processing code. In this case, you should investigate your system's logs to try and determine how the workers failed. | ||
|
||
It is also possible that this represents failure to successfully place workers in your cluster. If this is the case, then you will also have seen alarms on FailedWorkerPlacement (see above). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.