Add alarms to "♻️ that container" PR (#198)

* Add alarms and alarm docs * Add failedPlacementAlarmPeriods * Add CloudWatch Alarms snapshots * Update template jest snapshots * Add CloudWatch Alarms snapshots * Add failedworker and failedworkerplacement metric * Typo r/LogGroup/Logs * Change metric name * Metric Filter of worker errors to "[failure]" * Have current published version instead of undefined * Jake's Review * uh update-jest * Update alarms.md
mapbox · Jun 7, 2018 · 48b8b0f · 48b8b0f
1 parent a4f61be
commit 48b8b0f
Show file tree

Hide file tree

Showing 4 changed files with 1,678 additions and 1,229 deletions.
diff --git a/docs/alarms.md b/docs/alarms.md
@@ -0,0 +1,40 @@
+# Alarms
+
+This document describes CloudWatch alarms that Watchbot configures. If one of these alarms is tripped, a message will be sent from CloudWatch to the stack's configured `NotificationEmail` or `NotificationTopic`. That message will contain a description field, with a URL pointing to this document.
+
+**In all cases**, SQS messages that failed and led to these alarms are put back into SQS to be retried. [See the worker retry documentation](./worker-retry-cycle.md) for more info.
+
+
+## WorkerErrors
+
+### Why?
+
+There were more than a threshold number of worker containers that failed to process successfully in a 60 second period. The threshold is configured when you create your template via `watchbot.template()` through the `options.errorThreshold` value. The default threshold is 10 errors per minute.
+
+### What to do
+
+These errors represent situations where an SQS message resulted in the launch of a worker container, and that container exited with an exit code of **anything other than** `0` or `4`. See [the readme](./worker-runtime-details.md#worker-completion) for more information about how Watchbot interprets container's exit codes.
+
+This likely represents an error in your worker's code, or an edge-case that your application is unable to cope with. Most of the time, the solution to this problem is deduced by searching through worker logs in CloudWatch logs.
+
+##  QueueSize
+
+### Why?
+
+There were more than a threshold number of messages in the SQS queue for some period of time. Both the threshold and the alarm period are configured when you create your template via `watchbot.template()` through the `options.alarmThreshold` and `options.alarmPeriod` values. The default threshold is 40, and the default period is 2 hours.
+
+### What to do
+
+This represents a situation where messages are piling up in SQS faster than they are being processed. You may need to decrease the rate at which messages are being sent to SQS, or investigate whether there is something else preventing workers from processing effectively.
+
+## DeadLetter
+
+### Why?
+
+There are visible messages in the dead letter queue. SQS messages are received by Watchbot's watcher container. If processing the message fails for any reason, the message is sent back to Watchbot's primary queue and will be retried. If 10 attempts to process a message result in a failure, then the message will be sent to the dead letter queue. [See the worker retry documentation](./worker-retry-cycle.md) for more info.
+
+### What to do
+
+These messages consistently failed processing attempts. It is possible that these messages represent an edge case in your worker's processing code. In this case, you should investigate your system's logs to try and determine how the workers failed.
+
+It is also possible that this represents failure to successfully place workers in your cluster. If this is the case, then you will also have seen alarms on FailedWorkerPlacement (see above).
diff --git a/lib/template.js b/lib/template.js
@@ -1,28 +1,102 @@
 'use strict';
 
 const cf = require('@mapbox/cloudfriend');
+const path = require('path');
+const pkg = require(path.resolve(__dirname, '..', 'package.json'));
 
 /**
  * Builds Watchbot resources for you to include in a CloudFormation template
  *
  * @param {Object} options - configuration parameters
- * @param {String|ref} options.cluster - the ARN for the ECS cluster that will host Watchbot's containers.
- * @param {String} options.service - the name of your service. This is usually the name of your Github repository. It **must** match the name of the ECR repository where your images are stored.
- * @param {String|ref} options.serviceVersion - the version of you service to deploy. This should reference a specific image in ECR.
- * @param {String} options.command - the shell command that should be executed in order to process a single message.
- * @param {Array} [options.permissions=[]] - permissions that your worker will need in order to complete tasks.
- * @param {Object} [options.env={}] - key-value pairs that will be provided to the worker containers as environment variables. Keys must be strings, and values can either be strings or references to other CloudFormation resources via `{"Ref": "..."}`.
- * @param {String} [options.prefix='Watchbot'] - a prefix that will be applied to the logical names of all the resources Watchbot creates. If you're building a template that includes more than one Watchbot system, you'll need to specify this in order to differentiate the resources.
- * @param {String} [options.family] - the name of the the task definition family that watchbot will create revisions of.
- * @param {Number|ref} [options.workers=1] - the maximum number of worker containers that can be launched to process jobs concurrently. This parameter can be provided as either a number or a reference, i.e. `{"Ref": "..."}`.
- * @param {String} [options.mounts=''] - if your worker containers need to mount files or folders from the host EC2 file system, specify those mounts with this parameter. A single persistent mount point can be specified as `{host location}:{container location}`, e.g. /root:/mnt/root. A single ephemeral mount point can be specified as `{container location}`, e.g. /mnt/tmp. Separate multiple mount strings with commas if you need to mount more than one location. You can also specify a mount object with `container` and `host` property arrays, in which the indeces correspond: `{ container: [{container location}], host: [{host location}] }`, e.g. { container: [/mnt/root, /mnt/tmp], host: [/root, ''] }. A blank host entry will create an ephemeral mount point at the corresponding container filepath.
+ * @param {String|ref} options.cluster - the ARN for the ECS cluster that will
+ * host Watchbot's containers.
+ * @param {String} options.service - the name of your service. This is usually
+ * the name of your Github repository. It **must** match the name of the ECR
+ * repository where your images are stored.
+ * @param {String|ref} options.serviceVersion - the version of you service to
+ * deploy. This should reference a specific image in ECR.
+ * @param {String} options.command - the shell command that should be executed
+ * in order to process a single message.
+ * @param {Array} [options.permissions=[]] - permissions that your worker will
+ * need in order to complete tasks.
+ * @param {Object} [options.env={}] - key-value pairs that will be provided to
+ * the worker containers as environment variables. Keys must be strings, and
+ * values can either be strings or references to other CloudFormation resources
+ * via `{"Ref": "..."}`.
+ * @param {string|ref} [options.watchbotVersion=current] - the version of Watchbot's
+ * container that will be deployed. Defaults to the installed version
+ * @param {string|ref} [options.notificationEmail] - an email address to receive
+ * notifications when processing fails.
+ * @param {string|ref} [options.notificationTopic] - an ARN of the SNS topic to receive
+ * notifications when processing fails.
+ * @param {String} [options.prefix='Watchbot'] - a prefix that will be applied
+ * to the logical names of all the resources Watchbot creates. If you're
+ * building a template that includes more than one Watchbot system, you'll need
+ * to specify this in order to differentiate the resources.
+ * @param {String} [options.family] - the name of the the task definition family
+ * that watchbot will create revisions of.
+ * @param {Number|ref} [options.workers=1] - the maximum number of worker
+ * containers that can be launched to process jobs concurrently. This parameter
+ * can be provided as either a number or a reference, i.e. `{"Ref": "..."}`.
+ * @param {String} [options.mounts=''] - if your worker containers need to mount
+ * files or folders from the host EC2 file system, specify those mounts with this parameter.
+ * A single persistent mount point can be specified as `{host location}:{container location}`,
+ * e.g. /root:/mnt/root. A single ephemeral mount point can be specified as `{container location}`,
+ * e.g. /mnt/tmp. Separate multiple mount strings with commas if you need to mount
+ * more than one location. You can also specify a mount object with `container`
+ * and `host` property arrays, in which the indeces
+ * correspond: `{ container: [{container location}], host: [{host location}] }`,
+ * e.g. { container: [/mnt/root, /mnt/tmp], host: [/root, ''] }. A blank host
+ * entry will create an ephemeral mount point at the corresponding container filepath.
  * @param {Object} [options.reservation={}] - worker container resource reservations
- * @param {Number|ref} [options.reservation.memory] - the number of MB of RAM to reserve as a hard limit. If your worker container tries to utilize more than this much RAM, it will be shut down. This parameter can be provided as either a number or a reference, i.e. `{"Ref": "..."}`.
- * @param {Number|ref} [options.reservation.softMemory] - the number of MB of RAM to reserve as a soft limit. Your worker container will be able to utilize more than this much RAM if it happens to be available on the host. This parameter can be provided as either a number or a reference, i.e. `{"Ref": "..."}`.
- * @param {Number|ref} [options.reservation.cpu] - the number of CPU units to reserve for your worker container. This will only impact the placement of your container on an EC2 with sufficient CPU capacity, but will not limit your container's utilization. This parameter can be provided as either a number or a reference, i.e. `{"Ref": "..."}`.
- * @param {Boolean} [options.privileged=false] - give the container elevated privileges on the host container instance
- * @param {Number|ref} [options.messageTimeout=600] - once Watchbot pulls a message from SQS and spawns a worker to process it, SQS will wait this many seconds for a response. If the worker has not yet finished processing the message for any reason, SQS will make the message visible again and Watchbot will spawn another worker to process it. This is helpful when containers or processing scripts crash, but make sure that it allows sufficient time for routine processing to occur. If set too low, you will end up processing jobs multiple times. This parameter can be provided as either a number or a reference, i.e. `{"Ref": "..."}`.
- * @param {Number|ref} [options.messageRetention=1209600] - the number of seconds that a message will exist in SQS until it is deleted. The default value is the maximum time that SQS allows, 14 days. This parameter can be provided as either a number or a reference, i.e. `{"Ref": "..."}`.
+ * @param {Number|ref} [options.reservation.memory] - the number of MB of RAM
+ * to reserve as a hard limit. If your worker container tries to utilize more
+ * than this much RAM, it will be shut down. This parameter can be provided as
+ * either a number or a reference, i.e. `{"Ref": "..."}`.
+ * @param {Number|ref} [options.reservation.softMemory] - the number of MB of
+ * RAM to reserve as a soft limit. Your worker container will be able to utilize
+ * more than this much RAM if it happens to be available on the host. This
+ * parameter can be provided as either a number or a reference, i.e. `{"Ref": "..."}`.
+ * @param {Number|ref} [options.reservation.cpu] - the number of CPU units to
+ * reserve for your worker container. This will only impact the placement of
+ * your container on an EC2 with sufficient CPU capacity, but will not limit
+ * your container's utilization. This parameter can be provided as either a
+ * number or a reference, i.e. `{"Ref": "..."}`.
+ * @param {Boolean} [options.privileged=false] - give the container elevated
+ * privileges on the host container instance
+ * @param {Number|ref} [options.messageTimeout=600] - once Watchbot pulls a
+ * message from SQS and spawns a worker to process it, SQS will wait this many
+ * seconds for a response. If the worker has not yet finished processing the
+ * message for any reason, SQS will make the message visible again and Watchbot
+ * will spawn another worker to process it. This is helpful when containers or
+ * processing scripts crash, but make sure that it allows sufficient time for
+ * routine processing to occur. If set too low, you will end up processing jobs
+ * multiple times. This parameter can be provided as either a number or a
+ * reference, i.e. `{"Ref": "..."}`.
+ * @param {Number|ref} [options.messageRetention=1209600] - the number of seconds
+ * that a message will exist in SQS until it is deleted. The default value is
+ * the maximum time that SQS allows, 14 days. This parameter can be provided as
+ * either a number or a reference, i.e. `{"Ref": "..."}`.
+ * @param {Number|ref} [options.errorThreshold=10] - Watchbot creates a
+ * CloudWatch alarm that will fire if there have been more than this number
+ * of failed worker invocations in a 60 second period. This parameter can be provided as
+ * either a number or a reference, i.e. `{"Ref": "..."}`.
+ * @param {Number|ref} [options.alarmThreshold=40] - Watchbot creates a
+ * CloudWatch alarm that will go off when there have been too many messages in
+ * SQS for a certain period of time. Use this parameter to adjust the Threshold
+ * number of messages to trigger the alarm. This parameter can be provided as
+ * either a number or a reference, i.e. `{"Ref": "..."}`.
+ * @param {Number|ref} [options.alarmPeriods=24] - Use this parameter to control
+ * the duration that the SQS queue must be over the message threshold before
+ * triggering an alarm. You specify the number of 5-minute periods before an
+ * alarm is triggered. The default is 24 periods, or 2 hours. This parameter
+ * can be provided as either a number or a reference, i.e. `{"Ref": "..."}`.
+ * @param {Number|ref} [options.failedPlacementAlarmPeriods=1] - Use this
+ * parameter to control the duration for which the failed placements exceed
+ * the threshold of 5 before triggering an alarm. You specify the number
+ * of 1-minute periods before an alarm is triggered. The default is 1 period, or
+ * 1 minute. This parameter can be provided as either a number or a reference,
+ * i.e. `{"Ref": "..."}`.
  */
 module.exports = (options = {}) => {
   ['service', 'serviceVersion', 'command', 'cluster'].forEach((required) => {
@@ -32,14 +106,19 @@ module.exports = (options = {}) => {
   options = Object.assign(
     {
       prefix: 'Watchbot',
+      watchbotVersion: 'v' + pkg.version,
       reservation: {},
       env: {},
       messageTimeout: 600,
       messageRetention: 1209600,
       workers: 1,
       mounts: '',
       privileged: false,
-      family: options.service
+      family: options.service,
+      errorThreshold: 10,
+      alarmThreshold: 40,
+      alarmPeriods: 24,
+      failedPlacementAlarmPeriods: 1
     },
     options
   );
@@ -96,9 +175,24 @@ module.exports = (options = {}) => {
   };
 
   const mounts = mount(options.mounts);
-
   const Resources = {};
 
+  if (options.notificationTopic && options.notificationEmail) throw new Error('Cannot provide both notificationTopic and notificationEmail.');
+  const notify = options.notificationTopic || cf.ref(prefixed('NotificationTopic'));
+
+  if (options.notificationEmail) Resources[prefixed('NotificationTopic')] = {
+    Type: 'AWS::SNS::Topic',
+    Description: 'Subscribe to this topic to receive emails when tasks fail or retry',
+    Properties: {
+      Subscription: [
+        {
+          Endpoint: options.notificationEmail,
+          Protocol: 'email'
+        }
+      ]
+    }
+  };
+
   Resources[prefixed('DeadLetterQueue')] = {
     Type: 'AWS::SQS::Queue',
     Description: 'List of messages that failed to process 14 times',
@@ -430,5 +524,78 @@ module.exports = (options = {}) => {
     }
   };
 
+  Resources[prefixed('DeadLetterAlarm')] = {
+    Type: 'AWS::CloudWatch::Alarm',
+    Properties: {
+      AlarmName: cf.sub('${AWS::StackName}-dead-letter'),
+      AlarmDescription:
+        'Provides notification when messages are visible in the dead letter queue',
+      EvaluationPeriods: 1,
+      Statistic: 'Minimum',
+      Threshold: 1,
+      Period: '60',
+      ComparisonOperator: 'GreaterThanOrEqualToThreshold',
+      Namespace: 'AWS/SQS',
+      Dimensions: [
+        { Name: 'QueueName', Value: cf.getAtt(prefixed('DeadLetterQueue'), 'QueueName') }
+      ],
+      MetricName: 'ApproximateNumberOfMessagesVisible',
+      AlarmActions: [notify]
+    }
+  };
+
+  Resources[prefixed('WorkerErrorsMetric')] = {
+    Type: 'AWS::Logs::MetricFilter',
+    Properties: {
+      FilterPattern: '"[failure]"',
+      LogGroupName: cf.ref(prefixed('Logs')),
+      MetricTransformations: [{
+        MetricName: cf.join([prefixed('WorkerErrors-'), cf.stackName]),
+        MetricNamespace: 'Mapbox/ecs-watchbot',
+        MetricValue: 1
+      }]
+    }
+  };
+
+  Resources[prefixed('WorkerErrorsAlarm')] = {
+    Type: 'AWS::CloudWatch::Alarm',
+    Properties: {
+      AlarmName: cf.sub('${AWS::StackName}-worker-errors'),
+      AlarmDescription:
+        `https://github.com/mapbox/ecs-watchbot/blob/${options.watchbotVersion}/docs/alarms.md#workererrors`,
+      EvaluationPeriods: 1,
+      Statistic: 'Sum',
+      Threshold: options.errorThreshold,
+      Period: '60',
+      ComparisonOperator: 'GreaterThanThreshold',
+      Namespace: 'Mapbox/ecs-watchbot',
+      MetricName: cf.join([prefixed('WorkerErrors-'), cf.stackName]),
+      AlarmActions: [notify]
+    }
+  };
+
+  Resources[prefixed('QueueSizeAlarm')] = {
+    Type: 'AWS::CloudWatch::Alarm',
+    Properties: {
+      AlarmName: cf.sub('${AWS::StackName}-queue-size'),
+      AlarmDescription:
+        `https://github.com/mapbox/ecs-watchbot/blob/${options.watchbotVersion}/docs/alarms.md#queuesize`,
+      EvaluationPeriods: options.alarmPeriods,
+      Statistic: 'Average',
+      Threshold: options.alarmThreshold,
+      Period: '300',
+      ComparisonOperator: 'GreaterThanThreshold',
+      Namespace: 'AWS/SQS',
+      MetricName: 'ApproximateNumberOfMessagesVisible',
+      Dimensions: [
+        {
+          Name: 'QueueName',
+          Value: cf.getAtt(prefixed('Queue'), 'QueueName')
+        }
+      ],
+      AlarmActions: [notify]
+    }
+  };
+
   return cf.merge({ Resources });
 };