♻ that container #184

rclark · 2018-02-08T17:24:57Z

This is a work-in-progress PR that demonstrates a watchbot system that, instead of launching a new ECS task for each SQS message, uses individual containers to process multiple tasks.

Primary benefits:

Higher work throughput: you don't have to pay the cost of container startup for each message.
Stable reservation loads: ECS clusters can scale to meet processing demands more effectively.
Uses an ECS service: developers get CPU and memory utilization metrics in CloudWatch.
No RunTask API calls: avoid risk of being rate-limited and race-conditions in the ECS scheduler.
Easy ramp-up: the ECS service can scale up slowly to a massive spike of work in the queue.

The biggest caveats are:

Developers must install watchbot as part of the Dockerfiles that describe their workers.
Watchbot code runs as the container's main process, workers are run in child processes.
Scaling down the ECS service is tricky.

How it works

The developer writes a Dockerfile that describes the environment needed to process an SQS message. This includes the code that does the processing and must also include code for watchbot itself. The processing code must be encapsulated in a shell command that expects certain environment variables to convey the details of the message to be processed.

The developer constructs a CloudFormation template. Just like in previous versions, they use watchbot.template() to create a set of CloudForrmation resources to include in the template (SQS queue, roles, task definition, service, scaling policies, and alarms).

Launching this stack sets up an ECS service which will launch containers based on the developer's Dockerfile. Scaling policies will increase the service's desired number of containers when there are visible messages in the SQS queue. The maximum number of containers in the ECS service defines the maximum number of SQS messages that can be processed concurrently.

When a container launches, its main command is to run watchbot code. This code is responsible for polling the SQS queue, receiving a message, then processing the message by running the developer's shell command as a child process. Once the child process exits, watchbot code determines whether to delete the message or return it to the queue based on the process exit code, then polls the SQS queue again.

When there are no messages in-flight, meaning there is nothing being processed, then the ECS service scales itself down to 0 containers. The next time messages arrive in the SQS queue, the service will begin scaling itself up again.

Proof-of-concept

The container-recycling branch of ecs-telephone uses this ♻️ version of watchbot. The PR provides a good overview of what a developer needs to change about their existing watchbot system to work with this new version.

It looked good when I ran it. A single container got to the point where it was processing almost 2 SQS messages per second. Granted, ecs-telephone "processing" takes approximately no time. We should run more proof-of-concept stacks that do some more realistic work and expose more realistic SQS queue loads.

To-Do

A lot.

This was a rewrite, not a refactor, and the current state of the rewrite is intentionally very, very minimalistic. Some things that are missing that I know need to be added:

Alarms for when things go severely wrong
Logs. Its always been important that the "watcher" code give some insight into what's going on, and that "worker" logs are appropriately structured for search after the fact. Neither of these are implemented yet.
Track and log worker durations
Reduce-mode
CLI tools for managing dead-letter queue
This also includes no code to send notifications in the event of worker processing failure. We introduced dead-letter queues in v2, and this really altered the way that most developers are notified of worker failures. This PR currently includes a dead-letter queue, but no mechanisms for alarms or emails to be sent on any kind of failure. We should rebuild the notification settings in a way that makes sense in the new dead-letter world.

Other "interesting" things

I wanted to learn to write code using async/await so here we are in node.js v8. It won't work in v4 or v6.
Since developers are going to need to "install" watchbot in their Dockerfiles, we should make that trivial, and it should bundle node v8 in a way that will allow watchbot to run independently of whatever node version the actual processing code requires.
I tried using Jest as a way to do fixture-based testing of the watchbot.template() command. I think this is wise for what .template() is doing, but it is weird to have tape & jest tests.
I went heavily object-oriented in the code. I have this dream where that means someday we could introduce derivative "worker" classes that run work on other computing platforms. Maybe Lambda? Maybe straight on ECS with RunTask? The real hope is that this is making the system more maintainable / extensible and future changes maybe don't have to be rewrites.

cc @mapbox/platform

jakepruitt

SO EXCITED! Tried to give feedback about clarity wherever I could, but I also came in with bias from knowing a bit about watchbot beforehand. Would love to hear feedback from other folks.

jakepruitt · 2018-02-08T18:38:51Z

lib/watcher.js

+  listen() {
+    return new Promise((resolve) => {
+      const loop = async () => {
+        if (this.stop) return resolve();


Could you explain this line a bit more? Maybe add a comment about the usage of this.stop?

Yeah this is a bummer testing caveat. In real-life, .stop will always be undefined. This promise will never resolve.

Tests can set .stop so that the intended infinite recursion will end. See the test here and a similar pattern in the messages.js polling loop. ... come to think of it I may be able to avoid it over there.

jakepruitt · 2018-02-08T18:39:33Z

lib/watcher.js

+      const loop = async () => {
+        if (this.stop) return resolve();
+
+        const messages = await this.messages.waitFor();


How many messages get pulled at once?

I ask here rather than in lib/messages.js because I assumed there would only ever be one message you're waiting for. With the plural messages I feel like there's a few things that could use some explaining at the watcher level - like how many messages get pulled in at a time? Do the workers run concurrently or serially?

.waitFor(num) is actually the function signature, with a default value of 1. I'm hedging a little bit here between concurrent vs serial processing.

It is so easy to write a Messages class that is accommodating of receiving 1-10 messages per API call that it almost feels like an oversight not to allow for that.

But for this watchbot iteration, we're looking at running 1 worker at a time per watcher, one after the other. Since this is .waitFor(1), that's what will happen.

What I guess I have done is put in some of the structure that would allow for concurrent workers per watcher, but not all of it. Maybe that's a bad idea.

jakepruitt · 2018-02-08T18:42:24Z

lib/watcher.js

+          const worker = Worker.create(message, this.workerOptions);
+
+          worker.on('error', (err) => this.emit('error', err));
+          message.on('error', (err) => this.emit('error', err));


Does the retry error handling happen deeper in the workers/messages? What distinguishes these errors from retried errors?

I think of these emitted errors as "watcher-level" errors that don't have anything to do with the developer's processing code encountering a failure. These errors are chained through event emitters, rather than via rejected promises, because the watcher is supposed to keep on chugging along if it...

fails to return a message to the queue because its receipt handle isn't valid anymore

failed an SQS receiveMessage API call due to an IntervalServerError

couldn't launch a child process because of xyz

When a child process exits non-zero, those "errors" are all translated into returning the message to the queue. Since there's currently no logging or notifications or alarms incorporated in this PR, processing errors just mean retry the job and nothing else. Finding the right way to surface those failures is probably the biggest TODO at the moment.

Note that the PR #185 added logging and that removed the need to bubble up error through event emitters. Instead, when an error is encountered it just gets logged.

jakepruitt · 2018-02-08T18:44:37Z

lib/watcher.js

+          this.emit('error', err);
+        }
+
+        setImmediate(loop);


Question about await - does it make this file hang at line 44 until all of the workers complete? Or does it ride through that line and call setImmediate again immediately? Mostly just asking for my own curiosity.

The code hangs on L44 until the promise resolves. If it rejects, the error is thrown, hence the try/catch.

jakepruitt · 2018-02-08T18:45:38Z

lib/watcher.js

+
+  static create(options) {
+    return new Watcher(options);
+  }


I appreciate isolating the usage of new, or giving folks a way around it, just because I personally don't like new. Did you do this for the sake of tests?

Yes, did it for tests. There's no easy way to mock constructors with sinon, and I spent way too long looking at other frameworks that would allow for it. Way too long.

I actually really like the usage of the static method to create an instance of the object as opposed to new. It makes the code way more readable, even though it might seem redundant.

jakepruitt · 2018-02-08T18:59:10Z

lib/worker.js

+
+  async fail(err) {
+    if (err) this.emit('error', err);
+    return await this.message.retry();


So... will the worker both emit an error and retry the message? Will it only retry if an err is passed?

Another logging TBD here.

jakepruitt · 2018-02-08T19:02:27Z

test/bin.watchbot.test.js

+  assert.end();
+});
+
+test('[bin.watchbot] error handling', async (assert) => {


Does the watcher continue to work if the watcher.listen() errors out? Does the container die and another one start up? Could you add a line confirming the exit code of the watchbot() process in either case?

Note that there's no reject handler on watcher.listen()s promise. The only way this fails is through JS syntax errors in watchbot code itself, or if the OS decides to shut it down for some reason. Either way would result in a error getting printed and the process exiting. I can't actually simulate this in a test because it is not supposed to happen.

jakepruitt · 2018-02-08T19:04:13Z

test/message.test.js

+  assert.end();
+});
+
+test('[message] retry, too many receieves', async (assert) => {


Is this where the dead letter queue lived before?

The default is for messages to go dead-letter after the 11th receive. That's still set up, but in the past this limit has been configurable (or at least configurability has been considered). So this check is about making sure we don't send an API call with a visibility timeout that's too long, in the event that we decide to make the dead-letter limit configurable.

jakepruitt · 2018-02-08T19:07:45Z

test/stubber.js

@@ -0,0 +1,24 @@
+'use strict';
+
+const sinon = require('sinon');


Why did you go the route of a local helper instead of the "repeat yourself in tests" philosophy?

Oh the stubber? Because

a) its a little messy to stub the .create() function AND end up with access to a stub instance of the class so you can spy on instance method calls.

b) all these things are event emitters, and I wanted those emitters to actually work on the stub instances.

c) I thought about this waaaaaaaaaaay too long. At one point this stubber was trying to get things done without having to use .create() factories, but I eventually got to the point where I'd lost most of a day and gained maybe 50 lines of code and a bad headache. This is what I fell back on.

jakepruitt · 2018-02-08T19:09:30Z

test/template.jest.js

+    cluster: 'processing'
+  });
+
+  expect(builtWithDefaults).toMatchSnapshot('defaults');


😱 oh wow, so this is part of jest? Why did you want to go with this over tape?

Pretty much all I know about Jest is that people use this to manage fixture-based tests for another type of complex object: the HTML on a frontend app. I thought I would see how it feels for cloudformation JSON. I could do fixture-based tests in tape for sure, but this was easier to set up.

Would encourage you to kick the tires a little and see if you think its worth keeping or if I should write a little fixture-checker in tape (think: readable diffs when tests fail).

arunasank · 2018-02-09T13:25:28Z

Hi @rclark, this is a bad accountabilibuddy week. I spent all my day pushing on https://github.com/mapbox/mbxcli/pull/783, and am a little exhausted to be able to review this PR in a nice way. Super excited to read and understand this though, and hope to have lots of questions for you to answer further, on Monday!

arunasank · 2018-02-12T12:22:11Z

lib/message.js

+  constructor(sqsMessage = {}, options = {}) {
+    let valid = ['Body', 'MessageId', 'ReceiptHandle', 'Attributes'].reduce(
+      (valid, key) => {
+        if (!valid) return valid;


naive q: Why not just return false instead of valid? This wouldn't preserve the truthy value of valid from a previous call, and returning false would be simpler to read?

👍 will do

arunasank · 2018-02-12T12:37:38Z

lib/message.js

+      params: { QueueUrl: options.queueUrl }
+    });
+
+    this.logger = Logger.create('watcher', this);


naive q 2: How does the value of the message instance percolate down to the logger class, considering we only have the options variable in https://github.com/mapbox/ecs-watchbot/blob/container-recycling/lib/logger.js#L65?

That's a bug, thanks, will test and fix.

Ah, ok. Thought there was magic happening. 🙂 @rclark follow up: In both cases, we only have the queueUrl property in the options object (Both in this class as well as in the Messages class). Is there a reason why you haven't just used a string? Is it for extensibility, where you see this object containing other kinds of metadata in the future?

Yes, just for extensibility.

arunasank · 2018-02-12T15:33:11Z

lib/message.js

+
+    const params = {
+      ReceiptHandle: this.handle,
+      VisibilityTimeout: Math.pow(2, receives)


Why are we setting the VisibilityTimeout using a power of 2?

We want there to be an exponential increase in the number of seconds that a message "waits" before it gets retried again. This is an attempt to protect any underlying resources (maybe s3, dynamo, ecs) that may get throttled by too-rapid retry.

Related: #184 (comment)

arunasank · 2018-02-13T10:05:39Z

lib/messages.js

+          data = await this.sqs.receiveMessage(params).promise();
+        } catch (err) {
+          this.logger.queueError(err);
+          return setImmediate(poll);


@rclark is there a reason you prefer setImmediate over process.nextTick?

There is some subtle difference between the two that I never remember. setImmediate is my habit.

arunasank · 2018-02-13T10:41:41Z

Hey @rclark, loved shadowing you on this PR and learned a ton. Wanted to list a bunch of stuff I saw in code for the first time:

setImmediate 🤔
async, await 😍 loving this - so easy to combine async/sync.

Also, this is the first time I am reviewing code that contains a lot of object-oriented JavaScript, and it makes the various pieces of the code so much clearer. Especially love how the various pieces come together in the watcher and the worker

* adds logging of watcher-level errors, worker receives, and completion status * prefixed logs from child processes

* change scale-down MetricIntervalLowerBound to MetricIntervalUpperBound

* Add alarms and alarm docs * Add failedPlacementAlarmPeriods * Add CloudWatch Alarms snapshots * Update template jest snapshots * Add CloudWatch Alarms snapshots * Add failedworker and failedworkerplacement metric * Typo r/LogGroup/Logs * Change metric name * Metric Filter of worker errors to "[failure]" * Have current published version instead of undefined * Jake's Review * uh update-jest * Update alarms.md

* Add travis user * Ensure this fails * Add validation for notificationEmail or notificationTopic

…queue threshold, info to doc (#211) * Closes #208, #207, #206, #182, #149, #72, #15 (cherry picked from commit 8de328df79ccf52b8d612c625891555808c2fa0e) * Add minSize as option * update jest tests * Change MinSize to 0 * update jest * identation and minSize to 0 * Add deadletterThreshold info in Worker-retry-cycle

* Restrict writes to volumes and clean them after every job * Try out the `ReadOnlyRootFilesystem` option * Capitalization * Add watchbot-log * use strict * No need to chmod now

rclark requested a review from arunasank February 8, 2018 18:18

jakepruitt reviewed Feb 8, 2018

View reviewed changes

rclark mentioned this pull request Feb 9, 2018

Add logging #185

Merged

arunasank reviewed Feb 12, 2018

View reviewed changes

arunasank reviewed Feb 13, 2018

View reviewed changes

jakepruitt mentioned this pull request Mar 17, 2018

[WIP] Disk monitoring and quotas #188

Closed

4 tasks

jakepruitt mentioned this pull request Apr 23, 2018

[experiment] Service-based self-polling watchbot with containers that exit after every job #190

Closed

5 tasks

This was referenced May 14, 2018

Add reduce mode to "♻️ that container" PR #191

Closed

Dead letter queue for "♻️ that container" PR #192

Closed

Add dead letter queue CLI to "♻️ that container" PR #193

Closed

Add alarms to "♻️ that container" PR #194

Closed

tapaswenipathak mentioned this pull request May 15, 2018

[WIP] Reduce mode and dead letter queue and some logging #195

Closed

4 tasks

This was referenced May 21, 2018

Restricting writes to just /tmp and cleaning /tmp after every worker #197

Closed

Installing watchbot 4 #202

Closed

jakepruitt mentioned this pull request Jun 11, 2018

Add template validation tests #215

Merged

jakepruitt force-pushed the container-recycling branch from 1cdfdf9 to 63cba8b Compare June 12, 2018 20:20

jakepruitt mentioned this pull request Jun 12, 2018

Allow users to write to any volume #200

Merged

rclark and others added 9 commits June 15, 2018 17:48

♻ that container

cb46b25

Add logging (#185)

ffb8f85

* adds logging of watcher-level errors, worker receives, and completion status * prefixed logs from child processes

fixes logger factory to accept a message

8cf8563

--> false for legibility

cba14fe

move binary split to dependency

c086f69

package lock changes

1d5bda0

Scale down threshold (#187)

d8aea44

* change scale-down MetricIntervalLowerBound to MetricIntervalUpperBound

Add template validation tests (#215)

0ba78c9

* Add travis user * Ensure this fails * Add validation for notificationEmail or notificationTopic

tapaswenipathak and others added 24 commits June 15, 2018 17:48

Update tests with maxSize property

e2d2179

exit main loop after workers finish

c872885

resolve() after all workers return instead of exiting

e0ec438

fix tests and mocks

da19be5

cleanup

f06e9ea

logs

44b3950

another log

092ec2b

use logger and process.stdout for logs

6819cac

more logs

5fad3ea

edit logs

af02d07

remove superfluous logging

e81ae29

add fresh mode as a watchbot option

b6ceea8

if else

7b165e2

freshMode

d41f616

console log

0e94f7b

typeof

2829a32

true

464cef0

concise

dfc7337

add fresh

c4ad6da

fix tests

b4d78ba

fix binary test

fd1c6e9

update snapshots

c6aa113

Allow users to write to any volume (#200)

1892852

* Restrict writes to volumes and clean them after every job * Try out the `ReadOnlyRootFilesystem` option * Capitalization * Add watchbot-log * use strict * No need to chmod now

jakepruitt force-pushed the container-recycling branch from 3a196da to 1892852 Compare June 16, 2018 00:48

jakepruitt merged commit 7fc31c2 into master Jun 16, 2018

This was referenced Jun 16, 2018

Brainstorming: RunTask-free watchbot #138

Closed

Where'd the docs go? #218

Closed

arunasank deleted the container-recycling branch August 9, 2018 06:34

		@@ -0,0 +1,24 @@
		'use strict';

		const sinon = require('sinon');

♻ that container #184

♻ that container #184

Conversation

rclark commented Feb 8, 2018 • edited by tapaswenipathak

How it works

Proof-of-concept

To-Do

Other "interesting" things

jakepruitt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rclark Feb 8, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rclark Feb 8, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arunasank commented Feb 9, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arunasank commented Feb 13, 2018

rclark commented Feb 8, 2018 •

edited by tapaswenipathak

rclark Feb 8, 2018 •

edited

rclark Feb 8, 2018 •

edited

arunasank commented Feb 9, 2018 •

edited