Run builds on AWS Batch [PR] #29

tsibley · 2018-09-21T23:54:54Z

See my comments on #28 and the commit messages themselves.

It doesn't need to support arbitrary base modules; just return the last dotted component. Advantageously, this also avoids surprising behaviour around subpackages, e.g. nextstrain.cli.runner.foo.bar would return "bar" instead of "foo.bar" if foo/__init__.py exists. I believe this complexity was accidentally held over from code which _did_ supply a different "base_module" but was rewritten before seeing the light of day.

This adds an --aws-batch option to the "build" command which launches the pathogen build as a job on AWS Batch. It uploads the build directory to S3, submits the job, monitors the job status, streams the job logs, and downloads build results. The UX aims to be very similar to that of local builds (either containerized or native), so the "build" command stays in the foreground and result files are written back directly to the local build directory. AWS resources must be pre-configured and these are checked for (at least at the surface level) by the "check-setup" command. Documentation (and possibly a setup script) for the required resources is still to come. Two notable, independent improvements which are well-scoped and I think worthwhile to do in the near future are: • better secured environment variable forwarding • unattended (background) builds whose results can be fetched at any later point Comments on each of these are embedded in the relevant places in the code. The ability to overlay specific versions of Nextstrain components (like --augur for the Docker runner) is another development-focused improvement desired for the future. However, it is highly co-dependent on first implementing the more general case of remote build directories (e.g. `nextstrain build git+https://github.com/…`). The new nextstrain.cli.hostenv module exists to consolidate local host ambient environment handling for the Docker and AWS Batch runners.

Many boto3 clients for services require a region at instantiation. Rather than throw NoRegionErrors at the user, fall back to a default region. Any region provided externally (via the AWS_DEFAULT_REGION environment variable or the ~/.aws/config file) is still respected.

…uration

Currently unused, but it will be used by the AWS Batch runner in the next commit and other modules (Docker runner, deploy command) after that on subsequent branches. The interface is intentionally very simple, which seems like a good place to start.

This will especially make it easier for folks who aren't us to configure their S3 bucket name persistently.

The log stream's already been printed to the screen. Cleaning it up is nice because it potentially saves log storage costs (if over the free threshold already).

I just re-read and digested this material myself and initially had trouble finding it again. Useful to link to it here in the outline of security concerns.

mypy didn't catch this, presumably because it doesn't know what type the "container" value will be when it exists. Type checking doesn't currently pass with these corrected annotations, and I'll address that in the next commit.

The AWS Batch runner prints all log entries when a job transitions to a terminal status and the log watcher hasn't started yet. That handles the case of a job failing or succeeding very quickly, before the runner has time to start the log watcher. If a job is externally canceled (e.g. via the AWS CLI or web console) before it starts, however, it will never have a log stream associated with it and requests to fetch a stream will fail. Handle this edge case by returning an empty generator from the JobState.log_entries() method. This allows the runner to be mostly unaware of the details of the log stream. The log_watcher() method currently never triggers this edge case because it is only invoked after the job has started and is guaranteed to have a log stream. An assertion is added to catch future missteps, however.

This is particularly useful if the job was externally canceled or terminated, as the reason for doing so will be given. Without the reason, the job, confusingly, just seems to stop.

Attempts to do so are a programming error and shouldn't silently do nothing.

Exits immediately if a second ^C is received. This convenience feature is nice because it means jobs won't be orphaned to run for potentially a long time (until the configured timeout in Batch). The log watcher thread is daemonized so that the whole program exits when an exception is raised in the main thread (e.g. the second KeyboardInterrupt is re-raised). Diff best viewed with whitespace ignored, as most of the job monitoring code is indented further but otherwise unchanged.

trvrb · 2018-10-30T19:35:14Z

@tsibley ---

I'm trying to pick through this now (finally). One thing as I work:

I'm sure this is in the commit messages somewhere, but I need a more obvious entry point. I figured out from issue Run builds on AWS Batch [issue] #28 that I could run this via nextstrain build --aws-batch ., but I still don't know the additional options I can provide. I don't see anything about --aws-batch in nextstrain build --help. Is there some other place I should be looking?

trvrb · 2018-10-30T19:35:59Z

I'm also getting the following error repeatedly, which I will dig into further but wanted to bring it up now:

Cleaning up logs
Traceback (most recent call last):
  File "/Users/trvrb/.pyenv/versions/3.6.5/bin/nextstrain", line 11, in <module>
    load_entry_point('nextstrain-cli==1.5.0', 'console_scripts', 'nextstrain')()
  File "/Users/trvrb/.pyenv/versions/3.6.5/lib/python3.6/site-packages/nextstrain/cli/__main__.py", line 10, in main
    return cli.run( argv[1:] )
  File "/Users/trvrb/.pyenv/versions/3.6.5/lib/python3.6/site-packages/nextstrain/cli/__init__.py", line 51, in run
    return opts.__command__.run(opts)
  File "/Users/trvrb/.pyenv/versions/3.6.5/lib/python3.6/site-packages/nextstrain/cli/command/build.py", line 67, in run
    return runner.run(opts, working_volume = opts.build)
  File "/Users/trvrb/.pyenv/versions/3.6.5/lib/python3.6/site-packages/nextstrain/cli/runner/__init__.py", line 128, in run
    return opts.__runner__.run(opts, argv, working_volume = working_volume)
  File "/Users/trvrb/.pyenv/versions/3.6.5/lib/python3.6/site-packages/nextstrain/cli/runner/aws_batch/__init__.py", line 184, in run
    logs.delete_stream(job.log_stream)
  File "/Users/trvrb/.pyenv/versions/3.6.5/lib/python3.6/site-packages/nextstrain/cli/runner/aws_batch/logs.py", line 46, in delete_stream
    logStreamName = stream)
  File "/Users/trvrb/.pyenv/versions/3.6.5/lib/python3.6/site-packages/botocore/client.py", line 314, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Users/trvrb/.pyenv/versions/3.6.5/lib/python3.6/site-packages/botocore/client.py", line 612, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDeniedException) when calling the DeleteLogStream operation: User: arn:aws:iam::827581582529:user/trvrb is not authorized to perform: logs:DeleteLogStream on resource: arn:aws:logs:us-east-1:827581582529:log-group:/aws/batch/job:log-stream:nextstrain-job/default/7e4b6628-4c87-4911-91f3-435fee389286

I failed to notice this omission earlier, because after adding log stream cleanup I didn't test again with my unprivileged AWS user.

tsibley · 2018-10-30T22:24:17Z

Thanks for digging in!

Hmm, nextstrain build --help output mentions AWS Batch support in several places. If it doesn't for you, then I think maybe ~~you're not~~ you weren't running the code on the branch?

AWS Batch support is also documented at https://github.com/nextstrain/cli/blob/aws-batch/doc/aws-batch.md. I mentioned this in a comment on #28, and it's also referenced from the output of nextstrain build --help-all. For your review, it's definitely worth reading the other comments on #28. They're aimed at you, mostly. :-)

I believe the AccessDeniedException you were seeing is a result of the AWS IAM policy not including the DeleteLogStream action. I've fixed that and fixed the docs. Thanks for the catch! (I wasn't seeing them because I didn't test again with my non-privileged AWS user after adding the log stream cleanup.)

jameshadfield · 2018-10-31T19:42:11Z

I've tested this functionality and it can report that it works for me.

I have some suggestions:

(a) The documentation is good, however a flow diagram of the different components (cli, s3 bucket, batch, logs...), how they interact with each other, which permissions are used where etcetera would be extremely helpful (to someone like me).

(b) Currently quitting the nextstrain program while running results in the job being cancelled and the data deleted from the bucket.

Job now RUNNABLE
^C
Canceling job…
Waiting for job to stop…
(Press Ctrl-C again if you don't want to wait.)
Job FAILED after 1.1 minutes (stopped by user)
Cleaning up S3
deleted: s3://nextstrain-jobs/ddb25a16-62cd-44eb-9021-3d3096971e23.zip

I imagine scenarios where one wants to stop the nextstrain process but keep the job running. Is this possible? How does this work if internet connection drops?

(c) There are a number of zip files in the nextstrain-jobs bucket, and a number of logs in the /aws/batch/job group (and starting with nextstrain-job). It's confusing to me why these are there, since my submitted job had it's data & logs deleted.

(d) Would it be possible to keep logs (on AWS) for a period of time?

(e) Could jobs / S3 data / logs incorporate the username into their name (id?), such that one could more easily locate things? If this functionality could be achieved by searching please include it in the docs.

tsibley · 2018-10-31T22:04:30Z

Thanks for testing this! The feedback is very useful.

(a) Ah, yes, that'd be good. I've created #34 for this.

(b) I've previously sketched out ideas for supporting background jobs and disconnections, but they're not part of this initial implementation. The basic idea is an --unattended mode for AWS Batch builds with support for resuming/fetching results later using something like nextstrain build --aws-batch --resume job-id …. This could be wired up automatically if a connection error happens.

(c) Orphaned data files and logs are left around if the local nextstrain client throws an unexpected exception. This is by design, to aid with debugging the client. Both the S3 bucket and CloudFront log group have retention lifecycles defined so that orphaned files/logs are removed automatically after 30 days if not proactively removed by the client first.

(d) It would be possible. When would this be useful?

(e) Log streams and job ids cannot; their ids are assigned automatically. The build id (or job name) and zip file on S3 could incorporate a username, however. Would that still be useful? I don't really expect folks to be accessing the AWS console in normal use cases.

jameshadfield · 2018-10-31T23:06:45Z

I may not be the target audience for this tool and so my suggestions may be off track. I'd use this to submit jobs to aws in the "--unattended" mode, and perhaps dozens at a time. So I'd lose track of job IDs if I had to manually keep track of them. Something like job-status (which lists submitted jobs and their progress, including those that have finished) and retrieve-finished (which downloads the data) would be useful. Keeping around logs on AWS for ~30 days (and being able to subset by user) would help see what I'd run, similar to job-status.

tsibley · 2018-10-31T23:37:29Z

Nod. What's currently there are the fundamental bits, but my intent with the --unattended mode would be to tackle your more specific use case, albeit incrementally. Saving job ids and using them automatically later is possible and definitely desirable, but I postponed it for now because there are some technical hurdles to doing it robustly esp. around fetching results of multiple runs for the same build.

tsibley added 6 commits September 13, 2018 12:57

Document how to use the AWS Batch runner and the necessary AWS config…

f6cd65d

…uration

runner.aws_batch: Read job, queue, and S3 bucket from the config file

dd66eab

This will especially make it easier for folks who aren't us to configure their S3 bucket name persistently.

tsibley requested a review from trvrb September 21, 2018 23:54

runner.aws_batch: Delete log stream when done

e01ce8f

The log stream's already been printed to the screen. Cleaning it up is nice because it potentially saves log storage costs (if over the free threshold already).

tsibley force-pushed the aws-batch branch from 0fb19e0 to e01ce8f Compare September 25, 2018 18:17

AWS Batch docs: Link to more details about privilege escalation vectors

c820f6a

I just re-read and digested this material myself and initially had trouble finding it again. Useful to link to it here in the outline of security concerns.

tsibley mentioned this pull request Oct 2, 2018

Configuration files for named deployment targets #19

Open

tsibley changed the title ~~Run builds on AWS Batch~~ Run builds on AWS Batch [PR] Oct 3, 2018

tsibley added 5 commits October 11, 2018 14:12

runner.aws_batch: Log the reason a job succeeded or failed

483201e

This is particularly useful if the job was externally canceled or terminated, as the reason for doing so will be given. Without the reason, the job, confusingly, just seems to stop.

runner.aws_batch: Catch attempts to stop() a dead thread

d118a12

Attempts to do so are a programming error and shouldn't silently do nothing.

AWS Batch docs: Add logs:DeleteLogStream to the relevant IAM policy

05fece6

I failed to notice this omission earlier, because after adding log stream cleanup I didn't test again with my unprivileged AWS user.

tsibley mentioned this pull request Oct 31, 2018

Diagram components in use for AWS Batch in the documentation #34

Open

tsibley merged commit 05fece6 into master Nov 26, 2018

tsibley deleted the aws-batch branch November 26, 2018 20:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run builds on AWS Batch [PR] #29

Run builds on AWS Batch [PR] #29

tsibley commented Sep 21, 2018

trvrb commented Oct 30, 2018

trvrb commented Oct 30, 2018

tsibley commented Oct 30, 2018 •

edited

Loading

jameshadfield commented Oct 31, 2018 •

edited

Loading

tsibley commented Oct 31, 2018

jameshadfield commented Oct 31, 2018

tsibley commented Oct 31, 2018

Run builds on AWS Batch [PR] #29

Run builds on AWS Batch [PR] #29

Conversation

tsibley commented Sep 21, 2018

trvrb commented Oct 30, 2018

trvrb commented Oct 30, 2018

tsibley commented Oct 30, 2018 • edited Loading

jameshadfield commented Oct 31, 2018 • edited Loading

tsibley commented Oct 31, 2018

jameshadfield commented Oct 31, 2018

tsibley commented Oct 31, 2018

tsibley commented Oct 30, 2018 •

edited

Loading

jameshadfield commented Oct 31, 2018 •

edited

Loading