Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run builds on AWS Batch [PR] #29

Merged
merged 14 commits into from
Nov 26, 2018
Merged

Run builds on AWS Batch [PR] #29

merged 14 commits into from
Nov 26, 2018

Conversation

tsibley
Copy link
Member

@tsibley tsibley commented Sep 21, 2018

See my comments on #28 and the commit messages themselves.

It doesn't need to support arbitrary base modules; just return the last
dotted component.  Advantageously, this also avoids surprising behaviour
around subpackages, e.g. nextstrain.cli.runner.foo.bar would return
"bar" instead of "foo.bar" if foo/__init__.py exists.

I believe this complexity was accidentally held over from code which
_did_ supply a different "base_module" but was rewritten before seeing
the light of day.
This adds an --aws-batch option to the "build" command which launches
the pathogen build as a job on AWS Batch.  It uploads the build
directory to S3, submits the job, monitors the job status, streams the
job logs, and downloads build results.

The UX aims to be very similar to that of local builds (either
containerized or native), so the "build" command stays in the foreground
and result files are written back directly to the local build directory.

AWS resources must be pre-configured and these are checked for (at least
at the surface level) by the "check-setup" command.  Documentation (and
possibly a setup script) for the required resources is still to come.

Two notable, independent improvements which are well-scoped and I think
worthwhile to do in the near future are:

  • better secured environment variable forwarding

  • unattended (background) builds whose results can be fetched at any
    later point

Comments on each of these are embedded in the relevant places in the
code.

The ability to overlay specific versions of Nextstrain components (like
--augur for the Docker runner) is another development-focused
improvement desired for the future.  However, it is highly co-dependent
on first implementing the more general case of remote build directories
(e.g.  `nextstrain build git+https://github.com/…`).

The new nextstrain.cli.hostenv module exists to consolidate local host
ambient environment handling for the Docker and AWS Batch runners.
Many boto3 clients for services require a region at instantiation.
Rather than throw NoRegionErrors at the user, fall back to a default
region.  Any region provided externally (via the AWS_DEFAULT_REGION
environment variable or the ~/.aws/config file) is still respected.
Currently unused, but it will be used by the AWS Batch runner in the
next commit and other modules (Docker runner, deploy command) after
that on subsequent branches.

The interface is intentionally very simple, which seems like a good
place to start.
This will especially make it easier for folks who aren't us to configure
their S3 bucket name persistently.
@tsibley tsibley requested a review from trvrb September 21, 2018 23:54
The log stream's already been printed to the screen.  Cleaning it up is
nice because it potentially saves log storage costs (if over the free
threshold already).
I just re-read and digested this material myself and initially had
trouble finding it again.  Useful to link to it here in the outline of
security concerns.
@tsibley tsibley changed the title Run builds on AWS Batch Run builds on AWS Batch [PR] Oct 3, 2018
mypy didn't catch this, presumably because it doesn't know what type the
"container" value will be when it exists.

Type checking doesn't currently pass with these corrected annotations,
and I'll address that in the next commit.
The AWS Batch runner prints all log entries when a job transitions to a
terminal status and the log watcher hasn't started yet.  That handles
the case of a job failing or succeeding very quickly, before the runner
has time to start the log watcher.

If a job is externally canceled (e.g. via the AWS CLI or web console)
before it starts, however, it will never have a log stream associated
with it and requests to fetch a stream will fail.  Handle this edge case
by returning an empty generator from the JobState.log_entries() method.
This allows the runner to be mostly unaware of the details of the log
stream.

The log_watcher() method currently never triggers this edge case because
it is only invoked after the job has started and is guaranteed to have a
log stream.  An assertion is added to catch future missteps, however.
This is particularly useful if the job was externally canceled or
terminated, as the reason for doing so will be given.  Without the
reason, the job, confusingly, just seems to stop.
Attempts to do so are a programming error and shouldn't silently do
nothing.
Exits immediately if a second ^C is received.  This convenience feature
is nice because it means jobs won't be orphaned to run for potentially a
long time (until the configured timeout in Batch).

The log watcher thread is daemonized so that the whole program exits
when an exception is raised in the main thread (e.g. the second
KeyboardInterrupt is re-raised).

Diff best viewed with whitespace ignored, as most of the job monitoring
code is indented further but otherwise unchanged.
@trvrb
Copy link
Member

trvrb commented Oct 30, 2018

@tsibley ---

I'm trying to pick through this now (finally). One thing as I work:

  • I'm sure this is in the commit messages somewhere, but I need a more obvious entry point. I figured out from issue Run builds on AWS Batch [issue] #28 that I could run this via nextstrain build --aws-batch ., but I still don't know the additional options I can provide. I don't see anything about --aws-batch in nextstrain build --help. Is there some other place I should be looking?

@trvrb
Copy link
Member

trvrb commented Oct 30, 2018

I'm also getting the following error repeatedly, which I will dig into further but wanted to bring it up now:

Cleaning up logs
Traceback (most recent call last):
  File "/Users/trvrb/.pyenv/versions/3.6.5/bin/nextstrain", line 11, in <module>
    load_entry_point('nextstrain-cli==1.5.0', 'console_scripts', 'nextstrain')()
  File "/Users/trvrb/.pyenv/versions/3.6.5/lib/python3.6/site-packages/nextstrain/cli/__main__.py", line 10, in main
    return cli.run( argv[1:] )
  File "/Users/trvrb/.pyenv/versions/3.6.5/lib/python3.6/site-packages/nextstrain/cli/__init__.py", line 51, in run
    return opts.__command__.run(opts)
  File "/Users/trvrb/.pyenv/versions/3.6.5/lib/python3.6/site-packages/nextstrain/cli/command/build.py", line 67, in run
    return runner.run(opts, working_volume = opts.build)
  File "/Users/trvrb/.pyenv/versions/3.6.5/lib/python3.6/site-packages/nextstrain/cli/runner/__init__.py", line 128, in run
    return opts.__runner__.run(opts, argv, working_volume = working_volume)
  File "/Users/trvrb/.pyenv/versions/3.6.5/lib/python3.6/site-packages/nextstrain/cli/runner/aws_batch/__init__.py", line 184, in run
    logs.delete_stream(job.log_stream)
  File "/Users/trvrb/.pyenv/versions/3.6.5/lib/python3.6/site-packages/nextstrain/cli/runner/aws_batch/logs.py", line 46, in delete_stream
    logStreamName = stream)
  File "/Users/trvrb/.pyenv/versions/3.6.5/lib/python3.6/site-packages/botocore/client.py", line 314, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Users/trvrb/.pyenv/versions/3.6.5/lib/python3.6/site-packages/botocore/client.py", line 612, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDeniedException) when calling the DeleteLogStream operation: User: arn:aws:iam::827581582529:user/trvrb is not authorized to perform: logs:DeleteLogStream on resource: arn:aws:logs:us-east-1:827581582529:log-group:/aws/batch/job:log-stream:nextstrain-job/default/7e4b6628-4c87-4911-91f3-435fee389286

I failed to notice this omission earlier, because after adding log
stream cleanup I didn't test again with my unprivileged AWS user.
@tsibley
Copy link
Member Author

tsibley commented Oct 30, 2018

Thanks for digging in!

Hmm, nextstrain build --help output mentions AWS Batch support in several places. If it doesn't for you, then I think maybe you're not you weren't running the code on the branch?

AWS Batch support is also documented at https://github.com/nextstrain/cli/blob/aws-batch/doc/aws-batch.md. I mentioned this in a comment on #28, and it's also referenced from the output of nextstrain build --help-all. For your review, it's definitely worth reading the other comments on #28. They're aimed at you, mostly. :-)

I believe the AccessDeniedException you were seeing is a result of the AWS IAM policy not including the DeleteLogStream action. I've fixed that and fixed the docs. Thanks for the catch! (I wasn't seeing them because I didn't test again with my non-privileged AWS user after adding the log stream cleanup.)

@jameshadfield
Copy link
Member

jameshadfield commented Oct 31, 2018

I've tested this functionality and it can report that it works for me.

I have some suggestions:

(a) The documentation is good, however a flow diagram of the different components (cli, s3 bucket, batch, logs...), how they interact with each other, which permissions are used where etcetera would be extremely helpful (to someone like me).

(b) Currently quitting the nextstrain program while running results in the job being cancelled and the data deleted from the bucket.

Job now RUNNABLE
^C
Canceling job…
Waiting for job to stop…
(Press Ctrl-C again if you don't want to wait.)
Job FAILED after 1.1 minutes (stopped by user)
Cleaning up S3
deleted: s3://nextstrain-jobs/ddb25a16-62cd-44eb-9021-3d3096971e23.zip

I imagine scenarios where one wants to stop the nextstrain process but keep the job running. Is this possible? How does this work if internet connection drops?

(c) There are a number of zip files in the nextstrain-jobs bucket, and a number of logs in the /aws/batch/job group (and starting with nextstrain-job). It's confusing to me why these are there, since my submitted job had it's data & logs deleted.

(d) Would it be possible to keep logs (on AWS) for a period of time?

(e) Could jobs / S3 data / logs incorporate the username into their name (id?), such that one could more easily locate things? If this functionality could be achieved by searching please include it in the docs.

@tsibley
Copy link
Member Author

tsibley commented Oct 31, 2018

Thanks for testing this! The feedback is very useful.

(a) Ah, yes, that'd be good. I've created #34 for this.

(b) I've previously sketched out ideas for supporting background jobs and disconnections, but they're not part of this initial implementation. The basic idea is an --unattended mode for AWS Batch builds with support for resuming/fetching results later using something like nextstrain build --aws-batch --resume job-id …. This could be wired up automatically if a connection error happens.

(c) Orphaned data files and logs are left around if the local nextstrain client throws an unexpected exception. This is by design, to aid with debugging the client. Both the S3 bucket and CloudFront log group have retention lifecycles defined so that orphaned files/logs are removed automatically after 30 days if not proactively removed by the client first.

(d) It would be possible. When would this be useful?

(e) Log streams and job ids cannot; their ids are assigned automatically. The build id (or job name) and zip file on S3 could incorporate a username, however. Would that still be useful? I don't really expect folks to be accessing the AWS console in normal use cases.

@jameshadfield
Copy link
Member

I may not be the target audience for this tool and so my suggestions may be off track. I'd use this to submit jobs to aws in the "--unattended" mode, and perhaps dozens at a time. So I'd lose track of job IDs if I had to manually keep track of them. Something like job-status (which lists submitted jobs and their progress, including those that have finished) and retrieve-finished (which downloads the data) would be useful. Keeping around logs on AWS for ~30 days (and being able to subset by user) would help see what I'd run, similar to job-status.

@tsibley
Copy link
Member Author

tsibley commented Oct 31, 2018

Nod. What's currently there are the fundamental bits, but my intent with the --unattended mode would be to tackle your more specific use case, albeit incrementally. Saving job ids and using them automatically later is possible and definitely desirable, but I postponed it for now because there are some technical hurdles to doing it robustly esp. around fetching results of multiple runs for the same build.

@tsibley tsibley merged commit 05fece6 into master Nov 26, 2018
@tsibley tsibley deleted the aws-batch branch November 26, 2018 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants