-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run builds on AWS Batch [PR] #29
Conversation
It doesn't need to support arbitrary base modules; just return the last dotted component. Advantageously, this also avoids surprising behaviour around subpackages, e.g. nextstrain.cli.runner.foo.bar would return "bar" instead of "foo.bar" if foo/__init__.py exists. I believe this complexity was accidentally held over from code which _did_ supply a different "base_module" but was rewritten before seeing the light of day.
This adds an --aws-batch option to the "build" command which launches the pathogen build as a job on AWS Batch. It uploads the build directory to S3, submits the job, monitors the job status, streams the job logs, and downloads build results. The UX aims to be very similar to that of local builds (either containerized or native), so the "build" command stays in the foreground and result files are written back directly to the local build directory. AWS resources must be pre-configured and these are checked for (at least at the surface level) by the "check-setup" command. Documentation (and possibly a setup script) for the required resources is still to come. Two notable, independent improvements which are well-scoped and I think worthwhile to do in the near future are: • better secured environment variable forwarding • unattended (background) builds whose results can be fetched at any later point Comments on each of these are embedded in the relevant places in the code. The ability to overlay specific versions of Nextstrain components (like --augur for the Docker runner) is another development-focused improvement desired for the future. However, it is highly co-dependent on first implementing the more general case of remote build directories (e.g. `nextstrain build git+https://github.com/…`). The new nextstrain.cli.hostenv module exists to consolidate local host ambient environment handling for the Docker and AWS Batch runners.
Many boto3 clients for services require a region at instantiation. Rather than throw NoRegionErrors at the user, fall back to a default region. Any region provided externally (via the AWS_DEFAULT_REGION environment variable or the ~/.aws/config file) is still respected.
Currently unused, but it will be used by the AWS Batch runner in the next commit and other modules (Docker runner, deploy command) after that on subsequent branches. The interface is intentionally very simple, which seems like a good place to start.
This will especially make it easier for folks who aren't us to configure their S3 bucket name persistently.
The log stream's already been printed to the screen. Cleaning it up is nice because it potentially saves log storage costs (if over the free threshold already).
I just re-read and digested this material myself and initially had trouble finding it again. Useful to link to it here in the outline of security concerns.
mypy didn't catch this, presumably because it doesn't know what type the "container" value will be when it exists. Type checking doesn't currently pass with these corrected annotations, and I'll address that in the next commit.
The AWS Batch runner prints all log entries when a job transitions to a terminal status and the log watcher hasn't started yet. That handles the case of a job failing or succeeding very quickly, before the runner has time to start the log watcher. If a job is externally canceled (e.g. via the AWS CLI or web console) before it starts, however, it will never have a log stream associated with it and requests to fetch a stream will fail. Handle this edge case by returning an empty generator from the JobState.log_entries() method. This allows the runner to be mostly unaware of the details of the log stream. The log_watcher() method currently never triggers this edge case because it is only invoked after the job has started and is guaranteed to have a log stream. An assertion is added to catch future missteps, however.
This is particularly useful if the job was externally canceled or terminated, as the reason for doing so will be given. Without the reason, the job, confusingly, just seems to stop.
Attempts to do so are a programming error and shouldn't silently do nothing.
Exits immediately if a second ^C is received. This convenience feature is nice because it means jobs won't be orphaned to run for potentially a long time (until the configured timeout in Batch). The log watcher thread is daemonized so that the whole program exits when an exception is raised in the main thread (e.g. the second KeyboardInterrupt is re-raised). Diff best viewed with whitespace ignored, as most of the job monitoring code is indented further but otherwise unchanged.
@tsibley --- I'm trying to pick through this now (finally). One thing as I work:
|
I'm also getting the following error repeatedly, which I will dig into further but wanted to bring it up now:
|
I failed to notice this omission earlier, because after adding log stream cleanup I didn't test again with my unprivileged AWS user.
Thanks for digging in! Hmm, AWS Batch support is also documented at https://github.com/nextstrain/cli/blob/aws-batch/doc/aws-batch.md. I mentioned this in a comment on #28, and it's also referenced from the output of I believe the AccessDeniedException you were seeing is a result of the AWS IAM policy not including the DeleteLogStream action. I've fixed that and fixed the docs. Thanks for the catch! (I wasn't seeing them because I didn't test again with my non-privileged AWS user after adding the log stream cleanup.) |
I've tested this functionality and it can report that it works for me. I have some suggestions: (a) The documentation is good, however a flow diagram of the different components (cli, s3 bucket, batch, logs...), how they interact with each other, which permissions are used where etcetera would be extremely helpful (to someone like me). (b) Currently quitting the
I imagine scenarios where one wants to stop the (c) There are a number of zip files in the (d) Would it be possible to keep logs (on AWS) for a period of time? (e) Could jobs / S3 data / logs incorporate the username into their name (id?), such that one could more easily locate things? If this functionality could be achieved by searching please include it in the docs. |
Thanks for testing this! The feedback is very useful. (a) Ah, yes, that'd be good. I've created #34 for this. (b) I've previously sketched out ideas for supporting background jobs and disconnections, but they're not part of this initial implementation. The basic idea is an (c) Orphaned data files and logs are left around if the local (d) It would be possible. When would this be useful? (e) Log streams and job ids cannot; their ids are assigned automatically. The build id (or job name) and zip file on S3 could incorporate a username, however. Would that still be useful? I don't really expect folks to be accessing the AWS console in normal use cases. |
I may not be the target audience for this tool and so my suggestions may be off track. I'd use this to submit jobs to aws in the "--unattended" mode, and perhaps dozens at a time. So I'd lose track of job IDs if I had to manually keep track of them. Something like |
Nod. What's currently there are the fundamental bits, but my intent with the |
See my comments on #28 and the commit messages themselves.