Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run builds on AWS Batch [issue] #28

Closed
16 tasks done
tsibley opened this issue Sep 11, 2018 · 7 comments
Closed
16 tasks done

Run builds on AWS Batch [issue] #28

tsibley opened this issue Sep 11, 2018 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@tsibley
Copy link
Member

tsibley commented Sep 11, 2018

This issue is for a work-in-progress feature which I've been working on recently and am currently polishing up.

Even with the working mechanics, there are several external things that need consideration before this can be considered a shippable feature:

  • CLI: Default region? (Region is required by boto3.client("batch").)
  • IAM: policies allowing normal user roles to submit jobs
  • IAM: role/policies for Batch service (is it limited enough?)
  • IAM: role/policies for Batch jobs (is it limited enough?)
  • Compute environment: Put limits on resources used / costs incurred by Batch jobs to prevent runaways
  • Compute environment: Increase available CPU / memory resources
  • S3: Create build context bucket
  • S3: Add retention policy to S3 bucket
  • S3: Bucket ACLS - https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-access-control.html
  • CloudWatch Logs: Add retention policy for Batch jobs
  • CLI: Delete log stream when done
  • Documentation: Creation of Batch compute environment, queue, job definition, S3 bucket, S3 retention policies, CloudWatch logs retention policies, and IAM roles/policies (+ automated tooling for doing this?)
  • Documentation: Describe security environment and assumptions
  • docker-base: Merge aws-batch branch
  • Batch: Switch job definition to use nextstrain/base:latest instead of nextstrain/base:branch-aws-batch
  • CLI: Terminate remote jobs on ^C (or make issue for this)

(The list above is as much for me as anyone else.)

@tsibley tsibley added the enhancement New feature or request label Sep 11, 2018
@tsibley tsibley self-assigned this Sep 11, 2018
@tsibley
Copy link
Member Author

tsibley commented Sep 18, 2018

I've just pushed the CLI changes to adding support for nextstrain build --aws-batch: https://github.com/nextstrain/cli/compare/aws-batch.

I would appreciate any review of the code and especially the user interaction/experience. Until I complete some of the ancillary items on the todo list above, it will only work if you have admin access to the lab's AWS account (this means @trvrb and maybe @jameshadfield for now). I will leave another comment when I've arranged wider access so more folks can test if they want, hopefully tomorrow or sometime soon this week.

Note that the remote jobs are currently using the nextstrain/base:branch-aws-batch image from the aws-branch of docker-base: https://github.com/nextstrain/docker-base/compare/aws-batch. Relevant entrypoint code is there.

@tsibley
Copy link
Member Author

tsibley commented Sep 19, 2018

Folks in our lab should now be able to try this out more widely.

@tsibley
Copy link
Member Author

tsibley commented Sep 19, 2018

Re: compute limits: In our current AWS Batch configuration, each job defaults to 2 vCPUs with 4GB of memory and will be terminated if it does not complete in 4 hours. These are adjustable on a per-job basis, but the cli itself does not change the defaults. (Though an authorized user could.) The Batch compute environment (i.e. managed pool of EC2 instances) is limited to no more than 256 combined vCPUs. Instances are automatically provisioned, including down to zero instances (no cost) running if there are no jobs in the queue.

We should keep an eye on Batch usage and Batch-driven costs to make sure this is functioning as we expect. Only @trvrb (or someone else with access to Billing details) can do this.

If we start submitting large jobs, we should consider increasing the default job resources.

@tsibley
Copy link
Member Author

tsibley commented Sep 19, 2018

@tsibley
Copy link
Member Author

tsibley commented Sep 20, 2018

Documentation is now at https://github.com/nextstrain/cli/blob/aws-batch/doc/aws-batch.md.

That URL (well, the URL for master not aws-batch) is referenced from the output of nextstrain build --help-all under the AWS Batch section.

@tsibley
Copy link
Member Author

tsibley commented Sep 25, 2018

I've bumped the default job resources to 8 vCPUs and (just under) 16GiB of memory, which should cost about 34¢/hour on a c5.2xlarge. Combined with my augur PR to auto-scale alignment and tree-building parallelism, this should make larger builds run much quicker.

@tsibley tsibley changed the title Run builds on AWS Batch Run builds on AWS Batch [issue] Oct 3, 2018
@tsibley
Copy link
Member Author

tsibley commented Nov 26, 2018

Merged and released as 1.7.0.

@tsibley tsibley closed this as completed Nov 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant