Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Batch runner support runtime-configurable Docker images. #73

Closed
wants to merge 3 commits into from

Conversation

jstoja
Copy link

@jstoja jstoja commented Apr 7, 2020

Description of proposed changes

To support runtime-configurable Docker image in AWS Batch,
the creation of a Batch Job also created a Job Definition with the
configured image.

Related issue(s)

Fixes #57

Testing

I have to test on my AWS account first, then when I'll confirm it's working, I'd be happy to have someone from the core team to test this in your setup to confirm its working as expected.

As a side note, maybe it'd be helpful to add doctest (https://docs.python.org/3/library/doctest.html), which helps testing small functions, but also documents in a very practical manner the inputs and outputs.

Please let me know everything you think should be changed in this code.

Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jstoja! Thanks so much for working on this issue and submitting this PR! I've identified some changes that I think need to be made, detailed individually below, but this is already great progress. 😁 Let me know if you have questions or alternate suggestions, etc. I'm happy to chat more, and I'm hoping I'll be able to be more timely with feedback going forward.

I'm excited to have this functionality. ✨ It'll be super helpful when testing out new images or using custom images for certain builds.

dest = "aws_batch_image",
help = "Docker image to use for the AWS Job Definition",
metavar = "<name>",
default = docker.DEFAULT_IMAGE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to use the existing --image argument here, so that an invocation of the default Docker runner like:

nextstrain build --image <name> .

can simply be switched to AWS Batch like so:

nextstrain build --image <name> --aws-batch .

This would mean promoting the --image argument from runner.docker.register_arguments() into runner.register_arguments() so it can be shared by all runners that support it, similar to how --exec works. I imagine that runner.native would ignore --image and warn if it was used, but a future runner.singularity might be able to make use of it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, please let me know if the error handling in runner.native is OK for you.

Comment on lines 144 to 147
docker_image_tag = docker_image.split(':')
name = re.sub('[^0-9a-zA-Z-]+', '-', definition_name)
image = re.sub('[^0-9a-zA-Z-]+', '-', docker_image_tag[0])
tag = re.sub('[^0-9a-zA-Z-]+', '-', docker_image_tag[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docker_image isn't guaranteed to have a colon in it, and thus this could throw a ValueError.

More robust image name splitting logic is implemented in runner.docker.split_image_name() and that would be good to re-use here. That function may want to be moved into nextstrain.cli.utils so that this module doesn't have to import runner.docker.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that's much better indeed!

Comment on lines 168 to 181
batch.register_job_definition(
jobDefinitionName = definition_name,
type = "container",
containerProperties = {
"image": image,
"command": [],
},
retryStrategy = {
"attempts": 1,
},
timeout = {
"attemptDurationSeconds": 14400,
},
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The job definition created here should be based on the existing definition (which is specified by --aws-batch-job, the environment, or config), with just containerProperties.image modified.

The code can fetch the existing job description by calling batch.describe_job_definitions().

How does this code behave if a job definition with the name already exists? Does it throw an error? Does it silently do nothing? Does it create a duplicate job definition?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a TODO for this one and I'll come back later to it. I'll read the documentation thoroughly and do some tests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today the job definition nextstrain-job exists. If tomorrow nextstrain-job2 is specified, should it raise an error, try to create one based on this snippet (taken from the documentation), or use the default job nextstrain-job as base?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming a base job definition (one that isn't parameterized with the image name) is present, otherwise I raise an Exception mentioning this.

)
except Exception as error:
warn(error)
raise Exception("Creation of job definition (%s) failed" % definition_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should catch a more specific exception class, likely something from botocore.exceptions. I think there are examples of this in other places of the code.

Error reporting will also be nicer if this except block raises a subclass of NextstrainCliError from the nextstrain.cli.errors module. The existing UserError subclass probably isn't appropriate, so a new subclass like AWSError might be warranted. It would also be ok to use NextstrainCliError directly for now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've caught the most common error as per documentation, if it's something else, it probably shouldn't be caught and go upper as something would be off.
I've also added the new exception, please let me know if you feel something else should be added here.

Julien BORDELLIER added 2 commits May 10, 2020 21:46
To support runtime-configurable Docker image in AWS Batch,
the creation of a Batch Job also created a Job Definition with the
configured image.

Fixes nextstrain#57
* Change --aws-batch-image with --image by promoting the argument to all runners.
* Improve docker image parsing by using split_image_name from cli.utils.
* Added a new cli.errors.AWSError to represent botocore.exceptions that is now caught.
* TODO: Handle job definition creation/duplication.
@jstoja
Copy link
Author

jstoja commented May 10, 2020

Hello @tsibley
Thanks for the awesome feedback, I've reworked the PR as per most of your comments, although, I haven't taken care of the one about the job_definition. I've let a TODO and I'll work on it when I have some time available.
Please let me know if the changes are OK for you.
Thanks

If the base job definition doesn't exist, an Error is thrown to alert the user.
tsibley added a commit that referenced this pull request May 22, 2020
Job definitions are dynamically created as needed based on the base
definition specified by --aws-batch-job.  This is necessary to allow
runtime-configurable images since, unlike CPUs, memory, env vars, etc, a
submitted job may not directly override the container image set in the
job definition.

Resolves #57.  Implementation originally started as PR #73.

Co-authored-by: Julien BORDELLIER <git@julienbordellier.com>
@tsibley
Copy link
Member

tsibley commented May 22, 2020

@jstoja Thanks for reworking this!

I dug into it today and first ran into a couple bugs that prevented it from working (for example, register_job_definition takes keyword args, not a dict). I fixed those, but then ran into several edge cases we hadn't considered around the automatic creation of job definitions.

I ended up rewriting quite a bit and working out all the edge cases using this PR as a starting point. It's pushed up to the aws-batch-images branch and is PR #82. If you wanted to review it, I'd appreciate the extra eyes. :-) I'll likely merge it next week. I marked you as a co-author on the primary commit since it was based on your work. Thank you for your contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AWS Batch runner does not support runtime-configurable Docker images
2 participants