Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose Docker build context hash for image tag interpolation. #13959

Merged
merged 7 commits into from Jan 4, 2022

Conversation

kaos
Copy link
Member

@kaos kaos commented Dec 22, 2021

This allows users to tag their Docker images with a stable hash value based on build args, env and input sources.

docker_image(
  image_tags=["1.2.3-{pants.hash}"],
)

# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
@kaos
Copy link
Member Author

kaos commented Dec 22, 2021

Motivation from at least one user ;) #13934 (comment)

@kaos kaos requested a review from stuhood December 23, 2021 13:59
version_context["PANTS"] = {
# Present hash for all inputs that can be used for image tagging.
"HASH": str(hash((build_args, build_env, snapshot.digest))).replace("-", "0"),
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really sure if lowercase keys would be better, or lower.UPPER, or not nesting, so it would be "...{pants_hash}..", etc.. 🤷🏽

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opted for lowercasing, so it is {pants.hash}.

Copy link
Contributor

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. How do you think this should be documented?

@kaos
Copy link
Member Author

kaos commented Dec 28, 2021

With a growing number of sources for the interpolation context, it's becoming increasingly motivated to maybe create a dedicated section for that, detailing what goes into the context, including how the pants.hash value is calculated. Also as the interpolation context may be used in several fields/scenarios, repeating the context information is useful only in so much detail.

@Eric-Arellano
Copy link
Contributor

Oh wait actually please wait to merge: what do you think about stability guarantees for the hash value? Is this going to slow us done from making changes to this part of the codebase?

@kaos
Copy link
Member Author

kaos commented Dec 29, 2021

Good point. Let's examine:

  • Algorithm. Likely stable. May potentially change between versions of Python (as we rely on the builtin hash function).
  • Input data: build args, env and snapshot digest (i.e. sources). Whatever we do, if any of those change, we want the hash to change.

If we change the implementation so that it affects any of the input data, it will also affect the output image (potentially, at least), and a change of hash value is desirable.

A change of hash value due to change of Python version is not desirable, but I think we can live with that risk (I think it is rather low).

There's also a hash value embedded in one of the test cases, alerting us of unexpected changes to the calculated hash value.

wdyt?

@kaos
Copy link
Member Author

kaos commented Dec 29, 2021

Also, I think this use case is more in the form of optimization.. so it will not be critical to get a hash value "miss", and rebuild because of it.

@kaos
Copy link
Member Author

kaos commented Dec 29, 2021

Perhaps @vputz could fills us in on how this would affect them, in case of unexpected hash value changes?

@Eric-Arellano
Copy link
Contributor

Algorithm. Likely stable. May potentially change between versions of Python (as we rely on the builtin hash function).

Actually, I realize we should be using hashlib rather than hash: https://stackoverflow.com/questions/5583907/is-the-builtin-hash-method-of-python2-6-stable-across-architectures.

If we change the implementation so that it affects any of the input data, it will also affect the output image (potentially, at least), and a change of hash value is desirable.

Okay. I can't really comment to how useful this is, good idea to ask @vputz. I think it will be important to document the stability guarantees so users can decide if they want to rely on this or not.

[ci skip-rust]

[ci skip-build-wheels]
# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
@@ -0,0 +1,38 @@
# Copyright 2021 Pants project contributors (see CONTRIBUTORS.md).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I recommend having this be in docker/util_rules. It seems unlikely that we'll want this code to be used in other parts of the codebase without careful thought. If we do want to use it in other places, we can move it to this more generic place. For now, avoid "premature generalization".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about in the docker/utils.py file, as it's rather small?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I tend to want to only put files with rules in them in docker/util_rules/.., otherwise I find the name misleading.. ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works! And yeah...we put interpreter_constraints in backend/python/util_rules and it's a lie - there are no rules for it. But at this point it would be disruptive to change.

# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
Copy link
Contributor

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, but please wait for feedback from @vputz on stability guarantees.

Also did you have a chance to try this in the real world? Specifically, to make sure the JSON encoder works there and not only in tests.

Thanks for iterating on this!

@kaos
Copy link
Member Author

kaos commented Dec 30, 2021

The encoder is taken from the peek goal. Also, it does always a str as last resort for any unknown data, so "can't" fail ;)
But no, haven't tried it outside tests yet.. however it sits in a well exercised code path.

@kaos
Copy link
Member Author

kaos commented Dec 31, 2021

I'm thinking we may want to state that the hash may change between stable versions of Pants, to not put too restrictive boundaries on what we can do in the future. In cases where we may add to the build args/env in ways we don't do now, for instance.

@Eric-Arellano
Copy link
Contributor

That makes sense to me to retain flexibility.

@vputz
Copy link

vputz commented Jan 3, 2022

Thanks for all the work! This is part of an ongoing question about how to integrate monorepo workflow into CI/CD, with the general question being both "how do I deploy the minimum real changes to a configuration" and "how can I independently version deployable artifacts".

We have so far been doing the "easy but expensive" route of tagging all our deployable containers with the git hash. This works well, but if for example a source file for a single microservice got changed, this would force redeployment of all microservices. Semver is one option, but relies on human vigilance (which is necessarily fallible).

So that's the motivation for a "hash for the entire build context" of a container. And I agree with what I'm seeing on the above: "build args, environment, and [source files]", ie the full docker context once it's built (although this may have uses beyond the docker targets of course).

As noted above, we've just been redeploying EVERYTHING for any git hash change once that branch is marked as approved. So from my perspective and these goals, if the hash changes rarely without that (such as from a python version change) that's really not a problem--we'd still say "the system under test behaves normally, deploy all the changed containers" and just do a few more spurious deployments than are strictly necessary since the container itself may not have changed.

So in short, this sounds pretty good to me; it would let us say "okay, a bunch of things changed in the repo, but it's obvious that only these container artifacts really are different and worth deploying" and that could avoid a lot of unnecessary deployments (although updating our CI/CD system to use this would take a while).

I really like the idea and look forward to trying it!

@Eric-Arellano
Copy link
Contributor

Great, thank you @vputz! I'm comfortable landing this as long as we communicate the limitations to users, like that there are no guarantees for a stable hash across Pants versions.

@kaos kaos deleted the docker_build_hash branch January 4, 2022 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants