Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix docker invalidation bug. #16419

Merged
merged 6 commits into from Aug 5, 2022
Merged

Conversation

benjyw
Copy link
Sponsor Contributor

@benjyw benjyw commented Aug 4, 2022

Imagine a "child" image that depends on a "parent" image
(via a FROM command).

Previously, the child image building Process had no mention
of the parent image's unique ID. So, even though we would re-run
the parent building process every time (because it was set
to PER_SESSION scope in #13464), Pants would not know that the
child building process depended on the parent building process
completing first (and creating the new parent image as a side effect).

Now we write a dummy file containing the ids of all upstream images
into the context in which a child image build is executed. This
causes the child building process to have the correct causal chain.

Note that we still need the PER_SESSION scope because we have
to re-run all image build processes even when there were no changes at
all, in case the user did a docker image rm.

This highlights the complications and dangers of interacting with
a stateful external daemon like docker.

Fixes #16101

[ci skip-rust]

[ci skip-build-wheels]

Assume a "child" image that depends on a "parent" image
(via a FROM command).

Previously, the child image building Process had no mention
of the parent image's unique ID. So, even though we would re-run
the parent building process every time (because it was set
to PER_SESSION scope in pantsbuild#13464), Pants would not know that the
child building process depended on the parent building process
completing first (and creating the new parent image as a side effect).

Now we write a dummy file containing the ids of all upstream images
into the context in which a child image build is executed. This
causes the child building process to invalidate correctly.

Note that we still need the PER_SESSION scope because we have
to re-run all image build processes even when there were no changes at
all, in case the user did a `docker image rm`.

This highlights the complications and dangers of interacting with
a stateful external daemon like docker.

[ci skip-rust]

[ci skip-build-wheels]
@benjyw benjyw added the category:bugfix Bug fixes for released features label Aug 4, 2022
@benjyw benjyw requested review from stuhood and kaos August 4, 2022 22:19
@benjyw benjyw added this to the 2.13.x milestone Aug 4, 2022
@benjyw
Copy link
Sponsor Contributor Author

benjyw commented Aug 4, 2022

In the long run it would be nice if Process had a field for "strings that should invalidate the process", i.e., that participate in the cache key but don't affect process execution.

For now I emulate this by writing a dummy file into the sandbox. The downside of this is that this file could potentially be copied into the image, if there was a COPY that copied the entire context (not that there's harm in this, it's just weird).

An alternative hack might be to put this in an env var, although that is less obvious when debugging in the sandbox.

Copy link
Contributor

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for figuring this out!

@benjyw
Copy link
Sponsor Contributor Author

benjyw commented Aug 4, 2022

May fix #16356

@thejcannon
Copy link
Member

thejcannon commented Aug 4, 2022

If there's a COPY copying the entire context they are also copying the dockerfile, among other things 🙈

# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
@stuhood
Copy link
Sponsor Member

stuhood commented Aug 5, 2022

Thanks a lot!

An alternative hack might be to put this in an env var, although that is less obvious when debugging in the sandbox.

Yea would recommend the env var. Since nothing consumes it, it shouldn't impact the efficacy of debugging.

@benjyw benjyw requested a review from thejcannon August 5, 2022 13:47
@thejcannon
Copy link
Member

I'm off tower for a few days, so don't wait on me. I trust that you've tested the use cases with the code, and the approach is sound to me.

@benjyw
Copy link
Sponsor Contributor Author

benjyw commented Aug 5, 2022

Opened #16422 for the general issue of invalidating Processes based on out-of-band data.

@benjyw
Copy link
Sponsor Contributor Author

benjyw commented Aug 5, 2022

Thanks a lot!

An alternative hack might be to put this in an env var, although that is less obvious when debugging in the sandbox.

Yea would recommend the env var. Since nothing consumes it, it shouldn't impact the efficacy of debugging.

Well, my thinking was that it would be more obvious that the sandbox has been "tampered with" with the file. E.g., when trying to debug why something was invalidated.

@benjyw
Copy link
Sponsor Contributor Author

benjyw commented Aug 5, 2022

But then if we implement #16422 it will be even less obvious, and yet that would still be better than this hack, so I guess this is fine. Will switch to env vars. Theoretically env var values are capped in length, but that length is 128K on linux and unknown but large on macOS (and 32K even on Windows) so not a problem in practice.

[ci skip-rust]

[ci skip-build-wheels]
@thejcannon
Copy link
Member

Not that env var length is an issue (as you say) but we could always sha256 hash the input (further making it like a poor man's solution to #16422)

[ci skip-rust]

[ci skip-build-wheels]
@Eric-Arellano
Copy link
Contributor

Eric-Arellano commented Aug 5, 2022

Note that we use env var for Go cache invalidation like this. It seems preferable to use an env var rather than having to hit LMDB store to create a digest & merge it?

Update: I see you changed to env var. Cool.

@benjyw
Copy link
Sponsor Contributor Author

benjyw commented Aug 5, 2022

Switched to env var. PTAL.

@benjyw benjyw merged commit 565bf90 into pantsbuild:main Aug 5, 2022
@benjyw benjyw deleted the fix_docker_invalidation branch August 5, 2022 21:56
benjyw added a commit to benjyw/pants that referenced this pull request Aug 6, 2022
Assume a "child" image that depends on a "parent" image
(via a FROM command).

Previously, the child image building Process had no mention
of the parent image's unique ID. So, even though we would re-run
the parent building process every time (because it was set
to PER_SESSION scope in pantsbuild#13464), Pants would not know that the
child building process depended on the parent building process
completing first (and creating the new parent image as a side effect).

Now we write a dummy file containing the ids of all upstream images
into the context in which a child image build is executed. This
causes the child building process to invalidate correctly.

Note that we still need the PER_SESSION scope because we have
to re-run all image build processes even when there were no changes at
all, in case the user did a `docker image rm`.

This highlights the complications and dangers of interacting with
a stateful external daemon like docker.

[ci skip-rust]

[ci skip-build-wheels]
benjyw added a commit that referenced this pull request Aug 8, 2022
Assume a "child" image that depends on a "parent" image
(via a FROM command).

Previously, the child image building Process had no mention
of the parent image's unique ID. So, even though we would re-run
the parent building process every time (because it was set
to PER_SESSION scope in #13464), Pants would not know that the
child building process depended on the parent building process
completing first (and creating the new parent image as a side effect).

Now we write a dummy file containing the ids of all upstream images
into the context in which a child image build is executed. This
causes the child building process to invalidate correctly.

Note that we still need the PER_SESSION scope because we have
to re-run all image build processes even when there were no changes at
all, in case the user did a `docker image rm`.

This highlights the complications and dangers of interacting with
a stateful external daemon like docker.

[ci skip-rust]

[ci skip-build-wheels]
benjyw added a commit to benjyw/pants that referenced this pull request Aug 11, 2022
Assume a "child" image that depends on a "parent" image
(via a FROM command).

Previously, the child image building Process had no mention
of the parent image's unique ID. So, even though we would re-run
the parent building process every time (because it was set
to PER_SESSION scope in pantsbuild#13464), Pants would not know that the
child building process depended on the parent building process
completing first (and creating the new parent image as a side effect).

Now we write a dummy file containing the ids of all upstream images
into the context in which a child image build is executed. This
causes the child building process to invalidate correctly.

Note that we still need the PER_SESSION scope because we have
to re-run all image build processes even when there were no changes at
all, in case the user did a `docker image rm`.

This highlights the complications and dangers of interacting with
a stateful external daemon like docker.

[ci skip-rust]

[ci skip-build-wheels]
benjyw added a commit that referenced this pull request Aug 11, 2022
Assume a "child" image that depends on a "parent" image
(via a FROM command).

Previously, the child image building Process had no mention
of the parent image's unique ID. So, even though we would re-run
the parent building process every time (because it was set
to PER_SESSION scope in #13464), Pants would not know that the
child building process depended on the parent building process
completing first (and creating the new parent image as a side effect).

Now we write a dummy file containing the ids of all upstream images
into the context in which a child image build is executed. This
causes the child building process to invalidate correctly.

Note that we still need the PER_SESSION scope because we have
to re-run all image build processes even when there were no changes at
all, in case the user did a `docker image rm`.

This highlights the complications and dangers of interacting with
a stateful external daemon like docker.

[ci skip-rust]

[ci skip-build-wheels]
@stuhood
Copy link
Sponsor Member

stuhood commented Aug 15, 2022

Previously, the child image building Process had no mention
of the parent image's unique ID. So, even though we would re-run
the parent building process every time (because it was set
to PER_SESSION scope in #13464), Pants would not know that the
child building process depended on the parent building process
completing first (and creating the new parent image as a side effect).

AFAICT, this is not about the cache key at all: rather, about ordering. Process 1 "depends on" process 2, but that fact is completely out of band. The process is already marked PER_SESSION, so it is not necessary to actually invalidate its cache key.

It seems like the effect of this change might instead be that we now finish running process 1 to get its id before we begin running process 2. But I can't quite tell from the change how the ids are propagated between images. Note that another way to do this is recursively: building image 1 recursively depends on building image 2, etc.

@benjyw
Copy link
Sponsor Contributor Author

benjyw commented Aug 16, 2022

Yes, the paragraph you quote doesn't mention the cache key. This is exactly about ordering. I think that paragraph pretty much states the same as yours?

Possibly recursing is a better way to do this though.

@stuhood
Copy link
Sponsor Member

stuhood commented Aug 16, 2022

Yes, the paragraph you quote doesn't mention the cache key. This is exactly about ordering. I think that paragraph pretty much states the same as yours?

Yea, sorry... I didn't quote a relevant part. But the discussion of the cache key on here and in #16422 is slightly confusing. The Process itself doesn't need to actually have any reference to the previous process, because it's not actually necessary for this information to be in the cache key. It's only a new data dependency that needs to be introduced.

My comment has more to do with not seeing how you introduced a new data dependency here: I don't see anything that would cause the process to wait for the upstream_image_ids to have been produced. And we agree that putting them as strings in the cache key isn't sufficient.

@benjyw
Copy link
Sponsor Contributor Author

benjyw commented Aug 16, 2022

On closer inspection, to remind myself of the moving parts here, this is already implemented via recursion (and was before). See here. But that alone is not enough because of speculation - we speculate on the parent image and the child image concurrently, and there was nothing to cause the speculative child image build to be canceled and re-run.

The data dependency on the parent image's id is introduced via the DockerBuildContext. When the parent image completes, the child image's inputs will have changed, and that will cancel the speculative child image build and re-run it. That causes I had added some detail in the issue.

At least, that is my understanding of all this...

@stuhood
Copy link
Sponsor Member

stuhood commented Aug 16, 2022

But that alone is not enough because of speculation - we speculate on the parent image and the child image concurrently, and there was nothing to cause the speculative child image build to be canceled and re-run.

Aha. Yea, that makes more sense.

I don't have the entire picture in my mind (maybe 85%?), but DockerBuildContext needing its identity to change when upstream_image_ids does makes sense.

Copy link
Member

@kaos kaos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, speculative process execution. Yea I didn't consider that, hah. Thx for addressing it!

cczona pushed a commit to cczona/pants that referenced this pull request Sep 1, 2022
Assume a "child" image that depends on a "parent" image
(via a FROM command).

Previously, the child image building Process had no mention
of the parent image's unique ID. So, even though we would re-run
the parent building process every time (because it was set
to PER_SESSION scope in pantsbuild#13464), Pants would not know that the
child building process depended on the parent building process
completing first (and creating the new parent image as a side effect).

Now we write a dummy file containing the ids of all upstream images
into the context in which a child image build is executed. This
causes the child building process to invalidate correctly.

Note that we still need the PER_SESSION scope because we have
to re-run all image build processes even when there were no changes at
all, in case the user did a `docker image rm`.

This highlights the complications and dangers of interacting with
a stateful external daemon like docker.

[ci skip-rust]

[ci skip-build-wheels]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category:bugfix Bug fixes for released features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

./pants package Rebuilding Issue
5 participants