Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache chart layers without timestamp #8216

Closed

Conversation

pmengelbert
Copy link
Contributor

@pmengelbert pmengelbert commented May 28, 2020

What this PR does / why we need it:
This commit allows for the creation of a tar.gz archive of a chart
without a timestamp. Doing so ensures that charts with identical content
will have identical digests, which is important for storage in a
registry. Effort was made to ensure the functionality of helm package
was not affected.

Additionally, a unit test for the registry cache's StoreReference
function, to help track this bug.

closes #8212

Special notes for your reviewer:

If applicable:

  • this PR contains documentation
  • this PR contains unit tests
  • this PR has been tested for backwards compatibility

@helm-bot helm-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 28, 2020
@pmengelbert pmengelbert force-pushed the fix-digest-helm-chart-save branch 3 times, most recently from d94a092 to 191fb64 Compare May 28, 2020 15:17
pmengelbert added a commit to bloodorangeio/helm that referenced this pull request Jun 3, 2020
Currently, whenever the chart is printed, the digest of the .tar.gz
content layer is printed as the digest. The manifest digest is important
for OCI purposes, particularly in pushing to a registry.

These changes build on top of PR helm#8216, which is more critical and
should be merged first.

Resolves helm#8248.

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
@pmengelbert pmengelbert mentioned this pull request Jun 3, 2020
3 tasks
@bacongobbler bacongobbler added this to the 3.3.0 milestone Jun 5, 2020
This commit allows for the creation of a tar.gz archive of a chart
without a timestamp. Doing so ensures that charts with identical content
will have identical digests, which is important for storage in a
registry. Effort was made to ensure the functionality of `helm package`
was not affected.

Additionally, a unit test for the registry cache's StoreReference
function, to help track this bug.

Resolves helm#8212

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
@helm-bot helm-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 5, 2020
@pmengelbert
Copy link
Contributor Author

@bacongobbler This PR has been updated to reflect the changes you requested in #8249 . Please let me know if anything further is needed here.

pmengelbert added a commit to bloodorangeio/helm that referenced this pull request Jun 5, 2020
Currently, whenever the chart is printed, the digest of the .tar.gz
content layer is printed as the digest. The manifest digest is important
for OCI purposes, particularly in pushing to a registry.

These changes build on top of PR helm#8216, which is more critical and
should be merged first.

Resolves helm#8248.

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
pmengelbert added a commit to bloodorangeio/helm that referenced this pull request Jun 24, 2020
Currently, whenever the chart is printed, the digest of the .tar.gz
content layer is printed as the digest. The manifest digest is important
for OCI purposes, particularly in pushing to a registry.

These changes build on top of PR helm#8216, which is more critical and
should be merged first.

Resolves helm#8248.

Signed-off-by: Peter Engelbert <pmengelbert@gmail.com>
@pmengelbert
Copy link
Contributor Author

pmengelbert commented Jun 24, 2020

@bacongobbler again updated based on comments made on PR #8249

@@ -288,7 +288,7 @@ func (cache *Cache) saveChartConfig(ch *chart.Chart) (*ocispec.Descriptor, bool,
func (cache *Cache) saveChartContentLayer(ch *chart.Chart) (*ocispec.Descriptor, bool, error) {
destDir := filepath.Join(cache.rootDir, ".build")
os.MkdirAll(destDir, 0755)
tmpFile, err := chartutil.Save(ch, destDir)
tmpFile, err := chartutil.SaveWithOpts(ch, destDir, chartutil.WithoutTimestamp())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be changed for both helm package and helm dep up as well? Why should this be a special case just for helm chart save?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name, err := chartutil.Save(ch, dest)

_, err = chartutil.Save(ch, destPath)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bacongobbler I can't speak for helm dep up, but as for helm package, the timestamp was deliberately introduced in Helm v3.0.3 (specifically PR #4161) to resolve issue #4158.

Let me know your thoughts on this -- I did not want to break someone else's fix, but am happy to do as you ask.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, so coming back to the original question... Why should we make a special case for helm chart save? If users expect that their files are timestamped correctly within the package, why should helm chart save be any different?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I ask this is because we have been talking about using OCI as the standard protocol for distributing charts. The changes we make here are likely going to be used by community members in future Helm releases.

If #4158 was opened for helm package and helm dep up, why wouldn't someone from the community open a similar issue against helm chart save?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I ask this is because we have been talking about using OCI as the standard protocol for distributing charts. The changes we make here are likely going to be used by community members in future Helm releases.

If #4158 was opened for helm package and helm dep up, why wouldn't someone from the community open a similar issue against helm chart save?

@bacongobbler The OCI specification is such that the digest of each layer (or blob) is determined by its content. The digests of the individual layers "bubble up" to affect the digest of the manifest, as well. This is used for registry-side deduplication of both blobs and manifests, as well as identification. If two charts contain the exact same content but a different digest (due to the timestamp), it causes OCI deduplication and identification efforts to fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that only an issue if you're rebuilding the chart from source via helm chart save? If you're fetching and pushing the same content across multiple registries via helm chart pull and helm chart push, the digest remains the same.

The same argument could be made for helm package and chart provenance as well. If you rebuild the package from source, the chart's signature will fail verification because the digest changed. I would argue that's intentional behaviour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UX of helm chart save is designed to mimic that of docker build.

Here's an example Dockerfile and index.html:

# Dockerfile
FROM scratch
ADD index.html /index.html
<!-- index.html -->
<html><h1>Hello World</h1></html>

Now we run docker build:

$ docker build .
Sending build context to Docker daemon  3.072kB
Step 1/2 : FROM scratch
 --->
Step 2/2 : ADD index.html /index.html
 ---> 49c8907c89fa
Successfully built 49c8907c89fa

Now wait 1 minute, and run the command again without modifying either file:

$ docker build .
Sending build context to Docker daemon  3.072kB
Step 1/2 : FROM scratch
 --->
Step 2/2 : ADD index.html /index.html
 ---> Using cache
 ---> 49c8907c89fa
Successfully built 49c8907c89fa

Notice that the resulting image built has the same digest (49c8907c89fa).

Now in the context of Helm - why, if we have not made any modification to any files within a chart directory, should new digests be generated dependent on the time which they are saved?

If you're fetching and pushing the same content across multiple registries via helm chart pull and helm chart push, the digest remains the same

This is correct, however it fails if I decide to republish/mirror these charts into my own private registry. There needs to be some indication that 2 charts across separate registries are indeed identical, and the digest is what we are using to determine that.

There may be a way around this if we decide to decouple helm chart save from the tar-ing process (i.e. first run helm package mychart/ then helm chart save mychart-1.0.0.tgz, using the raw .tgz for the layer), but I do not think this is necessarily a user-friendly approach.

I dont know the correct answer, but I do see this PR as fixing a registry-related regression introduced in 3.0.3

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why, if we have not made any modification to any files within a chart directory, should new digests be generated dependent on the time which they are saved?

I'm not arguing whether the digests are generated based on the time that helm package is called. I am arguing whether the digests are generated based on the time the files have been modified. There's a slight nuance in behaviour here.

#4158 was opened because the user expected the timestamps of their files to be preserved during a helm package similar to tar -c.

In regards to docker build's behaviour... That's another matter entirely. They cache the underlying build context. If the files did not change in your local directory since the last build, docker build re-uses the same build context.

Taking your same example Dockerfile here, but instead let's add the --no-cache flag to demonstrate:

><> cat Dockerfile 
# Dockerfile
FROM scratch
ADD index.html /index.html
><> cat index.html 
<!-- index.html -->
<html><h1>Hello World</h1></html>
><> docker build .
Sending build context to Docker daemon  3.072kB
Step 1/2 : FROM scratch
 ---> 
Step 2/2 : ADD index.html /index.html
 ---> Using cache
 ---> 1eb4e4f55090
Successfully built 1eb4e4f55090
><> docker build --no-cache .
Sending build context to Docker daemon  3.072kB
Step 1/2 : FROM scratch
 ---> 
Step 2/2 : ADD index.html /index.html
 ---> 6b92fac2d0a8
Successfully built 6b92fac2d0a8
><>

Notice that the resulting digest differed from the first call to docker build. This is because docker build checks the mtime of the build context to compute the build context, and Docker re-computed the package digest based on that timestamp.

In other words, if you invalidate the cache (the build context which docker build runs from) and call docker build using the exact same files, it will result in a new digest. Because the mtime changed.

If docker didn't care about the mtime of the build context, the resulting digest would've been the same for me as it would be for you.

That is why I'm against this change - every other packaging solution handles this in the same manner. tar does this. docker build does this. helm package does this. mtime is preserved from the files into the package. Why can't helm chart save do the same?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to have helm package produce bitwise-deterministic packages, then I'd suggest attempting to fix #3612. Stripping timedata from the underlying files isn't what's causing this issue.

As reported in that issue's first comment, helm package will produce the same SHA 70% of the time. The other 30% are due to files being in another order within the tar package.

This was also clearly pointed out earlier in #3612 (comment).

Copy link
Member

@bacongobbler bacongobbler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested this with the test case provided in #3612: https://github.com/cmaher/nondeterministic-charts

$ export HELM_EXPERIMENTAL_OCI=1
$ ./helm version
version.BuildInfo{Version:"v3.2+unreleased", GitCommit:"d862cd9843663fc9cd621fcbb2ad8cdef90e2b1e", GitTreeState:"clean", GoVersion:"go1.14.4"}
$ cd parent
$ helm dep up
$ for i in {1..1000}; do ../helm chart save . helm-test-8216:$i; done

Here's the result.

$ helm chart list | awk '{ print $4 }' | sort | uniq
28be032
71c690a
95b90db
DIGEST

Please have a look at the assertions being made in #3612. Stripping the timestamp alone will not produce bitwise-deterministic package checksums.

This PR does not solve the problem this PR is attempting to fix. Please try out the suggestions made in #3612 to see if it will result in bitwise-deterministic packages. Based on some of the comments, I believe that fixing it in this way will get you the results you're trying to achieve.

@bacongobbler
Copy link
Member

removing from the milestone.

@bacongobbler bacongobbler removed this from the 3.3.0 milestone Jul 2, 2020
@bacongobbler
Copy link
Member

closing as we're planning to remove the OCI cache, working around the need for this as users will be pushing packages built with helm package. If there are further issues to address, please bring those up in #3612. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Same chart saved at different times has different digest
4 participants