Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix gzip decoding of HTTP sources. #3788

Merged
merged 1 commit into from
Apr 13, 2023

Conversation

sipsma
Copy link
Collaborator

@sipsma sipsma commented Apr 11, 2023

The go http library will normally implicitly set an Accept-Encoding of gzip to requests and then transparently decompress the response body. However, if the Accept-Encoding header is explicitly set by the caller it will no longer transparently decompress the body. This was causing HTTP LLB sources to have unexpectedly compressed contents.

The fix here just unsets the Accept-Encoding header after the HEAD request. Another possible fix would be to do our own gzip decompression after reading the body if it was gzipped, but this approach seemed slightly simpler.

cc @tonistiigi this fixes some ephemeral test failures we are getting in Dagger

The go http library will normally implicitly set an Accept-Encoding of
gzip to requests and then transparently decompress the response body.
However, if the Accept-Encoding header is explicitly set by the caller
it will no longer transparently decompress the body. This was causing
HTTP LLB sources to have unexpectedly compressed contents.

The fix here just unsets the Accept-Encoding header after the HEAD
request. Another possible fix would be to do our own gzip decompression
after reading the body if it was gzipped, but this approach seemed
slightly simpler.

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
@sipsma sipsma requested a review from tonistiigi April 11, 2023 01:12
sipsma added a commit to sipsma/dagger that referenced this pull request Apr 11, 2023
Buildkit internally stores some cache metadata of etags and http
checksums using an id based on this name, so setting it to the URL
maximizes our chances of following more optimized cache codepaths.

The codepaths in Buildkit are here:
1. A hash is used to lookup any possible etag/url-hash metadata from
   previous http sources:
   https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L128-L134

2. That hash is based in part on this getFileName function:
   https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L91

3. getFileName will default to just using the llb.Filename if explicitly
   set:
   https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L419-L421

In theory this behavior should have just been unoptimal before this
commit, the behavior was still correct. However, this did end up
triggering a different buildkit bug where the GET request made after
an unsuccessful attempt at matching based on previous etag metadata was
accidentally not uncompressing gzip responses, fixed here:
moby/buildkit#3788

Signed-off-by: Erik Sipsma <erik@dagger.io>
sipsma added a commit to sipsma/dagger that referenced this pull request Apr 11, 2023
Buildkit internally stores some cache metadata of etags and http
checksums using an id based on this name, so setting it to the URL
maximizes our chances of following more optimized cache codepaths.

The codepaths in Buildkit are here:
1. A hash is used to lookup any possible etag/url-hash metadata from
   previous http sources:
   https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L128-L134

2. That hash is based in part on this getFileName function:
   https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L91

3. getFileName will default to just using the llb.Filename if explicitly
   set:
   https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L419-L421

In theory this behavior should have just been unoptimal before this
commit, the behavior was still correct. However, this did end up
triggering a different buildkit bug where the GET request made after
an unsuccessful attempt at matching based on previous etag metadata was
accidentally not uncompressing gzip responses, fixed here:
moby/buildkit#3788

Signed-off-by: Erik Sipsma <erik@dagger.io>
sipsma added a commit to jlongtine/dagger that referenced this pull request Apr 11, 2023
Buildkit internally stores some cache metadata of etags and http
checksums using an id based on this name, so setting it to the URL
maximizes our chances of following more optimized cache codepaths.

The codepaths in Buildkit are here:
1. A hash is used to lookup any possible etag/url-hash metadata from
   previous http sources:
   https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L128-L134

2. That hash is based in part on this getFileName function:
   https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L91

3. getFileName will default to just using the llb.Filename if explicitly
   set:
   https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L419-L421

In theory this behavior should have just been unoptimal before this
commit, the behavior was still correct. However, this did end up
triggering a different buildkit bug where the GET request made after
an unsuccessful attempt at matching based on previous etag metadata was
accidentally not uncompressing gzip responses, fixed here:
moby/buildkit#3788

Signed-off-by: Erik Sipsma <erik@dagger.io>
sipsma added a commit to sipsma/dagger that referenced this pull request Apr 11, 2023
Buildkit internally stores some cache metadata of etags and http
checksums using an id based on this name, so setting it to the URL
maximizes our chances of following more optimized cache codepaths.

The codepaths in Buildkit are here:
1. A hash is used to lookup any possible etag/url-hash metadata from
   previous http sources:
   https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L128-L134

2. That hash is based in part on this getFileName function:
   https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L91

3. getFileName will default to just using the llb.Filename if explicitly
   set:
   https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L419-L421

In theory this behavior should have just been unoptimal before this
commit, the behavior was still correct. However, this did end up
triggering a different buildkit bug where the GET request made after
an unsuccessful attempt at matching based on previous etag metadata was
accidentally not uncompressing gzip responses, fixed here:
moby/buildkit#3788

Signed-off-by: Erik Sipsma <erik@dagger.io>
sipsma added a commit to jlongtine/dagger that referenced this pull request Apr 11, 2023
Buildkit internally stores some cache metadata of etags and http
checksums using an id based on this name, so setting it to the URL
maximizes our chances of following more optimized cache codepaths.

The codepaths in Buildkit are here:
1. A hash is used to lookup any possible etag/url-hash metadata from
   previous http sources:
   https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L128-L134

2. That hash is based in part on this getFileName function:
   https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L91

3. getFileName will default to just using the llb.Filename if explicitly
   set:
   https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L419-L421

In theory this behavior should have just been unoptimal before this
commit, the behavior was still correct. However, this did end up
triggering a different buildkit bug where the GET request made after
an unsuccessful attempt at matching based on previous etag metadata was
accidentally not uncompressing gzip responses, fixed here:
moby/buildkit#3788

Signed-off-by: Erik Sipsma <erik@dagger.io>
@@ -2196,6 +2197,49 @@ func testBuildHTTPSource(t *testing.T, sb integration.Sandbox) {
require.Equal(t, http.MethodHead, allReqs[1].Method)
require.Equal(t, "gzip", allReqs[1].Header.Get("Accept-Encoding"))

require.NoError(t, os.RemoveAll(filepath.Join(tmpdir, "foo")))

// update the content at the url to be gzipped now, the final output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be tested with llb.HTTP(, llb.Checksum()). Was this the case that failed, or is this different? I'm surprised that there isn't a case already that checks this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem happens when there is a cache metadata match on the URL hash but then it turns out the actual URL (as checked with HEAD) does not have the same etag as what was in cache. That triggers the codepath where HEAD gets the explicitly set Accept-Encoding (fine by itself) which is then retained for the subsequent GET and causes the file to not be transparently decompressed by the go stdlib (the actual bug fixed here)

Wrote the test like this because it seemed like a realistic scenario (URL changes contents), but the reason we hit it in Dagger is slightly different. We were always setting llb.Filename to the same value for every llb.HTTP source, (due to the fact that it always gets mounted/copied to a different name once utilized by users, so the name was internal-only and didn't really matter).

We didn't realize this meant that unless the user set non-default uid/gid/perms, the url hash would always be the same because the filename is used there instead of the url.

So, essentially every single one of our HTTP sources ended up with the same url hash, which meant they all got the same match here. This meant that after the first HTTP source was used, we'd almost always trigger the codepath where we call HEAD, it doesn't match the cached etag, we call GET, that request doesn't get decompressed if it's gzipped.

I'm also fixing Dagger to use a hash of the URL as the filename instead so we can have more optimal behavior here, but that's independent of the bug w/ not decompressing the response, which can also happen in fairly straightforward scenarios like the test case here where the URL just changes contents.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

causes the file to not be transparently decompressed by the go stdlib (the actual bug fixed here)

But doesn't this mean that the actual content for llb.HTTP() is different? And the best way to check for different content would be to add an expected checksum with llb.Checksum(). I'm not against your current test as well, but I think checksum should be checked as well so tests better protect against this class of issues.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looked into this, the problem is that if you set Checksum then most of the codepaths relevant here are skipped:

if hs.src.Checksum != "" {
hs.cacheKey = hs.src.Checksum
return hs.formatCacheKey(getFileName(hs.src.URL, hs.src.Filename, nil), hs.src.Checksum, "").String(), hs.src.Checksum.String(), nil, true, nil
}

The existing test asserts on the content by exporting the result and looking at the contents of the file there.

sipsma added a commit to dagger/dagger that referenced this pull request Apr 11, 2023
* Set HTTP source filename to be URL.

Buildkit internally stores some cache metadata of etags and http
checksums using an id based on this name, so setting it to the URL
maximizes our chances of following more optimized cache codepaths.

The codepaths in Buildkit are here:
1. A hash is used to lookup any possible etag/url-hash metadata from
   previous http sources:
   https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L128-L134

2. That hash is based in part on this getFileName function:
   https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L91

3. getFileName will default to just using the llb.Filename if explicitly
   set:
   https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L419-L421

In theory this behavior should have just been unoptimal before this
commit, the behavior was still correct. However, this did end up
triggering a different buildkit bug where the GET request made after
an unsuccessful attempt at matching based on previous etag metadata was
accidentally not uncompressing gzip responses, fixed here:
moby/buildkit#3788

Signed-off-by: Erik Sipsma <erik@dagger.io>

* Use hash instead of base64 to prevent hitting max filename limits.

Signed-off-by: Erik Sipsma <erik@dagger.io>

---------

Signed-off-by: Erik Sipsma <erik@dagger.io>
@tonistiigi tonistiigi merged commit 045c781 into moby:master Apr 13, 2023
55 checks passed
jlongtine pushed a commit to jlongtine/dagger that referenced this pull request Apr 13, 2023
Buildkit internally stores some cache metadata of etags and http
checksums using an id based on this name, so setting it to the URL
maximizes our chances of following more optimized cache codepaths.

The codepaths in Buildkit are here:
1. A hash is used to lookup any possible etag/url-hash metadata from
   previous http sources:
   https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L128-L134

2. That hash is based in part on this getFileName function:
   https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L91

3. getFileName will default to just using the llb.Filename if explicitly
   set:
   https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L419-L421

In theory this behavior should have just been unoptimal before this
commit, the behavior was still correct. However, this did end up
triggering a different buildkit bug where the GET request made after
an unsuccessful attempt at matching based on previous etag metadata was
accidentally not uncompressing gzip responses, fixed here:
moby/buildkit#3788

Signed-off-by: Erik Sipsma <erik@dagger.io>
Signed-off-by: Joel Longtine <joel@dagger.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants