-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix gzip decoding of HTTP sources. #3788
Conversation
The go http library will normally implicitly set an Accept-Encoding of gzip to requests and then transparently decompress the response body. However, if the Accept-Encoding header is explicitly set by the caller it will no longer transparently decompress the body. This was causing HTTP LLB sources to have unexpectedly compressed contents. The fix here just unsets the Accept-Encoding header after the HEAD request. Another possible fix would be to do our own gzip decompression after reading the body if it was gzipped, but this approach seemed slightly simpler. Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Buildkit internally stores some cache metadata of etags and http checksums using an id based on this name, so setting it to the URL maximizes our chances of following more optimized cache codepaths. The codepaths in Buildkit are here: 1. A hash is used to lookup any possible etag/url-hash metadata from previous http sources: https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L128-L134 2. That hash is based in part on this getFileName function: https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L91 3. getFileName will default to just using the llb.Filename if explicitly set: https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L419-L421 In theory this behavior should have just been unoptimal before this commit, the behavior was still correct. However, this did end up triggering a different buildkit bug where the GET request made after an unsuccessful attempt at matching based on previous etag metadata was accidentally not uncompressing gzip responses, fixed here: moby/buildkit#3788 Signed-off-by: Erik Sipsma <erik@dagger.io>
@@ -2196,6 +2197,49 @@ func testBuildHTTPSource(t *testing.T, sb integration.Sandbox) { | |||
require.Equal(t, http.MethodHead, allReqs[1].Method) | |||
require.Equal(t, "gzip", allReqs[1].Header.Get("Accept-Encoding")) | |||
|
|||
require.NoError(t, os.RemoveAll(filepath.Join(tmpdir, "foo"))) | |||
|
|||
// update the content at the url to be gzipped now, the final output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be tested with llb.HTTP(, llb.Checksum())
. Was this the case that failed, or is this different? I'm surprised that there isn't a case already that checks this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem happens when there is a cache metadata match on the URL hash but then it turns out the actual URL (as checked with HEAD
) does not have the same etag as what was in cache. That triggers the codepath where HEAD
gets the explicitly set Accept-Encoding
(fine by itself) which is then retained for the subsequent GET
and causes the file to not be transparently decompressed by the go stdlib (the actual bug fixed here)
Wrote the test like this because it seemed like a realistic scenario (URL changes contents), but the reason we hit it in Dagger is slightly different. We were always setting llb.Filename
to the same value for every llb.HTTP
source, (due to the fact that it always gets mounted/copied to a different name once utilized by users, so the name was internal-only and didn't really matter).
We didn't realize this meant that unless the user set non-default uid/gid/perms, the url hash would always be the same because the filename is used there instead of the url.
So, essentially every single one of our HTTP sources ended up with the same url hash, which meant they all got the same match here. This meant that after the first HTTP source was used, we'd almost always trigger the codepath where we call HEAD
, it doesn't match the cached etag, we call GET
, that request doesn't get decompressed if it's gzipped.
I'm also fixing Dagger to use a hash of the URL as the filename instead so we can have more optimal behavior here, but that's independent of the bug w/ not decompressing the response, which can also happen in fairly straightforward scenarios like the test case here where the URL just changes contents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
causes the file to not be transparently decompressed by the go stdlib (the actual bug fixed here)
But doesn't this mean that the actual content for llb.HTTP()
is different? And the best way to check for different content would be to add an expected checksum with llb.Checksum()
. I'm not against your current test as well, but I think checksum should be checked as well so tests better protect against this class of issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looked into this, the problem is that if you set Checksum
then most of the codepaths relevant here are skipped:
buildkit/source/http/httpsource.go
Lines 123 to 126 in 333ee91
if hs.src.Checksum != "" { | |
hs.cacheKey = hs.src.Checksum | |
return hs.formatCacheKey(getFileName(hs.src.URL, hs.src.Filename, nil), hs.src.Checksum, "").String(), hs.src.Checksum.String(), nil, true, nil | |
} |
The existing test asserts on the content by exporting the result and looking at the contents of the file there.
* Set HTTP source filename to be URL. Buildkit internally stores some cache metadata of etags and http checksums using an id based on this name, so setting it to the URL maximizes our chances of following more optimized cache codepaths. The codepaths in Buildkit are here: 1. A hash is used to lookup any possible etag/url-hash metadata from previous http sources: https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L128-L134 2. That hash is based in part on this getFileName function: https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L91 3. getFileName will default to just using the llb.Filename if explicitly set: https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L419-L421 In theory this behavior should have just been unoptimal before this commit, the behavior was still correct. However, this did end up triggering a different buildkit bug where the GET request made after an unsuccessful attempt at matching based on previous etag metadata was accidentally not uncompressing gzip responses, fixed here: moby/buildkit#3788 Signed-off-by: Erik Sipsma <erik@dagger.io> * Use hash instead of base64 to prevent hitting max filename limits. Signed-off-by: Erik Sipsma <erik@dagger.io> --------- Signed-off-by: Erik Sipsma <erik@dagger.io>
The go http library will normally implicitly set an Accept-Encoding of gzip to requests and then transparently decompress the response body. However, if the Accept-Encoding header is explicitly set by the caller it will no longer transparently decompress the body. This was causing HTTP LLB sources to have unexpectedly compressed contents.
The fix here just unsets the Accept-Encoding header after the HEAD request. Another possible fix would be to do our own gzip decompression after reading the body if it was gzipped, but this approach seemed slightly simpler.
cc @tonistiigi this fixes some ephemeral test failures we are getting in Dagger