Fix gzip decoding of HTTP sources. #3788

sipsma · 2023-04-11T01:12:07Z

The go http library will normally implicitly set an Accept-Encoding of gzip to requests and then transparently decompress the response body. However, if the Accept-Encoding header is explicitly set by the caller it will no longer transparently decompress the body. This was causing HTTP LLB sources to have unexpectedly compressed contents.

The fix here just unsets the Accept-Encoding header after the HEAD request. Another possible fix would be to do our own gzip decompression after reading the body if it was gzipped, but this approach seemed slightly simpler.

cc @tonistiigi this fixes some ephemeral test failures we are getting in Dagger

The go http library will normally implicitly set an Accept-Encoding of gzip to requests and then transparently decompress the response body. However, if the Accept-Encoding header is explicitly set by the caller it will no longer transparently decompress the body. This was causing HTTP LLB sources to have unexpectedly compressed contents. The fix here just unsets the Accept-Encoding header after the HEAD request. Another possible fix would be to do our own gzip decompression after reading the body if it was gzipped, but this approach seemed slightly simpler. Signed-off-by: Erik Sipsma <erik@sipsma.dev>

Buildkit internally stores some cache metadata of etags and http checksums using an id based on this name, so setting it to the URL maximizes our chances of following more optimized cache codepaths. The codepaths in Buildkit are here: 1. A hash is used to lookup any possible etag/url-hash metadata from previous http sources: https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L128-L134 2. That hash is based in part on this getFileName function: https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L91 3. getFileName will default to just using the llb.Filename if explicitly set: https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L419-L421 In theory this behavior should have just been unoptimal before this commit, the behavior was still correct. However, this did end up triggering a different buildkit bug where the GET request made after an unsuccessful attempt at matching based on previous etag metadata was accidentally not uncompressing gzip responses, fixed here: moby/buildkit#3788 Signed-off-by: Erik Sipsma <erik@dagger.io>

tonistiigi · 2023-04-11T05:52:18Z

client/client_test.go

@@ -2196,6 +2197,49 @@ func testBuildHTTPSource(t *testing.T, sb integration.Sandbox) {
 	require.Equal(t, http.MethodHead, allReqs[1].Method)
 	require.Equal(t, "gzip", allReqs[1].Header.Get("Accept-Encoding"))

+	require.NoError(t, os.RemoveAll(filepath.Join(tmpdir, "foo")))
+
+	// update the content at the url to be gzipped now, the final output


Shouldn't this be tested with llb.HTTP(, llb.Checksum()). Was this the case that failed, or is this different? I'm surprised that there isn't a case already that checks this.

The problem happens when there is a cache metadata match on the URL hash but then it turns out the actual URL (as checked with HEAD) does not have the same etag as what was in cache. That triggers the codepath where HEAD gets the explicitly set Accept-Encoding (fine by itself) which is then retained for the subsequent GET and causes the file to not be transparently decompressed by the go stdlib (the actual bug fixed here)

Wrote the test like this because it seemed like a realistic scenario (URL changes contents), but the reason we hit it in Dagger is slightly different. We were always setting llb.Filename to the same value for every llb.HTTP source, (due to the fact that it always gets mounted/copied to a different name once utilized by users, so the name was internal-only and didn't really matter).

We didn't realize this meant that unless the user set non-default uid/gid/perms, the url hash would always be the same because the filename is used there instead of the url.

So, essentially every single one of our HTTP sources ended up with the same url hash, which meant they all got the same match here. This meant that after the first HTTP source was used, we'd almost always trigger the codepath where we call HEAD, it doesn't match the cached etag, we call GET, that request doesn't get decompressed if it's gzipped.

I'm also fixing Dagger to use a hash of the URL as the filename instead so we can have more optimal behavior here, but that's independent of the bug w/ not decompressing the response, which can also happen in fairly straightforward scenarios like the test case here where the URL just changes contents.

causes the file to not be transparently decompressed by the go stdlib (the actual bug fixed here)

But doesn't this mean that the actual content for llb.HTTP() is different? And the best way to check for different content would be to add an expected checksum with llb.Checksum(). I'm not against your current test as well, but I think checksum should be checked as well so tests better protect against this class of issues.

Just looked into this, the problem is that if you set Checksum then most of the codepaths relevant here are skipped:

buildkit/source/http/httpsource.go

Lines 123 to 126 in 333ee91

if hs.src.Checksum != "" {

hs.cacheKey = hs.src.Checksum

return hs.formatCacheKey(getFileName(hs.src.URL, hs.src.Filename, nil), hs.src.Checksum, "").String(), hs.src.Checksum.String(), nil, true, nil

}

The existing test asserts on the content by exporting the result and looking at the contents of the file there.

* Set HTTP source filename to be URL. Buildkit internally stores some cache metadata of etags and http checksums using an id based on this name, so setting it to the URL maximizes our chances of following more optimized cache codepaths. The codepaths in Buildkit are here: 1. A hash is used to lookup any possible etag/url-hash metadata from previous http sources: https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L128-L134 2. That hash is based in part on this getFileName function: https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L91 3. getFileName will default to just using the llb.Filename if explicitly set: https://github.com/sipsma/buildkit/blob/cf2698c0e4b708127c3aa86c49d51532feee6b82/source/http/httpsource.go#L419-L421 In theory this behavior should have just been unoptimal before this commit, the behavior was still correct. However, this did end up triggering a different buildkit bug where the GET request made after an unsuccessful attempt at matching based on previous etag metadata was accidentally not uncompressing gzip responses, fixed here: moby/buildkit#3788 Signed-off-by: Erik Sipsma <erik@dagger.io> * Use hash instead of base64 to prevent hitting max filename limits. Signed-off-by: Erik Sipsma <erik@dagger.io> --------- Signed-off-by: Erik Sipsma <erik@dagger.io>

sipsma requested a review from tonistiigi April 11, 2023 01:12

sipsma mentioned this pull request Apr 11, 2023

Set HTTP source filename to be URL. dagger/dagger#4927

Merged

tonistiigi reviewed Apr 11, 2023

View reviewed changes

tonistiigi approved these changes Apr 13, 2023

View reviewed changes

tonistiigi merged commit 045c781 into moby:master Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix gzip decoding of HTTP sources. #3788

Fix gzip decoding of HTTP sources. #3788

sipsma commented Apr 11, 2023

tonistiigi Apr 11, 2023

sipsma Apr 11, 2023

tonistiigi Apr 11, 2023

sipsma Apr 13, 2023

	if hs.src.Checksum != "" {
	hs.cacheKey = hs.src.Checksum
	return hs.formatCacheKey(getFileName(hs.src.URL, hs.src.Filename, nil), hs.src.Checksum, "").String(), hs.src.Checksum.String(), nil, true, nil
	}

Fix gzip decoding of HTTP sources. #3788

Fix gzip decoding of HTTP sources. #3788

Conversation

sipsma commented Apr 11, 2023

tonistiigi Apr 11, 2023

Choose a reason for hiding this comment

sipsma Apr 11, 2023

Choose a reason for hiding this comment

tonistiigi Apr 11, 2023

Choose a reason for hiding this comment

sipsma Apr 13, 2023

Choose a reason for hiding this comment