Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

archive/zip: Performance regression with reading ZIP files with many entries #48374

Open
stanhu opened this issue Sep 14, 2021 · 2 comments
Open

archive/zip: Performance regression with reading ZIP files with many entries #48374

stanhu opened this issue Sep 14, 2021 · 2 comments

Comments

@stanhu
Copy link

@stanhu stanhu commented Sep 14, 2021

What version of Go are you using (go version)?

$ go version
go version go1.17.1 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/stanhu/.cache/go-build"
GOENV="/home/stanhu/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/stanhu/.gvm/pkgsets/go1.17.1/global/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/stanhu/.gvm/pkgsets/go1.17.1/global"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/stanhu/.gvm/gos/go1.17.1"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/stanhu/.gvm/gos/go1.17.1/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.17.1"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build4184767189=/tmp/go-build -gno-record-gcc-switches"

What did you do?

We use HTTP Range Requests to seek around an externally stored ZIP file to extract the directory listing. For example:

package main

import (
	"archive/zip"
	"fmt"
	"net/http"

	"github.com/jfbus/httprs"
)

func main() {
	url := "https://stanhu-s3-test-bucket.s3.us-west-1.amazonaws.com/test.zip"
	req, err := http.NewRequest(http.MethodGet, url, nil)
	if err != nil {
		panic(err)
	}

	httpClient := &http.Client{}

	resp, err := httpClient.Do(req)
	if err != nil || resp.StatusCode != http.StatusOK {
		panic(err)
	}

	rs := httprs.NewHttpReadSeeker(resp, httpClient)
	defer resp.Body.Close()
	defer rs.Close()

	archive, err := zip.NewReader(rs, resp.ContentLength)
	if err != nil {
		panic(err)
	}

	for _, entry := range archive.File {
		fmt.Println(entry.Name)
	}

	fmt.Printf("Range Requests: %d\n", rs.Requests)
}

What did you expect to see?

With Go v1.16.4:

# time ./main
<snip>
tmp/tests/frontend/fixtures-ee/projects_json/pipelines_empty.json
tmp/tests/frontend/fixtures-ee/pipeline_schedules/edit_with_variables.html
tmp/tests/frontend/fixtures-ee/projects_json/files.json
tmp/tests/frontend/fixtures-ee/projects/overview.html
tmp/tests/frontend/fixtures-ee/search/show.html
tmp/tests/frontend/fixtures-ee/search/blob_search_result.html
Range Requests: 2

real	0m0.277s
user	0m0.060s
sys	0m0.028s

What did you see instead?

# time ./main
<snip>
tmp/tests/frontend/fixtures-ee/pipeline_schedules/edit_with_variables.html
tmp/tests/frontend/fixtures-ee/projects_json/files.json
tmp/tests/frontend/fixtures-ee/projects/overview.html
tmp/tests/frontend/fixtures-ee/search/show.html
tmp/tests/frontend/fixtures-ee/search/blob_search_result.html
Range Requests: 78

real	0m9.547s
user	0m0.239s
sys	0m0.154s

It appears the call to f.readDataDescriptor is causing these extra 76 HTTP Range Requests:

f.readDataDescriptor()

This was added in ddb648f for #34974.

Granted that not many people may be using the library with HTTP Range Requests, but this is compatible with the interface. This problem caused a production incident as our Google Cloud Storage request rate jumped from 2000/s to over 30,000/s (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5521).

The additional reads also slows down opening ZIP files, especially on a networked filesystem. The OpenRaw() interface doesn't really work as a substitute because the HTTP client uses io.Reader, so we'd need to buffer the output to a file, which we're trying to avoid.

@escholtz @ianlancetaylor I realize this parsing of the data descriptors might have been done to support the raw methods/ ZIP64, but is there a way we can skip this since we're primarily interested in extracting the central directory header?

@escholtz
Copy link
Contributor

@escholtz escholtz commented Sep 14, 2021

Yikes! I'm sorry this caused a production incident.

You're correct that the readDataDescriptor in Reader.init is causing many additional random access reads. And there is no way to bypass that with the current public methods. This placement was proposed by @rsc in #34974 (comment). We'll have to take a closer look if it can be conditionally skipped or moved elsewhere. If I remember correctly, we were trying to balance constraints of the existing package interface/behavior, the needs of the new methods, and edge cases with the zip format.

Would it be possible for your use case to read the file from GCS first and then open it from memory? Or does that violate your bandwidth and local storage requirements?

@WarheadsSE
Copy link

@WarheadsSE WarheadsSE commented Sep 14, 2021

Would it be possible for your use case to read the file from GCS first and then open it from memory? Or does that violate your bandwidth and local storage requirements?

We really need to be operating from the HTTP range requests for the functionality impacted here, as it is directly related to our CI artifacts, as uploaded by GitLab Runner instances. There are many projects that generate rather large artifacts in public/private instances, and it would be best not to force the transit round trip when we really just need the metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants