Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compress/flate: decompression performance #23154

Open
flx42 opened this issue Dec 16, 2017 · 12 comments
Open

compress/flate: decompression performance #23154

flx42 opened this issue Dec 16, 2017 · 12 comments

Comments

@flx42
Copy link

@flx42 flx42 commented Dec 16, 2017

What version of Go are you using (go version)?

go version go1.9.2 linux/amd64

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/fabecassis/go"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build767750319=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

What did you do?

I noticed that the performance of "compress/gzip" is close to 2x slower than the gunzip(1) command-line tool. I've tested multiple methods of performing the decompression in this simple repository:
https://github.com/flx42/go-gunzip-bench
In summary, on a 333 MB gzipped file, for the read-decompress-write process:

  • The idiomatic go implementation takes 7.8s.
  • Piping to gunzip(1) takes 4.4s.
  • Using cgzip, a cgo wrapper on top on zlib, it takes 3.4s

Is there any room for improvement here for the standard implementation? The use case is decompression of downloaded Docker layers, in golang.

@flx42

This comment has been minimized.

Copy link
Author

@flx42 flx42 commented Dec 16, 2017

@agnivade

This comment has been minimized.

Copy link
Contributor

@agnivade agnivade commented Dec 16, 2017

/cc @dsnet

@dongweigogo

This comment has been minimized.

Copy link

@dongweigogo dongweigogo commented Dec 16, 2017

I see some complaints about it elsewhere.

@dsnet

This comment has been minimized.

Copy link
Member

@dsnet dsnet commented Dec 16, 2017

There's a large amount of area for improvement. I re-implemented the decompressor in github.com/dsnet/compress/flate. I have yet to merge my changes into the standard library, but it's about 1.6x to 2.3x faster. You can use that package if you want.

@dsnet

This comment has been minimized.

Copy link
Member

@dsnet dsnet commented Dec 16, 2017

The biggest bang for buck is to get around the io.ByteReader interface that the flate.Reader promises to uphold. This is problematic since it limits the input bandwidth to 200MB/s in the best case.

The approach I took was to special-case the fact that most io.Reader don't satisfy io.ByteReader and so will be automatically wrapped as a bufio.Reader. Thus, I type-assert the input reader to check whether it has the Peek and Discard methods and use those to process as much of the known input as possible. This allows you to take advantage of the new math/bits package to parallelize certain operations across multiple bytes. My implementation was written before math/bits, so I think it's possible to get yet more speed benefits than its current state.

Unfortunately, bytes.Buffer, bytes.Reader, and strings.Reader also implement io.ByteReader, so will also need some wrapping to be performant.

Some of the relevant performance changes:

@flx42

This comment has been minimized.

Copy link
Author

@flx42 flx42 commented Dec 16, 2017

Thanks @dsnet, do you have any ETA yet on when you will start to integrate your optimizations into the standard library?

I know that more optimized packages exist, but projects might be reluctant to add an external dependency for this task.

@dsnet dsnet changed the title compress/gzip: decompression performance compress/flate: decompression performance Dec 16, 2017
@dsnet

This comment has been minimized.

Copy link
Member

@dsnet dsnet commented Dec 16, 2017

The biggest blocker is time on myself and a reviewer who's also familiar with RFC1951. It's possible to get a new implementation in for Go1.11, but who knows. Compression needs to be reviewed carefully.

\cc @mdempsky @nigeltao @klauspost. Any takers as reviewers?

@klauspost

This comment has been minimized.

Copy link
Contributor

@klauspost klauspost commented Dec 17, 2017

I have added a few minor comments to the commits listed above.

You could consider supporting the io.WriterTo interface. It avoids a copy in the Read function.

I assume you have already fuzzed it :)

@nigeltao

This comment has been minimized.

Copy link
Contributor

@nigeltao nigeltao commented Dec 21, 2017

It's not production-ready yet, or any time soon, but I have been working on https://github.com/google/wuffs, which is a new, memory-safe programming language that generates C code now, and could generate Go code (using package unsafe) in the future. The kicker is that despite being memory safe (e.g. all array accesses are bounds-checked), its DEFLATE implementation benchmarks more or less as fast as the C zlib library (see https://github.com/google/wuffs/blob/master/doc/benchmarks.md).

A rough estimate (different measurement frameworks, different test data) is that it is therefore about 4x faster at decoding than Go's compress/flate. On my desktop: Wuffs is 300 MB/s compared to "go test -test.bench=. compress/flate"'s 75 MB/s. There are a couple further optimization techniques (madler/zlib#292 that I proposed for C zlib but had to abort because of API backwards compatibility constraints) that I'd expect to give an extra 1.5x improvement to Wuffs' throughput.

It's still a work in progress, and won't help Go programs any time soon. But that's where I'm spending my limited time these days.

@nigeltao

This comment has been minimized.

Copy link
Contributor

@nigeltao nigeltao commented Dec 21, 2017

As for APIs (io.ByteReader, io.WriterTo, etc), Wuffs' API is similar to Go's golang.org/x/text/transform.Transformer. You wouldn't be able to just drop that interface (and its specific errors) into the stdlib per se, as the stdlib can't depend on x, but there may be some ideas in there worth considering.

@flx42

This comment has been minimized.

Copy link
Author

@flx42 flx42 commented Dec 21, 2017

@nigeltao This project looks amazing. Even if it probably won't happen soon, having a new implementation under golang.org/x/ would already be a big step up to convince projects to adopt it, compared to non-official projects on GitHub.

@flx42

This comment has been minimized.

Copy link
Author

@flx42 flx42 commented Aug 2, 2018

FWIW, I'm seeing a nice performance improvement with Go 1.11 beta2. Down to 6.8s, compared to 8s originally with 1.9.2.

Not sure if this is an optimization in the deflate code, or general compiler optimizations.

cyphar added a commit to cyphar/umoci that referenced this issue Aug 2, 2018
One of the largest resource hogs in umoci is compression-related, and it
turns out[1] that Go's ordinary gzip implementation is nowhere near as
fast as other modern gzip implementations. In order to help our users
get more efficient builds, switch to a "pure-Go" implementation which
has x64-specific optimisations. We cannot use zlib wrappers like [2]
because of issues with "os/user" and static compilation (so we wouldn't
be able to release binaries).

This very simplified benchmark shows the positive difference when
switching libraries (using "tensorflow/tensorflow:latest" as the image
base):

  % time umoci unpack --image tensorflow:latest bundle # before
  39.03user 7.58system 0:45.16elapsed
  % time umoci unpack --image tensorflow:latest bundle # after
  40.89user 7.99system 0:34.89elapsed

And repacking is similarly fast (in this benchmark the changes were
installing FireFox and doing a repo refresh):

  % time umoci repack --image tensorflow:latest bundle # before
  78.54user 13.71system 1:26.66elapsed
  % time umoci repack --image tensorflow:latest bundle # after
  53.11user 4.87system 0:36.75elapsed

[1]: golang/go#23154
[2]: https://github.com/vitessio/vitess/tree/v2.2/go/cgzip

Signed-off-by: Aleksa Sarai <asarai@suse.de>
cyphar added a commit to cyphar/umoci that referenced this issue Aug 2, 2018
One of the largest resource hogs in umoci is compression-related, and it
turns out[1] that Go's ordinary gzip implementation is nowhere near as
fast as other modern gzip implementations. In order to help our users
get more efficient builds, switch to a "pure-Go" implementation which
has x64-specific optimisations. We cannot use zlib wrappers like [2]
because of issues with "os/user" and static compilation (so we wouldn't
be able to release binaries).

This very simplified benchmark shows the positive difference when
switching libraries (using "tensorflow/tensorflow:latest" as the image
base):

  % time umoci unpack --image tensorflow:latest bundle # before
  39.03user 7.58system 0:45.16elapsed
  % time umoci unpack --image tensorflow:latest bundle # after
  40.89user 7.99system 0:34.89elapsed

But the real benefit is when it comes to compression and repacking (in
this benchmark the changes were installing FireFox and doing a repo
refresh):

  % time umoci repack --image tensorflow:latest bundle # before
  78.54user 13.71system 1:26.66elapsed
  % time umoci repack --image tensorflow:latest bundle # after
  53.11user 4.87system 0:36.75elapsed

That's almost 3x faster, just by having a more optimised compression
library!

[1]: golang/go#23154
[2]: https://github.com/vitessio/vitess/tree/v2.2/go/cgzip

Signed-off-by: Aleksa Sarai <asarai@suse.de>
cyphar added a commit to cyphar/umoci that referenced this issue Aug 3, 2018
One of the largest resource hogs in umoci is compression-related, and it
turns out[1] that Go's ordinary gzip implementation is nowhere near as
fast as other modern gzip implementations. In order to help our users
get more efficient builds, switch to a "pure-Go" implementation which
has x64-specific optimisations. We cannot use zlib wrappers like [2]
because of issues with "os/user" and static compilation (so we wouldn't
be able to release binaries).

This very simplified benchmark shows the positive difference when
switching libraries (using "tensorflow/tensorflow:latest" as the image
base):

  % time umoci unpack --image tensorflow:latest bundle # before
  39.03user 7.58system 0:45.16elapsed
  % time umoci unpack --image tensorflow:latest bundle # after
  40.89user 7.99system 0:34.89elapsed

But the real benefit is when it comes to compression and repacking (in
this benchmark the changes were installing FireFox and doing a repo
refresh):

  % time umoci repack --image tensorflow:latest bundle # before
  78.54user 13.71system 1:26.66elapsed
  % time umoci repack --image tensorflow:latest bundle # after
  51.14user 3.25system 0:30.30elapsed

That's almost 3x faster, just by having a more optimised compression
library!

[1]: golang/go#23154
[2]: https://github.com/vitessio/vitess/tree/v2.2/go/cgzip

Signed-off-by: Aleksa Sarai <asarai@suse.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.