Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compress/flate: very memory intensive #32371

Open
nhooyr opened this issue May 31, 2019 · 8 comments

Comments

@nhooyr
Copy link
Contributor

commented May 31, 2019

func BenchmarkCompress(b *testing.B) {
	b.Run("writer", func(b *testing.B) {
		b.Run("flate", func(b *testing.B) {
			b.ReportAllocs()

			for i := 0; i < b.N; i++ {
				flate.NewWriter(nil, flate.BestSpeed)
			}
		})
		b.Run("gzip", func(b *testing.B) {
			b.ReportAllocs()

			for i := 0; i < b.N; i++ {
				gzip.NewWriterLevel(nil, zlib.BestSpeed)
			}
		})
	})
	b.Run("reader", func(b *testing.B) {
		b.Run("flate", func(b *testing.B) {
			b.ReportAllocs()

			for i := 0; i < b.N; i++ {
				flate.NewReader(nil)
			}
		})
		b.Run("gzip", func(b *testing.B) {
			b.ReportAllocs()

			bb := &bytes.Buffer{}
			gzip.NewWriter(bb).Write(nil)
			b.ResetTimer()
			for i := 0; i < b.N; i++ {
				gzip.NewReader(bb)
			}
		})
	})
}

If you run that, you'll get:

$ go test -bench=Compress -run=^$
goos: darwin
goarch: amd64
pkg: nhooyr.io/websocket
BenchmarkCompress/writer/flate-8   	   10000	    135813 ns/op	 1200010 B/op	      15 allocs/op
BenchmarkCompress/writer/gzip-8    	20000000	        73.1 ns/op	     176 B/op	       1 allocs/op
BenchmarkCompress/reader/flate-8   	  300000	      3785 ns/op	   44672 B/op	       6 allocs/op
BenchmarkCompress/reader/gzip-8    	10000000	       230 ns/op	     704 B/op	       1 allocs/op
PASS
ok  	nhooyr.io/websocket	6.674s

1.2 MB per writer and 45 KB per reader is a lot, especially for usage with WebSockets where most messages are rather small, often on average 512 bytes. Why is compress/flate allocating so much and is there a way to reduce it?

gzip (along with zlib though I didn't include it in the benchmark) use much less memory.

Related:

  1. #3155
  2. gorilla/websocket#203
@josharian

This comment has been minimized.

Copy link
Contributor

commented Jun 1, 2019

On my phone, but most gzip (and maybe flate?) types have a Reset method that you can call to allow their re-use. That should help considerably. If there’s a reason that you can’t use Reset, or a critical type is missing a Reset option, please detail that.

@nhooyr

This comment has been minimized.

Copy link
Contributor Author

commented Jun 1, 2019

On my phone, but most gzip (and maybe flate?) types have a Reset method that you can call to allow their re-use. That should help considerably. If there’s a reason that you can’t use Reset, or a critical type is missing a Reset option, please detail that.

Yes they both do have a Reset method and it does help.

However, the amount of memory used by flate is still very excessive compared to compress/gzip or compress/zlib.

@klauspost

This comment has been minimized.

Copy link
Contributor

commented Jun 1, 2019

However, the amount of memory used by flate is still very excessive compared to compress/gzip or compress/zlib.

Since both gzip and zlib is using deflate and store the *flate.Writer, your benchmark is misleading. gzip and zlib doesn't allocate the compressor until data is written to the stream, so if you do a single write, you will see the same numbers. And for practical use it will be the same.

But, let's put this into context. The deflate compression does do a lot of upfront allocations. The allocations are done so standard operation can be done without additional allocations, and why 'Reset' is available to reuse.

As I wrote on the gorilla ticket: For compression level 1 (fastest), this also means a lot of un-needed allocations. I made an experiment a couple of years ago to be more selective about allocations. This is mainly for use when using Reset isn't used, but it will have the side effect of less allocations.
Unfinished PR here: klauspost/compress#70 - note that "level 2" is equivalent to stdlib level 1.

A simpler optimization could be: For level 1, 0 and -2 the following arrays in the compressor are not needed: hashHead (512KB) hashPrev (128KB). It should be rather simple do add these to a struct that is only allocated when needed. hashMatch (1KB) is also not needed, but now we are getting to the small things.

@nhooyr

This comment has been minimized.

Copy link
Contributor Author

commented Jun 1, 2019

Since both gzip and zlib is using deflate and store the *flate.Writer, your benchmark is misleading. gzip and zlib doesn't allocate the compressor until data is written to the stream, so if you do a single write, you will see the same numbers. And for practical use it will be the same.

Makes sense.

But, let's put this into context. The deflate compression does do a lot of upfront allocations. The allocations are done so standard operation can be done without additional allocations, and why 'Reset' is available to reuse.

For writing such small messages, I think 1.2 MB is a very large price to pay. I'm not an expert in compression algorithms but can't the buffers just be adjusted to grow as needed dynamically instead of always allocating so much?

A simpler optimization could be: For level 1, 0 and -2 the following arrays in the compressor are not needed: hashHead (512KB) hashPrev (128KB). It should be rather simple do add these to a struct that is only allocated when needed. hashMatch (1KB) is also not needed, but now we are getting to the small things.

That would make a massive difference but still leaves 560 KB per writer. That still seems very excessive to me for the WebSocket use case.

@klauspost

This comment has been minimized.

Copy link
Contributor

commented Jun 1, 2019

Here is a simpler version of the PR above: klauspost/compress#107

It is fairly risk-free (compared to the other), so it should be feasible for a merge soon.

@klauspost

This comment has been minimized.

Copy link
Contributor

commented Jun 1, 2019

can't the buffers just be adjusted to grow as needed dynamically

That would come at a massive performance cost. The big allocations a hash table and chain table. The hash table is sort of a map[uint16]int32 lookup, but it is sparsely populated, since this allows very fast lookups with no bounds checks since the compiler knows its size.

In the stdlib "level 1" uses its own (smaller, 128KB) hash table, so the allocations made for the more expensive levels are not used.

That would make a massive difference but still leaves 560 KB per writer. That still seems very excessive to me for the WebSocket use case.

Let's break down the rest:

64KB is allocated in d.window. This is to accumulate input until we have enough for a block. This could be allocating less up front, but would mean allocations as content gets written.
64KB is allocated for 'tokens', the output of the compression stage. This could be less, but would also result in allocations during compression.
Huffman trees and histograms are needed no matter how big your input is. Only "level -2" (HuffmanOnly) could use slightly less space.

Finally the last thing I can think of is the output buffer which is 256 bytes. Not much to gain there.

Let me see if I can fix up your benchmark to get some real numbers.

@klauspost

This comment has been minimized.

Copy link
Contributor

commented Jun 1, 2019

The io.Writer interface kind of makes more detailed optimizations hard. Even if you only send a few bytes, we have no way of knowing that you will not be sending more.

A reasonable addition would be a Encode(src, dst []byte) []byte that allows you to send the entire content you want to have compressed. That would allow the compressor to chose a suitable compression scheme and also share compressors.

@klauspost

This comment has been minimized.

Copy link
Contributor

commented Jun 1, 2019

I have updated klauspost/compress#107 with the actual numbers remaining and here is a gist to a realistic benchmark: https://gist.github.com/klauspost/f5df3a3522ac4bcb3bcde448872dffe6

Most of the remaining allocations are for the Huffman table generators, which is pretty unavoidable no matter your input size. Again, note that "level 2" in my lib is the "level 1" in stdlib. So yes, the baseline for stdlib is about 540K. If you switch to my lib, it is about 340KB for level 1.

@dmitshur dmitshur added this to the Go1.14 milestone Jun 6, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.