-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Change zstd compression algorithm #1069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ A review job has been created and sent to the PullRequest network.
@jarifibrahim you can click here to see the review status or cancel the code review job.
|
Benchmarks are surely better for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that the goztsd can panic, and does not return a result.
Reviewed with ❤️ by PullRequest
| @@ -347,7 +347,7 @@ func (b *Builder) compressData(data []byte) ([]byte, error) { | |||
| case options.Snappy: | |||
| return snappy.Encode(nil, data), nil | |||
| case options.ZSTD: | |||
| return zstd.Compress(nil, data) | |||
| return gozstd.Compress(nil, data), nil | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although gozstd does not return an error, it will panic instead. If you want to keep a consistent behavior, and not kill the server when something goes wrong, catch the panic in a recover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say stick with consistency, and keep the current behavior.
Furthermore, I think it's better to keep issues isolated, and not have to depend on the error handling/panic handling behavior of the caller.
| @@ -125,7 +125,7 @@ func BenchmarkBuilder(b *testing.B) { | |||
|
|
|||
| keysCount := 1300000 // This number of entries consumes ~64MB of memory. | |||
| for i := 0; i < b.N; i++ { | |||
| opts := Options{BlockSize: 4 * 1024, BloomFalsePositive: 0.01} | |||
| opts := Options{Compression: options.ZSTD, BlockSize: 4 * 1024, BloomFalsePositive: 0.01} | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering why this option wasn't required before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashish-goswami gozstd is used by https://github.com/VictoriaMetrics/VictoriaMetrics
Reviewable status: 0 of 5 files reviewed, 2 unresolved discussions (waiting on @ashish-goswami, @manishrjain, and @pullrequest[bot])
table/builder.go, line 350 at r1 (raw file):
Previously, pullrequest[bot] wrote…
although gozstd does not return an error, it will panic instead. If you want to keep a consistent behavior, and not kill the server when something goes wrong, catch the panic in a
recover.
I could try to recover the panic here but the finishBlock function that calls this function will crash the process if compress returns an error. So the question being, do we want to crash the process by a panic or by returning an error.
I'd prefer to keep it this way.
table/builder_test.go, line 128 at r1 (raw file):
Previously, pullrequest[bot] wrote…
Just wondering why this option wasn't required before.
This option isn't really needed. I added it because I realized it was missing from the benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 5 files reviewed, 2 unresolved discussions (waiting on @ashish-goswami, @manishrjain, and @pullrequest[bot])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think panic should be recovered, and keep consistent behaviors, but it isn't a major blocker for me.
Reviewed with ❤️ by PullRequest
|
Benchmarks show LZ4 is faster than zstd and so we don't need this PR anymore. https://gist.github.com/jarifibrahim/50b596c233aa9d82706aa0df65a546ec used for benchmarking. |
|
I know this is closed but there's another reason to switch zstd library, a serious memory corruption bug in DataDog/zstd, that can either lead to segfaults, lower compression ratio, or nothing depending on your usage of the library. Please see DataDog/zstd#22 |
NOTE: The following benchmarks are invalid. The default compression level in both the implementations is different and that leads to the performance difference. On same compression levels, both the implementations perform similarly in terms of compression speed but there's a difference in decompression speed.
Currently, we use https://github.com/DataDog/zstd implementation for ZSTD compression and https://github.com/valyala/gozstd seems to be faster than https://github.com/DataDog/zstd . This PR
proposes use of
valyala/gozstdinstead ofDataDog/zstd.The following script was used for benchmarking
The following is the benchmark of time taken to build a table in badger (https://github.com/dgraph-io/badger/blob/f50343ff404d8198df6dc83755ec2eab863d5ff2/table/builder_test.go#L116)
This change is