Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zstd: Custom Dictionary Compression Support #140

Closed
richardartoul opened this issue Jul 15, 2019 · 13 comments · Fixed by #281
Closed

zstd: Custom Dictionary Compression Support #140

richardartoul opened this issue Jul 15, 2019 · 13 comments · Fixed by #281

Comments

@richardartoul
Copy link

First off, thanks for writing a pure Go implementation! My team has wanted to use zstd in our project for a long time now, but have been trying to avoid having any c-Go dependencies.

We have one use-case in particular that would really benefit from the ability to train and use custom dictionaries on the fly.

Is that feature on your roadmap anytime soon? and if not, how challenging do you think it would be for me to try upstream it? I'm happy to contribute some engineering work.

Cheers,
Richie

@klauspost
Copy link
Owner

klauspost commented Jul 17, 2019

Support for using dictionaries is on the horizon - but not on the top of the list. Doesn't seem like a huge task, mainly a question of initializing the encoder/decoder.

Decoder:

// TODO: Init to dictionary

// TODO: Init to dictionary

Encoder:

func (b *blockEnc) initNewEncode() {

The encoder models would also need to have a funciton added that indexes a blob of bytes as history. Should be fairly trivial.

I don't have plans for creating dictionaries. I wouldn't expect this to be a trivial task.

@richardartoul
Copy link
Author

@klauspost Ah ok. Creating dictionaries is what I need. Maybe I'll try and find the time to investigate that task and see how hard it would be. Thanks!

@klauspost klauspost changed the title Custom Dictionary Support zstd: Custom Dictionary Support Feb 18, 2020
@klauspost klauspost changed the title zstd: Custom Dictionary Support zstd: Custom Dictionary Compression Support Jun 1, 2020
@klauspost
Copy link
Owner

Decompression dictionary support has been added: https://github.com/klauspost/compress/tree/master/zstd#dictionaries

@rs
Copy link

rs commented Aug 11, 2020

Any timeline for encoding support?

@klauspost
Copy link
Owner

@rs No, no concrete timeline.

@klauspost
Copy link
Owner

@richardartoul @rs You can test out #281 - fuzz tests are now stable, but of course any testing helps.

@rs
Copy link

rs commented Sep 2, 2020

Performed a quick test and it works. Performance is bad compared to a simple deflate tho, I'm not sure why.

I'm working on small payloads. I tried with different compression levels or smaller dictionaries, the compression ratio stays better than deflate but compression time is two orders or magnitude slower, even with the fastest compression level and a tiny dictionary.

@klauspost
Copy link
Owner

klauspost commented Sep 2, 2020

@rs Yes. it will be slower since more state needs to be initialized. What is the actual difference you are seeing?

I do believe initialization is currently being done twice for small blocks, so some can be clawed back.

Other stuff is unavoidable. Since small blocks now suddenly have a (potentially big) history we cannot take some of the shortcuts we do for standalone blocks..

Edit: Of course your actual code would also help.

@klauspost
Copy link
Owner

@rs Fixed most of the things

BenchmarkEncodeAllDict/length-19-level-fastest-dict-1
BenchmarkEncodeAllDict/length-19-level-fastest-dict-1-32         	  141180	      8351 ns/op	   2.28 MB/s	       0 B/op	       0 allocs/op
BenchmarkEncodeAllDict/length-19-level-default-dict-1
BenchmarkEncodeAllDict/length-19-level-default-dict-1-32         	   19639	     60085 ns/op	   0.32 MB/s	       0 B/op	       0 allocs/op
BenchmarkEncodeAllDict/length-19-level-better-dict-1
BenchmarkEncodeAllDict/length-19-level-better-dict-1-32          	    5714	    196535 ns/op	   0.10 MB/s	       0 B/op	       0 allocs/op
BenchmarkEncodeAllDict/length-12131-level-fastest-dict-1
BenchmarkEncodeAllDict/length-12131-level-fastest-dict-1-32      	   19077	     59129 ns/op	 205.16 MB/s	       0 B/op	       0 allocs/op
BenchmarkEncodeAllDict/length-12131-level-default-dict-1
BenchmarkEncodeAllDict/length-12131-level-default-dict-1-32      	    3999	    301825 ns/op	  40.19 MB/s	       0 B/op	       0 allocs/op
BenchmarkEncodeAllDict/length-12131-level-better-dict-1
BenchmarkEncodeAllDict/length-12131-level-better-dict-1-32       	    2352	    495323 ns/op	  24.49 MB/s	       0 B/op	       0 allocs/op
BenchmarkEncodeAllDict/length-210569-level-fastest-dict-1
BenchmarkEncodeAllDict/length-210569-level-fastest-dict-1-32     	    1052	   1115971 ns/op	 188.69 MB/s	       0 B/op	       0 allocs/op
BenchmarkEncodeAllDict/length-210569-level-default-dict-1
BenchmarkEncodeAllDict/length-210569-level-default-dict-1-32     	     673	   1763746 ns/op	 119.39 MB/s	       0 B/op	       0 allocs/op
BenchmarkEncodeAllDict/length-210569-level-better-dict-1
BenchmarkEncodeAllDict/length-210569-level-better-dict-1-32      	     423	   2820325 ns/op	  74.66 MB/s	       0 B/op	       0 allocs/op
BenchmarkEncodeAllDict/length-102605-level-fastest-dict-1
BenchmarkEncodeAllDict/length-102605-level-fastest-dict-1-32     	    2352	    494047 ns/op	 207.68 MB/s	       0 B/op	       0 allocs/op
BenchmarkEncodeAllDict/length-102605-level-default-dict-1
BenchmarkEncodeAllDict/length-102605-level-default-dict-1-32     	    1142	   1049912 ns/op	  97.73 MB/s	       0 B/op	       0 allocs/op
BenchmarkEncodeAllDict/length-102605-level-better-dict-1
BenchmarkEncodeAllDict/length-102605-level-better-dict-1-32      	     662	   1791540 ns/op	  57.27 MB/s	       0 B/op	       0 allocs/op

Same with no dictionaries:

BenchmarkEncodeAllDict/length-19-level-fastest-dict-1
BenchmarkEncodeAllDict/length-19-level-fastest-dict-1-32         	  413812	      2711 ns/op	   7.01 MB/s	       0 B/op	       0 allocs/op
BenchmarkEncodeAllDict/length-19-level-default-dict-1
BenchmarkEncodeAllDict/length-19-level-default-dict-1-32         	 3166226	       379 ns/op	  50.13 MB/s	       0 B/op	       0 allocs/op
BenchmarkEncodeAllDict/length-19-level-better-dict-1
BenchmarkEncodeAllDict/length-19-level-better-dict-1-32          	 3007521	       402 ns/op	  47.23 MB/s	       0 B/op	       0 allocs/op
BenchmarkEncodeAllDict/length-12131-level-fastest-dict-1
BenchmarkEncodeAllDict/length-12131-level-fastest-dict-1-32      	   22346	     52135 ns/op	 232.69 MB/s	       0 B/op	       0 allocs/op
BenchmarkEncodeAllDict/length-12131-level-default-dict-1
BenchmarkEncodeAllDict/length-12131-level-default-dict-1-32      	   13558	     88066 ns/op	 137.75 MB/s	       0 B/op	       0 allocs/op
BenchmarkEncodeAllDict/length-12131-level-better-dict-1
BenchmarkEncodeAllDict/length-12131-level-better-dict-1-32       	    9999	    112111 ns/op	 108.21 MB/s	       0 B/op	       0 allocs/op
BenchmarkEncodeAllDict/length-210569-level-fastest-dict-1
BenchmarkEncodeAllDict/length-210569-level-fastest-dict-1-32     	    1060	   1113207 ns/op	 189.16 MB/s	       0 B/op	       0 allocs/op
BenchmarkEncodeAllDict/length-210569-level-default-dict-1
BenchmarkEncodeAllDict/length-210569-level-default-dict-1-32     	     901	   1327413 ns/op	 158.63 MB/s	       0 B/op	       0 allocs/op
BenchmarkEncodeAllDict/length-210569-level-better-dict-1
BenchmarkEncodeAllDict/length-210569-level-better-dict-1-32      	     783	   1499363 ns/op	 140.44 MB/s	       0 B/op	       0 allocs/op
BenchmarkEncodeAllDict/length-102605-level-fastest-dict-1
BenchmarkEncodeAllDict/length-102605-level-fastest-dict-1-32     	    2181	    549748 ns/op	 186.64 MB/s	       0 B/op	       0 allocs/op
BenchmarkEncodeAllDict/length-102605-level-default-dict-1
BenchmarkEncodeAllDict/length-102605-level-default-dict-1-32     	    1642	    724117 ns/op	 141.70 MB/s	       0 B/op	       0 allocs/op
BenchmarkEncodeAllDict/length-102605-level-better-dict-1
BenchmarkEncodeAllDict/length-102605-level-better-dict-1-32      	    1363	    817300 ns/op	 125.54 MB/s	       0 B/op	       0 allocs/op

@klauspost
Copy link
Owner

So the typical setup "price" is around 0.01ms/operation. An Interesting sidepoint is that length-19-level-fastest is so relatively slow without the dictionary. Let me check that.

@brancz
Copy link

brancz commented Oct 12, 2022

Apologies for commenting on an old issue, but before opening a new issue I wanted to ask here, are there any plans, or has anyone already written the ability to train a zstd dict in Go? I have a use case I'd love to try zstd dicts for but would prefer to avoid calling out to the binary.

@klauspost
Copy link
Owner

Follow up in #682

@brancz
Copy link

brancz commented Oct 12, 2022

Thanks so much for the super fast reply! I think I can validate whether dicts would work for my use case without, but if they do and I happen to look into implementing them I’ll make sure to communicate it on the discussion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants