-
-
Notifications
You must be signed in to change notification settings - Fork 378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multithreading support #159
Comments
For doing what, exactly? |
Some codecs may lend themselves to parallel processing of different chunks of data in the same file. If those chunks could be handled simultaneously by different cores, overall processing time could be lessened. Perhaps just handling different files in the same archive, or other variations. As an example, creating an archive from 8 files, where 4 or more of them are compressed at the same time by different CPU cores could cut the overall compression time down by a considerable amount. Of course, it will also depend on other factors, such as disk I/O, available RAM, etc., etc. It could also increase the complexity of the code by a rather large amount if doing multiple chunks of the same file, but may be trivial to handle multiple files at once (possibly as simple as using a separate go routine for each). |
Sure, pull requests welcomed! |
https://github.com/klauspost/pgzip should be a dropin replacement for |
@klauspost That's really cool! I am willing to replace gzip with pgzip, but I notice that it should only be used when working with > 1 MB. The way archiver is designed makes it hard to toggle which package to use between the time we can know the file size (and we only know that sometimes, since we're working with steams most of the time) and the time that we start the encoding/decoding work. Is it really inadvisable to just use pgzip across the board? I imagine it wouldn't be too bad since < 1 MB will still be fast no matter what, right? |
If there's a large enough difference in speed vs file size, there's no reason you couldn't just check the file size and use whichever one fits better. |
@JalonSolov unfortunately it's not that simple since the file size is not exposed to the past of the code that is concerned with setting up the compressor/decompressor. And sometimes we don't know the file size at all because it's just a stream. |
In terms of compression there is no (significant) drawback and in terms of speed, there is a "startup fee", but I can't see it be anything influencing real world use. So yeah, unless you are doing many thousands of small files, it shouldn't be a problem. Do note that the compression levels are re-balanced compared to stdlib. This makes the "level" adjustments much more natural and useful - but reduces compression at "default" for better balance.. See more here: https://blog.klauspost.com/rebalancing-deflate-compression-levels/ I also have an ~25% speed increase brewing for the lowest compression modes - but that still needs some testing. |
@klauspost Alright, great! I'll make pgzip the default, with an option to manually disable multithreaded if a user is using the package in a tight loop for small files. I tested the difference on a 45 GB file. Before (compress/gzip):
After (klauspost/pgzip):
That's a 90% reduction! Going to commit this and do a release. Thanks @klauspost! |
@klauspost helped to add multithreading support to pierrec/lz4#55 too! |
Closing, as we have some parallel compression implemented now thanks to @klauspost. And ultimately I think the multicore stuff would have to be implemented in other repos anyway, rather than this one. If there's a specific request to parallelize something in this repo, we can either reopen this issue or make a new issue for it. Thanks! |
I noticed that archiver uses only one cpu core. It may be problem on processing large data. Do you have ideas about how to add multi thread support?
The text was updated successfully, but these errors were encountered: