Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreading support #159

Closed
AlexAkulov opened this issue Apr 5, 2019 · 11 comments
Closed

Multithreading support #159

AlexAkulov opened this issue Apr 5, 2019 · 11 comments

Comments

@AlexAkulov
Copy link

I noticed that archiver uses only one cpu core. It may be problem on processing large data. Do you have ideas about how to add multi thread support?

@mholt
Copy link
Owner

mholt commented Apr 5, 2019

For doing what, exactly?

@FerdieBerfle
Copy link

Some codecs may lend themselves to parallel processing of different chunks of data in the same file. If those chunks could be handled simultaneously by different cores, overall processing time could be lessened. Perhaps just handling different files in the same archive, or other variations.

As an example, creating an archive from 8 files, where 4 or more of them are compressed at the same time by different CPU cores could cut the overall compression time down by a considerable amount.

Of course, it will also depend on other factors, such as disk I/O, available RAM, etc., etc.

It could also increase the complexity of the code by a rather large amount if doing multiple chunks of the same file, but may be trivial to handle multiple files at once (possibly as simple as using a separate go routine for each).

@mholt
Copy link
Owner

mholt commented Apr 25, 2019

Sure, pull requests welcomed!

@klauspost
Copy link
Contributor

https://github.com/klauspost/pgzip should be a dropin replacement for gzip :)

@mholt
Copy link
Owner

mholt commented Jun 11, 2019

@klauspost That's really cool! I am willing to replace gzip with pgzip, but I notice that it should only be used when working with > 1 MB. The way archiver is designed makes it hard to toggle which package to use between the time we can know the file size (and we only know that sometimes, since we're working with steams most of the time) and the time that we start the encoding/decoding work.

Is it really inadvisable to just use pgzip across the board? I imagine it wouldn't be too bad since < 1 MB will still be fast no matter what, right?

@JalonSolov
Copy link

If there's a large enough difference in speed vs file size, there's no reason you couldn't just check the file size and use whichever one fits better.

@mholt
Copy link
Owner

mholt commented Jun 22, 2019

@JalonSolov unfortunately it's not that simple since the file size is not exposed to the past of the code that is concerned with setting up the compressor/decompressor. And sometimes we don't know the file size at all because it's just a stream.

@klauspost
Copy link
Contributor

Is it really inadvisable to just use pgzip across the board? I imagine it wouldn't be too bad since < 1 MB will still be fast no matter what, right?

In terms of compression there is no (significant) drawback and in terms of speed, there is a "startup fee", but I can't see it be anything influencing real world use. So yeah, unless you are doing many thousands of small files, it shouldn't be a problem.

Do note that the compression levels are re-balanced compared to stdlib. This makes the "level" adjustments much more natural and useful - but reduces compression at "default" for better balance.. See more here: https://blog.klauspost.com/rebalancing-deflate-compression-levels/

I also have an ~25% speed increase brewing for the lowest compression modes - but that still needs some testing.

@mholt
Copy link
Owner

mholt commented Jun 23, 2019

@klauspost Alright, great! I'll make pgzip the default, with an option to manually disable multithreaded if a user is using the package in a tight loop for small files.

I tested the difference on a 45 GB file.

Before (compress/gzip):

$ time ./arc compress myfile gz
real    21m7.828s
user    18m7.737s
sys     4m10.169s

After (klauspost/pgzip):

$ time ./arc compress myfile gz
real    2m33.008s
user    14m4.392s
sys     1m17.194s

That's a 90% reduction!

Going to commit this and do a release. Thanks @klauspost!

@AlexAkulov
Copy link
Author

AlexAkulov commented Nov 18, 2019

@klauspost helped to add multithreading support to pierrec/lz4#55 too!
Please update 'github.com/pierrec/lz4' in go.mod to 3.2.0 for activating this feature.

@mholt
Copy link
Owner

mholt commented Jan 2, 2022

Closing, as we have some parallel compression implemented now thanks to @klauspost. And ultimately I think the multicore stuff would have to be implemented in other repos anyway, rather than this one. If there's a specific request to parallelize something in this repo, we can either reopen this issue or make a new issue for it. Thanks!

@mholt mholt closed this as completed Jan 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants