Multithreading support #159

AlexAkulov · 2019-04-05T13:35:12Z

I noticed that archiver uses only one cpu core. It may be problem on processing large data. Do you have ideas about how to add multi thread support?

mholt · 2019-04-05T19:21:19Z

For doing what, exactly?

FerdieBerfle · 2019-04-25T15:59:44Z

Some codecs may lend themselves to parallel processing of different chunks of data in the same file. If those chunks could be handled simultaneously by different cores, overall processing time could be lessened. Perhaps just handling different files in the same archive, or other variations.

As an example, creating an archive from 8 files, where 4 or more of them are compressed at the same time by different CPU cores could cut the overall compression time down by a considerable amount.

Of course, it will also depend on other factors, such as disk I/O, available RAM, etc., etc.

It could also increase the complexity of the code by a rather large amount if doing multiple chunks of the same file, but may be trivial to handle multiple files at once (possibly as simple as using a separate go routine for each).

mholt · 2019-04-25T19:21:35Z

Sure, pull requests welcomed!

klauspost · 2019-06-11T19:07:56Z

https://github.com/klauspost/pgzip should be a dropin replacement for gzip :)

mholt · 2019-06-11T21:20:00Z

@klauspost That's really cool! I am willing to replace gzip with pgzip, but I notice that it should only be used when working with > 1 MB. The way archiver is designed makes it hard to toggle which package to use between the time we can know the file size (and we only know that sometimes, since we're working with steams most of the time) and the time that we start the encoding/decoding work.

Is it really inadvisable to just use pgzip across the board? I imagine it wouldn't be too bad since < 1 MB will still be fast no matter what, right?

JalonSolov · 2019-06-22T13:20:48Z

If there's a large enough difference in speed vs file size, there's no reason you couldn't just check the file size and use whichever one fits better.

mholt · 2019-06-22T14:22:29Z

@JalonSolov unfortunately it's not that simple since the file size is not exposed to the past of the code that is concerned with setting up the compressor/decompressor. And sometimes we don't know the file size at all because it's just a stream.

klauspost · 2019-06-23T17:03:15Z

Is it really inadvisable to just use pgzip across the board? I imagine it wouldn't be too bad since < 1 MB will still be fast no matter what, right?

In terms of compression there is no (significant) drawback and in terms of speed, there is a "startup fee", but I can't see it be anything influencing real world use. So yeah, unless you are doing many thousands of small files, it shouldn't be a problem.

Do note that the compression levels are re-balanced compared to stdlib. This makes the "level" adjustments much more natural and useful - but reduces compression at "default" for better balance.. See more here: https://blog.klauspost.com/rebalancing-deflate-compression-levels/

I also have an ~25% speed increase brewing for the lowest compression modes - but that still needs some testing.

mholt · 2019-06-23T21:57:31Z

@klauspost Alright, great! I'll make pgzip the default, with an option to manually disable multithreaded if a user is using the package in a tight loop for small files.

I tested the difference on a 45 GB file.

Before (compress/gzip):

$ time ./arc compress myfile gz
real    21m7.828s
user    18m7.737s
sys     4m10.169s

After (klauspost/pgzip):

$ time ./arc compress myfile gz
real    2m33.008s
user    14m4.392s
sys     1m17.194s

That's a 90% reduction!

Going to commit this and do a release. Thanks @klauspost!

AlexAkulov · 2019-11-18T07:13:59Z

@klauspost helped to add multithreading support to pierrec/lz4#55 too!
Please update 'github.com/pierrec/lz4' in go.mod to 3.2.0 for activating this feature.

mholt · 2022-01-02T08:37:40Z

Closing, as we have some parallel compression implemented now thanks to @klauspost. And ultimately I think the multicore stuff would have to be implemented in other repos anyway, rather than this one. If there's a specific request to parallelize something in this repo, we can either reopen this issue or make a new issue for it. Thanks!

AlexAkulov added the feature request label Apr 5, 2019

AlexAkulov mentioned this issue Apr 29, 2019

Improve of "archive" backup strategy Altinity/clickhouse-backup#14

Closed

mholt added a commit that referenced this issue Jun 23, 2019

gzip: Default to parallel gzip by klauspost (#159)

d552b05

mholt closed this as completed Jan 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multithreading support #159

Multithreading support #159

AlexAkulov commented Apr 5, 2019

mholt commented Apr 5, 2019

FerdieBerfle commented Apr 25, 2019

mholt commented Apr 25, 2019

klauspost commented Jun 11, 2019

mholt commented Jun 11, 2019

JalonSolov commented Jun 22, 2019

mholt commented Jun 22, 2019 •

edited

Loading

klauspost commented Jun 23, 2019

mholt commented Jun 23, 2019

AlexAkulov commented Nov 18, 2019 •

edited

Loading

mholt commented Jan 2, 2022

Multithreading support #159

Multithreading support #159

Comments

AlexAkulov commented Apr 5, 2019

mholt commented Apr 5, 2019

FerdieBerfle commented Apr 25, 2019

mholt commented Apr 25, 2019

klauspost commented Jun 11, 2019

mholt commented Jun 11, 2019

JalonSolov commented Jun 22, 2019

mholt commented Jun 22, 2019 • edited Loading

klauspost commented Jun 23, 2019

mholt commented Jun 23, 2019

AlexAkulov commented Nov 18, 2019 • edited Loading

mholt commented Jan 2, 2022

mholt commented Jun 22, 2019 •

edited

Loading

AlexAkulov commented Nov 18, 2019 •

edited

Loading