GitHub - jakubsuchybio/Parallelized-Compression-Algorithms: Goal of this repository is to determine the best algorithm and parameters for given file type structure. e.g. Your application generates binary file in some format with some data. This library should help you find best algorithm with parameters best suited to your file structure type.

Normally compression algorithms are not parallel.
In modern era of multi-core CPUs this is a problem.

Goal of this repository is to determine the best algorithm and parameters for given file type structure. e.g. Your application generates binary file in some format with some data. This library should help you find best algorithm with parameters best suited to your file structure type.

Here is a test results for 2GB binary file of ECG signal

This library contains 3 projects:

src/ParallelizeCompression - This contains ICompressor interface, ParallelWrapper that wraps another compressor and parallelizes compression and decompression and few implementations of compression algorithms
test/ParallelizeCompression.Tests.Unit - Unit tests mainly for ParallelProcessor and some tests for combination of ParallelProcessor wrapping DeflateCompressor
test/ParallelizeCompression.Benchmark - This is a benchmark that tests speeds between non-parallel compressor and parallel compressor. On my CPU, there is mostly 2-5x speedup in compression and approx same speed with decompression

Compression algorithms implemented as Compressors:

Deflate (System.IO.Compression from Microsoft)
GZip (System.IO.Compression from Microsoft)
Deflate (Ionic.Zlib nuget)
GZip (Ionic.Zlib nuget)
LZ4 (K4os.Compression.LZ4.Streams nuget)
Brotli (BrotliSharpLib nuget)
Zstandard (Zstandard.Net nuget)

Parameters:

ChunkSize - Size of one chunk in bytes. (Optimal are one digits of MB (1MB,3MB,5MB))
DegreeOfParallelization - How large is blocking collection and therefore how much tasks can run in at once. (Optimal is Environment.ProcessorCount)
CompressionLevel - Dependant on algorithm used, but it says how much should given algorithm try to make file smaller for a cost of time.

How parallelization of compression algorithms works:
Compression:

Read input stream sequentially, create chunks and add them to the BlockingCollection.
Start tasks for each chunk to compress that chunk
Take chunks from BlockingCollection and store them into output stream, BUT before writing chunk's output data, we first write it's length

Decompression:

Read input stream sequentially (first 4bytes is lenght of first chunk), create chunks and add them to the BlockingCollectio
Start tasks for each chunk to decompress that chunk
Take chunks from BlockingCollection and store them into output stream

This way, we add a little overhead by saving multiple 4byte lengths of chunks, but we gain a lot speed by parallelizing workload of compression and we do not loose any decompression speed. We can lower the overhead of those lengths by increasing chunksize. For the fastest speed and smallest overhead use chunksize:
(FileSize / CPUCount)

F.A.Q.:
Q: Why using BlockingCollection?
A: By using BlockingCollection we throttle ussage of memory resources. It is because if we wouldn't throttle that, we could easily run out of memory when compressing some really large files (3GB+). We actualy do not throttle anything, because compressing and decompressing is much slower that file read/write, so the real bottleneck are tasks that are doing compression/decompression.

Q: This is useless. Compressed files from this are not readable by any software! Why?
A: Well, because we are bending the compression algorithms by compressing chunks of previous file. So this library is only usable for projects that do the compression and also the decompression themselves. You can't use this for compressing in your software and having someone else decompress it elsewere.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
src/ParallelCompression		src/ParallelCompression
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
ParallelCompression.sln		ParallelCompression.sln
README.md		README.md
TEST_RESULT_2GB_BINARY_FILE.md		TEST_RESULT_2GB_BINARY_FILE.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

jakubsuchybio/Parallelized-Compression-Algorithms

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages