Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decompression performance #89

Closed
mxmlnkn opened this issue Jan 15, 2022 · 6 comments
Closed

Decompression performance #89

mxmlnkn opened this issue Jan 15, 2022 · 6 comments

Comments

@mxmlnkn
Copy link
Contributor

mxmlnkn commented Jan 15, 2022

Hello,

I noticed that indexed_gzip seems to be 10 times slower than archivemount and fuse-archive when decoding large gzipped TARs with ratarmount. I would think that both use zlib, which makes this surprising to me. Are you compiling with -O3? If not, would that help? Or maybe are there other bottlenecks like storing the windows? I just think that this 10x difference would be really helpful to improve upon. I'm pretty sure that it isn't ratarmount who is at fault here but some benchmarks using only indexed_gzip or even the C backend directly still might be helpful for performance "debugging".

archivemount-comparison

@mxmlnkn
Copy link
Contributor Author

mxmlnkn commented Jan 15, 2022

Archivemount does not use zlib directly but libarchive, hence the name. But libarchive seems to use zlib, which makes some of the timing differences a bit weirder.

Here are some more low-level timings for decompressing 397MiB gzipped data to to 512 MiB base64-encoded random data (so basically pure Huffman-encoding with only a few LZ77 backreferences):

base64 /dev/urandom | head -c $(( 512*1024*1024 )) > small
gzip -k small
tar -cf small.tar small
gzip small.tar

Running my self-written benchmark inside the indexed_bzip2 repository:

Decompressing with zlib: min: 2.19763s max: 2.20379 s,  min: 189.079 MB/s max: 189.608 MB/s
Decompressing with libarchive: min: 1.97612s max: 1.98285 s,  min: 210.147 MB/s max: 210.863 MB/s
Decompressing with custom decoder (pragzip): min: 8.78324s max: 9.04052 s,  min: 46.0913 MB/s max: 47.4414 MB/s

Some benchmarks via the command line:

python3 -c 'import indexed_gzip as igz; import time; t0 = time.time(); igz.open("small.gz").read(); print(f"Decompression took {time.time() - t0:.3f} s")'
    Decompression took 4.594 s
python3 -c 'import gzip; import time; t0 = time.time(); gzip.open("small.gz").read(); print(f"Decompression took {time.time() - t0:.3f} s")'
    Decompression took 3.069 s
time python3 -m pgzip -o - small.gz | wc -c
    real	0m3.289s
time archivemount small.tar.gz mountpoint/
    real	0m1.657s
time ( fuse-archive small.tar.gz mountpoint && stat mountpoint &>/dev/null; )
    real	0m1.697s
time gzip -d -k -c small.gz | wc -c
    real	0m3.510s
time pigz -d -k -c small.gz | wc -c
    real	0m2.225s
time bgzip -d -c small.gz | wc -c
    real	0m2.548s
time igzip -d -c small.gz | wc -c
    real	0m1.523s
time minigzip-zlib-ng -d -k -c small.gz | wc -c
    real	0m1.903s
time gunzip-libdeflate -d -k -c small.gz | wc -c
    real	0m1.841s

time pugz small.gz | wc -c
    real	0m2.372s
time pugz -t 4 small.gz | wc -c
    using 4 threads for decompression (experimental)
    real	0m0.975s
time pugz -t $( nproc ) small.gz | wc -c
    using 24 threads for decompression (experimental)
    real	0m0.985s

Some observations:

  • Pugz looks very promising but does not work for binary files and the speedup on a 12/24-core shown here is a bit underwhelming, I think there is more potential by speeding up the deflate block detection.
  • igzip is amazingly fast. Even faster than the weird results with archivemount and only 50% slower than pugz with using 24 logical cores.
  • The timings for fuse-archive and archivemount look weird. They are even 15% faster than my manual usage of libarchive.
  • Gzip 1.10 is surprisingly slow. And when using zlib and libarchive from C++, both are similarly fast except for 10%.
  • indexed_gzip is still more than twice as slow compared to all but gzip itself and my prototype custom gzip decoder.
  • The slowness of my custom decoder is kinda disappointing as I already optimized it a bit. Although I didn't yet implement a Huffman decoding lookup table yet. But there sure are a lot of intricate optimizations to gzip though...

Repeat benchmarks with tarred and gzipped Silesia corpus like lzbench uses.

make benchmarkGzip && tests/benchmarkGzip silesia.tar.gz
    Decompressing 68238807 B to 211957760 B with zlib: min: 0.643727s max: 0.652675 s,  min: 104.553 MB/s max: 106.006 MB/s
    Decompressing 68238807 B to 211957760 B with libarchive: min: 0.545091s max: 0.554049 s,  min: 123.164 MB/s max: 125.188 MB/s
    Decompressing 68238807 B to 211957760 B with custom decoder (pragzip): min: 0.127799s max: 0.129439 s,  min: 527.189 MB/s max: 533.956 MB/s

python3 -c 'import indexed_gzip as igz; import time; t0 = time.time(); igz.open("silesia.tar.gz").read(); print(f"Decompression took {time.time() - t0:.3f} s")'
    Decompression took 1.249 s
python3 -c 'import gzip; import time; t0 = time.time(); gzip.open("silesia.tar.gz").read(); print(f"Decompression took {time.time() - t0:.3f} s")'
    Decompression took 0.908 s
time python3 -m pgzip -o - silesia.tar.gz | wc -c
    real	0m1.837s
time archivemount small.tar.gz mountpoint/
    real	0m1.663s
time ( fuse-archive small.tar.gz mountpoint && stat mountpoint &>/dev/null; )
    real	0m1.629s
time gzip -d -k -c silesia.tar.gz | wc -c
    real	0m1.000s
time pigz -d -k -c silesia.tar.gz | wc -c
    real	0m0.622s
time bgzip -d -c silesia.tar.gz | wc -c
    real	0m0.700s
time igzip -d -c silesia.tar.gz | wc -c
    real	0m0.357s
time pugz silesia.tar.gz | wc -c
    terminate called after throwing an instance of 'gzip_error'
      what():  INVALID_LITERAL
    0

    real	0m0.078s
    user	0m0.005s
    sys	0m0.006s

time minigzip-zlib-ng -d -k -c silesia.tar.gz | wc -c
    real	0m0.443s
time gunzip-libdeflate -d -k -c silesia.tar.gz | wc -c
    real	0m0.403s
time crc32 silesia.tar
    78e42bf0
    real	0m0.236s

Some observations on Silesia decompression:

  • The Silesia corpus has a much higher compression ratio of ~3.1 instead of ~1.29 for the base64 compression test. This also probably means that the Huffman decoding is not the bottleneck anymore but also LZ77 back-references.
  • igzip is still the fastest, which is awe-inspiring although the difference to libdeflate and zlib-ng isn't that large anymore.
  • gzip is still pretty slow compared to the others.
  • zlib is almost twice as slow as igzip
  • indexed_gzip is still twice as slow as zlib
  • My custom decoder is three times faster than igzip, which looks like a bug but based on the fact that it got the decompressed size correct, it seems to be working correctly. I assume that it is faster because it skips the CRC32 calculation, which would mean that the CRC32 calculation is actually the bottleneck for all others. It also skips writing back the results and only works on the decoding window of 32kiB, which might improve caching behavior. But I don't really think that copying the data outside of that window would slow down the speed because memcpy should have speeds in the range of 10GB/s, i.e., take 0.02s for the uncompressed 203MiB for silesia.tar, which would be almost negligible. But, writing to stdout and piping to wc might be slower. Note that the zlib and libarchive benchmarks run inside benchmarkGzip also do not write the output to stdout so those should be a fairer comparison and even compared to those it is 5x faster. Very weird, especially as it was 8x slower with the base64 (Huffman-decoding bound) benchmark.
  • The CRC32 calculation of the decoded archive takes 0.24s compared to igzip taking 0.36s, which indeed seems to indicate that CRC32 calculation is the bottleneck for the Silesia benchmark.

@pauldmccarthy
Copy link
Owner

pauldmccarthy commented Jan 17, 2022

Hi @mxmlnkn, I haven't spent any time optimising the indexed_gzip codebase, so I'm sure performance could be improved. The core of indexed_gzip is essentially an adaptation of the original zran.c example that ships with the zlib source code. It bears little resemblance to the original version, but it does share the same fundamental design for reading/inflation.

The conda-forge packages are built with -O2. I think GHA has deleted the logs for the pypi builds, so I'm not sure about them.

However, using indexed_gzip purely for a one-off decompress isn't intended usage, so I'm not sure if these comparisons makes a lot of sense. The point of indexed_gzip is for improved random seek time, after an index has been built.

(edited - added example code)

If you are interested in "raw" performance, you could adjust your benchmark code to disable the CRC32 check, increased read buffer sizes, and with a seek point spacing larger than the size of the uncompressed data, which would effectively disable creation of the index. e.g.:

import indexed_gzip as igzip
IndexedGzipFile.open('file.gz', spacing=2**30, readbuf_size=2**30, buffer_size=2**30).read()

This should cause the read to be performed via a single call to the zlib inflate function.

@mxmlnkn
Copy link
Contributor Author

mxmlnkn commented Jan 17, 2022

I get some very weird results when testing out these parameters. When I increase the buffers or even just the spacing, then it takes a lot more time. This might explain why I observed the 10x slower speed with ratarmount! I think, I changed the default spacing to 16 MiB.

python3 -m pip install --user pgzip indexed_gzip
python3 -c '
import indexed_gzip as igz; import time; t0 = time.time();
igz.open("small.gz").read();
print(f"Decompression took {time.time() - t0:.3f} s")'
#    Decompression took 4.666 s

python3 -c '
import indexed_gzip as igz; import time; t0 = time.time();
igz.IndexedGzipFile("small.gz", spacing=2**30, readbuf_size=2**30, buffer_size=2**30).read();
print(f"Decompression took {time.time() - t0:.3f} s")'
#    Floating point exception

python3 -c '
import indexed_gzip as igz; import time; t0 = time.time();
igz.open("small.gz", spacing=int(1*1024**2), readbuf_size=int(1*1024**2), buffer_size=int(1*1024**2)).read(); 
print(f"Decompression took {time.time() - t0:.3f} s")'
#    Decompression took 4.286 s

python3 -c '
import indexed_gzip as igz; import time; t0 = time.time();
igz.open("small.gz", spacing=int(32*1024**2), readbuf_size=int(1*1024**2), buffer_size=int(1*1024**2)).read();
print(f"Decompression took {time.time() - t0:.3f} s")'
#    Decompression took 7.407 s

python3 -c '
import indexed_gzip as igz; import time; t0 = time.time();
igz.open("small.gz", spacing=int(128*1024**2), readbuf_size=int(1024**2), buffer_size=int(128*1024**2)).read();
print(f"Decompression took {time.time() - t0:.3f} s")'
#    Decompression took 13.290 s

python3 -c '
import indexed_gzip as igz; import time; t0 = time.time();
igz.IndexedGzipFile("small.gz", spacing=16*1024**2).read();
print(f"Decompression took {time.time() - t0:.3f} s")'
#    Decompression took 6.403 s

Note also the weird floating point exception when I try the large buffer / spacing sizes you recommended.

Also, I think there might be a problem with the buffers because even 512 MiB buffers should not slow down a program that much. Except maybe if you are clearing them to zeros without using memset or something like that..

@pauldmccarthy
Copy link
Owner

That does sound strange - I'll look into it when I get a minute...

@mxmlnkn
Copy link
Contributor Author

mxmlnkn commented Feb 19, 2022

So I did try to look at the code and did some benchmarks and I arrived at the conclusion that the underlying problem is that zran_read always tries to seek from an index point even if no seek was actually applied since the last call to zran_read. But I don't feel up to the task of fixing it because that would mean checking when seeking is actually necessary and also storing the zlib state of the last call. I'm not sure whether this information is already available or needs to be added somewhere.

As a quick workaround, zran_read should only be called to decode buffers much longer than the index spacing, which might be easily doable in Python by wrapping IndexedGzipFile in an io.BufferedReader with buffer size ~ 2-10x index spacing.

The above explanation does still not explain why runtimes get worse for larger readbuf_size. But, I saw a lot of heuristics like data_size = 4 * spacing in zran_expand_index, which shouldn't even be necessary, a hardcoded data_size = 128 * 1024 should also work and be faster and more memory-efficient. Similarly, there were a lot of usages of calloc, where I think malloc should have sufficed. I'm not sure what the performance difference between these two are, but calloc sounds slower.

Here are some benchmarks I did for varying configurations. It can be seen that calls with 4 KiB even slows down Python's GzipFile, but everything above 32 KiB should be sufficiently large reads for optimal speed. This is a stark difference to IndexedGzipFile which might lead to performance bugs like I observed with ratarmount. With large enough reads (in the limit, reading the whole sample file at once), the speeds are good though.

1.086  s Read 4      KiB chunks from GzipFile                                                                            
0.817  s Read 32     KiB chunks from GzipFile                                                                            
0.774  s Read 128    KiB chunks from GzipFile                                                                            
0.760  s Read 1024   KiB chunks from GzipFile                                                                            
0.811  s Read 32768  KiB chunks from GzipFile                                                                            
0.811  s Read 131072 KiB chunks from GzipFile                                                                            
         
1.722  s Read 4      KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)                                       
1.487  s Read 32     KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)                                       
1.497  s Read 128    KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)                                       
1.470  s Read 1024   KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)                                       
1.032  s Read 32768  KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)                                       
0.739  s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)                                       
         
1.788  s Read 4      KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)                                      
1.467  s Read 32     KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)                                      
1.454  s Read 128    KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)                                      
1.450  s Read 1024   KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)                                      
1.036  s Read 32768  KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)                                      
0.739  s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)                                      
         
9.044  s Read 4      KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)                                  
8.957  s Read 32     KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)                                  
8.902  s Read 128    KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)                                  
8.854  s Read 1024   KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)                                  
1.413  s Read 32768  KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)                                  
0.733  s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)                                  
         
9.056  s Read 4      KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)                                 
8.824  s Read 32     KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)                                 
8.834  s Read 128    KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)                                 
8.886  s Read 1024   KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)                                 
1.415  s Read 32768  KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)                                 
0.736  s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)                                 

13.596 s Read 4      KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)  
13.536 s Read 32     KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)  
13.491 s Read 128    KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)  
13.500 s Read 1024   KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)  
1.697  s Read 32768  KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)  
0.787  s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)  
         
13.745 s Read 4      KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size) 
13.518 s Read 32     KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size) 
13.451 s Read 128    KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size) 
13.468 s Read 1024   KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size) 
1.669  s Read 32768  KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size) 
0.770  s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size) 

@mxmlnkn
Copy link
Contributor Author

mxmlnkn commented Feb 19, 2022

I saw that you are already using a BufferedReader and that there even is a buffer_size argument for IndexedGzipFile. I added a better default in #90 if it has not been specified, which increases performance by ~10x. Here are the results for buffer-size = 4 * spacing.


1.157 s Read      4 KiB chunks from GzipFile
0.816 s Read     32 KiB chunks from GzipFile
0.781 s Read    128 KiB chunks from GzipFile
0.773 s Read   1024 KiB chunks from GzipFile
0.811 s Read  32768 KiB chunks from GzipFile
0.812 s Read 131072 KiB chunks from GzipFile

1.369 s Read      4 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)
1.181 s Read     32 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)
1.162 s Read    128 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)
1.167 s Read   1024 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)
1.026 s Read  32768 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)
0.728 s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=0)

1.321 s Read      4 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)
1.169 s Read     32 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)
1.150 s Read    128 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)
1.157 s Read   1024 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)
1.040 s Read  32768 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)
0.723 s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=0)

1.028 s Read      4 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)
0.761 s Read     32 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)
0.741 s Read    128 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)
0.736 s Read   1024 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)
0.783 s Read  32768 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)
0.734 s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB)

1.044 s Read      4 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)
0.766 s Read     32 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)
0.743 s Read    128 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)
0.735 s Read   1024 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)
0.780 s Read  32768 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)
0.732 s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB)

1.090 s Read      4 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)
0.807 s Read     32 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)
0.783 s Read    128 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)
0.781 s Read   1024 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)
0.814 s Read  32768 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)
0.758 s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=True, spacing=32 MiB, readbuf_size=uncompressed_size)

1.056 s Read      4 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size)
0.803 s Read     32 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size)
0.775 s Read    128 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size)
0.773 s Read   1024 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size)
0.825 s Read  32768 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size)
0.775 s Read 131072 KiB chunks from IndexedGzipFile(drop_handles=False, spacing=32 MiB, readbuf_size=uncompressed_size)

@mxmlnkn mxmlnkn closed this as completed Jun 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants