decompressing individual .gz files in a directory (compression without tar envelope) #5

pspacek · 2022-02-23T14:49:30Z

Hello,

this is more a question than issue. First, let me describe my use-case:

I have a directory full of .gz files with vastly differing sizes - from 159 bytes to 4.4 GiB per file:

$ ls -lSh /data/czds/zonefiles/
-rw-r--r-- 1 pspacek pspacek  4.4G Feb 23 14:04 com.txt.gz
...
-rw-r--r-- 1 pspacek pspacek   159 Feb 23 14:26 xn--cg4bki.txt.gz

The use-case would be to "mount" the source directory and then transparently decompress .gz files in it, so I can run an utility on it which requires seeking in the files (and thus cannot simply use gzip output piped in via stdin).

This is not supported (I did not expect it to work, but was curious about error handling :-)):

$ ./my-fuse-archive /data/czds/zonefiles /tmp
fuse-archive: could not open /data/czds/zonefiles: fuse-archive: could not read archive file

To my surprise it somewhat works with a single .gz file:

$ ./my-fuse-archive /data/czds/zonefiles/com.txt.gz /tmp/mount
$ ls -l /tmp/f
total 0
-r--r--r-- 1 pspacek pspacek 24301854277 Feb 23 02:33 com.zone.46894
$ file /tmp/mount/com.zone.46894 
/tmp/mount/com.zone.46894: ASCII text, with very long lines (302)

File sizes & also md5 sums are all okay:

$ pigz -c -d /data/czds/zonefiles/com.txt.gz | wc -c
24301854277

$ time pigz -c -d /mnt/experiment/czds/new/zonefiles/com.txt.gz | md5sum
44461e319488dc9eca92444f68dd0019  -

real	0m47.204s
user	1m39.729s
sys	0m14.170s

$ time md5sum /tmp/mount/com.zone.46894
44461e319488dc9eca92444f68dd0019  /tmp/mount/com.zone.46894

real	0m51.924s
user	0m29.740s
sys	0m5.372s

Okay cool, so at the first glance it seems I should use your software as it is - just loop through list of files and mount each file separately. Bunch of symlinks would then solve naming etc.

Nevertheless, I have couple questions for you:

Where does the name in the mount point come from? It does not seem to be in the archive (or maybe gzip just does not output it?):

$ gzip -l com.txt.gz 
         compressed        uncompressed  ratio uncompressed_name
         4699561414          2827017797 -66.2% com.txt

Another randomly selected .gz file contains file named data, which is unrelated to the original file name net.txt.gz.

First ls operation on the mount takes ages - time comparable to decompressing the whole archive, presumably because it looks for end of first gzip stream to see if another gzip stream might follow. Is this intentional? Would you be willing to add an option to treat gz files as one-item archives and thus make initial listing fast?

$ time ls /tmp/mount
com.zone.46894

real	0m34.677s
user	0m0.001s
sys	0m0.000s

Would you be interested in an option which uses the original file name without .gz suffix for names in the mount?
And finally, assuming answers above were mostly "yes", are you interested in more complete feature request description to mount directories and transparently decompress files in them? I could write it down if it is not waste of time.

Thank you very much for your work on this project!

BTW you did impressive work on decompression speed (or selection or decompressor): This mount-hack decompresses the file like 4x faster than stock gzip and pigz!

The text was updated successfully, but these errors were encountered:

pspacek · 2022-02-23T15:05:23Z

Clarification: The ultimate feature would be something like the venerable https://www.freshports.org/sysutils/fusefs-gunzip/ (which disappeared from the face of Earth).

nigeltao · 2022-03-16T05:08:20Z

Where does the name in the mount point come from? It does not seem to be in the archive (or maybe gzip just does not output it?):

The gzip file format can optionally contain the original file name (look for "FNAME" in the https://datatracker.ietf.org/doc/html/rfc1952 gzip specification). If a .gz file has it, fuse-archive will use it. Otherwise libarchive will fall back to "data" as a default: https://github.com/libarchive/libarchive/blob/6d56dfd6ef13625561da83c605d2a12cb146088c/libarchive/archive_read_support_format_raw.c#L119

Maybe try gzip -l com.txt.gz with the --name or --no-name flags. Maybe it will show a difference.

First ls operation on the mount takes ages - time comparable to decompressing the whole archive, presumably because it looks for end of first gzip stream to see if another gzip stream might follow. Is this intentional? Would you be willing to add an option to treat gz files as one-item archives and thus make initial listing fast?

It is intentional to decompress the whole thing. Not to see if another gzip stream might follow, but to determine the decompressed size (which fuse-archive would need to show if we did a 'ls -l on the mount point).

Would you be interested in an option which uses the original file name without .gz suffix for names in the mount?

I'm not super-excited about adding that option. For your original use case, I'd probably use symlinks.

And finally, assuming answers above were mostly "yes", are you interested in more complete feature request description to mount directories and transparently decompress files in them? I could write it down if it is not waste of time.

Sorry, but the answers are mostly "no".

pspacek · 2022-03-16T08:12:03Z

Thank you for your time, it all makes sense. Have a great day!

0bi-w6n-K3nobi · 2022-05-26T11:58:07Z

Hi @pspacek

Maybe You can use fuse-overlayfs, and so join several mounted dir at unique target dir.

I hope that I could helped You.

pspacek closed this as completed Mar 16, 2022

wlritchi mentioned this issue Jul 30, 2022

Feature request: mount directories containing archive files #12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decompressing individual .gz files in a directory (compression without tar envelope) #5

decompressing individual .gz files in a directory (compression without tar envelope) #5

pspacek commented Feb 23, 2022

pspacek commented Feb 23, 2022

nigeltao commented Mar 16, 2022

pspacek commented Mar 16, 2022

0bi-w6n-K3nobi commented May 26, 2022

decompressing individual .gz files in a directory (compression without tar envelope) #5

decompressing individual .gz files in a directory (compression without tar envelope) #5

Comments

pspacek commented Feb 23, 2022

pspacek commented Feb 23, 2022

nigeltao commented Mar 16, 2022

pspacek commented Mar 16, 2022

0bi-w6n-K3nobi commented May 26, 2022