Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decompressing individual .gz files in a directory (compression without tar envelope) #5

Closed
pspacek opened this issue Feb 23, 2022 · 4 comments

Comments

@pspacek
Copy link

pspacek commented Feb 23, 2022

Hello,

this is more a question than issue. First, let me describe my use-case:

I have a directory full of .gz files with vastly differing sizes - from 159 bytes to 4.4 GiB per file:

$ ls -lSh /data/czds/zonefiles/
-rw-r--r-- 1 pspacek pspacek  4.4G Feb 23 14:04 com.txt.gz
...
-rw-r--r-- 1 pspacek pspacek   159 Feb 23 14:26 xn--cg4bki.txt.gz

The use-case would be to "mount" the source directory and then transparently decompress .gz files in it, so I can run an utility on it which requires seeking in the files (and thus cannot simply use gzip output piped in via stdin).

This is not supported (I did not expect it to work, but was curious about error handling :-)):

$ ./my-fuse-archive /data/czds/zonefiles /tmp
fuse-archive: could not open /data/czds/zonefiles: fuse-archive: could not read archive file

To my surprise it somewhat works with a single .gz file:

$ ./my-fuse-archive /data/czds/zonefiles/com.txt.gz /tmp/mount
$ ls -l /tmp/f
total 0
-r--r--r-- 1 pspacek pspacek 24301854277 Feb 23 02:33 com.zone.46894
$ file /tmp/mount/com.zone.46894 
/tmp/mount/com.zone.46894: ASCII text, with very long lines (302)

File sizes & also md5 sums are all okay:

$ pigz -c -d /data/czds/zonefiles/com.txt.gz | wc -c
24301854277

$ time pigz -c -d /mnt/experiment/czds/new/zonefiles/com.txt.gz | md5sum
44461e319488dc9eca92444f68dd0019  -

real	0m47.204s
user	1m39.729s
sys	0m14.170s

$ time md5sum /tmp/mount/com.zone.46894
44461e319488dc9eca92444f68dd0019  /tmp/mount/com.zone.46894

real	0m51.924s
user	0m29.740s
sys	0m5.372s

Okay cool, so at the first glance it seems I should use your software as it is - just loop through list of files and mount each file separately. Bunch of symlinks would then solve naming etc.

Nevertheless, I have couple questions for you:

  • Where does the name in the mount point come from? It does not seem to be in the archive (or maybe gzip just does not output it?):
$ gzip -l com.txt.gz 
         compressed        uncompressed  ratio uncompressed_name
         4699561414          2827017797 -66.2% com.txt

Another randomly selected .gz file contains file named data, which is unrelated to the original file name net.txt.gz.

  • First ls operation on the mount takes ages - time comparable to decompressing the whole archive, presumably because it looks for end of first gzip stream to see if another gzip stream might follow. Is this intentional? Would you be willing to add an option to treat gz files as one-item archives and thus make initial listing fast?
$ time ls /tmp/mount
com.zone.46894

real	0m34.677s
user	0m0.001s
sys	0m0.000s
  • Would you be interested in an option which uses the original file name without .gz suffix for names in the mount?

  • And finally, assuming answers above were mostly "yes", are you interested in more complete feature request description to mount directories and transparently decompress files in them? I could write it down if it is not waste of time.

Thank you very much for your work on this project!

BTW you did impressive work on decompression speed (or selection or decompressor): This mount-hack decompresses the file like 4x faster than stock gzip and pigz!

@pspacek
Copy link
Author

pspacek commented Feb 23, 2022

Clarification: The ultimate feature would be something like the venerable https://www.freshports.org/sysutils/fusefs-gunzip/ (which disappeared from the face of Earth).

@nigeltao
Copy link
Collaborator

Where does the name in the mount point come from? It does not seem to be in the archive (or maybe gzip just does not output it?):

The gzip file format can optionally contain the original file name (look for "FNAME" in the https://datatracker.ietf.org/doc/html/rfc1952 gzip specification). If a .gz file has it, fuse-archive will use it. Otherwise libarchive will fall back to "data" as a default: https://github.com/libarchive/libarchive/blob/6d56dfd6ef13625561da83c605d2a12cb146088c/libarchive/archive_read_support_format_raw.c#L119

Maybe try gzip -l com.txt.gz with the --name or --no-name flags. Maybe it will show a difference.

First ls operation on the mount takes ages - time comparable to decompressing the whole archive, presumably because it looks for end of first gzip stream to see if another gzip stream might follow. Is this intentional? Would you be willing to add an option to treat gz files as one-item archives and thus make initial listing fast?

It is intentional to decompress the whole thing. Not to see if another gzip stream might follow, but to determine the decompressed size (which fuse-archive would need to show if we did a 'ls -l on the mount point).

Would you be interested in an option which uses the original file name without .gz suffix for names in the mount?

I'm not super-excited about adding that option. For your original use case, I'd probably use symlinks.

And finally, assuming answers above were mostly "yes", are you interested in more complete feature request description to mount directories and transparently decompress files in them? I could write it down if it is not waste of time.

Sorry, but the answers are mostly "no".

@pspacek
Copy link
Author

pspacek commented Mar 16, 2022

Thank you for your time, it all makes sense. Have a great day!

@pspacek pspacek closed this as completed Mar 16, 2022
@0bi-w6n-K3nobi
Copy link

Hi @pspacek

Maybe You can use fuse-overlayfs, and so join several mounted dir at unique target dir.

I hope that I could helped You.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants