Index Cache not created for Zip File #98

arakis · 2022-11-20T20:51:07Z

If you have an archive.tar, a cache file is created. But when using a archive.zip, it can be mounted, but no index cache is created.

I know by definition, in ZIP there's a "Global Directory" at the end of the ZIP. But this is still horrible slow when your GB-Sized Zip File with 10K Entries is stored on a remote cloud storage.

So, can you enable Index Cache also for ZIP-Files?

mxmlnkn · 2022-11-20T21:44:12Z

It might be possible and reasonably easy but not trivial.

Zip has been added more as a proof of concept until now but it would certainly be an exciting reason to add faster zip file support.

At the point where I can create and use an index cache for TAR, I'm skipping the TAR parser completely, and simply read the binary data from the raw file. Obviously, this is very format-specific. A similar trick might work for non-compressed zip files. Maybe it could even be extended for deflate- or bzip2-compressed files but I would have to delve deeper into the zip specification and either get the raw data offsets from an existing zip parser or begin writing my own zip parser.

I was able to reproduce some slowness by repacking my test TAR with 100 folder each containing 1000 64B-sized files into a zip.
With this zip, ls -la is unexpectedly slow:

time ls -la mounted/95/

[...]
-r-xr-xr-x 1 user 64 Oct  3  2021 994
-r-xr-xr-x 1 user 64 Oct  3  2021 995
-r-xr-xr-x 1 user 64 Oct  3  2021 996
-r-xr-xr-x 1 user 64 Oct  3  2021 997
-r-xr-xr-x 1 user 64 Oct  3  2021 998
-r-xr-xr-x 1 user 64 Oct  3  2021 999

real	0m19.613s
user	0m0.011s
sys	0m0.033s

Simply accessing a file was as fast as expected though:

time cat mounted/95/995

SzroE6UcYwk0pPFNo+y39tYcGHvDwik2cacVblON3umtMcLugOv28EQwFObrGJA2
real	0m0.056s
user	0m0.000s
sys	0m0.003s

I don't understand though why it is so slow if there exists a Global Directory in the first place. Maybe because it has to linearly search in it? Or maybe because this index lies in the cloud storage? But, the index cache ratarmount creates would also lie in the cloud storage, so I'm not sure whether it would fix that problem.

But this is only a very small file. I strongly assume that accessing very large (>16 MiB for bzip2 and >100 MiB for gzip) compressed files will also become slow because, just as for gzip and bzip2, decompressors with seek-capabilities will be necessary.

Does this accurately describe the behavior observed in your case (slow metadata access but reasonably fast single-file access)?

arakis · 2022-11-21T02:21:33Z

Just check for "Central Directory" in the PKZIP Specification.

Regardless of the parsing method (linear scan versus Central Directory), I would love to see an index cache file also for zip archive. In the meantime, in need to stick with tar, as long this feature is not available.

mxmlnkn · 2022-11-22T18:19:31Z

Is the index file really that important or would it work to improve performance in Python's zipfile module? I wouldn't know whether it is possible to fix the problem there and how long it would take to land upstream though. The index file also has the benefit of being on disk thereby capping memory usage no matter how many files there are in the archive.

I took a look at the PKZIP specification. It is eery how similar it looks to a TAR with an index file like ratarmount creates appended to it. But I assume that there aren't that many ways to store a hierarchy of files. In the end, the same kind of data and metadata has to be written somehow.

My takeaway is that indeed looks doable. Maybe I'll find some time next week to code up a simple prototype. The cumbersome part would be to factor out the SQLite index behind an interface so that I can use it for zip and tar but I could also do that in a second step.

arakis · 2022-11-27T09:09:09Z

Because I mount my Archives on a (costly) google cloud storage (GCS) bucket, I need to be careful.

On GCS, the costs are-

The storage itself (per GB)
Network transfer fees (per GB)
EVERY operation/api call. It's like a I/O-call (per call: list, read, seek, ...)

My primary restriction is

remount/secondary mounts should not do unnecessary reads as possible
as low read access as possible to list a directory let's say in a file manager.

With tar, both goals are succeed (tested myself). Additional mounts are fast, and listing of directories are nearly instant.

I think, maybe it's also a difference if there's a small index file cached versus a part (directory part of GZIP) withing one extremely large file. Any I'm not sure, maybe you could have more IO-Delay when doing large seeks within the cloud stored file (just a thought)

So, I still suggest implementing this. But yes, efficient reading of the GZIP directory initially itself would make sense, too, of course - in this case, the first mount would be faster.

mxmlnkn · 2022-12-04T12:37:20Z

Thank you for your detailed explanation! I was not aware that ratarmount could be used like that (to reduce cloud costs).

I implemented it now on the develop branch. There still might be some bugs but if you want to test it out before the next release, which might take a bit, you can do so with:

python3 -m pip install --user --force-reinstall 'git+https://github.com/mxmlnkn/ratarmount.git@develop#egginfo=ratarmountcore&subdirectory=core'
python3 -m pip install --user --force-reinstall 'git+https://github.com/mxmlnkn/ratarmount.git@develop#egginfo=ratarmount'

mxmlnkn added enhancement New feature or request performance Something is slower than it could be labels Nov 20, 2022

mxmlnkn mentioned this issue Nov 22, 2022

Improve performance for multi-threaded access to encrypted zip files #97

Closed

mxmlnkn closed this as completed Feb 19, 2023

This was referenced Feb 22, 2023

Strange 'find' tool behaviour when scanning mounted .zip file #104

Closed

Add parallel support for large compressed zip members #105

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index Cache not created for Zip File #98

Index Cache not created for Zip File #98

arakis commented Nov 20, 2022

mxmlnkn commented Nov 20, 2022 •

edited

Loading

arakis commented Nov 21, 2022 •

edited

Loading

mxmlnkn commented Nov 22, 2022 •

edited

Loading

arakis commented Nov 27, 2022 •

edited

Loading

mxmlnkn commented Dec 4, 2022

Index Cache not created for Zip File #98

Index Cache not created for Zip File #98

Comments

arakis commented Nov 20, 2022

mxmlnkn commented Nov 20, 2022 • edited Loading

arakis commented Nov 21, 2022 • edited Loading

mxmlnkn commented Nov 22, 2022 • edited Loading

arakis commented Nov 27, 2022 • edited Loading

mxmlnkn commented Dec 4, 2022

mxmlnkn commented Nov 20, 2022 •

edited

Loading

arakis commented Nov 21, 2022 •

edited

Loading

mxmlnkn commented Nov 22, 2022 •

edited

Loading

arakis commented Nov 27, 2022 •

edited

Loading