-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index Cache not created for Zip File #98
Comments
It might be possible and reasonably easy but not trivial. Zip has been added more as a proof of concept until now but it would certainly be an exciting reason to add faster zip file support. At the point where I can create and use an index cache for TAR, I'm skipping the TAR parser completely, and simply read the binary data from the raw file. Obviously, this is very format-specific. A similar trick might work for non-compressed zip files. Maybe it could even be extended for deflate- or bzip2-compressed files but I would have to delve deeper into the zip specification and either get the raw data offsets from an existing zip parser or begin writing my own zip parser. I was able to reproduce some slowness by repacking my test TAR with 100 folder each containing 1000 64B-sized files into a zip. time ls -la mounted/95/
[...]
-r-xr-xr-x 1 user 64 Oct 3 2021 994
-r-xr-xr-x 1 user 64 Oct 3 2021 995
-r-xr-xr-x 1 user 64 Oct 3 2021 996
-r-xr-xr-x 1 user 64 Oct 3 2021 997
-r-xr-xr-x 1 user 64 Oct 3 2021 998
-r-xr-xr-x 1 user 64 Oct 3 2021 999
real 0m19.613s
user 0m0.011s
sys 0m0.033s Simply accessing a file was as fast as expected though: time cat mounted/95/995
SzroE6UcYwk0pPFNo+y39tYcGHvDwik2cacVblON3umtMcLugOv28EQwFObrGJA2
real 0m0.056s
user 0m0.000s
sys 0m0.003s I don't understand though why it is so slow if there exists a Global Directory in the first place. Maybe because it has to linearly search in it? Or maybe because this index lies in the cloud storage? But, the index cache ratarmount creates would also lie in the cloud storage, so I'm not sure whether it would fix that problem. But this is only a very small file. I strongly assume that accessing very large (>16 MiB for bzip2 and >100 MiB for gzip) compressed files will also become slow because, just as for gzip and bzip2, decompressors with seek-capabilities will be necessary. Does this accurately describe the behavior observed in your case (slow metadata access but reasonably fast single-file access)? |
Just check for "Central Directory" in the PKZIP Specification. Regardless of the parsing method (linear scan versus Central Directory), I would love to see an index cache file also for zip archive. In the meantime, in need to stick with tar, as long this feature is not available. |
Is the index file really that important or would it work to improve performance in Python's zipfile module? I wouldn't know whether it is possible to fix the problem there and how long it would take to land upstream though. The index file also has the benefit of being on disk thereby capping memory usage no matter how many files there are in the archive. I took a look at the PKZIP specification. It is eery how similar it looks to a TAR with an index file like ratarmount creates appended to it. But I assume that there aren't that many ways to store a hierarchy of files. In the end, the same kind of data and metadata has to be written somehow. My takeaway is that indeed looks doable. Maybe I'll find some time next week to code up a simple prototype. The cumbersome part would be to factor out the SQLite index behind an interface so that I can use it for zip and tar but I could also do that in a second step. |
Because I mount my Archives on a (costly) google cloud storage (GCS) bucket, I need to be careful. On GCS, the costs are-
My primary restriction is
With tar, both goals are succeed (tested myself). Additional mounts are fast, and listing of directories are nearly instant. I think, maybe it's also a difference if there's a small index file cached versus a part (directory part of GZIP) withing one extremely large file. Any I'm not sure, maybe you could have more IO-Delay when doing large seeks within the cloud stored file (just a thought) So, I still suggest implementing this. But yes, efficient reading of the GZIP directory initially itself would make sense, too, of course - in this case, the first mount would be faster. |
Thank you for your detailed explanation! I was not aware that ratarmount could be used like that (to reduce cloud costs). I implemented it now on the develop branch. There still might be some bugs but if you want to test it out before the next release, which might take a bit, you can do so with: python3 -m pip install --user --force-reinstall 'git+https://github.com/mxmlnkn/ratarmount.git@develop#egginfo=ratarmountcore&subdirectory=core'
python3 -m pip install --user --force-reinstall 'git+https://github.com/mxmlnkn/ratarmount.git@develop#egginfo=ratarmount' |
If you have an archive.tar, a cache file is created. But when using a archive.zip, it can be mounted, but no index cache is created.
I know by definition, in ZIP there's a "Global Directory" at the end of the ZIP. But this is still horrible slow when your GB-Sized Zip File with 10K Entries is stored on a remote cloud storage.
So, can you enable Index Cache also for ZIP-Files?
The text was updated successfully, but these errors were encountered: