Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index Cache not created for Zip File #98

Closed
arakis opened this issue Nov 20, 2022 · 5 comments
Closed

Index Cache not created for Zip File #98

arakis opened this issue Nov 20, 2022 · 5 comments
Labels
enhancement New feature or request performance Something is slower than it could be

Comments

@arakis
Copy link

arakis commented Nov 20, 2022

If you have an archive.tar, a cache file is created. But when using a archive.zip, it can be mounted, but no index cache is created.

I know by definition, in ZIP there's a "Global Directory" at the end of the ZIP. But this is still horrible slow when your GB-Sized Zip File with 10K Entries is stored on a remote cloud storage.

So, can you enable Index Cache also for ZIP-Files?

@mxmlnkn
Copy link
Owner

mxmlnkn commented Nov 20, 2022

It might be possible and reasonably easy but not trivial.

Zip has been added more as a proof of concept until now but it would certainly be an exciting reason to add faster zip file support.

At the point where I can create and use an index cache for TAR, I'm skipping the TAR parser completely, and simply read the binary data from the raw file. Obviously, this is very format-specific. A similar trick might work for non-compressed zip files. Maybe it could even be extended for deflate- or bzip2-compressed files but I would have to delve deeper into the zip specification and either get the raw data offsets from an existing zip parser or begin writing my own zip parser.

I was able to reproduce some slowness by repacking my test TAR with 100 folder each containing 1000 64B-sized files into a zip.
With this zip, ls -la is unexpectedly slow:

time ls -la mounted/95/

[...]
-r-xr-xr-x 1 user 64 Oct  3  2021 994
-r-xr-xr-x 1 user 64 Oct  3  2021 995
-r-xr-xr-x 1 user 64 Oct  3  2021 996
-r-xr-xr-x 1 user 64 Oct  3  2021 997
-r-xr-xr-x 1 user 64 Oct  3  2021 998
-r-xr-xr-x 1 user 64 Oct  3  2021 999

real	0m19.613s
user	0m0.011s
sys	0m0.033s

Simply accessing a file was as fast as expected though:

time cat mounted/95/995

SzroE6UcYwk0pPFNo+y39tYcGHvDwik2cacVblON3umtMcLugOv28EQwFObrGJA2
real	0m0.056s
user	0m0.000s
sys	0m0.003s

I don't understand though why it is so slow if there exists a Global Directory in the first place. Maybe because it has to linearly search in it? Or maybe because this index lies in the cloud storage? But, the index cache ratarmount creates would also lie in the cloud storage, so I'm not sure whether it would fix that problem.

But this is only a very small file. I strongly assume that accessing very large (>16 MiB for bzip2 and >100 MiB for gzip) compressed files will also become slow because, just as for gzip and bzip2, decompressors with seek-capabilities will be necessary.

Does this accurately describe the behavior observed in your case (slow metadata access but reasonably fast single-file access)?

@mxmlnkn mxmlnkn added enhancement New feature or request performance Something is slower than it could be labels Nov 20, 2022
@arakis
Copy link
Author

arakis commented Nov 21, 2022

Just check for "Central Directory" in the PKZIP Specification.

Regardless of the parsing method (linear scan versus Central Directory), I would love to see an index cache file also for zip archive. In the meantime, in need to stick with tar, as long this feature is not available.

@mxmlnkn
Copy link
Owner

mxmlnkn commented Nov 22, 2022

Is the index file really that important or would it work to improve performance in Python's zipfile module? I wouldn't know whether it is possible to fix the problem there and how long it would take to land upstream though. The index file also has the benefit of being on disk thereby capping memory usage no matter how many files there are in the archive.

I took a look at the PKZIP specification. It is eery how similar it looks to a TAR with an index file like ratarmount creates appended to it. But I assume that there aren't that many ways to store a hierarchy of files. In the end, the same kind of data and metadata has to be written somehow.

My takeaway is that indeed looks doable. Maybe I'll find some time next week to code up a simple prototype. The cumbersome part would be to factor out the SQLite index behind an interface so that I can use it for zip and tar but I could also do that in a second step.

@arakis
Copy link
Author

arakis commented Nov 27, 2022

Because I mount my Archives on a (costly) google cloud storage (GCS) bucket, I need to be careful.

On GCS, the costs are-

  • The storage itself (per GB)
  • Network transfer fees (per GB)
  • EVERY operation/api call. It's like a I/O-call (per call: list, read, seek, ...)

My primary restriction is

  • remount/secondary mounts should not do unnecessary reads as possible
  • as low read access as possible to list a directory let's say in a file manager.

With tar, both goals are succeed (tested myself). Additional mounts are fast, and listing of directories are nearly instant.

I think, maybe it's also a difference if there's a small index file cached versus a part (directory part of GZIP) withing one extremely large file. Any I'm not sure, maybe you could have more IO-Delay when doing large seeks within the cloud stored file (just a thought)

So, I still suggest implementing this. But yes, efficient reading of the GZIP directory initially itself would make sense, too, of course - in this case, the first mount would be faster.

@mxmlnkn
Copy link
Owner

mxmlnkn commented Dec 4, 2022

Thank you for your detailed explanation! I was not aware that ratarmount could be used like that (to reduce cloud costs).

I implemented it now on the develop branch. There still might be some bugs but if you want to test it out before the next release, which might take a bit, you can do so with:

python3 -m pip install --user --force-reinstall 'git+https://github.com/mxmlnkn/ratarmount.git@develop#egginfo=ratarmountcore&subdirectory=core'
python3 -m pip install --user --force-reinstall 'git+https://github.com/mxmlnkn/ratarmount.git@develop#egginfo=ratarmount'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Something is slower than it could be
Projects
None yet
Development

No branches or pull requests

2 participants