New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster entry listing for ZIP #889
Comments
This should be possible; seeking to the file contents should not be required until/unless you actually try to read the data. I think some of the logic for this already exists within
Libarchive is not doing that directly. We do of course use some C library time functions, but I'm not sure which one would be doing that. What OS are you using? What libc version? |
The streamable version seems to try much harder to not read data if it can avoid it, but I'm really not sure I'm reading this right.
Looks like tons of call to mktime(). But probably nothing to worry about too much in terms of performance. Pretty sure that's heavily cached as it is ;) But the stat() calls are a pretty good indicator of the calls to zip_time(), which is called when the headers are populated, and read. An early zip_time() call:
And this is a late call:
Looks like it's decoding the whole header, getting info such as the time of creation of all those entries, and throwing it away. This is the simple test app I used.
|
I also found this when trying to figure out why Using
In comparison,
over 1 million times.
|
That sounds like a bug in your libc. |
I think it's because of
https://pubs.opengroup.org/onlinepubs/009695399/functions/mktime.html says:
glibc man pages say:
My (gdb) bt
#0 0x00007ffff787af7d in __open_nocancel () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6
#1 0x00007ffff7808955 in _IO_file_open () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6
#2 0x00007ffff7808a23 in __GI__IO_file_fopen () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6
#3 0x00007ffff77fce19 in __fopen_internal () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6
#4 0x00007ffff7844732 in __tzfile_read () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6
#5 0x00007ffff7844308 in tzset_internal () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6
#6 0x00007ffff7844426 in tzset () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6
#7 0x00007ffff78431b9 in timelocal () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6
#8 0x00007ffff7f5c5d7 in zip_time () from /nix/store/cq53l1fj1sr0yl9hmxr69cayl1kygs1d-libarchive-3.5.2-lib/lib/libarchive.so.13
#9 0x00007ffff7f5de17 in zip_read_local_file_header () from /nix/store/cq53l1fj1sr0yl9hmxr69cayl1kygs1d-libarchive-3.5.2-lib/lib/libarchive.so.13
#10 0x00007ffff7f5ec02 in archive_read_format_zip_seekable_read_header () from /nix/store/cq53l1fj1sr0yl9hmxr69cayl1kygs1d-libarchive-3.5.2-lib/lib/libarchive.so.13
#11 0x00007ffff7f2444a in _archive_read_next_header2 () from /nix/store/cq53l1fj1sr0yl9hmxr69cayl1kygs1d-libarchive-3.5.2-lib/lib/libarchive.so.13
#12 0x00007ffff7f2454f in _archive_read_next_header () from /nix/store/cq53l1fj1sr0yl9hmxr69cayl1kygs1d-libarchive-3.5.2-lib/lib/libarchive.so.13
#13 0x0000000000406e86 in read_archive ()
#14 0x00000000004075f7 in tar_mode_x ()
#15 0x000000000040579f in main () |
Sure, that's the intention. But libc should still remember the TZ setting and not look it up over and over again. |
It looks like it's controversial whether or not the libc should cache
The trade-off of musl's caching approach is that the program has to do a dance with changing and resetting the I found that on glibc, a workaround is setting the environment variable Remains to check what In general, it seems best to query the system for the timezone once, and then re-use this info for formatting of times without re-querying it. Onther question though is that https://en.wikipedia.org/wiki/ZIP_(file_format)#Structure says
which makes me wonder whether bothering with time zones is worth it at all regarding to mtimes of the archive contents. |
That doesn't seem to make a difference to the overall speed of listing the contents of a ZIP file, even if it would be nicer to avoid all those extra calls. |
Have you tested what would happen if you forced If forcing |
Without time zone information, I think the only reasonable behavior is to assume the time stamps are using the local time zone, which is exactly what we do here. |
We should also double-check how the Zip code here is working with UT extensions. The UT extension stores mtime following POSIX conventions; if a UT extension is being used, we should be able to skip converting the old-style Zip timestamp information. |
Listing files from a remote CBZ file (250 megs, nearly 600 pages).
I realise that libarchive might be doing much more work than "unzip", but listing files inside an archive is the first thing to be done when handling CBZ or ePub files. Is it the integrity checking that ends up reading more data? Is it possible to get the list of files from the header instead of seeking within the archive?
PS: I should also note that bsdtar or libarchive pounds
/etc/localtime
. Stracing the above call to bsdtar leads to 3k system calls, a third of that is stat'ing/etc/localtime
,unzip
makes a single one.The text was updated successfully, but these errors were encountered: