Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster entry listing for ZIP #889

Open
hadess opened this issue Mar 21, 2017 · 11 comments
Open

Faster entry listing for ZIP #889

hadess opened this issue Mar 21, 2017 · 11 comments

Comments

@hadess
Copy link

hadess commented Mar 21, 2017

Listing files from a remote CBZ file (250 megs, nearly 600 pages).

$ time unzip -l '/run/user/1000/gvfs/smb-share:server=test.local,share=home/e-Books/Humble Bundle #4/FromHell.cbz' > /dev/null 

real	0m0.034s
user	0m0.003s
sys	0m0.001s

$ time ./bin/bsdtar -tf '/run/user/1000/gvfs/smb-share:server=test.local,share=home/e-Books/Humble Bundle #4/FromHell.cbz' > /dev/null

real	0m1.843s
user	0m0.011s
sys	0m0.022s

I realise that libarchive might be doing much more work than "unzip", but listing files inside an archive is the first thing to be done when handling CBZ or ePub files. Is it the integrity checking that ends up reading more data? Is it possible to get the list of files from the header instead of seeking within the archive?

PS: I should also note that bsdtar or libarchive pounds /etc/localtime. Stracing the above call to bsdtar leads to 3k system calls, a third of that is stat'ing /etc/localtime, unzip makes a single one.

@kientzle
Copy link
Contributor

Is it possible to get the list of files from the header instead of seeking within the archive?

This should be possible; seeking to the file contents should not be required until/unless you actually try to read the data. I think some of the logic for this already exists within archive_read_support_format_zip, but it may be incomplete.

Stracing the above call to bsdtar leads to 3k system calls, a third of that is stat'ing /etc/localtime

Libarchive is not doing that directly. We do of course use some C library time functions, but I'm not sure which one would be doing that. What OS are you using? What libc version?

@hadess
Copy link
Author

hadess commented Mar 27, 2017

This should be possible; seeking to the file contents should not be required until/unless you actually try to read the data. I think some of the logic for this already exists within archive_read_support_format_zip, but it may be incomplete.

The streamable version seems to try much harder to not read data if it can avoid it, but I'm really not sure I'm reading this right.

Libarchive is not doing that directly. We do of course use some C library time functions, but I'm not sure which one would be doing that. What OS are you using? What libc version?

Looks like tons of call to mktime(). But probably nothing to worry about too much in terms of performance. Pretty sure that's heavily cached as it is ;)

But the stat() calls are a pretty good indicator of the calls to zip_time(), which is called when the headers are populated, and read. An early zip_time() call:

#0  0x00007ffff7851d05 in __GI___xstat (vers=vers@entry=1, name=name@entry=0x7ffff78e836b "/etc/localtime", buf=buf@entry=0x7fffffffda60) at ../sysdeps/unix/sysv/linux/wordsize-64/xstat.c:35
#1  0x00007ffff7816cbd in __tzfile_read (file=file@entry=0x7ffff78e836b "/etc/localtime", extra=extra@entry=0, extrap=extrap@entry=0x0) at tzfile.c:161
#2  0x00007ffff7816718 in tzset_internal (explicit=1, always=1) at tzset.c:441
#3  __tzset () at tzset.c:589
#4  0x00007ffff7814e09 in __GI_mktime (tp=tp@entry=0x7fffffffdb70) at mktime.c:588
#5  0x00007ffff7b76fda in zip_time (p=p@entry=0x6033f2 "\264\222J?/\340oChe\021") at libarchive/archive_read_support_format_zip.c:413
#6  0x00007ffff7b78ca3 in slurp_central_directory (zip=0x6030a0, a=0x602010) at libarchive/archive_read_support_format_zip.c:2660
#7  archive_read_format_zip_seekable_read_header (a=0x602010, entry=0x602a10) at libarchive/archive_read_support_format_zip.c:2967
#8  0x00007ffff7b44928 in _archive_read_next_header2 (_a=_a@entry=0x602010, entry=0x602a10) at libarchive/archive_read.c:649
#9  0x00007ffff7b44a1f in _archive_read_next_header (_a=0x602010, entryp=0x7fffffffdce0) at libarchive/archive_read.c:687
#10 0x0000000000400945 in main (argc=2, argv=0x7fffffffdde8) at list.c:20

And this is a late call:

#0  0x00007ffff7851d05 in __GI___xstat (vers=vers@entry=1, name=name@entry=0x7ffff78e836b "/etc/localtime", buf=buf@entry=0x7fffffffd9d0) at ../sysdeps/unix/sysv/linux/wordsize-64/xstat.c:35
#1  0x00007ffff7816cbd in __tzfile_read (file=file@entry=0x7ffff78e836b "/etc/localtime", extra=extra@entry=0, extrap=extrap@entry=0x0) at tzfile.c:161
#2  0x00007ffff7816718 in tzset_internal (explicit=1, always=1) at tzset.c:441
#3  __tzset () at tzset.c:589
#4  0x00007ffff7814e09 in __GI_mktime (tp=tp@entry=0x7fffffffdae0) at mktime.c:588
#5  0x00007ffff7b76fda in zip_time (p=p@entry=0x6033ba "\211\242<?\314\004\250-\237Y\b") at libarchive/archive_read_support_format_zip.c:413
#6  0x00007ffff7b77b3f in zip_read_local_file_header (a=a@entry=0x602010, entry=entry@entry=0x602a10, zip=zip@entry=0x6030a0) at libarchive/archive_read_support_format_zip.c:798
#7  0x00007ffff7b78767 in archive_read_format_zip_seekable_read_header (a=<optimized out>, entry=0x602a10) at libarchive/archive_read_support_format_zip.c:3008
#8  0x00007ffff7b44928 in _archive_read_next_header2 (_a=_a@entry=0x602010, entry=0x602a10) at libarchive/archive_read.c:649
#9  0x00007ffff7b44a1f in _archive_read_next_header (_a=0x602010, entryp=0x7fffffffdce0) at libarchive/archive_read.c:687
#10 0x0000000000400945 in main (argc=2, argv=0x7fffffffdde8) at list.c:20

Looks like it's decoding the whole header, getting info such as the time of creation of all those entries, and throwing it away. This is the simple test app I used.

#include <archive.h>
#include <archive_entry.h>

int
main(int argc, const char **argv)
{
        struct archive *a; 
        struct archive_entry *entry;
        int r;

        a = archive_read_new();
        archive_read_support_format_zip(a);
        if ((r = archive_read_open_filename(a, argv[1], 10240))) {
                printf ("archive_read_open_filename(): %s\n", archive_error_string(a));
                return 1;
        }
        for (;;) {
                r = archive_read_next_header(a, &entry);
                if (r == ARCHIVE_EOF)
                        break;
                if (r != ARCHIVE_OK) {
                        printf ("archive_read_next_header(): %s\n", archive_error_string(a));
                        return 1;
                }
                printf ("%s\n", archive_entry_pathname(entry));
        }
        return 0;
}

@nh2
Copy link

nh2 commented Apr 2, 2022

I also found this when trying to figure out why bsdtar xf is 12x slower than unzip when unpacking a 9 KB file from a 33 GB zip file.

Using strace -fc on bsdtar:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 52.04    7.010728           5   1230981   1230961 openat
 30.99    4.174871          16    253082           read
 16.92    2.279220           4    502798           lseek

In comparison, unzip:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 98.82    0.006976           0      9264           read

bsdtar runs

openat(AT_FDCWD</path/to/pwd/dir>, "/etc/localtime", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)

over 1 million times.

unzip takes 0.37 s elapsed, bsdtar takes 4.73 s elapsed.

@jsonn
Copy link
Contributor

jsonn commented Apr 2, 2022

That sounds like a bug in your libc.

@nh2
Copy link

nh2 commented Apr 2, 2022

I think it's because of tm_isdst = -1:

https://pubs.opengroup.org/onlinepubs/009695399/functions/mktime.html says:

A negative value for tm_isdst shall cause mktime() to attempt to determine whether Daylight Savings Time is in effect for the specified time.

glibc man pages say:

and a negative value means that mktime() should (use timezone information and system databases to) attempt to determine whether DST is in effect at the specified time.

My gdb stack trace:

(gdb) bt
#0  0x00007ffff787af7d in __open_nocancel () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6
#1  0x00007ffff7808955 in _IO_file_open () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6
#2  0x00007ffff7808a23 in __GI__IO_file_fopen () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6
#3  0x00007ffff77fce19 in __fopen_internal () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6
#4  0x00007ffff7844732 in __tzfile_read () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6
#5  0x00007ffff7844308 in tzset_internal () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6
#6  0x00007ffff7844426 in tzset () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6
#7  0x00007ffff78431b9 in timelocal () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6
#8  0x00007ffff7f5c5d7 in zip_time () from /nix/store/cq53l1fj1sr0yl9hmxr69cayl1kygs1d-libarchive-3.5.2-lib/lib/libarchive.so.13
#9  0x00007ffff7f5de17 in zip_read_local_file_header () from /nix/store/cq53l1fj1sr0yl9hmxr69cayl1kygs1d-libarchive-3.5.2-lib/lib/libarchive.so.13
#10 0x00007ffff7f5ec02 in archive_read_format_zip_seekable_read_header () from /nix/store/cq53l1fj1sr0yl9hmxr69cayl1kygs1d-libarchive-3.5.2-lib/lib/libarchive.so.13
#11 0x00007ffff7f2444a in _archive_read_next_header2 () from /nix/store/cq53l1fj1sr0yl9hmxr69cayl1kygs1d-libarchive-3.5.2-lib/lib/libarchive.so.13
#12 0x00007ffff7f2454f in _archive_read_next_header () from /nix/store/cq53l1fj1sr0yl9hmxr69cayl1kygs1d-libarchive-3.5.2-lib/lib/libarchive.so.13
#13 0x0000000000406e86 in read_archive ()
#14 0x00000000004075f7 in tar_mode_x ()
#15 0x000000000040579f in main ()

@jsonn
Copy link
Contributor

jsonn commented Apr 2, 2022

Sure, that's the intention. But libc should still remember the TZ setting and not look it up over and over again.

@nh2
Copy link

nh2 commented Apr 3, 2022

It looks like it's controversial whether or not the libc should cache /etc/localtime contents; the user might change duing the runtime of a long-running program.

musl caches it, glibc doesn't. There's some musl discussion of the pros and cons of the behaviour here:

The trade-off of musl's caching approach is that the program has to do a dance with changing and resetting the TZ variable if it wants to force a re-read of /etc/localtime at runtime. In general the consensus appears to be that the C time functions are badly designed, so neither implementation is perfect.

I found that on glibc, a workaround is setting the environment variable TZ=:/etc/localtime. This will result in a single openat() at the beginning for bsdtar.

Remains to check what unzip does (it doesn't suffer the problem with glibc), and if libarchive could do the same.

In general, it seems best to query the system for the timezone once, and then re-use this info for formatting of times without re-querying it.

Onther question though is that https://en.wikipedia.org/wiki/ZIP_(file_format)#Structure says

The ZIP format has no notion of time zone, so timestamps are only meaningful if it is known what time zone they were created in.

which makes me wonder whether bothering with time zones is worth it at all regarding to mtimes of the archive contents.

@hadess
Copy link
Author

hadess commented Apr 3, 2022

I found that on glibc, a workaround is setting the environment variable TZ=:/etc/localtime. This will result in a single openat() at the beginning for bsdtar.

That doesn't seem to make a difference to the overall speed of listing the contents of a ZIP file, even if it would be nicer to avoid all those extra calls.

@kientzle
Copy link
Contributor

kientzle commented Apr 3, 2022

Have you tested what would happen if you forced ts.tm_isdst to zero instead of -1?

If forcing ts.tm_isdst here really does make a difference, that suggests that we should consider other ways to handle times here. A full solution will not be simple: In particular, we cannot cache ts.tm_isdst because we're calling this function with many different times on different dates, and cannot know which of those may be subject to DST rules. But we could store the timestamp in a different form in archive_entry and only convert lazily. What that different form should be would require some research; I suspect some of our other format code may have a similar issue and that may impact how we handle this.

@kientzle
Copy link
Contributor

kientzle commented Apr 3, 2022

The ZIP format has no notion of time zone, so timestamps are only meaningful if it is known what time zone they were created in.

which makes me wonder whether bothering with time zones is worth it at all regarding to mtimes of the archive contents

Without time zone information, I think the only reasonable behavior is to assume the time stamps are using the local time zone, which is exactly what we do here.

@kientzle
Copy link
Contributor

kientzle commented Apr 3, 2022

We should also double-check how the Zip code here is working with UT extensions. The UT extension stores mtime following POSIX conventions; if a UT extension is being used, we should be able to skip converting the old-style Zip timestamp information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants