Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can't extract large files from pax-format tar files #880

Closed
zholos opened this issue Mar 2, 2017 · 2 comments
Closed

can't extract large files from pax-format tar files #880

zholos opened this issue Mar 2, 2017 · 2 comments

Comments

@zholos
Copy link

zholos commented Mar 2, 2017

When reading pax-format tar files created by GNU tar, bsdtar shows zero size for regular files larger than 8 GB.

$ bsdtar --version
bsdtar 3.3.2dev - libarchive 3.3.2dev zlib/1.2.8
$ tar --version
tar (GNU tar) 1.27.1
$ dd if=/dev/zero of=big bs=1M count=10K # 10 GB, not sparse
$ tar cf gnutar.tar --format=pax big
$ tar tvf gnutar.tar
-rw-r--r-- user/user 10737418240 2017-03-02 15:01 big
$ bsdtar tvf gnutar.tar
-rw-r--r--  0 user   user        0 Mar  2 15:01 big

So GNU tar shows the correct size in the tar file it created, but bsdtar shows (and extracts) zero size.

The same happens with pax-format files created by Python's tarfile module, so I believe the pax format is correct and the bug is in bsdtar.

In pax mode both GNU tar and Python's tarfile module write 0 to the size field of the ustar header because the real size doesn't fit in 11 octal digits (the 12-byte field is supposed to be space- or nul-terminated). The correct size is written to the pax header, but it seems that bsdtar ignores it.

bsdtar itself still writes the large size to the ustar header in an extension format (12 octal digits in this case) in addition to writing it to the pax header. According to a comment in archive_write_set_format_pax.c this is for maximum compatbility. So bsdtar can read its own pax-format output:

$ bsdtar cf bsdtar.tar --format=pax big
$ bsdtar tvf bsdtar.tar
-rw-r--r--  0 user   user 10737418240 Mar  2 15:01 big
$ tar tvf bsdtar.tar
tar: Ignoring unknown extended header keyword 'SCHILY.fflags'
-rw-r--r-- user/user 10737418240 2017-03-02 15:01 big
@mmatuska
Copy link
Member

mmatuska commented Mar 5, 2017

I think I have found the cause. When we process the "size" pax header we set it only if tar->realsize is not negative (we use a value of -1 for unset). That is not good, because in header_common() we always set tar->realsize to header->size. That means the condition in https://github.com/libarchive/libarchive/blob/master/libarchive/archive_read_support_format_tar.c#L2065-L2070 will always evaluate as false.

			if (tar->realsize < 0) {
				archive_entry_set_size(entry,
				    tar->entry_bytes_remaining);
				tar->realsize
				    = tar->entry_bytes_remaining;
			}

@kientzle we have to fix this. I see 2 ways - the uglier one to use "tar->realsize <=0" and the nicer one to introduce a new marker to test against, e.g. tar->realsize_set and set it with GNU.sparse.size, GNU.sparse.realsize, SCHILY.realsize and gnutar header realsize to indicate these have precedence over "size".

@jsonn
Copy link
Contributor

jsonn commented Mar 5, 2017

I'd either do the overwriting unconditionally here or use the indicator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants