Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tar: bogus GNUSparseFile.0 directory extracting http://www.qemu-advent-calendar.org/2023/download/day01.tar.gz #469

Open
loreb opened this issue Dec 5, 2023 · 13 comments

Comments

@loreb
Copy link

loreb commented Dec 5, 2023

Sorry for the URL, I was unable to create a smaller example.

$ tar zxpf day01.tar.gz # GNU tar
tar: Ignoring unknown extended header keyword 'SCHILY.acl.ace'
tar: Ignoring unknown extended header keyword 'SCHILY.acl.ace'
tar: Ignoring unknown extended header keyword 'SCHILY.acl.ace'
tar: Ignoring unknown extended header keyword 'SCHILY.acl.ace'
# day01/TinyCore-current.iso is ok

$ busybox tar zxpf day01.tar.gz
$ file day01/GNUSparseFile.0/TinyCore-current.iso
day01/GNUSparseFile.0/TinyCore-current.iso: data

$ toybox tar zxpf day01.tar.gz
$ file day01/GNUSparseFile.0/TinyCore-current.iso
day01/GNUSparseFile.0/TinyCore-current.iso: data

$ star zxpf day01.tar.gz # Schilly tar
star: Unknown extended header keyword 'GNU.sparse.major' ignored at 4.
star: Unknown extended header keyword 'GNU.sparse.minor' ignored at 4.
star: Unknown extended header keyword 'GNU.sparse.name' ignored at 4.
star: Unknown extended header keyword 'GNU.sparse.realsize' ignored at 4.
star: 2228 blocks + 1024 bytes (total of 22815744 bytes = 22281.00k).
$ file day01/GNUSparseFile.0/TinyCore-current.iso
day01/GNUSparseFile.0/TinyCore-current.iso: data
@landley
Copy link
Owner

landley commented Dec 6, 2023

I fetched and extracted it successfully. Looks like they fixed it upstream?

@loreb
Copy link
Author

loreb commented Dec 6, 2023

Not sure if there was a miscommunication or a deeper mystery...
The problem is that (at least on my machine)

$ wget http://www.qemu-advent-calendar.org/2023/download/day01.tar.gz
$ sha256sum day01.tar.gz
66f0b7dc82ce30ed44e74b6b87e68ada2c38cf02fc18dbd283293a2e829834c3  day01.tar.gz
$ tar zxpvf day01.tar.gz
tar: Ignoring unknown extended header keyword 'SCHILY.acl.ace'
day01/
tar: Ignoring unknown extended header keyword 'SCHILY.acl.ace'
day01/TinyCore-current.iso
tar: Ignoring unknown extended header keyword 'SCHILY.acl.ace'
day01/adv-cal.txt
tar: Ignoring unknown extended header keyword 'SCHILY.acl.ace'
day01/run.sh
$ file day01/TinyCore-current.iso
day01/TinyCore-current.iso: ISO 9660 CD-ROM filesystem data (DOS/MBR boot sector) 'TinyCore' (bootable)
$ sha256sum day01/TinyCore-current.iso
62e78d715dfa86d7d486e3286b0215383dbeb99966bf0ceef7efb18f88caea21  day01/TinyCore-current.iso
$ rm -fr day01
$ sha256sum day01.tar.gz
66f0b7dc82ce30ed44e74b6b87e68ada2c38cf02fc18dbd283293a2e829834c3  day01.tar.gz
$ toybox tar zxpvf day01.tar.gz
day01/
day01/GNUSparseFile.0/TinyCore-current.iso
day01/adv-cal.txt
day01/run.sh
$ file day01/GNUSparseFile.0/TinyCore-current.iso
day01/GNUSparseFile.0/TinyCore-current.iso: data
$ sha256sum day01/GNUSparseFile.0/TinyCore-current.iso
285a4f610e70b90c73404216f0477b6b45885b44f47b59f2ca0b21a70ea42144  day01/GNUSparseFile.0/TinyCore-current.iso

I tried checking out the repo and using toybox compiled from that and that's what I got (and btw the GNUSparseFile version is 2MB smaller, 22MB instead of 24), identical results downloading with a browser / using busybox, star and toybox from my distro; if the hashes don't match for you please let me know and I'll try to figure out whatever is wrong on my end.

@landley
Copy link
Owner

landley commented Dec 7, 2023

I just wget it again and the sha1sum here was 8a872bf0436360a9ca23555977f1dfd3155673e6 which was a tarball that extracted fine for me. Possibly that website is using something in the http headers to identify machine type and produce different tarballs for different systems?

Jorg Schilling's "star" utility is nonstandard. He made up a bunch of his own headers (with his name in them for some reason) because he hated Linux and wanted to be intentionally incompatible with it (as in this came to a head 18 years ago, when linus Torvalds explicitly advised everyone to just ignore him in https://lkml.iu.edu/hypermail/linux/kernel/0408.1/0007.html and note that 39 of the replies to that mesage are from Jorg, mostly insulting Linux). Eventually the distros agreed ala https://slashdot.org/story/06/09/04/1335226/debian-kicks-jrg-schilling and a few years later he tried to strongarm Red Hat into shipping his package again by flexing licensing and posting on the fedora-legal list ala https://listman.redhat.com/archives/fedora-legal-list/2009-July/msg00000.html which, as usual, did not go well for him).

You yourself used "star" on the tarball (why do you even have it?) and it complained that it doesn't understand the gnu extensions introduced back in the 1990s. The bug is that star produces broken tarballs. I am not adding support for an intentionally broken tool, and can't even provide workarounds without a test case.

@loreb
Copy link
Author

loreb commented Dec 7, 2023

The sha1 matches, I still believe we are misunderstanding each other.

I have star because gnu tar says tar: Ignoring unknown extended header keyword 'SCHILY.acl.ace' => I figured star would be able to grok that file, but it doesn't.

The problem is not with that SCHILY thing, the problem is that they archived a sparse file using Im-not-sure-what and it results in a weird archive that is understood by gnu tar and apparently bsdtar, while busybox/openbsd/star/toybox corrupt TinyCore-current.iso attempting to extract it in the wrong directory but they believe it's extracted ok - do the comparison yourself.

A non complete (missing the schilly) test case:

#! /bin/sh
f=crap.dat
dd of=$f seek=$((1<<10)) if=/dev/urandom count=1
t=q.tar
rm -f $t
tar cpf $t --sparse --posix crap.dat
toybox tar xpvf $t
cmp G*/$f $f

I had to use gnu tar above because with star cpf -sparse toybox correctly complains that the result is not a tar file.

In fact try bsdtar cpf b.tar crap.dat &&toybox.static tar tf b.tar, the result is the same, and in fact it's consistently "GNUSparseFile.0" instead of some random number like gnu tar.

I have no idea if handling that is easy, nightmarish or something you have ruled out already, I just hope that detecting it and aborting or something should be feasible.

@landley
Copy link
Owner

landley commented Dec 7, 2023

Toybox tar used the filename the record said to:

00000a00 64 61 79 30 31 2f 47 4e 55 53 70 61 72 73 65 46 |day01/GNUSparseF|
00000a10 69 6c 65 2e 30 2f 54 69 6e 79 43 6f 72 65 2d 63 |ile.0/TinyCore-c|
00000a20 75 72 72 65 6e 74 2e 69 73 6f 00 00 00 00 00 00 |urrent.iso......|

If busybox and openbsd can't extract it either, this looks like a fairly widely unsupported format? I'll try to ping a freebsd expert to ask if this is documented anywhere.

Does posix even HAVE the concept of sparse files anywhere in the standard? Hmmm, the word shows up in two pages of the 2013 edition: V4_xsh_chap02 says "the values in uid_t, gid_t, and pid_t will be numbers generally, and potentially both large in magnitude and sparse" which is unrelated to filesystems, and the du utility does mention that some filesystems create sparse files with lseek and "It is up to the implementation to define exactly how accurate its methods are" in describing the space used by such files.

Posix yanked tar long before then, and neither https://pubs.opengroup.org/onlinepubs/007908799/xcu/tar.html nor https://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html contain the word "sparse", so combining --posix with --sparse means what exactly?

@loreb
Copy link
Author

loreb commented Dec 8, 2023

Toybox tar used the filename the record said to:

00000a00 64 61 79 30 31 2f 47 4e 55 53 70 61 72 73 65 46 |day01/GNUSparseF| 00000a10 69 6c 65 2e 30 2f 54 69 6e 79 43 6f 72 65 2d 63 |ile.0/TinyCore-c| 00000a20 75 72 72 65 6e 74 2e 69 73 6f 00 00 00 00 00 00 |urrent.iso......|

I'm useless here, the only thing I know about the tar format is that it had all sorts of limits and was extended in all sort of incompatible ways, kinda like make.
At a very quick glance (https://github.com/libarchive/libarchive/blob/master/libarchive/archive_write_set_format_pax.c#L1252) it seems to me that they set some "this is a sparse" flag before storing the file with a made-up name; their tar(5) looks fairly detailed, but again, at a quick glance.

If busybox and openbsd can't extract it either, this looks like a fairly widely unsupported format? I'll try to ping a freebsd expert to ask if this is documented anywhere.

Great, thank you!
Just for the record, bsdtar is used by freebsd an netbsd (and judging by the manpage macs) and recent versions of windows(!).

Does posix even HAVE the concept of sparse files anywhere in the standard? Hmmm, the word shows up in two pages of the 2013 edition: V4_xsh_chap02 says "the values in uid_t, gid_t, and pid_t will be numbers generally, and potentially both large in magnitude and sparse" which is unrelated to filesystems, and the du utility does mention that some filesystems create sparse files with lseek and "It is up to the implementation to define exactly how accurate its methods are" in describing the space used by such files.

Posix yanked tar long before then, and neither https://pubs.opengroup.org/onlinepubs/007908799/xcu/tar.html nor https://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html contain the word "sparse", so combining --posix with --sparse means what exactly?

Err, those flags are my fault, I tried to recreate that abomination with gnu tar by trial and error, and only later I realized that bsdtar creates them by default, as in dd of=crap.dat seek=$((1<<10)) if=/dev/urandom count=1 && bsdtar cpf - crap.dat |toybox tar tf -

@enh-google
Copy link
Collaborator

and judging by the manpage macs

yes, macOS uses bsdtar. it'll explicitly admit to it if you do tar --version. looks like bsdtar 3.5.3 is current.

@landley
Copy link
Owner

landley commented Dec 9, 2023

Sigh. It's probably one of the zillions in https://www.gnu.org/software/tar/manual/tar.html#Sparse-Formats so I can fish it out and implement it if necessary, but... the fact neither busybox nor openbsd support this kind of implies that either consuming tarballs produced by MacOS doesn't come up much, or the use of homebrew is fairly widespread...

Ah, it looks like they even made a tool called xsparse to convert such files after extracting them: https://www.gnu.org/software/tar/manual/tar.html#Sparse-Recovery

@landley
Copy link
Owner

landley commented Dec 9, 2023

The next question is, if we create normal linux tarballs and send them to macos, will their tar be able to extract them? Because "add another extract codepath to handle incoming crazy" is doable, but opens the door to somebody later complaining "you are sending me bad tarballs" and having expectations that because we added a workaround over HERE, we need to do it on the compression side too, and of course a mechanism to specify when to trigger it, and THEN the question is should that be the automatic behavior when building on macos and freebsd, so that tar behaves differently based on the OS it's built on..

I could also add an "xsparse" command to toybox.

@loreb
Copy link
Author

loreb commented Dec 9, 2023

Unless they are shipping some ancient buggy version there should be no problem, bsdtar can read the output of toybox just fine; I see no basis for a complaint, and if extracting them turns out to be unreasonable it's fine to just recognize it's a sparse file and warn about it (just please remember to recap at the end in case of a verbose extract).

TIL about xsparse, cool!

@enh
Copy link
Contributor

enh commented Dec 9, 2023

"hello, i'll be your macOS test monkey for today..."

i think no surprises with the original tar file, because we were expecting this to untar fine on macOS:

/tmp$ uname -a
Darwin m2 23.1.0 Darwin Kernel Version 23.1.0: Mon Oct  9 21:28:31 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T8112 arm64
/tmp$ curl --output day01.tar.gz http://www.qemu-advent-calendar.org/2023/download/day01.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 21.4M  100 21.4M    0     0  1661k      0  0:00:13  0:00:13 --:--:-- 2334k
/tmp$ tar ztpf day01.tar.gz 
day01/
day01/TinyCore-current.iso
day01/adv-cal.txt
day01/run.sh
/tmp$ 

there is no --sparse on macOS tar:

/tmp$ tar cpf $t --sparse --posix crap.dat
tar: Option --sparse is not supported
Usage:
  List:    tar -tf <archive-filename>
  Extract: tar -xf <archive-filename>
  Create:  tar -cf <archive-filename> [filenames...]
  Help:    tar --help
/tmp$ 

and it's a fairly up to date version:

/tmp$ tar --version
bsdtar 3.5.3 - libarchive 3.5.3 zlib/1.2.12 liblzma/5.0.5 bz2lib/1.0.8 
/tmp$ 

libarchive seems to have stopped sending out release announcements before this version, so i'm not sure when it's from, other than "later than September 2021, which is when the last version they mention in their release notes shipped".

(the example dd doesn't result in a sparse file on macOS, afaict. certainly all of ls/du/stat report the same size, and Finder's "on disk" size is the same too. i have never understood macOS' rules for sparse files...)

@emaste
Copy link
Contributor

emaste commented Dec 11, 2023

In libarchive GNUSparseFile.0 comes from build_gnu_sparse_name https://github.com/libarchive/libarchive/blob/0f410f8cfb512f1c7e36f01bd141f2e929281c26/libarchive/archive_write_set_format_pax.c#L1741

This happens when

		/* We use GNU-tar-compatible sparse attributes. */
		if (sparse_count > 0) {
...
			 * PAX Format 1.0 requires */
			archive_entry_set_pathname(entry_main,
			    build_gnu_sparse_name(gnu_sparse_name,
			        entry_name.s));

From bsdtar's man page:

     --read-sparse
             (c, r, u modes only) Read sparse file information from disk.
             This is the reverse of --no-read-sparse and the default behavior.

The next question is, if we create normal linux tarballs and send them to macos, will their tar be able to extract them?

bsdtar (via libarchive) should be able to extract just about anything.

... THEN the question is should that be the automatic behavior when building on macos and freebsd, so that tar behaves differently based on the OS it's built on..

To this I'd say definitely not.

As for what to do, maybe bsdtar should disable sparse by default under --posix

     --posix
             (c, r, u mode only) Synonym for --format pax

ISTM toybox should at least warn about these archives, but adding support seems reasonable.

As to why this hasn't been encountered more commonly, my guess is that there are so many opportunities to avoid it:

  • many archives are created by GNU tar which does not do sparse detection by default (IIUC)
  • GNU tar and bsdtar/libarchive are the most common implementations used to extract an archive, and both handle it transparently
  • sparse files are less common (actually sparse, not just blocks of zeros)

@landley
Copy link
Owner

landley commented Dec 12, 2023

Add read support, but keep writing the existing one. Got it.

Also I have an existing todo item here: add --hole-detection=raw and move most of the sparse tests over to it, with one test using the seek method but calling du or something to skip if the filesystem doesn't support it. (The reason it's still on the todo list is granularity: I think aligned 512 byte runs of zeroes are a win to count as holes? One of them in the whole file is a wash because it's extra 512 byte header to save 512 bytes data, but more than one is a win for tarball size even if fragmented. How that works out with compression I could't tell you. I dunno what the underlying filesystem will do for different sizes of seeks, that's why I need to add "raw" in the first place...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants