New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tar: bogus GNUSparseFile.0 directory extracting http://www.qemu-advent-calendar.org/2023/download/day01.tar.gz #469
Comments
I fetched and extracted it successfully. Looks like they fixed it upstream? |
Not sure if there was a miscommunication or a deeper mystery...
I tried checking out the repo and using toybox compiled from that and that's what I got (and btw the GNUSparseFile version is 2MB smaller, 22MB instead of 24), identical results downloading with a browser / using busybox, star and toybox from my distro; if the hashes don't match for you please let me know and I'll try to figure out whatever is wrong on my end. |
I just wget it again and the sha1sum here was 8a872bf0436360a9ca23555977f1dfd3155673e6 which was a tarball that extracted fine for me. Possibly that website is using something in the http headers to identify machine type and produce different tarballs for different systems? Jorg Schilling's "star" utility is nonstandard. He made up a bunch of his own headers (with his name in them for some reason) because he hated Linux and wanted to be intentionally incompatible with it (as in this came to a head 18 years ago, when linus Torvalds explicitly advised everyone to just ignore him in https://lkml.iu.edu/hypermail/linux/kernel/0408.1/0007.html and note that 39 of the replies to that mesage are from Jorg, mostly insulting Linux). Eventually the distros agreed ala https://slashdot.org/story/06/09/04/1335226/debian-kicks-jrg-schilling and a few years later he tried to strongarm Red Hat into shipping his package again by flexing licensing and posting on the fedora-legal list ala https://listman.redhat.com/archives/fedora-legal-list/2009-July/msg00000.html which, as usual, did not go well for him). You yourself used "star" on the tarball (why do you even have it?) and it complained that it doesn't understand the gnu extensions introduced back in the 1990s. The bug is that star produces broken tarballs. I am not adding support for an intentionally broken tool, and can't even provide workarounds without a test case. |
The sha1 matches, I still believe we are misunderstanding each other. I have star because gnu tar says The problem is not with that SCHILY thing, the problem is that they archived a sparse file using Im-not-sure-what and it results in a weird archive that is understood by gnu tar and apparently bsdtar, while busybox/openbsd/star/toybox corrupt TinyCore-current.iso attempting to extract it in the wrong directory but they believe it's extracted ok - do the comparison yourself. A non complete (missing the schilly) test case:
I had to use gnu tar above because with In fact try I have no idea if handling that is easy, nightmarish or something you have ruled out already, I just hope that detecting it and aborting or something should be feasible. |
Toybox tar used the filename the record said to: 00000a00 64 61 79 30 31 2f 47 4e 55 53 70 61 72 73 65 46 |day01/GNUSparseF| If busybox and openbsd can't extract it either, this looks like a fairly widely unsupported format? I'll try to ping a freebsd expert to ask if this is documented anywhere. Does posix even HAVE the concept of sparse files anywhere in the standard? Hmmm, the word shows up in two pages of the 2013 edition: V4_xsh_chap02 says "the values in uid_t, gid_t, and pid_t will be numbers generally, and potentially both large in magnitude and sparse" which is unrelated to filesystems, and the du utility does mention that some filesystems create sparse files with lseek and "It is up to the implementation to define exactly how accurate its methods are" in describing the space used by such files. Posix yanked tar long before then, and neither https://pubs.opengroup.org/onlinepubs/007908799/xcu/tar.html nor https://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html contain the word "sparse", so combining --posix with --sparse means what exactly? |
I'm useless here, the only thing I know about the tar format is that it had all sorts of limits and was extended in all sort of incompatible ways, kinda like make.
Great, thank you!
Err, those flags are my fault, I tried to recreate that abomination with gnu tar by trial and error, and only later I realized that bsdtar creates them by default, as in |
yes, macOS uses bsdtar. it'll explicitly admit to it if you do |
Sigh. It's probably one of the zillions in https://www.gnu.org/software/tar/manual/tar.html#Sparse-Formats so I can fish it out and implement it if necessary, but... the fact neither busybox nor openbsd support this kind of implies that either consuming tarballs produced by MacOS doesn't come up much, or the use of homebrew is fairly widespread... Ah, it looks like they even made a tool called xsparse to convert such files after extracting them: https://www.gnu.org/software/tar/manual/tar.html#Sparse-Recovery |
The next question is, if we create normal linux tarballs and send them to macos, will their tar be able to extract them? Because "add another extract codepath to handle incoming crazy" is doable, but opens the door to somebody later complaining "you are sending me bad tarballs" and having expectations that because we added a workaround over HERE, we need to do it on the compression side too, and of course a mechanism to specify when to trigger it, and THEN the question is should that be the automatic behavior when building on macos and freebsd, so that tar behaves differently based on the OS it's built on.. I could also add an "xsparse" command to toybox. |
Unless they are shipping some ancient buggy version there should be no problem, bsdtar can read the output of toybox just fine; I see no basis for a complaint, and if extracting them turns out to be unreasonable it's fine to just recognize it's a sparse file and warn about it (just please remember to recap at the end in case of a verbose extract). TIL about xsparse, cool! |
"hello, i'll be your macOS test monkey for today..." i think no surprises with the original tar file, because we were expecting this to untar fine on macOS:
there is no
and it's a fairly up to date version:
libarchive seems to have stopped sending out release announcements before this version, so i'm not sure when it's from, other than "later than September 2021, which is when the last version they mention in their release notes shipped". (the example |
In libarchive GNUSparseFile.0 comes from This happens when
From bsdtar's man page:
bsdtar (via libarchive) should be able to extract just about anything.
To this I'd say definitely not. As for what to do, maybe bsdtar should disable sparse by default under
ISTM toybox should at least warn about these archives, but adding support seems reasonable. As to why this hasn't been encountered more commonly, my guess is that there are so many opportunities to avoid it:
|
Add read support, but keep writing the existing one. Got it. Also I have an existing todo item here: add --hole-detection=raw and move most of the sparse tests over to it, with one test using the seek method but calling du or something to skip if the filesystem doesn't support it. (The reason it's still on the todo list is granularity: I think aligned 512 byte runs of zeroes are a win to count as holes? One of them in the whole file is a wash because it's extra 512 byte header to save 512 bytes data, but more than one is a win for tarball size even if fragmented. How that works out with compression I could't tell you. I dunno what the underlying filesystem will do for different sizes of seeks, that's why I need to add "raw" in the first place...) |
Sorry for the URL, I was unable to create a smaller example.
The text was updated successfully, but these errors were encountered: