Skip to content

Parse tar headers incrementally#2127

Merged
kientzle merged 7 commits intolibarchive:masterfrom
kientzle:kientzle-pax-header-streaming
Jun 16, 2024
Merged

Parse tar headers incrementally#2127
kientzle merged 7 commits intolibarchive:masterfrom
kientzle:kientzle-pax-header-streaming

Conversation

@kientzle
Copy link
Copy Markdown
Contributor

@kientzle kientzle commented Apr 14, 2024

This rebuilds the tar reader to parse all header data incrementally as it appears in the stream.

This definitively fixes a longstanding issue with unsupported pax attributes. Libarchive must limit the amount of data that it reads into memory, and this has caused problems with large unknown attributes. By scanning iteratively, we can instead identify an attribute by name and then decide whether to read it into memory or whether to skip it without reading.

This design also allows us to vary our sanity limits for different pax attributes (e.g., an attribute that is a single number can be limited to a few dozen bytes while an attribute holding an ACL is allowed to be a few hundred kilobytes). This allows us to be a little more resistant to malicious archives that might try to force allocation of very large amounts of memory, though there is still work to be done here.

This includes a number of changes to archive_entry processing to allow us to consistently keep the first appearance of any given value instead of the original architecture that recursively cached data in memory in order to effectively process all the data from back-to-front.

Resolves #1855
Resolves #1939

This rebuilds the tar reader to parse all header
data incrementally as it appears in the stream.

This definitively fixes a longstanding issue with unsupported
pax attributes.  Libarchive must limit the amount of data that
it reads into memory, and this has caused problems with large
unknown attributes.  By scanning iteratively, we can instead
identify an attribute and decide whether to read it into memory or
whether to skip it without reading.

This design also allows us to vary our sanity limits for
different pax attributes (e.g., an attribute that is a single number
can be limited to a few dozen bytes while an attribute holding an ACL is
allowed to be a few megabytes).  This allows us to be a little more
resistent to malicious archives that might try to force allocation of
very large amounts of memory, though there is still work to be done
here.

This builds on earlier work to make tar header parsing be purely
iterative by keeping the _first_ appearance of any given value instead
of the original architecture that recursively cached data in memory
in order to effectively process all the data from back-to-front.
@kientzle kientzle requested a review from mmatuska April 14, 2024 18:37
@kientzle
Copy link
Copy Markdown
Contributor Author

CC: @jvreelanda

@kientzle
Copy link
Copy Markdown
Contributor Author

Hmmm.... Looks like I need to look closely at the symlink handling to see why this test is failing on mingw-gcc. All other platforms are testing clean:

173/744 Test #173: libarchive_test_pax_filename_encoding ....................................***Failed    0.02 sec

172: test_pax_filename_encoding
D:\a\libarchive\libarchive\libarchive\test\test_pax_filename_encoding.c:167: filename != archive_entry_hardlink(entry)
                           filename = "abc\xCC\x8Cmno\xFCxyz" (length 12)
      archive_entry_hardlink(entry) = "abc\xE2\x95\xA0\xC3\xAEmno\xE2\x81\xBFxyz" (length 17)
D:\a\libarchive\libarchive\libarchive\test\test_pax_filename_encoding.c:173: filename != archive_entry_symlink(entry)
                          filename = "abc\xCC\x8Cmno\xFCxyz" (length 12)
      archive_entry_symlink(entry) = "abc\xE2\x95\xA0\xC3\xAEmno\xE2\x81\xBFxyz" (length 17)

@kientzle
Copy link
Copy Markdown
Contributor Author

I think the mingw-gcc test failure is pointing out a subtle issue with the new code: Pax archives store information with differing charset conventions. In particular, filenames and link names are stored using unspecified character encoding in the legacy ustar header but in UTF-8 in the pax header. So I've managed to inadvertently break link name conversions, since I now accumulate the raw version and only do a single conversion at the end, at which point I've lost track of which header (and hence which character encoding) the name came from. The solution is probably to accumulate the converted version so I can ensure the correct conversion. (This will be a problem someday when I tackle redesigning the API to eliminate most character-set conversions. Oh, well.)

Instead of keeping a local copy of the linkname, push
it into the entry directly when we see it.  This requires
modifying archive_entry to store a single "linkname" value
and separately mark whether that name is a symlink or hardlink.
This should not break any existing use, since no entry should
ever have both a symlink and a hardlink value.

For tar, this lets us correctly handle PAX link names, which
are stored early in the archive and only later do we find out
whether that name is a symlink or hardlink.  (We did this before
by storing the link name and putting it into the entry only
after we knew whether this was a hard or sym link.)

Nice side effect:  This slightly reduces the size of the
archive_entry struct.

Still TODO:  I need to correctly handle character set
for link names.
@kientzle
Copy link
Copy Markdown
Contributor Author

Reworking the linkname character-set handling seems to have addressed the mingw-gcc regression.

@kientzle kientzle merged commit 2d8a576 into libarchive:master Jun 16, 2024
@kientzle kientzle deleted the kientzle-pax-header-streaming branch July 6, 2024 22:45
mmatuska pushed a commit that referenced this pull request Jul 7, 2024
…harset (#2248)

It would seem as though #2127 conflicted with my change #2228.

I previously thought that the writer was putting info into the archive
that strings were encoded in UTF-8, but I'm not so sure of that
anymore... In any case, explicitly setting `hdrcharset` on the reader as
well is a reasonable alternative and something we do already.
kientzle added a commit to kientzle/libarchive that referenced this pull request Jul 8, 2024
Pax introduced new headers that appear _before_ the legacy
headers.  So pax archives require earlier properties to
override later ones.

Originally, libarchive handled this by storing the early
headers in memory so that it could do the actual parsing
from back to front.  With this scheme, properties from
early headers were parsed last and simply overwrote
properties from later headers.

PR libarchive#2127 reduced memory usage by parsing headers in the
order they appear in the file, which requires later headers
to avoid overwriting already-set properties.  Apparently,
when I made this change, I did not fully consider how charset
translations get handled on Windows, so failed to consistently
recognize when the path or linkname properties were in fact
actually set.  This PR corrects this bug by adding additional
tests to see if the wide character path or linkname properties
are set.

Related:  This bug was exposed by a new test added in libarchive#2228
which does a write/read validation to ensure round-trip filename
handling.  This was modified in libarchive#2248 to avoid tickling the bug above.
I've reverted the change from libarchive#2248 since it's no longer necessary.
I have also added some additional validation to this test to
help ensure that the intermediate archive actually is a pax
format that includes the expected path and linkname properties
in the expected places.
mmatuska pushed a commit that referenced this pull request Jul 9, 2024
Pax introduced new headers that appear _before_ the legacy
headers.  So pax archives require earlier properties to
override later ones.

Originally, libarchive handled this by storing the early
headers in memory so that it could do the actual parsing
from back to front.  With this scheme, properties from
early headers were parsed last and simply overwrote
properties from later headers.

PR #2127 reduced memory usage by parsing headers in the
order they appear in the file, which requires later headers
to avoid overwriting already-set properties.  Apparently,
when I made this change, I did not fully consider how charset
translations get handled on Windows, so failed to consistently
recognize when the path or linkname properties were in fact
actually set.  As a result, the legacy path/link values (which have
no charset information) overwrote the pax path/link values (which
are known to be UTF-8), leading to the behavior observed in
#2248.  This PR corrects this bug by adding additional
tests to see if the wide character path or linkname properties
are set.
    
Related:  This bug was exposed by a new test added in #2228
which does a write/read validation to ensure round-trip filename
handling. This was modified in #2248 to avoid tickling the bug above.
I've reverted the change from #2248 since it's no longer necessary.
I have also added some additional validation to this test to
help ensure that the intermediate archive actually is a pax
format that includes the expected path and linkname properties
in the expected places.
mmatuska pushed a commit that referenced this pull request Sep 20, 2024
…#2338)

Fix memory leaks introduced by #2127:

* `struct tar` member `entry_linkpath` was moved at the same time as
   other members were removed, but its cleanup was accidentally removed
   with the others.

* `header_pax_extension` local variable `attr_name` was not cleaned up.

Resolves #2336
mmatuska pushed a commit to mmatuska/libarchive that referenced this pull request Sep 20, 2024
Fix memory leaks introduced by libarchive#2127:

* `struct tar` member `entry_linkpath` was moved at the same time as
   other members were removed, but its cleanup was accidentally removed
   with the others.

* `header_pax_extension` local variable `attr_name` was not cleaned up.

Resolves libarchive#2336

(cherry picked from commit 7c39803)
kientzle added a commit to kientzle/libarchive that referenced this pull request Sep 21, 2024
PR libarchive#2127 failed to clean up the linkpath storage between entries.
As a result, after the first hard/symlink entry in a pax format
archive, all subsequent entries would get the same link information.

I'm really unsure how this bug failed to trip CI. I'll do some
digging in the test suite before I merge this.

Resolves libarchive#2338, libarchive#2331

P.S.  Thanks to Brad King for noting that the linkpath wasn't
being managed correctly, which was a big hint for me.
mmatuska added a commit that referenced this pull request Sep 22, 2024
…#2341)

Fix memory leaks introduced by #2127:

* `struct tar` member `entry_linkpath` was moved at the same time as
other members were removed, but its cleanup was accidentally removed
with the others.

* `header_pax_extension` local variable `attr_name` was not cleaned up.

Resolves #2336

(cherry picked from commit 7c39803)

Co-authored-by: Brad King <brad.king@kitware.com>
mmatuska pushed a commit that referenced this pull request Sep 22, 2024
PR #2127 failed to clean up the linkpath storage between entries. As a
result, after the first hard/symlink entry in a pax format archive, all
subsequent entries would get the same link information.

I'm really unsure how this bug failed to trip CI. I'll do some digging
in the test suite before I merge this.

Resolves #2331 , #2337

P.S. Thanks to Brad King for noting that the linkpath wasn't being
managed correctly, which was a big hint for me.
mmatuska pushed a commit that referenced this pull request Sep 22, 2024
PR #2127 failed to clean up the linkpath storage between entries. As a
result, after the first hard/symlink entry in a pax format archive, all
subsequent entries would get the same link information.

I'm really unsure how this bug failed to trip CI. I'll do some digging
in the test suite before I merge this.

Resolves #2331 , #2337

P.S. Thanks to Brad King for noting that the linkpath wasn't being
managed correctly, which was a big hint for me.

(cherry picked from commit 75cdc59)
kientzle added a commit to kientzle/libarchive that referenced this pull request Oct 6, 2024
The tar header parsing overhaul in libarchive#2127 introduced a systematic
mishandling of truncated files:  the code incorrectly checks for
whether a given read operation failed, and ends up dereferencing
a NULL pointer in this case.  I've gone back and double-checked
how `__archive_read_ahead` actually works (it returns NULL precisely
when it was unable to satisfy the read request) and reworked the
error handling for each call to this function in archive_read_support_format_tar.c

Resolves libarchive#2353
Resolves https://issues.oss-fuzz.com/issues/42537231
mmatuska pushed a commit that referenced this pull request Oct 11, 2024
The tar header parsing overhaul in #2127 introduced a systematic
mishandling of truncated files: the code incorrectly checks for whether
a given read operation failed, and ends up dereferencing a NULL pointer
in this case. I've gone back and double-checked how
`__archive_read_ahead` actually works (it returns NULL precisely when it
was unable to satisfy the read request) and reworked the error handling
for each call to this function in archive_read_support_format_tar.c

Resolves #2353
Resolves https://issues.oss-fuzz.com/issues/42537231
mmatuska pushed a commit that referenced this pull request Oct 12, 2024
The tar header parsing overhaul in #2127 introduced a systematic
mishandling of truncated files: the code incorrectly checks for whether
a given read operation failed, and ends up dereferencing a NULL pointer
in this case. I've gone back and double-checked how
`__archive_read_ahead` actually works (it returns NULL precisely when it
was unable to satisfy the read request) and reworked the error handling
for each call to this function in archive_read_support_format_tar.c

Resolves #2353
Resolves https://issues.oss-fuzz.com/issues/42537231

(cherry picked from commit 565b5ae)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"Special header too large" error when reading an incremental GNU tar pax archive which adds lots of files Schilly star compatibility

2 participants