kientzle edited this page Mar 29, 2012 · 3 revisions

Ideas that would be nice to implement someday.

If you would like to see any of these implemented, let us know by either posting a comment to this page or sending a note to the libarchive-discuss mailing list. If you'd like to take a stab at implementing any of these, definitely let us know. Also see ReleaseNotes. Please remember that libarchive is entirely developed by volunteers; new features get implemented whenever someone steps up to do the work. If you would like to try your hand at this, read LibarchiveInternals for a guided tour of how libarchive works and ask on the libarchive-discuss@googlegroups.com mailing list if you have any questions.

Table of Contents

Better Build Documentation

The existing build documentation is a little hard to navigate. It would be nice to break it out into separate pages by platform and provide more in the way of screenshots for the graphical tools and command examples for the command-line tools.

Man Page Updates

The man pages are a little out-of-date now; some of the newer changes need to be added.

Better documentation conversions

The existing conversion scripts don't do a bad job but could definitely be improved.

CMake-based install target

The CMake-based build has basic install support, but could use improvement.

CMake-based distribution building

Currently, the distribution can only be built from the autoconf build system. This is one of the few capabilities of the autoconf-based build that is not implemented in the CMake-based build.

Command-line parsing for compress_program

The `archive_read_support_compress_program()` and `archive_write_set_compress_program()` use external programs to compress or decompress data. Right now, those take their string argument as the name of a program to run, which provides no way to run programs with arguments. This is why libarchive currently uses "gunzip", for example, as the fallback for decompressing gzip streams instead of the more portable "gzip -d". The parsing requirements are not difficult, so this should be a relatively easy project for someone.

Extended Attribute interoperability

libarchive's handling of POSIX.1e-style extended attributes in pax files should be interoperable across systems, but there has not been a lot of testing of this. There also needs to be some work to implement the system-dependent portions for more operating systems. (Currently, only Linux and FreeBSD are fully supported; it should be easy to add Mac OS and Solaris and possibly others.)

Linux `FILE_OFFSET_BITS` intelligence

Libarchive is always compiled with `FILE_OFFSET_BITS=64` on Linux. Of course, this is not the default on 32-bit Linux systems, which leads to a common point of confusion among people who use libarchive on Linux.

Some of these problems have disappeared since libarchive 3.0 switched to using `int64_t` for file sizes and offsets instead of `off_t`. But there is still an issue with `struct stat`.

Through careful macro trickery, we could define several versions of `archive_entry_copy_stat()` so that:

  • `archive_entry_copy_stat32()` accepts a `struct stat32`
  • `archive_entry_copy_stat64()` accepts a `struct stat64`
  • `archive_entry_copy_stat()` refers to whichever of the above is appropriate for the current compilation environment.
If done correctly, programs that use `FILE_OFFSET_BITS=32` or `FILE_OFFSET_BITS=64` on Linux could compile and link correctly against the same libarchive library without issues.

Pax front-end

It should be feasible to build a POSIX-compliant pax on top of libarchive. Michihiro has been working on this for a while; he would certainly appreciate assistance.

cpio as archive translation

Essentially all of cpio's operation can be viewed as archive translation (where "archive" means "description of a series of filesystem objects"):

  • cpio -o reads a list of filenames and writes a cpio or tar archive
  • cpio -p reads a list of filenames and writes filesystem objects to disk
  • cpio -i reads an archive and writes filesystem objects to disk
libarchive's mtree reader demonstrates that a list of filenames can be viewed as an archive, and `archive_write_disk()` shows that a series of filesystem objects can be written to a disk directory using the same interface that is used to write them to a tar or cpio archive. By implementing a "list of filenames" reader (similar to the existing mtree reader), it should be possible to dramatically simplify the cpio implementation. Some tweaks to libarchive APIs might be necessary to make this really clean, of course. The end result should be a very simple cpio implementation that does everything the standard requires with a few nice new features:
  • cpio -o can do arbitrary archive translation; if it can read a list of filenames, there's no reason it can't read a tar archive as well.
  • The common `find|cpio` idiom can use mtree format. In particular, using mtree format between find and cpio allows the use of sed, grep, and awk to modify metadata (e.g., create a ustar archive where every file is owned by root).
  • A more modular libarchive interface will be more useful to other applications; this exercise will help in identifying some of the remaining rough edges. (In particular, implementing *cpio -pl* efficiently will require some careful thought about how `sourcepath` should be handled.)

Virtualizing archive_entry

The username, group name, uid, and gid lookups are awkward. These lookups are expensive to do so we'd like to avoid them when we can. Pushing them into `archive_write_disk()` and `archive_read_disk()` works well enough for tar and cpio because they operate in a particular fashion. Other uses won't necessarily operate in the same way.

One way to solve this is to allow archive readers to push lookup functions into the entry instead of direct values. This would allow the username/group name/gid/uid lookup to occur only when the value was requested. If it was never requested, the lookup would never happen. This will likely require some rethinking of libarchive APIs. In particular, the existing "set_lookup" capabilities may need to be reworked as a generic facility that can be configured for any archive object, not just read_disk/write_disk objects.

Similar concerns apply to ACLs and extended attributes; it would be most efficient to entirely skip such lookups if the entry is ultimately being written to an archive format that can't utilize such information. (For example, using archive_read_disk() to get entries to feed to a cpio writer currently pulls ACLs, extended attributes, and user/group names that are never actually used.)

Filtering directory traversals in archive_read_disk

Now that archive_read_disk can handle directory traversals, there's a tricky problem around file filtering that will need some creative thought.

Explicit read pipelines

Today, read filters are configured and then read pipelines are assembled automatically. There are a few cases where clients would like to create the read pipeline manually. This would allow them to handle partially-damaged input, for example, by skipping the bidding process. This probably requires rephrasing the existing read filters so that there is an `archive_read_add_filter_XXXX()` as well as an `archive_read_support_filter_XXXX()`. In this form, the _support_ form would register the bidder and the _add_ function. The bid process would automatically invoke the _add_ function for the winning bidder. Clients who wanted explicit configuration would just call the _add_ version directly to add that filter directly to the pipeline.

Of course, all of the capability here already exists; the work here is to cleanly expose things, develop tests, and make sure that the final API is clean and easy to understand (and backwards compatible with existing clients if that's possible).

Encrypted backup support

This should be a straightforward use of OpenSSL wrapped up in matching read and write filters. There should also be a standalone pair of encrypt/decrypt command-line tools. I have a rough design of how to implement this; ask for details.

Multivolume Writing

This requires two changes to the internal write filter API:

  • A "bytes remaining" call that the format writer can make against the write filters. Most write filters will just pass it downstream and possibly reduce the return value to account for data they have currently buffered (and any end-of-file overhead). Compression filters shouldn't try to adjust for compression ratio; underestimates are okay here but overestimates can be fatal.
  • A "volume change" call that the format writer can make to prompt each write filter to flush, invoke it's downstream, then reinitialize. (Maybe just call close() and then open() callbacks?)
The interesting part will be working through the format writers to actually support this. For tar and cpio, it might be as simple as invoking a volume change before any entry that is larger than the remaining bytes (plus some overhead). (GNU tar-like entry splitting is not necessary in the first version.) It is likely possible to implement multi-volume support for other formats as well. Here is an interesting use case: Two libarchive instances, the first writing a compressed tar stream, the second creating ISO images, with volumes split so that each ISO image has a single compressed tar file that fills the ISO. The volume change from the tar writer closes the ISO writer, invokes an ISO burner program to burn a disk, then reopens the ISO writer. (Bonus points, of course, if the burner program runs while the next ISO is being created.)

Multivolume Reading

Andres Mejia has recently completed some work in this area to support multivolume RAR archives. More work is needed to extend this to other formats.

Proper multivolume reading requires that each layer appropriately detect end-of-input and invoke close()/open() on the next layer up. The first step is to add multivolume support to archive_read_open_filename(); with no other changes, you should be able to read archives that have been broken into pieces with the `split` program. The next step is to change the way the tar reader handles end-of-archive: It should invoke close(), then open() and return end-of-archive only if the open() returns end-of-archive. Similar changes can be made to the gzip decompressor, for instance. If properly done, libarchive will be able to transparently handle all of the following scenarios:

  • A tar.gz archive that has been processed by `split`. In this case, the I/O functions will reassemble the file without any intervention by the decompression or format handler.
  • A tar archive that has been `split` and then each piece separately gzip compressed. In this case, the gzip decompressor will invoke close()/open() on the I/O functions to advance to the next file but the tar reader will see nothing.
  • A GNU-style multivolume tar archive with each volume separately compressed. In this case, the tar format handler will see the end-of-archive markers and invoke close() that will ripple through the gzip decompressor and the I/O handler. Of course, some additional work will be required for the tar reader to handle entries that have been split across volumes.

RMT support

RMT hasn't been a priority simply because noone has ever asked for it. However, it is one of the few features that is both reasonably standard across other tar implementations and already has a hook for adding it to libarchive (just add a new `archive_read_open_rmt()` module). NetBSD has a BSD-licensed rmt library that we might be able to crib from, though the rmt protocol is simple enough that you could also just implement it from scratch. This will require careful compatibility testing with a couple of different rmt servers.

This would be a good project for someone who is trying to learn network programming.

Seeking input files

This is partially supported now, but could be improved further. In particular, the current seek handling tends to generate a lot of non-aligned writes, which are very bad for performance. The I/O routines could be adjusted to handle this more gracefully.

The Zip and 7-Zip readers can take advantage of seeking already. There are likely several opportunities for performance improvements (such as sorting the file entries by offset within the archive so that there is less seeking).

The other reader that could really benefit is the ISO9660 reader. It would be best if this could be done in a way that continues to support streaming effectively for almost all ISO images. Doing this well will require finding test images that require seeking and ensuring the tests cover both seeking and streaming environments (see the Zip tests for an example of how to handle the latter issue).

Seek in archives

A few people have asked for the ability to efficiently "re-read" particular archive entries. This is a tricky subject. For many formats, the performance gains from this would be very modest. For example, with a little performance work, the seeking Zip reader could support very fast re-reading from the beginning since it only involves re-parsing the central directory. The cases where there would be real gains (e.g., tar.gz) are going to be very difficult to handle. The most likely implementation would be some form of checkpointing so that clients can explicitly ask for a checkpoint object and then restore back to that checkpoint. The checkpoint object could be complex if you have a series of stacked read filters plus state in the format handler itself.

fadvise() and O_DIRECT

Libarchive's streaming operation should allow the kernel to optimize I/O very cleanly. However, the current bsdtar and bsdcpio clients don't make any attempt to inform the kernel about that pattern: suitable fadvise() calls could have a significant improvement in overall performance.

The O_DIRECT flag could also be used to provide better I/O performance. There has been some work recently to align the copy buffers used within bsdtar so that O_DIRECT can be used; there is probably more that could be done.

File I/O strategy development

The `archive_read_open_filename` module now has a basic notion of "strategy" for reading files from different sources. This has been used to optimize reading archives from disk, but could be further extended to optimize reading archives from tapes, pipes, or other sources as well.

It would be even better to share the strategy layer between `archive_read_open_filename` and `archive_read_open_fd`. The easiest approach is probably to convert `archive_read_open_filename` into a simple wrapper around `archive_read_open_fd`, but there are a couple of small mismatches in the current code that would need to be resolved.

MMap and async I/O performance experiments

It should be possible to read archives using mmap() or async I/O and it might be significantly faster than the current approach. However, this requires some careful performance testing before we can be confident that it really is an improvement. (If it doesn't help performance, we shouldn't do it.)

Using mmap() should provide real advantages when reading archives from disk, especially uncompressed archives. This essentially lets libarchive work with very large blocks by pushing the I/O concerns back onto the kernel.

Async I/O would seem to improve performance with streaming tape drives. Joerg Schilling has had good results with _star_ using two processes and a shared-memory buffer to smooth out data flow when talking to tape drives. I think async I/O could provide comparable performance without forking. (Clients can easily get confused when callbacks get invoked in different processes, so I'm reluctant to fork within libarchive. However, forking or threading would be okay if it were contained entirely within a module such as `archive_read_open_filename`. Reading in a background thread into a shared-memory buffer while libarchive processes the previous buffer is certainly worth investigating.)

Note: In Jan 2010, I did some experiments using async I/O to begin reading the next block of the archive while the just-read block was being processed. I found no performance benefit from this. My experiments included reading from tape, reading from the same disk as the files were being extracted to, and reading from a separate disk.

I've not yet experimented to see if I can find any benefits from async I/O when writing archives to tape.

BSD-style long filename support for GNU ld and other ld implementations

The GNU/SysV ar format is ugly to write because you need to collect a filename table in advance. This complicates programs that write ar format. The BSD ar format avoids this problem but GNU ld doesn't support it. If GNU ld could read the BSD ar format, then it would be easier to create library-management tools on top of libarchive.

Single-pass ar writer

Another option is for libarchive's ar writer to simply store the entire archive in memory (.a files are rarely more than a few megabytes, so this is entirely feasible) while collecting information for the filename table. That would simplify use of libarchive's ar writer module considerably.

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.