Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
branch: master
Fetching contributors…

Cannot retrieve contributors at this time

993 lines (813 sloc) 48.897 kb
This is paxutils.info, produced by makeinfo version 3.12i from
paxutils.texi.
START-INFO-DIR-ENTRY
* pax utilities: (paxutils). pax and other archiving utilities.
* cpio: (paxutils)cpio invocation. Handling cpio archives.
* pax: (paxutils)pax invocation. The POSIX archiver.
* tar: (paxutils)tar invocation. Making tape (or disk) archives.
* mt: (paxutils)mt invocation. Basic tape positioning.
* rmt: (paxutils)rmt invocation. The remote tape facility.
END-INFO-DIR-ENTRY
This file documents `paxutils' 2.4i.
Copyright (C) 1992, 1994, 1995, 1996, 1997, 1998 Free Software
Foundation, Inc.
Permission is granted to make and distribute verbatim copies of this
manual provided the copyright notice and this permission notice are
preserved on all copies.
Permission is granted to copy and distribute modified versions of
this manual under the conditions for verbatim copying, provided that
the entire resulting derived work is distributed under the terms of a
permission notice identical to this one.
Permission is granted to copy and distribute translations of this
manual into another language, under the above conditions for modified
versions, except that this permission notice may be stated in a
translation approved by the Foundation.

File: paxutils.info, Node: posix, Next: Checksumming, Prev: old-archive, Up: Portability
`tar' and POSIX `tar'
---------------------
`tar' was based on an early draft of the POSIX 1003.1 `ustar'
standard. Extensions to this `tar', such as the support for file names
longer than 100 characters, use portions of the `tar' header record
which were specified in that POSIX draft as unused. Subsequent changes
in POSIX have allocated the same parts of the header record for other
purposes. As a result, `tar' is incompatible with the current POSIX
spec, and with `tar' programs that follow it.
We plan to reimplement these extensions in a new way which is upward
compatible with the latest POSIX `tar' format, but we don't know when
this will be done.
In the mean time, there is simply no way of telling what might
happen if you read a `tar' archive that uses the extensions using some
other `tar' program. So if you want to read the archive with another
`tar' program, be sure to write it using the `--old-archive' (`-o')
option (`-o').
Traditionally, old `tar's have a limit of 100 characters. `tar'
attempted two different approaches to overcoming this limit, using and
extending a format specified by a draft of P1003.1. The first way was
not that successful, and involved `@MaNgLeD@' file names, or such;
while a second approach used `././@LongLink' and other tricks, yielding
better success. In theory, `tar' should be able to handle file names
of practically unlimited length. So, if `tar' fails to dump and
retrieve files having more than 100 characters, then there is a bug in
`tar'.
But, for strict conformity to POSIX, the limit was still 100
characters. For various other purposes, `tar' used areas left
unassigned in the POSIX draft. POSIX later revised the P1003.1 `ustar'
format by assigning previously unused header fields in such a way that
the upper limit for file name length was raised to 256 characters.
However, the actual POSIX limit oscillates between 100 and 256,
depending on the precise location of slashes in full file name (this is
rather ugly). Since `tar' uses the same fields for quite other
purposes, it became incompatible with the latest POSIX standards.
For longer or non-fitting file names, we plan to use yet another set
of extensions, but this time, complying with the provisions POSIX
offers for extending the format, rather than conflicting with it.
Whenever an archive uses old `tar' extension format or POSIX
extensions, whether for very long file names or for other special cases,
this archive becomes non-portable to other `tar' implementations. In
fact, anything can happen. The most forgiving `tar's will merely
unpack the file using a wrong name, and maybe create another file named
something like `@LongName', with the true file name in it. `tar's not
protecting themselves may segment violate!
Compatibility concerns make all of these things more difficult, as we
will have to support _all_ these things together, for a while. `tar'
should be able to produce and read true POSIX format files, while being
able to detect old `tar' formats, including old V7 format, and process
them conveniently. It will take years before this whole area
stabilizes ...
There are plans to raise this 100 limit to 256, and yet produce POSIX
conformant archives. Past 256, we do not know yet if `tar' will go
non-POSIX again, or merely refuse to archive the file.
There are plans for `tar' to support the latest POSIX format more
fully, while being able to read old V7 format, old GNU (semi-POSIX plus
extensions), and full POSIX. One may ask if there is part of the POSIX
format that we still cannot support. This simple question has a
complex answer. Maybe, on closer inspection, some strong limitations
will pop up, but up to now, nothing looks too difficult (but see
below). We only have these few pages of POSIX telling about `Extended
tar format' (P1003.1-1990-section 10.1.1), and there are references to
other parts of the standard we do not have, which should normally
enforce limitations on stored file names (we suspect things like fixing
what `/' and `<NUL>' mean). There are also some points which the
standard does not make clear. Existing practice will then drive what
we should do.
POSIX mandates that when a file name cannot fit within 100 to 256
characters (the variance comes from the fact a `/' is ideally needed as
the 156th character) or a link name cannot fit within 100 characters, a
warning should be issued and the file _not_ be stored. Unless the
`--posix' option is given (or `POSIXLY_CORRECT' is set), we believe
that `tar' should disobey this specification, and automatically switch
to using extensions to overcome file name or link name length
limitations.
There is a problem, however, which we have not intimately studied
yet. Given a truly POSIX archive with names having more than 100
characters, we guess that `tar' up to 1.11.8 will process it as if it
were an old V7 archive, and be fooled by some fields which are coded
differently. So, the question is to decide if the next generation of
`tar' should produce POSIX format by default, whenever possible,
producing archives that older versions of `tar' might not be able to
read correctly. We fear that we will have to suffer such a choice one
of these days, if we want `tar' to go closer to POSIX. We might choose
to do that right away. Another possibility is to produce the current
`tar' format by default for a few years, but have `tar' versions from
some 1.POSIX and up able to recognize all three formats, and let older
`tar' fade out slowly. Then, we could switch to producing POSIX format
by default, with not much harm to those still having (very old at that
time) `tar' versions prior to 1.POSIX.
POSIX format cannot represent very long names, volume headers,
splitting of files in multi-volumes, sparse files, and incremental
dumps; these would be all disallowed if `--posix' is given or
`POSIXLY_CORRECT' is set. Otherwise, if `tar' is given long names, or
`-[VMSgG]', then it should automatically go non-POSIX. We think this
is easily granted without much discussion.
Another point is that only `mtime' is stored in POSIX archives,
while `tar' currently also stores `atime' and `ctime'. If we want
`tar' to go closer to POSIX, my choice would be to drop `atime' and
`ctime' support on average. On the other hand, we perceive that full
dumps or incremental dumps need `atime' and `ctime' support, so for
those special applications, POSIX has to be avoided altogether.
A few users requested that `--sparse' (`-S') be always active by
default. We think that before replying to them, we have to decide if
we want `tar' to go closer to POSIX on average, while producing files.
My choice would be to go closer to POSIX in the long run. Besides
possible double reading, we do not see any point in not trying to save
files as sparse when creating archives which are neither POSIX nor
old-V7, so the actual `--sparse' (`-S') would become selected by
default when producing such archives, whatever the reason is. So,
`--sparse' (`-S') alone might be redefined to force extended format
archives, and recover its previous meaning from this fact.
Extended format as it exists now can easily fool other POSIX `tar's,
as it uses fields which POSIX considers to be part of the file name
prefix. We wonder if it would not be a good idea, in the long run, to
try changing extended format so that any added field (like `ctime',
`atime', file offset in subsequent volumes, or sparse file
descriptions) would be wholly and always pushed into an extension block,
instead of using space in the POSIX header block. We could manage to
do that portably between future `tar's. So other POSIX `tar's might at
least be able to provide roughly correct listings for the archives
produced by `tar', if not to process them otherwise.
Using these projected extensions might induce older `tar's to fail.
We would use the same approach as for POSIX. We'll put out a `tar'
capable of reading POSIXier, yet extended archives, but will not produce
this format by default, when not in POSIX mode. In a few years, when
newer `tar's will have flooded out `tar' 1.11.X and earlier, we could
switch to producing POSIXier extended archives, with no real harm to
users, as almost all existing `tar's will be ready to read POSIXier
format. In fact, we'll do both changes at the same time, in a few
years, and just prepare `tar' for both changes, without effecting them,
from 1.POSIX. (Both changes: 1--using POSIX conventions for getting
over 100 characters; 2--avoiding mangling POSIX headers for extensions,
using only POSIX mandated extension techniques).
So, a future `tar' will have a `--posix' flag forcing the usage of
truly POSIX headers, and so, producing archives that previous `tar's
will not be able to read. So, _once_ pretest announces that feature,
it would be particularly useful for users to test how exchangeable
archives will be between `tar' with `--posix' and other POSIX `tar's.
In a few years, when `tar' will produce POSIX headers by default,
`--posix' will have a strong meaning and will disallow extensions. But
in the meantime, for a long while, `--posix' in `tar' will not disallow
extensions like `--label=ARCHIVE-LABEL' (`-V ARCHIVE-LABEL'),
`--multi-volume' (`-M'), `--sparse' (`-S'), or very long file or link
names. However, `--posix' with extensions will use POSIX headers with
reserved-for-users extensions to headers, and we will be curious to
know how well or badly POSIX `tar's will react to these.
`tar' prior to 1.POSIX, and after 1.POSIX without `--posix',
generates and checks `ustar ', with two suffixed spaces. This is
sufficient for older `tar' not to recognize POSIX archives, and
consequently, wrongly decide those archives are in old V7 format. It
is a useful bug for me, because `tar' has other POSIX
incompatibilities, and we need to segregate `tar' semi-POSIX archives
from truly POSIX archives, for `tar' should be somewhat compatible with
itself, while migrating closer to latest POSIX standards. So, we'll be
very careful about how and when we will do the correction.

File: paxutils.info, Node: Checksumming, Prev: posix, Up: Portability
Checksumming problems
---------------------
SunOS and HP-UX `tar' fail to accept archives created using `tar'
and containing non-ASCII file names, that is, file names having
characters with the eighth bit set, because they use signed checksums,
while `tar' uses unsigned checksums when creating archives, as per
POSIX standards. On reading, `tar' computes both checksums and accepts
either. It is somewhat worrying that a lot of people may go around
doing backup of their files using faulty (or at least non-standard)
software, not learning about it until it's time to restore their
missing files with an incompatible file extractor, or vice versa.
`tar' computes checksums both ways, and accepts either on read, so
`tar' can read Sun tapes even with their wrong checksums. `tar'
produces the standard checksum, however, raising incompatibilities with
Sun. That is to say, `tar' has not been modified to _produce_
incorrect archives to be read by buggy `tar's. We've been told that
more recent Sun `tar' now reads standard archives, so maybe Sun did a
similar patch, after all?
The story seems to be that when Sun first imported `tar' sources on
their system, they recompiled it without realizing that the checksums
were computed differently, because of a change in the default signing
of `char's in their compiler. So they started computing checksums
wrongly. When they later realized their mistake, they merely decided
to stay compatible with it, and with themselves afterwards.
Presumably, but we do not really know, HP-UX has chosen that their
`tar' archives to be compatible with Sun's. The current standards do
not favor Sun `tar' format. In any case, it now falls on the shoulders
of SunOS and HP-UX users to get a `tar' able to read the good archives
they receive.

File: paxutils.info, Node: Forced fields, Next: Compression, Prev: Portability, Up: Formats
Options to preset file attributes
=================================
* Menu:
* mode:: Presetting permissions
* owner:: Forcing a given owner
* group:: Forcing a given group
* numeric-owner:: Using numeric owner and group

File: paxutils.info, Node: mode, Next: owner, Prev: Forced fields, Up: Forced fields
Presetting permissions
----------------------
When adding files to an archive, `tar' will use PERMISSIONS for the
archive members, rather than the permissions from the files. The
program `chmod' and this `tar' option share the same syntax for what
PERMISSIONS might be. *Note Permissions: (filetutils)File permissions.
This reference also has useful information for those not especially
familiar with the Unix permission system.
Of course, PERMISSIONS might be plainly specified as an octal number.
However, using generic symbolic modifications to mode bits allows more
flexibility. For example, the value `a+rw' adds read and write
permissions for everybody, while retaining executable bits on
directories or on any other file already marked as executable.

File: paxutils.info, Node: owner, Next: group, Prev: mode, Up: Forced fields
Forcing a given owner
---------------------
The `--owner=USER' option specifies that `tar' should use USER as
the owner of members when creating archives, instead of the user
associated with the source file. USER is first decoded as a user
symbolic name, but if this interpretation fails, it has to be a decimal
numeric user ID.
There is no value indicating a missing number, and `0' usually means
`root'. Some people like to force `0' as the value to offer in their
distributions for the owner of files, because the `root' user is
anonymous anyway, so that might as well be the owner of anonymous
archives.
`tar' on MS-DOS/MS-Windows allows USER to be _any_ string. These
systems don't support file ownership, so `tar' allows them to give away
files to anybody. If USER includes only digits, it is treated as a
numeric UID; otherwise, it is treated as a user name.

File: paxutils.info, Node: group, Next: numeric-owner, Prev: owner, Up: Forced fields
Forcing a given group
---------------------
Given the `--group=GROUP' option, files added to the `tar' archive
will have a group ID of GROUP, rather than the group from the source
file. GROUP is first decoded as a group symbolic name, but if this
interpretation fails, it has to be a decimal numeric group ID. `tar'
on MS-DOS/MS-Windows allows GROUP to be _any_ string. These systems
don't support group IDs, so `tar' allows them to give away files to
anybody. If GROUP consists only of digits, it is treated as a numeric
GID; otherwise it is treated as a group name.

File: paxutils.info, Node: numeric-owner, Prev: group, Up: Forced fields
Using numeric owner and group
-----------------------------
The `--numeric-owner' option allows (ANSI) archives to be written
without user/group name information, or allows such information to be
ignored when extracting. It effectively disables the generation and/or
use of user/group name information. This option forces extraction
using the numeric IDs from the archive, ignoring the names.
This is useful in certain circumstances, when restoring a backup from
an emergency floppy with different passwd/group files for example. It
is otherwise impossible to extract files with the right ownerships if
the password file in use during the extraction does not match the one
belonging to the filesystem(s) being extracted. This occurs, for
example, if you are restoring your files after a major crash and had
booted from an emergency floppy with no password file or put your disk
into another machine to do the restore.
The numeric IDs are _always_ saved into `tar' archives. The
identifying names are added at create time when provided by the system,
unless `--old-archive' (`-o') is used. Numeric IDs could be used when
moving archives between a collection of machines using a centralized
management for attribution of numeric IDs to users and groups. This is
often done via the NIS capabilities.
When making a `tar' file for distribution to other sites, it is
sometimes cleaner to use a single owner for all files in the
distribution, and nicer to specify the write permission bits of the
members as stored in the archive independently of their actual value on
the file system. The way to prepare a clean distribution is usually to
have some makefile rule creating a directory, copying all needed files
in that directory, then setting ownership and permissions as wanted
(there are a lot of possible schemes), and only then making a `tar'
archive out of this directory, before cleaning everything out. Of
course, we could add a lot of options to `tar' for fine-tuning
permissions and ownership. This is not the best approach, we think.
`tar' is already crowded with options, and the approach just explained
gives you a great deal of control already.

File: paxutils.info, Node: Compression, Next: Other formats, Prev: Forced fields, Up: Formats
Using less space through compression
====================================
`tar' has options built in to let you compress and uncompress
archives or individual members on the command line, at the same time
that you create or extract them. Compressing an archive causes it to
take up less space in the system, but also might introduce a few
difficulties.
You can compress your archives using several different methods and
programs. When compressing the whole archive, you can also create the
archive without using one of those options and pipe (`|') the archive
through a compression program such as `gzip'. Likewise, instead of
using options for compressing individual members, one might first have
the files compressed right on disk, before the archive is later made
without using related options; the serious drawback is that files are
left altered (compressed) by this two step process, which users do not
want in general.
There are `tar' limitations with compressed archives. You cannot
modify whole compressed archives (with `--append' (`-r'), `--update'
(`-u') or `--delete', for example). You may also not use them in
conjunction with `--multi-volume' (`-M') related options.
* Menu:
* Archive compression:: Compressing the whole archive
* Member compression:: Compressing individual members

File: paxutils.info, Node: Archive compression, Next: Member compression, Prev: Compression, Up: Compression
Compressing the whole archive
-----------------------------
The `--gzip' (`-z'), `--compress' (`-Z') and
`--use-compress-program=PROGRAM' options allow you to compress or
uncompress a `tar' archive at the same time as you create or extract
the archive. These options are useful in saving time over networks, or
saving space in pipes, and when storage space is at a premium.
`--gzip' (`-z') runs the `gzip' utility; `--compress' (`-Z') runs
`compress'; and `--use-compress-program=PROGRAM' allows you to choose a
compression program that you prefer. If the selected compression
utility is not available, `tar' will report an error.
When any of the above options is specified, `tar' will compress (when
writing an archive), or uncompress (when reading an archive). These
options are used in conjunction with the `--create' (`-c'), `--extract'
(`--get', `-x'), `--list' (`-t'), and `--compare' (`--diff', `-d')
subcommands. However, these options will not work in conjunction with
the `--append' (`-r'), `--update' (`-u'), `--concatenate'
(`--catenate', `-A'), and `--delete' subcommands. *Note Subcommands::,
for more information on all these.
In general, compression algorithms work by substituting a string of
characters for some longer string of characters in the original file.
The compression program keeps track of which strings in the compressed
file will substitute for individual longer bits of the file. If you
modify a compressed file, the compression program will hopelessly lose
track of the data that was originally mapped to the compressed form of
the data, and the file will be corrupted and useless after the
modification point.
Similarly, you should not compress a file that will be used as a
backup on a tape. If your backup archive is compressed on a tape and
even a small portion of the tape is damaged in some way, you will be
unable to recover the contents of the archive following the damage, due
to the way compression algorithms work (described above). If a tape
containing an uncompressed backup archive gets damaged, you can
probably still recover the data from the rest of the tape; there is no
corrupted matching algorithm to work around.
For the `tar' and `gzip' tandem, as in the command `tar tfz
archive.tar.gz', you need to decompress the full archive to see its
contents. However, this may be done without needing disk space, by
using pipes internally. (`tar' on MS-DOS and MS-Windows also supports
this on-the-fly compression with `gzip', but since pipes are simulated
with disk files on MS-DOS, you _do_ need disk space to store the
uncompressed copy while `tar' runs. In particular, make sure the disk
with the directory which is the value of the environment variable
`TMPDIR' has enough free space. Many DOS users tend to point it to a
RAM disk.)
You can use archive compression options on physical devices (tape
drives and so forth) and remote files as well as on normal files; data
to or from such devices or remote files is reblocked by a forked copy
of the `tar' program to enforce the specified (or default) record size.
It is also useful to be able to call the compression option from
within `tar', instead of using external pipes, because compression
utilities by themselves cannot access remote tape drives.
It has been reported that if one writes compressed data (through the
`--gzip' (`-z') or `--compress' (`-Z') options) to a DLT and tries to
use the DLT compression mode, the data will actually get bigger and one
will end up with less space on the tape.
Why does `tar' refuse to compress archives with `--multi-volume'
(`-M')? Here is a sort of explanation. Each tape of a multi-volume
set should have a volume header entry at its beginning. When
compressing, another process (like `gzip', say) takes care of the
compression. `gzip' is not able to detect the end of tape and inform
`tar' of what is going on, so `tar' would produce the new volume
header. Even then, `tar' would not be able to withdraw already
produced bytes up to the exact point where the tape would be full after
compression (further, `gzip' itself produces write-ahead bytes in its
own buffers). On the other hand, `tar' often forks itself _after_ gzip
to reprocess its output, so `gzip' is sort of transparently sandwiched
between two copies of `tar'. Maybe some later `tar' version will
execute the multi-volume spanning code right in the post-processing
`tar'?
On the other hand, I wonder if compressed multi-volumes are such a
good idea. If one has a huge archive spanned on many volumes, one
looses all tapes following the first having an error, and all hope with
them ...
* Menu:
* gzip:: Using `gzip' compression
* compress:: Using `compress' compression
* use-compress-program:: Using other compression programs

File: paxutils.info, Node: gzip, Next: compress, Prev: Archive compression, Up: Archive compression
Using `gzip' compression
........................
You can have archives compressed by using the `--gzip' (`-z') option.
This will arrange for `tar' to use the `gzip' program to be used to
compress or uncompress the whole archive when writing or reading it.
To perform compression and uncompression on the archive, `tar' runs
the `gzip' utility. `tar' uses the default compression parameters; if
you need to override them, avoid the `--gzip' (`-z') option and run the
`gzip' utility explicitly. (Or set the `GZIP' environment variable.)
The following commands, for creating a compressed archive, are
equivalent:
$ tar cfz archive.tar.gz subdir
$ tar cf - subdir | gzip > archive.tar.gz
They both save all of `subdir' into a `gzip'-ed archive. Later you can
do either of:
$ tar xfz archive.tar.gz
$ gunzip < archive.tar.gz | tar xf -
to explode and unpack.
Although it is possible for `tar' and `gzip' to be done with a
single call, it is not precisely correct to say that `tar' is to work
in concert with `gzip' in a way similar to `zip', say. *Note zip::, for
more information.

File: paxutils.info, Node: compress, Next: use-compress-program, Prev: gzip, Up: Archive compression
Using `compress' compression
............................
*Please note:* The algorithm used by the `compress' program is
covered by a patent. You could be sued for patent infringment
merely for running `compress'. Even if the current patent holder
apparently tolerates such infringements, the safest attitude for
everybody is to just avoid becoming dependent on this program.
So, we recommend that you stop using it.
The `--compress' (`-Z') option gets the archive to be filtered
through the `compress' program. Otherwise, it pretty much behaves like
`--gzip' (`-z'). *Note gzip::.
The `compress' option is older than `gzip', and is now obsolescent.
However, there is still a lot of older `tar' files which have been
compressed by `compress' in their time, and because of that, it is
still useful to offer an option in `tar' to read them easily.

File: paxutils.info, Node: use-compress-program, Prev: compress, Up: Archive compression
Using other compression programs
................................
The `--use-compress-program=PROGRAM' option asks for the archive to
be filtered through PROGRAM. For example, option `--gzip' (`-z') is
pretty much like writing `--use=gzip', and option `--compress' (`-Z')
is like writing `--use=compress'. With this option, you might use any
program of your choice, for doing either compression, encryption, or
cyclic redundancy check processing, say, provided that the said program
acts as a filter (that is, it reads its standard input and produces
results on its standard output) and that it accepts a `-d' option.
This option is used by `tar' when calling PROGRAM in contexts where
decompression would normally be done (as when listing or extracting the
archive); it is not used in contexts where compression would normally
be done (as when creating the archive).
To combine many features at once, like compression and redundancy
checking, for example, one can provide a single shell script for
PROGRAM. When the `-d' option is not given to the script, it
compresses its standard input, and pipes the result into a program
computing and adding redundancy at regular intervals.(1) If the `-d'
option is given, the script does the reverse operation (in reverse
order as well), that is, it checks its standard input for redundancy
and possibly recovers lost data while removing the redundancy
information, piping the result into the appropriate decompression
program.
--------- Notes en bas de page ---------
(1) It is so sad that `ecc' had to be withdrawn because of a
serious, unrepairable algorithmic flaw.

File: paxutils.info, Node: Member compression, Prev: Archive compression, Up: Compression
Compressing individual members
------------------------------
There are pending suggestions for having a per-volume or per-file
compression in `tar', and these suggestions will be addressed. This
will allow for viewing the contents without decompression, and for
resynchronizing decompression at every volume or file, in case of
corrupted archives. Doing so, we might sacrifice maximum compression
to some extent, but in case of partial tape loss, recovery might become
possible, which would be a great advantage.
In the meantime, the only compressionlike technique available for
individual archive members is related to sparse file processing, which
only takes care of big strings of zero bytes in certain contexts, thus
making compression very slight in the average case.
* Menu:
* sparse:: Archiving sparse files

File: paxutils.info, Node: sparse, Prev: Member compression, Up: Member compression
Archiving sparse files
......................
Files in the filesystem occasionally have "holes". A hole in a file
is a section of the file's contents which was never written. The
contents of a hole read as all zeros. On many operating systems,
actual disk storage is not allocated for holes, but they are counted in
the length of the file. If you archive such a file, `tar' could create
an archive longer than the original. To have `tar' attempt to
recognize the holes in a file, use `--sparse' (`-S'). When you use the
`--sparse' (`-S') option, then, for any file using less disk space than
would be expected from its length, `tar' searches the file for
consecutive stretches of zeros. It then records in the archive for the
file where the consecutive stretches of zeros are, and only archives
the "real contents" of the file. On extraction (using `--sparse'
(`-S') is not needed on extraction) any such files have holes created
wherever the continuous stretches of zeros were found. Thus, if you
use `--sparse' (`-S'), `tar' archives won't take more space than the
original.
For example, the `--sparse' (`-S') option is useful when many `dbm'
files are being backed up. Using this option dramatically decreases the
amount of space needed to store such a file.
*Please note:* Always use `--sparse' (`-S') when performing file
system backups, to avoid archiving the expanded forms of files
stored sparsely in the system.
Even if your system has no sparse files currently, some may be
created in the future. If you use `--sparse' (`-S') while making
file system backups as a matter of course, you can be assured the
archive will never take more space on the media than the files
take on disk (otherwise, archiving a disk filled with sparse files
might take hundreds of tapes). *Note incremental
listed-incremental::.
Programs like `dump' do not have to read the entire file; by
examining the file system directly, they can determine in advance
exactly where the holes are and thus avoid reading through them. The
only data they need read are the actual allocated data blocks. `tar'
uses a more portable and straightforward archiving approach; it would
be fairly difficult for it to do otherwise. On 1990-12-10, Elizabeth
Zwicky wrote(1) to `comp.unix.internals':
What I did say is that you cannot tell the difference between a
hole and an equivalent number of `NUL's without reading raw blocks.
`st_blocks' at best tells you how many holes there are; it doesn't
tell you _where_. Just as programs may, conceivably, care what
`st_blocks' is (care to name one that does?), they may also care
where the holes are (I have no examples of this one either, but
it's equally imaginable).
I conclude from this that good archivers are not portable. One can
arguably conclude that if you want a portable program, you can in
good conscience restore files with as many holes as possible,
since you can't get it right.
Users should be well aware that at archive creation time, `tar'
still has to read the whole disk file to locate the "holes", and so,
even if sparse files use little space on disk and in the archive, they
may sometimes require an inordinate amount of time for reading and
examining all-zero blocks of a file. Although it works, it's painfully
slow for a large (sparse) file, even though the resulting `tar' archive
may be small. (One user reports that dumping a `core' file of over 400
megabytes, but with only about 3 megabytes of actual data, took about 9
minutes on a Sun Sparstation ELC, with full CPU utilisation.) This
reading is required in all cases, even if the `--sparse' (`-S') option
is not used.(2)
In some later `tar' version, the `--sparse' (`-S') option might be
removed as such, and the testing and treatment of sparse files may be
done automatically with any special option calling for _any_ extension.
The matter is not fully decided yet.
--------- Notes en bas de page ---------
(1) This quote comes from a reply from Elizabeth, after someone
_falsely_ attributed to her the sentence: `One has to be pretty
intimate with the disk, to know where the holes are ...'
(2) Well! When `--sparse' (`-S') is selected while creating an
archive, the current `tar' algorithm requires sparse files to be read
twice, not once. We hope to develop a new archive format for saving
sparse files in which one pass will be sufficient.

File: paxutils.info, Node: Other formats, Prev: Compression, Up: Formats
Other non-`tar' formats
=======================
* Menu:
* cpio:: Comparison of `tar' and `cpio'
* zip:: Comparison of `tar' and `zip'

File: paxutils.info, Node: cpio, Next: zip, Prev: Other formats, Up: Other formats
Comparison of `tar' and `cpio'
------------------------------
The `cpio' archive formats, like `tar', have maximum pathname
lengths. The binary and old ASCII formats have a max path length of
256, and the new ASCII and CRC ASCII formats have a max path length of
1024. This `cpio' can read and write archives with arbitrary pathname
lengths, but other `cpio' implementations may crash unexplainedly
trying to read them.
`tar' handles symbolic links in the form in which it comes in BSD;
`cpio' doesn't handle symbolic links in the form in which it comes in
System V prior to SVR4, and some vendors may have added symlinks to
their system without enhancing `cpio' to know about them. Others may
have enhanced it in a way other than the way we did it at Sun, and
which was adopted by AT&T (and which is, we think, also present in the
`cpio' that Berkeley picked up from AT&T and put into a later BSD
release--we think we gave them our changes).
(SVR4 does some funny stuff with `tar'; basically, its `cpio' can
handle `tar' format input, and write it on output, and it probably
handles symbolic links. They may not have bothered doing anything to
enhance `tar' as a result.)
`cpio' handles special files; traditional `tar' doesn't.
`tar' comes with V7, System III, System V, and BSD source; `cpio'
comes only with System III, System V, and later BSD (4.3-tahoe and
later).
`tar''s way of handling multiple hard links to a file can handle
file systems that support 32-bit inumbers (e.g., the BSD file system);
`cpio''s way requires you to play some games (in its "binary" format,
i-numbers are only 16 bits, and in its "portable ASCII" format, they're
18 bits--it would have to play games with the "file system ID" field of
the header to make sure that the file system ID/i-number pairs of
different files were always different), and we don't know which
`cpio's, if any, play those games. Those that don't might get confused
and think two files are the same file when they're not, and make hard
links between them.
`tar''s way of handling multiple hard links to a file places only
one copy of the link on the tape, but the name attached to that copy is
the _only_ one you can use to retrieve the file; `cpio''s way puts one
copy for every link, but you can retrieve it using any of the names.
What type of checksum (if any) is used, and how is this calculated?
See the attached manual pages for `tar' and `cpio' format. `tar'
uses a checksum which is the sum of all the bytes in the `tar' header
for a file; `cpio' uses no checksum.
Does anyone know why `cpio' was made when `tar' was present at the
Unix scene?
It wasn't. `cpio' first showed up in PWB/Unix 1.0; no
generally-available version of Unix had `tar' at the time. We don't
know whether any version that was generally available _within AT&T_ had
`tar', or, if so, whether the people within AT&T who did `cpio' knew
about it.
On restore, if there is a corruption on a tape, `tar' will stop at
that point, while `cpio' will skip over it and try to restore the rest
of the files.
The main difference is just in the command syntax and header format.
`tar' is a little more tape-oriented in that everything is blocked
to start on a record boundary.
Are there any differences in the ability to recover crashed
archives between the two of them? (Is there any chance of
recovering crashed archives at all?)
Theoretically, it should be easier under `tar' since the blocking
lets you find a header with some variation of `dd skip=NN'. However,
modern `cpio''s and variations have an option to just search for the
next file header after an error with a reasonable chance of re-syncing.
Note that lots of tape driver software won't allow you to continue
past a media error, which should be the only reason for getting out of
sync unless a file changed sizes while you were writing the archive.
If anyone knows why `cpio' was made when `tar' was present at the
Unix scene, please tell me about this too.
Probably because it is more media efficient (by not blocking
everything and using only the space needed for the headers where `tar'
always uses 512 bytes per file header) and it knows how to archive
special files.
You might want to look at the freely available alternatives. The
major ones are `afio', `tar', and `pax', each of which have their own
extensions with some backwards compatibility.
Sparse files were `tar'red as sparse files (which you can easily
test, because the resulting archive gets smaller, and `cpio' can no
longer read it).

File: paxutils.info, Node: zip, Prev: cpio, Up: Other formats
Comparison of `tar' and `zip'
-----------------------------
On 1993-01-26, Jean-loup Gailly published on `bug-gnu-utils' a
useful comparison between the `tar' and `gzip' combination, and `zip'.
Here are the points of his letter.
* `tar -z' (that is, `tar' with `gzip') compresses a tar file into a
single stream. To extract one specific member (with `tar xfz
foo.tar.z member'), `gunzip' decompresses the whole `tar.z' file
and passes that to `tar'. This method improves compression since
`gzip' can take advantage of redundancy between files.
* `zip' compresses file members independently. `unzip' is then able
to seek directly to the proper location for extraction of a single
member. This method degrades compression but enables recovery in
case of damage to a portion of the `zip' file. If a `tar.z' file
is damaged, all data after the error is lost.
* The current version of `zip' does not store UID and GID, and
compresses hard links several times. `tar' works correctly.
* `unzip' has many tricks to convert file names from one system to
another, restore special file attributes (for VMS and OS/2), and so
forth ... `gzip' is only a data compression program, which should
be kept simple.
Jean-loup adds that he is thinking of adding an optional block size
parameter to `gzip' to improve error recovery, and refers to the `TODO'
file in the `gzip' 0.8.1 distribution.

File: paxutils.info, Node: Media, Next: Backups, Prev: Formats, Up: Top
Tapes and other archive media
*****************************
Archives are usually written on removable media--tape cartridges, mag
tapes, or floppy disks.
The amount of data a tape or disk can hold depends not only on its
size, but also on how it is formatted. A 2400 foot long reel of mag
tape holds 40 megabytes of data when formated at 1600 bits per inch.
The physically smaller EXABYTE tape cartridge holds 2.3 gigabytes.
Magnetic media are re-usable--once the archive on a tape is no longer
needed, the archive can be erased and the tape or disk used over.
Media quality does deteriorate with use, however. Most tapes or disks
should be discarded when they begin to produce data errors.
Magnetic media are written and erased using magnetic fields, and
should be protected from such fields to avoid damage to stored data.
Sticking a floppy disk to a filing cabinet using a magnet is probably
not a good idea.
Format related parameters specify how an archive is written on the
archive media. The best choice of format parameters will vary
depending on the type and number of files being archived, and on the
media used to store the archive.
To specify format parameters when accessing or creating an archive,
you can use the options described in the following sections. If you do
not specify any format parameters, `tar' uses default parameters. You
cannot modify a compressed archive. If you create an archive with the
`--blocking-factor=BLOCKS' (`-b BLOCKS') option specified (*note
blocking-factor::.), you should specify that blocking factor when
operating on the archive. *Note Formats::, for other examples of
format parameter considerations.
When you access a previously created `tar' archive using `tar', you
should specify certain format parameters. These parameters were
specified when the archive was created, and `tar' is not always able to
determine some of the parameters for itself. The safest procedure is to
specify them again in order for `tar' to properly read and/or modify
the contents of the archive.
* Menu:
* Blocking:: Blocking
* Many on one:: Many archives on one tape
* One on many:: Using multiple tapes
* Being careful:: Being even more careful
* Other tape considerations:: Other tape considerations

File: paxutils.info, Node: Blocking, Next: Many on one, Prev: Media, Up: Media
Blocking
========
"Block" and "record" terminology is rather confused, and it is even
confusing to the expert reader. Currently, `tar' uses the POSIX
terminology, in which the terms are exchanged with regard to the IBM
terminology. On 1995-06, John Gilmore (the writer of the original
program which evolved into this current `tar') wrote:
The nomenclature of tape drives comes from IBM, where we believe
they were invented for the IBM 650 or so. On IBM mainframes, what
is recorded on tape are tape blocks. The logical organization of
data is into records. There are various ways of putting records
into blocks, including `F' (fixed sized records), `V' (variable
sized records), `FB' (fixed blocked: fixed size records, N to a
block), `VB' (variable size records, N to a block), `VSB'
(variable spanned blocked: variable sized records that can occupy
more than one block), etc. The `JCL' `DD RECFORM=' parameter
specified this to the operating system.
The Unix man page on `tar' was totally confused about this. When I
wrote `PD TAR', I used the historically correct terminology (`tar'
writes data records, which are grouped into blocks). It appears
that the bogus terminology made it into POSIX (no surprise here),
and now Franc,ois has migrated that terminology back into the
source code too.
* Menu:
* Blocks and records:: Blocks and records
* blocking-factor:: Setting the blocking factor
* record-size:: Setting a record size
* Media types:: Per-Media blocking considerations
* Reblocking:: Automatic reblocking

File: paxutils.info, Node: Blocks and records, Next: blocking-factor, Prev: Blocking, Up: Blocking
Blocks and records
------------------
The term "physical block" means the basic transfer chunk from or to a
device, after which reading or writing may stop without anything being
lost. In this manual, the term "block" usually refers to a physical
disk block, _assuming_ that each disk block is 512 bytes in length. It
is true that some disk devices have different physical blocks, but `tar'
ignores these differences in its own format, which is meant to be
portable, so a `tar' block is always 512 bytes in length, and "block"
always means a `tar' block.(1) The term "physical record" is another
way of speaking about a physical block, those two terms are somewhat
interchangeable.
Contrarily to disks, tapes start and stop often. An "inter-record
gap" (IRG), or "gap" for short, is a small landing area on the tape with
no information on it, used for decelerating the tape to a full stop, and
for later regaining reading or writing speed. When the tape driver
starts reading a record, the record has to be read in its entirety
without stopping, as a gap is needed to stop the tape motion without
losing information. Many such gaps must be managed at regular
intervals, so the tape soon finds a place to stop when this is needed.
The recorded data is split into chunks that we call physical tape
blocks, each of which is sandwiched between two successive gaps. In
POSIX `tar' terminology, each physical tape block is called a "record".
Although each record might have its own length, it is customary to set
some maximum length for all records on a given tape. `tar' archives
may be put on disk or used with pipes, instead of being written to
tape. Nevertheless, `tar' tries to read and write the archive one
record at a time, whatever the medium in use.
For `tar', a record(2) is made up of an integral number of blocks,
and this operation of putting many disk blocks into a single record is
called "blocking". `tar' blocks are all fixed size (512 bytes), and
its scheme for putting them into records is to put a whole number of
them (one or more) into each record. In a single archive, all `tar'
records are the same size; at the end of the file there's a block
containing all zeros, which is how you tell that the remainder of the
last record(s) is (are) garbage. The usual number of disk blocks that
go into a single record is called the "blocking factor" for that tape.
Using higher blocking (putting more disk blocks per record) will use
the tape more efficiently, as there will be fewer gaps. But reading
such tapes may be more difficult for the system, as more memory will be
required to receive the whole record at once. Further, if there is a
reading error on a huge record, it is less likely that the system will
succeed in recovering the information. So blocking should not be too
low, nor it should be too high. `tar' uses by default a blocking of 20
for historical reasons, and it does not really matter when reading or
writing to disk. Current tape technology would easily accomodate
higher blockings.
--------- Notes en bas de page ---------
(1) The term "logical block" often represents the basic chunk of
allocation of many disk blocks as a single entity, which the operating
system treats somewhat atomically; this concept is used seldom if at
all in `tar'.
(2) The term "logical record" refers to the logical organization of
many characters into something meaningful to the application. The term
"unit record" describes a small set of characters which are transmitted
whole to or by the application, and often refers to a line of text.
Those two last terms are unrelated to what we call a "record" in `tar'.
Jump to Line
Something went wrong with that request. Please try again.