authors | state |
---|---|
Alex Wilson <alex.wilson@joyent.com> |
draft |
The ZFS send stream format is a convenient and efficient way to synchronize and
replicate ZFS filesystems, whether to another live system or to backup media.
However, the stream format and the code that parses it in zfs receive
were
never designed to allow the processing of arbitrary untrusted data.
The stream format is a wrapper around series of DMU-layer records which are only
validated in very simple ways before being written directly into the receiving
ZFS pool. This means it is possible for streams to contain DMU records that
contain data which violate invariants assumed by the upper layers of ZFS (such
as the ZPL). Such data would be blindly written into the pool by zfs receive
,
making it potentially un-importable (and possibly resulting in a kernel panic,
and worse, allowing for the corruption of other data in the pool).
This is particularly unfortunate for those offering illumos zones or Linux containers with ZFS access to external entities who are only partly trusted and should have no ability to break their isolation from other zones or containers.
Previous work in this area has mainly extended to separating the privilege of
use of ZFS from the use of zfs receive
-- such as the PRIV_SYS_FS_IMPORT
privilege in illumos. In this way, the partly trusted tenant zones can be denied
entirely the ability to receive ZFS streams, the so-called "nuclear option".
It might feel that this is merely a matter of improving the validation
performed by the DMU record parser, but to think of this problem iteratively
is to underestimate it: not only must zfs receive
be impervious to streams
that would result in a corrupt on-disk structure, it must also reject streams
that would result in denial-of-service of a host that attempts to import the
pool. This is a dauntingly high bar, and when multiple tenants share a single
zpool, the failure mode is potentially devastating: a kind of crown fire that
can leap the fire line from one tenant to another. In short, the problem of
making zfs receive
robust with respect to byzantine streams is brutally
tricky and detail-oriented -- and the consequences of anything less than a
perfect solution are dire.
One shorter-term way to mitigate the problem somewhat is to enable the system to
make trust decisions based on the origin of a given ZFS send stream. If the
system can authenticate that the stream was originally produced by a legitimate
zfs send
on a known, trusted machine, then this limits the potential bad
behavior that could be induced by the stream to match that which is already
present on the trusted source machine. Any code which has its invariants broken
by the stream's contents would also have them broken on the source machine,
which is a far narrower scope for problems.
In this document, we propose an enhancement to the stream format to allow the use of cryptographic signatures to verify the origin of a stream, both to provide this short-term mitigation, and also as a generally useful mechanism for enhanced integrity protection on ZFS send streams.
The ZFS send stream format consists of a wrapper (generated in userland), around a series of "DMU replay records", which are generated by the kernel.
DMU replay records consist of a header (a dmu_replay_record_t
) followed by a
payload
. The generic structure of a record looks like:
The drr_u
member is a union containing the specific fields for each type of
record (types include drr_object
, drr_write
, drr_spill
and some others).
All of these fit in the 272 available bytes between drr_payloadlen
and
drr_checksum
.
Size of used space in various types' drr_u
values:
Type | Size (bytes) |
---|---|
drr_freeobjects |
24 |
drr_free |
32 |
drr_object |
40 |
drr_write_embedded |
48 |
drr_spill |
56 |
drr_write |
88 |
drr_write_byref |
104 |
At the beginning and end of a stream are two special records, the drr_begin
and drr_end
. These records are considered special because neither of them have
the usual drr_checksum
member (drr_end
does have one, but not at the usual
offset). Also, drr_end
must have drr_payloadlen
set to zero, and no trailing
payload.
The drr_checksum
members of ordinary records contain the 4th-order Fletcher checksum (also known as the "ZFS Fletcher-4" or "fletcher4") of all the bytes in the stream so far up to that drr_checksum
. The drr_end
's checksum member contains the checksum of all bytes up until the beginning of the drr_end
record.
This system of incremental checksums has many desirable properties -- it allows for each record to be checksummed on the fly, while also checksumming the total contents and order of the stream.
Digital cryptographic signatures are commonly used to verify both the integrity and the authenticity of some array of bytes. The way in which they do this takes advantage of public-key cryptography. While a full introduction to public-key cryptography is beyond the scope of this document, public-key algorithms allow the creation and use of a key that is in two parts: a "public" key and a "private" key. Data that has been encrypted with the "public" key can only be decrypted with the "private" key, and vice versa.
In a public-key digital signature scheme, conceptually, you keep the "private" key to yourself while sharing the "public" key with all of your users. Then, you can take a secure hash of some content and encrypt it with your private key. Your users can easily decrypt the hash with your public key that you have shared with them, and be assured that if the data's hash as they compute it matches the decrypted hash, then the data originated from you (since nobody else has the private key).
While this is not quite the way in which real signature schemes work, it does suffice to illustrate the broad concept. The signature consists of information derived from the content and your private key, and can be easily verified with your public key, but cannot be easily forged without having your private key at hand.
Generally, both keys and signatures are a small collection of large numbers (256 to 4096 or more bits in length, depending on algorithm). The following table shows typical sizes of some signatures for common public-key algorithms:
Algorithm | Signature size |
---|---|
RSA (4096 bit key) | 512 bytes (1x key-sized number) |
RSA (2048 bit key) | 256 bytes (1x key-sized number) |
DSA (1024 bit key) | 256 bytes (2x key-sized numbers) |
ECDSA on NIST P-256 (256 bit key) | 64 bytes (2x key-sized numbers) |
ECDSA on NIST P-384 (384 bit key) | 96 bytes (2x key-sized numbers) |
EdDSA on Curve25519 (256 bit key) | 64 bytes (1x 2x-key-sized number) |
Notice that, based on the above table and the structure of the DMU stream record
headers, there is enough space left in the existing dmu_replay_record_t
(168
bytes) to accommodate either an ECDSA or EdDSA signature. In fact, after adding
a 64-byte signature there would still be 104 remaining padding bytes in the
largest of the record types (drr_write_byref
).
It is desirable to keep the overall size of the dmu_replay_record_t
unchanged,
as this allows for much simpler compatibility with both previous and future
versions of the stream format.
To allow easy identification of the keypair that was used to sign the records,
we should also include a "key fingerprint" (a hash of the public key) and an
identifier for the algorithms used, in the stream. Rather than including these
on each record, though, they will be kept in the nvlist following the
drr_begin
record, as other forms of extensible metadata already are (such as
that belonging to resumable send streams).
The proposed modified dmu_replay_record_t
structure (for EdDSA on 25519, ECDSA 256-bit curve, etc):
And the new fields to be added to the drr_begin
nvlist:
signed
(boolean
) -- true if the stream is signedsignature
(nvlist
)alg
(string
) -- public-key algorithm, one ofeddsa
orecdsa
curve
(string
) -- OID or name of named elliptic curve
key_fp
(nvlist
)alg
(string
) -- name of hash algorithm (sha256
,sha384
etc)hash
(byte array
) -- hash of the public key
The acceptable combinations of alg
and curve
in signature
:
alg |
curve aliases |
sizeof(drr_signature) |
---|---|---|
'ecdsa' |
'nistp256' 'prime256v1' '1.2.840.10045.3.1.7' |
64 |
'ecdsa' |
'nistp384' 'secp384r1' '1.3.132.0.34' |
96 |
'eddsa' |
'curve25519' |
64 |
Note that the selection of alg
and curve
together determines the size that
is used for the drr_signature
member within all records in that stream.
Only 256 and 384-bit curves are permitted, in order to allow space for expansion
of the existing record header structures as needed. If 521-bit curves were
allowed, this would reduce the available space remaining after a
drr_write_byref
to less than 4 words (36 bytes).
As the signature is used to protect the data produced by the kernel, the keys for signing will have to be kept in a kernel keyring.
On illumos, the proposal is to augment the kernel crypto framework (KCF) with some keyring management functions and better support for serializing and de- serializing keys. The KCF already contains support for ECDSA with most common curves, and can easily be enhanced to support EdDSA.
The kernel keyring will be loaded with two types of keys: key pairs used for signing; and trusted public keys that will be accepted for receive operations.
Three new privileges will be added to the illumos privilege model:
PRIV_KERN_KEY_MGMT
(representing full control over the kernel keyring
framework), PRIV_SYS_FS_IMPORT_KEY_MGMT
(only the ZFS signed send keyring),
and PRIV_KERN_KEY_LIST
(to only list the keyIds
or fingerprints of all
available keys). Non-global zones will not have any of these privileges by
default.
Also, the existing PRIV_SYS_FS_IMPORT
privilege will be split into sub-
privileges PRIV_SYS_FS_IMPORT_SIGNED
and PRIV_SYS_FS_IMPORT_UNSIGNED
. The
intention is to only give non-global zones with delegated datasets the
PRIV_SYS_FS_IMPORT_SIGNED
right in multi-tenant or high-security operations.
The zfs send
command will be augmented to accept a -s keyId
option to
produce a signed send stream. The keyId
can be given as default
to use the
first available private key on the ZFS keyring. If the user running zfs send
has only the PRIV_SYS_FS_IMPORT_SIGNED
privilege, -s default
will be
assumed always, unless -s none
is provided. This is so that existing scripts
using zfs send | zfs recv
can continue to work in an environment that allows
only signed streams.
The zfs receive
command will also be augmented, with a -s
option (no
argument), which imports a signed stream while verifying it against any matching
public key in the ZFS keyring. If -s
is supplied and the stream is not signed,
an error will be produced. Errors are also produced (and the import aborted) if
the signature does not verify successfully or the key fingerprint used cannot be
found in the ZFS keyring.
A second option to zfs receive
, -k
can be combined with -s
to allow the
importation of streams that are unsigned or where the signing key cannot be
found on the ZFS keyring. If -k
is given and the stream is found to be signed
with a known key, an error will still be produced if that signature is invalid.
Finally, if zfs receive
is run by a user only possessing the
PRIV_SYS_FS_IMPORT_SIGNED
privilege, the -s
option will be implied by
default, and it will be an error to use -k
.
- All the supported signature algorithms are considered "128-bit secure", i.e. they have a security target >= 2128 with the best known methods today.
- The use of public-key algorithms for signing allows the fault domains for
key exposure to be more tightly contained than a symmetric key
infrastructure.
- In this case though it does not necessarily improve overall system security compared to using a symmetric HMAC key in the use case of multi-tenant non-global zones. The weakest link here is the isolation of the kernel keyring from the users.
- As long as no data is ever written to the pool before a signature covering it
has been validated, the only code that must be robust against malicious
input is the minimum parsing code necessary to find and validate each
signature.
- This, however, does include the nvlist parser. The
drr_begin
nvlist payload must be opened to find the algorithm details and key to be used for validation. As a result, the nvlist parser must be safe against malicious input.
- This, however, does include the nvlist parser. The
- The code parsing the additional structure around the DMU stream in userland
should be isolated using privilege separation techniques as much as possible.
- Bugs here will not enable kernel privilege escalation, unlike bugs within the DMU stream proper, so not signing the whole package still provides worthwhile benefits.
While performance will obviously depend on the implementation, Ed25519 is known to achieve 71000 verifications per second, or 109000 signatures per second, per core on a Nehalem-era Intel CPU. This is vastly smaller than the amount of time spent hashing the data to produce the signature.
Proceeding with the Ed25519 example, which uses a SHA2-512 hash, we arrive at the following performance ceilings on a Nehalem-era Intel (a performance ceiling is the maximum possible rate we could produce a ZFS send stream at if we assume signing is the dominant process and all other code is negligible):
Assumed average record size | Estimated throughput ceiling |
---|---|
8KB | 290 MB/s |
32KB | 479 MB/s |
128KB | 572 MB/s |
16MB | 612 MB/s |
It seems reasonable to suppose that for extremely large amounts of data on systems with very high amounts of disk bandwidth (and network bandwidth if sending across one), it is likely this could become the limiting factor in ZFS send performance. However, for more regular, small systems it is likely that disk read bandwidth will still be more dominant than signing.
One option to amend this proposal if this performance is found to be a problem is to use a faster hash algorithm on the actual data and sign only the concatenated signatures of all the records in the stream so far.
Approaches based around making each signature cover more data (e.g. by only
putting drr_signature
in every Nth record) will however not improve
performance significantly except with very very small records, as the
performance ceiling otherwise is always dominated by the cost of hashing the
data and not by the signature calculation itself.