authors	state
Alex Wilson <alex.wilson@joyent.com>	draft

RFD 14 Signed ZFS Send

Introduction

The ZFS send stream format is a convenient and efficient way to synchronize and replicate ZFS filesystems, whether to another live system or to backup media. However, the stream format and the code that parses it in zfs receive were never designed to allow the processing of arbitrary untrusted data.

The stream format is a wrapper around series of DMU-layer records which are only validated in very simple ways before being written directly into the receiving ZFS pool. This means it is possible for streams to contain DMU records that contain data which violate invariants assumed by the upper layers of ZFS (such as the ZPL). Such data would be blindly written into the pool by zfs receive, making it potentially un-importable (and possibly resulting in a kernel panic, and worse, allowing for the corruption of other data in the pool).

This is particularly unfortunate for those offering illumos zones or Linux containers with ZFS access to external entities who are only partly trusted and should have no ability to break their isolation from other zones or containers.

Previous work in this area has mainly extended to separating the privilege of use of ZFS from the use of zfs receive -- such as the PRIV_SYS_FS_IMPORT privilege in illumos. In this way, the partly trusted tenant zones can be denied entirely the ability to receive ZFS streams, the so-called "nuclear option".

It might feel that this is merely a matter of improving the validation performed by the DMU record parser, but to think of this problem iteratively is to underestimate it: not only must zfs receive be impervious to streams that would result in a corrupt on-disk structure, it must also reject streams that would result in denial-of-service of a host that attempts to import the pool. This is a dauntingly high bar, and when multiple tenants share a single zpool, the failure mode is potentially devastating: a kind of crown fire that can leap the fire line from one tenant to another. In short, the problem of making zfs receive robust with respect to byzantine streams is brutally tricky and detail-oriented -- and the consequences of anything less than a perfect solution are dire.

One shorter-term way to mitigate the problem somewhat is to enable the system to make trust decisions based on the origin of a given ZFS send stream. If the system can authenticate that the stream was originally produced by a legitimate zfs send on a known, trusted machine, then this limits the potential bad behavior that could be induced by the stream to match that which is already present on the trusted source machine. Any code which has its invariants broken by the stream's contents would also have them broken on the source machine, which is a far narrower scope for problems.

In this document, we propose an enhancement to the stream format to allow the use of cryptographic signatures to verify the origin of a stream, both to provide this short-term mitigation, and also as a generally useful mechanism for enhanced integrity protection on ZFS send streams.

Send stream record structure

The ZFS send stream format consists of a wrapper (generated in userland), around a series of "DMU replay records", which are generated by the kernel.

DMU replay records consist of a header (a dmu_replay_record_t) followed by a payload. The generic structure of a record looks like:

The drr_u member is a union containing the specific fields for each type of record (types include drr_object, drr_write, drr_spill and some others). All of these fit in the 272 available bytes between drr_payloadlen and drr_checksum.

Size of used space in various types' drr_u values:

Type	Size (bytes)
`drr_freeobjects`	24
`drr_free`	32
`drr_object`	40
`drr_write_embedded`	48
`drr_spill`	56
`drr_write`	88
`drr_write_byref`	104

At the beginning and end of a stream are two special records, the drr_begin and drr_end. These records are considered special because neither of them have the usual drr_checksum member (drr_end does have one, but not at the usual offset). Also, drr_end must have drr_payloadlen set to zero, and no trailing payload.

The drr_checksum members of ordinary records contain the 4th-order Fletcher checksum (also known as the "ZFS Fletcher-4" or "fletcher4") of all the bytes in the stream so far up to that drr_checksum. The drr_end's checksum member contains the checksum of all bytes up until the beginning of the drr_end record.

This system of incremental checksums has many desirable properties -- it allows for each record to be checksummed on the fly, while also checksumming the total contents and order of the stream.

Cryptographic signatures

Digital cryptographic signatures are commonly used to verify both the integrity and the authenticity of some array of bytes. The way in which they do this takes advantage of public-key cryptography. While a full introduction to public-key cryptography is beyond the scope of this document, public-key algorithms allow the creation and use of a key that is in two parts: a "public" key and a "private" key. Data that has been encrypted with the "public" key can only be decrypted with the "private" key, and vice versa.

In a public-key digital signature scheme, conceptually, you keep the "private" key to yourself while sharing the "public" key with all of your users. Then, you can take a secure hash of some content and encrypt it with your private key. Your users can easily decrypt the hash with your public key that you have shared with them, and be assured that if the data's hash as they compute it matches the decrypted hash, then the data originated from you (since nobody else has the private key).

While this is not quite the way in which real signature schemes work, it does suffice to illustrate the broad concept. The signature consists of information derived from the content and your private key, and can be easily verified with your public key, but cannot be easily forged without having your private key at hand.

Generally, both keys and signatures are a small collection of large numbers (256 to 4096 or more bits in length, depending on algorithm). The following table shows typical sizes of some signatures for common public-key algorithms:

Algorithm	Signature size
RSA (4096 bit key)	512 bytes (1x key-sized number)
RSA (2048 bit key)	256 bytes (1x key-sized number)
DSA (1024 bit key)	256 bytes (2x key-sized numbers)
ECDSA on NIST P-256 (256 bit key)	64 bytes (2x key-sized numbers)
ECDSA on NIST P-384 (384 bit key)	96 bytes (2x key-sized numbers)
EdDSA on Curve25519 (256 bit key)	64 bytes (1x 2x-key-sized number)

Design proposal

Notice that, based on the above table and the structure of the DMU stream record headers, there is enough space left in the existing dmu_replay_record_t (168 bytes) to accommodate either an ECDSA or EdDSA signature. In fact, after adding a 64-byte signature there would still be 104 remaining padding bytes in the largest of the record types (drr_write_byref).

It is desirable to keep the overall size of the dmu_replay_record_t unchanged, as this allows for much simpler compatibility with both previous and future versions of the stream format.

To allow easy identification of the keypair that was used to sign the records, we should also include a "key fingerprint" (a hash of the public key) and an identifier for the algorithms used, in the stream. Rather than including these on each record, though, they will be kept in the nvlist following the drr_begin record, as other forms of extensible metadata already are (such as that belonging to resumable send streams).

The proposed modified dmu_replay_record_t structure (for EdDSA on 25519, ECDSA 256-bit curve, etc):

And the new fields to be added to the drr_begin nvlist:

signed (boolean) -- true if the stream is signed
signature (nvlist)
- alg (string) -- public-key algorithm, one of eddsa or ecdsa
- curve (string) -- OID or name of named elliptic curve
key_fp (nvlist)
- alg (string) -- name of hash algorithm (sha256, sha384 etc)
- hash (byte array) -- hash of the public key

The acceptable combinations of alg and curve in signature:

`alg`	`curve` aliases	`sizeof(drr_signature)`
`'ecdsa'`	`'nistp256'` `'prime256v1'` `'1.2.840.10045.3.1.7'`	64
`'ecdsa'`	`'nistp384'` `'secp384r1'` `'1.3.132.0.34'`	96
`'eddsa'`	`'curve25519'`	64

Note that the selection of alg and curve together determines the size that is used for the drr_signature member within all records in that stream.

Only 256 and 384-bit curves are permitted, in order to allow space for expansion of the existing record header structures as needed. If 521-bit curves were allowed, this would reduce the available space remaining after a drr_write_byref to less than 4 words (36 bytes).

User interface

As the signature is used to protect the data produced by the kernel, the keys for signing will have to be kept in a kernel keyring.

On illumos, the proposal is to augment the kernel crypto framework (KCF) with some keyring management functions and better support for serializing and de- serializing keys. The KCF already contains support for ECDSA with most common curves, and can easily be enhanced to support EdDSA.

The kernel keyring will be loaded with two types of keys: key pairs used for signing; and trusted public keys that will be accepted for receive operations.

Three new privileges will be added to the illumos privilege model: PRIV_KERN_KEY_MGMT (representing full control over the kernel keyring framework), PRIV_SYS_FS_IMPORT_KEY_MGMT (only the ZFS signed send keyring), and PRIV_KERN_KEY_LIST (to only list the keyIds or fingerprints of all available keys). Non-global zones will not have any of these privileges by default.

Also, the existing PRIV_SYS_FS_IMPORT privilege will be split into sub- privileges PRIV_SYS_FS_IMPORT_SIGNED and PRIV_SYS_FS_IMPORT_UNSIGNED. The intention is to only give non-global zones with delegated datasets the PRIV_SYS_FS_IMPORT_SIGNED right in multi-tenant or high-security operations.

The zfs send command will be augmented to accept a -s keyId option to produce a signed send stream. The keyId can be given as default to use the first available private key on the ZFS keyring. If the user running zfs send has only the PRIV_SYS_FS_IMPORT_SIGNED privilege, -s default will be assumed always, unless -s none is provided. This is so that existing scripts using zfs send | zfs recv can continue to work in an environment that allows only signed streams.

The zfs receive command will also be augmented, with a -s option (no argument), which imports a signed stream while verifying it against any matching public key in the ZFS keyring. If -s is supplied and the stream is not signed, an error will be produced. Errors are also produced (and the import aborted) if the signature does not verify successfully or the key fingerprint used cannot be found in the ZFS keyring.

A second option to zfs receive, -k can be combined with -s to allow the importation of streams that are unsigned or where the signing key cannot be found on the ZFS keyring. If -k is given and the stream is found to be signed with a known key, an error will still be produced if that signature is invalid.

Finally, if zfs receive is run by a user only possessing the PRIV_SYS_FS_IMPORT_SIGNED privilege, the -s option will be implied by default, and it will be an error to use -k.

Analysis

Security

All the supported signature algorithms are considered "128-bit secure", i.e. they have a security target >= 2¹²⁸ with the best known methods today.
The use of public-key algorithms for signing allows the fault domains for key exposure to be more tightly contained than a symmetric key infrastructure.
- In this case though it does not necessarily improve overall system security compared to using a symmetric HMAC key in the use case of multi-tenant non-global zones. The weakest link here is the isolation of the kernel keyring from the users.
As long as no data is ever written to the pool before a signature covering it has been validated, the only code that must be robust against malicious input is the minimum parsing code necessary to find and validate each signature.
- This, however, does include the nvlist parser. The drr_begin nvlist payload must be opened to find the algorithm details and key to be used for validation. As a result, the nvlist parser must be safe against malicious input.
The code parsing the additional structure around the DMU stream in userland should be isolated using privilege separation techniques as much as possible.
- Bugs here will not enable kernel privilege escalation, unlike bugs within the DMU stream proper, so not signing the whole package still provides worthwhile benefits.

Performance

While performance will obviously depend on the implementation, Ed25519 is known to achieve 71000 verifications per second, or 109000 signatures per second, per core on a Nehalem-era Intel CPU. This is vastly smaller than the amount of time spent hashing the data to produce the signature.

Proceeding with the Ed25519 example, which uses a SHA2-512 hash, we arrive at the following performance ceilings on a Nehalem-era Intel (a performance ceiling is the maximum possible rate we could produce a ZFS send stream at if we assume signing is the dominant process and all other code is negligible):

Assumed average record size	Estimated throughput ceiling
8KB	290 MB/s
32KB	479 MB/s
128KB	572 MB/s
16MB	612 MB/s

It seems reasonable to suppose that for extremely large amounts of data on systems with very high amounts of disk bandwidth (and network bandwidth if sending across one), it is likely this could become the limiting factor in ZFS send performance. However, for more regular, small systems it is likely that disk read bandwidth will still be more dominant than signing.

One option to amend this proposal if this performance is found to be a problem is to use a faster hash algorithm on the actual data and sign only the concatenated signatures of all the records in the stream so far.

Approaches based around making each signature cover more data (e.g. by only putting drr_signature in every Nth record) will however not improve performance significantly except with very very small records, as the performance ceiling otherwise is always dominated by the cost of hashing the data and not by the signature calculation itself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

RFD 14 Signed ZFS Send

Introduction

Send stream record structure

Cryptographic signatures

Design proposal

User interface

Analysis

Security

Performance

Files

README.md

Latest commit

History

README.md

File metadata and controls

RFD 14 Signed ZFS Send

Introduction

Send stream record structure

Cryptographic signatures

Design proposal

User interface

Analysis

Security

Performance