Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS Encryption #5769

Merged
merged 1 commit into from
Aug 14, 2017
Merged

ZFS Encryption #5769

merged 1 commit into from
Aug 14, 2017

Conversation

tcaputi
Copy link
Contributor

@tcaputi tcaputi commented Feb 9, 2017

Native encryption in zfsonlinux (See issue #494)

The change incorporates 2 major pieces:

The first feature is a keystore that manages wrapping and encryption keys for encrypted datasets. The commands are similar to that of Solaris but with a few key enhancements to make it more predictable, more consistent, and require less manual maintenance. It is fully integrated with the existing zfs create functions and zfs clone functions. It also exposes a new set of commands via zfs key for managing the keystore. For more info on the issues with the Solaris implementation see my comments here and here. The keystore operates on a few rules.

  • All wrapping keys are 32 bytes (256 bits), even for 128 and 192 bit encryption types.
  • Encryption must be specified at dataset creation time.
  • Specifying a keylocation or keyformat while creating a dataset causes the dataset to become the root of an encryption tree.
  • All members of an encryption tree share the same wrapping key.
  • Each dataset can have up to 1 keychain (if it is encrypted) that is not shared with anybody.

The second feature is the actual data and metadata encryption. All user data in an encrypted dataset is stored encrypted on-disk. User-provided metadata is also encrypted, but metadata structures have been left plain so that scrubbing and resilvering still works without the keys loaded. The design was originallly inspired by this article but has been changed fairly significantly since.

Implementation details that should be looked at

  • Encrypting data going to disk requires creating a key_mapping_t during dsl_dataset_tryown(). I added a flag to this function for code that wishes to own the dataset, but that does not require encrypted data, such as the scrub functions. I did my best to confirm that all owners set this flag correctly, but someone should confirm them, just to be sure.
  • zfs send and zfs recv do not currently do anything special with regards to encryption. The format of the send file has not changed and zfs send requires the keys to be loaded in order to work. At some point there should probably be a way to do raw sends.
  • I altered the prototype of lzc_create() and lzc_clone() to support hidden arguments. I understand that the purpose of libzfs_core is to have a stable api interacting with the ZFS ioctls. However, these functions need to accept wrapping keys separately from the rest of their parameters because they need to use the (new) hidden_args framework to support hiding arguments from the logs. Without this, the wrapping keys would get printed to the zpool history.

EDIT 5/4/16: Updated to reflect the current state of the PR
EDIT 1/3/17: Updated to reflect the current state of the PR
EDIT 2/9/17: reopened under a new PR due to long Github load times (previous PR was #4329)

@mention-bot
Copy link

@tcaputi, thanks for your PR! By analyzing the history of the files in this pull request, we identified @behlendorf, @grwilson and @ahrens to be potential reviewers.

@tcaputi tcaputi mentioned this pull request Feb 9, 2017
@tcaputi
Copy link
Contributor Author

tcaputi commented Feb 9, 2017

Several updates have been made to the code since it was moved (thanks to @ahrens for the review thus far). These are the highlights:

  • the keysource property has been split into keylocation and keyformat
  • the keylocation property can now be set with zfs set (with some restrictions)
  • the zfs key sub-command has been split into zfs load-key, zfs unload-key and zfs change-key.
  • zfs load-key and zfs unload-key now support -r and -a for recursive key loading / unloading
  • zfs load-key now supports -n (noop) for checking that a key is correct without loading it
  • zfs load-key now supports -L (location) for loading a key from an alternate keylocation
  • zfs change-key now supports -i (inherit) for making a dataset inherit its parent's keys
  • the default number of pbkdf2 iterations has been bumped up to 350000 (minimum is now 100000)
  • the man pages have been updated and some clarifications have been made
  • the encryption feature now uses the ZFEATURE_FLAG_PER_DATASET framework for refcounting
  • various smaller code cleanups and fixes

These changes do constitute an on-disk format change, so anyone updating from an earlier version of the PR will have to recreate their pool.

@tcaputi tcaputi force-pushed the master branch 2 times, most recently from 6cf566f to ae18e6d Compare February 9, 2017 23:36
@grahamperrin
Copy link
Contributor

grahamperrin commented Feb 10, 2017

Thanks to everyone so far 👍


… User-provided metadata is also encrypted, but metadata structures have been left plain …

For clarity (for any newcomer to this PR), from https://github.com/tcaputi/zfs/blob/ae18e6d0a8c1b32991094594d1b751646bc5b4ca/man/man8/zfs.8 with added emphasis:

will encrypt all user data including file and zvol data, file attributes, ACLs, permission bits, directory listings, and FUID mappings.

will not encrypt metadata related to the pool structure, including dataset names, dataset hierarchy, file size, file holes, and dedup tables. …

(The first – a definition of user data – includes some things that might otherwise be viewed as metadata.)


Also from the opening of this PR:

Specifying a keysource …

– could be updated to reflect the split that is highlighted in the first (human) comment.

Copy link
Contributor

@grahamperrin grahamperrin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation: first pass …

accessing the data allow for it. Deduplication with encryption will leak
information about which blocks are equivalent in a dataset and will incur an
extra CPU cost per block written.
This feature enables the creation and management of natively encrypted datasets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As details of the feature are more in the manual page for zfs(8) than in the page for zpool(8), so the SEE ALSO part of this zpool-features(5) page should be expanded to include zfs(8).

man/man8/zfs.8 Outdated
Selecting \fBencryption\fR=\fBon\fR when creating a dataset indicates that the
default encryption suite will be selected, which is currently \fBaes-256-ccm\fR.
In order to provide consistent data protection, encryption must be specified at
dataset creation time and it cannot be changed afterwards.
.sp
For more details and caveats about encryption see \fBzpool-features\fR.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SEE ALSO part of this zfs(8) manual page should be expanded to include zpool-features(5).

accessing the data allow for it. Deduplication with encryption will leak
information about which blocks are equivalent in a dataset and will incur an
extra CPU cost per block written.
This feature enables the creation and management of natively encrypted datasets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basics.

Can creation of a pool comprise a single encrypted dataset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This functions the same way as any other zfs property (compression, deduplication, etc)

@tcaputi tcaputi force-pushed the master branch 2 times, most recently from c4daa03 to 8594e08 Compare February 11, 2017 10:19
@niieani
Copy link

niieani commented Feb 11, 2017

Fantastic work 👍 . I'm curious, how far would you say are you from publishing the changes from this PR? A rough estimate would do, say: weeks, months, Q2, Q3?

@tcaputi
Copy link
Contributor Author

tcaputi commented Feb 11, 2017

@niieani It's tough to say. That largely depends on how the review goes. At this point @ahrens has reviewed the DSL changes and I've made adjustments (which still need a bit of stabilizing) to the code to account for his notes, but that is all that has happened so far.

@@ -245,26 +247,35 @@ dsl_dataset_phys(dsl_dataset_t *ds)
#define DS_UNIQUE_IS_ACCURATE(ds) \
((dsl_dataset_phys(ds)->ds_flags & DS_FLAG_UNIQUE_ACCURATE) != 0)

/* flags for holding the dataset */
#define DS_HOLD_FLAG_DECRYPT (1 << 0) /* needs access encrypted data */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets use an enum for this, so that it's more clear what values the function argument can have. (Unfortunately there's a bad example set by some other flags - cleanup has been on my to-do list for a while now.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know. will fix

if (dd == NULL) {
dsl_dir_t *winner;

dd = kmem_zalloc(sizeof (dsl_dir_t), KM_SLEEP);
dd->dd_object = ddobj;
dd->dd_dbuf = dbuf;
dd->dd_pool = dp;

if (doi.doi_type == DMU_OTN_ZAP_METADATA &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use dsl_dir_is_zapified(), instead. (at a cost of maybe doing dmu_object_info twice - shouldn't be a big deal)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix


/*
* encrypted datasets can only be moved if they are
* an encryption root (locally set keyformat).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we allow that? Seems like a common use case would be "encrypt all datasets below this the same way". "whoops, named a child dataset wrong, rename it". As long as the new location is under the same encryption root, I think we can allow the rename. Would that be hard to implement?

I thought there was a restriction that an unencrypted dataset couldn't be under an encrypted dataset. Doesn't that need to be enforced here?

I didn't see either of these restrictions in the manpages. Seems like they should at least be in the zfs rename section, and maybe also the Encryption section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we allow that? Seems like a common use case would be "encrypt all datasets below this the same way". "whoops, named a child dataset wrong, rename it"

To clarify, renaming an encryption child is ok. Moving to a new parent is not currently.

As long as the new location is under the same encryption root, I think we can allow the rename. Would that be hard to implement?

I think I can implement it like that. Shouldn't be a big deal.

I thought there was a restriction that an unencrypted dataset couldn't be under an encrypted dataset. Doesn't that need to be enforced here?

Whoops. It looks like I must've rebased that away at some point. I will put it back.

I didn't see either of these restrictions in the manpages. Seems like they should at least be in the zfs rename section, and maybe also the Encryption section.

I will make sure this is documented in both places (I think there is a sentence about it in the encryption section).

man/man8/zfs.8 Outdated
locally, the dataset is an encryption root. Encryption roots share their keys
with all datasets that inherit this property. This means that when a key is
loaded for the encryption root, all children that inherit the \fBkeylocation\fR
property are automatically loaded as well.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about the key for all children ... is automatically loaded as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

man/man8/zfs.8 Outdated
.sp
Even though the encryption suite cannot be changed after dataset creation, the
keylocation can be with \fBzfs change-key\fR. If the dataset is an encryption
root this property may also be changed with \fBzfs set\fR. If \fBprompt\fR is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not an encryption root, can the key be changed with zfs change-key? The current wording implies it can, by calling out the restriction only for zfs set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. non-encryption roots can use zfs change-key, which will make them an encryption root. I will add a sentence to the man page to reflect this.

if (ret != 0)
goto error;

return (0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this (success case) would work right if it fell through to the error label.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change it. I personally like separate error labels that don't share code.

module/zfs/zil.c Outdated
* the keys may not be loaded.
*/
if (wbuf == NULL && BP_IS_ENCRYPTED(bp))
zio_flags |= ZIO_FLAG_RAW;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we change this to pass RAW even if not encrypted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why not. I just didn't want to impact existing functionality. I'll run some tests to make sure that works and make the change.

module/zfs/zio.c Outdated
BP_SET_ENCRYPTED(bp, B_TRUE);
}

/* Perform the encryption. This should not fail */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a more precise word than "should"? You're gracefully handling failure just below, so presumably it does fail in some cases... Maybe This will fail for block types that are partially encrypted (dnode, zil) if no encryption is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change the "should". I am not gracefully handling errors. The only acceptable return codes here are 0 and ZIO_NO_ENCRYPTION_NEEDED (see the assertion below).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I wasn't sure if ZIO_NO_ENCRYPTION_NEEDED was considered an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, its a special return code to indicate that the block didn't need to be encrypted for some reason (for instance a dnode block with no encryptable bonus buffers).

* evenly across the bits of the checksum. Therefore,
* when truncating a weak checksum we XOR the first
* 2 words with the last 2 so that we don't "lose" any
* entropy unnecessarily.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any downside to doing this for all checksum types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any, but I know all secure checksums support being truncated without issue whereas there may be weaknesses for XOR folding.

uint32_t *iv2 = (uint32_t *)(iv + sizeof (uint64_t));

ASSERT(BP_IS_ENCRYPTED(bp));
bp->blk_dva[2].dva_word[0] = LE_64(*((uint64_t *)salt));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This trick doesn't work on processors which don't support unaligned memory access. You'll need to do it byte at a time (if it's unaligned, at least - but if you implement it both ways, make sure we can test hitting both code paths).

Same goes for the following functions.

See zap_leaf_array_read() for an example (the la_array is not aligned).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix.

@sempervictus
Copy link
Contributor

Thanks for updating the PR, will build out a new test mule this week.
I do have a potential concern/question however regarding operating logic in terms of resource consumption:
If users are allowed to attempt their own key loads, and use high iteration PBKDF derivations, whats to prevent them from exhausting resources on the host system by just pummeling it with passphrase load attempts until its blue in the face? Do the internal implementation or interface provide any load request/expensive operation throttling capability?

@tcaputi
Copy link
Contributor Author

tcaputi commented Feb 13, 2017

If users are allowed to attempt their own key loads, and use high iteration PBKDF derivations, whats to prevent them from exhausting resources on the host system by just pummeling it with passphrase load attempts until its blue in the face? Do the internal implementation or interface provide any load request/expensive operation throttling capability?

There are no restrictions, but I would argue that there is also nothing that would prevent them from running a while(1); program as well. The pbkdf2 iterations are done in userspace so they are killable.

@tcaputi tcaputi force-pushed the master branch 4 times, most recently from afee592 to 2284451 Compare February 19, 2017 16:24
@tcaputi tcaputi force-pushed the master branch 2 times, most recently from 84eeb12 to 5b21838 Compare February 23, 2017 00:17
module/zfs/arc.c Outdated
*
* The L1ARC has a slightly different system for storing encrypted data.
* Raw (encrypted + possibly compressed) data has a few subtle differences from
* data that is just compressed. The biggest difference is that it is not always
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can delete always, because it is not possible to decrypt ... if keys aren't loaded (right?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

* may have both an encrypted version and a decrypted version of its data at
* once. When a caller needs a raw arc_buf_t, it is allocated and the data is
* copied out of this header. To avoid complications with b_pabd, raw buffers
* cannot be shared.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure where is the best place to explain this in the code, but at least for my own edification:

When would we have both encrypted and unencrypted data in the ARC? If we scrub when the keys are not loaded, we'll have encrypted dnode blocks, I think? And once we have encrypted send, that could cache encrypted user data in the ARC? And if we later load the keys and do an unencrypted read, it would decrypt from b_rabd to b_abd.

Once we have both b_rabd and b_abd, do we ever evict one of them, to reduce memory usage? Or are we stuck with 2x memory usage until we happen to evict the whole hdr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would we have both encrypted and unencrypted data in the ARC? If we scrub when the keys are not loaded, we'll have encrypted dnode blocks, I think? And once we have encrypted send, that could cache encrypted user data in the ARC? And if we later load the keys and do an unencrypted read, it would decrypt from b_rabd to b_abd.

All correct.

Once we have both b_rabd and b_abd, do we ever evict one of them, to reduce memory usage? Or are we stuck with 2x memory usage until we happen to evict the whole hdr?

If we have both we evict the encrypted one as soon as all encrypted buffers are destroyed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure where is the best place to explain this in the code...

There is actually already a comment for this in arc_buf_destroy_impl() (where the header freeing takes place).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK cool, I haven't got to reading that part of the code yet, it's a lot to fit in my brain at once :-)

module/zfs/arc.c Outdated
@@ -817,6 +835,8 @@ static arc_state_t *arc_l2c_only;
#define arc_need_free ARCSTAT(arcstat_need_free) /* bytes to be freed */
#define arc_sys_free ARCSTAT(arcstat_sys_free) /* target system free bytes */

/* encrypted + compressed size of entire arc */
#define arc_raw_size ARCSTAT(arcstat_raw_size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encryption doesn't change the size of the data, so how is this different from arc_compressed_size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this just gives an extra stat for how much raw encrypted data is stored in the ARC. I thought it would be useful seeing as how this memory operates separately from compressed memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's the sum of the size of the b_rabds? Could you update the comment to say that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

@@ -112,20 +123,22 @@ typedef enum arc_flags
ARC_FLAG_L2_WRITING = 1 << 11, /* write in progress */
ARC_FLAG_L2_EVICTED = 1 << 12, /* evicted during I/O */
ARC_FLAG_L2_WRITE_HEAD = 1 << 13, /* head of write list */
/* encrypted on disk (may or may not be encrypted in memory) */
ARC_FLAG_ENCRYPT = 1 << 14,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about ARC_FLAG_ENCRYPTED (and HDR_ENCRYPTED()), since this is not being used as a verb.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. will fix.

/* encryption parameters */
uint8_t b_salt[DATA_SALT_LEN];
uint8_t b_iv[DATA_IV_LEN];
uint8_t b_mac[DATA_MAC_LEN];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b_mac seems different from the other encryption parameters, because it's used not when writing to L2ARC, but rather when decrypting an encrypted hdr. In that case I think the caller could pass in the BP and we could get the mac there. It might be worth explaining that we keep the mac here for convenience in this sake, so the caller doesn't have to pass in the BP again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point. it looks like the only place this is really used is in arc_hdr_decrypt(). I will add that comment. I could probably also assert that the mac in l2arc_apply_transforms() matches the one in the header.

* pool. When this is the case, we must first compress it if it is
* compressed on the main pool before we can validate the checksum.
*/
if (!HDR_COMPRESSION_ENABLED(hdr) && compress != ZIO_COMPRESS_OFF) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this code moved somewhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, see comment below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which comment specifically?

I think this code is now assuming that we write blocks to the L2ARC exactly as they are in the main pool, including compressed, unencrypted blocks, even if the b_pabd is uncompressed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I'm sorry I thought this was a different piece of code. You are correct, this change allows the L2ARC to write out data as it is in the main pool, removing the need for this step.

@@ -1652,15 +1701,14 @@ arc_hdr_set_compress(arc_buf_hdr_t *hdr, enum zio_compress cmp)
*/
if (!zfs_compressed_arc_enabled || HDR_GET_PSIZE(hdr) == 0) {
arc_hdr_clear_flags(hdr, ARC_FLAG_COMPRESSED_ARC);
HDR_SET_COMPRESS(hdr, ZIO_COMPRESS_OFF);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an intentional change to now set the hdr's compression algorithm even if it is not compressed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Previously, disabled arc compression worked by simply not storing the compression at all. This change allows us to keep track of the compression algorithm for raw blocks even if arc compression is disabled.

module/zfs/arc.c Outdated
* This function will take a header that only has raw encrypted data in
* b_crypt_hdr.b_rabd and decrypts it into a new buffer which is stored in
* b_l1hdr.b_pabd. If designated on the header, this function will also
* decompress the data as well.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove as well (already has also)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

module/zfs/arc.c Outdated
}

if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&
!HDR_COMPRESSION_ENABLED(hdr)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use a comment explaining this case. I think it's saying that the data on disk is compressed, and therefore the raw data in b_rabd is compressed, but we don't want this hdr to be stored compressed (i.e. b_pabd should not be compressed), so we need to decompress it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the correct interpretation. I will add a comment.

tmp = abd_borrow_buf(cabd, arc_hdr_size(hdr));

ret = zio_decompress_data(HDR_GET_COMPRESS(hdr),
hdr->b_l1hdr.b_pabd, tmp, HDR_GET_PSIZE(hdr),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's possible this is used in a performance-sensitive code path, it would be good to avoid the copy from b_pabd to linearize it for decompression. This could be avoided by allocating b_pabd as linear, before decrypting into it. (And then replacing b_pabd with the decompressed scatter ABD, as you're already doing here.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only issue with that is that then we will not be honoring zfs_abd_scatter_enabled. I do not think this is an incredibly performance-sensitive path either, since this will realistically only be run in 2 circumstances: decrypting dnode blocks and reading data that is already in the ARC due to a scrub.

Let me know what you think and I can change it if you still feel it is appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could still honor zfs_abd_scatter_enabled because the data stored in b_pabd would be scatter (or not). It's only the intermediary, short-lived buffer (decrypted but not yet uncompressed) that would be linear.

But it does seem like not a super common code path so I could go either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean now... This might be a bit messy to do since arc_hdr_alloc_abd() cannot currently decide how it would like its abd to look. Unfortunately the code for that is a few functions up the stack. I'm inclined to leave it for now unless it becomes a problem.

@lundman
Copy link
Contributor

lundman commented Mar 4, 2017

Oh I should mention, in the IllumOS patches, before I included your commits yesterday, when we did a zfs send -c | zfs recv -vu we would trigger:

Mar  1 19:46:39 omnios ^Mpanic[cpu1]/thread=ffffff0261a72800:
Mar  1 19:46:39 omnios genunix: [ID 403854 kern.notice] assertion failed: (encrypted) implies (compressed), file: ../../common/fs/zfs/arc.c, line: 2675
Mar  1 19:46:39 omnios unix: [ID 100000 kern.notice]
Mar  1 19:46:39 omnios genunix: [ID 655072 kern.notice] ffffff0008d675a0 genunix:process_type+17f230 ()
Mar  1 19:46:39 omnios genunix: [ID 655072 kern.notice] ffffff0008d67630 zfs:arc_buf_alloc_impl+2ea ()
Mar  1 19:46:39 omnios genunix: [ID 655072 kern.notice] ffffff0008d676c0 zfs:arc_alloc_compressed_buf+eb ()
Mar  1 19:46:39 omnios genunix: [ID 655072 kern.notice] ffffff0008d67700 zfs:arc_loan_compressed_buf+38 ()
Mar  1 19:46:39 omnios genunix: [ID 655072 kern.notice] ffffff0008d677a0 zfs:receive_read_record+e0 ()
Mar  1 19:46:39 omnios genunix: [ID 655072 kern.notice] ffffff0008d67990 zfs:dmu_recv_stream+399 ()
Mar  1 19:46:39 omnios genunix: [ID 655072 kern.notice] ffffff0008d67bd0 zfs:zfs_ioc_recv+20e ()
Mar  1 19:46:39 omnios genunix: [ID 655072 kern.notice] ffffff0008d67c70 zfs:zfsdev_ioctl+50f ()
Mar  1 19:46:39 omnios genunix: [ID 655072 kern.notice] ffffff0008d67cb0 genunix:cdev_ioctl+39 ()
Mar  1 19:46:39 omnios genunix: [ID 655072 kern.notice] ffffff0008d67d00 specfs:spec_ioctl+60 ()
Mar  1 19:46:39 omnios genunix: [ID 655072 kern.notice] ffffff0008d67d90 genunix:fop_ioctl+55 ()
Mar  1 19:46:39 omnios genunix: [ID 655072 kern.notice] ffffff0008d67eb0 genunix:ioctl+9b ()
Mar  1 19:46:39 omnios genunix: [ID 655072 kern.notice] ffffff0008d67f00 unix:brand_sys_sysenter+2b7 ()

But it could be an old bug now, but if it is trivial for you to set lz4, and test zfs send -c to rule it out...

@rottegift
Copy link

Further to @lundman 's comment, we have openzfsonosx/zfs#557

@tcaputi
Copy link
Contributor Author

tcaputi commented Mar 4, 2017

@lundman I dont know what version you were on before, but the problem seems to be fixed now.

@lundman
Copy link
Contributor

lundman commented Mar 4, 2017

Woo \o/

@tcaputi
Copy link
Contributor Author

tcaputi commented Mar 28, 2018

@sempervictus
ZFS currently has a generic implementation, an accelerated x86-64 implementation, and an even further accelerated AESNI implementation. Normally even AMD chips would use the x86-64 accelerated version of the code (which still seems to be compatible). It is only the generic implementation that is causing issues which should only matter for other architectures and (as I believe @teknoman117 is doing here) for testing purposes.

@tcaputi
Copy link
Contributor Author

tcaputi commented Mar 29, 2018

@teknoman117 The issue does not seem to reproduce on Illumos, so we must have changed something at some point. I hope that I can get to the bottom of it tomorrow.

@tcaputi
Copy link
Contributor Author

tcaputi commented Mar 30, 2018

@teknoman117 I figured it out. The problem is actually just with your patch. You need this in the patch as well:

@@ -109,12 +111,14 @@ static int intel_aes_instructions_present(void);
 #define	rijndael_key_setup_enc_raw	rijndael_key_setup_enc
 #endif	/* __amd64 */
 
-#if defined(_LITTLE_ENDIAN) && !defined(__amd64)
+//#if defined(_LITTLE_ENDIAN) && !defined(__amd64)
+#if 1
 #define	AES_BYTE_SWAP
 #endif
 

Let me know if I'm missing something or if you need anything else.

@teknoman117
Copy link
Contributor

teknoman117 commented Apr 1, 2018

@tcaputi

My assumption to leave that guard in was based on this comment in aes_impl.c
/* * For _LITTLE_ENDIAN machines (except AMD64), reverse every * 4 bytes in the key. On _BIG_ENDIAN and AMD64, copy the key * without reversing bytes. * For AMD64, do not byte swap for aes_setupkeys(). * * SPARCv8/v9 uses a key schedule array with 64-bit elements. * X86/AMD64 uses a key schedule array with 32-bit elements. */

That being said, this does indeed seem to let the generic c implementation load a dataset encrypted using the the aesni or x86_64 optimized method. However, removing the guard completely breaks the accelerated methods. I suppose this means that for x86_64 the byte swap only is only needed for the generic implementation.

@jbrodriguez
Copy link

My question is .. why is this not stable yet ? I need to run zfs-linux-git on Arch Linux ... sort of unofficially .. is it testing what's keeping this from stable ?

@grantwwu
Copy link

grantwwu commented Apr 1, 2018

@jbrodriguez It's not stable yet because it's a Hard Problem, and because the developers have run into some issues, most notably #6845.

I don't think you really ought to complain unless you're paying the OpenZFS developers... and if you are you probably have internal channels for that :)

@jbrodriguez
Copy link

Thanks, I don't know how much you officially speak for @grantwwu, I was just asking a question. Nevertheless, thanks for pointing out the issue, I'll subscribe to it.

@tcaputi
Copy link
Contributor Author

tcaputi commented Apr 1, 2018

@teknoman117

I suppose this means that for x86_64 the byte swap only is only needed for the generic implementation.

Yeah. I think that comment was meant to be taken in the context of the entire file. In other words, I think the #define was written assuming that the accelerated implementations wouldn't be forcefully removed from under it.

@jbrodriguez

My question is .. why is this not stable yet ? I need to run zfs-linux-git on Arch Linux ... sort of unofficially .. is it testing what's keeping this from stable ?

The 3 big reasons:

  1. As @grantwwu alluded, this patch was 26000 lines of code (~15000 of which were completely new and not ported from elsewhere) and required modifications to virtually all parts of ZFS (except possibly the vdev and metaslab layers). We have hit issues with this patch since it was merged, most notably problems with raw sends, which is to be expected for a patch this large. One of these problems even required an on-disk format update to fix. That said, I have not heard of any new problems in the past 3 weeks or so, so I'm hoping that's a good sign.
  2. We haven't released a major version of ZFS (0.8.0) since encryption was merged. That is still pending on a few other features which are slated to be released in the new major version. ZFS encryption will be included in the 0.8.0 release.
  3. I have another patch coming (hopefully as a PR early next week if I can resolve the one last issue) which will add support for zfs recv -o / -x with encryption properties, which is sorely needed to allow people to convert their existing pools easily. Right now you can't do a zfs send -R | zfs recv to convert your pool to encrypted because zfs send -R includes all properties and one of them happens to be encryption=off. Since you can't tell zfs to ignore encryption properties at the moment, you can't use this to transition your datasets.

@jbrodriguez
Copy link

jbrodriguez commented Apr 1, 2018

Thanks @tcaputi, the major version comment is spot on.

I'm looking to start from scratch, so I may have overlooked some considerations.

Can't praise enough the work you've done here, kudos.

FransUrbo pushed a commit to FransUrbo/zfs that referenced this pull request Apr 28, 2019
This change incorporates three major pieces:

The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.

The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.

The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes openzfs#494 
Closes openzfs#5769
@taoeffect
Copy link

This looks awesome, congrats! Is there a guide / tutorial anywhere on how to set this up?

Can be integrated with an existing LUKS encrypted partition setup? (How?) Thanks for helping out a newb!

@sempervictus
Copy link
Contributor

Manpages show usage, luks is just the header for most dm-crypt setups, and dm-crypt CAN be used under it to protect its unencrypted metadata at the overhead of a whole encrypt/decrypt round under the zfs impl.
@tcaputi: am I correct in my recollection that dm-crypt is using coproc/simd accelerated crypto implementations in kernel while zfs does not?

@taoeffect
Copy link

taoeffect commented May 30, 2019

Which manpage, for LUKS or ZFS? (And if ZFS which command?).

I might not have been super clear. It's my understanding that "ZFS on root" (i.e. ZFS managing the entire drive) doesn't work well on Linux, so what people do is they create a ZFS partition.

That's what this guide does and that's what I've been following.

In that guide a partition is created for /boot, the main OS under / (LUKS encrypted), and then a third partition is created for ZFS to manage, which is also LUKS encrypted. The key for the third partition is stored on the main / partition.

But now that zfsonlinux has encryption, what's the best thing to do, and how?

@benley
Copy link

benley commented May 30, 2019

I'm not sure if this is the best thing to do, but on my laptop I've been using a small fat32 partition for /boot and the UEFI ESP, and / is an encrypted ZFS dataset without any LUKS header.

@rlaager
Copy link
Member

rlaager commented May 30, 2019

I have an experimental version of the HOWTO for Debian that builds 0.8.0 from experimental:
https://github.com/zfsonlinux/zfs/wiki/Debian-Buster-Encrypted-Root-on-ZFS

I haven't updated that since the rc4 package. The final 0.8.0 will likely be in experimental soon, at which point I'll update that again. I'm not sure how much I want to publicize this, though, as a hand-built backport isn't necessarily the best idea.

@sempervictus
Copy link
Contributor

LUKS is a Linux component, zfs is not, they share zero code, and one is a header for full block encryption of a device while the other is data encryption inside a filesystem. They're not related by any direct linkage.
Zfs on root works fine, on the 0.7.x branches anyway and 0.6.x when we used those. That goes for with or without dm-crypt, the zfs crypto bits won't play with grub and I am not aware of efi bootloaders supporting it either. Though an efi-loaded kernel with built-in zfs having crypto might be worth trying.
The efi partition is always fat, nothing we can do about that unfortunately. Lowest common denominators or something

@taoeffect
Copy link

@rlaager wrote:

I have an experimental version of the HOWTO for Debian that builds 0.8.0 from experimental:
https://github.com/zfsonlinux/zfs/wiki/Debian-Buster-Encrypted-Root-on-ZFS

Wow, thanks for that. It seems like it could be useful as a reference, but also seems quite complicated.

I'm wondering if there's a simpler way? Currently exploring the following approach (with Fedora) for a VPS setup:

  • Create two partitions (/boot and a 15GB /) as XFS, install via GUI installer
  • Boot the machine and install ZFS
  • Create a 3rd partition in remaining space and tell ZFS to manage it
  • Use rsync to copy all data in / to the new ZFS partition
  • Somehow make the new ZFS partition the new root partition and boot from it
  • Format the old root partition and add its device to the root pool to use up all the space available

Is that doable? Would it be simpler? (anyone know of any guides for that?)

@taoeffect
Copy link

taoeffect commented May 31, 2019

I have an experimental version of the HOWTO for Debian that builds 0.8.0 from experimental: https://github.com/zfsonlinux/zfs/wiki/Debian-Buster-Encrypted-Root-on-ZFS

I should have mentioned in my previous comments, one of the reasons I can't follow that approach is that, as mentioned in that link, you don't always have the required RAM available to use a Live CD (as with a VPS), but also my web host does not offer a Fedora Live CD, so even if I had the RAM, I'd be forced to use the "anaconda" troublshooting environment which doesn't even have a working package manager and is a nightmare to use.

@tcaputi
Copy link
Contributor Author

tcaputi commented May 31, 2019

@tcaputi: am I correct in my recollection that dm-crypt is using coproc/simd accelerated crypto implementations in kernel while zfs does not?

We both use accelerated encryption implementations, but due to licensing restructions they are separate implementations.

@sempervictus
Copy link
Contributor

Is this still true now that SIMD support is cut upstream due to GPL-only export of FPU functions? 4.14+ have that commit applied nowadays since it got back ported to 4.14.120

@teknoman117
Copy link
Contributor

teknoman117 commented Jun 1, 2019

@taoeffect I run ZFS as the root file system on my Linux machine. There's a small fat32 partition at the beginning which uses an efistub kernel and the zfs "pool" is the second partition which contains datasets for /, /home, and some others.

@tcaputi
Copy link
Contributor Author

tcaputi commented Jun 1, 2019

@sempervictus

Is this still true now that SIMD support is cut upstream due to GPL-only export of FPU functions? 4.14+ have that commit applied nowadays since it got back ported to 4.14.120

If you have that commit then yes SIMD support is disabled. We aer still working on figuring out a workaround there.

@sempervictus
Copy link
Contributor

We build kernels in-house with zfs as a built-in to the main kernel image (avoids certain instrumentation passes for modules at runtime), which permits us the opportunity to simply revert that commit. Most of our clients cannot do this, and folks are talking about moving to 0.8 this year already due to crypto being available. Im a bit worried that more will stop using zfs for being too slow - we've had a few drop in the last year or so because modern storage hw is much faster than the zfs io pipeline, kpti smashed even more bs into the execution of the "slow" fs code paths, and with no SIMD this is just going to get worse and worse.
It looks to me from the exchange @kpande had with Greg that the Linux team is actively trying to cripple us in order to favor filesystems under their preferred license model. SPL wrappers to re-expose FPU functions seem to me the most reasonable way to do this. What's Linus going to do, sue for infringement? VMware is full of such things and worse, and so far, they're winning their case

@pashford
Copy link
Contributor

pashford commented Jun 1, 2019

@sempervictus,

the Linux team is actively trying to cripple us in order to favor filesystems under their preferred license model

Based on what little I saw, I must, unfortunately. agree. You wouldn't happen to have a link to the issue in question, would you?

At this time, it looks to me like the Linux Kernel project is effectively controlled by zealots. I see this as the first step in a "whack-a-mole" process (with aggressive back-porting) to cripple ZoL in favor of the broken BTRFS. As with any religious war, the only thing that matters is the religion; facts are irrelevant. I don't believe this is a war that ZoL can win.

With that in mind, I'm in the process of creating a message to all my ZoL customers, suggesting that they consider migrating their file servers from Linux/ZoL to FreeBSD/OpenZFS.

@jbrodriguez
Copy link

If need be, as the five and dime shop we are, we'll figure out how to push our own kernel build (+patches) into our archlinux setup.

We used to use Oracle Solaris 11 11/11 for its ZFS encryption capabilities.

Switched to dm-crypt and jumped in blindfolded into linux zfs encryption as soon as tom caputi and team made it available.

We're not going back. Only forward.

@taoeffect
Copy link

@teknoman117 wrote:

@taoeffect I run ZFS as the root file system on my Linux machine. There's a small fat32 partition at the beginning which uses an efistub kernel and the zfs "pool" is the second partition which contains datasets for /, /home, and some others.

You and @benley both seem to be doing this. However, every tutorial I have seen for ZFS on root is insanely complicated, and also does not seem to always work on VPS setups with only 2GB of RAM because of the LiveCD requirements.

I wish deeply that ZFS were the default on Linux, and that it were simple to setup. That unfortunately is not the reality.

Also, I have not seen any comparisons between the latest ZFS encryption stuff and ZFS on LUKS. It would be useful to know what the tradeoffs are in terms of both ease-of-use, security, and performance.

@tcaputi
Copy link
Contributor Author

tcaputi commented Jun 2, 2019

With that in mind, I'm in the process of creating a message to all my ZoL customers, suggesting that they consider migrating their file servers from Linux/ZoL to FreeBSD/OpenZFS.

I'm not really one to get into the politics of this whole situation, but I will say that despite the statement from some of the Linux maintainers, nothing has really changed in this regard. The Linux kernel has always had a set of functions and symbols that it exports in one release and stops exporting in the next and its internal APIs are always changing. As a "third party" module, we have always had to adapt our software to these changes and we must always write code that can support past, present, and future versions as best as possible. Every time a new kernel comes out, our testing builders fail and within a few days / weeks someone takes a look and fixes any new incompatibilities that have arisen. This particular issue is a bit trickier to work around, but I don't believe we are fundamentally incapable of dealing with it as we have in the past.

@sempervictus
Copy link
Contributor

@pashford: I wouldn't send that email out yet, likely not to work. ZoL has the accounting feature which, iirc, prevents use on non-Linux systems. This is why we started making all new pools with that feature disabled, then patched zfs to make disabled the default.

@taoeffect: livecd is squashfs mapped to ram, not much to do with zfs. Far as dm-crypt vs ANY filesystem crypto, you probably want to get a firm understanding of what block data and file data are relative to each other, because you keep asking questions indicative of a thought process which does not account for the differences in the logical tiers of storage required to provide a user with access to a file (encrypted or otherwise). Far as performance goes, see above, Linux is actively trying to degrade our performance for by making the relevant interface GPL-only. So performance where? On 4.14.119 or today? What are you trying to use zfs for from a business-logic perspective, and why not pay the hosting company for the proper amount of resources?

@taoeffect
Copy link

Per @kpande's request, I've opened up two discussion threads on the mailing list:

(BTW, this is amazing mailing list software! Never seen it before!)

But to quickly answer your questions @sempervictus:

So performance where? On 4.14.119 or today?

I mean Linux in general. Whatever the latest kernel is. I doubt people will stay on old kernels for long due to the constant security problems that are being created, found, fixed, and re-created.

What are you trying to use zfs for from a business-logic perspective, and why not pay the hosting company for the proper amount of resources?

A 2GB VPS seems like a reasonable target for ZFS, no?

Feel free to post any replies via the mailing list!

@sempervictus
Copy link
Contributor

So far as 2G goes, really depends on the workload. ARC usually takes half your ram, so if the kernel and userspace are under a gig, you're OK till you run software. If you have a large pool with a bunch of dedup blocks, your DDTs will take up even more. Swap and ARC/DDTs don't mix well...
For your benchmark question, the test setup would be ZFS atop a dm-crypt volume vs zfs on disk using its own crypto. On the newest kernel, zfs doesn't get SIMD, so even if the algos are the same, it'll be slower. They're not as Tom said, ZFS can't touch Linux crypto for license reasons and it wouldn't be portable. So if you're looking for performance, and have a decent chip backing the vps, the native Linux crypto will run faster pound for pound as its using the "fastpath" of SIMD instructions. It also protects all of the metadata, so more coverage, but less flexibility and granularity.
Lastly, this entire FPU thing is casting a shadow of instability on ZoL again as they may GPL-only anything and break ZoL entirely at their whim. There's no guarantee that Linux will "permit" ZoLs continued existence. There's always grsec... :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.