Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change defaults: Btrfs should use 'dup' metadata on encrypted devices #319

Closed
Forza-tng opened this issue Dec 5, 2020 · 10 comments
Closed
Labels
defaults Changes in default settings mkfs Changes in mkfs.btrfs
Milestone

Comments

@Forza-tng
Copy link
Contributor

According to man-page https://btrfs.wiki.kernel.org/index.php/Manpage/mkfs.btrfs#OPTIONS the default for non-rotational is to use single metadata profile.

I think that in cases where users use Btrfs on-top of LUKS/DM-Crypt then Btrfs should default to dup mode. One of the reasons to choose single is that modern SSD/Flash media can do de-duplication which breaks the purpose of dup mode. With an encrypted layer in between then this is no longer possible, so we should use dup by default.

The Man (https://btrfs.wiki.kernel.org/index.php/Manpage/mkfs.btrfs#DUP_PROFILES_ON_A_SINGLE_DEVICE) pages should expand on the reasoning for choosing single or dup on encrypted devices to help users making more informed choices.

A lot of users do use Btrfs on their laptops, (perhaps more so no that Fedora 33 uses Btrfs by default) and using DUP could help save users filesystems from crashing on smaller errors.

SD-cards on the other hand is another type of flash media where perhaps the added writes with dup is not so good. On the other hand if the cards break, then they often break badly so that dup would not help anyway? I have no evidence other than word-of-mouth here.

_Anecdotal evidence on my own NAS box. It has a Samsung SSD 830 Series 256GB as system disk. I had a corrupt block, luckily it was on my /boot partition so I could easily restore a backup. Had I used dup the filesystem could have survived better. Earlier had been convinced by arguments on #btrfs IRC channel that dup wasn't the best choice.

@Forza-tng Forza-tng changed the title Changed defaults: Btrfs should use 'dup' metadata on encrypted devices Change defaults: Btrfs should use 'dup' metadata on encrypted devices Dec 5, 2020
@Zygo
Copy link

Zygo commented Dec 5, 2020

There is quite a range of SSD firmware behavior out there. There's basically only one case where single metadata isn't a "please destroy my data" flag. While it's not a rare case (there are a lot of SSDs that work), it's still a good idea to bet against it in the absence of specific information about the SSD (there are also a lot of SSDs that don't work).

A brief summary of SSD behaviors

At the high end, high endurance SSDs do deduplication in firmware, and also compression, and a bag of proprietary firmware tricks to reduce the data size written to the flash media. The drives get their high endurance rating in part because they do deduplication in firmware, and they will underperform their ratings if the data is encrypted. The bag of tricks is advertised under various trademarked names, so users can pay extra to buy drives that have the features. The feature sets include data redundancy and error recovery provided by SSD firmware. The SSD firmware will dedupe unencrypted dup metadata, then use an ECC scheme which effectively reduplicates it. The drive will make every effort to self-repair medium errors, but when that fails, the drive will report error information (and usually flip the entire drive read-only, because there is no longer sufficient redundant media in the drive to correct any more failures). When these drives work, btrfs can use single metadata and even disable csums, as the drive will do these tasks adequately.

At the low end, low endurance SSDs are optimized for cost. Low-end SSDs often lack software and hardware for deduplication in firmware. At the lowest extreme, the most basic error detection capabilities in firmware are not working or not implemented. Drives in this category will report no errors and pass all SMART self-tests even as data read/write tests (e.g. badblocks -w) clearly indicate media failure. It falls to the filesystem to determine when flash cells start to go bad. These drives routinely fail data csum checks as they age, and they randomly become unmountable without dup metadata.

Most consumer drives fall somewhere on a spectrum between these extremes. A lot of firmware tries to do smart things, but the implementation has bugs. Some combinations of firmware features can be very bad, e.g. deduping writes but not providing redundant media bits for ECC would make the drive strictly more vulnerable to media failure and data loss unless btrfs dup metadata was used with encryption to defeat the SSD firmware dedupe. We can hope that drive vendors will not choose stupid combinations of firmware features, but we cannot rely on it. Severe firmware bugs are not common but not rare either.

All SSDs are made of hardware, and all hardware can fail, so there's a risk of single-block data losses even on high-end drives. There's a marginal event probability sandwiched between "no failures at all" and "total drive failure" where btrfs dup metadata can still be useful on such drives. It will be a tiny portion of all failure cases--but not zero.

OEM drives tend to fall toward the low end of the spectrum, and that's the kind of drive most users will have in their single-drive machines. If you count drive models, most are good. If you count drives users use, most are bad. The good ones are expensive so they aren't popular. The bad ones get sold at a discount, so your laptop vendor probably bought one.

Metadata writes tend to be the biggest component of lifetime writes to a SSD, so there is some value in reducing them; however, only users who consume 50 to 100% of the SSD's actual lifetime writes need to be concerned by the write amplification of btrfs dup metadata. Most users will be far below 50% of the actual lifetime, or will write the drive to death and discover how many writes 100% of the actual lifetime was. SSD firmware often adds its own write multipliers that can be arbitrary and unpredictable and dependent on application behavior, and these will typically have far greater effect on SSD lifespan than btrfs dup metadata. It's more or less impossible to predict when a SSD will run out of lifetime writes to within a factor of two, so it's hard to justify wear reduction as a benefit.

If user has a cloud of single-SSD machines and always writes them to death, and they want to reduce the rate at which they consume drives, then wear reduction is a benefit--but they probably also fall into case 1 below, since their drives are continually failing.

When does single metadata make sense?

There are only two cases where single metadata is a predictably good idea:

  1. where persistence is strictly less important than performance or endurance (e.g. /tmp, database temporary table storage, file caches, application-replicated data, or the cloud-of-single-SSD-machines example above). In this case we can expect the filesystem may be destroyed by a single-bit error in metadata, but we do not expect this event to inconvenience the user.
  2. where an encrypted filesystem is used on a single high-endurance SSD with confirmed working firmware. This is the only case where the drive firmware can provide persistence and performance equal to or better than the equivalent btrfs dup metadata feature.

Case 1 is always a user choice. Most filesystem users prioritize persistence over performance, especially in default configurations, but there are exceptions and btrfs provides for them (see also nobarrier).

Case 2 is an intersection of edge cases:

  • user values persistence over performance (otherwise, would trivially meet criteria for case 1)
  • user rationally trusts the SSD firmware to provide adequate protection against medium errors instead of btrfs dup metadata (by definition of case 2)
  • user is using encryption (otherwise, drive would optimize away dup metadata writes, making dup and single nearly equivalent, and eliminating the need to choose one over the other)
  • user is constrained to a single-drive configuration (otherwise, a rational user would add a second drive and use raid1 metadata for significantly higher reliability and performance with low marginal cost)

A user in case 2 is micro-optimizing for small marginal gains against an expensive hardware constraint (e.g. laptop or embedded/appliance device with only one drive bay) and they know that they are doing this (i.e. they have done the homework to vet the SSD firmware).

If the user has not vetted the drive firmware, then (as of December 2020) we cannot reasonably expect a random drive model to have working ECC or media redundancy, and the filesystem should expect bit- and block-level metadata errors.

If the user hasn't properly vetted the SSD firmware then the only valid use case for single metadata is case 1. This is the problem with the current situation. There's no practical way for mkfs.btrfs to validate SSD firmware or assess a user's risk tolerances. mkfs.btrfs can't evaluate the criteria for either single metadata case, yet it sometimes chooses single metadata without user intervention.

Given the above, dup metadata should be the default. single metadata should always require explicit user action (as balance requires --force when reducing metadata redundancy).

@Forza-tng
Copy link
Contributor Author

Thanks for the very detailed explanation. It is much appreciated.

@tom-seewald
Copy link

What Zygo wrote seems to be pretty compelling towards enabling duplicated metadata by default on SSDs, and I think filesystems in general should err on the side of fault tolerance whenever reasonable. @kdave what are your thoughts on this proposal?

@kdave kdave added defaults Changes in default settings mkfs Changes in mkfs.btrfs labels Mar 16, 2021
@kdave
Copy link
Owner

kdave commented Jun 21, 2021

The question regarding DUP has been brought up on IRC again and it seems to be a recurring topic, to my knowledge favoring DUP even for SSD devices. What Zygo wrote also speaks for that and my personal opintion is to also do the switch.

The original change DUP->single was done in 2013 in commit 124053b with the argument

SSD's do not gain anything by having metadata DUP turned on.  The underlying
file system that is a part of all SSD's could easily map duplicate metadat
blocks into the same erase block which effectively eliminates the benefit of
duplicating the metadata on disk.

IMHO this is in more detail addressed in Zygo's post and there is not a clear yes/no answer. Regarding the trade-offs, favoring reliability over performance is something that can be considered a good approch to choosing defaults. The number of cases where single makes sense for SSD is more like an exception.

As changing defaults affects everyone and the current state has been around for some years, I'd like to get some agreement that it's worth the change but also counter arguments.

The change should happen at a major release time, but 5.13 is too close, so it would be 5.14 at the earlies. This should be enough time to gather feedback.

(Related to defaults but not this issue: multi-device mkfs creates raid0 by default, this should perhaps be single as well)

kdave added a commit that referenced this issue Jul 2, 2021
Make it a new chapter with sections. The SSD and firmware parts were
inspired by a more detailed Zygo's writeup at
#319 (comment)

Signed-off-by: David Sterba <dsterba@suse.com>
kdave added a commit that referenced this issue Jul 2, 2021
Make it a new chapter with sections. The SSD and firmware parts were
inspired by a more detailed Zygo's writeup at
#319 (comment)

Signed-off-by: David Sterba <dsterba@suse.com>
@tom-seewald
Copy link

As changing defaults affects everyone and the current state has been around for some years, I'd like to get some agreement that it's worth the change but also counter arguments.

The change should happen at a major release time, but 5.13 is too close, so it would be 5.14 at the earlies. This should be enough time to gather feedback.

Just checking, is this proposed change still on track for being implemented in a future release or have there been compelling counter arguments made?

@adam900710
Copy link
Collaborator

I'm 100% agreeing on using DUP as default for metadata no matter what's the underlying disks.

The implementation of disks should not affect how we utilize them under most cases.

One important thing is, we already have a report in the mailing list that csum error screwed up a tree block for a SSD.
To me, that's already enough for us to do the switch.

Finally, such an exception is already an arguable thing, we need more solid statistics on the SSDs in the wild to be determined.
It's already near 10 years late to argue on that, but it's still not too late to revert it.

@kdave kdave added this to the v5.15 milestone Sep 27, 2021
@kdave
Copy link
Owner

kdave commented Sep 30, 2021

Dup by default for metadata scheduled for 5.15.

kdave added a commit that referenced this issue Sep 30, 2021
The original idea of not doing DUP on SSD was that the duplicate blocks
get deduplicated again by the driver firmware. This was in 2013, years
ago. Then it was speculative and even nowadays we don't have much
reliable information from vendors what optimizations are done on the
drive level.

After the year there's enough information gathered by user community and
there's no simple answer. Expensive drives are more reliable but less
common, for cheap consumer drive it's vice versa. The characteristics
are described in more detail in manual page btrfs(5) in section "SOLID
STATE DRIVES (SSD)".

The reasoning is based on numerous reports on IRC and technical
difficulty on mkfs side to do the right decision. The default is chosen
to be the safe option and up to user to change that based on informed
decision.

Issue: #319
Signed-off-by: David Sterba <dsterba@suse.com>
kdave added a commit that referenced this issue Oct 5, 2021
The original idea of not doing DUP on SSD was that the duplicate blocks
get deduplicated again by the driver firmware. This was in 2013, years
ago. Then it was speculative and even nowadays we don't have much
reliable information from vendors what optimizations are done on the
drive level.

After the year there's enough information gathered by user community and
there's no simple answer. Expensive drives are more reliable but less
common, for cheap consumer drive it's vice versa. The characteristics
are described in more detail in manual page btrfs(5) in section "SOLID
STATE DRIVES (SSD)".

The reasoning is based on numerous reports on IRC and technical
difficulty on mkfs side to do the right decision. The default is chosen
to be the safe option and up to user to change that based on informed
decision.

Issue: #319
Signed-off-by: David Sterba <dsterba@suse.com>
kdave added a commit that referenced this issue Oct 6, 2021
The original idea of not doing DUP on SSD was that the duplicate blocks
get deduplicated again by the driver firmware. This was in 2013, years
ago. Then it was speculative and even nowadays we don't have much
reliable information from vendors what optimizations are done on the
drive level.

After the year there's enough information gathered by user community and
there's no simple answer. Expensive drives are more reliable but less
common, for cheap consumer drive it's vice versa. The characteristics
are described in more detail in manual page btrfs(5) in section "SOLID
STATE DRIVES (SSD)".

The reasoning is based on numerous reports on IRC and technical
difficulty on mkfs side to do the right decision. The default is chosen
to be the safe option and up to user to change that based on informed
decision.

Issue: #319
Signed-off-by: David Sterba <dsterba@suse.com>
kdave added a commit that referenced this issue Oct 8, 2021
The original idea of not doing DUP on SSD was that the duplicate blocks
get deduplicated again by the driver firmware. This was in 2013, years
ago. Then it was speculative and even nowadays we don't have much
reliable information from vendors what optimizations are done on the
drive level.

After the year there's enough information gathered by user community and
there's no simple answer. Expensive drives are more reliable but less
common, for cheap consumer drive it's vice versa. The characteristics
are described in more detail in manual page btrfs(5) in section "SOLID
STATE DRIVES (SSD)".

The reasoning is based on numerous reports on IRC and technical
difficulty on mkfs side to do the right decision. The default is chosen
to be the safe option and up to user to change that based on informed
decision.

Issue: #319
Signed-off-by: David Sterba <dsterba@suse.com>
@kdave
Copy link
Owner

kdave commented Nov 4, 2021

No remaining issues regarding DUP by default, closing.

@kdave kdave closed this as completed Nov 4, 2021
lansuse pushed a commit to lansuse/btrfs-progs that referenced this issue Dec 1, 2021
Make it a new chapter with sections. The SSD and firmware parts were
inspired by a more detailed Zygo's writeup at
kdave#319 (comment)

Signed-off-by: David Sterba <dsterba@suse.com>
lansuse pushed a commit to lansuse/btrfs-progs that referenced this issue Dec 1, 2021
The original idea of not doing DUP on SSD was that the duplicate blocks
get deduplicated again by the driver firmware. This was in 2013, years
ago. Then it was speculative and even nowadays we don't have much
reliable information from vendors what optimizations are done on the
drive level.

After the year there's enough information gathered by user community and
there's no simple answer. Expensive drives are more reliable but less
common, for cheap consumer drive it's vice versa. The characteristics
are described in more detail in manual page btrfs(5) in section "SOLID
STATE DRIVES (SSD)".

The reasoning is based on numerous reports on IRC and technical
difficulty on mkfs side to do the right decision. The default is chosen
to be the safe option and up to user to change that based on informed
decision.

Issue: kdave#319
Signed-off-by: David Sterba <dsterba@suse.com>
@jowagner
Copy link

Further points for weighing the pros and cons of dup+encryption vs investing in a second SSD and using proper raid1:

  • The SSD may store the two dup copies close together simply because they are both written in a short time window. The block addresses are irrelevant when the drive uses an internal mapping to assign internal storage locations.
  • Stream compression is not necessary switched off for uncompressible or poorly compressible data. Avalanche effects may still damage both copies in the dup+encryption setup. Yes, it would make sense for the SSD to switch off compression to save power when reading the same data multiple times in the future but this also makes the logic more complicated.

@kdave
Copy link
Owner

kdave commented Mar 22, 2022

We as users of the SSDs don't have any reliable specification of the internal behaviour of the devices and optimizations, most of it is speculative, experimentally observed or anecdotal, also different accross vendors and firmware versions. Zygo has replied in the issue #455 in that regard too.

On the filesystem side we took a general approach instead of trying to be too clever and likely wrong. There are options and features available to be turned on or off in case the user has enough information to do the right decision. Partially covered in https://btrfs.readthedocs.io/en/latest/btrfs-man5.html#solid-state-drives-ssd .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defaults Changes in default settings mkfs Changes in mkfs.btrfs
Projects
None yet
Development

No branches or pull requests

6 participants