-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change defaults: Btrfs should use 'dup' metadata on encrypted devices #319
Comments
|
There is quite a range of SSD firmware behavior out there. There's basically only one case where single metadata isn't a "please destroy my data" flag. While it's not a rare case (there are a lot of SSDs that work), it's still a good idea to bet against it in the absence of specific information about the SSD (there are also a lot of SSDs that don't work). A brief summary of SSD behaviorsAt the high end, high endurance SSDs do deduplication in firmware, and also compression, and a bag of proprietary firmware tricks to reduce the data size written to the flash media. The drives get their high endurance rating in part because they do deduplication in firmware, and they will underperform their ratings if the data is encrypted. The bag of tricks is advertised under various trademarked names, so users can pay extra to buy drives that have the features. The feature sets include data redundancy and error recovery provided by SSD firmware. The SSD firmware will dedupe unencrypted dup metadata, then use an ECC scheme which effectively reduplicates it. The drive will make every effort to self-repair medium errors, but when that fails, the drive will report error information (and usually flip the entire drive read-only, because there is no longer sufficient redundant media in the drive to correct any more failures). When these drives work, btrfs can use single metadata and even disable csums, as the drive will do these tasks adequately. At the low end, low endurance SSDs are optimized for cost. Low-end SSDs often lack software and hardware for deduplication in firmware. At the lowest extreme, the most basic error detection capabilities in firmware are not working or not implemented. Drives in this category will report no errors and pass all SMART self-tests even as data read/write tests (e.g. Most consumer drives fall somewhere on a spectrum between these extremes. A lot of firmware tries to do smart things, but the implementation has bugs. Some combinations of firmware features can be very bad, e.g. deduping writes but not providing redundant media bits for ECC would make the drive strictly more vulnerable to media failure and data loss unless btrfs dup metadata was used with encryption to defeat the SSD firmware dedupe. We can hope that drive vendors will not choose stupid combinations of firmware features, but we cannot rely on it. Severe firmware bugs are not common but not rare either. All SSDs are made of hardware, and all hardware can fail, so there's a risk of single-block data losses even on high-end drives. There's a marginal event probability sandwiched between "no failures at all" and "total drive failure" where btrfs dup metadata can still be useful on such drives. It will be a tiny portion of all failure cases--but not zero. OEM drives tend to fall toward the low end of the spectrum, and that's the kind of drive most users will have in their single-drive machines. If you count drive models, most are good. If you count drives users use, most are bad. The good ones are expensive so they aren't popular. The bad ones get sold at a discount, so your laptop vendor probably bought one. Metadata writes tend to be the biggest component of lifetime writes to a SSD, so there is some value in reducing them; however, only users who consume 50 to 100% of the SSD's actual lifetime writes need to be concerned by the write amplification of btrfs dup metadata. Most users will be far below 50% of the actual lifetime, or will write the drive to death and discover how many writes 100% of the actual lifetime was. SSD firmware often adds its own write multipliers that can be arbitrary and unpredictable and dependent on application behavior, and these will typically have far greater effect on SSD lifespan than btrfs dup metadata. It's more or less impossible to predict when a SSD will run out of lifetime writes to within a factor of two, so it's hard to justify wear reduction as a benefit. If user has a cloud of single-SSD machines and always writes them to death, and they want to reduce the rate at which they consume drives, then wear reduction is a benefit--but they probably also fall into case 1 below, since their drives are continually failing. When does single metadata make sense?There are only two cases where single metadata is a predictably good idea:
Case 1 is always a user choice. Most filesystem users prioritize persistence over performance, especially in default configurations, but there are exceptions and btrfs provides for them (see also Case 2 is an intersection of edge cases:
A user in case 2 is micro-optimizing for small marginal gains against an expensive hardware constraint (e.g. laptop or embedded/appliance device with only one drive bay) and they know that they are doing this (i.e. they have done the homework to vet the SSD firmware). If the user has not vetted the drive firmware, then (as of December 2020) we cannot reasonably expect a random drive model to have working ECC or media redundancy, and the filesystem should expect bit- and block-level metadata errors. If the user hasn't properly vetted the SSD firmware then the only valid use case for single metadata is case 1. This is the problem with the current situation. There's no practical way for Given the above, dup metadata should be the default. single metadata should always require explicit user action (as balance requires |
|
Thanks for the very detailed explanation. It is much appreciated. |
|
What Zygo wrote seems to be pretty compelling towards enabling duplicated metadata by default on SSDs, and I think filesystems in general should err on the side of fault tolerance whenever reasonable. @kdave what are your thoughts on this proposal? |
|
The question regarding DUP has been brought up on IRC again and it seems to be a recurring topic, to my knowledge favoring DUP even for SSD devices. What Zygo wrote also speaks for that and my personal opintion is to also do the switch. The original change DUP->single was done in 2013 in commit 124053b with the argument IMHO this is in more detail addressed in Zygo's post and there is not a clear yes/no answer. Regarding the trade-offs, favoring reliability over performance is something that can be considered a good approch to choosing defaults. The number of cases where single makes sense for SSD is more like an exception. As changing defaults affects everyone and the current state has been around for some years, I'd like to get some agreement that it's worth the change but also counter arguments. The change should happen at a major release time, but 5.13 is too close, so it would be 5.14 at the earlies. This should be enough time to gather feedback. (Related to defaults but not this issue: multi-device mkfs creates raid0 by default, this should perhaps be single as well) |
Make it a new chapter with sections. The SSD and firmware parts were inspired by a more detailed Zygo's writeup at #319 (comment) Signed-off-by: David Sterba <dsterba@suse.com>
Make it a new chapter with sections. The SSD and firmware parts were inspired by a more detailed Zygo's writeup at #319 (comment) Signed-off-by: David Sterba <dsterba@suse.com>
Just checking, is this proposed change still on track for being implemented in a future release or have there been compelling counter arguments made? |
|
I'm 100% agreeing on using DUP as default for metadata no matter what's the underlying disks. The implementation of disks should not affect how we utilize them under most cases. One important thing is, we already have a report in the mailing list that csum error screwed up a tree block for a SSD. Finally, such an exception is already an arguable thing, we need more solid statistics on the SSDs in the wild to be determined. |
|
Dup by default for metadata scheduled for 5.15. |
The original idea of not doing DUP on SSD was that the duplicate blocks get deduplicated again by the driver firmware. This was in 2013, years ago. Then it was speculative and even nowadays we don't have much reliable information from vendors what optimizations are done on the drive level. After the year there's enough information gathered by user community and there's no simple answer. Expensive drives are more reliable but less common, for cheap consumer drive it's vice versa. The characteristics are described in more detail in manual page btrfs(5) in section "SOLID STATE DRIVES (SSD)". The reasoning is based on numerous reports on IRC and technical difficulty on mkfs side to do the right decision. The default is chosen to be the safe option and up to user to change that based on informed decision. Issue: #319 Signed-off-by: David Sterba <dsterba@suse.com>
The original idea of not doing DUP on SSD was that the duplicate blocks get deduplicated again by the driver firmware. This was in 2013, years ago. Then it was speculative and even nowadays we don't have much reliable information from vendors what optimizations are done on the drive level. After the year there's enough information gathered by user community and there's no simple answer. Expensive drives are more reliable but less common, for cheap consumer drive it's vice versa. The characteristics are described in more detail in manual page btrfs(5) in section "SOLID STATE DRIVES (SSD)". The reasoning is based on numerous reports on IRC and technical difficulty on mkfs side to do the right decision. The default is chosen to be the safe option and up to user to change that based on informed decision. Issue: #319 Signed-off-by: David Sterba <dsterba@suse.com>
The original idea of not doing DUP on SSD was that the duplicate blocks get deduplicated again by the driver firmware. This was in 2013, years ago. Then it was speculative and even nowadays we don't have much reliable information from vendors what optimizations are done on the drive level. After the year there's enough information gathered by user community and there's no simple answer. Expensive drives are more reliable but less common, for cheap consumer drive it's vice versa. The characteristics are described in more detail in manual page btrfs(5) in section "SOLID STATE DRIVES (SSD)". The reasoning is based on numerous reports on IRC and technical difficulty on mkfs side to do the right decision. The default is chosen to be the safe option and up to user to change that based on informed decision. Issue: #319 Signed-off-by: David Sterba <dsterba@suse.com>
The original idea of not doing DUP on SSD was that the duplicate blocks get deduplicated again by the driver firmware. This was in 2013, years ago. Then it was speculative and even nowadays we don't have much reliable information from vendors what optimizations are done on the drive level. After the year there's enough information gathered by user community and there's no simple answer. Expensive drives are more reliable but less common, for cheap consumer drive it's vice versa. The characteristics are described in more detail in manual page btrfs(5) in section "SOLID STATE DRIVES (SSD)". The reasoning is based on numerous reports on IRC and technical difficulty on mkfs side to do the right decision. The default is chosen to be the safe option and up to user to change that based on informed decision. Issue: #319 Signed-off-by: David Sterba <dsterba@suse.com>
|
No remaining issues regarding DUP by default, closing. |
Make it a new chapter with sections. The SSD and firmware parts were inspired by a more detailed Zygo's writeup at kdave#319 (comment) Signed-off-by: David Sterba <dsterba@suse.com>
The original idea of not doing DUP on SSD was that the duplicate blocks get deduplicated again by the driver firmware. This was in 2013, years ago. Then it was speculative and even nowadays we don't have much reliable information from vendors what optimizations are done on the drive level. After the year there's enough information gathered by user community and there's no simple answer. Expensive drives are more reliable but less common, for cheap consumer drive it's vice versa. The characteristics are described in more detail in manual page btrfs(5) in section "SOLID STATE DRIVES (SSD)". The reasoning is based on numerous reports on IRC and technical difficulty on mkfs side to do the right decision. The default is chosen to be the safe option and up to user to change that based on informed decision. Issue: kdave#319 Signed-off-by: David Sterba <dsterba@suse.com>
|
Further points for weighing the pros and cons of
|
|
We as users of the SSDs don't have any reliable specification of the internal behaviour of the devices and optimizations, most of it is speculative, experimentally observed or anecdotal, also different accross vendors and firmware versions. Zygo has replied in the issue #455 in that regard too. On the filesystem side we took a general approach instead of trying to be too clever and likely wrong. There are options and features available to be turned on or off in case the user has enough information to do the right decision. Partially covered in https://btrfs.readthedocs.io/en/latest/btrfs-man5.html#solid-state-drives-ssd . |
According to man-page https://btrfs.wiki.kernel.org/index.php/Manpage/mkfs.btrfs#OPTIONS the default for non-rotational is to use
singlemetadata profile.I think that in cases where users use Btrfs on-top of LUKS/DM-Crypt then Btrfs should default to dup mode. One of the reasons to choose
singleis that modern SSD/Flash media can do de-duplication which breaks the purpose of dup mode. With an encrypted layer in between then this is no longer possible, so we should usedupby default.The Man (https://btrfs.wiki.kernel.org/index.php/Manpage/mkfs.btrfs#DUP_PROFILES_ON_A_SINGLE_DEVICE) pages should expand on the reasoning for choosing
singleordupon encrypted devices to help users making more informed choices.A lot of users do use Btrfs on their laptops, (perhaps more so no that Fedora 33 uses Btrfs by default) and using DUP could help save users filesystems from crashing on smaller errors.
SD-cards on the other hand is another type of flash media where perhaps the added writes with
dupis not so good. On the other hand if the cards break, then they often break badly so that dup would not help anyway? I have no evidence other than word-of-mouth here._Anecdotal evidence on my own NAS box. It has a Samsung SSD 830 Series 256GB as system disk. I had a corrupt block, luckily it was on my /boot partition so I could easily restore a backup. Had I used
dupthe filesystem could have survived better. Earlier had been convinced by arguments on #btrfs IRC channel thatdupwasn't the best choice.The text was updated successfully, but these errors were encountered: