New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reallocate on read error #1256
Comments
The devil is always in the details with these things, but it's certainly an idea worth exploring. If you want to dig in to this I'd start with |
You will want to set TLER/CCTL on your disks to force them to report errors to the OS sooner. I like to set them to 1 second at boot with |
I'm not sure that fixes the problem - the issue is the drive firmware reports a read error if it detects a bad sector, and doesn't return bad data (which, depending on your perspective, is somewhat sensible). The problem seems to be that ZFS doesn't treat a read error like a checksum fail, so although it'll rebuild from parity and continue, it assumes that the sector might not return a read error at some point in the future. Until a write happens, the disk won't reallocate the sector and ZFS will keep on rebuilding around it. I have a couple of Seagate ST2000's in my array (which unlike WD don't lie about reallocated sector counts). One has 278 sectors pending reallocation despite regular scrubs, but 0 reallocated sectors. I'll have a dig around and see how ZFS should be handling read/errors like this vs. how the Linux kernel reports them, because this seems kind of important. Though I guess what I'll need is some tool that'll reliably emulate a bad hard disk. |
@wrouesnel I can assure you it is working as intended. Which isn't to say there isn't room for improvement because it was designed with the Solaris SCSI layer behavior in mind and not Linux's. They differ significantly in some regards. There are at least three legitimate cases to consider.
Right now ZFS treats all read errors as soft read errors. The data will get reconstructed on the fly but not rewritten. For Linux that's probably not ideal because the SCSI layer is much more aggressive about retrying I/Os that it's Solaris counterpart. By the time we see the read error in ZFS Linux has already retried the I/O three times at the lower layers. This isn't true for Solaris they will fail the I/O without any retry. In theory this is what the Linux FASTFAIL I/O flags are for but in practice they don't work reliably for a variety of reasons. So it might make sense on Linux to treat all read error as hard read errors and rewrite the sector in place.
Check out the |
@behlendorf Up until now i was under the impression that ZFS will automatically reconstruct damaged on-disk data the moment it encounters it (given the data is available from somewhere else), with scrub being a way to force ZFS to visit the whole tree to find errors in cold data. Thinking about the three possible cases you mentioned, i wonder: why isn't bad data rewritten instantly? What is the reasoning behind keeping known bad data on-disk? |
For the record, I checked and zfs-fuse 0.6.9 behaves exactly the same way, ie, doesn't rewrite blocks on read failure. This is perhaps to be expected, as vdev_mirror_io_done() in zfs-fuse is almost exactly the same as in ZoL (only 2 different lines, and apparently mostly cosmetic). |
Seems like a failure (somewhat) of the "zfs data safety - it's better than hardware RAID" promise. Say RAID-Z2 and two discs go down: "officially" we were "allowed to lose any two discs", but in reality we have now lost data since bad sectors on the remaining disks were left by ZFS to sit around whereever they appear, when they ought to have been rebuilt/ reallocated/ rewritten elsewhere. When the "public/stated contract" of ZFS is not met in this way, end-users might be considered undermined - so I suggest the priority of this be upped from "feature" to something more significant. And also, until this "bug" is fixed, the spec/ notes/ documentation ought be updated so that end users (if they ever get bitten by this) don't consider themselves undermined with promises. |
For a clear example of this misunderstanding of the current ZFSoL contract, see: In particular this text: Evidently that last statement is a false understanding by the writer. |
Is this still the case? I've encountered a few unrellocatable bad sectors on 1 of my 3 drives in RAIDZ1 configuration. I'm wondering how safe I am, and how urgent is the task on replacing the drive. I'm on:
|
Agreed - this is concerning, and makes me wonder about the safety of data on systems that don't immediately rewrite. Looking at the logic of vdev_mirror.c, it appears that the "issue" is:
Would it be enough to change it to
Skipping the other two checks? Similarly in vdev_raidz: zio_checksum_verified(zio);
Could become
What would be the downside? |
@pdemarino we want to make sure we only perform the rewrite in the case of unexpected_errors or resilver/scrub. Otherwise we'll make a lot of unnecessary iterations around that for loop for every read. So the key question then is why isn't unexpected_errors non-zero. |
@behlendorf Understood - but I would like to challenge that. Is it really appropriate to be concerned about performance in the rare case when errors are reported by the disk stack? Wouldn't it be more reasonable to think safety first, in the rare occurrence when any disk is reporting a failure? I feel that the current policy is essentially assuming that most errors will be transient; and that scrubs will be run often enough that non-transient errors will be caught. While I wholeheartedly agree that scrubs should be run often, I agree with @zenaan and @SkyWriter that this is a departure from what is understood reading the docs - it sure is a departure from my personal expectations. My recommendation would be to force the rewrites every time an error is caught, scrub or no scrub. Unless I misunderstand the algorithm, this should have no impact whatsoever on performance on a healthy disk, and it may safeguard data on an ailing one, albeit with a small performance cost (again, which would only be paid by a failing system). |
@behlendorf - Feedback understood re: my recommendation. That was asinine on my part, I misread the code pretty bad. Now, however, I'm really confused. Given that the opening of vdev_mirror_io_done counts as unexpected errors any children with an error that weren't skipped, how is it possible that a read error slips by? They should all trigger the code on the bottom. In other words - how can a read not set the error flag, yet return no data? I've been following the code all the way down to the BIO submission, but couldn't find the point where that choice is made. |
@pdemarino yes exactly. That's the mystery, I'm glad we're on the same page. I've only skimmed the code but it's not obvious to me either how |
@behlendorf Yet it's definitely happening. I have a RAIDZ-2 array solely dedicated to ZFS, and two drives report through SMART significant numbers of pending reallocations - which means read errors not followed by rewrite attempts. Now, I see (at least) three possible explanations.
Would it make sense to instrument the RAIDZ vdev code to try and tell these three cases apart? I have a slowly failing disk that I can use to try and map what's going on. Alternatively, could an analysis of dmesg and/or zpool history help? Happy to share anything that could help. |
I just want to add that I have a couple of drives with pending reallocations that don't disappear even after filling the whole drive with random data multiple times (using dd), so I don't know if that's always proof of the sectors not being rewritten. I literally can't do anything to make those pending sectors go away. Reading the whole drive also does nothing to them. They don't report any errors, but the smart values stay the same, with pending sectors. I'm not saying this is what's happening here, but I wanted to let you know this is possible. If it's because buggy firmware or something, I don't know. These disks are old (500GB and 160GB), and it's worked fine the other times I've tried it in different hard drives. |
isn't that exactly how marking a sector for reallocation should work (transparent from the os)? note there is a difference between marking for allocation and offline uncorrectable - the latter fails the read while the former may still return valid (albeit possibly corrupted) data. if the read would have returned corrupted data, then zfs checksumming would have kicked in. if that did not happen, then the sector read returned valid uncorrupted data, BUT the controller tried hard to read that (which you see in the smart log). that makes me think this issue is a feature request: integrated smartd polling for configurable early drive faulting when encountering pending or reallocated sectors. |
Just wanted to point out a few things:
All of this fully lines up with some of the experiences described here if errors were either corrected by a HDD or if they were found in an empty part of the media. However, the (almost 4-year old) statement by @behlendorf regarding the soft errors makes me think that may be ZFS is really not what I want to use for my data. On a positive side, I was able to obtain a collection of 6 SATA HDDs which produce "Pending Sectors" pretty much on demand, while still perfectly operational otherwise. I'll try setting up something to see if RAIDZ can really lose data due to gradual bit rot. References:
|
At work I ran a set of 10 SATA drives that had an unusually high URE rate like that for a few years in raidz3 but never had any bit rot. Although I did weekly scrubs which each week there were repairs made. |
On Tue, Dec 20, 2016 at 09:12:35PM -0800, GoofHub wrote:
All of this fully lines up with some of the experiences described
here if errors were either corrected by a HDD or if they were found
in an empty part of the media. However, the (almost 4-year old)
statement by @behlendorf regarding the soft errors makes me think
that may be ZFS is really not what I want to use for my data.
Until the bug is fixed, yes, it is very disconcerting to use zfs on
linux if you value your data.
On a positive side, I was able to obtain a collection of 6 SATA
HDDs which produce "Pending Sectors" pretty much on demand, while
still perfectly operational otherwise. I'll try setting up
something to see if RAIDZ can really lose data due to gradual bit
rot.
This is a very positive note. Conceptually, the problem should be
easy to repeat artificially - create a zfs "module" to flip a bit (or
an entire sector) in a RAIDed file, then read that file.
Then verify that the file is not reallocated from its RAID parity
bits from other drive(s).
Once a simple/straightforward set of steps to reproduce is
established, then a programmer has an easy target for fixing.
|
On Wed, Dec 21, 2016 at 06:40:27AM -0800, John M. Drescher wrote:
At work I ran a set of 10 SATA drives that had an unusually high
URE rate like that for a few years in raidz3 but never had any bit
rot. Although I did weekly scrubs which each week there were
repairs made.
ZFS on Linux (just making sure)?
|
Yes |
So, currently, ZFS could lead to data-loss (or punctured arrays) on mostly-read raids ? As example:
As ZFS is COW, probably, Now disk1 fails totally and a resilvering is made. Result: data loss due to COW and missing forced rewrite by ZFS. This is something that mostly all HW RAID controller does automatically during their "patrol read" or "consistency check" scans. In case of URE, all raid controllers rewrite to the same sector forcing HDD to reallocate it to a new location with correct data coming from the other survived disks. |
This is why you scrub. |
Yes, but scrubs doesn't trigger any rewrites, or this issue would be totally non-sense, right ? And even by scrubbing, the unreadable sector won't be reassigned. Will ZFS "move" |
They do if the data read does not match. |
In other words, this issue is wrong and should be closed or am I missing something? |
I am pretty sure that was not talking about scrubs or resilvers. |
scrubs and resilvers in this failure mode still don't rebuild, because it shows up as a read-error, not a checksum error AFAIK. With modern storage hardware you almost would like to promote all read errors to be defacto checksum errrors to force a rewrite attempt - since disks start dropping for me once I get write failures in ZFS (which is generally the point at which that disk is "done"). |
The problem is that read error, as seen by ZFS, is not the same as a medium error. Rewrite will only attempt to repair a medium error. |
Wow this has been around for 4 years now? Just found this now, and I think I've already experienced data loss last year caused by this issue. Hoping for improvements soon, upvote ^^ |
I think this should be placed on a top priority list |
On Thu, Jul 06, 2017 at 12:05:29AM -0700, GUEST.it wrote:
I think this should be placed on a top priority list
ZFS is famous for data integrity and without this issue fixed, data loss could happen easily
Agreed - it seems possible that the problem really only arises on
Linux where the SCSI/disk subsystem works a bit differently to
Solaris (no real surprise there).
As we know this is a volunteer project, so let's be really grateful
to those devs who do put in their time - without them, we would not
have ZOL in the first place :)
|
In Solaris this isn't possible? Why? |
On Thu, Jul 06, 2017 at 04:00:45AM -0700, GUEST.it wrote:
In Solaris this isn't possible? Why?
I don't know that's a fact, but by my reading of those above who have
generously added their insights, Solaris at the disk/SCSI layer,
provides perhaps different "response codes" to the upper layer(s)
e.g. the zfs driver/fs.
ZFS was written on, and for, Solaris, and only later modified to
support/run on, the Linux kernel.
We can reverse the question - why would the Linux kernel, and the
Solaris kernel, have identical semantics --at the disk driver layer--
?
There is no reason that they should be identical, and indeed that
would be extraordinarily surprising if they were. The various SYSV/
BSD, and etc Unix standards were created to standardize the higher
layers in Unix - as far as I know there is no "Unix driver ABI or
etc" standard :)
|
Has the patch proposed by @pdemarino been tested by anyone? I have a pool which experiences UREs on one disk which do not get fixed (even when scrubbing). |
Is possible to know if this (critical for data reliability and safety) issue is being worked by someone? ZFS is famous for data reliability, but this issue describe a potential data loss |
On Fri, Jul 14, 2017 at 12:11:21AM -0700, GUEST.it wrote:
Is possible to know if this (critical for data reliability and safety) issue is being worked by someone?
Only if a) someone spends the time to look at it, and
b) posts a note here that they're doing so.
ZFS is famous for data reliability, but this issue describe a potential data loss
Indeed.
If anyone is relying on this commercially, perhaps a Patreon or
similar bounty could inspire someone.
Other than that, this is a volunteer project, and where it's at
already is a great contribution to the GNU/Linux etc ecosystem.
|
In the meantime, until a fix is implemented, the community should put a big red warning on the first page of ZFS on Linux website to explain this potential issue. Is Oracle still funding ZFS on Solaris? With Solaris slowly dying, I guess Oracle is not interested in funding ZFS anymore; where would the revenues come from? BTRFS failed me (even with RAID1) and I was planing to got with ZFS on Linux, but I feel lucky to have found this issue before I did. The community should be transparent and advertise this problem. My only option left for a RAID software solution is MDADM and they broadcast on their first page a warning about a potential issue. That is the way to do it and ZFS on Linux should do the same. |
this probably belongs to FAQ link in the front page of zol website which points to https://github.com/zfsonlinux/zfs/wiki/FAQ |
I believe in my testing with drives with high URE rates in raidz3 that scrubs did fix this problem. I did get checksum errors for the drives and the sectors appeared to be reallocated in the SMART data. The system is powered off since I retired it a few months ago. However next week I could possibly power it on an do some testing. |
BTW, I had the system running like this for a few years with actual data on it (mostly downloads and isos for our software several hundred GB of data) and never lost any data. When I retired it I did a send and receive to a new zfs server and there was no issue with that. |
yes, but as wrote above, scrubs are run at scheduled intervals. If you scrub weekly, you can react to URE only weekly. If you hit an URE after the last scrub and then you loose a disk before another scrub, you loose data. I think that is much safer to force a reallocation/rewrite as soon as an URE is detected. |
On Fri, Jul 14, 2017 at 03:11:29PM +0000, GUEST.it wrote:
yes, but as wrote above, scrubs are run at scheduled intervals. If
you scrub weekly, you can react to URE only weekly. If you hit an
URE after the last scrub and then you loose a disk before another
scrub, you loose data.
I think that is much safer to force a reallocation/rewrite as soon
as an URE is detected.
Of course that's safer - the point is, this is not a voting
competition, and not even a "compel the volunteers" exercise - it's
just the way ZFS should (and is supposed to) work - there is no
disagreement that this is a bug.
The only competition is who will be the first to either volunteer and
fix this, or get paid and fix this. Sooner or later, someone will do
so :)
|
Used to test if data is really re-written on zio read & checksum errors. This came about because of a discussion in: openzfs#1256
I've created a branch (
To run: Clone and build my zfs branch. Within that workspace, run Results: If a scrub read IO has a checksum error, it reconstructs the data from the good drive, and re-writes the sectors where the bad data was. This increments the checksum counter in zpool status. If a regular read IO has a checksum error, it also reconstructs the data from the good drive, and re-writes the sectors where the bad data was. This does not increment the checksum counter in If the data could not be re-constructed, or if the re-write was not successful, then you would see the error counters increment. Furthermore, consider that when you read a file from a mirrored pool, zfs will quasi-round-robin the reads between the disks to get better performance. So if you have a 1MB file that's totally corrupted on one disk in a mirrored pair, and you cat the file, then roughly half the data you read will be corrupted. That means that each time you read the file completely, you're going to reconstruct half of the remaining bad data. You see this in TL;DR I think it's up for debate whether these self-healing IOs should be included in the |
On Wed, Jul 26, 2017 at 12:44:25AM +0000, Tony Hutter wrote:
I've created a branch (`tonyhutter:testing-read-failures`) with a test script that does the following:
...
TL;DR
Your data is safe. It is getting re-written on every checksum and read error. ZFS just doesn't report the errors if it fixed them without issue.
Very nice report!
I think it's up for debate whether these self-healing IOs should be included in the `zpool status` read/checksum counts. There may be good historical reasons why they're not included. We may also want to consider retooling the self-healing routines to be even more hardcore. For example, @behlendorf suggested to me that it may be possible do a mini-scrub on just the file that had the errors. That's a discussion for another day though (outside of this bug).
Great stuff, will certainly set folks' minds at ease :)
|
@tonyhutter Thank you very much for testing and confirming that it works as intended! |
@tonyhutter Thanks for the detailed answer.
One question: So for e.g. an hdd sector gets unwriteable/damaged, zfs reads the data, (determine checksum error), and it tries to re-write the data to the same bad sector? (I thought zfs is also capable like chkdsk or badblocks to mark these sectors as unusable and skip them in future. Or is that one layer below zfs? Sorry for the dump question). |
@jumbi77 we assume the disk will internally remap the bad sector to another physical location on disk. See https://kb.acronis.com/content/9105. |
ZFS can't relocate a sector in that regard. That's supposed to be the drive's responsibility to have spare sectors and work around physical defects while providing the illusion of a pristine surface. When it can't, you should replace the drive anyway. |
I think it would be wise for 24/7 production systems to not rely on zfs doing this healing and replace a disk as soon as you get read or write errors. Does this PR implements a way to inform user that a healing has taken place? Otherwise, it will make things worse, delaying the moment the ill drive should be replaced. |
This isn't a PR, just an observation that self-healing doesn't happen in all situations when there was an expectation it did. |
I'm going to close this bug because I think we've confirmed that ZFS is in fact re-writing the sector on read errors. I did send an email to the zfs-devel mailing list on whether we should be including the self-healed IOs in |
To build on @tonyhutter's previous comment I wanted to expand on why failed speculative IOs are not counted in the For performance reasons these IOs bypass the normal interlocks in the ZFS pipeline. This means that's it possible they can fail with a checksum error, for example if a file is removed and the blocks are reused in a new file before the speculative IO is serviced. This is a false positive which shouldn't be reported. The design is very careful to only report real hardware errors which have percolated up. We want people to be able to trust that when ZFS reports an error, it is a real error. There are a couple ways we can weed out the false positives and only report the real errors. That's something we're going to look in to. |
I've noticed an issue when using ZFS with commodity hard disks: when these drives develop a bad sector on read, they don't reallocate it right away. Unfortunately, neither does ZFS it seems - which regards such errors as temporary read errors and rebuilds from parity data.
As a result, these types of sector errors persist for a very long time due to the copy-on-write semantics and RAIDZ/mirror vdev levels, which is undesirable behavior because you end up relying on parity data more often even if the error is transient/manufacturing related.
There are ways to force a reallocation, but they involve trawling logs and figuring out sector locations, which is a pain and not the type of automatic that's really desirable. What would be much better would be if ZFS could be convinced to treat a sector read error as being similar to a checksum failure, and force a rewrite of that sector - which would trigger sector reallocation on the underlying drive.
This would help identify actual failing drives more quickly, and help keep incidental read errors down. I'm thinking the exact logic would be a flag to a zpool to treat the first read error as a checksum failure, and the 2nd as a real error.
The text was updated successfully, but these errors were encountered: