Reallocate on read error #1256

wrouesnel · 2013-02-03T07:11:34Z

I've noticed an issue when using ZFS with commodity hard disks: when these drives develop a bad sector on read, they don't reallocate it right away. Unfortunately, neither does ZFS it seems - which regards such errors as temporary read errors and rebuilds from parity data.

As a result, these types of sector errors persist for a very long time due to the copy-on-write semantics and RAIDZ/mirror vdev levels, which is undesirable behavior because you end up relying on parity data more often even if the error is transient/manufacturing related.

There are ways to force a reallocation, but they involve trawling logs and figuring out sector locations, which is a pain and not the type of automatic that's really desirable. What would be much better would be if ZFS could be convinced to treat a sector read error as being similar to a checksum failure, and force a rewrite of that sector - which would trigger sector reallocation on the underlying drive.

This would help identify actual failing drives more quickly, and help keep incidental read errors down. I'm thinking the exact logic would be a flag to a zpool to treat the first read error as a checksum failure, and the 2nd as a real error.

behlendorf · 2013-02-04T17:24:02Z

The devil is always in the details with these things, but it's certainly an idea worth exploring. If you want to dig in to this I'd start with vdev_mirror_io_done() and vdev_raidz_io_done() which kick off the repair I/O for a failed checksum.

ryao · 2013-02-06T02:32:44Z

You will want to set TLER/CCTL on your disks to force them to report errors to the OS sooner. I like to set them to 1 second at boot with smartctl -l scterc,10,10 /dev/sdX in /etc/local.d/disks.start. That will work with disk drives from most manufacturers.

wrouesnel · 2013-02-07T05:37:45Z

I'm not sure that fixes the problem - the issue is the drive firmware reports a read error if it detects a bad sector, and doesn't return bad data (which, depending on your perspective, is somewhat sensible). The problem seems to be that ZFS doesn't treat a read error like a checksum fail, so although it'll rebuild from parity and continue, it assumes that the sector might not return a read error at some point in the future. Until a write happens, the disk won't reallocate the sector and ZFS will keep on rebuilding around it.

I have a couple of Seagate ST2000's in my array (which unlike WD don't lie about reallocated sector counts). One has 278 sectors pending reallocation despite regular scrubs, but 0 reallocated sectors.

I'll have a dig around and see how ZFS should be handling read/errors like this vs. how the Linux kernel reports them, because this seems kind of important. Though I guess what I'll need is some tool that'll reliably emulate a bad hard disk.

behlendorf · 2013-02-07T17:14:56Z

@wrouesnel I can assure you it is working as intended. Which isn't to say there isn't room for improvement because it was designed with the Solaris SCSI layer behavior in mind and not Linux's. They differ significantly in some regards. There are at least three legitimate cases to consider.

A hard persistent read error. The sector will never be readable again and will not be remapped by the drive until the sector is rewritten.
A soft read error. The sector error was transient and subsequent attempts will return the correct data.
No read error but incorrect data was returned, the invalid checksum case.

Right now ZFS treats all read errors as soft read errors. The data will get reconstructed on the fly but not rewritten. For Linux that's probably not ideal because the SCSI layer is much more aggressive about retrying I/Os that it's Solaris counterpart. By the time we see the read error in ZFS Linux has already retried the I/O three times at the lower layers. This isn't true for Solaris they will fail the I/O without any retry. In theory this is what the Linux FASTFAIL I/O flags are for but in practice they don't work reliably for a variety of reasons. So it might make sense on Linux to treat all read error as hard read errors and rewrite the sector in place.

Though I guess what I'll need is some tool that'll reliably emulate a bad hard disk.

Check out the zfault.sh script in the scripts/ directory. It simulates various drive failure modes using the mdadm faulty device and the scsi_debug module. Both allow you to inject errors at the lowest layers of the stack.

GregorKopka · 2014-11-04T07:19:11Z

@behlendorf Right now ZFS treats all read errors as soft read errors. The data will get reconstructed on the fly but not rewritten.
I understand this as ZoL leaves known bad data on disk as-is.

Up until now i was under the impression that ZFS will automatically reconstruct damaged on-disk data the moment it encounters it (given the data is available from somewhere else), with scrub being a way to force ZFS to visit the whole tree to find errors in cold data.

Thinking about the three possible cases you mentioned, i wonder: why isn't bad data rewritten instantly? What is the reasoning behind keeping known bad data on-disk?

DurvalMenezes · 2014-11-04T11:11:33Z

For the record, I checked and zfs-fuse 0.6.9 behaves exactly the same way, ie, doesn't rewrite blocks on read failure. This is perhaps to be expected, as vdev_mirror_io_done() in zfs-fuse is almost exactly the same as in ZoL (only 2 different lines, and apparently mostly cosmetic).

zenaan · 2015-06-13T12:07:03Z

Seems like a failure (somewhat) of the "zfs data safety - it's better than hardware RAID" promise. Say RAID-Z2 and two discs go down: "officially" we were "allowed to lose any two discs", but in reality we have now lost data since bad sectors on the remaining disks were left by ZFS to sit around whereever they appear, when they ought to have been rebuilt/ reallocated/ rewritten elsewhere.

When the "public/stated contract" of ZFS is not met in this way, end-users might be considered undermined - so I suggest the priority of this be upped from "feature" to something more significant.

And also, until this "bug" is fixed, the spec/ notes/ documentation ought be updated so that end users (if they ever get bitten by this) don't consider themselves undermined with promises.

zenaan · 2015-06-13T12:19:26Z

For a clear example of this misunderstanding of the current ZFSoL contract, see:

https://groups.google.com/a/zfsonlinux.org/forum/?_escaped_fragment_=topic/zfs-discuss/2hrtKeL-UAk#!topic/zfs-discuss/2hrtKeL-UAk

In particular this text:
"The other important point with ZFS is that you don't need to wait around until one drive fails and you have latent bad sectors before fixing the bad sectors. ZFS does checksum verification of all reads, and repairs the bad sectors immediately."

Evidently that last statement is a false understanding by the writer.

SkyWriter · 2016-03-21T05:38:48Z

Is this still the case? I've encountered a few unrellocatable bad sectors on 1 of my 3 drives in RAIDZ1 configuration. I'm wondering how safe I am, and how urgent is the task on replacing the drive. I'm on:

Linux abyss 3.2.0-4-amd64 Use Barriers in pre-2.6.24 kernels #1 SMP Debian 3.2.60-1+deb7u3 x86_64 GNU/Linux
Debian 7
ZoL 0.6.3-11~~194e56~~wheezy

pdemarino · 2016-09-20T12:55:01Z

Agreed - this is concerning, and makes me wonder about the safety of data on systems that don't immediately rewrite.

Looking at the logic of vdev_mirror.c, it appears that the "issue" is:

  if (good_copies && spa_writeable(zio->io_spa) &&
        (unexpected_errors ||
        (zio->io_flags & ZIO_FLAG_RESILVER) ||
        ((zio->io_flags & ZIO_FLAG_SCRUB) && mm->mm_replacing)))

Would it be enough to change it to

if (good_copies && spa_writeable(zio->io_spa))

Skipping the other two checks?

Similarly in vdev_raidz:

zio_checksum_verified(zio);

if (zio->io_error == 0 && spa_writeable(zio->io_spa) &&
    (unexpected_errors || (zio->io_flags & ZIO_FLAG_RESILVER)))

Could become

 if (zio->io_error == 0 && spa_writeable(zio->io_spa))

What would be the downside?

behlendorf · 2016-09-20T21:48:34Z

@pdemarino we want to make sure we only perform the rewrite in the case of unexpected_errors or resilver/scrub. Otherwise we'll make a lot of unnecessary iterations around that for loop for every read. So the key question then is why isn't unexpected_errors non-zero.

pdemarino · 2016-09-21T00:49:48Z

@behlendorf Understood - but I would like to challenge that. Is it really appropriate to be concerned about performance in the rare case when errors are reported by the disk stack? Wouldn't it be more reasonable to think safety first, in the rare occurrence when any disk is reporting a failure?

I feel that the current policy is essentially assuming that most errors will be transient; and that scrubs will be run often enough that non-transient errors will be caught. While I wholeheartedly agree that scrubs should be run often, I agree with @zenaan and @SkyWriter that this is a departure from what is understood reading the docs - it sure is a departure from my personal expectations.

My recommendation would be to force the rewrites every time an error is caught, scrub or no scrub. Unless I misunderstand the algorithm, this should have no impact whatsoever on performance on a healthy disk, and it may safeguard data on an ailing one, albeit with a small performance cost (again, which would only be paid by a failing system).

pdemarino · 2016-09-21T01:40:49Z

@behlendorf - Feedback understood re: my recommendation. That was asinine on my part, I misread the code pretty bad.

Now, however, I'm really confused. Given that the opening of vdev_mirror_io_done counts as unexpected errors any children with an error that weren't skipped, how is it possible that a read error slips by? They should all trigger the code on the bottom.

In other words - how can a read not set the error flag, yet return no data? I've been following the code all the way down to the BIO submission, but couldn't find the point where that choice is made.

behlendorf · 2016-09-21T01:46:40Z

@pdemarino yes exactly. That's the mystery, I'm glad we're on the same page. I've only skimmed the code but it's not obvious to me either how unexpected_errors can be zero on error unless mc_skipped was set on the error device. However, that should only happen for devices which are unreadable or missing a DTL log (atleast for mirrors). In both of these cases we shouldn't be writing to these vdevs.

pdemarino · 2016-09-21T20:16:53Z

@behlendorf Yet it's definitely happening. I have a RAIDZ-2 array solely dedicated to ZFS, and two drives report through SMART significant numbers of pending reallocations - which means read errors not followed by rewrite attempts.

Now, I see (at least) three possible explanations.

The HD hardware found an error, was able to fix it, yet marked the sector for reallocation without flagging the error to software (unlikely)
The HD hardware raised the error, that was somehow suppressed in the Linux stack (this again assumes that the data was somehow fixed by either the Linux stack or the hardware, or ZFS would have raised a checksum error)
The error made its way to ZFS, and the ZFS stack chose to ignore it for some reason that I can't fathom.

Would it make sense to instrument the RAIDZ vdev code to try and tell these three cases apart? I have a slowly failing disk that I can use to try and map what's going on. Alternatively, could an analysis of dmesg and/or zpool history help?

Happy to share anything that could help.

BujiasBitwise · 2016-09-22T10:44:34Z

I just want to add that I have a couple of drives with pending reallocations that don't disappear even after filling the whole drive with random data multiple times (using dd), so I don't know if that's always proof of the sectors not being rewritten. I literally can't do anything to make those pending sectors go away. Reading the whole drive also does nothing to them. They don't report any errors, but the smart values stay the same, with pending sectors. I'm not saying this is what's happening here, but I wanted to let you know this is possible. If it's because buggy firmware or something, I don't know. These disks are old (500GB and 160GB), and it's worked fine the other times I've tried it in different hard drives.

mailinglists35 · 2016-09-27T09:43:24Z

@pdemarino

The HD hardware found an error, was able to fix it, yet marked the sector for reallocation without flagging the error to software (unlikely)

isn't that exactly how marking a sector for reallocation should work (transparent from the os)? note there is a difference between marking for allocation and offline uncorrectable - the latter fails the read while the former may still return valid (albeit possibly corrupted) data.

if the read would have returned corrupted data, then zfs checksumming would have kicked in. if that did not happen, then the sector read returned valid uncorrupted data, BUT the controller tried hard to read that (which you see in the smart log).

that makes me think this issue is a feature request: integrated smartd polling for configurable early drive faulting when encountering pending or reallocated sectors.

GoofHub · 2016-12-21T05:12:21Z

Just wanted to point out a few things:

"Pending Sectors" are definitely read errors with either corrupt or no data returned [1]
"Reallocated Sectors" are "Pending Sectors" which were either recovered by a HDD on its own (successful read retry), or overwritten by an OS/application. [2]
"Pending Sectors" can arise as a result of a HDD's self-test [3], and do not necessarily mean that user/filesystem [meta]data was corrupted, as they could be in an unused/"empty" part of the disk.

All of this fully lines up with some of the experiences described here if errors were either corrected by a HDD or if they were found in an empty part of the media. However, the (almost 4-year old) statement by @behlendorf regarding the soft errors makes me think that may be ZFS is really not what I want to use for my data.

On a positive side, I was able to obtain a collection of 6 SATA HDDs which produce "Pending Sectors" pretty much on demand, while still perfectly operational otherwise. I'll try setting up something to see if RAIDZ can really lose data due to gradual bit rot.

References:

SMART attribute 197 (0xC5): Current Pending Sector Count
======================================================
Current Pending Sector Count S.M.A.R.T. parameter is a critical parameter
and indicates the current count of unstable sectors (waiting for remapping).
The raw value of this attribute indicates the total number of sectors
waiting for remapping. Later, when some of these sectors are read successfully,
the value is decreased. If errors still occur when reading some sector, the
hard drive will try to restore the data, transfer it to the reserved disk
area (spare area) and mark this sector as remapped.
======================================================
https://kb.acronis.com/content/9133
SMART attribute 5 (0x05): Reallocated Sectors Count
======================================================
Reallocated Sectors Count S.M.A.R.T. parameter indicates the count of reallocated
sectors (512 bytes). When the hard drive finds a read/write/verification error,
it marks this sector as "reallocated" and transfers data to a special reserved
area (spare area). This process is also known as remapping and "reallocated" sectors
are called remaps. This is why, on a modern hard disks, you will not see "bad
blocks" while testing the surface - all bad blocks are hidden in reallocated sectors.
======================================================
https://kb.acronis.com/content/9105
Extended (long) SMART test:
======================================================
[...] segment (3) of the extended self-test shall be a read/verify scan of all of the user data area [...]
======================================================
http://www.t10.org/ftp/t10/document.99/99-179r0.pdf

drescherjm · 2016-12-21T14:40:21Z

At work I ran a set of 10 SATA drives that had an unusually high URE rate like that for a few years in raidz3 but never had any bit rot. Although I did weekly scrubs which each week there were repairs made.

zenaan · 2017-06-08T01:45:14Z

On Tue, Dec 20, 2016 at 09:12:35PM -0800, GoofHub wrote: All of this fully lines up with some of the experiences described here if errors were either corrected by a HDD or if they were found in an empty part of the media. However, the (almost 4-year old) statement by @behlendorf regarding the soft errors makes me think that may be ZFS is really not what I want to use for my data.

Until the bug is fixed, yes, it is very disconcerting to use zfs on linux if you value your data.

On a positive side, I was able to obtain a collection of 6 SATA HDDs which produce "Pending Sectors" pretty much on demand, while still perfectly operational otherwise. I'll try setting up something to see if RAIDZ can really lose data due to gradual bit rot.

This is a very positive note. Conceptually, the problem should be easy to repeat artificially - create a zfs "module" to flip a bit (or an entire sector) in a RAIDed file, then read that file. Then verify that the file is not reallocated from its RAID parity bits from other drive(s). Once a simple/straightforward set of steps to reproduce is established, then a programmer has an easy target for fixing.

zenaan · 2017-06-08T01:46:02Z

On Wed, Dec 21, 2016 at 06:40:27AM -0800, John M. Drescher wrote: At work I ran a set of 10 SATA drives that had an unusually high URE rate like that for a few years in raidz3 but never had any bit rot. Although I did weekly scrubs which each week there were repairs made.

ZFS on Linux (just making sure)?

drescherjm · 2017-06-08T02:27:59Z

Yes

guestisp · 2017-07-01T15:26:45Z

So, currently, ZFS could lead to data-loss (or punctured arrays) on mostly-read raids ?

As example:

hugefile.log is an archive file, made 3 years ago and never touched again. This file is stored in a RAIDZ, by using disk0-sector100, disk1-sector200, diskParity-sector300 (just as example)

disk0-sector100 became unreadable. ZFS detect this during a scrub, but still provide correct data by using the parity.

As ZFS is COW, probably, disk0-sector100 would never by written again for many,many,many,many years if there is plenty of space available on disks. Thus, disk0-sector100 won't be reallocated by hdd.

Now disk1 fails totally and a resilvering is made.
This lead to data loss: disk0-sector100 is unreadable (and it wasn't reallocated), disk1 is missing. The only available disk is the parity disk, but a RAIDZ is unusable with 2 disks down out of 3.

Result: data loss due to COW and missing forced rewrite by ZFS.

This is something that mostly all HW RAID controller does automatically during their "patrol read" or "consistency check" scans. In case of URE, all raid controllers rewrite to the same sector forcing HDD to reallocate it to a new location with correct data coming from the other survived disks.

drescherjm · 2017-07-01T15:31:06Z

This is why you scrub.

guestisp · 2017-07-01T15:35:26Z

Yes, but scrubs doesn't trigger any rewrites, or this issue would be totally non-sense, right ?

And even by scrubbing, the unreadable sector won't be reassigned. Will ZFS "move" hugefile.log from disk0-sector100 to disk0-sectorXXX during a scrub by rewriting the corrupted file to a new location (because of COW)

drescherjm · 2017-07-01T15:40:23Z

They do if the data read does not match.

guestisp · 2017-07-01T15:44:00Z

In other words, this issue is wrong and should be closed or am I missing something?

drescherjm · 2017-07-02T02:42:10Z

I am pretty sure that was not talking about scrubs or resilvers.

wrouesnel · 2017-07-02T10:25:15Z

scrubs and resilvers in this failure mode still don't rebuild, because it shows up as a read-error, not a checksum error AFAIK.

With modern storage hardware you almost would like to promote all read errors to be defacto checksum errrors to force a rewrite attempt - since disks start dropping for me once I get write failures in ZFS (which is generally the point at which that disk is "done").

richardelling · 2017-07-03T20:47:54Z

The problem is that read error, as seen by ZFS, is not the same as a medium error. Rewrite will only attempt to repair a medium error.

liquidox · 2017-07-06T06:27:26Z

Wow this has been around for 4 years now? Just found this now, and I think I've already experienced data loss last year caused by this issue. Hoping for improvements soon, upvote ^^

guestisp · 2017-07-06T07:05:22Z

I think this should be placed on a top priority list
ZFS is famous for data integrity and without this issue fixed, data loss could happen easily

zenaan · 2017-07-06T07:53:47Z

On Thu, Jul 06, 2017 at 12:05:29AM -0700, GUEST.it wrote: I think this should be placed on a top priority list ZFS is famous for data integrity and without this issue fixed, data loss could happen easily

Agreed - it seems possible that the problem really only arises on Linux where the SCSI/disk subsystem works a bit differently to Solaris (no real surprise there). As we know this is a volunteer project, so let's be really grateful to those devs who do put in their time - without them, we would not have ZOL in the first place :)

guestisp · 2017-07-06T11:00:38Z

In Solaris this isn't possible? Why?

zenaan · 2017-07-06T11:23:33Z

On Thu, Jul 06, 2017 at 04:00:45AM -0700, GUEST.it wrote: In Solaris this isn't possible? Why?

I don't know that's a fact, but by my reading of those above who have generously added their insights, Solaris at the disk/SCSI layer, provides perhaps different "response codes" to the upper layer(s) e.g. the zfs driver/fs. ZFS was written on, and for, Solaris, and only later modified to support/run on, the Linux kernel. We can reverse the question - why would the Linux kernel, and the Solaris kernel, have identical semantics --at the disk driver layer-- ? There is no reason that they should be identical, and indeed that would be extraordinarily surprising if they were. The various SYSV/ BSD, and etc Unix standards were created to standardize the higher layers in Unix - as far as I know there is no "Unix driver ABI or etc" standard :)

lorenz · 2017-07-13T23:55:30Z

Has the patch proposed by @pdemarino been tested by anyone? I have a pool which experiences UREs on one disk which do not get fixed (even when scrubbing).

guestisp · 2017-07-14T07:11:14Z

Is possible to know if this (critical for data reliability and safety) issue is being worked by someone?

ZFS is famous for data reliability, but this issue describe a potential data loss

zenaan · 2017-07-14T08:18:22Z

On Fri, Jul 14, 2017 at 12:11:21AM -0700, GUEST.it wrote: Is possible to know if this (critical for data reliability and safety) issue is being worked by someone?

Only if a) someone spends the time to look at it, and b) posts a note here that they're doing so.

ZFS is famous for data reliability, but this issue describe a potential data loss

Indeed. If anyone is relying on this commercially, perhaps a Patreon or similar bounty could inspire someone. Other than that, this is a volunteer project, and where it's at already is a great contribution to the GNU/Linux etc ecosystem.

deragon · 2017-07-14T09:38:27Z

In the meantime, until a fix is implemented, the community should put a big red warning on the first page of ZFS on Linux website to explain this potential issue.

Is Oracle still funding ZFS on Solaris? With Solaris slowly dying, I guess Oracle is not interested in funding ZFS anymore; where would the revenues come from?

BTRFS failed me (even with RAID1) and I was planing to got with ZFS on Linux, but I feel lucky to have found this issue before I did. The community should be transparent and advertise this problem.

My only option left for a RAID software solution is MDADM and they broadcast on their first page a warning about a potential issue. That is the way to do it and ZFS on Linux should do the same.

mailinglists35 · 2017-07-14T09:57:28Z

on the first page of ZFS on Linux website

this probably belongs to FAQ link in the front page of zol website which points to https://github.com/zfsonlinux/zfs/wiki/FAQ
but I don't see any way to edit or submit PR to wiki. try pinging @kpande or @ryao if you want to contribute to the FAQ with your suggestion

drescherjm · 2017-07-14T14:38:38Z

I believe in my testing with drives with high URE rates in raidz3 that scrubs did fix this problem. I did get checksum errors for the drives and the sectors appeared to be reallocated in the SMART data. The system is powered off since I retired it a few months ago. However next week I could possibly power it on an do some testing.

drescherjm · 2017-07-14T14:46:50Z

BTW, I had the system running like this for a few years with actual data on it (mostly downloads and isos for our software several hundred GB of data) and never lost any data. When I retired it I did a send and receive to a new zfs server and there was no issue with that.

guestisp · 2017-07-14T15:09:14Z

yes, but as wrote above, scrubs are run at scheduled intervals. If you scrub weekly, you can react to URE only weekly. If you hit an URE after the last scrub and then you loose a disk before another scrub, you loose data.

I think that is much safer to force a reallocation/rewrite as soon as an URE is detected.

zenaan · 2017-07-15T00:49:24Z

On Fri, Jul 14, 2017 at 03:11:29PM +0000, GUEST.it wrote: yes, but as wrote above, scrubs are run at scheduled intervals. If you scrub weekly, you can react to URE only weekly. If you hit an URE after the last scrub and then you loose a disk before another scrub, you loose data. I think that is much safer to force a reallocation/rewrite as soon as an URE is detected.

Of course that's safer - the point is, this is not a voting competition, and not even a "compel the volunteers" exercise - it's just the way ZFS should (and is supposed to) work - there is no disagreement that this is a bug. The only competition is who will be the first to either volunteer and fix this, or get paid and fix this. Sooner or later, someone will do so :)

Used to test if data is really re-written on zio read & checksum errors. This came about because of a discussion in: openzfs#1256

tonyhutter · 2017-07-26T00:44:07Z

I've created a branch (tonyhutter:testing-read-failures) with a test script that does the following:

Creates a mirrored pool of two md-raid disks.
Writes a file to the pool.
Exports the pool, scrambles it's contents, re-imports the pool.
(optionally) Enables simulated read errors though mdadm.
Reads back the file.
Looks at read and checksum errors reported in zpool status, as well as the number of read & checksum errors as seen by zfs.ko (via added zfs_dbgmsg prints). Also, looks at the number of failed IOs that got re-constructed and re-written.

To run:

Clone and build my zfs branch. Within that workspace, run sudo ./test_errors or sudo ./test_errors -r to include read errors.

Results:

If a scrub read IO has a checksum error, it reconstructs the data from the good drive, and re-writes the sectors where the bad data was. This increments the checksum counter in zpool status.

If a regular read IO has a checksum error, it also reconstructs the data from the good drive, and re-writes the sectors where the bad data was. This does not increment the checksum counter in zpool status since it's considered a "self-healing" IO. More specifically, the regular read IO has ZIO_FLAG_SPECULATIVE set, and so it exits early from vdev_stat_update() before incrementing the error counters. Read errors behave somewhat the same way. If you're reading a file and hit a read error, the data is re-constructed from the other drive and re-written to the bad sectors, but the read error counter is not incremented since it "self-healed". However, unlike a checksum error, a self-healed read error will generate an IO event in zpool events.

If the data could not be re-constructed, or if the re-write was not successful, then you would see the error counters increment.

Furthermore, consider that when you read a file from a mirrored pool, zfs will quasi-round-robin the reads between the disks to get better performance. So if you have a 1MB file that's totally corrupted on one disk in a mirrored pair, and you cat the file, then roughly half the data you read will be corrupted. That means that each time you read the file completely, you're going to reconstruct half of the remaining bad data. You see this in test_errors where we read a file ten times and see the number of checksum errors drop by roughly half each time. A scrub is more exhaustive; it will read the file in full from both drives.

TL;DR
Your data is safe. It is getting re-written on every checksum and read error. ZFS just doesn't report the errors if it fixed them without issue.

I think it's up for debate whether these self-healing IOs should be included in the zpool status read/checksum counts. There may be good historical reasons why they're not included. We may also want to consider retooling the self-healing routines to be even more hardcore. For example, @behlendorf suggested to me that it may be possible do a mini-scrub on just the file that had the errors. That's a discussion for another day though (outside of this bug).

zenaan · 2017-07-26T10:02:22Z

On Wed, Jul 26, 2017 at 12:44:25AM +0000, Tony Hutter wrote: I've created a branch (`tonyhutter:testing-read-failures`) with a test script that does the following:

...

TL;DR Your data is safe. It is getting re-written on every checksum and read error. ZFS just doesn't report the errors if it fixed them without issue.

Very nice report!

I think it's up for debate whether these self-healing IOs should be included in the `zpool status` read/checksum counts. There may be good historical reasons why they're not included. We may also want to consider retooling the self-healing routines to be even more hardcore. For example, @behlendorf suggested to me that it may be possible do a mini-scrub on just the file that had the errors. That's a discussion for another day though (outside of this bug).

Great stuff, will certainly set folks' minds at ease :)

lorenz · 2017-07-26T14:34:23Z

@tonyhutter Thank you very much for testing and confirming that it works as intended!

jumbi77 · 2017-07-26T17:08:37Z

@tonyhutter Thanks for the detailed answer.

... and re-writes the sectors where the bad data was

One question: So for e.g. an hdd sector gets unwriteable/damaged, zfs reads the data, (determine checksum error), and it tries to re-write the data to the same bad sector? (I thought zfs is also capable like chkdsk or badblocks to mark these sectors as unusable and skip them in future. Or is that one layer below zfs? Sorry for the dump question).

tonyhutter · 2017-07-26T17:37:47Z

@jumbi77 we assume the disk will internally remap the bad sector to another physical location on disk. See https://kb.acronis.com/content/9105.

DeHackEd · 2017-07-26T17:45:33Z

ZFS can't relocate a sector in that regard. That's supposed to be the drive's responsibility to have spare sectors and work around physical defects while providing the illusion of a pristine surface. When it can't, you should replace the drive anyway.

mailinglists35 · 2017-07-26T17:57:37Z

I think it would be wise for 24/7 production systems to not rely on zfs doing this healing and replace a disk as soon as you get read or write errors. Does this PR implements a way to inform user that a healing has taken place? Otherwise, it will make things worse, delaying the moment the ill drive should be replaced.

DeHackEd · 2017-07-26T18:09:14Z

This isn't a PR, just an observation that self-healing doesn't happen in all situations when there was an expectation it did.

tonyhutter · 2017-07-26T21:15:04Z

I'm going to close this bug because I think we've confirmed that ZFS is in fact re-writing the sector on read errors.

I did send an email to the zfs-devel mailing list on whether we should be including the self-healed IOs in zpool status. If you have any thoughts on it please reply to that thread.

behlendorf · 2017-08-02T23:00:24Z

To build on @tonyhutter's previous comment I wanted to expand on why failed speculative IOs are not counted in the zpool status output.

For performance reasons these IOs bypass the normal interlocks in the ZFS pipeline. This means that's it possible they can fail with a checksum error, for example if a file is removed and the blocks are reused in a new file before the speculative IO is serviced. This is a false positive which shouldn't be reported. The design is very careful to only report real hardware errors which have percolated up. We want people to be able to trust that when ZFS reports an error, it is a real error.

There are a couple ways we can weed out the false positives and only report the real errors. That's something we're going to look in to.

behlendorf removed this from the 0.8.0 milestone Oct 7, 2014

behlendorf added the Difficulty - Medium label Oct 7, 2014

behlendorf removed the Difficulty - Medium label Oct 5, 2016

tonyhutter added a commit to tonyhutter/zfs that referenced this issue Jul 26, 2017

'test_errors' script and associated debug messages

4f92494

Used to test if data is really re-written on zio read & checksum errors. This came about because of a discussion in: openzfs#1256

tonyhutter closed this as completed Jul 26, 2017

loli10K mentioned this issue Aug 18, 2017

Improvement: ZPOOL ‘status’ currently does not count low level disk errors #4851

Closed

Reallocate on read error #1256

Reallocate on read error #1256

Comments

wrouesnel commented Feb 3, 2013

behlendorf commented Feb 4, 2013

ryao commented Feb 6, 2013

wrouesnel commented Feb 7, 2013

behlendorf commented Feb 7, 2013

GregorKopka commented Nov 4, 2014

DurvalMenezes commented Nov 4, 2014

zenaan commented Jun 13, 2015

zenaan commented Jun 13, 2015

SkyWriter commented Mar 21, 2016

pdemarino commented Sep 20, 2016 • edited

behlendorf commented Sep 20, 2016

pdemarino commented Sep 21, 2016

pdemarino commented Sep 21, 2016

behlendorf commented Sep 21, 2016

pdemarino commented Sep 21, 2016

BujiasBitwise commented Sep 22, 2016

mailinglists35 commented Sep 27, 2016 • edited

GoofHub commented Dec 21, 2016 • edited

drescherjm commented Dec 21, 2016

zenaan commented Jun 8, 2017 via email

zenaan commented Jun 8, 2017 via email

drescherjm commented Jun 8, 2017

guestisp commented Jul 1, 2017 • edited

drescherjm commented Jul 1, 2017

guestisp commented Jul 1, 2017 • edited

drescherjm commented Jul 1, 2017

guestisp commented Jul 1, 2017

drescherjm commented Jul 2, 2017

wrouesnel commented Jul 2, 2017

richardelling commented Jul 3, 2017

liquidox commented Jul 6, 2017

guestisp commented Jul 6, 2017

zenaan commented Jul 6, 2017 via email

guestisp commented Jul 6, 2017

zenaan commented Jul 6, 2017 via email

lorenz commented Jul 13, 2017 • edited

guestisp commented Jul 14, 2017

zenaan commented Jul 14, 2017 via email

deragon commented Jul 14, 2017

mailinglists35 commented Jul 14, 2017

drescherjm commented Jul 14, 2017 • edited

drescherjm commented Jul 14, 2017 • edited

guestisp commented Jul 14, 2017

zenaan commented Jul 15, 2017 via email

tonyhutter commented Jul 26, 2017

zenaan commented Jul 26, 2017 via email

lorenz commented Jul 26, 2017

jumbi77 commented Jul 26, 2017 • edited

tonyhutter commented Jul 26, 2017

DeHackEd commented Jul 26, 2017

mailinglists35 commented Jul 26, 2017

DeHackEd commented Jul 26, 2017

tonyhutter commented Jul 26, 2017

behlendorf commented Aug 2, 2017

pdemarino commented Sep 20, 2016 •

edited

mailinglists35 commented Sep 27, 2016 •

edited

GoofHub commented Dec 21, 2016 •

edited

guestisp commented Jul 1, 2017 •

edited

guestisp commented Jul 1, 2017 •

edited

lorenz commented Jul 13, 2017 •

edited

drescherjm commented Jul 14, 2017 •

edited

drescherjm commented Jul 14, 2017 •

edited

jumbi77 commented Jul 26, 2017 •

edited