Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenZFS for Linux interaction problem with NCQ - potential data loss #15270

Open
meyergru opened this issue Sep 13, 2023 · 8 comments
Open

OpenZFS for Linux interaction problem with NCQ - potential data loss #15270

meyergru opened this issue Sep 13, 2023 · 8 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@meyergru
Copy link

meyergru commented Sep 13, 2023

System information

Linux x64 Box
--- | ---
Proxmox 8.04 | kernel 6.2.16-12-pve | x64 | OpenZFS zfs-2.1.12-pve1 | 8x WDC connected through SATA

Describe the problem you're observing

There is an old issue which partly relates to this, but I think it is not classified as a bug - and what is worse, one that leads to data destruction.

Just to reiterate on what I wrote about this here: (#10094 (comment)), I have a Linux box with 8 WDC 18 TByte SATA drives, 4 of which are connected through the mainboard controllers (AMD FCH variants) and 4 through an ASMEDIA ASM1166. They build a raidz2 running under Proxmox with a 6.2 kernel. During my nightly backups, the drives would regularly fail (sometimes "degraded" and somtimes "failed" and errors showed up in the system log, more often than not "unaligned write errors".

First thing to note is that one poster in the thread mentioned that the "Unaligned write" is a bug in libata, in that "other" errors are mapped to this one in the scsi translation code (https://lore.kernel.org/all/20230623181908.2032764-1-lorenz@brun.one/). Thus, the actual error message is meaningless.

In the old issue, several possible remedies were offered, such as:

  1. Faulty SATA cables (I replaced them all, no change, but I admit this could be the problem in some cases)
  2. Faulty disks (Mine were known to be good, and also, errors were randomly distributed among them)
  3. Power saving in the SATA link or the PCI bus (disabling this did not help)
  4. Problematic controllers (Both the FCH and the ASM1166 chips as well as a JMB585 showed the same behaviour)
  5. Limiting SATA speed to SATA 3.0 Gbps or even to 1.5 Gbps (3.0 Gbps did not help, and was not even possible with the ASM1166 as the speed was always reset to 6.0 Gbps, but I could check with FCH and JMB585 controllers)
  6. Disabling NCQ (guess what, this helped!)
  7. Replacing the SATA controllers with an LSI 9211-8i (I guess this would have helped, as others have reported, because it probably does not use NCQ)

I am 99% sure that it boils down to a bad interaction between OpenZFS and libata with NCQ enabled and I have a theory why this is so:
When you look at how NCQ works, it is a queue of up to 32 (or to be exact 31 for implementation reasons) tasks that can be given to the disk drive. Those tasks can be handled in any order by the drive hardware, e.g. in order to minimize seek times. This, when you give the drive 3 tasks, like "read sectors 1, 42 and 2, the drive might decide to reorder them and read sector 42 last, thus saving one seek operation in the process.

Now imagine a time of high I/O pressure, like when I do my nightly backups. OpenZFS has some queues of its own which are then given to the drives and for each task started, OpenZFS expects a result (but in no particular order). However, when a task returns, it opens up a slot in the NCQ queue, which is immediately filled with another task because of the high I/O pressure. That means that the sector 42 could potentially never be read at all, provided that other tasks are prioritized higher by the drive hardware.

I believe, this is exactly what is happening and if one task result is not received within the expected time frame, a timeout or an unspecific error occurs which is then reflected as "unaligned write".

IMHO, this is the result of putting one (or more) queues within OpenZFS in front of a smaller hardware queue (i.e. NCQ).

It explains why both solutions 6 and probably 7 from my list above cure the problem: Without NCQ, every task must first be finished before the next one can be started. It also explains why this problem is not as evident with other filesystems - were this a general problem with libata, it would have been fixed long ago.

I would even guess reducing SATA speed to 1.5 Gbps would help (one guy reported this) - I bet this is simply because the resulting speed of ~150 MByte/s is somewhat lower than modern hard disks, such that the disk can always finish tasks before the next one is started, whereas 3 Gpbs is still faster than modern spinning rust.

If I am right, two things should be considered:

a. The problem should be analysed and fixed in a better way than just disabling NCQ, like throttling the libata NCQ queue if pressure gets too high, just before errors are thrown. This would give the drive time to finish existing tasks.
b. There should be a warning or some kind of automatism to disable NCQ for OpenZFS for the time being.

I also think that the performance impact of disabling NCQ with OpenZFS is probably neglible, because OpenZFS has prioritized queues for different operations anyway.

Describe how to reproduce the problem

Create a raidz2, copy a large number of files to it, preferably from a fast source like an NVMe disk.

Include any warning/errors/backtraces from the system logs

Irrelevant because of another bug in the libata/scsi abstraction layer, see: https://lore.kernel.org/all/20230623181908.2032764-1-lorenz@brun.one/

@meyergru meyergru added the Type: Defect Incorrect behavior (e.g. crash, hang) label Sep 13, 2023
@meyergru meyergru changed the title OpenZFS for Linux interaction problem with libata NCQ - potential data loss OpenZFS for Linux interaction problem with NCQ - potential data loss Sep 13, 2023
@mabod
Copy link

mabod commented Sep 14, 2023

I am wondering how this is related to the IO scheduler. Have you tested this with mq-deadline, kyber or bfq?

@amotin
Copy link
Member

amotin commented Sep 14, 2023

Do you have any evidences of command timeouts in your tests?

Last year I was specifically testing different HDDs for behavior under a mix of sequential and random reads. I saw that disks indeed prioritize sequential reads to stay more efficient and reduce the number of head seeks. But on all HDDs I tested I also saw a hard deadline between 1 and 4 seconds, depending on a model, where firmware broke the linear I/O pattern and went executing random I/Os. So there should be no timeouts from that as long as HDD firmware is sane.

As result of that investigation I actually made improvement to ZFS I/O scheduler to explicitly delay low-priority I/Os if high-priority ones are not completing for too long: #11166 . Obviously it works only in one direction, but still should reduce chances of starvation scenarios you are describing.

On top of that, at least FreeBSD ATA/SCSI stack explicitly injects non-queued commands every half command timeout interval. It forces disk queue flush in case drive does not do it right. Supposedly it happened on some old SCSI disks. I am not sure it is really needed these days, but it does not make too much harm, so it is still there. I don't know if Linux has any similar mechanism, but it could.

In any case the command timeouts are not ZFS problem, but the disk driver that implements them. ZFS itself would "happily" wait forever, if just has no other choice. And disabling NCQ is a bad idea, since only the HDD's firmware can schedule multiple I/Os more efficiently by knowing internal disk physical characteristics.

@meyergru
Copy link
Author

meyergru commented Sep 14, 2023

I am wondering how this is related to the IO scheduler. Have you tested this with mq-deadline, kyber or bfq?

No, that is a productive system, so I am glad to have it working again by disabling NCQ.

Do you have any evidences of command timeouts in your tests?

No, as I wrote, the error messages are unspecific in that they are in an "else" branch which catches whatever is not handled specifically.

On top of that, at least FreeBSD ATA/SCSI stack explicitly injects non-queued commands every half command timeout interval. It forces disk queue flush in case drive does not do it right. Supposedly it happened on some old SCSI disks. I am not sure it is really needed these days, but it does not make too much harm, so it is still there. I don't know if Linux has any similar mechanism, but it could.

I do not know if that exists, but I agree that it should. And I admit it could be that the drive firmware does not set a hard deadline. I cannot investigate because I only have one type of drive.

In any case the command timeouts are not ZFS problem, but the disk driver that implements them. ZFS itself would "happily" wait forever, if just has no other choice. And disabling NCQ is a bad idea, since only the HDD's firmware can schedule multiple I/Os more efficiently by knowing internal disk physical characteristics.

Probably, however I would argue that the behaviour of the underlying drivers is just at it is and OpenZFS is potentially making assumptions about how the driver "should" behave - which it probably does for FreeBSD (which it originally was designed for), as you say, but probably not for Linux. That it why I titled the defect to reflect the interaction between OpenZFS and NCQ on Linux.

And as for the "bad idea": I rather have a reliable array than an optimized one for the time being. But you are correct, the way to go is to fix the problem even with NCQ turned on.

@mabod
Copy link

mabod commented Sep 14, 2023

Back to my question: Is this somehow influenced by the IO scheduler? I assume you are using "none". Would it make any difference if you use mq-deadline or bfq?

@meyergru
Copy link
Author

meyergru commented Sep 14, 2023

Back to my question: Is this somehow influenced by the IO scheduler? I assume you are using "none". Would it make any difference if you use mq-deadline or bfq?

I do not know if changing it would help (it is mq-deadline now) and as I said: This being a productive system with over 60 TByte worth of data, I am not going to experiment on it. Every time those errors occur, I have to scrub the whole array for > 24 hours and hope that no files are corrupted after this (been there - done that). The experiments have taken me the last three weeks until I found that disabling NCQ would have been the fix in the first place, while buying two now useless SATA controllers on the way.

@spixx
Copy link

spixx commented Oct 1, 2023

I would like to point out that I am experiencing a similar issue, I do not run this in production (homelab) so I might be able to assist. When setting libata.force=noncq in my KERNEL boot line it works "flawlessly". (running on proxmox with a ASMedia controller).

@ashleyw-gh
Copy link

just a comment, I've been running OpenZFS for years, but we recently switched to Linux raid (using md), and after disabling NCQ our throughput went up between 5 and 10 fold on a Veeam Active Full job. (this is using 22 disk raid 10 8TB Toshiba drives - spinning rust). so I don't believe this issue is specific to OpenZFS but a more generalised issue. In our case we have a cron job with a @reboot task to run this script at boot time to make sure NCQ is disabled for all our drives.
Sadly I don't have access to spare hardware currently to reproduce the issue on OpenZFS currently.

for drive in sd{b..x};do
  NCQDisabled=`cat /sys/block/$drive/device/queue_depth`
  #echo $drive $NCQDisabled
  if [ "$NCQDisabled" != "1" ]; then
    echo "disabling NCQ for $drive"
    echo 1 > /sys/block/$drive/device/queue_depth
  else
    echo "NCQ already disabled for $drive"
  fi
done

@richardelling
Copy link
Contributor

FWIW, NCQ has a long, sordid history of breakage. So it is not surprising we continue to find more. Clearly there are other integration points in the Linux stack that cause problems. However, it is safe the SysFS queue_depth on-the-fly. You might consider using a udev rule instead of a systemd solution, because it would also handle the hot-plug case and you can restrict it to ATA drives. Yes, I do mean to imply that native SCSI is better than ATA, NCQ is just one area where ATA sucks rocks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

6 participants