New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the return of "Unaligned write command" errors #10094
Comments
Some details. OS is CentOS-7, kernel 3.10.0-1062.12.1.el7.x86_64, zfs-0.8.3-1. Problem disks: # ./smart-status.perl Disk model serial temperature realloc pending uncorr CRC err RRER /dev/sda WDC WDS100T2B0A-00SM50 195004A00B9C 26 . ? ? . ? /dev/sdg WDC WDS100T2B0A-00SM50 195008A008F8 26 . ? ? . ? # more /sys/class/block/sda/queue/physical_block_size 512 # more /sys/class/block/sdg/queue/physical_block_size 4096 # zpool status zssd1tb pool: zssd1tb state: ONLINE scan: scrub repaired 0B in 0 days 00:05:55 with 0 errors on Mon Mar 2 13:24:15 2020 config: NAME STATE READ WRITE CKSUM zssd1tb ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ata-WDC_WDS100T2B0A-00SM50_195004A00B9C ONLINE 0 0 0 ata-WDC_WDS100T2B0A-00SM50_195008A008F8 ONLINE 0 0 0 errors: No known data errors Typical error from "dmesg" [1974990.399004] ata8.00: exception Emask 0x0 SAct 0x8000000 SErr 0x0 action 0x6 frozen [1974990.399009] ata8.00: failed command: WRITE FPDMA QUEUED [1974990.399013] ata8.00: cmd 61/08:d8:e0:27:70/00:00:74:00:00/40 tag 27 ncq 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [1974990.399015] ata8.00: status: { DRDY } [1974990.399018] ata8: hard resetting link [1974990.707014] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [1974990.710637] ata8.00: configured for UDMA/133 [1974990.710690] sd 7:0:0:0: [sdg] tag#27 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [1974990.710693] sd 7:0:0:0: [sdg] tag#27 Sense Key : Illegal Request [current] [descriptor] [1974990.710695] sd 7:0:0:0: [sdg] tag#27 Add. Sense: Unaligned write command [1974990.710698] sd 7:0:0:0: [sdg] tag#27 CDB: Write(10) 2a 00 74 70 27 e0 00 00 08 00 [1974990.710699] blk_update_request: I/O error, dev sdg, sector 1953507296 [1974990.710703] zio pool=zssd1tb vdev=/dev/disk/by-id/ata-WDC_WDS100T2B0A-00SM50_195008A008F8-part1 error=5 type=2 offset=1000194686976 size=4096 flags=180ac0 [1974990.710708] ata8: EH complete |
Such errors almost always denote a problem due to a failing SSD/HDD. The system even tried to reset the SATA link, with the write still failing. |
shodashok - failing hdd: no (this is an ssd), failing ssd: unlikely: (a) ssd is brand new, (b) failing devices usually throw read/write i/o errors, not "illegal request - unaligned write command", (c) bum ssd firmware (hello, Intel!) - unlikely, both ssds have the same firmware, only one throws the errors. (d) overheated ssd malfunction: no, ssd is at room temperature, cool to the touch. K.O. |
additional information: moved the two problem ssds to different sata ports, now both report physical sector size 512. (one was reporting 4096). Now let's see if the errors go away. K.O. |
@dd1dd1 I had an identical problem on a Crucial MX500 SSD. After some days, the SSD controller failed catastrophically (with the SSD no more discovered on BIOS screen). So, I really doubt it is a ZFS problem. The fact that simply swapping SATA port changed the reported sector size is really suspicious. |
I'm experiencing the same issues. These errors appeared out of nothing. I highly doubt that all 4 disks die at exact the same time.
The errors appear randomly for the four disks, but only one is really failing badly withing zfs:
Without ZFS the disk sdd (slot4crypt) is working without any issues. S.M.A.R.T not showing any issues at all. |
uhm. interesting. #4873 (comment) seems to work for me, too. but WHY?! |
Finally, this turned out to be a system firmware issue. See https://review.coreboot.org/c/coreboot/+/40877 You can try this: echo maximum_performance | sudo tee /sys/class/scsi_host/host*/link_power_management_policy If that solves the problem, it's pretty likely that you have the same problem. Try to get a BIOS/UEFI update from your vendor. |
That's interesting. My "unaligned write" errors have gone away after I moved the disks from an oldish socket 1366 mobo and a brand new socket 1151 mobo. It could have been the old bios on the old mobo not initializing the SSD SATA link correctly... K.O. |
had the same problems with brand new 14TB drives on a skylake motherboard. Setting GRUB_CMD_LINUX="libata.force=3.0" fixed this issue for me. This forces the sata controller running at 3 GBIT/s - for a rusty drive still enough. I have not tested the maximum_performance switch for the ports. Waiting for a firmware/bios update... let's see |
i have the same issues on a AMD x399 Threadripper board with 2 16TB Seagate exos Sata drives. I get mostly errors on one of them, i checked them recently with seagate windows tool and both seem fine. Currently i have the HDD Enclosure+Cabling in suspicion that they can not handle the full SATA 6 Speeds in a sustained manner, i'll rework the case this weekend and move around disks and see if it helps link_power_management_policy was already on maximum_performance |
I'm getting this too on four separate machines now. Swapping drives, cables, drive bay, etc. didn't work. But putting all the drives on an HBA solved it. Supermicro H11DSi motherboard on all 4 machines, seems only to happen when using SATA controller on motherboard. BIOS already upgraded to latest. Using zfs via latest Proxmox 6.2 (zfs root/boot). A scrub can trigger it every time, though some of the machines can go a while without issues if they're not stressed. I've reached out to SuperMicro, but they don't have anything helpful yet. I'm primarily seeing this on HGST 10TB SATA drives, though I did see it a few times on an Intel SSD. |
i switched cabling and used a new sata drive cage, so far my problems are gone |
Yeah, that's what's weird about this one. I've resolved similar problems that way in the past, but it's not resolving it this time, on multiple different systems. |
Can confirm too that this happens on at least three similar hosts all with an ASRock TRX40D8-2N2T Mainboard (SATAIII via TRX14, and/or via ASM 1061) and varying hard disks and SATA cables. Can't see the same happening with FreeNAS/TrueNAS running on the same setup, and neither when the setup is altered so that the disks are attached over an add-in LSI HBA card instead of using the Mainboard connectors. Also confirming that the failure is reproducibly happening only after having sent many gigabytes of data over the SATA wires. The failures seems to suddenly affecting multipe ata links concurrently. My guess at this point is a kernel bug with some HBAs. To reproduce: boot a Debian Buster (or Proxmox) from an additional disk, and build a raidz2 over four disks attached to the TRX40 or the ASM 1061, then concurrently run Found with Proxmox kernel 5.4.73-1-pve, and still present in 5.4.114-1-pve. |
I can reproduce with a simple zpool scrub. It's true that it only happens after many GBs of data transfer, but I can trigger it every time. I was also able to eliminate the problem with an add in LSI 9300-4I HBA, but of course that's not a true fix. However, I had to get our servers stable, so I bought HBAs for all of them, and I haven't seen a single problem since. Edit: I also had to update the HBA firmware to the latest in order to avoid hard reboots under the same type of load. I would assume that's unrelated (it was NOT the same "unaligned write error" in dmesg, it was a hard freeze/reboot), but I thought I should mention it. Anyway, an HBA with fully updated firmware does seem to have eliminated the problem for me across the several servers I was having this issue with, as a workaround. |
Update: Kernel 5.10¹ from Debian Buster Backports, and Kernel 5.11 from Proxmox pvetest repo are stable on the TRX14 SATA links. They both continue to fail on the ASMedia ASM1062²; but less spectacularly so in dmesg. ¹ Debian Backports 5.10.0-0.bpo.5-amd64 |
my problems returned in a differen kind of way on my personal host where i passthrough pcie devices incl. gpu. this is the host where i originally had the problems with the two sata 16TB exos drives where the issues seem to went away after i replaced the disk cage and cabling. But it seems the problem just moved on to the nvme port. I had one Samsung 970EVO with 2TB in that port and did not use it yet and had that disk in a ZFS Pool. Almost every time i scrubbed that nvme i got following errors on random sectors
it got somehow worse and was not just limited to scrub problems. my win10 guest that passes through multiple pcie devices from the proxmox host could not initialize hardware when booting up and locked the host up solid when the windows circle rotates and windows sorts out hardware, sometimes it got a screen further and locked up on applying configuration, and even more rarely it got to desktop and then locked up and in very rare cases it locked up hours days or even a week later. i fiddled so long around and i got a few hints in the console that it seem to be a hardware initialization issue. Sometimes it wrote something into dmesg just before locking up. sometimes USB3 Adapters that were passed through but i also noticed the nvme device that was not passed through at all. This Samsung was counting up SMART inegrity errors everytime i ran zfs scrub and im right now far over In the end this was the single issue that was plaguing me for the last few months where i needed to hope and gamble to get my guest up and running. I first thought it was related to the kernel version of proxomox and the zfs module as there were some known issues that locked up the host in the latest Kernel. |
This started happening to me pretty regularly on a supermicro virtualization server (X10SDV-8C-TLN4F). I have 4 500GB crucial MX500's in a 2x2 mirrored pool with one hot spare. Every couple weeks I'm seeing the hot spare kick in. The drive that dies always dies with the error mentioned at the top of this issue (unaligned write). Soft resetting the host bus doesn't work. Unplugging the SATA cable doesn't work. You have to unplug power to the drive and re-plug in to get it back up. All 5 drives are brand new (I replaced the old SSDs when this problem cropped up thinking they were failing...guess they weren't). It's not always the same drive that this happens with (it's happened across 3 different ones). The old drives were Samsung EVO 860's and had the same problem. So...not manufacturer specific. It's still running the original BIOS so I may try updating that to see if it makes a difference. This smells like a kernel bug though rather than a ZFS bug given that the failure is at the ATA level. Unless ZFS can write something that can result in an unaligned write. |
I do have trim enabled on the pool. Perhaps it could be related to this: #8552. Though these aren't Samsung drives. I can try the noncq setting on the next reboot when I can do some downtime. |
Also, you can disable NCQ without a reboot using this technique: https://unix.stackexchange.com/a/611496 |
Disabling NCQ with the above technique (the technique that doesn't require rebooting) was not effective for me. I'm going to try disabling autotrim and just leave it to the monthly manual trim (which I believe should run in a couple of hours since the default is the first Sunday of each month). |
Nope. That didn't work either. :-P I'm starting to suspect failing hardware for me now too (or a kernel bug in a recent update). To my knowledge, the D-1541 Xeon's are a SoC so all the SATA controllers are actually in the CPU. Seems weird it would fail, but there must be some components on the MB. I'm moving them to the backplane to see if that resolves the issue. Only had them on the internal ports for speed since they're each individual SATA III ports vs the shared SATA II backplane. |
@livelace thanks for the writeup. In my case though the issue predates zstd, for whatever that's worth. |
Yep, thanks @livelace. Doesn't apply in my case either though since the current problem I'm having is with Crucial MX500's (5 of them, all brand new, 2x2 mirrored pairs and a spare). Also using lz4 rather than zstd for me. |
I do have a bit of a performance hit, but so far everything is working on the backplane whereas every day for the past week I was getting at least 1 drive failure. Prior to the switch yesterday I had 4 fail at once which required me to reboot everything. I guess it's possible that I have failing SATA ports on the motherboard. But...I don't think it's likely. Also, this started happening after applying both a kernel update and a ZFS update. I had been holding off on going from the latest 0.8.4 and updated kernel due to wanting the same ZFS version for my znapzend container. I finally got around to making a znapzend container based on the proxmox repos though so finally made the ZFS and kernel jump before this started happening. I'm not sure how best to narrow this down to a ZFS or kernel bug, but it seems like it's one of those. |
Huh. This is the same symptoms from years ago on the exact same MB I'm using, though it's with intel SSDs instead of crucial. In that case it was the drive firmware that was updated to fix it. Crucial doesn't appear to have new firmware for the MX500's I'm using though: Guess I'm just stuck with the backplane for the time being (or until I upgrade the machine?). Meh, it makes drive maintenance easier I suppose. |
I had a very similar issue, that is failed writes with the dmesg message: My Hardware configuration introduces something different from the other cases:
I was having the issue with 'scrub' on both pools. The way I solved it, or at least hope to have solved it, is by backing up all data, recreating the pools and restoring the data. There is something that probably needs to be confirmed: I have no doubt that the ZFS team verifies that a pool survives a ZFS upgrade. The question is whether there could be a corner case where strange issues happen. I mean, ZFS is a hugely complicated filesystem, and it's conceivable that something could have been introduced. Thank you. |
And after a month of monitoring, I am now sure set I also confirmed this issue come back when turn Just for your information. |
Thanks for that information, I'm setting that right now! |
Also encountered this, I set I had already ruled out the SATA controller itself as being the issue, as the same hard drive encountered this same issue, no matter whether it was plugged into the motherboard's SATA ports or the HBA card's ports. |
The latest comment in that bug mentions that updating the drive firmware fixes the issue (for Samsung EVO 860/870 drives). Is it safe to do the drive firmware update without taking the whole machine down by just offlining one drive at a time with |
Very informative read, a bit long, but worth the time. I was struggling to isolate some errors I was seeing, given smartmontools smartctl was reporting ICRC ABRT errors for multiple drives, I took the test effort back to the lowest common level, and started testing, the interesting thing was, it seemed like a power level issue or a back plane board issue in my storage frame, but as it turned it it was one of the internal cables from the back plane board (1 of 2) to the eSata port transition out of the storage frame case, which is an 8-bay eSATA tower. For those that might be interested, below is the test methodology I used... WARNING... I DID NOT NEED TO MAINTAIN THE DATA IN THE GIVEN STORAGE FRAME.... The following steps will overwrite existing data, be careful. I added a multiplex PCIe card/adapter, did not solve the issue. So the mainboard eSATA ports and the PCIe card/adapter both seem fine, but still getting random read/write errors. So I disabled NCQ, still errors. Then replaced the external cables from server to storage frame, still errors. Crossed the internal SATA cables of the back plane, since it was a split design, had two boards, each supporting 4 devices. The issue at first seem to move from back plane 1 to back plane 2, but as I did a bit more testing, I started getting reports of bad sectors on drives I believed fine, passing various smart tests, via smartmontools. I pulled some additional drives, from my spare parts, and swapped 4 drives on back plane 1, now errors reported on the drives just swapped in. So swapped in more drives, still errors on back plane 2 as well. Did more smart tests, drives seemed fine, when used on different system no errors. So I replaced the internal SATA cables as well, things seemed to stabilize. So then I used 'dd' to do an exhaustive random write to all sectors, first the just drives in back plane 1, no errors. About 2 hours later, still ok. That is a good sign. Then I did the same test with the same set of now believed good drives, with the new internal SATA cables once connected to back plane 1, connected to back plane 2, even better, no errors. So now I knew the external cables, and internal cables seemed good. And, the back plane boards seem good. After 2 hours of constant random writes to all 8 drives, still no errors. Will let the test continue for a couple more hours, but when errors occurred it only took minutes to about 1 hour to get 10s to 100s of errors across all 8 drives. This is also pushing the power supply on the storage frame, since have all 8 drives racing to slam data to sectors exhaustively. Oh, minor, but I did confirm that write cache is off during the dd tests. You might want to make sure you set write cache on or off depending on your use/test case. in my case, I need data saved, not performance, so write cache stays off. I still need to enable NCQ, to confirm everything is completely legit. Just to do that last step of validation. Of course, setting up an mdadm RAID set, say 1 RAID 5 set per back plane, or setting up a ZFS pool per back plane or across all 8 drives would also work as a test scenario, but using 'dd' was easy. And I wanted to make sure the power supply was stable. Even using FIO would be applicable, not that I think about it. Why the errors, seems the internal SATA cables from back planes from external eSATA port sockets, just aged badly, the internal temperature case does it pretty warm, even with fans and venting, and the air flow, from power supply and drives, you guessed it, goes right through the internal SATA cables to escape. The case has rear fan exhaust, but not top exhaust, if it was possible I would add a vent or even better a top exhaust fan. Hope those that find this, find it helpful. |
I only started seeing this error after upgrading from Ubuntu 18.04 to 22.04 and rebuilding the pool. I've replaced SATA cables and some disks, and the issue persists. I'm suspecting it has something to do with how non-Solaris zfs currently handles NCQ. I stumbled across this thread which quotes this article from 2009 where :
I am not sure how to verify if what he says would be true for my install, how to change that parameter in Ubuntu/Linux, or what the optimal queue depth would be, if anyone has ideas. The OpenZFS docs on command queuing mention that zfs checks for command queue support on Linux with:
The output for my disks looks like:
I am a bit more convinced that the issue is somehow related to the above, since @bjquinn was able to solve the problem by switching to SAS-capable HBAs which presumably use a SAS-to-SATA adapter in his case, and makes me wonder how command queuing is handled with his controllers. The author of the article also mentioned that his performance issues were mostly attributed to TLER being disabled by default, and the OpenZFS docs on error recovery control mention this as well and recommend writing a script that enables it on every boot with a low value. With the Hitachi/HGST drives that I have, none of them accept values lower than 6.5 seconds, which might be some capability requiring enterprise software from the manufacturer. I would think the capability exists with hdparm (can't find it, if so), but setting ERC with smartctl is done in units of deciseconds, if anyone else needs or wants to set this:
I'm also starting to suspect the issue may have something to do with enterprise hardware being sold to consumers, where some features come disabled by default given a certain hardware setup, thinking of the choice of controller in @bjquinn's case. I also have some SED drives that require TCG Enterprise to access most of the features (and probably better TLER times), and the SED features are only partially available using hdparm and smartctl. The Wikipedia page on NCQ is also informative:
For the moment I've set my onboard controllers to IDE mode from AHCI, and configured TLER as mentioned above to see if this helps at all. |
None of that was helpful. I ended up getting an HBA card as others mentioned here previously, and the errors have been gone now for more than a month. I don't mind the investment too much, but I'd love to know technically why this has become a necessity for ZFS. I never had these issues several LTS-releases ago. |
@bjquinn @stuckj Can either of you tell me what you used to flash your HBAs? I'm running Ubuntu 22.04 and I'm having trouble figuring out what utility to use. I have an LSI 9305-16i which has been running great until some recent kernel update bricked it, and I'm hoping a flash will fix it (HBA card currently fails to load its BIOS). |
Yikes. I flashed mine years ago. I don't recall offhand, but I believe I followed a thread on the TrueNAS forums (I was using ESXi + FreeNAS at the time). It may have been this one? https://www.truenas.com/community/threads/detailed-newcomers-guide-to-crossflashing-lsi-9211-9300-9305-9311-9400-94xx-hba-and-variants.55096/page-3 Best of luck. |
This is from my notes, though like @stuckj I haven't tried this in quite some time.
|
This happened to me on a Coreboot Thinkpad T430 with 2x Crucial MX500 (one is in ultrabay) on Debian 11. Setting This next paragraph is more of a speculation, as my setup is a bit exotic - ZFS mirror on a laptop with one of these drives in ultrabay: I ran Crucial MX300 disks in this very machine in the same configuration before and didn't have these issues. Also, I noticed that the errors always happened on the MX500 with firmware I really don't like experimenting around with SSDs and if MX300 were available at the time I was buying these, I'd have gone with them instead. Every one of those recently bought SSDs I owned had weird issues that required firmware updates. MX500 has this bug, Intel 660p had PCIe passthrough broken until a firmware update. Never had any issue with MX300 or MX200. |
When I ran into the "unaligned write command", I also had a MX500 with firmware M3CR033 running (in a thinkpad W530). I've set |
On Ubuntu 20.04.2 I was unable to achieve stable ZFS operation due to
"unaligned" and had to move to a 9207-8i LSI HBA. This initially did
not work well at all, I think the first such card that I bought had
issues, it was a apparently a cheap Asian knockoff. I found a reputable
seller who sold me a surplus HP card that is working very well.
Then, ZFS being ZFS, it found some bad sectors and a bad hotswap tray --
which is exactly why my interest in ZFS in the first place -- but
everything is now fixed and my "unaligned" issues are over.
|
Just another "me too" here. Three spinning rust drives connected to a consumer-grade motherboard's SATA links:
Motherboard:
CPU:
Kernel:
All drives consistently pass SMART tests, but a
Setting the SATA links to |
Same for me: Suffering from this issue for years now with various Samsung EVO SSDs (latest FW)
P.S.: Found this information linking the "Unaligned Write Command" to "Zoned Block Devices": Since I'm not an expert in this domain: Can someone in this forum comment on this? Is there a possibility that Samsung EVO SSDs exhibit this behavior of zoned block devices? How does ZFS deal with zoned block devices ? ... From the past, I have in mind that e.g. SMR devices are not suited for ZFS ... Is this still true? ... nevertheless: Only SSDs in my case.... |
I've definitely run into this issue myself, but I would make sure you don't have a bad HDD first since it can also result in this problem. It's suspicious if it's only happening for one drive. Passing the smart self-checks doesn't mean the drive is good. If you're always getting errors on the WD drive, I would check the smart attributes for any critical attributes that have bad values. E.g., See, e.g. https://www.thomas-krenn.com/en/wiki/Analyzing_a_Faulty_Hard_Disk_using_Smartctl, for an example of how to diagnose a bad HDD. Also, look at the power on hours in the smart attributes. Some NAS drives can last 5-7 years, but it's not a guarantee. Cheaper drives generally have a shorter lifetime (though not always). I believe WD Blue is on the cheaper side. |
Have had the same problem. I have KNOWN GOOD drives that would have a failing SATA link, causing unaligned write errors. The problem was very noticeable when the drives were put in a ZFS pool. Non-ZFS write loads didn't seem to kill the link, weirdly. Using the built-in sata controller on both my B450 motherboards caused the same errors. None of the fixes in this thread worked, not even the link power management settings. Maybe for Samsung drives there are fixes, but this seems to be its own issue FIX: Motherboards: Gigabyte B450 AORUS M, MSI B450M PRO-M2 |
To be clear, I've still not seen any errors since setting the link rate. SMART attributes look good:
|
Oh yeah, that looks like a relatively new drive. |
Just FYI the Unaligned Write error is a bug in Linux's libata: https://lore.kernel.org/all/20230623181908.2032764-1-lorenz@brun.one/ Basically libata is not implementing the SAT (SCSI-to-ATA) translation correctly and ends up converting ATA errors it shouldn't, producing nonsense. This is still a real error (probably a link reset or similar), but the SCSI conversion is broken. This only applies if SAT is not done by a controller/storage system, but by Linux. |
@lorenz Interesting, that might explain why I've gotton the unaligned write error too. I've tried everything from updating ssd firmware, maximum link power settings, to no trim, upgrading the linux kernel, downgrading the port speeds to 3gbps. Nothing worked, slower speed just meant it took longer for the problem to resurface. What has likely fixed it now is a cable replacement. I noticed that only 2 drives were failing and they had a different brand of cables. The new cables are thicker too, so probably they have much better shielding. Definitely try to replace your sata cable if you're seeing this error! Side note: zfs didn't handle random writes erroring out gently, the pool was broken beyond repair and I had to rebuild from backups. |
I am on the same boat here. System is an X570-based mainboard with FCH controllers plus an ASMEDIA ASM1166. I had ZFS errors galore and tried:
Also, I see those errors turning up only after a few hours of time (or lots of data) has passed. Using the JMB585 could still be no option, even if now drives on the motherboard controller show these errors, because I can probably limit SATA speed with that controller, which was impossible with the ASM1166. I will try that as a last-but-one resort if limiting link power does not resolve this. I hate the thought of having to use a HBA adapter consuming more power. P.S.: The JMB585 can be limited to 3 Gbps, Otherwise, no change, I still get errors on random disks. Have ordered an LSI 9211.8i now. However, this points to a real problem in the interaction between libata and ZFS. P.P.S: I disabled NCQ and the problem is gone. I did not bother to try the LSI controller. Will follow up with some insights. |
OpenZFS for Linux problem with libata - root cause identified?Just to reiterate on what I wrote about this here, I have a Linux box with 8 WDC 18 TByte SATA drives, 4 of which are connected through the mainboard controllers (AMD FCH variants) and 4 through an ASMEDIA ASM1166. They build a raidz2 running under Proxmox with a 6.2 kernel. During my nightly backups, the drives would regularly fail and errors showed up in the logs, more often than not "unaligned write errors". First thing to note is that one poster in the thread mentioned that the "Unaligned write" is a bug in libata, in that "other" errors are mapped to this one in the scsi translation code (https://lore.kernel.org/all/20230623181908.2032764-1-lorenz@brun.one/). Thus, the error itself is meaningless. In the thread, several possible remedies were offered, such as:
I am 99% sure that it boils down to a bad interaction between OpenZFS and libata with NCQ enabled and I have a theory why this is so: Now imagine a time of high I/O pressure, like when I do my nightly backups. OpenZFS has some queues of its own which are then given to the drives and for each task started, OpenZFS expects a result (but in no particular order). However, when a task returns, it opens up a slot in the NCQ queue, which is immediately filled with another task because of the high I/O pressure. That means that the sector 42 could potentially never be read at all, provided that other tasks are prioritized higher by the drive hardware. I believe, this is exactly what is happening and if one task result is not received within the expected time frame, a timeout with an unspecific error occurs. This is the result of putting one (or more) quite large queues within OpenZFS before a smaller hardware queue (NCQ). It explains why both solutions 6 and probably 7 from my list above cure the problem: Without NCQ, every task must first be finished before the next one can be started. It also explains why this problem is not as evident with other filesystems - were this a general problem with libata, it would have been fixed long ago. I would even guess reducing SATA speed to 1.5 Gbps would help (one guy reported this) - I bet this is simply because the resulting speed of ~150 MByte/s is somewhat lower than modern hard disks, such that the disk can always finish tasks before the next one is started, whereas 3 Gpbs is still faster than modern spinning rust. If I am right, two things should be considered: a. The problem should be analysed and fixed in a better way, like throttling the libata NCQ queue if pressure gets too high, just before timeouts are thrown. This would give the drive time to finish existing tasks. I also think that the parformance impact of disabling NCQ with OpenZFS is probably neglible, because OpenZFS has prioritized queues for different operations anyway. |
(I am the OP, I have some experience with linux kernel drivers, and embedded firmware development) I like how you wrote it all up, but I doubt you can bring a closure to this problem. IMO, the list of "remedies" is basically snake oil, if any of these remedies was "a solution", this bug would be closed long time ago. I think "NCQ timeout" does not explain this problem: I think we now have to wait for the libata bug fix to make it into production kernels. Then we will see what the actual error is. "unaligned write command" never made sense to me, and now we know it is most likely bogus. K.O. |
I did not imply that NCQ allows a command to be left indefinitely in itself. It can only be postponed by the hardware in that it may reorder the commands in any way it likes. This is just how NCQ works. Thus, an indefinite postponing can only be occur if someone "pressures" the queue consistenly - actually, the drive is free to reorder new incoming commands and intersperse them with previous ones - matter-of-fact, there is no difference between issuing 32 commands in short succession and issuing a few more only after some have finished. Call that behaviour a design flaw, but I think it exists and the problem in question surfaces only when some other conditions are met. And I strongly believe that OpenZFS can cause exactly that situation, especially with write patterns of raidz under high I/O pressure. I doubt that this bug would occur with other filesystems where no such complex patterns from several internal queues ever happen. As to why the "fixes" worked sometimes (or seemed to have worked): As I said, #6 and #7 both disable NCQ. Reducing the speed to 1.5 Gbps will most likely reduce the I/O pressure enough to make the problem go away and other solutions may help people who really have hardware problems. Also, I have read nobody so far who has tried to disable NCQ and not done something else alongside (e.g. reducing speed as well). I refrained from disabling NCQ first only because I thought it would hurt performance - which it did not. Thus, my experiments ruled out one single potential cause after another, leaving only the disabling of NCQ as the effective cure. I admit that I probably should wait a few more nights before jumping to conclusions, however these problem were consistent with every setup I tried so far. (P.S.: It has been three days in a row now that no problems occured) Nothing written here nor anything I have tried so far refutes my theory. I agree there is a slight chance of my WDC drives having a problem with NCQ in the first place - I have seen comments on some Samsung SSDs having that problem with certain firmware revisions. But that would not have gone unnoticed, I bet. |
Unfortunately, this patch was never applied and the issue got no further attention after a short discussion. There also seems to be no other cleanup having been done on this topic, at least I couldn't find anything related in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/drivers/ata/libata-scsi.c
I think you are correct on this, I'm seeing the error also on a system where I put a new 6.0 GBps Harddisk with 512 byte sectors in a older cartridge with old cabling (going to replace this). For this zpool status lists 164 write errors after writing about 100GB and degraded state. The harddisk increased UDMA_CRC_Error_Count SMART raw value from 0 to 3 but otherwise has no problems. The dmesg info also indicates a prior interface/bus error that is then decoded as unaligned write on the tagged command: [18701828.321386] ata4.00: exception Emask 0x10 SAct 0x3c00002 SErr 0x400100 action 0x6 frozen |
The reason it never got applied is mostly because as it turns out this is a deeper architectural issue with libata, there is no valid SCSI error code here. Sadly I'm not familiar enough with Linux SCSI midlayer to implement the necessary changes. CRC errors are not the only type of link error. You are probably losing the SATA link which causes reset/retraining which is one of the known things libata doesn't handle correctly. |
Reporting an unusual situation. Have ZFS mirror array across two 1TB SSDs. It regularly spews "Unaligned write command" errors. From reading reports here and elsewhere, this problem used to exist, was fixed years ago, not supposed to happen today. So, a puzzle.
It turns out that the two SSDs report different physical sector size, one reports 512 bytes, one reports 4096 bytes. Same vendor, same model, same firmware. (WTH?!?)
zpool reports default ashift 0 (autodetect_.
zdb reports ashift 12 (correct for 4096 sectors)
So everything seems to be correct, but the errors are there.
The "unaligned write command" errors only some from the "4096 bytes" SSD. After these write errors, "zpool scrub" runs without errors ("repaired 0B"). Two other zfs mirror pools on the same machine run without errors (2x10TB and 2x6TB disks, all report 4096 physical sector size).
K.O.
The text was updated successfully, but these errors were encountered: