New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PANIC in zio_data_buf_alloc locked up the file system irrecoverably #11531
Comments
|
@ericonr Thanks for reporting this. Has this happened more than once? |
|
Fortunately for me, and unfortunately for the bug report, no. I have tried forcing similar circumstances, but no such bug was triggered again. |
|
Got the same panic with zfs 2.0.3 (zfs-kmod-2.0.3-1~bpo10+1 from debian buster backports). The system has 3 pools (2 on mirrored NVME drives and one on ordinary HDD's with special class mirror device on SATA SSD's). Pools were scrubbed 2 days before (there were checksum errors on both NVME mirror devices which were fixed successfully). I needed to reboot the system after the panic and scrubbed all pools again without errors. The system has ECC memory and 2 of the pools are encrypted: |
Panic is coming from: The I.e. something's calling The stack trace indicates we're passing through Further back in the stack we have: Presumably that Sorry, I've no idea where to go from there, it requires someone with better understanding of this stuff to track down how this |
|
@mattmacy - can you take a look at this? I suspect your "Consolidate arc_buf allocation checks" 13fac09 commit (dated 2020-02-27, first seen in zfs-2.0.0-rc1) introduced the problem. The PANIC occurs in Commit 13fac09 changed an I.e. from: to: I don't know the best way to fix this - perhaps just reverting the 13fac09 commit which was nominally a code cleanup / deduplication rather than a functional change. |
|
Huh. I've been working off the stack trace from @phreaker0's stack trace, but I just noticed that the original stack trace from @ericonr does NOT include the @ericonr - are you sure you were running zfs v2.0.1-1 at the time of the problem per your original report? I guess the |
|
@chrisrd directly from the dmesg from which I picked those backtraces:
I don't think we use particularly aggressive optimization flags, only whatever DKMS exports by default. |
|
@ericonr Assuming you still have the zfs.ko from when your stack dump was generated, let's see what's in your |
|
I don't think I have it any more, I didn't know it would be useful here... Do you think reproducing it by reinstalling a 5.10.9 kernel and ZFS 2.0.1 would be worth it? |
|
@ericonr Apologies for delay in responding, stuff keeps getting in the way. It would be interesting but perhaps not critical to see the code from the actual version of the module where you saw the error. If you get around to reproducing the 2.0.1 module, perhaps copy the entire Do you have encryption enabled, and have you tried scrubbing your data? |
|
@chrisrd no worries :)
Yes to both. I use |
|
@behlendorf I'm no longer convinced commit 13fac09 which introduced As noted previously, the PANICs are coming about because we have something calling The last few calls from our call trace are: Following the call trace down from the top, in The last parameter is I.e. We've seen And that looks like: I.e. we are guaranteed that Back to which calls From the call trace we see we call Tracking back, we see that So what on earth is happening here??? I further note that we see a very similar panic in issue #7603, with a call trace that ends in the same sequence: Compared to the call traces in this issue which look like: Whilst the #7603 also mentions #8099, in which we see @linvinus had added some debugging to Something very weird is going on here. At this point I'm stumped - I don't think I can take this any further without someone else coming up with some bright ideas or showing me where I've stuffed up the this analysis. |
|
Nice analysis. It seems to me it would be possible for The question is how to we end up with a zero lsize in the |
|
I've been able to trigger this pretty reliably by running a specific workload, so if anyone wants to link me to a patch to try with some debugging instructions, I'm happy to dump more output to help track it down. |
|
I've seen this error on 2 machines which are running the same setup with Linux Kernel 5.4.89, ZFS v2.0.1-1 and an encrypted ZFS mirror over 2 NVMe devices. Here are the logs. Machine 1: Machine 2: |
|
Just hit it again on a different machine with the same hardware setup: |
|
Hello, happened to me tomorrow morning. Below are my logs. After crash only SSD pool was not accessible. SSD zpool configuration also posted below, this is an encrypted stripe of four mirrors. Operating System is Proxmox, updated and restarted 26 days ago and then updated without a restart yesterday. |
|
I believe I just hit this bug too. I am also using aes-256-gcm encryption. Same issue with tasks subsequently locking up, but after a reset everything appeared fine and a scrub completed with no errors. |
|
We hit this ourselves today on 2.0.4, 2x12 raidz2, zstd + encryption enabled. One of our programmers has been trying to follow the trace back through the source code, and says:
Meanwhile, we're looking at further debugging we can add. Switching the As relative ZFS greenhorns, its not clear if zero is allowed here. We thought to try and log when we see one and include block id, flags etc so maybe we could use I actually have the machine up right now, with ZFS stopped after panic. I'm trying to get us set up for live kernel memory debugging without upsetting anything, so if there's anything that would be useful and you can tell me how to get it, let me know (in the next 24hrs, I can't keep this machine offline forever). |
|
I also don't know if zero should be allowed in this code. Given the asserts, I guess not - but it's possible other things have changed in the meantime which would make zero size ok. I also don't have any particular knowledge of this code, I've just been reading through it and trying to analyse what's going on. I was hoping my prior analysis might prompt whomever had recently touched this area to get involved but it seems they're also not in a position to do so. I'm only able to look at this stuff when I have spare time, and unfortunately I'm not going to have any significant spare time in the near (and maybe medium 😞) future. If I had the time I'd be trying to look into tracing where the zero size is coming from, per @behlendorf's comment at #11531 (comment) Perhaps your programmer could start with there? |
|
If anyone is feeling particularly daring, they can try this branch of mine which reverts the arc_buf consolidation in question. It's not a simple revert of that commit because 10b3c7f clashes with it -- so that patch set tries to be as transparent as possible by partially reverting the zstd support commit, reverting the commit in question, and then reverting the revert commit. I ran this through the ZTS back when I has re-based it on top of 2.0.4, but it's now re-based on top of tonyhutter's zfs-2.0.5 staging branch. I'm currently running the ZTS on it right now (on a machine of my own in addition to with the official testrunners). |
Data gathering to debug openzfs#11531.
|
We've reintroduced the server that failed to production (as a hot spare for our application only at this point) with fastmailops/zfs@946ff14 applied, which enables (changes to I'll be running in this mode until at least next week, and if it doesn't turn up anything, I'll consider reintroducing the server to active service. I'm wary of it as its an awkward failure to clean up from, but also, this machine has been running for weeks in various modes and hasn't shown an issue, so I don't have a lot of confidence that it'll just magically reappear. If anyone can reproduce more reliably, it'd be great if they could try a similar patch. If anyone knows how to induce a failure, do share! |
|
@aerusso Good-looking patch series. I'm holding it in my back pocket for the moment because I'd like to see if we can understand the problem first. There's definitely a future where I give it a try, if we start to reproduce the crash more often and don't have an easy answer. I'm glad to have it available, thank you! |
|
I spent some hours today working backward from the crash site to try and piece together how we got there. I don't have concrete answers, but I think maybe I have a place where an expert could start looking. As best as I can tell, Given that, there's only one other place way a zero-sized header can be obtained, which is through So, I was lead to look at the caller, Now I don't know anything about the dbuf system (I got tired!), but there's enough evidence from assignments, checks and assertions that the expectation is that In the previous version, if the "old" arcbuf in (Much longer analysis notes at https://gist.github.com/robn/42c0f77a30f666136e4477a3dc3ae827). I'd love it if someone could check my working and tell me if this gets us closer to the truth or just shows how confused I am. Thanks! |
|
@chrisrd sorry, now that I've had a chance to absorb all this I realised that you'd already come to the same conclusion! Still, I don't feel bad having arrived at it from first principles. Tomorrow I'll try to make a test case. I don't k ow if I can, but I've got a couple of ideas on how to start (mostly turn of encryption and compression and try to make some zero extents in an effort to force these code paths into play). |
This reverts commit 13fac09. Per the discussion in openzfs#11531, the reverted commit---which intended only to be a cleanup commit---introduced a subtle, unintended change in behavior. Suggested-by: @chrisrd Suggested-by: robn@despairlabs.com Signed-off-by: Antonio Russo <aerusso@aerusso.net>
This reverts commit 13fac09. Per the discussion in #11531, the reverted commit---which intended only to be a cleanup commit---introduced a subtle, unintended change in behavior. Care was taken to partially revert and then reapply 10b3c7f which would otherwise have caused a conflict. These changes were squashed in to this commit. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Suggested-by: @chrisrd Suggested-by: robn@despairlabs.com Signed-off-by: Antonio Russo <aerusso@aerusso.net> Closes #11531 Closes #12227
This reverts commit 13fac09. Per the discussion in openzfs#11531, the reverted commit---which intended only to be a cleanup commit---introduced a subtle, unintended change in behavior. Care was taken to partially revert and then reapply 10b3c7f which would otherwise have caused a conflict. These changes were squashed in to this commit. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Suggested-by: @chrisrd Suggested-by: robn@despairlabs.com Signed-off-by: Antonio Russo <aerusso@aerusso.net> Closes openzfs#11531 Closes openzfs#12227
This reverts commit 13fac09. Per the discussion in #11531, the reverted commit---which intended only to be a cleanup commit---introduced a subtle, unintended change in behavior. Care was taken to partially revert and then reapply 10b3c7f which would otherwise have caused a conflict. These changes were squashed in to this commit. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Suggested-by: @chrisrd Suggested-by: robn@despairlabs.com Signed-off-by: Antonio Russo <aerusso@aerusso.net> Closes #11531 Closes #12227
This reverts commit 13fac09. Per the discussion in #11531, the reverted commit---which intended only to be a cleanup commit---introduced a subtle, unintended change in behavior. Care was taken to partially revert and then reapply 10b3c7f which would otherwise have caused a conflict. These changes were squashed in to this commit. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Suggested-by: @chrisrd Suggested-by: robn@despairlabs.com Signed-off-by: Antonio Russo <aerusso@aerusso.net> Closes #11531 Closes #12227
|
Hi there, i am currently being hunted by this problem. It appears to be triggered by a big pg_dump backupfile, but weirdly does not happen every time (i couldn't track it down further so far and am not sure on what the issue depends on in my case). I have installed ZFS from backports in Debian as suggested by the wiki. It is at 2.0.3. Should i change the source to get a newer version? Didn't see these referenced commits mentioned in releasenotes, that's why i am asking like this... . Edit: oops, looks like i have forgotten to update that |
see openzfs/zfs#11531 Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
|
I'm experiencing something similar with Once the panic occurs, software on the host doing disk IO will hang, and rebooting/shutting down the system seems to hang as well. Here's a recent trace from dmesg: |
System information
Describe the problem you're observing
In normal usage (watching a movie, browser and other applications in the background, including torrent client potentially reading from the same file) on a simple layout (single zpool, zpool with a single NVME drive), I got a panic in the ZFS kernel module (shown below).
Looks a bit like #2932, and for what it's worth I also have xattrs=sa and acltype=posixacl.
It seems to have mostly locked up the device, since after this I got multiple timeout warnings in dmesg (also shown below).
With my normal shell (fish) I was unable to even launch commands (it failed to lock its history file, for example), since it relies on filesystem access quite a bit, and apparently some specific accesses were failing. I eventually launched dash, but it got to a point where
lsin a directory backed by ZFS, such as~/and/hung (though I could still interrupt it with Ctrl-C). Callingsyncalso hung. The interface for some applications simply locked up (thunderbird and qutebrowser), and others that were working still couldn't actually be quit.As a last ditch, I tried SysRq. After sysrq S and U (sync and remount ro), I tried E and finally I (SIGTERM and SIGKILL all processes). Since I use
runit, SIGTERM'ing all processes already tried to shutdown the system, but that simply hung completely. I had to resort to SysRq B (unconditional reboot) for the system to close up.The dmesg was dumped into my EFI partition, which could be accessed without any issues at all.
Describe how to reproduce the problem
I don't know for sure what caused it. qbittorrent might have been accessing a file that mpv was accessing as well.
Include any warning/errors/backtraces from the system logs
Panic
Timeouts
The text was updated successfully, but these errors were encountered: