-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PANIC at zfs_vfsops.c:426:zfs_space_delta_cb() #6332
Comments
This seems similar to #2025, at least insofar as the second log above references |
I deleted the snapshots of this filesystem that were created after the initial panic, just in case the data causing the problem was created at the first panic. But, that didn't help, I'm still running into this problem with older snapshots. Testing against I haven't narrowed it down to a precise file, as I'm not entirely sure of the best way to go about that. Given @avg-I's theory about why this may occur, I'm not even sure that it would even be a particular file causing trouble, especially since the timestamp would correspond to the removal and not some stale corrupt data from the past. So, maybe it's just a race condition happening at some point when many files are being removed very quickly on this filesystem? My old pool, that these filesystems were sent from, is destroyed. I did save the |
I'm trying to figure out a way to get poolB up and running and fully functional without panics on my production machine. I'm walking through the code to see if there's a workaround that I could compile into a custom build until the underlying problem is solved. If @avg-I's theory is correct, I'm thinking it might be safe to disable the Also, if @avg-I's theory is correct, it's probably safe to assume that my original pool's filesystems in question were not version 5, as I was running the same version of the module with that pool and didn't run into any of these panics until switching to the new pool with the old filesystems sent over. Though that would mean that receive doesn't retain the original filesystem version on the destination pool as part of a replication like how it retains other filesystem properties. If anyone has any suggestions for how to fix the original problem or how I might work around it or whether my idea for how to work around it might work, please let me know! |
I just hit this as well. I'm testing ZFS and Lustre to see if I can upgrade, with this pointing to 'no'. ZFS: 0.7.0-1 Test: Mounted Lustre on single remote client. Ran both 'fio' and 'mdtest' concurrently in 2 loops. fio --size=200G --rw=rw --bs=128k --numjobs=16 --iodepth=16 --ioengine=libaio --direct=0 --invalidate=1 --fsync_on_close=1 --norandommap --name test1 --filename=blah --group_reporting mpirun --allow-run-as-root -n 8 ./mdtest -z10 -n1250 -i5 -R13 -N1
|
Looks liek it cut my pool information: ZFS -
|
Rebooted and got the same error with just the mdtest loop, do not get the error with the 'ior' loops.
|
@YFLOPS You may be running into a different issue than me, as although the panic is occurring in the same place, the contents of sa.sa_magic in my case are clearly UNIX timestamps which points to a possible SA upgrade issue, but I don't think that's necessarily true for your case. You may have more luck getting help if you creating a new issue for it, especially if you note that you're using 0.7.0, as all eyes are on the new version. |
@behlendorf Not sure if you've seen this report yet (ignore the three posts by @YFLOPS as that was covered in its own issue #6474). Tagging you in case you haven't as you're the person who closed #1303 and #2025 with a note to create a new issue if it reappeared in recent versions. Given @avg-I 's theory in #2025 (and referenced above in my comments) about this being a race condition related to SA upgrades, if it is correct, then this bug has been around for years and is still in the latest versions as none of the relevant code seems to have been changed since the version I'm running (0.6.5.9). |
|
@behlendorf I'm not sure that this is a duplicate of #7117 as the value in sa.sa_magic that's causing the panic is clearly a timestamp, despite the panic otherwise looking similar. Have you or anyone checked whether @avg-I's theory in #2025 (a race condition relating to SA upgrades to fs version 5+) might be what's causing this? I've looked at the code myself and it seems to be a compelling explanation, but I don't have the experience with the codebase that you guys do. It's my intuition that I could just disable the SA upgrading (by commenting out the zfs_sa_upgrade callback) and fix this problem in practice, but I would worry about that ruining my data now, or on future changes to ZFS expecting it to have been completed. (I realize this ticket only became more confused as another user piggybacked on it with yet another non-timestamp-in-sa_magic panic that I believe to be unrelated.) |
We've started hitting this issue too, in just the past few days. The pool is on a relatively new server, but the filesystems were zfs sent from the old server to the new, then upgraded to zfs version 5 (from 4) as part of getting everything up to date. Trying to rsync from the live filesystem seems to tickle the bug most quickly. One thing we found that has reduced the frequency of crashes was to zfs set atime=off on all filesystems. It hasn't crashed on rsync since I did that, but that was this morning and we won't know for sure until more time has passed. If anyone is hitting this often, try zfs set atime=off. I'll be watching this bug closely for updates as this is affecting a production server. EDIT: General info CentOS Linux release 7.4.1708 (Core) |
For what it's worth, I've been experiencing this all along with rsync as my primary test, though rm -r also causes it, and that's with |
@jspiros Well, that's disconcerting. Haven't crashed yet and have been rsyncing since this morning nearly constantly, though. (moving data to a ZFS v4 partition to try to get away from this bug since it only seems to affect ZFS v5) Haven't tried any rm -r yet; there isn't large directory tree that we can delete on the affected filesystems right now. Still, I'll report back if we crash again with atime=off. |
@flynnjFIU Given the theory on why this is happening, datasets of hundreds of thousands of tiny files would sooner exhibit the problem than datasets of fewer, larger files. My case is more the former. Indeed, I'd even considered as a solution doing a script that would replicate |
@jspiros In the interest of being thorough, do you also have relatime=off? relatime=on could also cause atime updates on access, though they wouldn't be as frequent as atime=on. Given the timestamp erroneously in the field, it feels like this bug definitely has to do with a race condition where a timestamp update is ending up in the wrong memory location. |
Welp, the problem just recurred. So although atime=off seems to reduce the likelyhood of it, it still happens. I shall continue migrating this filesystem to ZFS v4, along with the others on the host. |
@flynnjFIU In the interest of completeness, yes, I have |
I have a similar problem on Ubuntu 18.04 bionic with zfsonlinux from standard package sources. Importing the pool works fine, but mounting the filesystems causes
The machine still does "something", but is completely unresponsive. It has to be reset, doesn't shutdown cleanly. I disabled the auto import of the pool, rebooted, then This is a bummer, how can some bad data effectively kill the entire machine? |
similar here, with 0.8.2 on NixOS, apparently when mounting a FS while other operations running.
But after reboot everything seems fine again. |
I've just started noticing this problem with Ubuntu 20.04 on a Raspberry Pi 4, since doing a linux dist-upgrade which I suspected of triggering the issue, though from what I'm reading here maybe it started earlier this year (not sure exactly when), when I upgraded my zfs to version 5 (I think). I don't have any log files from before then (long story) so I can't see if I was ever getting panics before that. I guess there's no way to switch back from 5 to 4, is there? Symptom: When running rsync or tree (haven't tried much else) on much of the filesystem it will pretty routinely throw a panic report in syslog/kern.log, and the rsync/tree operation will later hang solidly such that even ^C doesn't break back to the shell, and any further file operations will likely hang and lock up their shells. Often requires power cycle rather than just shutdown -r to reboot. syslog/kern.log: Feb 2 16:52:43 donny kernel: [ 517.355790] VERIFY3(sa.sa_magic == SA_MAGIC) failed (value == 3100762) <- value varies e.g. 1612270437, 1612270437, 1612283815 I've filed a bug report with ubuntu launchpad |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
This problem still exists and it's being reported from time to time in various forums. |
NOTE: I have some thoughts about what may be causing this problem, and I'm not sure that my logs in this first post are immediately useful. You may want to skip down to my updates after this first post for more context and thoughts on what might be happening.
System information
Describe the problem you're observing
I have experienced two panics in a short period of time seemingly related to rsync and/or send/recv. I was running with two pools, poolA and poolB, and I was in the process of moving everything from poolA to poolB using send/recv so that I could run with poolB only. One of the filesystem hierarchies being moved is the destination of some rsync cronjobs used for backup purposes. When I was send/recv'ing that hierarchy (using
zfs send -vR poolA/backups@snap | zfs receive -uv poolB/backups
), before it completed, one of the cronjobs started to run (with poolA as the effective destination). With both the send/recv and the rsync running at the same time, I got a panic which brought everything related to ZFS or any ZFS filesystems to a halt, necessitating a reboot. This is the panic from my kern.log:Before I rebooted, I confirmed that both the rsync and the zfs commands were still running. I was logging the output of the zfs commands, and it did log the final "receiving incremental stream" line relating to the last snapshot of the last filesystem that was to be sent per the initial estimate when the commands first started. But, the processes did not exit.
After reboot, it looked like the final snapshot of the final filesystem was indeed on poolB, though
zpool history
for poolA or poolB did not show the send/recv commands.Eventually I did the final move by creating new snapshots with everything unmounted, did an incremental update with
zfs send -vRI @snap poolA/backups@final-snap | zfs receive -uFv poolB/backups
. After this completed, I again confirmed (usingzfs list
) that everything was showing up on the poolB side, before exporting poolA, and reconfiguring mounts and rebooting the system to run only with poolB.After the system was running with poolB only for a while, one of the rsync backup cronjobs ran, with a destination in the same hierarchy of filesystems that I ran into the first panic with (but this time with poolB as the actual pool involved), and I got another panic. This time, there was no send/recv taking place at the same time.
There are other processes appearing in this log, like dropbox, youtube-dl, etc., and those are all processes that were interacting with other ZFS filesystems at the time. But, those same processes have been running with no issue otherwise, so I suspect it's related to rsync and the filesystems I had issues with during the send/recv process.
This is a production system, and since the backups aren't as essential as the other processes that depend on other filesystems on this system, and I hoped that the issue was with something in those filesystems that were send/recv'ing during the first panic, I decided to try and avoid further panics by simply unmounting the entire suspect filesystem hierarchy (/backups), and disabling the cronjobs that expected it to be mounted. The system has now been running for another two days (after a reboot after the last panic) without any further panics.
I still have poolA, though attached to a different system, and all of the data that was sent to poolB is still on poolA, if that might make it possible to run tests without taking my production system down repeatedly.
This seems possibly related to #3968, #1303, and more. Any ideas?
The text was updated successfully, but these errors were encountered: