-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel panic when deleting send/recv replicated clones #7279
Comments
I wrote this script and was able to successfully reproduce the behavior. The Populate script will create the initial dataset, clone that to a new dataset, copy in some data, then clone again and so on. The script will perform 3 clones, then replicate to a remote host. It will create a total of 72 datasets all cloned off of each other. The source pool is assumed to be "tank" and the remote pool is assumed to be "remote". Populate script: https://pastebin.com/Bd2EAuzS Then I use the zfs_delete_old.sh script to search for datasets on the remote side that are older than X seconds and attempt to delete them. Here is a sample command to search for datasets (with the -t). Remove the "-t" to actually delete. When you do delete, your kernel should crash.
zfs_delete_old.sh script: https://pastebin.com/LcJaSFsD Thank you! |
@kpande @behlendorf I see that this has been added to the 0.8.0 milestone, thank you!! I am just checking to see if there might be a temporary work around to the issue or a configuration change that could be made in the meantime? |
Destroy operations on deeply nested chains of clones can overflow the stack: Depth Size Location (221 entries) ----- ---- -------- 0) 15664 48 mutex_lock+0x5/0x30 1) 15616 8 mutex_lock+0x5/0x30 ... 26) 13576 72 dsl_dataset_remove_clones_key.isra.4+0x124/0x1e0 [zfs] 27) 13504 72 dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs] 28) 13432 72 dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs] ... 185) 2128 72 dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs] 186) 2056 72 dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs] 187) 1984 72 dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs] 188) 1912 136 dsl_destroy_snapshot_sync_impl+0x4e0/0x1090 [zfs] 189) 1776 16 dsl_destroy_snapshot_check+0x0/0x90 [zfs] ... 218) 304 128 kthread+0xdf/0x100 219) 176 48 ret_from_fork+0x22/0x40 220) 128 128 kthread+0x0/0x100 Fix this issue by converting dsl_dataset_remove_clones_key() from recursive to iterative. Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7279 Closes #7810
I can confirm that this commit resolved my issue! Thank you! |
System information SOURCE
System information TARGET
Describe the problem you're observing
SOURCE system has series of datasets cloned from each other every 10 mins:
tank/data0110@snap <- tank/data0120
tank/data0120@snap <- tank/data0130
tank/data0130@snap <- tank/dataN
You get the idea.
SOURCE system send/recv to target system on a 30 min basis. SOURCE system has a cleanup job that runs everyday at 12 noon. The cleanup job will remove all but the last 6 datasets. The cleanup job will promote the second to oldest snapshot and then delete the oldest snapshot. It repeats this process until only 6 datasets are left.
The cleanup job accomplishes this by doing the following:
Meanwhile the TARGET has all of the datasets remaining intact. TARGET also has a cleanup job that runs every Monday and deletes any datasets older than 1 week using the same process as the SOURCE cleanup job; promote second to oldest, then delete oldest. Wash, rinse, repeat until all datasets older than a week are purged.
The issue I am experiencing is with the TARGET cleanup job. When it attempts to cleanup the oldest dataset, it successfully promotes the second oldest dataset but when I try to remove the snap-purge snapshot from the promoted dataset, I get a kernel panic and the machine locks up or reboots.
It seems like the issue is that the root dataset of the chain gets deleted on the SOURCE but not the TARGET and then subsequent send/recv happens. Then TARGET attempts to delete the root dataset sometime later and something bad happens.
I have tried this on multiple versions of Proxmox and Ubuntu with distribution version of ZFS as well as many compiled versions including the most recent ZFS.
Describe how to reproduce the problem
To reproduce the problem you need to clone datasets off of each other, replicate them, delete the source datasets frequently and the target datasets less frequently.
Include any warning/errors/backtraces from the system logs
I do not have any errors or logs. I have tried to setup kernel crash dump but I am not sure how to read it. I am willing to provide whatever is necessary to help track this down.
The text was updated successfully, but these errors were encountered: