Kernel panic when deleting send/recv replicated clones #7279

valeech · 2018-03-07T04:42:41Z

System information SOURCE

Type	Version/Name
Distribution Name	Proxmox
Distribution Version	5.1-43/bdb08029
Linux Kernel	4.13.13-5-pve
Architecture	x86_64
ZFS Version	v0.7.4-1
SPL Version	v0.7.4-1

System information TARGET

Type	Version/Name
Distribution Name	Ubuntu
Distribution Version	16.04
Linux Kernel	4.4.0-116-generic
Architecture	x86_64
ZFS Version	v0.7.6-1_e3b28e1
SPL Version	v0.7.6-1_3cc0ea2

Describe the problem you're observing

SOURCE system has series of datasets cloned from each other every 10 mins:

tank/data0110@snap <- tank/data0120
tank/data0120@snap <- tank/data0130
tank/data0130@snap <- tank/dataN

You get the idea.

SOURCE system send/recv to target system on a 30 min basis. SOURCE system has a cleanup job that runs everyday at 12 noon. The cleanup job will remove all but the last 6 datasets. The cleanup job will promote the second to oldest snapshot and then delete the oldest snapshot. It repeats this process until only 6 datasets are left.

The cleanup job accomplishes this by doing the following:

1. zfs rename tank/dataOLDEST@snap tank/dataOLDEST@snap-purge
2. zfs promote tank/dataSECONDOLDEST
3. zfs destroy -r tank/dataOLDEST
4. zfs destroy tank/SECONDOLDEST@snap-purge

Meanwhile the TARGET has all of the datasets remaining intact. TARGET also has a cleanup job that runs every Monday and deletes any datasets older than 1 week using the same process as the SOURCE cleanup job; promote second to oldest, then delete oldest. Wash, rinse, repeat until all datasets older than a week are purged.

The issue I am experiencing is with the TARGET cleanup job. When it attempts to cleanup the oldest dataset, it successfully promotes the second oldest dataset but when I try to remove the snap-purge snapshot from the promoted dataset, I get a kernel panic and the machine locks up or reboots.

It seems like the issue is that the root dataset of the chain gets deleted on the SOURCE but not the TARGET and then subsequent send/recv happens. Then TARGET attempts to delete the root dataset sometime later and something bad happens.

I have tried this on multiple versions of Proxmox and Ubuntu with distribution version of ZFS as well as many compiled versions including the most recent ZFS.

Describe how to reproduce the problem

To reproduce the problem you need to clone datasets off of each other, replicate them, delete the source datasets frequently and the target datasets less frequently.

Include any warning/errors/backtraces from the system logs

I do not have any errors or logs. I have tried to setup kernel crash dump but I am not sure how to read it. I am willing to provide whatever is necessary to help track this down.

The text was updated successfully, but these errors were encountered:

valeech · 2018-03-09T21:44:29Z

I wrote this script and was able to successfully reproduce the behavior.

The Populate script will create the initial dataset, clone that to a new dataset, copy in some data, then clone again and so on. The script will perform 3 clones, then replicate to a remote host. It will create a total of 72 datasets all cloned off of each other. The source pool is assumed to be "tank" and the remote pool is assumed to be "remote".

Populate script: https://pastebin.com/Bd2EAuzS

Then I use the zfs_delete_old.sh script to search for datasets on the remote side that are older than X seconds and attempt to delete them. Here is a sample command to search for datasets (with the -t). Remove the "-t" to actually delete. When you do delete, your kernel should crash.

zfs_delete_old.sh -s 240 -p 'remote/testclone/.*' -t

zfs_delete_old.sh script: https://pastebin.com/LcJaSFsD

Thank you!

valeech · 2018-03-16T16:45:15Z

@kpande @behlendorf I see that this has been added to the 0.8.0 milestone, thank you!! I am just checking to see if there might be a temporary work around to the issue or a configuration change that could be made in the meantime?

loli10K · 2018-03-18T09:15:38Z

I am just checking to see if there might be a temporary work around to the issue or a configuration change that could be made in the meantime?

@valeech you can lower the nesting level of your clones.

Duplicate of #3959.

Destroy operations on deeply nested chains of clones can overflow the stack: Depth Size Location (221 entries) ----- ---- -------- 0) 15664 48 mutex_lock+0x5/0x30 1) 15616 8 mutex_lock+0x5/0x30 ... 26) 13576 72 dsl_dataset_remove_clones_key.isra.4+0x124/0x1e0 [zfs] 27) 13504 72 dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs] 28) 13432 72 dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs] ... 185) 2128 72 dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs] 186) 2056 72 dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs] 187) 1984 72 dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs] 188) 1912 136 dsl_destroy_snapshot_sync_impl+0x4e0/0x1090 [zfs] 189) 1776 16 dsl_destroy_snapshot_check+0x0/0x90 [zfs] ... 218) 304 128 kthread+0xdf/0x100 219) 176 48 ret_from_fork+0x22/0x40 220) 128 128 kthread+0x0/0x100 Fix this issue by converting dsl_dataset_remove_clones_key() from recursive to iterative. Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7279 Closes #7810

valeech · 2018-10-26T20:41:18Z

I can confirm that this commit resolved my issue! Thank you!

valeech changed the title ~~Kernel panic when deleting replicated clones~~ Kernel panic when deleting sedn/recv replicated clones Mar 7, 2018

valeech changed the title ~~Kernel panic when deleting sedn/recv replicated clones~~ Kernel panic when deleting send/recv replicated clones Mar 7, 2018

behlendorf added this to the 0.8.0 milestone Mar 9, 2018

loli10K mentioned this issue Aug 19, 2018

Stack overflow when destroying deeply nested clones #7810

Merged

13 tasks

behlendorf closed this as completed in #7810 Aug 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel panic when deleting send/recv replicated clones #7279

Kernel panic when deleting send/recv replicated clones #7279

valeech commented Mar 7, 2018

valeech commented Mar 9, 2018 •

edited

valeech commented Mar 16, 2018

loli10K commented Mar 18, 2018

valeech commented Oct 26, 2018

Kernel panic when deleting send/recv replicated clones #7279

Kernel panic when deleting send/recv replicated clones #7279

Comments

valeech commented Mar 7, 2018

System information SOURCE

System information TARGET

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

valeech commented Mar 9, 2018 • edited

valeech commented Mar 16, 2018

loli10K commented Mar 18, 2018

valeech commented Oct 26, 2018

valeech commented Mar 9, 2018 •

edited