Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel panic when deleting send/recv replicated clones #7279

Closed
valeech opened this issue Mar 7, 2018 · 5 comments · Fixed by #7810
Closed

Kernel panic when deleting send/recv replicated clones #7279

valeech opened this issue Mar 7, 2018 · 5 comments · Fixed by #7810
Milestone

Comments

@valeech
Copy link

valeech commented Mar 7, 2018

System information SOURCE

Type Version/Name
Distribution Name Proxmox
Distribution Version 5.1-43/bdb08029
Linux Kernel 4.13.13-5-pve
Architecture x86_64
ZFS Version v0.7.4-1
SPL Version v0.7.4-1

System information TARGET

Type Version/Name
Distribution Name Ubuntu
Distribution Version 16.04
Linux Kernel 4.4.0-116-generic
Architecture x86_64
ZFS Version v0.7.6-1_e3b28e1
SPL Version v0.7.6-1_3cc0ea2

Describe the problem you're observing

SOURCE system has series of datasets cloned from each other every 10 mins:

tank/data0110@snap <- tank/data0120
tank/data0120@snap <- tank/data0130
tank/data0130@snap <- tank/dataN

You get the idea.

SOURCE system send/recv to target system on a 30 min basis. SOURCE system has a cleanup job that runs everyday at 12 noon. The cleanup job will remove all but the last 6 datasets. The cleanup job will promote the second to oldest snapshot and then delete the oldest snapshot. It repeats this process until only 6 datasets are left.

The cleanup job accomplishes this by doing the following:

1. zfs rename tank/dataOLDEST@snap tank/dataOLDEST@snap-purge
2. zfs promote tank/dataSECONDOLDEST
3. zfs destroy -r tank/dataOLDEST
4. zfs destroy tank/SECONDOLDEST@snap-purge

Meanwhile the TARGET has all of the datasets remaining intact. TARGET also has a cleanup job that runs every Monday and deletes any datasets older than 1 week using the same process as the SOURCE cleanup job; promote second to oldest, then delete oldest. Wash, rinse, repeat until all datasets older than a week are purged.

The issue I am experiencing is with the TARGET cleanup job. When it attempts to cleanup the oldest dataset, it successfully promotes the second oldest dataset but when I try to remove the snap-purge snapshot from the promoted dataset, I get a kernel panic and the machine locks up or reboots.

It seems like the issue is that the root dataset of the chain gets deleted on the SOURCE but not the TARGET and then subsequent send/recv happens. Then TARGET attempts to delete the root dataset sometime later and something bad happens.

I have tried this on multiple versions of Proxmox and Ubuntu with distribution version of ZFS as well as many compiled versions including the most recent ZFS.

Describe how to reproduce the problem

To reproduce the problem you need to clone datasets off of each other, replicate them, delete the source datasets frequently and the target datasets less frequently.

Include any warning/errors/backtraces from the system logs

I do not have any errors or logs. I have tried to setup kernel crash dump but I am not sure how to read it. I am willing to provide whatever is necessary to help track this down.

@valeech valeech changed the title Kernel panic when deleting replicated clones Kernel panic when deleting sedn/recv replicated clones Mar 7, 2018
@valeech valeech changed the title Kernel panic when deleting sedn/recv replicated clones Kernel panic when deleting send/recv replicated clones Mar 7, 2018
@valeech
Copy link
Author

valeech commented Mar 9, 2018

I wrote this script and was able to successfully reproduce the behavior.

The Populate script will create the initial dataset, clone that to a new dataset, copy in some data, then clone again and so on. The script will perform 3 clones, then replicate to a remote host. It will create a total of 72 datasets all cloned off of each other. The source pool is assumed to be "tank" and the remote pool is assumed to be "remote".

Populate script: https://pastebin.com/Bd2EAuzS

Then I use the zfs_delete_old.sh script to search for datasets on the remote side that are older than X seconds and attempt to delete them. Here is a sample command to search for datasets (with the -t). Remove the "-t" to actually delete. When you do delete, your kernel should crash.

zfs_delete_old.sh -s 240 -p 'remote/testclone/.*' -t

zfs_delete_old.sh script: https://pastebin.com/LcJaSFsD

Thank you!

@behlendorf behlendorf added this to the 0.8.0 milestone Mar 9, 2018
@valeech
Copy link
Author

valeech commented Mar 16, 2018

@kpande @behlendorf I see that this has been added to the 0.8.0 milestone, thank you!! I am just checking to see if there might be a temporary work around to the issue or a configuration change that could be made in the meantime?

@loli10K
Copy link
Contributor

loli10K commented Mar 18, 2018

I am just checking to see if there might be a temporary work around to the issue or a configuration change that could be made in the meantime?

@valeech you can lower the nesting level of your clones.

Duplicate of #3959.

behlendorf pushed a commit that referenced this issue Aug 22, 2018
Destroy operations on deeply nested chains of clones can overflow
the stack:

        Depth    Size   Location    (221 entries)
        -----    ----   --------
  0)    15664      48   mutex_lock+0x5/0x30
  1)    15616       8   mutex_lock+0x5/0x30
...
 26)    13576      72   dsl_dataset_remove_clones_key.isra.4+0x124/0x1e0 [zfs]
 27)    13504      72   dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs]
 28)    13432      72   dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs]
...
185)     2128      72   dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs]
186)     2056      72   dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs]
187)     1984      72   dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs]
188)     1912     136   dsl_destroy_snapshot_sync_impl+0x4e0/0x1090 [zfs]
189)     1776      16   dsl_destroy_snapshot_check+0x0/0x90 [zfs]
...
218)      304     128   kthread+0xdf/0x100
219)      176      48   ret_from_fork+0x22/0x40
220)      128     128   kthread+0x0/0x100

Fix this issue by converting dsl_dataset_remove_clones_key() from
recursive to iterative.

Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7279 
Closes #7810
@valeech
Copy link
Author

valeech commented Oct 26, 2018

I can confirm that this commit resolved my issue! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
@behlendorf @loli10K @valeech and others