Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZTS test slog_015_neg.ksh can trigger zfs deadlock #14775

Open
mmaybee opened this issue Apr 20, 2023 · 6 comments
Open

ZTS test slog_015_neg.ksh can trigger zfs deadlock #14775

mmaybee opened this issue Apr 20, 2023 · 6 comments
Assignees
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@mmaybee
Copy link
Contributor

mmaybee commented Apr 20, 2023

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 20.04
Kernel Version 5.4.0-146
Architecture x86
OpenZFS Version 2.1.99
Commit 4c856fb

Describe the problem you're observing

The ZTS test: slog_015_neg.ksh can trigger a deadlock in the zfs module. This manifests as a test hang while doing a zpool offline request. This appears to be a result of changes introduced in PR# 14514, which introduced the zl_syspend_lock.

Describe how to reproduce the problem

This reproduces very reliably on our config when running slog_015_neg.ksh

Include any warning/errors/backtraces from the system logs

Here is an example of the three deadlocking threads:
Thread 1 is a writer to the pool. It is holding the zl_suspend_lock as READER (from zil_commit()) and waiting for the txg to sync:

1006     UNINTERRUPTABLE                    1
         context_switch+0x12c
         __schedule+0x2de
         schedule+0x42
         io_schedule+0x16
         cv_wait_common+0xcb [spl]
         __cv_wait_io+0x18 [spl]
         txg_wait_synced_impl+0x9b [zfs]
         txg_wait_synced+0x10 [zfs]
         zil_create+0x286 [zfs]
         zil_process_commit_list+0x1cc [zfs]
         zil_commit_writer+0xb3 [zfs]
         zil_commit_impl+0x62 [zfs]
         zil_commit+0x82 [zfs]
         zfs_write+0xae5 [zfs]
         zpl_iter_write+0xe6 [zfs]
         call_write_iter+0x15
         new_sync_write+0x125
         __vfs_write+0x29
         vfs_write+0x37
         vfs_write+0xb9
         ksys_pwrite64+0x6d
         __do_sys_pwrite64+0x18
         __se_sys_pwrite64+0x18
         __x64_sys_pwrite64+0x1e
         do_syscall_64+0x57
         entry_SYSCALL_64

Thread 2 is the zpool command trying to offline a log device. It is holding the dp_config_lock as READER and trying to get the zl_suspend_lock as WRITER:

1007     UNINTERRUPTABLE                    1
         context_switch+0x12c
         __schedule+0x2de
         schedule+0x42
         rwsem_down_write_slowpath+0x244
         __down_write+0x2e
         down_write+0x41
         zil_suspend+0x7f [zfs]
         zil_reset+0x14 [zfs]
         dmu_objset_find_impl+0x106 [zfs]
         dmu_objset_find+0x57 [zfs]
         spa_reset_logs+0x2c [zfs]
         vdev_offline_locked+0x13c [zfs]
         vdev_offline+0x3e [zfs]
         zfs_ioc_vdev_set_state+0xf6 [zfs]
         zfsdev_ioctl_common+0x5df [zfs]
         zfsdev_ioctl+0x57 [zfs]
         vfs_ioctl+0x30d
         file_ioctl+0x357
         do_vfs_ioctl+0x407
         ksys_ioctl+0x67
         __do_sys_ioctl+0x14
         __se_sys_ioctl+0x14
         __x64_sys_ioctl+0x1a
         do_syscall_64+0x57
         entry_SYSCALL_64

Thread 3 is the pool sync thread. It is waiting for the dp_config_lock lock as WRITER:

996      UNINTERRUPTABLE                    1
         context_switch+0x12c
         __schedule+0x2de
         schedule+0x42
         cv_wait_common+0x11e [spl]
         __cv_wait+0x15 [spl]
         rrw_enter_write+0x4e [zfs]
         rrw_enter+0x13 [zfs]
         spa_sync_upgrades+0x79 [zfs]
         spa_sync_iterate_to_convergence+0x156 [zfs]
         spa_sync+0x327 [zfs]
         txg_sync_thread+0x22d [zfs]
         thread_generic_wrapper+0x83 [spl]
         kthread+0x104
         ret_from_fork
@mmaybee mmaybee added the Type: Defect Incorrect behavior (e.g. crash, hang) label Apr 20, 2023
@mmaybee
Copy link
Contributor Author

mmaybee commented Apr 20, 2023

@ryao Please take a look at this to determine if this is indeed the result of your changes from PR# 14514

@youzhongyang
Copy link
Contributor

I've seen similar hang in the following two tests too:

cli_root/zfs_copies/zfs_copies_003_pos.ksh
cli_root/zpool_add/zpool_add_004_pos.ksh

@behlendorf
Copy link
Contributor

I've opened #14790 which reverts the suspected change to try and verify its responsible. I've observed the deadlock described above multiple times in the CI although never ran it down to the particular test(s) or commit. Assuming this is the cause we can revert the change for now and then work on following up with an alternate fix for the original issue.

@ryao
Copy link
Contributor

ryao commented Apr 25, 2023

@ryao Please take a look at this to determine if this is indeed the result of your changes from PR# 14514

That appears to be the case.

freebsd-git pushed a commit to freebsd/freebsd-src that referenced this issue Apr 25, 2023
This reverts commit 4c856fb.

To quote a pending upstream PR:
This reverts commit 4c856fb to resolve a newly introduced deadlock which
in practice is more disruptive that the issue this commit intended to
address.

Causes deadlocks described in openzfs/zfs#14775

Sponsored by:	Rubicon Communications, LLC ("Netgate")
behlendorf added a commit that referenced this issue Apr 25, 2023
This reverts commit 4c856fb to
resolve a newly introduced deadlock which in practice in more
disruptive that the issue this commit intended to address.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14775
Closes #14790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue May 1, 2023
This reverts commit 4c856fb to
resolve a newly introduced deadlock which in practice in more
disruptive that the issue this commit intended to address.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#14775
Closes openzfs#14790
bsdjhb pushed a commit to bsdjhb/cheribsd that referenced this issue Jun 21, 2023
This reverts commit 4c856fb.

To quote a pending upstream PR:
This reverts commit 4c856fb to resolve a newly introduced deadlock which
in practice is more disruptive that the issue this commit intended to
address.

Causes deadlocks described in openzfs/zfs#14775

Sponsored by:	Rubicon Communications, LLC ("Netgate")
@tuxoko
Copy link
Contributor

tuxoko commented Jul 20, 2023

Just want to comment that this deadlock is still there.
It's just instead of zl_suspend_lock, now it's zl_issuer_lock instead.

@behlendorf
Copy link
Contributor

PR #15103 was merged today to hopefully address this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

5 participants