Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock or missed wakeup with heavy I/O, file deletion, and snapshot creation on Debian Linux/ppc64le #11527

Closed
nwf opened this issue Jan 26, 2021 · 5 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@nwf
Copy link
Contributor

nwf commented Jan 26, 2021

System information

Type Version/Name
Distribution Name Debian
Distribution Version 10.7
Linux Kernel 4.19.0-13-powerpc64le
Architecture powerpc64le, dual socket, 22 cores per socket
ZFS Version 0.8.6-1~bpo10+1
SPL Version same as ZFS

There are two zpools in this machine: one is a single NVMe device and holds the root filesystem while the other is five spinning rust drives and a SSD l2arc. The spinning rust drives are the pool of concern and are fully given over to dmcrypt and thence to ZFS; there is no dmcrypt intermediation of the l2arc.

Describe the problem you're observing

When doing a large, ccache-enabled build which generates lots of I/O to the spinning rust zpool, zfs-auto-snapshot's zfs snapshot commands, also targeting that pool, appear to be able to bring the system to standstill. It seems that txg_quiesce goes to sleep and stays there (in D state); on what, I cannot say. txg_sync is asleep but in S state.

Please find attached a list of most threads involved in zfs or spl at the time of deadlock; for each, I have captured /proc/$PID/status, /proc/$PID/cmdline | tr '\000' ' ' and /proc/$PID/stack. I have removed all threads that appeared to be unrelated or merely idle task queues awaiting work. If there's additional information I could provide, please do not hesitate to ask; if I do not have it at present I shall attempt to capture it next time the issue rears its ugly head.

Describe how to reproduce the problem

I'm unsure how to have anyone else reproduce the problem, but the above recipe happens often enough on this machine that I have had this occur somewhere in the vicinity of once per month on average, though we're at twice this week, so that's exciting! I've told zfs-auto-snapshot to not take hourly snapshots, in hopes that whatever's being tickled is tickled less often.

FWIW, this happened as well with 0.7.5 before I moved up to the 0.8 series.

@nwf nwf added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels Jan 26, 2021
@pcd1193182
Copy link
Contributor

I took an initial look at this, and it looks like the quiesce thread is actually in txg_quiesce (which is inlined in the particular module being run), and is currently in the "Quiesce the transaction group by waiting for everyone to txg_exit()" loop. This would suggest that some tx_commit somewhere was missed, or that there is a bug with the tc_count accounting. One thing that would help in verifying this assumption is the contents of /proc/spl/kstat/zfs/dozer/txgs; if this analysis is correct so far, there should be two txgs in open context (one accepting new tx's, and the other still attempting to enter the quiesced state, with otime set).

@pcd1193182
Copy link
Contributor

After spotting the stack for pid 1437, the culprit has been revealed.

[<0>] 0xc00000188ff92980
[<0>] __switch_to+0x1dc/0x2d0
[<0>] cv_wait_common+0x110/0x200 [spl]
[<0>] txg_wait_open+0x12c/0x190 [zfs]
[<0>] dmu_free_long_range+0x514/0x630 [zfs]
[<0>] zfs_rmnode+0x2bc/0x410 [zfs]
[<0>] zfs_zinactive+0x164/0x1c0 [zfs]
[<0>] zfs_inactive+0x88/0x2e0 [zfs]
[<0>] zpl_evict_inode+0x68/0xb0 [zfs]
[<0>] evict+0x108/0x240
[<0>] zfs_iput_async+0x2c/0xf0 [zfs]
[<0>] zfs_get_done+0x58/0x90 [zfs]
[<0>] dmu_sync_done+0x138/0x3e0 [zfs]
[<0>] arc_write_done+0x94/0x510 [zfs]
[<0>] zio_done+0x328/0x11b8 [zfs]
[<0>] zio_execute+0xe4/0x190 [zfs]
[<0>] taskq_thread+0x354/0x600 [spl]
[<0>] kthread+0x1ac/0x1c0
[<0>] ret_from_kernel_thread+0x5c/0x6c

We're calling iput from iput_async because we think the i_count is greater than 0, but if we race with another iput call, we end up in the evict logic, which can easily deadlock. The solution here is to do the decrement ourselves, unless it would make the count 0; in that case, we dispatch to a taskq and let them do the eviction. I will file a PR once I've tested the code.

@behlendorf behlendorf removed the Status: Triage Needed New issue which needs to be triaged label Jan 27, 2021
@nwf
Copy link
Contributor Author

nwf commented Jan 27, 2021

Closing in favor of #11530; will reopen if it crops up in a different guise. Thanks again, @pcd1193182.

@nwf nwf closed this as completed Jan 27, 2021
behlendorf pushed a commit that referenced this issue Jan 28, 2021
There is a race condition in zfs_zrele_async when we are checking if 
we would be the one to evict an inode. This can lead to a txg sync 
deadlock.

Instead of calling into iput directly, we attempt to perform the atomic 
decrement ourselves, unless that would set the i_count value to zero. 
In that case, we dispatch a call to iput to run later, to prevent a 
deadlock from occurring.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #11527 
Closes #11530
behlendorf pushed a commit that referenced this issue Jan 28, 2021
There is a race condition in zfs_zrele_async when we are checking if 
we would be the one to evict an inode. This can lead to a txg sync 
deadlock.

Instead of calling into iput directly, we attempt to perform the atomic 
decrement ourselves, unless that would set the i_count value to zero. 
In that case, we dispatch a call to iput to run later, to prevent a 
deadlock from occurring.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #11527 
Closes #11530
@nwf
Copy link
Contributor Author

nwf commented Feb 11, 2021

I think I have correctly applied the fix from 43eaef6 to my local dkms tree, and yet I still am seeing hangs. :(

The pool's txgs log ends with a quiescing txg:

10499902 226742099057673  C     126515712    6803456      201682944    526      15607    6960401391   116951       8336272869   8658410713  
10499903 226749059459064  C     135820800    16384        122187776    4        13068    8336450721   90400        8658312965   7551185072  
10499904 226757395909785  Q     0            0            0            0        0        8658810027   0            0            0           
10499905 226766054719812  O     0            0            0            0        0        0            0            0            0      

Here's the kstack of every thread in the system parked in ZFS: zfs-hang.txt

zfs/dbgmsg ends with

1613056624   spa_history.c:306:spa_history_log_sync(): command: zfs snapshot -o com.sun:auto-snapshot-desc=- -r dozer/nwf@zfs-auto-snap_hourly-2021-02-11-1517
1613056624   zcp.c:656:zcp_debug(): txg 10499445 ZCP: snap: dozer/nwf@zfs-auto-snap_hourly-2021-02-10-1517 errno: 0
1613056624   spa_history.c:314:spa_history_log_sync(): txg 10499445 destroy dozer/nwf@zfs-auto-snap_hourly-2021-02-10-1517 (id 4159) 
1613056624   dsl_scan.c:3325:dsl_process_async_destroys(): freed 434 blocks in 1ms from free_bpobj/bptree txg 10499445; err=0
1613056625   spa_history.c:339:spa_history_log_sync(): ioctl destroy_snaps
1613056625   spa_history.c:306:spa_history_log_sync(): command: zfs destroy -d -r dozer/nwf@zfs-auto-snap_hourly-2021-02-10-1517
1613056625   spa_history.c:314:spa_history_log_sync(): txg 10499446 snapshot dozer/openwrt@zfs-auto-snap_hourly-2021-02-11-1517 (id 2871) 
1613056625   spa_history.c:314:spa_history_log_sync(): txg 10499446 set dozer/openwrt@zfs-auto-snap_hourly-2021-02-11-1517 (id 2871) com.sun:auto-snapshot-desc=-
1613056626   spa_history.c:339:spa_history_log_sync(): ioctl snapshot
1613056626   spa_history.c:306:spa_history_log_sync(): command: zfs snapshot -o com.sun:auto-snapshot-desc=- -r dozer/openwrt@zfs-auto-snap_hourly-2021-02-11-1517
1613056626   zcp.c:656:zcp_debug(): txg 10499447 ZCP: snap: dozer/openwrt@zfs-auto-snap_hourly-2021-02-10-1517 errno: 0
1613056626   spa_history.c:314:spa_history_log_sync(): txg 10499447 destroy dozer/openwrt@zfs-auto-snap_hourly-2021-02-10-1517 (id 21376) 
1613056631   spa_history.c:339:spa_history_log_sync(): ioctl destroy_snaps
1613056631   spa_history.c:306:spa_history_log_sync(): command: zfs destroy -d -r dozer/openwrt@zfs-auto-snap_hourly-2021-02-10-1517
1613058036   metaslab.c:2742:metaslab_condense(): condensing: txg 10499701, msp[71] c000001e0fb77800, vdev id 0, spa dozer, smp size 4077392, segments 254201, forcing condense=FALSE
1613058753   metaslab.c:2742:metaslab_condense(): condensing: txg 10499819, msp[60] c000001e0fb7f000, vdev id 0, spa dozer, smp size 2333840, segments 145362, forcing condense=FALSE
1613058788   metaslab.c:2742:metaslab_condense(): condensing: txg 10499822, msp[61] c000001e0fb7c800, vdev id 0, spa dozer, smp size 3156624, segments 196415, forcing condense=FALSE

Note that there is, indeed, a snapshot destroy operation in flight, though that may, as @pcd1193182 has suggested, just be increasing the frequency of whatever race is going on rather than actually properly causal.

@nwf nwf reopened this Feb 11, 2021
@nwf
Copy link
Contributor Author

nwf commented Feb 11, 2021

Nope, my fault. Had rebuilt the module and rebooted but had failed to update the initramfs. Sorry for the noise!

@nwf nwf closed this as completed Feb 11, 2021
jsai20 pushed a commit to jsai20/zfs that referenced this issue Mar 30, 2021
There is a race condition in zfs_zrele_async when we are checking if 
we would be the one to evict an inode. This can lead to a txg sync 
deadlock.

Instead of calling into iput directly, we attempt to perform the atomic 
decrement ourselves, unless that would set the i_count value to zero. 
In that case, we dispatch a call to iput to run later, to prevent a 
deadlock from occurring.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes openzfs#11527 
Closes openzfs#11530
sempervictus pushed a commit to sempervictus/zfs that referenced this issue May 31, 2021
There is a race condition in zfs_zrele_async when we are checking if 
we would be the one to evict an inode. This can lead to a txg sync 
deadlock.

Instead of calling into iput directly, we attempt to perform the atomic 
decrement ourselves, unless that would set the i_count value to zero. 
In that case, we dispatch a call to iput to run later, to prevent a 
deadlock from occurring.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes openzfs#11527 
Closes openzfs#11530
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

3 participants