Deadlock or missed wakeup with heavy I/O, file deletion, and snapshot creation on Debian Linux/ppc64le #11527

nwf · 2021-01-26T15:27:10Z

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	10.7
Linux Kernel	4.19.0-13-powerpc64le
Architecture	powerpc64le, dual socket, 22 cores per socket
ZFS Version	0.8.6-1~bpo10+1
SPL Version	same as ZFS

There are two zpools in this machine: one is a single NVMe device and holds the root filesystem while the other is five spinning rust drives and a SSD l2arc. The spinning rust drives are the pool of concern and are fully given over to dmcrypt and thence to ZFS; there is no dmcrypt intermediation of the l2arc.

Describe the problem you're observing

When doing a large, ccache-enabled build which generates lots of I/O to the spinning rust zpool, zfs-auto-snapshot's zfs snapshot commands, also targeting that pool, appear to be able to bring the system to standstill. It seems that txg_quiesce goes to sleep and stays there (in D state); on what, I cannot say. txg_sync is asleep but in S state.

Please find attached a list of most threads involved in zfs or spl at the time of deadlock; for each, I have captured /proc/$PID/status, /proc/$PID/cmdline | tr '\000' ' ' and /proc/$PID/stack. I have removed all threads that appeared to be unrelated or merely idle task queues awaiting work. If there's additional information I could provide, please do not hesitate to ask; if I do not have it at present I shall attempt to capture it next time the issue rears its ugly head.

Describe how to reproduce the problem

I'm unsure how to have anyone else reproduce the problem, but the above recipe happens often enough on this machine that I have had this occur somewhere in the vicinity of once per month on average, though we're at twice this week, so that's exciting! I've told zfs-auto-snapshot to not take hourly snapshots, in hopes that whatever's being tickled is tickled less often.

FWIW, this happened as well with 0.7.5 before I moved up to the 0.8 series.

The text was updated successfully, but these errors were encountered:

pcd1193182 · 2021-01-26T22:01:24Z

I took an initial look at this, and it looks like the quiesce thread is actually in txg_quiesce (which is inlined in the particular module being run), and is currently in the "Quiesce the transaction group by waiting for everyone to txg_exit()" loop. This would suggest that some tx_commit somewhere was missed, or that there is a bug with the tc_count accounting. One thing that would help in verifying this assumption is the contents of /proc/spl/kstat/zfs/dozer/txgs; if this analysis is correct so far, there should be two txgs in open context (one accepting new tx's, and the other still attempting to enter the quiesced state, with otime set).

pcd1193182 · 2021-01-26T23:15:48Z

After spotting the stack for pid 1437, the culprit has been revealed.

[<0>] 0xc00000188ff92980
[<0>] __switch_to+0x1dc/0x2d0
[<0>] cv_wait_common+0x110/0x200 [spl]
[<0>] txg_wait_open+0x12c/0x190 [zfs]
[<0>] dmu_free_long_range+0x514/0x630 [zfs]
[<0>] zfs_rmnode+0x2bc/0x410 [zfs]
[<0>] zfs_zinactive+0x164/0x1c0 [zfs]
[<0>] zfs_inactive+0x88/0x2e0 [zfs]
[<0>] zpl_evict_inode+0x68/0xb0 [zfs]
[<0>] evict+0x108/0x240
[<0>] zfs_iput_async+0x2c/0xf0 [zfs]
[<0>] zfs_get_done+0x58/0x90 [zfs]
[<0>] dmu_sync_done+0x138/0x3e0 [zfs]
[<0>] arc_write_done+0x94/0x510 [zfs]
[<0>] zio_done+0x328/0x11b8 [zfs]
[<0>] zio_execute+0xe4/0x190 [zfs]
[<0>] taskq_thread+0x354/0x600 [spl]
[<0>] kthread+0x1ac/0x1c0
[<0>] ret_from_kernel_thread+0x5c/0x6c

We're calling iput from iput_async because we think the i_count is greater than 0, but if we race with another iput call, we end up in the evict logic, which can easily deadlock. The solution here is to do the decrement ourselves, unless it would make the count 0; in that case, we dispatch to a taskq and let them do the eviction. I will file a PR once I've tested the code.

nwf · 2021-01-27T17:02:06Z

Closing in favor of #11530; will reopen if it crops up in a different guise. Thanks again, @pcd1193182.

There is a race condition in zfs_zrele_async when we are checking if we would be the one to evict an inode. This can lead to a txg sync deadlock. Instead of calling into iput directly, we attempt to perform the atomic decrement ourselves, unless that would set the i_count value to zero. In that case, we dispatch a call to iput to run later, to prevent a deadlock from occurring. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #11527 Closes #11530

nwf · 2021-02-11T17:41:53Z

I think I have correctly applied the fix from 43eaef6 to my local dkms tree, and yet I still am seeing hangs. :(

The pool's txgs log ends with a quiescing txg:

10499902 226742099057673  C     126515712    6803456      201682944    526      15607    6960401391   116951       8336272869   8658410713  
10499903 226749059459064  C     135820800    16384        122187776    4        13068    8336450721   90400        8658312965   7551185072  
10499904 226757395909785  Q     0            0            0            0        0        8658810027   0            0            0           
10499905 226766054719812  O     0            0            0            0        0        0            0            0            0

Here's the kstack of every thread in the system parked in ZFS: zfs-hang.txt

zfs/dbgmsg ends with

1613056624   spa_history.c:306:spa_history_log_sync(): command: zfs snapshot -o com.sun:auto-snapshot-desc=- -r dozer/nwf@zfs-auto-snap_hourly-2021-02-11-1517
1613056624   zcp.c:656:zcp_debug(): txg 10499445 ZCP: snap: dozer/nwf@zfs-auto-snap_hourly-2021-02-10-1517 errno: 0
1613056624   spa_history.c:314:spa_history_log_sync(): txg 10499445 destroy dozer/nwf@zfs-auto-snap_hourly-2021-02-10-1517 (id 4159) 
1613056624   dsl_scan.c:3325:dsl_process_async_destroys(): freed 434 blocks in 1ms from free_bpobj/bptree txg 10499445; err=0
1613056625   spa_history.c:339:spa_history_log_sync(): ioctl destroy_snaps
1613056625   spa_history.c:306:spa_history_log_sync(): command: zfs destroy -d -r dozer/nwf@zfs-auto-snap_hourly-2021-02-10-1517
1613056625   spa_history.c:314:spa_history_log_sync(): txg 10499446 snapshot dozer/openwrt@zfs-auto-snap_hourly-2021-02-11-1517 (id 2871) 
1613056625   spa_history.c:314:spa_history_log_sync(): txg 10499446 set dozer/openwrt@zfs-auto-snap_hourly-2021-02-11-1517 (id 2871) com.sun:auto-snapshot-desc=-
1613056626   spa_history.c:339:spa_history_log_sync(): ioctl snapshot
1613056626   spa_history.c:306:spa_history_log_sync(): command: zfs snapshot -o com.sun:auto-snapshot-desc=- -r dozer/openwrt@zfs-auto-snap_hourly-2021-02-11-1517
1613056626   zcp.c:656:zcp_debug(): txg 10499447 ZCP: snap: dozer/openwrt@zfs-auto-snap_hourly-2021-02-10-1517 errno: 0
1613056626   spa_history.c:314:spa_history_log_sync(): txg 10499447 destroy dozer/openwrt@zfs-auto-snap_hourly-2021-02-10-1517 (id 21376) 
1613056631   spa_history.c:339:spa_history_log_sync(): ioctl destroy_snaps
1613056631   spa_history.c:306:spa_history_log_sync(): command: zfs destroy -d -r dozer/openwrt@zfs-auto-snap_hourly-2021-02-10-1517
1613058036   metaslab.c:2742:metaslab_condense(): condensing: txg 10499701, msp[71] c000001e0fb77800, vdev id 0, spa dozer, smp size 4077392, segments 254201, forcing condense=FALSE
1613058753   metaslab.c:2742:metaslab_condense(): condensing: txg 10499819, msp[60] c000001e0fb7f000, vdev id 0, spa dozer, smp size 2333840, segments 145362, forcing condense=FALSE
1613058788   metaslab.c:2742:metaslab_condense(): condensing: txg 10499822, msp[61] c000001e0fb7c800, vdev id 0, spa dozer, smp size 3156624, segments 196415, forcing condense=FALSE

Note that there is, indeed, a snapshot destroy operation in flight, though that may, as @pcd1193182 has suggested, just be increasing the frequency of whatever race is going on rather than actually properly causal.

nwf · 2021-02-11T18:27:53Z

Nope, my fault. Had rebuilt the module and rebooted but had failed to update the initramfs. Sorry for the noise!

There is a race condition in zfs_zrele_async when we are checking if we would be the one to evict an inode. This can lead to a txg sync deadlock. Instead of calling into iput directly, we attempt to perform the atomic decrement ourselves, unless that would set the i_count value to zero. In that case, we dispatch a call to iput to run later, to prevent a deadlock from occurring. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes openzfs#11527 Closes openzfs#11530

nwf added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels Jan 26, 2021

behlendorf removed the Status: Triage Needed New issue which needs to be triaged label Jan 27, 2021

pcd1193182 mentioned this issue Jan 27, 2021

Fix zrele race in zrele_async that can cause hang #11530

Merged

13 tasks

nwf closed this as completed Jan 27, 2021

sayap mentioned this issue Feb 1, 2021

task txg_quiesce:4004 blocked for more than 120 seconds #10370

Closed

nwf reopened this Feb 11, 2021

nwf closed this as completed Feb 11, 2021

nwf mentioned this issue May 19, 2021

Another PowerPC deadlock involving heavy I/O #12085

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock or missed wakeup with heavy I/O, file deletion, and snapshot creation on Debian Linux/ppc64le #11527

Deadlock or missed wakeup with heavy I/O, file deletion, and snapshot creation on Debian Linux/ppc64le #11527

nwf commented Jan 26, 2021 •

edited

pcd1193182 commented Jan 26, 2021

pcd1193182 commented Jan 26, 2021

nwf commented Jan 27, 2021

nwf commented Feb 11, 2021

nwf commented Feb 11, 2021

Deadlock or missed wakeup with heavy I/O, file deletion, and snapshot creation on Debian Linux/ppc64le #11527

Deadlock or missed wakeup with heavy I/O, file deletion, and snapshot creation on Debian Linux/ppc64le #11527

Comments

nwf commented Jan 26, 2021 • edited

System information

Describe the problem you're observing

Describe how to reproduce the problem

pcd1193182 commented Jan 26, 2021

pcd1193182 commented Jan 26, 2021

nwf commented Jan 27, 2021

nwf commented Feb 11, 2021

nwf commented Feb 11, 2021

nwf commented Jan 26, 2021 •

edited