write IO hangs when uiomove triggers page fault in zfs_write context. #7512

YanyunGao · 2018-05-07T10:15:01Z

System information

Type	Version/Name
Distribution Name	Centos
Distribution Version	7.1
Linux Kernel	3.10.0-514.26.2.el7.x86_64
Architecture
ZFS Version	0.6.5.6
SPL Version	0.6.5.6

Describe the problem you're observing

Write IO hangs on a pool with a lot of backtraces in sys log. One write request stuck in context of page_fault which causes the quiesce thread never complete. It turns out following txg cannot be synced.

Describe how to reproduce the problem

We hit the problem twice in one week but yet have no idea how to reproduce it.
There are mixed payloads with mmap and regular read/write operations. We noticed that the problem could be easy to reproduce when there is no too much memory left.

Include any warning/errors/backtraces from the system logs

May 7 14:39:28 xt4 kernel: ffff88074283b918 0000000000000086 ffff880814623ec0 ffff88074283bfd8
May 7 14:39:28 xt4 kernel: ffff88074283bfd8 ffff88074283bfd8 ffff880814623ec0 ffff880814623ec0
May 7 14:39:28 xt4 kernel: ffff88046d70de38 ffffffffffffffff ffff88046d70de40 ffff88046d70ddc0
May 7 14:39:28 xt4 kernel: Call Trace:
May 7 14:39:28 xt4 kernel: [] schedule+0x29/0x70
May 7 14:39:28 xt4 kernel: [] rwsem_down_read_failed+0xf5/0x170
May 7 14:39:28 xt4 kernel: [] call_rwsem_down_read_failed+0x18/0x30
May 7 14:39:28 xt4 kernel: [] ? spl_kmem_zalloc+0xc0/0x170 [spl]
May 7 14:39:28 xt4 kernel: [] down_read+0x20/0x30
May 7 14:39:28 xt4 kernel: [] __do_page_fault+0x380/0x450
May 7 14:39:28 xt4 kernel: [] do_page_fault+0x35/0x90
May 7 14:39:28 xt4 kernel: [] page_fault+0x28/0x30
May 7 14:39:28 xt4 kernel: [] ? spl_kmem_zalloc+0xc0/0x170 [spl]
May 7 14:39:28 xt4 kernel: [] ? copy_user_enhanced_fast_string+0x9/0x20
May 7 14:39:28 xt4 kernel: [] ? uiomove+0x155/0x2b0 [zcommon]
May 7 14:39:28 xt4 kernel: [] dmu_write_uio_dnode+0x90/0x150 [zfs]
May 7 14:39:28 xt4 kernel: [] dmu_write_uio_dbuf+0x4f/0x70 [zfs]
May 7 14:39:28 xt4 kernel: [] zfs_write+0xb31/0xc40 [zfs]
May 7 14:39:28 xt4 kernel: [] ? nvlist_lookup_common.part.71+0xa2/0xb0 [znvpair]
May 7 14:39:28 xt4 kernel: [] ? nvlist_lookup_byte_array+0x26/0x30 [znvpair]
May 7 14:39:28 xt4 kernel: [] ? strfree+0xe/0x10 [spl]
May 7 14:39:28 xt4 kernel: [] ? vfs_getxattr+0x88/0xb0
May 7 14:39:28 xt4 kernel: [] zpl_write_common_iovec.constprop.8+0x95/0x100 [zfs]
May 7 14:39:28 xt4 kernel: [] zpl_write+0x7e/0xb0 [zfs]
May 7 14:39:28 xt4 kernel: [] vfs_write+0xbd/0x1e0
May 7 14:39:28 xt4 kernel: [] SyS_pwrite64+0x92/0xc0
May 7 14:39:28 xt4 kernel: [] system_call_fastpath+0x16/0x1b

May 7 14:39:28 xt4 kernel: glusterfsd D 0000000000000000 0 28771 29196 0x00000080
May 7 14:39:28 xt4 kernel: ffff880561ffb9b8 0000000000000086 ffff8808652b5e20 ffff880561ffbfd8
May 7 14:39:28 xt4 kernel: ffff880561ffbfd8 ffff880561ffbfd8 ffff8808652b5e20 ffff88046f406b68
May 7 14:39:28 xt4 kernel: ffff88046f406a20 ffff88046f406b70 ffff88046f406a48 0000000000000000
May 7 14:39:28 xt4 kernel: Call Trace:
May 7 14:39:28 xt4 kernel: [] schedule+0x29/0x70
May 7 14:39:28 xt4 kernel: [] cv_wait_common+0x125/0x150 [spl]
May 7 14:39:28 xt4 kernel: [] ? wake_up_atomic_t+0x30/0x30
May 7 14:39:28 xt4 kernel: [] __cv_wait+0x15/0x20 [spl]
May 7 14:39:28 xt4 kernel: [] txg_wait_open+0xc3/0x110 [zfs]
May 7 14:39:28 xt4 kernel: [] dmu_tx_wait+0x3a8/0x3c0 [zfs]
May 7 14:39:28 xt4 kernel: [] dmu_tx_assign+0x9a/0x510 [zfs]
May 7 14:39:28 xt4 kernel: [] zfs_dirty_inode+0xf7/0x320 [zfs]
May 7 14:39:28 xt4 kernel: [] ? dequeue_entity+0x11c/0x5d0
May 7 14:39:28 xt4 kernel: [] ? __d_lookup+0x120/0x160
May 7 14:39:28 xt4 kernel: [] ? dequeue_task_fair+0x41e/0x660
May 7 14:39:28 xt4 kernel: [] ? sched_clock_cpu+0x85/0xc0
May 7 14:39:28 xt4 kernel: [] ? __switch_to+0x15c/0x4c0
May 7 14:39:28 xt4 kernel: [] zpl_dirty_inode+0x2c/0x40 [zfs]
May 7 14:39:28 xt4 kernel: [] __mark_inode_dirty+0xca/0x290
May 7 14:39:28 xt4 kernel: [] update_time+0x81/0xd0
May 7 14:39:28 xt4 kernel: [] ? __sb_start_write+0x58/0x110
May 7 14:39:28 xt4 kernel: [] file_update_time+0xa0/0xf0
May 7 14:39:28 xt4 kernel: [] filemap_page_mkwrite+0x40/0xb0
May 7 14:39:28 xt4 kernel: [] do_page_mkwrite+0x54/0xa0
May 7 14:39:28 xt4 kernel: [] do_shared_fault.isra.46+0x77/0x1e0
May 7 14:39:28 xt4 kernel: [] handle_mm_fault+0x61e/0x1000
May 7 14:39:28 xt4 kernel: [] ? do_futex+0x122/0x5b0
May 7 14:39:28 xt4 kernel: [] __do_page_fault+0x154/0x450
May 7 14:39:28 xt4 kernel: [] do_page_fault+0x35/0x90
May 7 14:39:28 xt4 kernel: [] page_fault+0x28/0x30

gmelikov · 2018-05-07T10:18:32Z

@YanyunGao please try latest stable version 0.7.8 first.

javenwu · 2018-05-07T10:30:02Z

@gmelikov Thanks for quick reply. Could you hint us which potential fixs are relevant to the issue.
We did basic analysis, it seems dmu_write_uio_dnode() still possible triggers the page fault in the newest code. As long as the page_fault is triggered in the dmu_write_uio_dnode()->uiomove() context, I assume the issue is still there. It's really not easy to reproduce the problem and we have no idea how to reproduce it constantly. Two weeks ago we hit the problem twice, but we haven't trigger it in the following test till today.

YanyunGao · 2018-05-07T10:54:39Z

I noticed that there is a comment in the code dmu.c. dmu_write_uio_dnode():

     /*
      * XXX uiomove could block forever (eg.nfs-backed
      * pages).  There needs to be a uiolockdown() function
      * to lock the pages in memory, so that uiomove won't
      * block.
      */
     err = uiomove((char *)db->db_data + bufoff, tocpy,
         UIO_WRITE, uio);

It seems we hit the case, uiomove blocks forever in our case. The issue report lists two back traces. One is page_fault context, the other is syscall write context. page_fault context hold the semaphore but stuck in tx_wait_open (wait for quiesce txg complete, unfortunately, the quiesce txg never complete because of the other back trace). The syscall write context increases tc_count of quiescing txg but fails to down read semphore, in turns block quiesce thread. It looks like a deadlock.

gmelikov · 2018-05-07T11:10:47Z

@javenwu

We noticed that the problem could be easy to reproduce when there is no too much memory left.

0.7 branch has a big bunch of changes, and one of them - ABD 7657def . If there is a problem, which is easy to reproduce when there is no too much memory left, then this patch is a must have.

And, secondly, I think it's a good tone to try fresh stable version first.

You can look at #7339 too, it looks similar.

YanyunGao · 2018-05-07T12:05:02Z

@gmelikov I noticed that there is a comment in #7339, this is fixed calling uiomove() in mappedread() under the page lock can result in a deadlock if the user space page needs to be faulted in.

YanyunGao · 2018-05-07T12:11:38Z

@trisk @behlendorf

javenwu · 2018-05-08T05:58:56Z

Thanks for pointing out #7339, it's relevant but the it won't fix our problem. The interesting thing is I found we might hit the same symptoms as removing the prefault code @trisk mentioned in the comments of #7339. In our case, prefault is called absolutely, so I think it's the case the page is evicted after prefault. I saw there is a proposal to create a separate issue to track this kind of issue.

javenwu · 2018-08-17T02:27:06Z

After dive into the detail of the problem, we figured out the issue more clearly. This is a dead lock case. The dead lock is about mm_sem of two threads context of a process. The application has two threads, one is doing write syscall, the other thread is manipulating memory address which is mapped to a file.
Let's assume syscall write thread is context #1, another context is page fault context triggered by accessing memory address that is mapped to a file, let's call it context #2.
The time sequence is as below:

context Use Barriers in pre-2.6.24 kernels #1 enters zfs_write() and get txg "n" assigned .
context Implement zeventd daemon #2 with mm_sem being hold, zfs_dirty_inode failed to get txg assigned and try to wait previous txg "n" complete.
in context Use Barriers in pre-2.6.24 kernels #1 , prefaulted page is faulted in uiomove. The page fault handler is suspended on mm_sem (mm_sem of the process has already been hold by context Implement zeventd daemon #2) and result in txg "n" cannot complete.
context Implement zeventd daemon #2 will never get txg n+1 assigned because txg n will not complete.

So context #1 and context #2 trap into the dead lock.

If my understand is correct, prefaulted page being faulted in step 3 is the criminal of the deadlock.
I noticed that there is function __probe_kernel_read() in the kernel.
---%<---
long __probe_kernel_read(void *dst, const void *src, size_t size)
{
long ret;
mm_segment_t old_fs = get_fs();

set_fs(KERNEL_DS);
pagefault_disable();
ret = __copy_from_user_inatomic(dst,
        (__force const void __user *)src, size);
pagefault_enable();
set_fs(old_fs);

return ret ? -EFAULT : 0;

}
---%<---
I was thinking if we modify uiomove implementation a little bit, copy_x_user in pagefault_disabled section and return EFAULT if page is not loaded. When EFAULT is returned in zfs_write() context, we can unassign the txg "n" and invoke prefault before reassign and retry. In theory, we expected all pages used in uiomove is faulted by prefault.

How do you guys think?

The bug time sequence: 1. context openzfs#1, zfs_write assign a txg "n". 2. In a same process, context openzfs#2, mmap page fault (which means the `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed, and wait previous txg "n" completed. 3. context openzfs#1 call `uiomove` to write, however page fault is occurred in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by context openzfs#2, so it stuck and can't complete, then txg "n" will not complete. So context openzfs#1 and context openzfs#2 trap into the "dead lock". Signed-off-by: Grady Wong <grady.w@xtaotech.com> Closes openzfs#7512

stale · 2020-08-25T04:21:48Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

wgqimut mentioned this issue Sep 21, 2018

deadlock between mm_sem and tx assign in zfs_write() and page fault #7939

Merged

13 tasks

behlendorf mentioned this issue Sep 21, 2018

Fix zfs_write() / mmap update_time() lock inversion #7942

Closed

13 tasks

stale bot added the Status: Stale No recent activity for issue label Aug 25, 2020

stale bot closed this as completed Nov 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write IO hangs when uiomove triggers page fault in zfs_write context. #7512

write IO hangs when uiomove triggers page fault in zfs_write context. #7512

YanyunGao commented May 7, 2018 •

edited

Loading

gmelikov commented May 7, 2018

javenwu commented May 7, 2018

YanyunGao commented May 7, 2018 •

edited

Loading

gmelikov commented May 7, 2018

YanyunGao commented May 7, 2018

YanyunGao commented May 7, 2018

javenwu commented May 8, 2018 •

edited

Loading

javenwu commented Aug 17, 2018 •

edited

Loading

stale bot commented Aug 25, 2020

write IO hangs when uiomove triggers page fault in zfs_write context. #7512

write IO hangs when uiomove triggers page fault in zfs_write context. #7512

Comments

YanyunGao commented May 7, 2018 • edited Loading

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

gmelikov commented May 7, 2018

javenwu commented May 7, 2018

YanyunGao commented May 7, 2018 • edited Loading

gmelikov commented May 7, 2018

YanyunGao commented May 7, 2018

YanyunGao commented May 7, 2018

javenwu commented May 8, 2018 • edited Loading

javenwu commented Aug 17, 2018 • edited Loading

stale bot commented Aug 25, 2020

YanyunGao commented May 7, 2018 •

edited

Loading

YanyunGao commented May 7, 2018 •

edited

Loading

javenwu commented May 8, 2018 •

edited

Loading

javenwu commented Aug 17, 2018 •

edited

Loading