Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs diff lockup #2989

Closed
cwedgwood opened this issue Dec 24, 2014 · 3 comments
Closed

zfs diff lockup #2989

cwedgwood opened this issue Dec 24, 2014 · 3 comments

Comments

@cwedgwood
Copy link
Contributor

Using latest code (as of Dec 21, 2014) and 3.17.7' :

zfs dfif foo@bar for some foo@bar will cause the following.

Dec 21 19:00:09 kernel: [  202.937213] zpool (3175) used greatest stack depth: 11136 bytes left
Dec 21 19:13:17 kernel: [  991.696105] INFO: task txg_sync:3588 blocked for more than 45 seconds.
Dec 21 19:13:17 kernel: [  991.702701]       Tainted: P           O   3.17.7.cw0 #20
Dec 21 19:13:17 kernel: [  991.708383] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 21 19:13:17 kernel: [  991.716683] txg_sync        D ffff880a3fc12600 12960  3588      2 0x00000000
Dec 21 19:13:17 kernel: [  991.724233]  ffff880a17ef3c20 0000000000000046 0000000000009000 ffff880a18450000
Dec 21 19:13:17 kernel: [  991.732135]  ffff880a17ef3fd8 ffff880a2818b240 ffff880a18450000 ffffffff816e9e48
Dec 21 19:13:17 kernel: [  991.740037]  0000000000000202 ffff880527b2fdd8 ffff880527b2fda8 ffff880527b2fde0
Dec 21 19:13:17 kernel: [  991.747922] Call Trace:
Dec 21 19:13:17 kernel: [  991.750506]  [<ffffffff816e9e48>] ? _raw_spin_lock_irqsave+0x18/0x60
Dec 21 19:13:17 kernel: [  991.757278]  [<ffffffff816e5a39>] schedule+0x29/0x70
Dec 21 19:13:17 kernel: [  991.762606]  [<ffffffffa0c777bd>] cv_wait_common+0x11d/0x280 [spl]
Dec 21 19:13:17 kernel: [  991.769109]  [<ffffffff810b3430>] ? __wake_up_sync+0x20/0x20
Dec 21 19:13:17 kernel: [  991.775117]  [<ffffffffa0c77935>] __cv_wait+0x15/0x20 [spl]
Dec 21 19:13:17 kernel: [  991.781106]  [<ffffffffa0d82305>] rrw_enter_write+0x75/0x130 [zfs]
Dec 21 19:13:17 kernel: [  991.787733]  [<ffffffffa0d823d3>] rrw_enter+0x13/0x30 [zfs]
Dec 21 19:13:17 kernel: [  991.793623]  [<ffffffffa0d8d32f>] spa_sync+0x88f/0xd00 [zfs]
Dec 21 19:13:17 kernel: [  991.799625]  [<ffffffff810a0222>] ? default_wake_function+0x12/0x20
Dec 21 19:13:17 kernel: [  991.806263]  [<ffffffff810b3442>] ? autoremove_wake_function+0x12/0x40
Dec 21 19:13:17 kernel: [  991.813182]  [<ffffffffa0da2569>] txg_sync_thread+0x329/0x610 [zfs]
Dec 21 19:13:17 kernel: [  991.819858]  [<ffffffffa0da2240>] ? txg_thread_wait+0xf0/0xf0 [zfs]
Dec 21 19:13:17 kernel: [  991.826556]  [<ffffffffa0c71373>] thread_generic_wrapper+0x83/0xc0 [spl]
Dec 21 19:13:17 kernel: [  991.833596]  [<ffffffffa0c712f0>] ? __thread_exit+0x20/0x20 [spl]
Dec 21 19:13:17 kernel: [  991.840078]  [<ffffffff81093eeb>] kthread+0xdb/0x100
Dec 21 19:13:17 kernel: [  991.845334]  [<ffffffff81093e10>] ? kthread_create_on_node+0x180/0x180
Dec 21 19:13:17 kernel: [  991.852247]  [<ffffffff816ea4ac>] ret_from_fork+0x7c/0xb0
Dec 21 19:13:17 kernel: [  991.858025]  [<ffffffff81093e10>] ? kthread_create_on_node+0x180/0x180

It seems quite repeatable (though sometimes with a little effort) so suggestions on how to debug this would be most welcome.

@cwedgwood
Copy link
Contributor Author

@behlendorf when this occurred I did sysrq-t and saw 15 or 16 CPUs were in cpuidle, the only one that isn't was:

Dec 21 19:14:01 kernel: [ 1035.560668] NMI backtrace for cpu 1
Dec 21 19:14:01 kernel: [ 1035.560678] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 193.351 msecs
Dec 21 19:14:01 kernel: [ 1035.574612] CPU: 1 PID: 3821 Comm: bash Tainted: P           O   3.17.7.cw0 #20
Dec 21 19:14:01 kernel: [ 1035.582325] Hardware name: Supermicro X8DTL/X8DTL, BIOS 2.1b       02/01/2013
Dec 21 19:14:01 kernel: [ 1035.589877] task: ffff8805254c0000 ti: ffff880527594000 task.ti: ffff880527594000
Dec 21 19:14:01 kernel: [ 1035.597829] RIP: 0010:[<ffffffff81373338>]  [<ffffffff81373338>] delay_tsc+0x28/0x60
Dec 21 19:14:01 kernel: [ 1035.606139] RSP: 0018:ffff880527597de0  EFLAGS: 00000803
Dec 21 19:14:01 kernel: [ 1035.611773] RAX: 000000008289d65a RBX: 0000000000001000 RCX: 000000008289d630
Dec 21 19:14:01 kernel: [ 1035.619348] RDX: 0000000000000263 RSI: 0000000000000001 RDI: 000000000003a97a
Dec 21 19:14:01 kernel: [ 1035.626877] RBP: ffff880527597de0 R08: ffffffff81a9d208 R09: 000000000000047c
Dec 21 19:14:01 kernel: [ 1035.634467] R10: ffff8805274b21d4 R11: 000000000000047c R12: 00000000000003e9
Dec 21 19:14:01 kernel: [ 1035.642075] R13: 0000000000000400 R14: 0000000000000002 R15: 0000000000000002
Dec 21 19:14:01 kernel: [ 1035.649587] FS:  00007ff10b6cc700(0000) GS:ffff88052bc20000(0000) knlGS:0000000000000000
Dec 21 19:14:01 kernel: [ 1035.658154] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 21 19:14:01 kernel: [ 1035.664201] CR2: 0000000000725850 CR3: 00000005127ed000 CR4: 00000000000007e0
Dec 21 19:14:01 kernel: [ 1035.671755] Stack:
Dec 21 19:14:01 kernel: [ 1035.673872]  ffff880527597df0 ffffffff813732a9 ffff880527597e10 ffffffff8103220a
Dec 21 19:14:01 kernel: [ 1035.681777]  000000000000a032 ffffffff81a9d208 ffff880527597e58 ffffffff81033bf0
Dec 21 19:14:01 kernel: [ 1035.689638]  0000001227597e78 0000000000000282 000000000000006c 0000000000000008
Dec 21 19:14:01 kernel: [ 1035.697617] Call Trace:
Dec 21 19:14:01 kernel: [ 1035.700200]  [<ffffffff813732a9>] __const_udelay+0x29/0x30
Dec 21 19:14:01 kernel: [ 1035.706025]  [<ffffffff8103220a>] native_safe_apic_wait_icr_idle+0x2a/0x60
Dec 21 19:14:01 kernel: [ 1035.713245]  [<ffffffff81033bf0>] default_send_IPI_mask_sequence_phys+0xd0/0x120
Dec 21 19:14:01 kernel: [ 1035.721082]  [<ffffffff81037fae>] physflat_send_IPI_mask+0xe/0x10
Dec 21 19:14:01 kernel: [ 1035.727528]  [<ffffffff81033ec5>] arch_trigger_all_cpu_backtrace+0xc5/0xe0
Dec 21 19:14:01 kernel: [ 1035.734678]  [<ffffffff8140c6f3>] sysrq_handle_showallcpus+0x13/0x20
Dec 21 19:14:01 kernel: [ 1035.741066]  [<ffffffff8140cd36>] __handle_sysrq+0x106/0x170
Dec 21 19:14:01 kernel: [ 1035.746755]  [<ffffffff8140d1df>] write_sysrq_trigger+0x2f/0x40
Dec 21 19:14:01 kernel: [ 1035.752856]  [<ffffffff811febb8>] proc_reg_write+0x48/0x70
Dec 21 19:14:01 kernel: [ 1035.758710]  [<ffffffff811a2fd7>] vfs_write+0xb7/0x1f0
Dec 21 19:14:01 kernel: [ 1035.764131]  [<ffffffff811a3b86>] SyS_write+0x46/0xb0
Dec 21 19:14:01 kernel: [ 1035.769480]  [<ffffffff816ea552>] system_call_fastpath+0x16/0x1b
Dec 21 19:14:01 kernel: [ 1035.775821] Code: 00 5d c3 66 66 66 66 90 55 48 89 e5 65 8b 34 25 2c a0 00 00 0f 1f 00 0f ae e8 0f 31 89 c1 0f 1f 44 00 00 0f 1f 00 0f ae e8 0f 31 <48> c1 e2 20 89 c0 48 09 c2 89 d0 29 ca 39 fa 73 17 f3 90 65 8b 

This seems at-odds with the call trace above though.

@behlendorf
Copy link
Contributor

The process holding the lock will most likely be sleeping so don't limit yourself to running processes. I'd check any that have interesting stacks in zfs. The 'zfs diff' process here is probably the most likely. As for the stack above it's from the process that handled the sysrq.

@gmelikov
Copy link
Member

Close as stale.

If it's actual - feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants