Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rsync and txg_sync blocked for more than 120 seconds #2611

Closed
freakout42 opened this issue Aug 18, 2014 · 47 comments
Closed

rsync and txg_sync blocked for more than 120 seconds #2611

freakout42 opened this issue Aug 18, 2014 · 47 comments

Comments

@freakout42
Copy link

SPL: Loaded module v0.6.3-1
ZFS: Loaded module v0.6.3-1, ZFS pool version 5000, ZFS filesystem version 5

doing nothing but some basic rsync's with moderate sizes 4-10G result always in having approx 1MB/sec throughtput (very slow) on a up-date 16G RAM HP server - CentOS 6 with OpenVZ and selfcompiled modules for ZFS - dmesg:

INFO: task rsync:6517 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-042stab092.3 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rsync         D ffff8804029bafb0     0  6517   6516    0 0x00000080
 ffff88001e55da18 0000000000000086 0000000000000003 00000003d2dad320
 000000000001ec80 0000000000000001 ffff88001e55dab8 0000000000000082
 ffffc900105a3428 ffff8803fe837d60 ffff8804029bb578 000000000001ec80
Call Trace:
 [<ffffffff810a1dfe>] ? prepare_to_wait_exclusive+0x4e/0x80
 [<ffffffffa019fb35>] cv_wait_common+0x105/0x1c0 [spl]
 [<ffffffff810a1bb0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa019fc45>] __cv_wait+0x15/0x20 [spl]
 [<ffffffffa02ae1fb>] txg_wait_open+0x8b/0x110 [zfs]
 [<ffffffffa027194e>] dmu_tx_wait+0x29e/0x2b0 [zfs]
 [<ffffffff81530bfe>] ? mutex_lock+0x1e/0x50
 [<ffffffffa0271a41>] dmu_tx_assign+0x91/0x490 [zfs]
 [<ffffffffa027fab7>] ? dsl_dataset_block_freeable+0x27/0x60 [zfs]
 [<ffffffffa02e8d3e>] zfs_write+0x43e/0xcf0 [zfs]
 [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff811c5e5c>] ? core_sys_select+0x1ec/0x2d0
 [<ffffffffa02fd354>] zpl_write_common+0x54/0xd0 [zfs]
 [<ffffffffa02fd438>] zpl_write+0x68/0xa0 [zfs]
 [<ffffffff811ac798>] vfs_write+0xb8/0x1a0
 [<ffffffff811ad091>] sys_write+0x51/0x90
 [<ffffffff810f4dee>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b102>] system_call_fastpath+0x16/0x1b

INFO: task txg_sync:876 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-042stab092.3 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
txg_sync      D ffff880405d28640     0   876      2    0 0x00000000
 ffff8803fe839b70 0000000000000046 0000000000000001 ffff880405101430
 0000000000000000 0000000000000000 ffff8803fe839af0 ffffffff81060032
 ffff8803fe839b40 ffffffff81054939 ffff880405d28c08 000000000001ec80
Call Trace:
 [<ffffffff81060032>] ? default_wake_function+0x12/0x20
 [<ffffffff81054939>] ? __wake_up_common+0x59/0x90
 [<ffffffff8152f833>] io_schedule+0x73/0xc0
 [<ffffffffa019fadc>] cv_wait_common+0xac/0x1c0 [spl]
 [<ffffffffa02f53e0>] ? zio_execute+0x0/0x140 [zfs]
 [<ffffffff810a1bb0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa019fc08>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa02f561b>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa02867e3>] dsl_pool_sync+0xb3/0x440 [zfs]
 [<ffffffffa029a67b>] spa_sync+0x40b/0xae0 [zfs]
 [<ffffffffa02aebb4>] txg_sync_thread+0x384/0x5e0 [zfs]
 [<ffffffff8105b309>] ? set_user_nice+0xc9/0x130
 [<ffffffffa02ae830>] ? txg_sync_thread+0x0/0x5e0 [zfs]
 [<ffffffffa01978e8>] thread_generic_wrapper+0x68/0x80 [spl]
 [<ffffffffa0197880>] ? thread_generic_wrapper+0x0/0x80 [spl]
 [<ffffffff810a1596>] kthread+0x96/0xa0
 [<ffffffff8100c34a>] child_rip+0xa/0x20
 [<ffffffff810a1500>] ? kthread+0x0/0xa0
 [<ffffffff8100c340>] ? child_rip+0x0/0x20
@wankdanker
Copy link

I hit something similar but maybe different last night. I believe it occurred while KVM was copying disk blocks from another server to this one. This ended up causing actual corruption on at least one of the zvols (as seen by the VM).

[381360.908047] INFO: task zvol/0:331 blocked for more than 120 seconds.
[381360.908147]       Tainted: PF          O 3.13.0-32-generic #57-Ubuntu
[381360.908240] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[381360.908352] zvol/0          D ffff88062fc54440     0   331      2 0x00000000
[381360.908357]  ffff88060796dbe0 0000000000000002 ffff880609aa17f0 ffff88060796dfd8
[381360.908360]  0000000000014440 0000000000014440 ffff880609aa17f0 ffff880609a5eb30
[381360.908362]  ffff880609a5e9f8 ffff880609a5eb38 0000000000000000 0000000000000002
[381360.908365] Call Trace:
[381360.908374]  [<ffffffff817200d9>] schedule+0x29/0x70
[381360.908396]  [<ffffffffa00c0485>] cv_wait_common+0x105/0x1a0 [spl]
[381360.908402]  [<ffffffff810aaf00>] ? prepare_to_wait_event+0x100/0x100
[381360.908408]  [<ffffffffa00c0535>] __cv_wait+0x15/0x20 [spl]
[381360.908459]  [<ffffffffa0209abb>] txg_wait_open+0x8b/0x110 [zfs]
[381360.908476]  [<ffffffffa01cd83b>] dmu_tx_wait+0x29b/0x2a0 [zfs]
[381360.908492]  [<ffffffffa01cd8cc>] dmu_tx_assign+0x8c/0x460 [zfs]
[381360.908520]  [<ffffffffa025a8c7>] zvol_write+0xa7/0x480 [zfs]
[381360.908527]  [<ffffffffa00bab27>] taskq_thread+0x237/0x4b0 [spl]
[381360.908530]  [<ffffffff81097508>] ? finish_task_switch+0x128/0x170
[381360.908534]  [<ffffffff8109a800>] ? wake_up_state+0x20/0x20
[381360.908539]  [<ffffffffa00ba8f0>] ? taskq_cancel_id+0x1f0/0x1f0 [spl]
[381360.908543]  [<ffffffff8108b3d2>] kthread+0xd2/0xf0
[381360.908545]  [<ffffffff8108b300>] ? kthread_create_on_node+0x1d0/0x1d0
[381360.908548]  [<ffffffff8172c5bc>] ret_from_fork+0x7c/0xb0
[381360.908550]  [<ffffffff8108b300>] ? kthread_create_on_node+0x1d0/0x1d0

This was repeated for zvol/0 through zvol/9. There are 31 zvols on this system.

[    4.995804] SPL: Loaded module v0.6.3-1~precise
[    5.130074] ZFS: Loaded module v0.6.3-2~precise, ZFS pool version 5000, ZFS filesystem version 5

I tried migrating a VM again today and all hell broke loose but I did not get these errors. The system load was in the 100's on an 8 core system. Major I/O wait time. I killed the migration but ended up having at least 3 corrupt zvols anyways.

@freakout42
Copy link
Author

i do not have any corruption or problem - the system is stable and running - also the rsync tasks are done 100% perfect - but very slow 1M/second write performance - and while running the whole system is not responding fast - but no errors and no corruption - these "blocked for more than 120 seconds" dmesg's come in pairs for rsync and txg_sync once a day

@kernelOfTruth
Copy link
Contributor

not similar but related message:

Aug 29 05:37:06 morpheus kernel: [46185.239554] ata6.00: configured for UDMA/133
Aug 29 05:37:06 morpheus kernel: [46185.239562] ata6: EH complete
Aug 29 05:53:40 morpheus kernel: [47179.890587] INFO: task txg_sync:1462 blocked for more than 180 seconds.
Aug 29 05:53:40 morpheus kernel: [47179.890589] Tainted: P O 3.16.0_ck1-smtnice6_BFQ_integra_intel #1
Aug 29 05:53:40 morpheus kernel: [47179.890590] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 29 05:53:40 morpheus kernel: [47179.890591] txg_sync D 0000000000000006 0 1462 2 0x00000000
Aug 29 05:53:40 morpheus kernel: [47179.890594] ffff8805fbda9e40 0000000000000046 ffff88020931a9b0 ffff88065b610000
Aug 29 05:53:40 morpheus kernel: [47179.890596] 0000000000000066 ffff8807fb2a0f20 000000000000b020 0000000000013100
Aug 29 05:53:40 morpheus kernel: [47179.890597] ffff88065b6103a0 00002992b81d47dd ffff880617373fd8 ffff88065b610000
Aug 29 05:53:40 morpheus kernel: [47179.890599] Call Trace:
Aug 29 05:53:40 morpheus kernel: [47179.890605] [] ? io_schedule+0x88/0xd0
Aug 29 05:53:40 morpheus kernel: [47179.890613] [] ? __cv_timedwait+0x96/0x110 [spl]
Aug 29 05:53:40 morpheus kernel: [47179.890616] [] ? finish_wait+0x90/0x90
Aug 29 05:53:40 morpheus kernel: [47179.890623] [] ? zio_wait+0xeb/0x1a0 [zfs]
Aug 29 05:53:40 morpheus kernel: [47179.890631] [] ? dsl_pool_sync+0xaa/0x450 [zfs]
Aug 29 05:53:40 morpheus kernel: [47179.890639] [] ? spa_sync+0x483/0xb20 [zfs]
Aug 29 05:53:40 morpheus kernel: [47179.890642] [] ? default_wake_function+0xd/0x20
Aug 29 05:53:40 morpheus kernel: [47179.890644] [] ? ktime_get_ts+0x3d/0xe0
Aug 29 05:53:40 morpheus kernel: [47179.890652] [] ? txg_sync_start+0x6ea/0x900 [zfs]
Aug 29 05:53:40 morpheus kernel: [47179.890655] [] ? __switch_to+0x2a/0x560
Aug 29 05:53:40 morpheus kernel: [47179.890662] [] ? txg_sync_start+0x3c0/0x900 [zfs]
Aug 29 05:53:40 morpheus kernel: [47179.890664] [] ? spl_kmem_fini+0xa5/0xc0 [spl]
Aug 29 05:53:40 morpheus kernel: [47179.890667] [] ? spl_kmem_fini+0x30/0xc0 [spl]
Aug 29 05:53:40 morpheus kernel: [47179.890669] [] ? kthread+0xbc/0xe0
Aug 29 05:53:40 morpheus kernel: [47179.890670] [] ? __ww_mutex_lock_slowpath+0x8c/0x2cc
Aug 29 05:53:40 morpheus kernel: [47179.890672] [] ? flush_kthread_worker+0x80/0x80
Aug 29 05:53:40 morpheus kernel: [47179.890674] [] ? ret_from_fork+0x7c/0xb0
Aug 29 05:53:40 morpheus kernel: [47179.890675] [] ? flush_kthread_worker+0x80/0x80
Aug 29 06:28:01 morpheus kernel: [49241.775276] usb 1-1.1: USB disconnect, device number 3

this seems to occur from time to time with a rather slow USB3.0 powered 4TB hdd (Touro Desk 3.0, HGST5K4000) in an external case during rsync & transferring of large files (several GiB)

@kernelOfTruth
Copy link
Contributor

clarification for above:

I had this happen with the above mentioned drive - of which I still don't know what causes it (but I suspect it could be related either to the chipset in the external harddrive enclosure where the drive sits in or powersaving features of the XHCI driver & hardware, which I already had issues with in the past)

another drive showed this behavior (a seagate ST3000DM001) which likely underwent a headcrash and did reallocate several sectors (<10). It had been placed in an external enclosure by fantec [db-f8u3e with an incompatible chipset against smartctl] that had shown in the past to have a life & mind of its own: it would occasionally turn off during transfers and causing trouble with other filesystems, on ZFS, however, the files so far seemed fine. The day before yesterday I placed the drive in another external enclosure and it worked well during backups (only transferring several hundreds of MiB of data per backup job incrementally via rsync) until I decided to run a scrub and check everything: after several hours the drive again screamed and made hearable noises of a head-crash and/or sector re-allocation (had those in the past) and access wasn't possible anymore to the drive

that's where the above posted message occured again
so at least in the second case (ST3000DM001) ZFS seemingly showed indirectly that something was wrong with the hardware/harddrive

so when encountering this message - make sure to double- or triple-check that it's not a hardware-issue instead a "software"- (ZFS-related) problem

@ryao
Copy link
Contributor

ryao commented Sep 13, 2014

@wankdanker I think that your issue is separate. It might have been caused by the zvol processing occuring inside an interrupt context. Pull request #2484 might resolve it.

@ryao
Copy link
Contributor

ryao commented Sep 13, 2014

@freakout42 Would you tell us more about your pool configuration? Also, do you have data deduplication enabled on this pool?

@freakout42
Copy link
Author

[root@blood ~]# zpool status
  pool: tank
 state: ONLINE
  scan: scrub repaired 0 in 0h53m with 0 errors on Tue Sep  9 12:45:13 2014
config:

    NAME        STATE     READ WRITE CKSUM
    tank        ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sda4    ONLINE       0     0     0
        sdb4    ONLINE       0     0     0

errors: No known data errors

[root@blood ~]# zpool get all
NAME  PROPERTY               VALUE                  SOURCE
tank  size                   824G                   -
tank  capacity               34%                    -
tank  altroot                -                      default
tank  health                 ONLINE                 -
tank  guid                   3198719639486948540    default
tank  version                -                      default
tank  bootfs                 -                      default
tank  delegation             on                     default
tank  autoreplace            off                    default
tank  cachefile              -                      default
tank  failmode               wait                   default
tank  listsnapshots          off                    default
tank  autoexpand             off                    default
tank  dedupditto             0                      default
tank  dedupratio             1.00x                  -
tank  free                   538G                   -
tank  allocated              286G                   -
tank  readonly               off                    -
tank  ashift                 0                      default
tank  comment                -                      default
tank  expandsize             0                      -
tank  freeing                0                      default
tank  feature@async_destroy  enabled                local
tank  feature@empty_bpobj    active                 local
tank  feature@lz4_compress   enabled                local

@ColdCanuck
Copy link
Contributor

Had similar failures, which occurred during heavy rsync pulls from remote machine.

I was able to make it happen very quickly by starting up the remote pull. Did this three times in a row and caused the fault every time. The symptom was that any userland zfs/zpool commands hang, but the machine was still responsive to other commands.

I set the parameter spl_kmem_cache_slab_limit=0 (it had been spl_kmem_cache_slab_limit=16384), and the problem seems to be gone, or not easliy triggered.

Part of the process which triggers this includes snapshot renaming, but no zvols are involved in this process, although the pool has some.

The pool is a raidz1 pool, and there are no hardware issues on the server.

Sep 23 16:17:51 zephyr kernel: [   27.497382] SPL: Loaded module v0.6.3-1~precise
Sep 23 16:17:51 zephyr kernel: [   27.515752] ZFS: Loaded module v0.6.3-2~precise, ZFS pool version 5000, ZFS filesystem version 5

Sep 26 00:11:23 zephyr kernel: [200583.071397] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 26 00:11:23 zephyr kernel: [200583.093284] txg_sync        D ffffffff8180cbe0     0  4178      2 0x00000000
Sep 26 00:11:23 zephyr kernel: [200583.093291]  ffff8807d9fd1aa0 0000000000000046 00000000000000ff ffffffffa026c280
Sep 26 00:11:23 zephyr kernel: [200583.093297]  ffff8807d9fd1fd8 ffff8807d9fd1fd8 ffff8807d9fd1fd8 00000000000139c0
Sep 26 00:11:23 zephyr kernel: [200583.093303]  ffff8807f3d6ae20 ffff8807e0189710 00000000ffffffff ffffffffa026c280
Sep 26 00:11:23 zephyr kernel: [200583.093309] Call Trace:
Sep 26 00:11:23 zephyr kernel: [200583.093333]  [<ffffffff8169f099>] schedule+0x29/0x70
Sep 26 00:11:23 zephyr kernel: [200583.093338]  [<ffffffff8169f35e>] schedule_preempt_disabled+0xe/0x10
Sep 26 00:11:23 zephyr kernel: [200583.093343]  [<ffffffff8169df77>] __mutex_lock_slowpath+0xd7/0x150
Sep 26 00:11:23 zephyr kernel: [200583.093350]  [<ffffffff8169db8a>] mutex_lock+0x2a/0x50
Sep 26 00:11:23 zephyr kernel: [200583.093395]  [<ffffffffa0230f19>] zvol_rename_minors+0x79/0x180 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.093432]  [<ffffffffa01ad349>] dsl_dataset_rename_snapshot_sync_impl+0x189/0x2c0 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.093460]  [<ffffffffa01ad52f>] dsl_dataset_rename_snapshot_sync+0xaf/0x190 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.093491]  [<ffffffffa01bde52>] dsl_sync_task_sync+0xf2/0x100 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.093521]  [<ffffffffa01b63b3>] dsl_pool_sync+0x2f3/0x420 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.093554]  [<ffffffffa01cc4f4>] spa_sync+0x414/0xb20 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.093590]  [<ffffffffa01db021>] ? spa_txg_history_set+0x21/0xd0 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.093625]  [<ffffffffa01de335>] txg_sync_thread+0x385/0x5d0 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.093659]  [<ffffffffa01ddfb0>] ? txg_init+0x260/0x260 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.093672]  [<ffffffffa00d3fc8>] thread_generic_wrapper+0x78/0x90 [spl]
Sep 26 00:11:23 zephyr kernel: [200583.093682]  [<ffffffffa00d3f50>] ? __thread_create+0x300/0x300 [spl]
Sep 26 00:11:23 zephyr kernel: [200583.093688]  [<ffffffff81077ee3>] kthread+0x93/0xa0
Sep 26 00:11:23 zephyr kernel: [200583.093694]  [<ffffffff816a9b24>] kernel_thread_helper+0x4/0x10
Sep 26 00:11:23 zephyr kernel: [200583.093700]  [<ffffffff81077e50>] ? flush_kthread_worker+0xb0/0xb0
Sep 26 00:11:23 zephyr kernel: [200583.093704]  [<ffffffff816a9b20>] ? gs_change+0x13/0x13
Sep 26 00:11:23 zephyr kernel: [200583.093737] INFO: task zfs:22581 blocked for more than 120 seconds.
Sep 26 00:11:23 zephyr kernel: [200583.104867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 26 00:11:23 zephyr kernel: [200583.126712] zfs             D ffffffff8180cbe0     0 22581  22253 0x00000000
Sep 26 00:11:23 zephyr kernel: [200583.126717]  ffff880182f2f958 0000000000000082 ffff8807eee9c530 ffff8807deda2550
Sep 26 00:11:23 zephyr kernel: [200583.126722]  ffff880182f2ffd8 ffff880182f2ffd8 ffff880182f2ffd8 00000000000139c0
Sep 26 00:11:23 zephyr kernel: [200583.126728]  ffff8807f3f04530 ffff8807eee9c530 0000000000000292 ffff8807deda2550
Sep 26 00:11:23 zephyr kernel: [200583.126733] Call Trace:
Sep 26 00:11:23 zephyr kernel: [200583.126740]  [<ffffffff8169f099>] schedule+0x29/0x70
Sep 26 00:11:23 zephyr kernel: [200583.126753]  [<ffffffffa00dbd8d>] cv_wait_common+0xfd/0x1b0 [spl]
Sep 26 00:11:23 zephyr kernel: [200583.126768]  [<ffffffff810787c0>] ? add_wait_queue+0x60/0x60
Sep 26 00:11:23 zephyr kernel: [200583.126779]  [<ffffffffa00dbe95>] __cv_wait+0x15/0x20 [spl]
Sep 26 00:11:23 zephyr kernel: [200583.126813]  [<ffffffffa01c4aec>] rrw_enter_read+0x3c/0x130 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.126849]  [<ffffffffa01c4c70>] rrw_enter+0x20/0x30 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.126878]  [<ffffffffa01b51cd>] dsl_pool_config_enter+0x1d/0x20 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.126915]  [<ffffffffa01b709a>] dsl_pool_hold+0x4a/0x60 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.126927]  [<ffffffffa019957b>] dmu_objset_hold+0x2b/0xb0 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.126929]  [<ffffffff816a85d6>] ? ftrace_call+0x5/0x2b
Sep 26 00:11:23 zephyr kernel: [200583.126942]  [<ffffffffa01b8a6f>] dsl_prop_get+0x3f/0x90 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.126944]  [<ffffffff816a85d6>] ? ftrace_call+0x5/0x2b
Sep 26 00:11:23 zephyr kernel: [200583.126956]  [<ffffffffa01b8ade>] dsl_prop_get_integer+0x1e/0x20 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.126970]  [<ffffffffa022ef6f>] __zvol_create_minor+0xbf/0x630 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.126985]  [<ffffffffa0230bc7>] zvol_create_minor+0x27/0x40 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.126998]  [<ffffffffa0230bee>] zvol_create_minors_cb+0xe/0x20 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.127010]  [<ffffffffa0197e6e>] dmu_objset_find_impl+0x37e/0x3f0 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.127023]  [<ffffffffa0230be0>] ? zvol_create_minor+0x40/0x40 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.127034]  [<ffffffffa0197cae>] dmu_objset_find_impl+0x1be/0x3f0 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.127048]  [<ffffffffa0230be0>] ? zvol_create_minor+0x40/0x40 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.127061]  [<ffffffffa0230be0>] ? zvol_create_minor+0x40/0x40 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.127072]  [<ffffffffa0197f32>] dmu_objset_find+0x52/0x80 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.127074]  [<ffffffff816a85d6>] ? ftrace_call+0x5/0x2b
Sep 26 00:11:23 zephyr kernel: [200583.127088]  [<ffffffffa0230d53>] zvol_create_minors+0x33/0x40 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.127102]  [<ffffffffa0202a69>] zfs_ioc_snapshot+0x259/0x2a0 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.127116]  [<ffffffffa0206530>] zfsdev_ioctl+0x180/0x500 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.127118]  [<ffffffff816a85d6>] ? ftrace_call+0x5/0x2b
Sep 26 00:11:23 zephyr kernel: [200583.127131]  [<ffffffff8119ae9a>] do_vfs_ioctl+0x8a/0x340
Sep 26 00:11:23 zephyr kernel: [200583.127135]  [<ffffffff8119b1e1>] sys_ioctl+0x91/0xa0
Sep 26 00:11:23 zephyr kernel: [200583.127141]  [<ffffffff816a8829>] system_call_fastpath+0x16/0x1b

@dweeezil
Copy link
Contributor

@ColdCanuck Your comments regarding spl_kmem_cache_slab_limit piqued my interest in this issue which until now, I've not had time to follow. In fact, a quick skim of this whole issue makes me wonder how related each of the problem reports are and whether there's too much issue creep.

Back to the point at hand: I'm posting this followup because there have been a disturbing number of seemingly otherwise unrelated problems sporadically seemingly caused by using the Linux slab. Although I've not been able to spend the time on it I've wanted to, I've been rather knee deep investigating the series of issues related to Posix ACLs and SA xattrs and have seen at least one report (#2701) and, more interestingly #2725 which makes me think there may be a tie-in to our use of the Linux slab for <= 16KiB objects. I don't have any other brilliant observations offer at the moment other than to raise concern there may be problems realted to using the Linux slab and to ask @behlendorf, @ryao et al. what your thoughts are on this (particularly given the last few comments in #2725).

@kernelOfTruth
Copy link
Contributor

just posting what comes to mind:

could scheduling a regular cronjob which compacts memory via
echo 1 > /proc/sys/vm/compact_memory

change things (provided slab issues and timeouts are related to memory fragmentation)

@ioquatix
Copy link

I'm systematically having this issue when trying to RSync when using the latest ZFS from Arch: zfs-git 0.6.3_r170_gd958324f_3.18.2_2-1

[46181.967521] perf interrupt took too long (2506 > 2495), lowering kernel.perf_event_max_sample_rate to 50100
[79468.027144] INFO: task txg_sync:583 blocked for more than 120 seconds.
[79468.027287]       Tainted: P           O   3.18.2-2-ARCH #1
[79468.027363] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[79468.027468] txg_sync        D 0000000000000000     0   583      2 0x00000000
[79468.027476]  ffff88021ce03b18 0000000000000046 ffff88021cd5eeb0 0000000000013640
[79468.027483]  ffff88021ce03fd8 0000000000013640 ffff88021f601420 ffff88021cd5eeb0
[79468.027489]  ffff8801591955b8 ffffffff00000000 ffff880159195570 00000000025e92cf
[79468.027494] Call Trace:
[79468.027509]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[79468.027518]  [<ffffffff81550b59>] schedule+0x29/0x70
[79468.027524]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[79468.027547]  [<ffffffffa01cbb45>] __cv_broadcast+0xe5/0x160 [spl]
[79468.027554]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[79468.027565]  [<ffffffffa01cbc18>] __cv_wait_io+0x18/0x180 [spl]
[79468.027588]  [<ffffffffa030367b>] zio_wait+0x11b/0x200 [zfs]
[79468.027616]  [<ffffffffa028ac71>] dsl_pool_sync+0xc1/0x480 [zfs]
[79468.027625]  [<ffffffffa01c60af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[79468.027656]  [<ffffffffa02a5b60>] spa_sync+0x480/0xbd0 [zfs]
[79468.027663]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[79468.027694]  [<ffffffffa02b730c>] txg_delay+0x4ec/0xa60 [zfs]
[79468.027724]  [<ffffffffa02b6f90>] ? txg_delay+0x170/0xa60 [zfs]
[79468.027734]  [<ffffffffa01c707a>] __thread_exit+0x9a/0xb0 [spl]
[79468.027742]  [<ffffffffa01c7000>] ? __thread_exit+0x20/0xb0 [spl]
[79468.027749]  [<ffffffff81090e0a>] kthread+0xea/0x100
[79468.027755]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[79468.027761]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[79468.027767]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[82589.120097] INFO: task txg_sync:583 blocked for more than 120 seconds.
[82589.120252]       Tainted: P           O   3.18.2-2-ARCH #1
[82589.120347] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[82589.120479] txg_sync        D 0000000000000001     0   583      2 0x00000000
[82589.120489]  ffff88021ce03b18 0000000000000046 ffff88021cd5eeb0 0000000000013640
[82589.120497]  ffff88021ce03fd8 0000000000013640 ffff8802240ceeb0 ffff88021cd5eeb0
[82589.120505]  ffff88022363f7b0 ffff88022363f798 0000000000000000 0000000000000003
[82589.120512] Call Trace:
[82589.120529]  [<ffffffff8109e8e2>] ? default_wake_function+0x12/0x20
[82589.120540]  [<ffffffff810b2715>] ? __wake_up_common+0x55/0x90
[82589.120549]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[82589.120558]  [<ffffffff81550b59>] schedule+0x29/0x70
[82589.120564]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[82589.120591]  [<ffffffffa01cbb45>] __cv_broadcast+0xe5/0x160 [spl]
[82589.120599]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[82589.120613]  [<ffffffffa01cbc18>] __cv_wait_io+0x18/0x180 [spl]
[82589.120641]  [<ffffffffa030367b>] zio_wait+0x11b/0x200 [zfs]
[82589.120676]  [<ffffffffa028ac71>] dsl_pool_sync+0xc1/0x480 [zfs]
[82589.120689]  [<ffffffffa01c60af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[82589.120727]  [<ffffffffa02a5b60>] spa_sync+0x480/0xbd0 [zfs]
[82589.120735]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[82589.120774]  [<ffffffffa02b730c>] txg_delay+0x4ec/0xa60 [zfs]
[82589.120811]  [<ffffffffa02b6f90>] ? txg_delay+0x170/0xa60 [zfs]
[82589.120823]  [<ffffffffa01c707a>] __thread_exit+0x9a/0xb0 [spl]
[82589.120834]  [<ffffffffa01c7000>] ? __thread_exit+0x20/0xb0 [spl]
[82589.120842]  [<ffffffff81090e0a>] kthread+0xea/0x100
[82589.120849]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[82589.120858]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[82589.120864]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[89311.468460] INFO: task txg_sync:583 blocked for more than 120 seconds.
[89311.468644]       Tainted: P           O   3.18.2-2-ARCH #1
[89311.468741] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[89311.468872] txg_sync        D 0000000000000000     0   583      2 0x00000000
[89311.468882]  ffff88021ce03b18 0000000000000046 ffff88021cd5eeb0 0000000000013640
[89311.468890]  ffff88021ce03fd8 0000000000013640 ffff88021f601420 ffff88021cd5eeb0
[89311.468897]  ffff88001afc4498 fffffffe00000000 ffff88001afc4490 00000000025e92cf
[89311.468905] Call Trace:
[89311.468922]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[89311.468933]  [<ffffffff81550b59>] schedule+0x29/0x70
[89311.468940]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[89311.468968]  [<ffffffffa01cbb45>] __cv_broadcast+0xe5/0x160 [spl]
[89311.468976]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[89311.468989]  [<ffffffffa01cbc18>] __cv_wait_io+0x18/0x180 [spl]
[89311.469017]  [<ffffffffa030367b>] zio_wait+0x11b/0x200 [zfs]
[89311.469052]  [<ffffffffa028ac71>] dsl_pool_sync+0xc1/0x480 [zfs]
[89311.469065]  [<ffffffffa01c60af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[89311.469103]  [<ffffffffa02a5b60>] spa_sync+0x480/0xbd0 [zfs]
[89311.469112]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[89311.469150]  [<ffffffffa02b730c>] txg_delay+0x4ec/0xa60 [zfs]
[89311.469187]  [<ffffffffa02b6f90>] ? txg_delay+0x170/0xa60 [zfs]
[89311.469199]  [<ffffffffa01c707a>] __thread_exit+0x9a/0xb0 [spl]
[89311.469211]  [<ffffffffa01c7000>] ? __thread_exit+0x20/0xb0 [spl]
[89311.469218]  [<ffffffff81090e0a>] kthread+0xea/0x100
[89311.469226]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[89311.469234]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[89311.469241]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
hinoki% free -h
              total        used        free      shared  buff/cache   available
Mem:           7.8G        4.3G        533M        1.1M        3.0G        702M
Swap:          4.0G          0B        4.0G

@ioquatix
Copy link

Still trying to run rsync:

[89431.510429] INFO: task txg_sync:583 blocked for more than 120 seconds.
[89431.510571]       Tainted: P           O   3.18.2-2-ARCH #1
[89431.510647] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[89431.510753] txg_sync        D 0000000000000000     0   583      2 0x00000000
[89431.510762]  ffff88021ce03b18 0000000000000046 ffff88021cd5eeb0 0000000000013640
[89431.510769]  ffff88021ce03fd8 0000000000013640 ffff88021f601420 ffff88021cd5eeb0
[89431.510774]  ffff88001afc4498 fffffffe00000000 ffff88001afc4490 00000000025e92cf
[89431.510780] Call Trace:
[89431.510796]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[89431.510805]  [<ffffffff81550b59>] schedule+0x29/0x70
[89431.510811]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[89431.510836]  [<ffffffffa01cbb45>] __cv_broadcast+0xe5/0x160 [spl]
[89431.510842]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[89431.510853]  [<ffffffffa01cbc18>] __cv_wait_io+0x18/0x180 [spl]
[89431.510877]  [<ffffffffa030367b>] zio_wait+0x11b/0x200 [zfs]
[89431.510906]  [<ffffffffa028ac71>] dsl_pool_sync+0xc1/0x480 [zfs]
[89431.510915]  [<ffffffffa01c60af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[89431.510946]  [<ffffffffa02a5b60>] spa_sync+0x480/0xbd0 [zfs]
[89431.510953]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[89431.510984]  [<ffffffffa02b730c>] txg_delay+0x4ec/0xa60 [zfs]
[89431.511014]  [<ffffffffa02b6f90>] ? txg_delay+0x170/0xa60 [zfs]
[89431.511024]  [<ffffffffa01c707a>] __thread_exit+0x9a/0xb0 [spl]
[89431.511033]  [<ffffffffa01c7000>] ? __thread_exit+0x20/0xb0 [spl]
[89431.511039]  [<ffffffff81090e0a>] kthread+0xea/0x100
[89431.511045]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[89431.511052]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[89431.511058]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[89551.552404] INFO: task txg_sync:583 blocked for more than 120 seconds.
[89551.552565]       Tainted: P           O   3.18.2-2-ARCH #1
[89551.552660] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[89551.552792] txg_sync        D 0000000000000000     0   583      2 0x00000000
[89551.552803]  ffff88021ce03b18 0000000000000046 ffff88021cd5eeb0 0000000000013640
[89551.552812]  ffff88021ce03fd8 0000000000013640 ffff88021f601420 ffff88021cd5eeb0
[89551.552819]  ffff88001afc4498 fffffffe00000000 ffff88001afc4490 00000000025e92cf
[89551.552826] Call Trace:
[89551.552844]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[89551.552854]  [<ffffffff81550b59>] schedule+0x29/0x70
[89551.552862]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[89551.552890]  [<ffffffffa01cbb45>] __cv_broadcast+0xe5/0x160 [spl]
[89551.552898]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[89551.552911]  [<ffffffffa01cbc18>] __cv_wait_io+0x18/0x180 [spl]
[89551.552940]  [<ffffffffa030367b>] zio_wait+0x11b/0x200 [zfs]
[89551.552975]  [<ffffffffa028ac71>] dsl_pool_sync+0xc1/0x480 [zfs]
[89551.552987]  [<ffffffffa01c60af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[89551.553025]  [<ffffffffa02a5b60>] spa_sync+0x480/0xbd0 [zfs]
[89551.553034]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[89551.553073]  [<ffffffffa02b730c>] txg_delay+0x4ec/0xa60 [zfs]
[89551.553110]  [<ffffffffa02b6f90>] ? txg_delay+0x170/0xa60 [zfs]
[89551.553122]  [<ffffffffa01c707a>] __thread_exit+0x9a/0xb0 [spl]
[89551.553133]  [<ffffffffa01c7000>] ? __thread_exit+0x20/0xb0 [spl]
[89551.553140]  [<ffffffff81090e0a>] kthread+0xea/0x100
[89551.553148]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[89551.553156]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[89551.553163]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[89671.594371] INFO: task txg_sync:583 blocked for more than 120 seconds.
[89671.594507]       Tainted: P           O   3.18.2-2-ARCH #1
[89671.594605] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[89671.594711] txg_sync        D 0000000000000000     0   583      2 0x00000000
[89671.594720]  ffff88021ce03b18 0000000000000046 ffff88021cd5eeb0 0000000000013640
[89671.594727]  ffff88021ce03fd8 0000000000013640 ffff88021f601420 ffff88021cd5eeb0
[89671.594733]  ffff88001afc4498 fffffffe00000000 ffff88001afc4490 00000000025e92cf
[89671.594738] Call Trace:
[89671.594753]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[89671.594762]  [<ffffffff81550b59>] schedule+0x29/0x70
[89671.594768]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[89671.594792]  [<ffffffffa01cbb45>] __cv_broadcast+0xe5/0x160 [spl]
[89671.594798]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[89671.594809]  [<ffffffffa01cbc18>] __cv_wait_io+0x18/0x180 [spl]
[89671.594832]  [<ffffffffa030367b>] zio_wait+0x11b/0x200 [zfs]
[89671.594860]  [<ffffffffa028ac71>] dsl_pool_sync+0xc1/0x480 [zfs]
[89671.594870]  [<ffffffffa01c60af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[89671.594901]  [<ffffffffa02a5b60>] spa_sync+0x480/0xbd0 [zfs]
[89671.594908]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[89671.594939]  [<ffffffffa02b730c>] txg_delay+0x4ec/0xa60 [zfs]
[89671.594969]  [<ffffffffa02b6f90>] ? txg_delay+0x170/0xa60 [zfs]
[89671.594979]  [<ffffffffa01c707a>] __thread_exit+0x9a/0xb0 [spl]
[89671.594988]  [<ffffffffa01c7000>] ? __thread_exit+0x20/0xb0 [spl]
[89671.594994]  [<ffffffff81090e0a>] kthread+0xea/0x100
[89671.595000]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[89671.595007]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[89671.595013]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[89791.636364] INFO: task txg_sync:583 blocked for more than 120 seconds.
[89791.636507]       Tainted: P           O   3.18.2-2-ARCH #1
[89791.636583] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[89791.636688] txg_sync        D 0000000000000000     0   583      2 0x00000000
[89791.636697]  ffff88021ce03b18 0000000000000046 ffff88021cd5eeb0 0000000000013640
[89791.636703]  ffff88021ce03fd8 0000000000013640 ffff88021f601420 ffff88021cd5eeb0
[89791.636709]  ffff88001afc4498 fffffffe00000000 ffff88001afc4490 00000000025e92cf
[89791.636715] Call Trace:
[89791.636729]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[89791.636738]  [<ffffffff81550b59>] schedule+0x29/0x70
[89791.636744]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[89791.636767]  [<ffffffffa01cbb45>] __cv_broadcast+0xe5/0x160 [spl]
[89791.636774]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[89791.636785]  [<ffffffffa01cbc18>] __cv_wait_io+0x18/0x180 [spl]
[89791.636808]  [<ffffffffa030367b>] zio_wait+0x11b/0x200 [zfs]
[89791.636836]  [<ffffffffa028ac71>] dsl_pool_sync+0xc1/0x480 [zfs]
[89791.636852]  [<ffffffffa01c60af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[89791.636883]  [<ffffffffa02a5b60>] spa_sync+0x480/0xbd0 [zfs]
[89791.636890]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[89791.636921]  [<ffffffffa02b730c>] txg_delay+0x4ec/0xa60 [zfs]
[89791.636951]  [<ffffffffa02b6f90>] ? txg_delay+0x170/0xa60 [zfs]
[89791.636960]  [<ffffffffa01c707a>] __thread_exit+0x9a/0xb0 [spl]
[89791.636969]  [<ffffffffa01c7000>] ? __thread_exit+0x20/0xb0 [spl]
[89791.636975]  [<ffffffff81090e0a>] kthread+0xea/0x100
[89791.636981]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[89791.636989]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[89791.636994]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[90105.367103] systemd[1]: systemd-journald.service stop-sigabrt timed out. Terminating.
[90105.490377] systemd[1]: Listening on Journal Audit Socket.
[90105.490436] systemd[1]: Starting Journal Service...
[90151.762415] INFO: task kswapd0:31 blocked for more than 120 seconds.
[90151.762569]       Tainted: P           O   3.18.2-2-ARCH #1
[90151.762663] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[90151.762795] kswapd0         D 0000000000000000     0    31      2 0x00000000
[90151.762805]  ffff8802235a3908 0000000000000046 ffff880224225a90 0000000000013640
[90151.762814]  ffff8802235a3fd8 0000000000013640 ffffffff81818540 ffff880224225a90
[90151.762821]  ffff88022fc93640 ffff88021cd5e4a0 ffff88021cd5ec02 0000000000000000
[90151.762828] Call Trace:
[90151.762845]  [<ffffffff8109e6e7>] ? try_to_wake_up+0x1e7/0x380
[90151.762856]  [<ffffffff8109e8e2>] ? default_wake_function+0x12/0x20
[90151.762865]  [<ffffffff81550b59>] schedule+0x29/0x70
[90151.762893]  [<ffffffffa01cbb8d>] __cv_broadcast+0x12d/0x160 [spl]
[90151.762902]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[90151.762915]  [<ffffffffa01cbbd5>] __cv_wait+0x15/0x20 [spl]
[90151.762956]  [<ffffffffa02b6d03>] txg_wait_open+0x73/0xb0 [zfs]
[90151.762984]  [<ffffffffa027514a>] dmu_tx_wait+0x33a/0x350 [zfs]
[90151.763011]  [<ffffffffa02751f1>] dmu_tx_assign+0x91/0x4b0 [zfs]
[90151.763040]  [<ffffffffa02f2a73>] zfs_inactive+0x163/0x200 [zfs]
[90151.763049]  [<ffffffff810b2e70>] ? autoremove_wake_function+0x40/0x40
[90151.763075]  [<ffffffffa030af18>] zpl_vap_init+0x838/0xa10 [zfs]
[90151.763083]  [<ffffffff811eab68>] evict+0xb8/0x1b0
[90151.763090]  [<ffffffff811eaca1>] dispose_list+0x41/0x50
[90151.763097]  [<ffffffff811ebce6>] prune_icache_sb+0x56/0x80
[90151.763106]  [<ffffffff811d2b25>] super_cache_scan+0x115/0x180
[90151.763115]  [<ffffffff81169e89>] shrink_slab_node+0x129/0x2f0
[90151.763123]  [<ffffffff8116abcb>] shrink_slab+0x8b/0x160
[90151.763131]  [<ffffffff8116e0e9>] kswapd_shrink_zone+0x129/0x1d0
[90151.763138]  [<ffffffff8116eb7a>] kswapd+0x54a/0x8f0
[90151.763147]  [<ffffffff8116e630>] ? mem_cgroup_shrink_node_zone+0x1c0/0x1c0
[90151.763155]  [<ffffffff81090e0a>] kthread+0xea/0x100
[90151.763162]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[90151.763171]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[90151.763178]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[90151.763190] INFO: task systemd-journal:138 blocked for more than 120 seconds.
[90151.763340]       Tainted: P           O   3.18.2-2-ARCH #1
[90151.763433] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[90151.763563] systemd-journal D 0000000000000000     0   138      1 0x00000004
[90151.763571]  ffff88022310b128 0000000000000082 ffff880222c25080 0000000000013640
[90151.763577]  ffff88022310bfd8 0000000000013640 ffff88021cd5e4a0 ffff880222c25080
[90151.763583]  ffff88022fc93640 ffff88021cd5e4a0 ffff88021cd5ec02 0000000000000000
[90151.763590] Call Trace:
[90151.763599]  [<ffffffff8109e6e7>] ? try_to_wake_up+0x1e7/0x380
[90151.763607]  [<ffffffff8109e8e2>] ? default_wake_function+0x12/0x20
[90151.763614]  [<ffffffff81550b59>] schedule+0x29/0x70
[90151.763627]  [<ffffffffa01cbb8d>] __cv_broadcast+0x12d/0x160 [spl]
[90151.763635]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[90151.763647]  [<ffffffffa01cbbd5>] __cv_wait+0x15/0x20 [spl]
[90151.763683]  [<ffffffffa02b6d03>] txg_wait_open+0x73/0xb0 [zfs]
[90151.763710]  [<ffffffffa027514a>] dmu_tx_wait+0x33a/0x350 [zfs]
[90151.763737]  [<ffffffffa02751f1>] dmu_tx_assign+0x91/0x4b0 [zfs]
[90151.763765]  [<ffffffffa02f2a73>] zfs_inactive+0x163/0x200 [zfs]
[90151.763774]  [<ffffffff810b2e70>] ? autoremove_wake_function+0x40/0x40
[90151.763799]  [<ffffffffa030af18>] zpl_vap_init+0x838/0xa10 [zfs]
[90151.763805]  [<ffffffff811eab68>] evict+0xb8/0x1b0
[90151.763812]  [<ffffffff811eaca1>] dispose_list+0x41/0x50
[90151.763819]  [<ffffffff811ebce6>] prune_icache_sb+0x56/0x80
[90151.763827]  [<ffffffff811d2b25>] super_cache_scan+0x115/0x180
[90151.763834]  [<ffffffff81169e89>] shrink_slab_node+0x129/0x2f0
[90151.763842]  [<ffffffff811c1523>] ? mem_cgroup_iter+0x2f3/0x4d0
[90151.763850]  [<ffffffff8116abcb>] shrink_slab+0x8b/0x160
[90151.763858]  [<ffffffff8116da45>] do_try_to_free_pages+0x365/0x4e0
[90151.763866]  [<ffffffff8116dc71>] try_to_free_pages+0xb1/0x1a0
[90151.763873]  [<ffffffff81160ca7>] __alloc_pages_nodemask+0x697/0xb50
[90151.763884]  [<ffffffff811a6e9c>] alloc_pages_current+0x9c/0x120
[90151.763891]  [<ffffffff811afba5>] new_slab+0x305/0x370
[90151.763899]  [<ffffffff811b2175>] __slab_alloc.isra.51+0x545/0x650
[90151.763907]  [<ffffffff81445af9>] ? __alloc_skb+0x89/0x210
[90151.763914]  [<ffffffff814380f4>] ? raw_pci_write+0x24/0x50
[90151.763923]  [<ffffffff812e8be6>] ? pci_bus_write_config_word+0x66/0x80
[90151.763949]  [<ffffffffa004ed7f>] ? ata_bmdma_start+0x2f/0x40 [libata]
[90151.763960]  [<ffffffffa01121ad>] ? atiixp_bmdma_start+0x9d/0xe0 [pata_atiixp]
[90151.763969]  [<ffffffff811b5445>] __kmalloc_node_track_caller+0xa5/0x240
[90151.763976]  [<ffffffff81445af9>] ? __alloc_skb+0x89/0x210
[90151.763984]  [<ffffffff81445a11>] __kmalloc_reserve.isra.38+0x31/0x90
[90151.763990]  [<ffffffff81445acb>] ? __alloc_skb+0x5b/0x210
[90151.763997]  [<ffffffff81445af9>] __alloc_skb+0x89/0x210
[90151.764004]  [<ffffffff81445de4>] alloc_skb_with_frags+0x64/0x1e0
[90151.764011]  [<ffffffff8143efb9>] sock_alloc_send_pskb+0x219/0x290
[90151.764020]  [<ffffffff810136fb>] ? __switch_to+0x1fb/0x600
[90151.764029]  [<ffffffff814f886d>] unix_dgram_sendmsg+0x18d/0x690
[90151.764037]  [<ffffffff8143ac39>] sock_sendmsg+0x79/0xb0
[90151.764045]  [<ffffffff810986da>] ? finish_task_switch+0x4a/0xf0
[90151.764051]  [<ffffffff815504c8>] ? __schedule+0x3e8/0xa50
[90151.764059]  [<ffffffff8143c76c>] ? move_addr_to_kernel+0x2c/0x50
[90151.764066]  [<ffffffff8144a3c7>] ? verify_iovec+0x47/0xd0
[90151.764074]  [<ffffffff8143b928>] ___sys_sendmsg+0x408/0x420
[90151.764083]  [<ffffffff812149d0>] ? ep_read_events_proc+0xe0/0xe0
[90151.764089]  [<ffffffff81440030>] ? sk_prot_alloc.isra.33+0x30/0x130
[90151.764097]  [<ffffffff811b291a>] ? kmem_cache_alloc+0x16a/0x170
[90151.764104]  [<ffffffff811d100c>] ? get_empty_filp+0x5c/0x1c0
[90151.764112]  [<ffffffff8126e496>] ? security_file_alloc+0x16/0x20
[90151.764118]  [<ffffffff811d1084>] ? get_empty_filp+0xd4/0x1c0
[90151.764126]  [<ffffffff811d118f>] ? alloc_file+0x1f/0xb0
[90151.764134]  [<ffffffff8143d651>] __sys_sendmsg+0x51/0x90
[90151.764142]  [<ffffffff8143d6a2>] SyS_sendmsg+0x12/0x20
[90151.764149]  [<ffffffff81554ca9>] system_call_fastpath+0x12/0x17
[90151.764186] INFO: task txg_sync:583 blocked for more than 120 seconds.
[90151.764325]       Tainted: P           O   3.18.2-2-ARCH #1
[90151.764418] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[90151.764547] txg_sync        D 0000000000000000     0   583      2 0x00000000
[90151.764554]  ffff88021ce03b18 0000000000000046 ffff88021cd5eeb0 0000000000013640
[90151.764560]  ffff88021ce03fd8 0000000000013640 ffff88021f601420 ffff88021cd5eeb0
[90151.764566]  ffff88018de384b8 ffffffff00000000 ffff88018de38490 00000000025e92cf
[90151.764572] Call Trace:
[90151.764581]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[90151.764588]  [<ffffffff81550b59>] schedule+0x29/0x70
[90151.764594]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[90151.764607]  [<ffffffffa01cbb45>] __cv_broadcast+0xe5/0x160 [spl]
[90151.764615]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[90151.764627]  [<ffffffffa01cbc18>] __cv_wait_io+0x18/0x180 [spl]
[90151.764651]  [<ffffffffa030367b>] zio_wait+0x11b/0x200 [zfs]
[90151.764686]  [<ffffffffa028ac71>] dsl_pool_sync+0xc1/0x480 [zfs]
[90151.764698]  [<ffffffffa01c60af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[90151.764735]  [<ffffffffa02a5b60>] spa_sync+0x480/0xbd0 [zfs]
[90151.764744]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[90151.764781]  [<ffffffffa02b730c>] txg_delay+0x4ec/0xa60 [zfs]
[90151.764818]  [<ffffffffa02b6f90>] ? txg_delay+0x170/0xa60 [zfs]
[90151.764830]  [<ffffffffa01c707a>] __thread_exit+0x9a/0xb0 [spl]
[90151.764841]  [<ffffffffa01c7000>] ? __thread_exit+0x20/0xb0 [spl]
[90151.764847]  [<ffffffff81090e0a>] kthread+0xea/0x100
[90151.764855]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[90151.764862]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[90151.764869]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[90195.648689] systemd[1]: systemd-journald.service stop-sigterm timed out. Killing.
[90195.649942] systemd[1]: Starting Journal Service...

@behlendorf
Copy link
Contributor

This might be caused by #2523, can you verify you have the fix to the SPL applied: openzfs/spl@a3c1eb7

@ioquatix
Copy link

More of the same

[ 1320.575152] INFO: task txg_sync:583 blocked for more than 120 seconds.
[ 1320.575263]       Tainted: P           O   3.18.2-2-ARCH #1
[ 1320.575314] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1320.575384] txg_sync        D 0000000000000000     0   583      2 0x00000000
[ 1320.575390]  ffff88021ba43b18 0000000000000046 ffff8800cbfc4670 0000000000013640
[ 1320.575395]  ffff88021ba43fd8 0000000000013640 ffff88021dee2840 ffff8800cbfc4670
[ 1320.575399]  ffff880131dcc138 ffffffff00000000 ffff880131dcc130 00000000b3f1aac7
[ 1320.575402] Call Trace:
[ 1320.575414]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[ 1320.575420]  [<ffffffff81550b59>] schedule+0x29/0x70
[ 1320.575424]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[ 1320.575442]  [<ffffffffa01cfb45>] __cv_broadcast+0xe5/0x160 [spl]
[ 1320.575447]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[ 1320.575453]  [<ffffffffa01cfc18>] __cv_wait_io+0x18/0x180 [spl]
[ 1320.575470]  [<ffffffffa030767b>] zio_wait+0x11b/0x200 [zfs]
[ 1320.575490]  [<ffffffffa028ec71>] dsl_pool_sync+0xc1/0x480 [zfs]
[ 1320.575496]  [<ffffffffa01ca0af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[ 1320.575517]  [<ffffffffa02a9b60>] spa_sync+0x480/0xbd0 [zfs]
[ 1320.575522]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[ 1320.575543]  [<ffffffffa02bb30c>] txg_delay+0x4ec/0xa60 [zfs]
[ 1320.575563]  [<ffffffffa02baf90>] ? txg_delay+0x170/0xa60 [zfs]
[ 1320.575569]  [<ffffffffa01cb07a>] __thread_exit+0x9a/0xb0 [spl]
[ 1320.575575]  [<ffffffffa01cb000>] ? __thread_exit+0x20/0xb0 [spl]
[ 1320.575579]  [<ffffffff81090e0a>] kthread+0xea/0x100
[ 1320.575583]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1320.575588]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[ 1320.575592]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1560.661841] INFO: task txg_sync:583 blocked for more than 120 seconds.
[ 1560.661995]       Tainted: P           O   3.18.2-2-ARCH #1
[ 1560.662090] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1560.662221] txg_sync        D 0000000000000000     0   583      2 0x00000000
[ 1560.662232]  ffff88021ba43b18 0000000000000046 ffff8800cbfc4670 0000000000013640
[ 1560.662240]  ffff88021ba43fd8 0000000000013640 ffffffff81818540 ffff8800cbfc4670
[ 1560.662247]  ffff880223335eb0 ffff880223335e98 0000000000000000 0000000000000003
[ 1560.662255] Call Trace:
[ 1560.662272]  [<ffffffff8109e8e2>] ? default_wake_function+0x12/0x20
[ 1560.662283]  [<ffffffff810b2715>] ? __wake_up_common+0x55/0x90
[ 1560.662292]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[ 1560.662301]  [<ffffffff81550b59>] schedule+0x29/0x70
[ 1560.662307]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[ 1560.662335]  [<ffffffffa01cfb45>] __cv_broadcast+0xe5/0x160 [spl]
[ 1560.662343]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[ 1560.662355]  [<ffffffffa01cfc18>] __cv_wait_io+0x18/0x180 [spl]
[ 1560.662383]  [<ffffffffa030767b>] zio_wait+0x11b/0x200 [zfs]
[ 1560.662418]  [<ffffffffa028ec71>] dsl_pool_sync+0xc1/0x480 [zfs]
[ 1560.662430]  [<ffffffffa01ca0af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[ 1560.662468]  [<ffffffffa02a9b60>] spa_sync+0x480/0xbd0 [zfs]
[ 1560.662477]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[ 1560.662516]  [<ffffffffa02bb30c>] txg_delay+0x4ec/0xa60 [zfs]
[ 1560.662552]  [<ffffffffa02baf90>] ? txg_delay+0x170/0xa60 [zfs]
[ 1560.662564]  [<ffffffffa01cb07a>] __thread_exit+0x9a/0xb0 [spl]
[ 1560.662575]  [<ffffffffa01cb000>] ? __thread_exit+0x20/0xb0 [spl]
[ 1560.662583]  [<ffffffff81090e0a>] kthread+0xea/0x100
[ 1560.662590]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1560.662598]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[ 1560.662605]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1560.662617] INFO: task java:1218 blocked for more than 120 seconds.
[ 1560.662753]       Tainted: P           O   3.18.2-2-ARCH #1
[ 1560.662846] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1560.662975] java            D 0000000000000001     0  1218    655 0x00000000
[ 1560.662983]  ffff8800c17f7ac8 0000000000000086 ffff8800c33e1420 0000000000013640
[ 1560.662990]  ffff8800c17f7fd8 0000000000013640 ffff8802240ceeb0 ffff8800c33e1420
[ 1560.662997]  0000008000000000 ffff88002041b7a0 0000000000000000 ffff88002041b778
[ 1560.663003] Call Trace:
[ 1560.663027]  [<ffffffffa0260398>] ? dbuf_rele_and_unlock+0x2c8/0x4d0 [zfs]
[ 1560.663049]  [<ffffffffa0261eca>] ? dbuf_read+0x8da/0xf20 [zfs]
[ 1560.663061]  [<ffffffffa01c8901>] ? kmem_asprintf+0x51/0x80 [spl]
[ 1560.663068]  [<ffffffff81550b59>] schedule+0x29/0x70
[ 1560.663080]  [<ffffffffa01cfb8d>] __cv_broadcast+0x12d/0x160 [spl]
[ 1560.663088]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[ 1560.663099]  [<ffffffffa01cfbd5>] __cv_wait+0x15/0x20 [spl]
[ 1560.663126]  [<ffffffffa0278eab>] dmu_tx_wait+0x9b/0x350 [zfs]
[ 1560.663153]  [<ffffffffa02791f1>] dmu_tx_assign+0x91/0x4b0 [zfs]
[ 1560.663161]  [<ffffffff810e1006>] ? getrawmonotonic+0x36/0xd0
[ 1560.663186]  [<ffffffffa0268e1c>] dmu_free_long_range+0x1ac/0x280 [zfs]
[ 1560.663217]  [<ffffffffa02daf1c>] zfs_rmnode+0x6c/0x340 [zfs]
[ 1560.663244]  [<ffffffffa02fe141>] zfs_zinactive+0xc1/0x1d0 [zfs]
[ 1560.663273]  [<ffffffffa02f6974>] zfs_inactive+0x64/0x200 [zfs]
[ 1560.663281]  [<ffffffff810b2e70>] ? autoremove_wake_function+0x40/0x40
[ 1560.663307]  [<ffffffffa030ef18>] zpl_vap_init+0x838/0xa10 [zfs]
[ 1560.663315]  [<ffffffff811eab68>] evict+0xb8/0x1b0
[ 1560.663322]  [<ffffffff811eb405>] iput+0xf5/0x1a0
[ 1560.663330]  [<ffffffff811df7f2>] do_unlinkat+0x1e2/0x350
[ 1560.663337]  [<ffffffff811d4839>] ? SyS_newstat+0x39/0x60
[ 1560.663345]  [<ffffffff811e02c6>] SyS_unlink+0x16/0x20
[ 1560.663353]  [<ffffffff81554ca9>] system_call_fastpath+0x12/0x17
[ 1560.663371] INFO: task sshd:1381 blocked for more than 120 seconds.
[ 1560.663508]       Tainted: P           O   3.18.2-2-ARCH #1
[ 1560.663601] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1560.663791] sshd            D 0000000000000001     0  1381    764 0x00000004
[ 1560.663798]  ffff880113ad7b38 0000000000000086 ffff8800a7356eb0 0000000000013640
[ 1560.663804]  ffff880113ad7fd8 0000000000013640 ffff8802240ceeb0 ffff8800a7356eb0
[ 1560.663810]  ffff8800c9ccabb8 ffff8800c9ccabb8 0000000000000280 0000000000000001
[ 1560.663816] Call Trace:
[ 1560.663825]  [<ffffffff81550b59>] schedule+0x29/0x70
[ 1560.663839]  [<ffffffffa01cfb8d>] __cv_broadcast+0x12d/0x160 [spl]
[ 1560.663847]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[ 1560.663858]  [<ffffffffa01cfbd5>] __cv_wait+0x15/0x20 [spl]
[ 1560.663885]  [<ffffffffa0278eab>] dmu_tx_wait+0x9b/0x350 [zfs]
[ 1560.663912]  [<ffffffffa02791f1>] dmu_tx_assign+0x91/0x4b0 [zfs]
[ 1560.663940]  [<ffffffffa02f66d7>] zfs_dirty_inode+0xf7/0x330 [zfs]
[ 1560.663951]  [<ffffffffa01c8956>] ? kmem_free_debug+0x16/0x20 [spl]
[ 1560.663962]  [<ffffffffa01d0b10>] ? crfree+0x170/0x180 [spl]
[ 1560.663972]  [<ffffffffa01d123d>] ? tsd_exit+0x19d/0x1b0 [spl]
[ 1560.663998]  [<ffffffffa02fe288>] ? zfs_tstamp_update_setup+0x38/0x1c0 [zfs]
[ 1560.664026]  [<ffffffffa02f0cfe>] ? zfs_read+0x39e/0x460 [zfs]
[ 1560.664049]  [<ffffffffa030ef2e>] zpl_vap_init+0x84e/0xa10 [zfs]
[ 1560.664056]  [<ffffffff811fb0f8>] __mark_inode_dirty+0x38/0x2d0
[ 1560.664082]  [<ffffffffa02fb3cd>] zfs_mark_inode_dirty+0x4d/0x60 [zfs]
[ 1560.664106]  [<ffffffffa030d626>] zpl_putpage+0x576/0xd50 [zfs]
[ 1560.664114]  [<ffffffff811d0d3c>] __fput+0x9c/0x200
[ 1560.664122]  [<ffffffff811d0eee>] ____fput+0xe/0x10
[ 1560.664128]  [<ffffffff8108f33f>] task_work_run+0x9f/0xe0
[ 1560.664137]  [<ffffffff81014e75>] do_notify_resume+0x95/0xa0
[ 1560.664145]  [<ffffffff81554f20>] int_signal+0x12/0x17
[ 1680.702704] INFO: task txg_sync:583 blocked for more than 120 seconds.
[ 1680.702864]       Tainted: P           O   3.18.2-2-ARCH #1
[ 1680.702958] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1680.703089] txg_sync        D 0000000000000000     0   583      2 0x00000000
[ 1680.703100]  ffff88021ba43b18 0000000000000046 ffff8800cbfc4670 0000000000013640
[ 1680.703108]  ffff88021ba43fd8 0000000000013640 ffffffff81818540 ffff8800cbfc4670
[ 1680.703116]  ffff880223335eb0 ffff880223335e98 0000000000000000 0000000000000003
[ 1680.703123] Call Trace:
[ 1680.703141]  [<ffffffff8109e8e2>] ? default_wake_function+0x12/0x20
[ 1680.703152]  [<ffffffff810b2715>] ? __wake_up_common+0x55/0x90
[ 1680.703161]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[ 1680.703169]  [<ffffffff81550b59>] schedule+0x29/0x70
[ 1680.703176]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[ 1680.703203]  [<ffffffffa01cfb45>] __cv_broadcast+0xe5/0x160 [spl]
[ 1680.703211]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[ 1680.703224]  [<ffffffffa01cfc18>] __cv_wait_io+0x18/0x180 [spl]
[ 1680.703252]  [<ffffffffa030767b>] zio_wait+0x11b/0x200 [zfs]
[ 1680.703287]  [<ffffffffa028ec71>] dsl_pool_sync+0xc1/0x480 [zfs]
[ 1680.703299]  [<ffffffffa01ca0af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[ 1680.703338]  [<ffffffffa02a9b60>] spa_sync+0x480/0xbd0 [zfs]
[ 1680.703346]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[ 1680.703385]  [<ffffffffa02bb30c>] txg_delay+0x4ec/0xa60 [zfs]
[ 1680.703422]  [<ffffffffa02baf90>] ? txg_delay+0x170/0xa60 [zfs]
[ 1680.703433]  [<ffffffffa01cb07a>] __thread_exit+0x9a/0xb0 [spl]
[ 1680.703444]  [<ffffffffa01cb000>] ? __thread_exit+0x20/0xb0 [spl]
[ 1680.703452]  [<ffffffff81090e0a>] kthread+0xea/0x100
[ 1680.703459]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1680.703467]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[ 1680.703474]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1800.742712] INFO: task txg_sync:583 blocked for more than 120 seconds.
[ 1800.742869]       Tainted: P           O   3.18.2-2-ARCH #1
[ 1800.742964] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1800.743095] txg_sync        D 0000000000000000     0   583      2 0x00000000
[ 1800.743106]  ffff88021ba43b18 0000000000000046 ffff8800cbfc4670 0000000000013640
[ 1800.743114]  ffff88021ba43fd8 0000000000013640 ffffffff81818540 ffff8800cbfc4670
[ 1800.743121]  ffff880223335eb0 ffff880223335e98 0000000000000000 0000000000000003
[ 1800.743129] Call Trace:
[ 1800.743146]  [<ffffffff8109e8e2>] ? default_wake_function+0x12/0x20
[ 1800.743157]  [<ffffffff810b2715>] ? __wake_up_common+0x55/0x90
[ 1800.743165]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[ 1800.743174]  [<ffffffff81550b59>] schedule+0x29/0x70
[ 1800.743180]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[ 1800.743206]  [<ffffffffa01cfb45>] __cv_broadcast+0xe5/0x160 [spl]
[ 1800.743214]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[ 1800.743227]  [<ffffffffa01cfc18>] __cv_wait_io+0x18/0x180 [spl]
[ 1800.743256]  [<ffffffffa030767b>] zio_wait+0x11b/0x200 [zfs]
[ 1800.743290]  [<ffffffffa028ec71>] dsl_pool_sync+0xc1/0x480 [zfs]
[ 1800.743302]  [<ffffffffa01ca0af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[ 1800.743341]  [<ffffffffa02a9b60>] spa_sync+0x480/0xbd0 [zfs]
[ 1800.743349]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[ 1800.743388]  [<ffffffffa02bb30c>] txg_delay+0x4ec/0xa60 [zfs]
[ 1800.743425]  [<ffffffffa02baf90>] ? txg_delay+0x170/0xa60 [zfs]
[ 1800.743437]  [<ffffffffa01cb07a>] __thread_exit+0x9a/0xb0 [spl]
[ 1800.743448]  [<ffffffffa01cb000>] ? __thread_exit+0x20/0xb0 [spl]
[ 1800.743455]  [<ffffffff81090e0a>] kthread+0xea/0x100
[ 1800.743462]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1800.743471]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[ 1800.743477]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1920.782195] INFO: task txg_sync:583 blocked for more than 120 seconds.
[ 1920.782354]       Tainted: P           O   3.18.2-2-ARCH #1
[ 1920.782449] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1920.782580] txg_sync        D 0000000000000000     0   583      2 0x00000000
[ 1920.782590]  ffff88021ba43b18 0000000000000046 ffff8800cbfc4670 0000000000013640
[ 1920.782599]  ffff88021ba43fd8 0000000000013640 ffffffff81818540 ffff8800cbfc4670
[ 1920.782606]  ffff880223335eb0 ffff880223335e98 0000000000000000 0000000000000003
[ 1920.782614] Call Trace:
[ 1920.782631]  [<ffffffff8109e8e2>] ? default_wake_function+0x12/0x20
[ 1920.782641]  [<ffffffff810b2715>] ? __wake_up_common+0x55/0x90
[ 1920.782650]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[ 1920.782658]  [<ffffffff81550b59>] schedule+0x29/0x70
[ 1920.782665]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[ 1920.782693]  [<ffffffffa01cfb45>] __cv_broadcast+0xe5/0x160 [spl]
[ 1920.782701]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[ 1920.782714]  [<ffffffffa01cfc18>] __cv_wait_io+0x18/0x180 [spl]
[ 1920.782742]  [<ffffffffa030767b>] zio_wait+0x11b/0x200 [zfs]
[ 1920.782777]  [<ffffffffa028ec71>] dsl_pool_sync+0xc1/0x480 [zfs]
[ 1920.782789]  [<ffffffffa01ca0af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[ 1920.782827]  [<ffffffffa02a9b60>] spa_sync+0x480/0xbd0 [zfs]
[ 1920.782836]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[ 1920.782874]  [<ffffffffa02bb30c>] txg_delay+0x4ec/0xa60 [zfs]
[ 1920.782911]  [<ffffffffa02baf90>] ? txg_delay+0x170/0xa60 [zfs]
[ 1920.782923]  [<ffffffffa01cb07a>] __thread_exit+0x9a/0xb0 [spl]
[ 1920.782934]  [<ffffffffa01cb000>] ? __thread_exit+0x20/0xb0 [spl]
[ 1920.782942]  [<ffffffff81090e0a>] kthread+0xea/0x100
[ 1920.782949]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1920.782957]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[ 1920.782964]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1920.782976] INFO: task java:1218 blocked for more than 120 seconds.
[ 1920.783136]       Tainted: P           O   3.18.2-2-ARCH #1
[ 1920.783236] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1920.783366] java            D 0000000000000000     0  1218    655 0x00000000
[ 1920.783373]  ffff8800c17f7ab8 0000000000000086 ffff8800c33e1420 0000000000013640
[ 1920.783381]  ffff8800c17f7fd8 0000000000013640 ffffffff81818540 ffff8800c33e1420
[ 1920.783387]  ffff8800c17f7a08 ffffffffa025a872 0000000000000000 ffff8800235be4a0
[ 1920.783393] Call Trace:
[ 1920.783416]  [<ffffffffa025a872>] ? arc_buf_eviction_needed+0x82/0xc0 [zfs]
[ 1920.783439]  [<ffffffffa0260398>] ? dbuf_rele_and_unlock+0x2c8/0x4d0 [zfs]
[ 1920.783476]  [<ffffffffa02b634d>] ? bp_get_dsize+0xad/0xf0 [zfs]
[ 1920.783503]  [<ffffffffa02770e4>] ? dmu_tx_callback_register+0x324/0xab0 [zfs]
[ 1920.783512]  [<ffffffff81550b59>] schedule+0x29/0x70
[ 1920.783524]  [<ffffffffa01cfb8d>] __cv_broadcast+0x12d/0x160 [spl]
[ 1920.783532]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[ 1920.783544]  [<ffffffffa01cfbd5>] __cv_wait+0x15/0x20 [spl]
[ 1920.783570]  [<ffffffffa0278eab>] dmu_tx_wait+0x9b/0x350 [zfs]
[ 1920.783597]  [<ffffffffa02791f1>] dmu_tx_assign+0x91/0x4b0 [zfs]
[ 1920.783627]  [<ffffffffa02810e5>] ? dsl_dataset_block_freeable+0x45/0x1d0 [zfs]
[ 1920.783653]  [<ffffffffa0276fb9>] ? dmu_tx_callback_register+0x1f9/0xab0 [zfs]
[ 1920.783681]  [<ffffffffa02f74a0>] zfs_write+0x3c0/0xbf0 [zfs]
[ 1920.783688]  [<ffffffff810a9dee>] ? enqueue_entity+0x24e/0xaa0
[ 1920.783696]  [<ffffffff8109a1f0>] ? resched_curr+0xd0/0xe0
[ 1920.783705]  [<ffffffff810ec497>] ? wake_futex+0x67/0x90
[ 1920.783711]  [<ffffffff810ef856>] ? do_futex+0x8f6/0xae0
[ 1920.783736]  [<ffffffffa030d2cb>] zpl_putpage+0x21b/0xd50 [zfs]
[ 1920.783744]  [<ffffffff811cf1d7>] vfs_write+0xb7/0x200
[ 1920.783752]  [<ffffffff811cfd29>] SyS_write+0x59/0xd0
[ 1920.783760]  [<ffffffff81554ca9>] system_call_fastpath+0x12/0x17
[ 1920.783778] INFO: task imap:1389 blocked for more than 120 seconds.
[ 1920.783916]       Tainted: P           O   3.18.2-2-ARCH #1
[ 1920.784009] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1920.784138] imap            D 0000000000000000     0  1389    770 0x00000000
[ 1920.784145]  ffff88000f97fb38 0000000000000082 ffff880222d3bc60 0000000000013640
[ 1920.784151]  ffff88000f97ffd8 0000000000013640 ffff88009d290a10 ffff880222d3bc60
[ 1920.784161]  ffff8800c9ccabb8 ffff8800c9ccabb8 0000000000000fd8 0000000000000001
[ 1920.784168] Call Trace:
[ 1920.784176]  [<ffffffff81550b59>] schedule+0x29/0x70
[ 1920.784190]  [<ffffffffa01cfb8d>] __cv_broadcast+0x12d/0x160 [spl]
[ 1920.784197]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[ 1920.784209]  [<ffffffffa01cfbd5>] __cv_wait+0x15/0x20 [spl]
[ 1920.784236]  [<ffffffffa0278eab>] dmu_tx_wait+0x9b/0x350 [zfs]
[ 1920.784263]  [<ffffffffa02791f1>] dmu_tx_assign+0x91/0x4b0 [zfs]
[ 1920.784292]  [<ffffffffa02f66d7>] zfs_dirty_inode+0xf7/0x330 [zfs]
[ 1920.784303]  [<ffffffffa01d0b10>] ? crfree+0x170/0x180 [spl]
[ 1920.784314]  [<ffffffffa01d123d>] ? tsd_exit+0x19d/0x1b0 [spl]
[ 1920.784338]  [<ffffffffa030ef2e>] zpl_vap_init+0x84e/0xa10 [zfs]
[ 1920.784345]  [<ffffffff811fb0f8>] __mark_inode_dirty+0x38/0x2d0
[ 1920.784372]  [<ffffffffa02fb3cd>] zfs_mark_inode_dirty+0x4d/0x60 [zfs]
[ 1920.784395]  [<ffffffffa030d626>] zpl_putpage+0x576/0xd50 [zfs]
[ 1920.784403]  [<ffffffff811d0d3c>] __fput+0x9c/0x200
[ 1920.784411]  [<ffffffff811d0eee>] ____fput+0xe/0x10
[ 1920.784417]  [<ffffffff8108f33f>] task_work_run+0x9f/0xe0
[ 1920.784426]  [<ffffffff81014e75>] do_notify_resume+0x95/0xa0
[ 1920.784434]  [<ffffffff81554f20>] int_signal+0x12/0x17
[ 2280.901736] INFO: task txg_sync:583 blocked for more than 120 seconds.
[ 2280.901897]       Tainted: P           O   3.18.2-2-ARCH #1
[ 2280.901992] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2280.902124] txg_sync        D 0000000000000000     0   583      2 0x00000000
[ 2280.902134]  ffff88021ba43b18 0000000000000046 ffff8800cbfc4670 0000000000013640
[ 2280.902143]  ffff88021ba43fd8 0000000000013640 ffffffff81818540 ffff8800cbfc4670
[ 2280.902150]  ffff880223335eb0 ffff880223335e98 0000000000000000 0000000000000003
[ 2280.902157] Call Trace:
[ 2280.902175]  [<ffffffff8109e8e2>] ? default_wake_function+0x12/0x20
[ 2280.902186]  [<ffffffff810b2715>] ? __wake_up_common+0x55/0x90
[ 2280.902194]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[ 2280.902203]  [<ffffffff81550b59>] schedule+0x29/0x70
[ 2280.902210]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[ 2280.902237]  [<ffffffffa01cfb45>] __cv_broadcast+0xe5/0x160 [spl]
[ 2280.902245]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[ 2280.902258]  [<ffffffffa01cfc18>] __cv_wait_io+0x18/0x180 [spl]
[ 2280.902286]  [<ffffffffa030767b>] zio_wait+0x11b/0x200 [zfs]
[ 2280.902321]  [<ffffffffa028ec71>] dsl_pool_sync+0xc1/0x480 [zfs]
[ 2280.902333]  [<ffffffffa01ca0af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[ 2280.902371]  [<ffffffffa02a9b60>] spa_sync+0x480/0xbd0 [zfs]
[ 2280.902380]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[ 2280.902419]  [<ffffffffa02bb30c>] txg_delay+0x4ec/0xa60 [zfs]
[ 2280.902455]  [<ffffffffa02baf90>] ? txg_delay+0x170/0xa60 [zfs]
[ 2280.902467]  [<ffffffffa01cb07a>] __thread_exit+0x9a/0xb0 [spl]
[ 2280.902478]  [<ffffffffa01cb000>] ? __thread_exit+0x20/0xb0 [spl]
[ 2280.902486]  [<ffffffff81090e0a>] kthread+0xea/0x100
[ 2280.902493]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2280.902502]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[ 2280.902508]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[38275.738573] perf interrupt took too long (2506 > 2495), lowering kernel.perf_event_max_sample_rate to 50100

Running spl versions from arch:

hinoki% pacman -Q | grep spl
spl-git 0.6.3_r54_g03a7835_3.18.2_2-1
spl-utils-git 0.6.3_r54_g03a7835_3.18.2_2-1

@ioquatix
Copy link

Free memory:

hinoki% free -h
              total        used        free      shared  buff/cache   available
Mem:           7.8G        3.6G        691M        1.0M        3.6G        872M
Swap:          4.0G          0B        4.0G

@TonyHoyle
Copy link

I'm getting this every couple of days now since upgrading to the latest Debian build (which according to the linked bug has the fix in it). Oops messages more or less the same as those already posted.. The zvols lock up and load average climbs into the hundreds. I've never had an issue prior to this.

@StevenWolfe
Copy link

I'm seeing similar symptoms after running a zfs rollback on a SMB shared dataset. After seeing load quickly rise I've stopped the SMB service and will give the rollback command some time to see if it recovers.

From dmesg, on Ubuntu Server 14.04.2 LTS, ZoL 0.6.3-5 from the Ubuntu PPA:

[3764100.795021] smbd            D ffff88010b433480     0 34303  60686 0x00000000
[3764100.795025]  ffff8800ae491d68 0000000000000082 ffff8800cf951800 ffff8800ae491fd8
[3764100.795029]  0000000000013480 0000000000013480 ffff8800cf951800 ffff8800e48dc3d0
[3764100.795032]  ffff8800e48dc3a0 ffff8800e48dc3d8 ffff8800e48dc3c8 0000000000000000
[3764100.795037] Call Trace:
[3764100.795052]  [<ffffffff817251a9>] schedule+0x29/0x70
[3764100.795067]  [<ffffffffa006b7b5>] cv_wait_common+0x125/0x1c0 [spl]
[3764100.795073]  [<ffffffff810ab0b0>] ? prepare_to_wait_event+0x100/0x100
[3764100.795091]  [<ffffffffa006b865>] __cv_wait+0x15/0x20 [spl]
[3764100.795130]  [<ffffffffa017019b>] rrw_enter_read+0x3b/0x150 [zfs]
[3764100.795181]  [<ffffffffa01bf65d>] zfs_getattr_fast+0x3d/0x180 [zfs]
[3764100.795230]  [<ffffffffa01d81fd>] zpl_getattr+0x2d/0x50 [zfs]
[3764100.795234]  [<ffffffff811c2829>] vfs_getattr_nosec+0x29/0x40
[3764100.795237]  [<ffffffff811c28fd>] vfs_getattr+0x2d/0x40
[3764100.795240]  [<ffffffff811c29d2>] vfs_fstatat+0x62/0xa0
[3764100.795244]  [<ffffffff811c2e5f>] SYSC_newstat+0x1f/0x40
[3764100.795248]  [<ffffffff811cdc99>] ? putname+0x29/0x40
[3764100.795252]  [<ffffffff811bcfe8>] ? do_sys_open+0x1b8/0x280
[3764100.795256]  [<ffffffff811c30ae>] SyS_newstat+0xe/0x10
[3764100.795260]  [<ffffffff8173186d>] system_call_fastpath+0x1a/0x1f
[3764100.795263] INFO: task smbd:34313 blocked for more than 120 seconds.

@stevleibelt
Copy link

I can trigger the same error by using rsync from two different pools. The rsync process is hanging and can not be killed.

If I won't stop the rsync from the cli, the will endup into a fault status. The fault status is gone after a reboot.

May 07 22:43:27 <my host> kernel: INFO: task txg_sync:1408 blocked for more than 120 seconds.
May 07 22:43:27 <my host> kernel:       Tainted: P           O    4.0.1-1-ARCH #1
May 07 22:43:27 <my host> kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 07 22:43:27 <my host> kernel: txg_sync        D ffff880028b63a28     0  1408      2 0x00000000
May 07 22:43:27 <my host> kernel:  ffff880028b63a28 ffff88018c5f3cc0 ffff880070a16540 0000000000000000
May 07 22:43:27 <my host> kernel:  ffff880028b63fd8 ffff88021fc93e00 7fffffffffffffff ffff8801b1ccae08
May 07 22:43:27 <my host> kernel:  0000000000000001 ffff880028b63a48 ffffffff8156fa87 ffff88008e7bdcb0
May 07 22:43:27 <my host> kernel: Call Trace:
May 07 22:43:27 <my host> kernel:  [<ffffffff8156fa87>] schedule+0x37/0x90
May 07 22:43:27 <my host> kernel:  [<ffffffff8157246c>] schedule_timeout+0x1bc/0x250
May 07 22:43:27 <my host> kernel:  [<ffffffff8101f599>] ? read_tsc+0x9/0x10
May 07 22:43:27 <my host> kernel:  [<ffffffff810e6757>] ? ktime_get+0x37/0xb0
May 07 22:43:27 <my host> kernel:  [<ffffffff8156ef9a>] io_schedule_timeout+0xaa/0x130
May 07 22:43:27 <my host> kernel:  [<ffffffffa034daa0>] ? zio_taskq_member.isra.6+0x80/0x80 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffffa0192248>] cv_wait_common+0xb8/0x140 [spl]
May 07 22:43:27 <my host> kernel:  [<ffffffff810b6b20>] ? wake_atomic_t_function+0x60/0x60
May 07 22:43:27 <my host> kernel:  [<ffffffffa0192328>] __cv_wait_io+0x18/0x20 [spl]
May 07 22:43:27 <my host> kernel:  [<ffffffffa034f943>] zio_wait+0x123/0x210 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffffa02d6be1>] dsl_pool_sync+0xc1/0x480 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffffa02f1f80>] spa_sync+0x480/0xbf0 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffff810b6b36>] ? autoremove_wake_function+0x16/0x40
May 07 22:43:27 <my host> kernel:  [<ffffffffa0303e06>] txg_sync_thread+0x386/0x630 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffff8156fd62>] ? preempt_schedule_common+0x22/0x40
May 07 22:43:27 <my host> kernel:  [<ffffffffa0303a80>] ? txg_quiesce_thread+0x3a0/0x3a0 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffffa018d561>] thread_generic_wrapper+0x71/0x80 [spl]
May 07 22:43:27 <my host> kernel:  [<ffffffffa018d4f0>] ? __thread_exit+0x20/0x20 [spl]
May 07 22:43:27 <my host> kernel:  [<ffffffff81093338>] kthread+0xd8/0xf0
May 07 22:43:27 <my host> kernel:  [<ffffffff81093260>] ? kthread_worker_fn+0x170/0x170
May 07 22:43:27 <my host> kernel:  [<ffffffff81573718>] ret_from_fork+0x58/0x90
May 07 22:43:27 <my host> kernel:  [<ffffffff81093260>] ? kthread_worker_fn+0x170/0x170
May 07 22:43:27 <my host> kernel: INFO: task rsync:10064 blocked for more than 120 seconds.
May 07 22:43:27 <my host> kernel:       Tainted: P           O    4.0.1-1-ARCH #1
May 07 22:43:27 <my host> kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 07 22:43:27 <my host> kernel: rsync           D ffff8801a4633a58     0 10064      1 0x00000004
May 07 22:43:27 <my host> kernel:  ffff8801a4633a58 ffff880216186f60 ffff880070a56540 ffff8801a4633a68
May 07 22:43:27 <my host> kernel:  ffff8801a4633fd8 ffff8800da2a4a20 ffff8800da2a4ae0 ffff8800da2a4a48
May 07 22:43:27 <my host> kernel:  0000000000000000 ffff8801a4633a78 ffffffff8156fa87 ffff8800da2a4a20
May 07 22:43:27 <my host> kernel: Call Trace:
May 07 22:43:27 <my host> kernel:  [<ffffffff8156fa87>] schedule+0x37/0x90
May 07 22:43:27 <my host> kernel:  [<ffffffffa019229d>] cv_wait_common+0x10d/0x140 [spl]
May 07 22:43:27 <my host> kernel:  [<ffffffff810b6b20>] ? wake_atomic_t_function+0x60/0x60
May 07 22:43:27 <my host> kernel:  [<ffffffffa01922e5>] __cv_wait+0x15/0x20 [spl]
May 07 22:43:27 <my host> kernel:  [<ffffffffa030327b>] txg_wait_synced+0x8b/0xd0 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffffa02c078c>] dmu_tx_wait+0x25c/0x3a0 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffffa02c096e>] dmu_tx_assign+0x9e/0x520 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffffa02c8e10>] ? dsl_dataset_block_freeable+0x20/0x70 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffffa02be639>] ? dmu_tx_count_dnode+0x59/0xb0 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffffa033f5de>] zfs_write+0x3ce/0xc50 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffff81515821>] ? unix_stream_recvmsg+0x701/0x7e0
May 07 22:43:27 <my host> kernel:  [<ffffffffa03558cd>] zpl_write+0xbd/0x130 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffff811d56b3>] vfs_write+0xb3/0x200
May 07 22:43:27 <my host> kernel:  [<ffffffff811d624c>] ? vfs_read+0x11c/0x140
May 07 22:43:27 <my host> kernel:  [<ffffffff811d6399>] SyS_write+0x59/0xd0
May 07 22:43:27 <my host> kernel:  [<ffffffff815737c9>] system_call_fastpath+0x12/0x17
May 07 22:44:27 <my host> kernel: WARNING: Pool '<my target pool>' has encountered an uncorrectable I/O failure and has been suspended.
May 07 22:44:27 <my host> zed[12566]: eid=2302 class=data pool=<my target pool>
May 07 22:44:27 <my host> zed[12569]: eid=2303 class=io_failure pool=<my target pool>
# zpool status -v
  pool: <my target pool>
 state: ONLINE
  scan: scrub repaired 0 in 9h15m with 0 errors on Thu May  7 09:01:37 2015
config:

        NAME                    STATE     READ WRITE CKSUM
        <my target pool>          ONLINE       0     0     0
          <my target pool>-crypt  ONLINE       0     0     0

errors: No known data errors

  pool: <my source pool>
 state: ONLINE
  scan: scrub repaired 0 in 6h3m with 0 errors on Thu May  7 05:49:33 2015
config:

        NAME                       STATE     READ WRITE CKSUM
        <my source pool>               ONLINE       0     0     0
          mirror-0                 ONLINE       0     0     0
            crypt-<my source pool>-00  ONLINE       0     0     0
            crypt-<my source pool>-01  ONLINE       0     0     0

errors: No known data errors
#modinfo zfs | head
filename:       /lib/modules/4.0.1-1-ARCH/extra/zfs/zfs.ko.gz
version:        0.6.4.1-1
license:        CDDL
author:         OpenZFS on Linux
description:    ZFS
srcversion:     8324F6AEA2A06B2B6F0A0F5
depends:        spl,znvpair,zunicode,zcommon,zavl
vermagic:       4.0.1-1-ARCH SMP preempt mod_unload modversions 
#modinfo spl | head
filename:       /lib/modules/4.0.1-1-ARCH/extra/spl/spl.ko.gz
version:        0.6.4.1-1
license:        GPL
author:         OpenZFS on Linux
description:    Solaris Porting Layer
srcversion:     8907748310B8940C9D0DCD2
depends:        
vermagic:       4.0.1-1-ARCH SMP preempt mod_unload modversions 
#uname -a
Linux <my host> 4.0.1-1-ARCH #1 SMP PREEMPT Wed Apr 29 12:00:26 CEST 2015 x86_64 GNU/Linux

Fingers crossed I've provided good information. I'm running an arch linux with demz repo.

@melkor217
Copy link

I got the same issue with high-loaded mongodb on ZFS.

free -m
             total       used       free     shared    buffers     cached
Mem:          7928       7422        506         76        192       1024
-/+ buffers/cache:       6205       1723
Swap:            0          0          0
# modinfo spl|head
filename:       /lib/modules/4.0.0-1-amd64/updates/dkms/spl.ko
version:        0.6.4-2-62e2eb
license:        GPL
author:         OpenZFS on Linux
description:    Solaris Porting Layer
srcversion:     3F1EAF06925B312A0B3F767
depends:        
vermagic:       4.0.0-1-amd64 SMP mod_unload modversions 
parm:           spl_hostid:The system hostid. (ulong)
parm:           spl_hostid_path:The system hostid file (/etc/hostid) (charp)
# modinfo zfs|head
filename:       /lib/modules/4.0.0-1-amd64/updates/dkms/zfs.ko
version:        0.6.4-16-544f71
license:        CDDL
author:         OpenZFS on Linux
description:    ZFS
srcversion:     29BA21B62706579B75D5974
depends:        spl,znvpair,zunicode,zcommon,zavl
vermagic:       4.0.0-1-amd64 SMP mod_unload modversions 
parm:           zvol_inhibit_dev:Do not create zvol device nodes (uint)
parm:           zvol_major:Major number for zvol device (uint)
# uname -a
Linux dan-desktop 4.0.0-1-amd64 #1 SMP Debian 4.0.2-1 (2015-05-11) x86_64 GNU/Linux
# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux unstable (sid)
Release:        unstable
Codename:       sid

@andreas-p
Copy link

Same for me on Debian Jessie with Linux 3.16.0-4-amd64 and zfs 0.6.5.2-2.
Happened on a postgresql server, no load at that time.

Jan 22 07:04:42 db04 kernel: [5056080.684110] INFO: task txg_sync:378 blocked for more than 120 seconds.
Jan 22 07:04:43 db04 kernel: [5056080.684128] Tainted: P O 3.16.0-4-amd64 #1
Jan 22 07:04:43 db04 kernel: [5056080.684134] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 22 07:04:43 db04 kernel: [5056080.684142] txg_sync D ffff8800f96bc7e8 0 378 2 0x00000000
Jan 22 07:04:43 db04 kernel: [5056080.684153] ffff8800f96bc390 0000000000000246 0000000000012f00 ffff8800f707ffd8
Jan 22 07:04:43 db04 kernel: [5056080.684165] 0000000000012f00 ffff8800f96bc390 ffff8800ff3137b0 ffff880048db7570
Jan 22 07:04:43 db04 kernel: [5056080.684176] ffff880048db75b0 0000000000000001 ffff8800f8eb9000 0000000000000000
Jan 22 07:04:43 db04 kernel: [5056080.684187] Call Trace:
Jan 22 07:04:43 db04 kernel: [5056080.684201] [] ? io_schedule+0x99/0x120
Jan 22 07:04:43 db04 kernel: [5056080.684220] [] ? cv_wait_common+0x90/0x100 [spl]
Jan 22 07:04:43 db04 kernel: [5056080.684232] [] ? prepare_to_wait_event+0xf0/0xf0
Jan 22 07:04:43 db04 kernel: [5056080.684264] [] ? zio_wait+0x10b/0x1e0 [zfs]
Jan 22 07:04:43 db04 kernel: [5056080.684287] [] ? dsl_pool_sync+0xaa/0x460 [zfs]
Jan 22 07:04:43 db04 kernel: [5056080.684313] [] ? spa_sync+0x366/0xb30 [zfs]
Jan 22 07:04:43 db04 kernel: [5056080.684339] [] ? txg_sync_thread+0x3d1/0x680 [zfs]
Jan 22 07:04:43 db04 kernel: [5056080.684363] [] ? txg_quiesce_thread+0x3e0/0x3e0 [zfs]
Jan 22 07:04:43 db04 kernel: [5056080.684374] [] ? thread_generic_wrapper+0x6b/0x80 [spl]
Jan 22 07:04:43 db04 kernel: [5056080.684388] [] ? __thread_exit+0x20/0x20 [spl]
Jan 22 07:04:43 db04 kernel: [5056080.684398] [] ? kthread+0xbd/0xe0
Jan 22 07:04:43 db04 kernel: [5056080.684406] [] ? kthread_create_on_node+0x180/0x180
Jan 22 07:04:43 db04 kernel: [5056080.684417] [] ? ret_from_fork+0x58/0x90
Jan 22 07:04:43 db04 kernel: [5056080.684428] [] ? kthread_create_on_node+0x180/0x180

@kernelOfTruth
Copy link
Contributor

@andreas-p please update to 0.6.5.4* if available

alternatively: you can build your own latest zfsonlinux packages:

http://zfsonlinux.org/generic-deb.html

@andreas-p
Copy link

Unfortunately, debian8 is DKMS still on 6.5.2, no update in the last three months. Any clue when this gets resolved?

@kernelOfTruth
Copy link
Contributor

@andreas-p sorry, no idea,

but there's always the option of building the packages on your own - which is some effort but you'll know that you can trust those instead of having to rely on third-party repositories, etc.

@andreas-p
Copy link

Got the very same problem with 0.6.5.4 on a different machine with Debian8, zfs built from source. The stack trace shows exactly the same positions as the trace from Jan 18.
Virtual machine: 6GB RAM, 1.5GB swap, 4 AMD CPU cores.

12:54 Starting an rsync from a 2.5TB xfs to 4TB zfs partition, memory rising from 2GB to 5GB within 5 minutes.
13:09 CPU load/1m steps up from 1 to 10, CPU utilization around 50%, mostly system
13:11 txg_sync hung_timeout first appearance, CPU load/1m steps to 16, committed memory drops to 0, CPU utilization 85% system.
13:15 CPU utilization 95% system, load 16, need reboot.

@hostingnuggets
Copy link

same problem here with Debian 8, ZoL 6.5.2 from the official package repository.

@andreas-p
Copy link

Exactly the same block with 0.6.5.5:
Mar 24 11:13:06 db04 kernel: [4746960.372343] [] ? io_schedule+0x99/0x120
Mar 24 11:13:06 db04 kernel: [4746960.372358] [] ? cv_wait_common+0x92/0x110 [spl]
Mar 24 11:13:06 db04 kernel: [4746960.372370] [] ? prepare_to_wait_event+0xf0/0xf0
Mar 24 11:13:06 db04 kernel: [4746960.372400] [] ? zio_wait+0x10b/0x1e0 [zfs]
Mar 24 11:13:06 db04 kernel: [4746960.372423] [] ? dsl_pool_sync+0xaa/0x460 [zfs]
Mar 24 11:13:06 db04 kernel: [4746960.372447] [] ? spa_sync+0x366/0xb30 [zfs]
Mar 24 11:13:06 db04 kernel: [4746960.372472] [] ? txg_sync_thread+0x3d1/0x680 [zfs]
Mar 24 11:13:06 db04 kernel: [4746960.372496] [] ? txg_quiesce_thread+0x3e0/0x3e0 [zfs]
Mar 24 11:13:06 db04 kernel: [4746960.372507] [] ? thread_generic_wrapper+0x6b/0x80 [spl]
Mar 24 11:13:06 db04 kernel: [4746960.372518] [] ? __thread_exit+0x20/0x20 [spl]
Mar 24 11:13:06 db04 kernel: [4746960.372529] [] ? kthread+0xbd/0xe0
Mar 24 11:13:06 db04 kernel: [4746960.372547] [] ? kthread_create_on_node+0x180/0x180
Mar 24 11:13:06 db04 kernel: [4746960.372558] [] ? ret_from_fork+0x58/0x90
Mar 24 11:13:06 db04 kernel: [4746960.372567] [] ? kthread_create_on_node+0x180/0x180

I got two consecutive "txg_sync blocked for more than 120 seconds", then it went back to normal.

@narun4sk
Copy link

narun4sk commented Jun 15, 2016

I'm suffering from the similar problem. rsync reads/writes are extremely slow ~3M. After struggling like this ~24h (it's a multi-million file dataset) machine gets bricked.

Please advice.


# uname -a
Linux ip-172-30-0-118 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2 (2016-04-08) x86_64 GNU/Linux
# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 8.5 (jessie)
Release:        8.5
Codename:       jessie
# dpkg -l '*zfs*' | grep ii
ii  libzfs2linux   0.6.5.7-1    amd64        OpenZFS filesystem library for Linux
ii  zfs-dkms       0.6.5.7-1    all          OpenZFS filesystem kernel modules for Linux
ii  zfs-zed        0.6.5.7-1    amd64        OpenZFS Event Daemon
ii  zfsutils-linux 0.6.5.7-1    amd64        command-line tools to manage OpenZFS filesystems
# dpkg -l '*spl*' | grep ii
ii  spl            0.6.5.7-1    amd64        Solaris Porting Layer user-space utilities for Linux
ii  spl-dkms       0.6.5.7-1    all          Solaris Porting Layer kernel modules for Linux
# zpool status
  pool: zfs-backup
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        zfs-backup  ONLINE       0     0     0
          xvdf      ONLINE       0     0     0
        logs
          xvdg      ONLINE       0     0     0

errors: No known data errors
# zpool iostat -v
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
zfs-backup   564G   380G    538      0  1.48M  2.42K
  xvdf       564G   380G    538      0  1.48M  2.11K
logs            -      -      -      -      -      -
  xvdg          0  1.98G      0      0    471    312
----------  -----  -----  -----  -----  -----  -----
# zpool get all
NAME        PROPERTY                    VALUE                       SOURCE
zfs-backup  size                        944G                        -
zfs-backup  capacity                    59%                         -
zfs-backup  altroot                     -                           default
zfs-backup  health                      ONLINE                      -
zfs-backup  guid                        2876612074418704500         default
zfs-backup  version                     -                           default
zfs-backup  bootfs                      -                           default
zfs-backup  delegation                  on                          default
zfs-backup  autoreplace                 off                         default
zfs-backup  cachefile                   -                           default
zfs-backup  failmode                    wait                        default
zfs-backup  listsnapshots               off                         default
zfs-backup  autoexpand                  off                         default
zfs-backup  dedupditto                  0                           default
zfs-backup  dedupratio                  1.00x                       -
zfs-backup  free                        380G                        -
zfs-backup  allocated                   564G                        -
zfs-backup  readonly                    off                         -
zfs-backup  ashift                      0                           default
zfs-backup  comment                     -                           default
zfs-backup  expandsize                  -                           -
zfs-backup  freeing                     0                           default
zfs-backup  fragmentation               44%                         -
zfs-backup  leaked                      0                           default
zfs-backup  feature@async_destroy       enabled                     local
zfs-backup  feature@empty_bpobj         active                      local
zfs-backup  feature@lz4_compress        active                      local
zfs-backup  feature@spacemap_histogram  active                      local
zfs-backup  feature@enabled_txg         active                      local
zfs-backup  feature@hole_birth          active                      local
zfs-backup  feature@extensible_dataset  enabled                     local
zfs-backup  feature@embedded_data       active                      local
zfs-backup  feature@bookmarks           enabled                     local
zfs-backup  feature@filesystem_limits   enabled                     local
zfs-backup  feature@large_blocks        enabled                     local
Jun 15 16:55:40 ip-172-30-0-118 kernel: [113640.618167] txg_sync        D ffff880079c259c8     0  1361      2 0x00000000
Jun 15 16:55:41 ip-172-30-0-118 kernel: [113640.621686]  ffff880079c25570 0000000000000046 0000000000012f00 ffff88007aaf3fd8
Jun 15 16:55:43 ip-172-30-0-118 kernel: [113640.625572]  0000000000012f00 ffff880079c25570 ffff88007fc137b0 ffff8800070cf050
Jun 15 16:55:43 ip-172-30-0-118 kernel: [113640.629691]  ffff8800070cf090 0000000000000001 ffff8800790a1000 0000000000000000
Jun 15 16:55:44 ip-172-30-0-118 kernel: [113640.633524] Call Trace:
Jun 15 16:55:45 ip-172-30-0-118 kernel: [113640.634740]  [<ffffffff815114a9>] ? io_schedule+0x99/0x120
Jun 15 16:55:47 ip-172-30-0-118 kernel: [113640.637492]  [<ffffffffa0152572>] ? cv_wait_common+0x92/0x110 [spl]
Jun 15 16:55:48 ip-172-30-0-118 kernel: [113640.640409]  [<ffffffff810a7e60>] ? prepare_to_wait_event+0xf0/0xf0
Jun 15 16:55:49 ip-172-30-0-118 kernel: [113640.643375]  [<ffffffffa029d12b>] ? zio_wait+0x10b/0x1e0 [zfs]
Jun 15 16:55:49 ip-172-30-0-118 kernel: [113640.646118]  [<ffffffffa022a77a>] ? dsl_pool_sync+0xaa/0x460 [zfs]
Jun 15 16:55:50 ip-172-30-0-118 kernel: [113640.649130]  [<ffffffffa0244766>] ? spa_sync+0x366/0xb30 [zfs]
Jun 15 16:55:51 ip-172-30-0-118 kernel: [113640.651910]  [<ffffffffa0256231>] ? txg_sync_thread+0x3d1/0x680 [zfs]
Jun 15 16:55:53 ip-172-30-0-118 kernel: [113640.655111]  [<ffffffffa0255e60>] ? txg_quiesce_thread+0x3e0/0x3e0 [zfs]
Jun 15 16:55:53 ip-172-30-0-118 kernel: [113640.658303]  [<ffffffffa014dcab>] ? thread_generic_wrapper+0x6b/0x80 [spl]
Jun 15 16:55:54 ip-172-30-0-118 kernel: [113640.661609]  [<ffffffffa014dc40>] ? __thread_exit+0x20/0x20 [spl]
Jun 15 16:55:55 ip-172-30-0-118 kernel: [113640.664528]  [<ffffffff8108809d>] ? kthread+0xbd/0xe0
Jun 15 16:55:56 ip-172-30-0-118 kernel: [113640.667024]  [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180
Jun 15 16:55:58 ip-172-30-0-118 kernel: [113640.670190]  [<ffffffff81514958>] ? ret_from_fork+0x58/0x90
Jun 15 16:55:58 ip-172-30-0-118 kernel: [113640.672778]  [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180
Jun 15 16:57:39 ip-172-30-0-118 kernel: [113760.681975] txg_sync        D ffff880079c259c8     0  1361      2 0x00000000
Jun 15 16:57:40 ip-172-30-0-118 kernel: [113760.685521]  ffff880079c25570 0000000000000046 0000000000012f00 ffff88007aaf3fd8
Jun 15 16:57:41 ip-172-30-0-118 kernel: [113760.689295]  0000000000012f00 ffff880079c25570 ffff88007fc137b0 ffff8800070cf050
Jun 15 16:57:42 ip-172-30-0-118 kernel: [113760.693211]  ffff8800070cf090 0000000000000001 ffff8800790a1000 0000000000000000
Jun 15 16:57:43 ip-172-30-0-118 kernel: [113760.697110] Call Trace:
Jun 15 16:57:44 ip-172-30-0-118 kernel: [113760.698425]  [<ffffffff815114a9>] ? io_schedule+0x99/0x120
Jun 15 16:57:45 ip-172-30-0-118 kernel: [113760.701106]  [<ffffffffa0152572>] ? cv_wait_common+0x92/0x110 [spl]
Jun 15 16:57:47 ip-172-30-0-118 kernel: [113760.704131]  [<ffffffff810a7e60>] ? prepare_to_wait_event+0xf0/0xf0
Jun 15 16:57:48 ip-172-30-0-118 kernel: [113760.707157]  [<ffffffffa029d12b>] ? zio_wait+0x10b/0x1e0 [zfs]
Jun 15 16:57:49 ip-172-30-0-118 kernel: [113760.709944]  [<ffffffffa022a77a>] ? dsl_pool_sync+0xaa/0x460 [zfs]
Jun 15 16:57:51 ip-172-30-0-118 kernel: [113760.713084]  [<ffffffffa0244766>] ? spa_sync+0x366/0xb30 [zfs]
Jun 15 16:57:52 ip-172-30-0-118 kernel: [113760.715920]  [<ffffffffa0256231>] ? txg_sync_thread+0x3d1/0x680 [zfs]
Jun 15 16:57:54 ip-172-30-0-118 kernel: [113760.719014]  [<ffffffffa0255e60>] ? txg_quiesce_thread+0x3e0/0x3e0 [zfs]
Jun 15 16:57:55 ip-172-30-0-118 kernel: [113760.722283]  [<ffffffffa014dcab>] ? thread_generic_wrapper+0x6b/0x80 [spl]
Jun 15 16:57:57 ip-172-30-0-118 kernel: [113760.725571]  [<ffffffffa014dc40>] ? __thread_exit+0x20/0x20 [spl]
Jun 15 16:57:57 ip-172-30-0-118 kernel: [113760.728485]  [<ffffffff8108809d>] ? kthread+0xbd/0xe0
Jun 15 16:57:58 ip-172-30-0-118 kernel: [113760.730974]  [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180
Jun 15 16:58:00 ip-172-30-0-118 kernel: [113760.734102]  [<ffffffff81514958>] ? ret_from_fork+0x58/0x90
Jun 15 16:58:02 ip-172-30-0-118 kernel: [113760.736819]  [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180
Jun 15 16:59:38 ip-172-30-0-118 kernel: [113880.747394] txg_sync        D ffff880079c259c8     0  1361      2 0x00000000
Jun 15 16:59:39 ip-172-30-0-118 kernel: [113880.750796]  ffff880079c25570 0000000000000046 0000000000012f00 ffff88007aaf3fd8
Jun 15 16:59:39 ip-172-30-0-118 kernel: [113880.754658]  0000000000012f00 ffff880079c25570 ffff88007fc137b0 ffff8800070cf050
Jun 15 16:59:41 ip-172-30-0-118 kernel: [113880.758412]  ffff8800070cf090 0000000000000001 ffff8800790a1000 0000000000000000
Jun 15 16:59:42 ip-172-30-0-118 kernel: [113880.762234] Call Trace:
Jun 15 16:59:43 ip-172-30-0-118 kernel: [113880.763454]  [<ffffffff815114a9>] ? io_schedule+0x99/0x120
Jun 15 16:59:44 ip-172-30-0-118 kernel: [113880.766117]  [<ffffffffa0152572>] ? cv_wait_common+0x92/0x110 [spl]
Jun 15 16:59:45 ip-172-30-0-118 kernel: [113880.769024]  [<ffffffff810a7e60>] ? prepare_to_wait_event+0xf0/0xf0
Jun 15 16:59:45 ip-172-30-0-118 kernel: [113880.772106]  [<ffffffffa029d12b>] ? zio_wait+0x10b/0x1e0 [zfs]
Jun 15 16:59:46 ip-172-30-0-118 kernel: [113880.775310]  [<ffffffffa022a77a>] ? dsl_pool_sync+0xaa/0x460 [zfs]
Jun 15 16:59:48 ip-172-30-0-118 kernel: [113880.778331]  [<ffffffffa0244766>] ? spa_sync+0x366/0xb30 [zfs]
Jun 15 16:59:50 ip-172-30-0-118 kernel: [113880.781198]  [<ffffffffa0256231>] ? txg_sync_thread+0x3d1/0x680 [zfs]
Jun 15 16:59:51 ip-172-30-0-118 kernel: [113880.784324]  [<ffffffffa0255e60>] ? txg_quiesce_thread+0x3e0/0x3e0 [zfs]
Jun 15 16:59:52 ip-172-30-0-118 kernel: [113880.787500]  [<ffffffffa014dcab>] ? thread_generic_wrapper+0x6b/0x80 [spl]
Jun 15 16:59:54 ip-172-30-0-118 kernel: [113880.790757]  [<ffffffffa014dc40>] ? __thread_exit+0x20/0x20 [spl]
Jun 15 16:59:55 ip-172-30-0-118 kernel: [113880.793707]  [<ffffffff8108809d>] ? kthread+0xbd/0xe0
Jun 15 16:59:56 ip-172-30-0-118 kernel: [113880.796081]  [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180
Jun 15 16:59:57 ip-172-30-0-118 kernel: [113880.799104]  [<ffffffff81514958>] ? ret_from_fork+0x58/0x90
Jun 15 16:59:58 ip-172-30-0-118 kernel: [113880.801724]  [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180
Jun 15 17:01:41 ip-172-30-0-118 kernel: [114000.817906] txg_sync        D ffff880079c259c8     0  1361      2 0x00000000
Jun 15 17:01:42 ip-172-30-0-118 kernel: [114000.822256]  ffff880079c25570 0000000000000046 0000000000012f00 ffff88007aaf3fd8
Jun 15 17:01:43 ip-172-30-0-118 kernel: [114000.827952]  0000000000012f00 ffff880079c25570 ffff88007fc137b0 ffff8800070cf050
Jun 15 17:01:45 ip-172-30-0-118 kernel: [114000.833092]  ffff8800070cf090 0000000000000001 ffff8800790a1000 0000000000000000
Jun 15 17:01:45 ip-172-30-0-118 kernel: [114000.837533] Call Trace:
Jun 15 17:01:47 ip-172-30-0-118 kernel: [114000.838994]  [<ffffffff815114a9>] ? io_schedule+0x99/0x120
Jun 15 17:01:48 ip-172-30-0-118 kernel: [114000.842069]  [<ffffffffa0152572>] ? cv_wait_common+0x92/0x110 [spl]
Jun 15 17:01:49 ip-172-30-0-118 kernel: [114000.845833]  [<ffffffff810a7e60>] ? prepare_to_wait_event+0xf0/0xf0
Jun 15 17:01:50 ip-172-30-0-118 kernel: [114000.849362]  [<ffffffffa029d12b>] ? zio_wait+0x10b/0x1e0 [zfs]
Jun 15 17:01:51 ip-172-30-0-118 kernel: [114000.852630]  [<ffffffffa022a77a>] ? dsl_pool_sync+0xaa/0x460 [zfs]
Jun 15 17:01:52 ip-172-30-0-118 kernel: [114000.856037]  [<ffffffffa0244766>] ? spa_sync+0x366/0xb30 [zfs]
Jun 15 17:01:53 ip-172-30-0-118 kernel: [114000.859185]  [<ffffffffa0256231>] ? txg_sync_thread+0x3d1/0x680 [zfs]
Jun 15 17:01:54 ip-172-30-0-118 kernel: [114000.862633]  [<ffffffffa0255e60>] ? txg_quiesce_thread+0x3e0/0x3e0 [zfs]
Jun 15 17:01:55 ip-172-30-0-118 kernel: [114000.866301]  [<ffffffffa014dcab>] ? thread_generic_wrapper+0x6b/0x80 [spl]
Jun 15 17:01:56 ip-172-30-0-118 kernel: [114000.870041]  [<ffffffffa014dc40>] ? __thread_exit+0x20/0x20 [spl]
Jun 15 17:01:57 ip-172-30-0-118 kernel: [114000.873530]  [<ffffffff8108809d>] ? kthread+0xbd/0xe0
Jun 15 17:01:57 ip-172-30-0-118 kernel: [114000.876425]  [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180
Jun 15 17:01:59 ip-172-30-0-118 kernel: [114000.879915]  [<ffffffff81514958>] ? ret_from_fork+0x58/0x90
Jun 15 17:02:00 ip-172-30-0-118 kernel: [114000.882644]  [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180

@kernelOfTruth
Copy link
Contributor

kernelOfTruth commented Jun 16, 2016

@narunask could you please post output of

cat /proc/spl/kstat/zfs/arcstats

or meanwhile access to the box is denied ("bricked") ?

Also please post some hardware and specific configuration data (RAM, processor, mainboard, harddrive type, lvm/cryptsetup/etc. etc.)

thanks

@narun4sk
Copy link

narun4sk commented Jun 16, 2016

Server is on the EC2, currently 1 core, 2GB RAM (t2.small), HDD - the magnetic one, I believe it is documented here.

HDD that I'm copying from is LVM based, consisting of 3 PVs. HDD are not encrypted.

Also FYI, currently rsync is calculating deltas and should start moving data again very soon, if that adds something to the stats.

# cat /proc/spl/kstat/zfs/arcstats
6 1 0x01 91 4368 35041278654 31678952268315
name                            type data
hits                            4    42347862
misses                          4    10285010
demand_data_hits                4    0
demand_data_misses              4    68
demand_metadata_hits            4    34887103
demand_metadata_misses          4    6345542
prefetch_data_hits              4    0
prefetch_data_misses            4    0
prefetch_metadata_hits          4    7460759
prefetch_metadata_misses        4    3939400
mru_hits                        4    20830447
mru_ghost_hits                  4    2991105
mfu_hits                        4    14056656
mfu_ghost_hits                  4    1782606
deleted                         4    2167955
mutex_miss                      4    90369
evict_skip                      4    1781974161
evict_not_enough                4    15390660
evict_l2_cached                 4    0
evict_l2_eligible               4    35359622656
evict_l2_ineligible             4    55754657792
evict_l2_skip                   4    0
hash_elements                   4    5452
hash_elements_max               4    48255
hash_collisions                 4    57676
hash_chains                     4    69
hash_chain_max                  4    4
p                               4    7007439
c                               4    34933248
c_min                           4    33554432
c_max                           4    1053282304
size                            4    105425184
hdr_size                        4    1969640
data_size                       4    0
metadata_size                   4    30842880
other_size                      4    72612664
anon_size                       4    753664
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    27006976
mru_evictable_data              4    0
mru_evictable_metadata          4    3653632
mru_ghost_size                  4    6850560
mru_ghost_evictable_data        4    0
mru_ghost_evictable_metadata    4    6850560
mfu_size                        4    3082240
mfu_evictable_data              4    0
mfu_evictable_metadata          4    32768
mfu_ghost_size                  4    25972224
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    25972224
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    947
memory_indirect_count           4    259028
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    3960
arc_meta_used                   4    105425184
arc_meta_limit                  4    789961728
arc_meta_max                    4    816526072
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    32911360

# cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
stepping        : 2
microcode       : 0x25
cpu MHz         : 2394.530
cache size      : 30720 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm xsaveopt fsgsbase bmi1 avx2 smep bmi2 erms invpcid
bogomips        : 4789.06
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:
# dmidecode --type memory
# dmidecode 3.0
Scanning /dev/mem for entry point.
SMBIOS 2.4 present.

Handle 0x1000, DMI type 16, 15 bytes
Physical Memory Array
        Location: Other
        Use: System Memory
        Error Correction Type: Multi-bit ECC
        Maximum Capacity: 2 GB
        Error Information Handle: Not Provided
        Number Of Devices: 1

Handle 0x1100, DMI type 17, 21 bytes
Memory Device
        Array Handle: 0x1000
        Error Information Handle: 0x0000
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 2048 MB
        Form Factor: DIMM
        Set: None
        Locator: DIMM 0
        Bank Locator: Not Specified
        Type: RAM
        Type Detail: None

Thanks

@andreas-p
Copy link

I haven't seen the problem for quite a while now (seemed to have gone since 0.6.5.6), but this morning I had the very same hung task with 0.6.5.7 (single occurrance) come up again. Sigh...

@ioquatix
Copy link

I can confirm that I have not seen this issue happen for ages on the same hardware. I have not had it reoccur, probably within the last 3-6 months.

@ioquatix
Copy link

Still happening with heavy RSync backups.

[88234.807775] INFO: task rsync:18699 blocked for more than 120 seconds.
[88234.807929]       Tainted: P           O    4.8.13-1-ARCH #1
[88234.808026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[88234.808159] rsync           D ffff880124bcb9d0     0 18699  18698 0x00000000
[88234.808171]  ffff880124bcb9d0 00ffffff81003016 ffff88021cd00d00 ffff880127fa6800
[88234.808179]  ffff88022fc17ff0 ffff880124bcc000 ffff88021d8eaa10 ffff88021d8eab58
[88234.808185]  ffff88021d8eaa38 0000000000000000 ffff880124bcb9e8 ffffffff815f40ec
[88234.808190] Call Trace:
[88234.808203]  [<ffffffff815f40ec>] schedule+0x3c/0x90
[88234.808217]  [<ffffffffa0171e1f>] cv_wait_common+0x10f/0x130 [spl]
[88234.808225]  [<ffffffff810c0450>] ? wake_atomic_t_function+0x60/0x60
[88234.808235]  [<ffffffffa0171e55>] __cv_wait+0x15/0x20 [spl]
[88234.808294]  [<ffffffffa03095f8>] txg_wait_open+0xa8/0xe0 [zfs]
[88234.808338]  [<ffffffffa02c73db>] dmu_tx_wait+0x32b/0x340 [zfs]
[88234.808380]  [<ffffffffa02c747b>] dmu_tx_assign+0x8b/0x490 [zfs]
[88234.808422]  [<ffffffffa0342d09>] zfs_write+0x3f9/0xc80 [zfs]
[88234.808461]  [<ffffffffa034f567>] ? zio_destroy+0xb7/0xc0 [zfs]
[88234.808500]  [<ffffffffa0352c78>] ? zio_wait+0x138/0x1d0 [zfs]
[88234.808507]  [<ffffffffa016ab9a>] ? spl_kmem_free+0x2a/0x40 [spl]
[88234.808551]  [<ffffffffa0337510>] ? zfs_range_unlock+0x1a0/0x2c0 [zfs]
[88234.808590]  [<ffffffffa035826c>] zpl_write_common_iovec+0x8c/0xe0 [zfs]
[88234.808627]  [<ffffffffa0358497>] zpl_write+0x87/0xc0 [zfs]
[88234.808635]  [<ffffffff81208797>] __vfs_write+0x37/0x140
[88234.808642]  [<ffffffff810c7be7>] ? percpu_down_read+0x17/0x50
[88234.808648]  [<ffffffff81209566>] vfs_write+0xb6/0x1a0
[88234.808652]  [<ffffffff8120a9e5>] SyS_write+0x55/0xc0
[88234.808658]  [<ffffffff815f8032>] entry_SYSCALL_64_fastpath+0x1a/0xa4
[88234.808664] INFO: task imap:19376 blocked for more than 120 seconds.
[88234.808803]       Tainted: P           O    4.8.13-1-ARCH #1
[88234.808898] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[88234.809029] imap            D ffff8801b875fad8     0 19376   1488 0x00000000
[88234.809036]  ffff8801b875fad8 00ff88021d8eab00 ffffffff81a0d500 ffff8801a9f95b00
[88234.809042]  ffff8801b875fab8 ffff8801b8760000 ffff88021d8eaa10 ffff88021d8eab58
[88234.809048]  ffff88021d8eaa38 0000000000000000 ffff8801b875faf0 ffffffff815f40ec
[88234.809053] Call Trace:
[88234.809059]  [<ffffffff815f40ec>] schedule+0x3c/0x90
[88234.809067]  [<ffffffffa0171e1f>] cv_wait_common+0x10f/0x130 [spl]
[88234.809073]  [<ffffffff810c0450>] ? wake_atomic_t_function+0x60/0x60
[88234.809081]  [<ffffffffa0171e55>] __cv_wait+0x15/0x20 [spl]
[88234.809131]  [<ffffffffa03095f8>] txg_wait_open+0xa8/0xe0 [zfs]
[88234.809175]  [<ffffffffa02c73db>] dmu_tx_wait+0x32b/0x340 [zfs]
[88234.809217]  [<ffffffffa02c747b>] dmu_tx_assign+0x8b/0x490 [zfs]
[88234.809259]  [<ffffffffa0342299>] zfs_dirty_inode+0xe9/0x300 [zfs]
[88234.809265]  [<ffffffff8122a3b4>] ? mntput+0x24/0x40
[88234.809272]  [<ffffffff81189adb>] ? release_pages+0x2cb/0x380
[88234.809278]  [<ffffffff811c763e>] ? free_pages_and_swap_cache+0x8e/0xa0
[88234.809315]  [<ffffffffa035a2cc>] zpl_dirty_inode+0x2c/0x40 [zfs]
[88234.809322]  [<ffffffff81237785>] __mark_inode_dirty+0x45/0x400
[88234.809362]  [<ffffffffa0346fe8>] zfs_mark_inode_dirty+0x48/0x50 [zfs]
[88234.809399]  [<ffffffffa0358826>] zpl_release+0x46/0x90 [zfs]
[88234.809404]  [<ffffffff8120b42f>] __fput+0x9f/0x1e0
[88234.809408]  [<ffffffff8120b5ae>] ____fput+0xe/0x10
[88234.809414]  [<ffffffff8109a0d0>] task_work_run+0x80/0xa0
[88234.809420]  [<ffffffff8100366a>] exit_to_usermode_loop+0xba/0xc0
[88234.809425]  [<ffffffff81003b2e>] syscall_return_slowpath+0x4e/0x60
[88234.809430]  [<ffffffff815f80ba>] entry_SYSCALL_64_fastpath+0xa2/0xa4   

Most of the time performance is fine.

@ioquatix
Copy link

So, I replaced the drive which might have been causing issues, with a new Samsung 840 PRO SSD, partitioned 50GB OS, 4GB Swap and the rest available as L2ARC.

I've set the arc max size to 4GB on a system with only 8GB memory.

Tonight, the same issue occurred, rsync and then everything ground to a halt. So, it seems unlikely to be a disk issue. Nominal read speeds are 100MB/s+ so things are humming along quite nicely.

@ioquatix
Copy link

I can reproduce this issue fairly easily so just let me know what information you'd like me to collect and I'll try to do it.

@ioquatix
Copy link

[78889.098553] INFO: task txg_sync:1372 blocked for more than 120 seconds.
[78889.098667]       Tainted: P           O    4.8.13-1-ARCH #1
[78889.098719] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[78889.098790] txg_sync        D ffff88021c893ab8     0  1372      2 0x00000000
[78889.098798]  ffff88021c893ab8 00ffffff810a5d2f ffff8802212b5b00 ffff88021d3ece00
[78889.098802]  0000000000000046 ffff88021c894000 ffff88022fc17f80 7fffffffffffffff
[78889.098806]  ffff880105123b60 0000000000000001 ffff88021c893ad0 ffffffff815f40ec
[78889.098809] Call Trace:
[78889.098818]  [<ffffffff815f40ec>] schedule+0x3c/0x90
[78889.098821]  [<ffffffff815f6fa3>] schedule_timeout+0x243/0x3d0
[78889.098825]  [<ffffffff810a6c12>] ? default_wake_function+0x12/0x20
[78889.098828]  [<ffffffff810bfd9d>] ? __wake_up_common+0x4d/0x80
[78889.098832]  [<ffffffff810f2c51>] ? ktime_get+0x41/0xb0
[78889.098835]  [<ffffffff815f38c4>] io_schedule_timeout+0xa4/0x110
[78889.098843]  [<ffffffffa0203dc1>] cv_wait_common+0xb1/0x130 [spl]
[78889.098846]  [<ffffffff810c0450>] ? wake_atomic_t_function+0x60/0x60
[78889.098851]  [<ffffffffa0203e98>] __cv_wait_io+0x18/0x20 [spl]
[78889.098886]  [<ffffffffa0343c3d>] zio_wait+0xfd/0x1d0 [zfs]
[78889.098912]  [<ffffffffa02cf258>] dsl_pool_sync+0xb8/0x480 [zfs]
[78889.098939]  [<ffffffffa02ea38f>] spa_sync+0x37f/0xb30 [zfs]
[78889.098942]  [<ffffffff810a6c12>] ? default_wake_function+0x12/0x20
[78889.098969]  [<ffffffffa02fac1a>] txg_sync_thread+0x3ba/0x620 [zfs]
[78889.098997]  [<ffffffffa02fa860>] ? txg_delay+0x160/0x160 [zfs]
[78889.099002]  [<ffffffffa01fef61>] thread_generic_wrapper+0x71/0x80 [spl]
[78889.099006]  [<ffffffffa01feef0>] ? __thread_exit+0x20/0x20 [spl]
[78889.099010]  [<ffffffff8109be38>] kthread+0xd8/0xf0
[78889.099013]  [<ffffffff8102c782>] ? __switch_to+0x2d2/0x630
[78889.099016]  [<ffffffff815f823f>] ret_from_fork+0x1f/0x40
[78889.099019]  [<ffffffff8109bd60>] ? kthread_worker_fn+0x170/0x170
[79380.619090] INFO: task txg_sync:1372 blocked for more than 120 seconds.
[79380.619245]       Tainted: P           O    4.8.13-1-ARCH #1
[79380.619342] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[79380.619475] txg_sync        D ffff88021c893ab8     0  1372      2 0x00000000
[79380.619487]  ffff88021c893ab8 00ffffff810a5d2f ffffffff81a0d500 ffff88021d3ece00
[79380.619494]  0000000000000046 ffff88021c894000 ffff88022fc17f80 7fffffffffffffff
[79380.619500]  ffff8801e9c0c850 0000000000000001 ffff88021c893ad0 ffffffff815f40ec
[79380.619506] Call Trace:
[79380.619519]  [<ffffffff815f40ec>] schedule+0x3c/0x90
[79380.619525]  [<ffffffff815f6fa3>] schedule_timeout+0x243/0x3d0
[79380.619531]  [<ffffffff810a6c12>] ? default_wake_function+0x12/0x20
[79380.619537]  [<ffffffff810bfd9d>] ? __wake_up_common+0x4d/0x80
[79380.619543]  [<ffffffff810f2c51>] ? ktime_get+0x41/0xb0
[79380.619547]  [<ffffffff815f38c4>] io_schedule_timeout+0xa4/0x110
[79380.619560]  [<ffffffffa0203dc1>] cv_wait_common+0xb1/0x130 [spl]
[79380.619565]  [<ffffffff810c0450>] ? wake_atomic_t_function+0x60/0x60
[79380.619574]  [<ffffffffa0203e98>] __cv_wait_io+0x18/0x20 [spl]
[79380.619623]  [<ffffffffa0343c3d>] zio_wait+0xfd/0x1d0 [zfs]
[79380.619672]  [<ffffffffa02cf258>] dsl_pool_sync+0xb8/0x480 [zfs]
[79380.619724]  [<ffffffffa02ea38f>] spa_sync+0x37f/0xb30 [zfs]
[79380.619729]  [<ffffffff810a6c12>] ? default_wake_function+0x12/0x20
[79380.619780]  [<ffffffffa02fac1a>] txg_sync_thread+0x3ba/0x620 [zfs]
[79380.619830]  [<ffffffffa02fa860>] ? txg_delay+0x160/0x160 [zfs]
[79380.619839]  [<ffffffffa01fef61>] thread_generic_wrapper+0x71/0x80 [spl]
[79380.619847]  [<ffffffffa01feef0>] ? __thread_exit+0x20/0x20 [spl]
[79380.619854]  [<ffffffff8109be38>] kthread+0xd8/0xf0
[79380.619860]  [<ffffffff8102c782>] ? __switch_to+0x2d2/0x630
[79380.619866]  [<ffffffff815f823f>] ret_from_fork+0x1f/0x40
[79380.619872]  [<ffffffff8109bd60>] ? kthread_worker_fn+0x170/0x170
[81223.826929] INFO: task txg_sync:1372 blocked for more than 120 seconds.
[81223.827091]       Tainted: P           O    4.8.13-1-ARCH #1
[81223.827226] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[81223.827381] txg_sync        D ffff88021c893ab8     0  1372      2 0x00000000
[81223.827393]  ffff88021c893ab8 00ffffff810a5d2f ffff8802157c3400 ffff88021d3ece00
[81223.827400]  0000000000000046 ffff88021c894000 ffff88022fc17f80 7fffffffffffffff
[81223.827407]  ffff88015f37d0f0 0000000000000001 ffff88021c893ad0 ffffffff815f40ec
[81223.827413] Call Trace:
[81223.827426]  [<ffffffff815f40ec>] schedule+0x3c/0x90
[81223.827432]  [<ffffffff815f6fa3>] schedule_timeout+0x243/0x3d0
[81223.827439]  [<ffffffff810a6c12>] ? default_wake_function+0x12/0x20
[81223.827444]  [<ffffffff810bfd9d>] ? __wake_up_common+0x4d/0x80
[81223.827451]  [<ffffffff810f2c51>] ? ktime_get+0x41/0xb0
[81223.827455]  [<ffffffff815f38c4>] io_schedule_timeout+0xa4/0x110
[81223.827468]  [<ffffffffa0203dc1>] cv_wait_common+0xb1/0x130 [spl]
[81223.827473]  [<ffffffff810c0450>] ? wake_atomic_t_function+0x60/0x60
[81223.827482]  [<ffffffffa0203e98>] __cv_wait_io+0x18/0x20 [spl]
[81223.827531]  [<ffffffffa0343c3d>] zio_wait+0xfd/0x1d0 [zfs]
[81223.827580]  [<ffffffffa02cf258>] dsl_pool_sync+0xb8/0x480 [zfs]
[81223.827632]  [<ffffffffa02ea38f>] spa_sync+0x37f/0xb30 [zfs]
[81223.827636]  [<ffffffff810a6c12>] ? default_wake_function+0x12/0x20
[81223.827687]  [<ffffffffa02fac1a>] txg_sync_thread+0x3ba/0x620 [zfs]
[81223.827738]  [<ffffffffa02fa860>] ? txg_delay+0x160/0x160 [zfs]
[81223.827747]  [<ffffffffa01fef61>] thread_generic_wrapper+0x71/0x80 [spl]
[81223.827755]  [<ffffffffa01feef0>] ? __thread_exit+0x20/0x20 [spl]
[81223.827762]  [<ffffffff8109be38>] kthread+0xd8/0xf0
[81223.827767]  [<ffffffff8102c782>] ? __switch_to+0x2d2/0x630
[81223.827773]  [<ffffffff815f823f>] ret_from_fork+0x1f/0x40
[81223.827779]  [<ffffffff8109bd60>] ? kthread_worker_fn+0x170/0x170
[94923.528386] CE: hpet increased min_delta_ns to 20115 nsec
[102727.918958] INFO: task rsync:5577 blocked for more than 120 seconds.
[102727.919092]       Tainted: P           O    4.8.13-1-ARCH #1
[102727.919170] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[102727.919277] rsync           D ffff8801dfe9ba10     0  5577   5576 0x00000000
[102727.919287]  ffff8801dfe9ba10 00000000ffffffff ffffffff81a0d500 ffff8802157c3400
[102727.919293]  ffffffffa0340567 ffff8801dfe9c000 ffff88021ed41148 ffff88021ed41180
[102727.919298]  ffff88021ed41170 0000000000000000 ffff8801dfe9ba28 ffffffff815f40ec
[102727.919303] Call Trace:
[102727.919352]  [<ffffffffa0340567>] ? zio_destroy+0xb7/0xc0 [zfs]
[102727.919358]  [<ffffffff815f40ec>] schedule+0x3c/0x90
[102727.919367]  [<ffffffffa0203e1f>] cv_wait_common+0x10f/0x130 [spl]
[102727.919373]  [<ffffffff810c0450>] ? wake_atomic_t_function+0x60/0x60
[102727.919380]  [<ffffffffa0203e55>] __cv_wait+0x15/0x20 [spl]
[102727.919414]  [<ffffffffa02b814b>] dmu_tx_wait+0x9b/0x340 [zfs]
[102727.919448]  [<ffffffffa02b847b>] dmu_tx_assign+0x8b/0x490 [zfs]
[102727.919482]  [<ffffffffa0333d09>] zfs_write+0x3f9/0xc80 [zfs]
[102727.919514]  [<ffffffffa0340567>] ? zio_destroy+0xb7/0xc0 [zfs]
[102727.919544]  [<ffffffffa0343c78>] ? zio_wait+0x138/0x1d0 [zfs]
[102727.919579]  [<ffffffffa0328510>] ? zfs_range_unlock+0x1a0/0x2c0 [zfs]
[102727.919610]  [<ffffffffa034926c>] zpl_write_common_iovec+0x8c/0xe0 [zfs]
[102727.919640]  [<ffffffffa0349497>] zpl_write+0x87/0xc0 [zfs]
[102727.919646]  [<ffffffff81208797>] __vfs_write+0x37/0x140
[102727.919652]  [<ffffffff810c7be7>] ? percpu_down_read+0x17/0x50
[102727.919656]  [<ffffffff81209566>] vfs_write+0xb6/0x1a0
[102727.919660]  [<ffffffff8120a9e5>] SyS_write+0x55/0xc0
[102727.919665]  [<ffffffff815f8032>] entry_SYSCALL_64_fastpath+0x1a/0xa4

Happened last night during backups.

@ioquatix
Copy link

ioquatix commented Feb 6, 2017

Okay, so I've replaced the drive which had the high await time, and also increased the memory to 16GB, still having issues:

[12785.566973] CE: hpet increased min_delta_ns to 20115 nsec
[25560.317294] INFO: task txg_sync:1392 blocked for more than 120 seconds.
[25560.317451]       Tainted: P           O    4.9.6-1-ARCH #1
[25560.317542] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[25560.317668] txg_sync        D    0  1392      2 0x00000000
[25560.317678]  ffff88041b855c00 0000000000000000 ffff8804104b2700 ffff88042fc180c0
[25560.317686]  ffffffff81a0e500 ffffc9000828fad8 ffffffff81605cdf ffff880411a4c080
[25560.317692]  00ffffffa0341360 ffff88042fc180c0 0000000000000000 ffff8804104b2700
[25560.317698] Call Trace:
[25560.317713]  [<ffffffff81605cdf>] ? __schedule+0x22f/0x6e0
[25560.317719]  [<ffffffff816061cd>] schedule+0x3d/0x90
[25560.317726]  [<ffffffff81608fd3>] schedule_timeout+0x243/0x3d0
[25560.317781]  [<ffffffffa033fbb1>] ? zio_taskq_dispatch+0x91/0xa0 [zfs]
[25560.317821]  [<ffffffffa033fbd2>] ? zio_issue_async+0x12/0x20 [zfs]
[25560.317859]  [<ffffffffa0343569>] ? zio_nowait+0x79/0x110 [zfs]
[25560.317867]  [<ffffffff810f7b81>] ? ktime_get+0x41/0xb0
[25560.317873]  [<ffffffff81605a44>] io_schedule_timeout+0xa4/0x110
[25560.317884]  [<ffffffffa01eccd1>] cv_wait_common+0xb1/0x130 [spl]
[25560.317891]  [<ffffffff810c4200>] ? wake_atomic_t_function+0x60/0x60
[25560.317900]  [<ffffffffa01ecda8>] __cv_wait_io+0x18/0x20 [spl]
[25560.317938]  [<ffffffffa034337c>] zio_wait+0xac/0x130 [zfs]
[25560.317984]  [<ffffffffa02cf408>] dsl_pool_sync+0xb8/0x480 [zfs]
[25560.318035]  [<ffffffffa02e9e1f>] spa_sync+0x37f/0xb30 [zfs]
[25560.318041]  [<ffffffff810aa4a2>] ? default_wake_function+0x12/0x20
[25560.318091]  [<ffffffffa02fa6aa>] txg_sync_thread+0x3ba/0x620 [zfs]
[25560.318096]  [<ffffffff8102d752>] ? __switch_to+0x2d2/0x630
[25560.318145]  [<ffffffffa02fa2f0>] ? txg_delay+0x160/0x160 [zfs]
[25560.318154]  [<ffffffffa01e7f22>] thread_generic_wrapper+0x72/0x80 [spl]
[25560.318161]  [<ffffffffa01e7eb0>] ? __thread_exit+0x20/0x20 [spl]
[25560.318167]  [<ffffffff8109e8f9>] kthread+0xd9/0xf0
[25560.318171]  [<ffffffff8102d752>] ? __switch_to+0x2d2/0x630
[25560.318176]  [<ffffffff8109e820>] ? kthread_park+0x60/0x60
[25560.318180]  [<ffffffff8109e820>] ? kthread_park+0x60/0x60
[25560.318184]  [<ffffffff8160a995>] ret_from_fork+0x25/0x30
[25683.204571] INFO: task txg_sync:1392 blocked for more than 120 seconds.
[25683.204722]       Tainted: P           O    4.9.6-1-ARCH #1
[25683.204814] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[25683.204940] txg_sync        D    0  1392      2 0x00000000
[25683.204950]  ffff88041b855c00 0000000000000000 ffff8804104b2700 ffff88042fc180c0
[25683.204958]  ffffffff81a0e500 ffffc9000828fad8 ffffffff81605cdf ffff880411a4c080
[25683.204965]  00ffffffa0341360 ffff88042fc180c0 0000000000000000 ffff8804104b2700
[25683.204970] Call Trace:
[25683.204985]  [<ffffffff81605cdf>] ? __schedule+0x22f/0x6e0
[25683.204992]  [<ffffffff816061cd>] schedule+0x3d/0x90
[25683.204999]  [<ffffffff81608fd3>] schedule_timeout+0x243/0x3d0
[25683.205055]  [<ffffffffa033fbb1>] ? zio_taskq_dispatch+0x91/0xa0 [zfs]
[25683.205095]  [<ffffffffa033fbd2>] ? zio_issue_async+0x12/0x20 [zfs]
[25683.205134]  [<ffffffffa0343569>] ? zio_nowait+0x79/0x110 [zfs]
[25683.205142]  [<ffffffff810f7b81>] ? ktime_get+0x41/0xb0
[25683.205148]  [<ffffffff81605a44>] io_schedule_timeout+0xa4/0x110
[25683.205159]  [<ffffffffa01eccd1>] cv_wait_common+0xb1/0x130 [spl]
[25683.205166]  [<ffffffff810c4200>] ? wake_atomic_t_function+0x60/0x60
[25683.205175]  [<ffffffffa01ecda8>] __cv_wait_io+0x18/0x20 [spl]
[25683.205213]  [<ffffffffa034337c>] zio_wait+0xac/0x130 [zfs]
[25683.205259]  [<ffffffffa02cf408>] dsl_pool_sync+0xb8/0x480 [zfs]
[25683.205310]  [<ffffffffa02e9e1f>] spa_sync+0x37f/0xb30 [zfs]
[25683.205315]  [<ffffffff810aa4a2>] ? default_wake_function+0x12/0x20
[25683.205365]  [<ffffffffa02fa6aa>] txg_sync_thread+0x3ba/0x620 [zfs]
[25683.205371]  [<ffffffff8102d752>] ? __switch_to+0x2d2/0x630
[25683.205419]  [<ffffffffa02fa2f0>] ? txg_delay+0x160/0x160 [zfs]
[25683.205428]  [<ffffffffa01e7f22>] thread_generic_wrapper+0x72/0x80 [spl]
[25683.205436]  [<ffffffffa01e7eb0>] ? __thread_exit+0x20/0x20 [spl]
[25683.205441]  [<ffffffff8109e8f9>] kthread+0xd9/0xf0
[25683.205445]  [<ffffffff8102d752>] ? __switch_to+0x2d2/0x630
[25683.205450]  [<ffffffff8109e820>] ? kthread_park+0x60/0x60
[25683.205454]  [<ffffffff8109e820>] ? kthread_park+0x60/0x60
[25683.205458]  [<ffffffff8160a995>] ret_from_fork+0x25/0x30
[26051.866268] INFO: task txg_sync:1392 blocked for more than 120 seconds.
[26051.866424]       Tainted: P           O    4.9.6-1-ARCH #1
[26051.866516] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26051.866642] txg_sync        D    0  1392      2 0x00000000
[26051.866652]  ffff8803e6f2c000 0000000000000000 ffff8804104b2700 ffff88042fc180c0
[26051.866660]  ffff88041b50b400 ffffc9000828fad8 ffffffff81605cdf ffff880411a4c080
[26051.866667]  00ffffffa0341360 ffff88042fc180c0 0000000000000000 ffff8804104b2700
[26051.866673] Call Trace:
[26051.866687]  [<ffffffff81605cdf>] ? __schedule+0x22f/0x6e0
[26051.866694]  [<ffffffff816061cd>] schedule+0x3d/0x90
[26051.866700]  [<ffffffff81608fd3>] schedule_timeout+0x243/0x3d0
[26051.866756]  [<ffffffffa033fbb1>] ? zio_taskq_dispatch+0x91/0xa0 [zfs]
[26051.866796]  [<ffffffffa033fbd2>] ? zio_issue_async+0x12/0x20 [zfs]
[26051.866834]  [<ffffffffa0343569>] ? zio_nowait+0x79/0x110 [zfs]
[26051.866842]  [<ffffffff810f7b81>] ? ktime_get+0x41/0xb0
[26051.866847]  [<ffffffff81605a44>] io_schedule_timeout+0xa4/0x110
[26051.866859]  [<ffffffffa01eccd1>] cv_wait_common+0xb1/0x130 [spl]
[26051.866866]  [<ffffffff810c4200>] ? wake_atomic_t_function+0x60/0x60
[26051.866874]  [<ffffffffa01ecda8>] __cv_wait_io+0x18/0x20 [spl]
[26051.866913]  [<ffffffffa034337c>] zio_wait+0xac/0x130 [zfs]
[26051.866959]  [<ffffffffa02cf408>] dsl_pool_sync+0xb8/0x480 [zfs]
[26051.867009]  [<ffffffffa02e9e1f>] spa_sync+0x37f/0xb30 [zfs]
[26051.867015]  [<ffffffff810aa4a2>] ? default_wake_function+0x12/0x20
[26051.867065]  [<ffffffffa02fa6aa>] txg_sync_thread+0x3ba/0x620 [zfs]
[26051.867070]  [<ffffffff8102d752>] ? __switch_to+0x2d2/0x630
[26051.867119]  [<ffffffffa02fa2f0>] ? txg_delay+0x160/0x160 [zfs]
[26051.867128]  [<ffffffffa01e7f22>] thread_generic_wrapper+0x72/0x80 [spl]
[26051.867135]  [<ffffffffa01e7eb0>] ? __thread_exit+0x20/0x20 [spl]
[26051.867140]  [<ffffffff8109e8f9>] kthread+0xd9/0xf0
[26051.867145]  [<ffffffff8102d752>] ? __switch_to+0x2d2/0x630
[26051.867149]  [<ffffffff8109e820>] ? kthread_park+0x60/0x60
[26051.867153]  [<ffffffff8109e820>] ? kthread_park+0x60/0x60
[26051.867157]  [<ffffffff8160a995>] ret_from_fork+0x25/0x30

@ioquatix
Copy link

ioquatix commented Feb 6, 2017

Running

% pacman -Q | egrep "zfs|spl"
spl-linux 0.6.5.9_4.9.6_1-2
spl-utils-linux 0.6.5.9_4.9.6_1-2
zfs-linux 0.6.5.9_4.9.6_1-2
zfs-utils-linux 0.6.5.9_4.9.6_1-2

@ioquatix
Copy link

ioquatix commented Feb 6, 2017

Memory available is okay:

% free -h
              total        used        free      shared  buff/cache   available
Mem:            15G         12G        2.7G        692K        630M        3.0G
Swap:          4.0G          0B        4.0G

iostat seem okay (sde is samsung 840 pro)

% iostat -mx 10
Linux 4.9.6-1-ARCH (hinoki) 	06/02/17 	_x86_64_	(2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.69    0.00    2.26    3.18    0.00   93.88

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    4.39   13.96     0.19     0.23    47.25     0.06    3.16   11.42    0.57   2.28   4.18
sdb               0.00     0.00    4.39   13.69     0.19     0.22    46.63     0.06    3.13   11.19    0.55   2.26   4.09
sdc               0.00     0.00    4.29   13.58     0.19     0.23    47.97     0.06    3.61   12.91    0.66   2.58   4.62
sdd               0.00     0.00    4.32   13.80     0.18     0.23    46.40     0.06    3.18   11.60    0.55   2.29   4.15
sde               0.00     0.30    1.46   11.25     0.02     0.36    61.82     0.01    0.41    0.78    0.36   0.22   0.28

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          10.43    0.00   47.68   28.80    0.00   13.08

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   54.20    0.80     3.27     0.01   122.01     0.31    5.70    5.77    0.88   3.10  17.06
sdb               0.00     0.00   54.60    0.80     3.09     0.01   114.60     0.43    7.70    7.80    0.75   3.62  20.07
sdc               0.00     0.00   54.20    1.00     3.15     0.01   117.16     0.39    6.98    7.11    0.00   3.80  21.00
sdd               0.00     0.00   56.30    1.00     3.08     0.01   110.27     0.43    7.57    7.70    0.30   3.87  22.17
sde               0.00     5.60   27.30  159.50     0.08     7.38    81.70     0.06    0.34    0.21    0.36   0.22   4.20

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.69    0.00   43.65   30.28    0.00   20.38

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   43.30    0.90     2.45     0.01   113.81     0.35    7.86    8.03    0.00   4.38  19.34
sdb               0.00     0.00   40.30    0.90     2.31     0.01   114.99     0.36    8.68    8.87    0.44   4.47  18.40
sdc               0.00     0.00   41.80    0.80     2.39     0.01   115.08     0.34    8.08    8.22    0.38   4.68  19.93
sdd               0.00     0.00   41.20    0.80     2.30     0.01   112.34     0.44   10.39   10.58    0.88   4.56  19.16
sde               0.00     3.10   25.80  136.70     0.07     5.80    74.00     0.05    0.28    0.48    0.25   0.21   3.47

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.42    0.00   38.83   28.72    0.00   29.02

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   26.70    0.90     1.58     0.01   117.86     0.26    9.37    9.63    1.78   5.51  15.20
sdb               0.00     0.00   28.50    0.90     1.53     0.01   107.54     0.25    8.59    8.86    0.00   5.04  14.83
sdc               0.00     0.00   27.40    0.90     1.41     0.01   103.12     0.27    9.34    9.65    0.00   5.49  15.53
sdd               0.00     0.00   28.60    0.90     1.43     0.01   100.18     0.24    8.21    8.47    0.00   4.84  14.27
sde               0.00     0.80   27.30   84.50     0.08     2.67    50.38     0.02    0.21    0.27    0.19   0.19   2.07

@hostingnuggets
Copy link

Just wanted to give a short update to my post in this issue from last year: in the mean time I have upgraded to ZFS 0.6.5.8 from Debian's backports, still using Debian 8. Unfortunately I still get the exact same timeout in the kernel logs.

@gdevenyi
Copy link
Contributor

gdevenyi commented Feb 6, 2017

@kpande if you're going to assert that, can you please describe what's big enough?

I have resources 4X the average described in this thread and I also see the same issues, in this case with the combination of zfs send and nfsd load.

@ioquatix
Copy link

ioquatix commented Feb 6, 2017

severely undersized hardware

You may be right.

I have 4x 4TB drives, and one 256 SSD as OS/Cache. 16GB memory, 2 core CPU (1.5Ghz Atom). This system is almost exclusively used for RSync backups. I do feel that this is pretty reasonable for my needs.

The ARC cache is 12GB, the OS has 4GB left over. The L2 cache, if enabled , is about 170GB.

I'll have to check how many files it is, but I'm not sure if it's multi-millions or not.

@hostingnuggets
Copy link

@kpande you are right I forgot to give any relevant infos about my underlying hardware. Sorry about that, here are hopefully all the relevant infos:

My server is a virtualization Server running Xen 4.4 with currently 6 virtual machines which all have their logical volumes (LVM) stored on a RAIDZ1 volume with 3x 2TB Seagate SATA enterprise disks (ST32000645NS). The Debian 8 OS is independent and located on two internal SATA-SSD disks of both 16 GB in RAID1 using Linux MD for mirroring. The CPU is an Intel E5-2620 v3 @ 2.40GHz with 6 cores/12 threads. Out of these 6 cores 4 vCPUs have been pinned to the host/hypervisor/dom0 using the dom0_max_vcpus=4 dom0_vcpus_pin Linux kernel options. The server has 64 GB of TruDDR4 ECC memory and out of this 64 GB of memory 6 GB has been reserved to the host/hypervisor/dom0 using the respective Linux kernel options dom0_mem=6G,max:6G. Finally I have reserved 2 GB RAM of these 6 GB for the ARC using the following zfs_arc_max=2147483648 zfs module option. I have also disabled ZFS prefetch if that is of any relevance (zfs_prefetch_disable=1).

Below is the output of an actual ARC summary (server has bee rebooted 5 days ago):

ZFS Subsystem Report				Tue Feb 07 15:11:32 2017
ARC Summary: (HEALTHY)
	Memory Throttle Count:			0

ARC Misc:
	Deleted:				3.05m
	Mutex Misses:				7
	Evict Skips:				7

ARC Size:				77.69%	1.55	GiB
	Target Size: (Adaptive)		100.00%	2.00	GiB
	Min Size (Hard Limit):		1.56%	32.00	MiB
	Max Size (High Water):		64:1	2.00	GiB

ARC Size Breakdown:
	Recently Used Cache Size:	93.75%	1.88	GiB
	Frequently Used Cache Size:	6.25%	128.00	MiB

ARC Hash Breakdown:
	Elements Max:				171.83k
	Elements Current:		97.97%	168.34k
	Collisions:				68.60m
	Chain Max:				6
	Chains:					12.12k

ARC Total accesses:					102.96m
	Cache Hit Ratio:		48.88%	50.32m
	Cache Miss Ratio:		51.12%	52.63m
	Actual Hit Ratio:		48.88%	50.32m

	Data Demand Efficiency:		0.00%	49.34m

	CACHE HITS BY CACHE LIST:
	  Most Recently Used:		63.44%	31.92m
	  Most Frequently Used:		36.56%	18.40m
	  Most Recently Used Ghost:	0.08%	42.37k
	  Most Frequently Used Ghost:	0.00%	0

	CACHE HITS BY DATA TYPE:
	  Demand Data:			0.00%	167
	  Prefetch Data:		0.00%	0
	  Demand Metadata:		100.00%	50.32m
	  Prefetch Metadata:		0.00%	0

	CACHE MISSES BY DATA TYPE:
	  Demand Data:			93.74%	49.34m
	  Prefetch Data:		0.00%	0
	  Demand Metadata:		6.26%	3.29m
	  Prefetch Metadata:		0.00%	145




ZFS Tunable:
	metaslab_debug_load                               0
	zfs_arc_min_prefetch_lifespan                     0
	zfetch_max_streams                                8
	zfs_nopwrite_enabled                              1
	zfetch_min_sec_reap                               2
	zfs_dbgmsg_enable                                 0
	zfs_dirty_data_max_max_percent                    25
	zfs_arc_p_aggressive_disable                      1
	spa_load_verify_data                              1
	zfs_zevent_cols                                   80
	zfs_dirty_data_max_percent                        10
	zfs_sync_pass_dont_compress                       5
	l2arc_write_max                                   8388608
	zfs_vdev_scrub_max_active                         2
	zfs_vdev_sync_write_min_active                    10
	zvol_prefetch_bytes                               131072
	metaslab_aliquot                                  524288
	zfs_no_scrub_prefetch                             0
	zfs_arc_shrink_shift                              0
	zfetch_block_cap                                  256
	zfs_txg_history                                   0
	zfs_delay_scale                                   500000
	zfs_vdev_async_write_active_min_dirty_percent     30
	metaslab_debug_unload                             0
	zfs_read_history                                  0
	zvol_max_discard_blocks                           16384
	zfs_recover                                       0
	l2arc_headroom                                    2
	zfs_deadman_synctime_ms                           1000000
	zfs_scan_idle                                     50
	zfs_free_min_time_ms                              1000
	zfs_dirty_data_max                                624856268
	zfs_vdev_async_read_min_active                    1
	zfs_mg_noalloc_threshold                          0
	zfs_dedup_prefetch                                0
	zfs_vdev_max_active                               1000
	l2arc_write_boost                                 8388608
	zfs_resilver_min_time_ms                          3000
	zfs_vdev_async_write_max_active                   10
	zil_slog_limit                                    1048576
	zfs_prefetch_disable                              1
	zfs_resilver_delay                                2
	metaslab_lba_weighting_enabled                    1
	zfs_mg_fragmentation_threshold                    85
	l2arc_feed_again                                  1
	zfs_zevent_console                                0
	zfs_immediate_write_sz                            32768
	zfs_dbgmsg_maxsize                                4194304
	zfs_free_leak_on_eio                              0
	zfs_deadman_enabled                               1
	metaslab_bias_enabled                             1
	zfs_arc_p_dampener_disable                        1
	zfs_object_mutex_size                             64
	zfs_metaslab_fragmentation_threshold              70
	zfs_no_scrub_io                                   0
	metaslabs_per_vdev                                200
	zfs_dbuf_state_index                              0
	zfs_vdev_sync_read_min_active                     10
	metaslab_fragmentation_factor_enabled             1
	zvol_inhibit_dev                                  0
	zfs_vdev_async_write_active_max_dirty_percent     60
	zfs_vdev_cache_size                               0
	zfs_vdev_mirror_switch_us                         10000
	zfs_dirty_data_sync                               67108864
	spa_config_path                                   /etc/zfs/zpool.cache
	zfs_dirty_data_max_max                            1562140672
	zfs_arc_lotsfree_percent                          10
	zfs_zevent_len_max                                64
	zfs_scan_min_time_ms                              1000
	zfs_arc_sys_free                                  0
	zfs_arc_meta_strategy                             1
	zfs_vdev_cache_bshift                             16
	zfs_arc_meta_adjust_restarts                      4096
	zfs_max_recordsize                                1048576
	zfs_vdev_scrub_min_active                         1
	zfs_vdev_read_gap_limit                           32768
	zfs_arc_meta_limit                                0
	zfs_vdev_sync_write_max_active                    10
	l2arc_norw                                        0
	zfs_arc_meta_prune                                10000
	metaslab_preload_enabled                          1
	l2arc_nocompress                                  0
	zvol_major                                        230
	zfs_vdev_aggregation_limit                        131072
	zfs_flags                                         0
	spa_asize_inflation                               24
	zfs_admin_snapshot                                0
	l2arc_feed_secs                                   1
	zio_taskq_batch_pct                               75
	zfs_sync_pass_deferred_free                       2
	zfs_disable_dup_eviction                          0
	zfs_arc_grow_retry                                0
	zfs_read_history_hits                             0
	zfs_vdev_async_write_min_active                   1
	zfs_vdev_async_read_max_active                    3
	zfs_scrub_delay                                   4
	zfs_delay_min_dirty_percent                       60
	zfs_free_max_blocks                               100000
	zfs_vdev_cache_max                                16384
	zio_delay_max                                     30000
	zfs_top_maxinflight                               32
	ignore_hole_birth                                 1
	spa_slop_shift                                    5
	zfs_vdev_write_gap_limit                          4096
	spa_load_verify_metadata                          1
	spa_load_verify_maxinflight                       10000
	l2arc_noprefetch                                  1
	zfs_vdev_scheduler                                noop
	zfs_expire_snapshot                               300
	zfs_sync_pass_rewrite                             2
	zil_replay_disable                                0
	zfs_nocacheflush                                  0
	zfs_arc_max                                       2147483648
	zfs_arc_min                                       0
	zfs_read_chunk_size                               1048576
	zfs_txg_timeout                                   5
	zfs_pd_bytes_max                                  52428800
	l2arc_headroom_boost                              200
	zfs_send_corrupt_data                             0
	l2arc_feed_min_ms                                 200
	zfs_arc_meta_min                                  0
	zfs_arc_average_blocksize                         8192
	zfetch_array_rd_sz                                1048576
	zfs_autoimport_disable                            1
	zfs_arc_p_min_shift                               0
	zio_requeue_io_start_cut_in_line                  1
	zfs_vdev_sync_read_max_active                     10
	zfs_mdcomp_disable                                0
	zfs_arc_num_sublists_per_state                    4

Do you need any more information? and what do you think about this setup? is my hardware undersized?

@ioquatix
Copy link

ioquatix commented Feb 7, 2017

I checked and the majority of my backups are < 200k files and < 20GBytes.

@red-scorp
Copy link

red-scorp commented Mar 21, 2018

I hope it helps:

user@server ~ $ cat /var/log/syslog | grep zfs
Mar 21 04:01:11 server kernel: [204939.534145] dmu_tx_wait+0x36f/0x390 [zfs]
Mar 21 04:01:11 server kernel: [204939.534234] dmu_tx_assign+0x83/0x470 [zfs]
Mar 21 04:01:11 server kernel: [204939.534339] zfs_write+0x3f6/0xd80 [zfs]
Mar 21 04:01:11 server kernel: [204939.534439] ? rrw_exit+0x5a/0x150 [zfs]
Mar 21 04:01:11 server kernel: [204939.534539] ? rrm_exit+0x46/0x80 [zfs]
Mar 21 04:01:11 server kernel: [204939.534648] zpl_write_common_iovec+0x8c/0xe0 [zfs]
Mar 21 04:01:11 server kernel: [204939.534752] zpl_iter_write+0xae/0xe0 [zfs]
Mar 21 04:03:12 server kernel: [205060.371676] dmu_tx_wait+0x36f/0x390 [zfs]
Mar 21 04:03:12 server kernel: [205060.371764] dmu_tx_assign+0x83/0x470 [zfs]
Mar 21 04:03:12 server kernel: [205060.371869] zfs_write+0x3f6/0xd80 [zfs]
Mar 21 04:03:12 server kernel: [205060.371969] ? rrw_exit+0x5a/0x150 [zfs]
Mar 21 04:03:12 server kernel: [205060.372069] ? rrm_exit+0x46/0x80 [zfs]
Mar 21 04:03:12 server kernel: [205060.372177] zpl_write_common_iovec+0x8c/0xe0 [zfs]
Mar 21 04:03:12 server kernel: [205060.372282] zpl_iter_write+0xae/0xe0 [zfs]
Mar 21 04:05:13 server kernel: [205181.209034] dmu_tx_wait+0x36f/0x390 [zfs]
Mar 21 04:05:13 server kernel: [205181.209123] dmu_tx_assign+0x83/0x470 [zfs]
Mar 21 04:05:13 server kernel: [205181.209228] zfs_write+0x3f6/0xd80 [zfs]
Mar 21 04:05:13 server kernel: [205181.209329] ? rrw_exit+0x5a/0x150 [zfs]
Mar 21 04:05:13 server kernel: [205181.209429] ? rrm_exit+0x46/0x80 [zfs]
Mar 21 04:05:13 server kernel: [205181.209537] zpl_write_common_iovec+0x8c/0xe0 [zfs]
Mar 21 04:05:13 server kernel: [205181.209641] zpl_iter_write+0xae/0xe0 [zfs]
Mar 21 04:07:13 server kernel: [205302.045999] dmu_tx_wait+0x36f/0x390 [zfs]
Mar 21 04:07:13 server kernel: [205302.046088] dmu_tx_assign+0x83/0x470 [zfs]
Mar 21 04:07:13 server kernel: [205302.046192] zfs_write+0x3f6/0xd80 [zfs]
Mar 21 04:07:13 server kernel: [205302.046293] ? rrw_exit+0x5a/0x150 [zfs]
Mar 21 04:07:13 server kernel: [205302.046392] ? rrm_exit+0x46/0x80 [zfs]
Mar 21 04:07:13 server kernel: [205302.046501] zpl_write_common_iovec+0x8c/0xe0 [zfs]
Mar 21 04:07:13 server kernel: [205302.046605] zpl_iter_write+0xae/0xe0 [zfs]
Mar 21 04:09:14 server kernel: [205422.882950] dmu_tx_wait+0x36f/0x390 [zfs]
Mar 21 04:09:14 server kernel: [205422.883039] dmu_tx_assign+0x83/0x470 [zfs]
Mar 21 04:09:14 server kernel: [205422.883144] zfs_write+0x3f6/0xd80 [zfs]
Mar 21 04:09:14 server kernel: [205422.883244] ? rrw_exit+0x5a/0x150 [zfs]
Mar 21 04:09:14 server kernel: [205422.883343] ? rrm_exit+0x46/0x80 [zfs]
Mar 21 04:09:14 server kernel: [205422.883452] zpl_write_common_iovec+0x8c/0xe0 [zfs]
Mar 21 04:09:14 server kernel: [205422.883556] zpl_iter_write+0xae/0xe0 [zfs]
Mar 21 04:11:15 server kernel: [205543.719843] dmu_tx_wait+0x36f/0x390 [zfs]
Mar 21 04:11:15 server kernel: [205543.719933] dmu_tx_assign+0x83/0x470 [zfs]
Mar 21 04:11:15 server kernel: [205543.720037] zfs_write+0x3f6/0xd80 [zfs]
Mar 21 04:11:15 server kernel: [205543.720137] ? rrw_exit+0x5a/0x150 [zfs]
Mar 21 04:11:15 server kernel: [205543.720237] ? rrm_exit+0x46/0x80 [zfs]
Mar 21 04:11:15 server kernel: [205543.720345] zpl_write_common_iovec+0x8c/0xe0 [zfs]
Mar 21 04:11:15 server kernel: [205543.720449] zpl_iter_write+0xae/0xe0 [zfs]
Mar 21 04:13:16 server kernel: [205664.556609] dmu_tx_wait+0x36f/0x390 [zfs]
Mar 21 04:13:16 server kernel: [205664.556698] dmu_tx_assign+0x83/0x470 [zfs]
Mar 21 04:13:16 server kernel: [205664.556802] zfs_write+0x3f6/0xd80 [zfs]
Mar 21 04:13:16 server kernel: [205664.556903] ? rrw_exit+0x5a/0x150 [zfs]
Mar 21 04:13:16 server kernel: [205664.557003] ? rrm_exit+0x46/0x80 [zfs]
Mar 21 04:13:16 server kernel: [205664.557112] zpl_write_common_iovec+0x8c/0xe0 [zfs]
Mar 21 04:13:16 server kernel: [205664.557216] zpl_iter_write+0xae/0xe0 [zfs]
Mar 21 04:15:17 server kernel: [205785.393394] dmu_tx_wait+0x36f/0x390 [zfs]
Mar 21 04:15:17 server kernel: [205785.393483] dmu_tx_assign+0x83/0x470 [zfs]
Mar 21 04:15:17 server kernel: [205785.393587] zfs_write+0x3f6/0xd80 [zfs]
Mar 21 04:15:17 server kernel: [205785.393687] ? rrw_exit+0x5a/0x150 [zfs]
Mar 21 04:15:17 server kernel: [205785.393787] ? rrm_exit+0x46/0x80 [zfs]
Mar 21 04:15:17 server kernel: [205785.393896] zpl_write_common_iovec+0x8c/0xe0 [zfs]
Mar 21 04:15:17 server kernel: [205785.394000] zpl_iter_write+0xae/0xe0 [zfs]
Mar 21 04:17:18 server kernel: [205906.229973] dmu_tx_wait+0x36f/0x390 [zfs]
Mar 21 04:17:18 server kernel: [205906.230062] dmu_tx_assign+0x83/0x470 [zfs]
Mar 21 04:17:18 server kernel: [205906.230166] zfs_write+0x3f6/0xd80 [zfs]
Mar 21 04:17:18 server kernel: [205906.230289] ? rrw_exit+0x5a/0x150 [zfs]
Mar 21 04:17:18 server kernel: [205906.230389] ? rrm_exit+0x46/0x80 [zfs]
Mar 21 04:17:18 server kernel: [205906.230498] zpl_write_common_iovec+0x8c/0xe0 [zfs]
Mar 21 04:17:18 server kernel: [205906.230602] zpl_iter_write+0xae/0xe0 [zfs]
Mar 21 04:19:18 server kernel: [206027.066643] dmu_tx_wait+0x36f/0x390 [zfs]
Mar 21 04:19:18 server kernel: [206027.066732] dmu_tx_assign+0x83/0x470 [zfs]
Mar 21 04:19:18 server kernel: [206027.066837] zfs_write+0x3f6/0xd80 [zfs]
Mar 21 04:19:18 server kernel: [206027.066937] ? rrw_exit+0x5a/0x150 [zfs]
Mar 21 04:19:18 server kernel: [206027.067037] ? rrm_exit+0x46/0x80 [zfs]
Mar 21 04:19:18 server kernel: [206027.067146] zpl_write_common_iovec+0x8c/0xe0 [zfs]
Mar 21 04:19:18 server kernel: [206027.067250] zpl_iter_write+0xae/0xe0 [zfs]
Mar 21 07:54:33 server systemd[1]: zfs-import-cache.service: Main process exited, code=exited, status=1/FAILURE
Mar 21 07:54:33 server systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'.
user@server ~ $ cat /etc/issue
Ubuntu Bionic Beaver (development branch) \n \l

user@server ~ $ uname -a
Linux server 4.15.0-12-generic #13-Ubuntu SMP Thu Mar 8 06:24:47 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
user@server ~ $ dpkg -l zfs*
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-==============-============-============-=================================
un zfs (no description available)
un zfs-dkms (no description available)
un zfs-dracut (no description available)
un zfs-fuse (no description available)
un zfs-initramfs (no description available)
un zfs-modules (no description available)
ii zfs-zed 0.7.5-1ubunt amd64 OpenZFS Event Daemon
un zfsutils (no description available)
ii zfsutils-linux 0.7.5-1ubunt amd64 command-line tools to manage Open
user@server ~ $
user@server ~ $ df -h /data
Filesystem Size Used Avail Use% Mounted on
data 14T 11T 3.1T 78% /data

it was mv process from one zfs to another in the same pool (actually it behaves as cp, not faster). the process hangs forever (at least few hours) and not allow to terminate with a signal. zpool reports errors-free. Now I've started zpool scrub, let's see what it will report.

UPDATE:

user@server ~ $ zpool status
pool: data
state: ONLINE
scan: scrub repaired 0B in 18h22m with 0 errors on Thu Mar 22 02:32:52 2018
config:

    NAME        STATE     READ WRITE CKSUM
    data        ONLINE       0     0     0
      raidz2-0  ONLINE       0     0     0
        sdf     ONLINE       0     0     0
        sdg     ONLINE       0     0     0
        sdh     ONLINE       0     0     0
        sdi     ONLINE       0     0     0
        sdj     ONLINE       0     0     0
        sdk     ONLINE       0     0     0
        sdl     ONLINE       0     0     0
        sdm     ONLINE       0     0     0

errors: No known data errors

Let me know if you need other infos.
With best regards, Ag

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests