Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swap deadlock in 0.7.9 #7734

Open
runderwo opened this issue Jul 21, 2018 · 122 comments
Open

Swap deadlock in 0.7.9 #7734

runderwo opened this issue Jul 21, 2018 · 122 comments
Labels
Component: Memory Management

Comments

@runderwo
Copy link

runderwo commented Jul 21, 2018

System information

Type Version/Name
Distribution Name Debian
Distribution Version stretch
Linux Kernel 4.16.0-0.bpo.2-amd64
Architecture amd64
ZFS Version 0.7.9-3~bpo9+1
SPL Version 0.7.9-3~bpo9+1

Describe the problem you're observing

System deadlocked while forcing a page out to a swap zvol. Unfortunately I do not have the rest of the backtraces.

Describe how to reproduce the problem

rsync a filesystem with O(1M) files.

Include any warning/errors/backtraces from the system logs

imag0272

NAME        PROPERTY               VALUE                  SOURCE
rpool/swap  type                   volume                 -
rpool/swap  creation               Tue Jun 26 17:52 2018  -
rpool/swap  used                   8.50G                  -
rpool/swap  available              8.84T                  -
rpool/swap  referenced             128K                   -
rpool/swap  compressratio          1.09x                  -
rpool/swap  reservation            none                   default
rpool/swap  volsize                8G                     local
rpool/swap  volblocksize           4K                     -
rpool/swap  checksum               on                     default
rpool/swap  compression            zle                    local
rpool/swap  readonly               off                    default
rpool/swap  createtxg              156102                 -
rpool/swap  copies                 1                      default
rpool/swap  refreservation         8.50G                  local
rpool/swap  guid                   11434631848701244988   -
rpool/swap  primarycache           metadata               local
rpool/swap  secondarycache         all                    default
rpool/swap  usedbysnapshots        0B                     -
rpool/swap  usedbydataset          128K                   -
rpool/swap  usedbychildren         0B                     -
rpool/swap  usedbyrefreservation   8.50G                  -
rpool/swap  logbias                throughput             local
rpool/swap  dedup                  off                    default
rpool/swap  mlslabel               none                   default
rpool/swap  sync                   always                 local
rpool/swap  refcompressratio       1.09x                  -
rpool/swap  written                128K                   -
rpool/swap  logicalused            46K                    -
rpool/swap  logicalreferenced      46K                    -
rpool/swap  volmode                default                default
rpool/swap  snapshot_limit         none                   default
rpool/swap  snapshot_count         none                   default
rpool/swap  snapdev                hidden                 default
rpool/swap  context                none                   default
rpool/swap  fscontext              none                   default
rpool/swap  defcontext             none                   default
rpool/swap  rootcontext            none                   default
rpool/swap  redundant_metadata     all                    default
rpool/swap  com.sun:auto-snapshot  false                  local
@behlendorf behlendorf added the Component: Memory Management label Jul 25, 2018
@shartse
Copy link
Contributor

shartse commented Aug 28, 2018

@jgallag88 and I have been observing this as well. Here's our bug report with a simple reproducer:

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 18.04
Linux Kernel 4.15.0-33-generic
Architecture amd64
ZFS Version Master
SPL Version Master

Describe the problem you're observing

With zvol swap devices, high memory operations hang indefinitely despite there being a lot of swap space available. It appears that the hang occurs in ZIL code while ZFS is performing an allocation for a zio as part of its work to swap pages.

We also tested this situation with a ext4 configured swap device and did not see any hangs.

This is the memory usage on one our of typical VMs:

              total        used        free      shared  buff/cache   available
Mem:           7.3G        3.6G        3.1G         32M        603M        3.4G
Swap:          4.0G          0B        4.0G

Reproducing the problem

Configure a zvol as a swap device

zfs create -V 4G -b "$(getconf PAGESIZE)" -o logbias=throughput -o sync=always  -o primarycache=metadata rpool/swap
mkswap -f /dev/zvol/rpool/swap
swapon /dev/zvol/rpool/swap

Run a high memory operation

Running something that will use up all your memory, for example stress:
stress --vm 10 --vm-keep --vm-bytes 512M
Check that all memory is being used with
watch free -h
Eventually the system will start to swap but quickly hang and not recover.

Examples

Example of the thread trying to swap out page:

PID: 1      TASK: ffff8cadf6208000  CPU: 1   COMMAND: "systemd"
 #0 [ffffb10840c671f0] __schedule at ffffffffb0594571
 #1 [ffffb10840c67288] schedule at ffffffffb0594bac
 #2 [ffffb10840c67298] schedule_preempt_disabled at ffffffffb0594e8e
 #3 [ffffb10840c672a8] __mutex_lock at ffffffffb059654c
 #4 [ffffb10840c67338] __mutex_lock_slowpath at ffffffffb05968a3
 #5 [ffffb10840c67350] mutex_lock at ffffffffb05968df
 #6 [ffffb10840c67368] zil_commit_writer at ffffffffc0723d21 [zfs]
 #7 [ffffb10840c673a0] zil_commit_impl at ffffffffc0723e66 [zfs]
 #8 [ffffb10840c673c0] zil_commit at ffffffffc0723ef0 [zfs]
 #9 [ffffb10840c673e8] zvol_write at ffffffffc0746e20 [zfs]
#10 [ffffb10840c67488] zvol_request at ffffffffc0747fba [zfs]
#11 [ffffb10840c674d8] generic_make_request at ffffffffb00534e4
#12 [ffffb10840c67538] submit_bio at ffffffffb0053733
#13 [ffffb10840c67588] __swap_writepage at ffffffffafe26e83
#14 [ffffb10840c67620] swap_writepage at ffffffffafe26f44
#15 [ffffb10840c67648] pageout at ffffffffafde683b
#16 [ffffb10840c676d8] shrink_page_list at ffffffffafde9caa
#17 [ffffb10840c677a0] shrink_inactive_list at ffffffffafdea6f2
#18 [ffffb10840c67858] shrink_node_memcg at ffffffffafdeb1a9
#19 [ffffb10840c67958] shrink_node at ffffffffafdeb6b7
#20 [ffffb10840c679e8] do_try_to_free_pages at ffffffffafdeb989
#21 [ffffb10840c67a40] try_to_free_pages at ffffffffafdebcde
#22 [ffffb10840c67ac8] __alloc_pages_slowpath at ffffffffafdd958e
#23 [ffffb10840c67bd8] __alloc_pages_nodemask at ffffffffafdda203
#24 [ffffb10840c67c40] alloc_pages_current at ffffffffafe385da
#25 [ffffb10840c67c70] __page_cache_alloc at ffffffffafdcc131
#26 [ffffb10840c67c90] filemap_fault at ffffffffafdcf978
#27 [ffffb10840c67d48] __do_fault at ffffffffafe0a864
#28 [ffffb10840c67d70] handle_pte_fault at ffffffffafe0f9e3
#29 [ffffb10840c67dc8] __handle_mm_fault at ffffffffafe10598
#30 [ffffb10840c67e70] handle_mm_fault at ffffffffafe10791
#31 [ffffb10840c67ea8] __do_page_fault at ffffffffafc74990
#32 [ffffb10840c67f20] do_page_fault at ffffffffafc74c3e
#33 [ffffb10840c67f50] page_fault at ffffffffb06015e5
    RIP: 00007fcfe704d69a  RSP: 00007ffe9178ead0  RFLAGS: 00010202
    RAX: 0000000000000001  RBX: 000055ca1d2bd8c0  RCX: 00007fcfe753abb7
    RDX: 0000000000000056  RSI: 00007ffe9178ead0  RDI: 0000000000000000
    RBP: 00007ffe9178efe0   R8: 0000000000000000   R9: 61742e6369736162
    R10: 00000000ffffffff  R11: 0000000000000000  R12: 0000000000000001
    R13: ffffffffffffffff  R14: 00007ffe9178ead0  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b

Two places where zios were stuck

PID: 570    TASK: ffff91c660c32d80  CPU: 0   COMMAND: "z_wr_int"
 #0 [fffffe0000008cc0] machine_kexec at ffffffff896631c3
 #1 [fffffe0000008d20] __crash_kexec at ffffffff8972a479
 #2 [fffffe0000008de8] panic at ffffffff8968c7b0
 #3 [fffffe0000008e70] nmi_panic at ffffffff8968c329
 #4 [fffffe0000008e80] unknown_nmi_error at ffffffff896318e7
 #5 [fffffe0000008ea0] default_do_nmi at ffffffff89631a8e
 #6 [fffffe0000008ec8] do_nmi at ffffffff89631bc9
 #7 [fffffe0000008ef0] end_repeat_nmi at ffffffff8a0019eb
    [exception RIP: putback_inactive_pages+501]
    RIP: ffffffff897ea245  RSP: ffff9fc501e9f658  RFLAGS: 00000003
    RAX: ffff91c677012800  RBX: ffffeb4ac7e52a80  RCX: 0000000000000001
    RDX: ffffeb4ac7380720  RSI: 0000000000000000  RDI: ffff91c677012800
    RBP: ffff9fc501e9f6d8   R8: 0000000000027f20   R9: 0000000000036164
    R10: ffff91c67ffd2000  R11: 00000000000000e5  R12: 0000000000000000
    R13: ffffeb4ac7e52aa0  R14: ffff91c677012800  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
--- <NMI exception stack> ---
 #8 [ffff9fc501e9f658] putback_inactive_pages at ffffffff897ea245
 #9 [ffff9fc501e9f6e0] shrink_inactive_list at ffffffff897ea74d
#10 [ffff9fc501e9f798] shrink_node_memcg at ffffffff897eb1a9
#11 [ffff9fc501e9f898] shrink_node at ffffffff897eb6b7
#12 [ffff9fc501e9f930] do_try_to_free_pages at ffffffff897eb989
#13 [ffff9fc501e9f988] try_to_free_pages at ffffffff897ebcde
#14 [ffff9fc501e9fa10] __alloc_pages_slowpath at ffffffff897d958e
#15 [ffff9fc501e9fb20] __alloc_pages_nodemask at ffffffff897da203
#16 [ffff9fc501e9fb88] __alloc_pages at ffffffffc07851db [zfs]
#17 [ffff9fc501e9fb98] __alloc_pages_node at ffffffffc07851f1 [zfs]
#18 [ffff9fc501e9fba8] alloc_pages_node at ffffffffc0785225 [zfs]
#19 [ffff9fc501e9fbc0] abd_alloc_chunk at ffffffffc078523e [zfs]
#20 [ffff9fc501e9fbd0] abd_alloc_pages.constprop.8 at ffffffffc0785e85 [zfs]
#21 [ffff9fc501e9fc58] abd_alloc at ffffffffc07862f5 [zfs]
#22 [ffff9fc501e9fc80] abd_alloc_for_io at ffffffffc078647e [zfs]
#23 [ffff9fc501e9fc90] vdev_queue_aggregate at ffffffffc086187a [zfs]
#24 [ffff9fc501e9fd00] vdev_queue_io_to_issue at ffffffffc0861e11 [zfs]
#25 [ffff9fc501e9fd48] vdev_queue_io_done at ffffffffc0862582 [zfs]
#26 [ffff9fc501e9fd88] zio_vdev_io_done at ffffffffc08c61cf [zfs]
#27 [ffff9fc501e9fdb0] zio_execute at ffffffffc08c6c2c [zfs]
#28 [ffff9fc501e9fdf8] taskq_thread at ffffffffc065254d [spl]
#29 [ffff9fc501e9ff08] kthread at ffffffff896ae531
#30 [ffff9fc501e9ff50] ret_from_fork at ffffffff8a000205

And the next place:

PID: 567    TASK: ffff97afe0e05b00  CPU: 0   COMMAND: "z_wr_iss"
 #0 [fffffe0000008cc0] machine_kexec at ffffffffa64631c3
 #1 [fffffe0000008d20] __crash_kexec at ffffffffa652a479
 #2 [fffffe0000008de8] panic at ffffffffa648c7b0
 #3 [fffffe0000008e70] nmi_panic at ffffffffa648c329
 #4 [fffffe0000008e80] unknown_nmi_error at ffffffffa64318e7
 #5 [fffffe0000008ea0] default_do_nmi at ffffffffa6431a8e
 #6 [fffffe0000008ec8] do_nmi at ffffffffa6431bc9
 #7 [fffffe0000008ef0] end_repeat_nmi at ffffffffa6e019eb
    [exception RIP: page_referenced_one]
    RIP: ffffffffa661e5f0  RSP: ffffa65681e5f3a8  RFLAGS: 00000246
    RAX: ffffffffa661e5f0  RBX: 00000000ffc49000  RCX: ffffa65681e5f418
    RDX: 00000000ffc49000  RSI: ffff97afc231cea0  RDI: ffffdde407197dc0
    RBP: ffffa65681e5f3f8   R8: 0000000000080000   R9: 0000000000000000
    R10: 0000000000000001  R11: 0000000000000004  R12: ffffdde407197dc0
    R13: ffff97afc231cea0  R14: ffffa65681e5f430  R15: ffff97afc104dd80
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
--- <NMI exception stack> ---
 #8 [ffffa65681e5f3a8] page_referenced_one at ffffffffa661e5f0
 #9 [ffffa65681e5f3a8] rmap_walk_anon at ffffffffa661e47a
#10 [ffffa65681e5f400] rmap_walk at ffffffffa66203c8
#11 [ffffa65681e5f410] page_referenced at ffffffffa66204bc
#12 [ffffa65681e5f488] shrink_page_list at ffffffffa65e9960
#13 [ffffa65681e5f550] shrink_inactive_list at ffffffffa65ea6f2
#14 [ffffa65681e5f608] shrink_node_memcg at ffffffffa65eb1a9
#15 [ffffa65681e5f708] shrink_node at ffffffffa65eb6b7
#16 [ffffa65681e5f7a0] do_try_to_free_pages at ffffffffa65eb989
#17 [ffffa65681e5f7f8] try_to_free_pages at ffffffffa65ebcde
#18 [ffffa65681e5f880] __alloc_pages_slowpath at ffffffffa65d958e
#19 [ffffa65681e5f990] __alloc_pages_nodemask at ffffffffa65da203
#20 [ffffa65681e5f9f8] alloc_pages_current at ffffffffa66385da
#21 [ffffa65681e5fa28] new_slab at ffffffffa6645884
#22 [ffffa65681e5fa98] ___slab_alloc at ffffffffa66463b7
#23 [ffffa65681e5fb58] __slab_alloc at ffffffffa6646530
#24 [ffffa65681e5fb80] kmem_cache_alloc at ffffffffa66466cb
#25 [ffffa65681e5fbc0] spl_kmem_cache_alloc at ffffffffc06d6a3b [spl]
#26 [ffffa65681e5fbf8] zio_create at ffffffffc0960601 [zfs]
#27 [ffffa65681e5fc50] zio_vdev_child_io at ffffffffc0961fcc [zfs]
#28 [ffffa65681e5fcf8] vdev_mirror_io_start at ffffffffc08f6b4e [zfs]
#29 [ffffa65681e5fd70] zio_vdev_io_start at ffffffffc0966ac7 [zfs]
#30 [ffffa65681e5fdb0] zio_execute at ffffffffc095cc2c [zfs]
#31 [ffffa65681e5fdf8] taskq_thread at ffffffffc06db54d [spl]
#32 [ffffa65681e5ff08] kthread at ffffffffa64ae531
#33 [ffffa65681e5ff50] ret_from_fork at ffffffffa6e00205

@siv0
Copy link
Contributor

siv0 commented Sep 4, 2018

Can reproduce this as well on:

System information

Type Version/Name
Distribution Name Proxmox VE (based on Debian)
Distribution Version 5.2 / stretch
Linux Kernel 4.15.18-4-pve (based on ubuntu-bionic)
Architecture amd64
ZFS Version 0.7.9-pve3~bpo9
SPL Version 0.7.9-1

swap zvol was created as above with " -o compression=zle-o secondarycache=none" as additional parameters.
The system is a small qemu-machine, using a qemu-disk w/o zfs as swap does not lockup the machine.

I could reproduce the issue with plain debian stretch: kernel 4.9.110-3+deb9u4 and zfs 0.7.9 (from stretch-backports)
The Problem does not occur on plain debian stretch: kernel 4.9.110-3+deb9u4 and zfs 0.6.5.9

Please let me know if/how I can help in hunting this down/fixing this.

console output w/ hung tasks on 4.15.18-4 (ZFS 0.7.9):

[  846.823830] INFO: task systemd:1 blocked for more than 120 seconds.
[  846.824978]       Tainted: P           O     4.15.18-4-pve #1
[  846.825888] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  846.826774] systemd         D    0     1      0 0x00000000
[  846.827409] Call Trace:
[  846.827722]  __schedule+0x3e0/0x870
[  846.828145]  schedule+0x36/0x80
[  846.828541]  cv_wait_common+0x11e/0x140 [spl]
[  846.829088]  ? wait_woken+0x80/0x80
[  846.829477]  __cv_wait+0x15/0x20 [spl]
[  846.829968]  txg_wait_open+0xb0/0x100 [zfs]
[  846.830455]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  846.830992]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  846.831485]  zvol_write+0x175/0x620 [zfs]
[  846.831994]  zvol_request+0x24a/0x300 [zfs]
[  846.832513]  ? SyS_madvise+0xa20/0xa20
[  846.833124]  generic_make_request+0x123/0x2f0
[  846.833908]  submit_bio+0x73/0x140
[  846.834491]  ? submit_bio+0x73/0x140
[  846.835098]  ? get_swap_bio+0xcf/0x100
[  846.835780]  __swap_writepage+0x345/0x3b0
[  846.836463]  ? __frontswap_store+0x73/0x100
[  846.837190]  swap_writepage+0x34/0x90
[  846.838503]  pageout.isra.53+0x1e5/0x330
[  846.839874]  shrink_page_list+0x955/0xb70
[  846.841326]  shrink_inactive_list+0x256/0x5e0
[  846.842768]  ? next_arg+0x80/0x110
[  846.843851]  shrink_node_memcg+0x365/0x780
[  846.845052]  shrink_node+0xe1/0x310
[  846.846113]  ? shrink_node+0xe1/0x310
[  846.847153]  do_try_to_free_pages+0xef/0x360
[  846.848258]  try_to_free_pages+0xf2/0x1b0
[  846.849392]  __alloc_pages_slowpath+0x401/0xf10
[  846.850513]  ? __page_cache_alloc+0x86/0x90
[  846.851612]  __alloc_pages_nodemask+0x25b/0x280
[  846.852784]  alloc_pages_current+0x6a/0xe0
[  846.853903]  __page_cache_alloc+0x86/0x90
[  846.855011]  filemap_fault+0x369/0x740
[  846.856057]  ? page_add_file_rmap+0xf7/0x150
[  846.857482]  ? filemap_map_pages+0x369/0x380
[  846.858618]  ext4_filemap_fault+0x31/0x44
[  846.859708]  __do_fault+0x24/0xe3
[  846.860751]  __handle_mm_fault+0xcd7/0x11e0
[  846.861842]  ? ep_read_events_proc+0xd0/0xd0
[  846.863129]  handle_mm_fault+0xce/0x1b0
[  846.864175]  __do_page_fault+0x25e/0x500
[  846.865251]  ? wake_up_q+0x80/0x80
[  846.866175]  do_page_fault+0x2e/0xe0
[  846.867108]  ? async_page_fault+0x2f/0x50
[  846.868080]  do_async_page_fault+0x1a/0x80
[  846.869078]  async_page_fault+0x45/0x50
[  846.869985] RIP: 0033:0x7f551d2d30a3
[  846.870874] RSP: 002b:00007ffdbd747e28 EFLAGS: 00010246
[  846.872075] RAX: 0000000000000001 RBX: 0000558ebcbf3120 RCX: 00007f551d2d30a3
[  846.873473] RDX: 0000000000000041 RSI: 00007ffdbd747e30 RDI: 0000000000000004
[  846.874811] RBP: 00007ffdbd748240 R08: 431bde82d7b634db R09: 0000000000000100
[  846.876098] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007ffdbd747e30
[  846.877594] R13: 0000000000000001 R14: ffffffffffffffff R15: 0000000000000003
[  846.878882] INFO: task kswapd0:34 blocked for more than 120 seconds.
[  846.880099]       Tainted: P           O     4.15.18-4-pve #1
[  846.881379] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  846.882873] kswapd0         D    0    34      2 0x80000000
[  846.883989] Call Trace:
[  846.884834]  __schedule+0x3e0/0x870
[  846.885771]  schedule+0x36/0x80
[  846.886722]  cv_wait_common+0x11e/0x140 [spl]
[  846.887853]  ? wait_woken+0x80/0x80
[  846.888889]  __cv_wait+0x15/0x20 [spl]
[  846.889927]  txg_wait_open+0xb0/0x100 [zfs]
[  846.891027]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  846.892132]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  846.893185]  zvol_write+0x175/0x620 [zfs]
[  846.894136]  ? avl_add+0x74/0xa0 [zavl]
[  846.895112]  zvol_request+0x24a/0x300 [zfs]
[  846.896124]  ? SyS_madvise+0xa20/0xa20
[  846.897118]  generic_make_request+0x123/0x2f0
[  846.898189]  submit_bio+0x73/0x140
[  846.899171]  ? submit_bio+0x73/0x140
[  846.900115]  ? get_swap_bio+0xcf/0x100
[  846.901183]  __swap_writepage+0x345/0x3b0
[  846.902212]  ? __frontswap_store+0x73/0x100
[  846.903221]  swap_writepage+0x34/0x90
[  846.904185]  pageout.isra.53+0x1e5/0x330
[  846.905207]  shrink_page_list+0x955/0xb70
[  846.906238]  shrink_inactive_list+0x256/0x5e0
[  846.907257]  ? __wake_up+0x13/0x20
[  846.908211]  ? __cv_signal+0x2d/0x40 [spl]
[  846.909286]  ? next_arg+0x80/0x110
[  846.910237]  shrink_node_memcg+0x365/0x780
[  846.911275]  shrink_node+0xe1/0x310
[  846.912253]  ? shrink_node+0xe1/0x310
[  846.913273]  kswapd+0x386/0x770
[  846.914213]  kthread+0x105/0x140
[  846.915168]  ? mem_cgroup_shrink_node+0x180/0x180
[  846.916326]  ? kthread_create_worker_on_cpu+0x70/0x70
[  846.917519]  ret_from_fork+0x35/0x40
[  846.918631] INFO: task jbd2/sda1-8:196 blocked for more than 120 seconds.
[  846.920121]       Tainted: P           O     4.15.18-4-pve #1
[  846.921514] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  846.923138] jbd2/sda1-8     D    0   196      2 0x80000000
[  846.924465] Call Trace:
[  846.925372]  __schedule+0x3e0/0x870
[  846.926425]  schedule+0x36/0x80
[  846.927439]  cv_wait_common+0x11e/0x140 [spl]
[  846.928567]  ? wait_woken+0x80/0x80
[  846.929635]  __cv_wait+0x15/0x20 [spl]
[  846.930714]  txg_wait_open+0xb0/0x100 [zfs]
[  846.931826]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  846.932937]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  846.934063]  zvol_write+0x175/0x620 [zfs]
[  846.935090]  ? avl_find+0x5f/0xa0 [zavl]
[  846.936147]  zvol_request+0x24a/0x300 [zfs]
[  846.937292]  ? SyS_madvise+0xa20/0xa20
[  846.938274]  generic_make_request+0x123/0x2f0
[  846.939445]  submit_bio+0x73/0x140
[  846.940395]  ? submit_bio+0x73/0x140
[  846.941384]  ? get_swap_bio+0xcf/0x100
[  846.942427]  __swap_writepage+0x345/0x3b0
[  846.943393]  ? __frontswap_store+0x73/0x100
[  846.944395]  swap_writepage+0x34/0x90
[  846.945419]  pageout.isra.53+0x1e5/0x330
[  846.946424]  shrink_page_list+0x955/0xb70
[  846.947423]  shrink_inactive_list+0x256/0x5e0
[  846.948465]  ? next_arg+0x80/0x110
[  846.949443]  shrink_node_memcg+0x365/0x780
[  846.950518]  ? _cond_resched+0x1a/0x50
[  846.951533]  shrink_node+0xe1/0x310
[  846.952469]  ? shrink_node+0xe1/0x310
[  846.953446]  do_try_to_free_pages+0xef/0x360
[  846.954449]  try_to_free_pages+0xf2/0x1b0
[  846.955670]  __alloc_pages_slowpath+0x401/0xf10
[  846.956738]  ? ext4_map_blocks+0x436/0x5d0
[  846.957745]  ? __switch_to_asm+0x34/0x70
[  846.958753]  __alloc_pages_nodemask+0x25b/0x280
[  846.959807]  alloc_pages_current+0x6a/0xe0
[  846.960860]  __page_cache_alloc+0x86/0x90
[  846.961867]  pagecache_get_page+0xab/0x2b0
[  846.962968]  __getblk_gfp+0x109/0x300
[  846.964009]  jbd2_journal_get_descriptor_buffer+0x5e/0xe0
[  846.965220]  journal_submit_commit_record+0x84/0x200
[  846.966361]  ? wait_woken+0x80/0x80
[  846.967350]  jbd2_journal_commit_transaction+0x1299/0x1720
[  846.968574]  ? __switch_to_asm+0x34/0x70
[  846.969665]  ? __switch_to_asm+0x40/0x70
[  846.970672]  ? finish_task_switch+0x74/0x200
[  846.971787]  kjournald2+0xc8/0x260
[  846.972772]  ? kjournald2+0xc8/0x260
[  846.973770]  ? wait_woken+0x80/0x80
[  846.974696]  kthread+0x105/0x140
[  846.975568]  ? commit_timeout+0x20/0x20
[  846.976555]  ? kthread_create_worker_on_cpu+0x70/0x70
[  846.977751]  ? do_syscall_64+0x73/0x130
[  846.978746]  ? SyS_exit_group+0x14/0x20
[  846.979736]  ret_from_fork+0x35/0x40
[  846.980747] INFO: task systemd-journal:236 blocked for more than 120 seconds.
[  846.982049]       Tainted: P           O     4.15.18-4-pve #1
[  846.983182] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  846.984707] systemd-journal D    0   236      1 0x00000120
[  846.986238] Call Trace:
[  846.987172]  __schedule+0x3e0/0x870
[  846.988109]  ? spl_kmem_zalloc+0xa4/0x190 [spl]
[  846.989188]  schedule+0x36/0x80
[  846.990073]  cv_wait_common+0x11e/0x140 [spl]
[  846.991083]  ? wait_woken+0x80/0x80
[  846.992044]  __cv_wait+0x15/0x20 [spl]
[  846.993046]  txg_wait_open+0xb0/0x100 [zfs]
[  846.994018]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  846.994985]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  846.996196]  zvol_write+0x175/0x620 [zfs]
[  846.997469]  zvol_request+0x24a/0x300 [zfs]
[  846.998562]  ? SyS_madvise+0xa20/0xa20
[  846.999542]  generic_make_request+0x123/0x2f0
[  847.000637]  submit_bio+0x73/0x140
[  847.001711]  ? submit_bio+0x73/0x140
[  847.002674]  ? get_swap_bio+0xcf/0x100
[  847.003655]  __swap_writepage+0x345/0x3b0
[  847.004788]  ? __frontswap_store+0x73/0x100
[  847.005816]  swap_writepage+0x34/0x90
[  847.006733]  pageout.isra.53+0x1e5/0x330
[  847.007692]  shrink_page_list+0x955/0xb70
[  847.008845]  shrink_inactive_list+0x256/0x5e0
[  847.010100]  ? next_arg+0x80/0x110
[  847.011214]  shrink_node_memcg+0x365/0x780
[  847.012477]  shrink_node+0xe1/0x310
[  847.013530]  ? shrink_node+0xe1/0x310
[  847.014567]  do_try_to_free_pages+0xef/0x360
[  847.015686]  try_to_free_pages+0xf2/0x1b0
[  847.016894]  __alloc_pages_slowpath+0x401/0xf10
[  847.017958]  ? __page_cache_alloc+0x86/0x90
[  847.018954]  __alloc_pages_nodemask+0x25b/0x280
[  847.020080]  alloc_pages_current+0x6a/0xe0
[  847.021240]  __page_cache_alloc+0x86/0x90
[  847.022334]  filemap_fault+0x369/0x740
[  847.023299]  ? page_add_file_rmap+0xf7/0x150
[  847.024324]  ? filemap_map_pages+0x369/0x380
[  847.025381]  ext4_filemap_fault+0x31/0x44
[  847.026388]  __do_fault+0x24/0xe3
[  847.027337]  __handle_mm_fault+0xcd7/0x11e0
[  847.028382]  ? ep_read_events_proc+0xd0/0xd0
[  847.029472]  handle_mm_fault+0xce/0x1b0
[  847.030504]  __do_page_fault+0x25e/0x500
[  847.031518]  ? wake_up_q+0x80/0x80
[  847.032505]  do_page_fault+0x2e/0xe0
[  847.033544]  ? async_page_fault+0x2f/0x50
[  847.034608]  do_async_page_fault+0x1a/0x80
[  847.035740]  async_page_fault+0x45/0x50
[  847.036859] RIP: 0033:0x7f5e14a430c3
[  847.037888] RSP: 002b:00007ffde5fb7fd0 EFLAGS: 00010293
[  847.039163] RAX: 0000000000000001 RBX: 000056164a08f1e0 RCX: 00007f5e14a430c3
[  847.040736] RDX: 000000000000001c RSI: 00007ffde5fb7fe0 RDI: 0000000000000008
[  847.042378] RBP: 00007ffde5fb8230 R08: 431bde82d7b634db R09: 000000f966b2e318
[  847.043991] R10: 00000000ffffffff R11: 0000000000000293 R12: 00007ffde5fb7fe0
[  847.045640] R13: 0000000000000001 R14: ffffffffffffffff R15: 0005750c02d60440
[  847.047266] INFO: task systemd-udevd:255 blocked for more than 120 seconds.
[  847.048893]       Tainted: P           O     4.15.18-4-pve #1
[  847.050532] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  847.052367] systemd-udevd   D    0   255      1 0x00000120
[  847.053924] Call Trace:
[  847.054897]  __schedule+0x3e0/0x870
[  847.056031]  schedule+0x36/0x80
[  847.057129]  cv_wait_common+0x11e/0x140 [spl]
[  847.058391]  ? wait_woken+0x80/0x80
[  847.059493]  __cv_wait+0x15/0x20 [spl]
[  847.060683]  txg_wait_open+0xb0/0x100 [zfs]
[  847.061979]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  847.063203]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  847.064447]  ? zvol_setup_zv+0x1b0/0x1b0 [zfs]
[  847.065775]  dmu_sync_late_arrival+0x53/0x150 [zfs]
[  847.067099]  ? dmu_write_policy+0xca/0x350 [zfs]
[  847.068394]  dmu_sync+0x3b4/0x490 [zfs]
[  847.069756]  ? dbuf_hold+0x33/0x60 [zfs]
[  847.070998]  ? zvol_setup_zv+0x1b0/0x1b0 [zfs]
[  847.072266]  zvol_get_data+0x164/0x180 [zfs]
[  847.073532]  zil_commit.part.14+0x451/0x8b0 [zfs]
[  847.074855]  zil_commit+0x17/0x20 [zfs]
[  847.076040]  zvol_write+0x5a2/0x620 [zfs]
[  847.077285]  zvol_request+0x24a/0x300 [zfs]
[  847.078528]  ? SyS_madvise+0xa20/0xa20
[  847.079677]  generic_make_request+0x123/0x2f0
[  847.080963]  submit_bio+0x73/0x140
[  847.082058]  ? submit_bio+0x73/0x140
[  847.083093]  ? get_swap_bio+0xcf/0x100
[  847.084135]  __swap_writepage+0x345/0x3b0
[  847.085284]  ? __frontswap_store+0x73/0x100
[  847.086424]  swap_writepage+0x34/0x90
[  847.087471]  pageout.isra.53+0x1e5/0x330
[  847.088565]  shrink_page_list+0x955/0xb70
[  847.089688]  shrink_inactive_list+0x256/0x5e0
[  847.090715]  ? next_arg+0x80/0x110
[  847.091670]  shrink_node_memcg+0x365/0x780
[  847.092808]  shrink_node+0xe1/0x310
[  847.093789]  ? shrink_node+0xe1/0x310
[  847.094739]  do_try_to_free_pages+0xef/0x360
[  847.095766]  try_to_free_pages+0xf2/0x1b0
[  847.096834]  __alloc_pages_slowpath+0x401/0xf10
[  847.097863]  ? security_file_open+0x90/0xa0
[  847.098852]  ? terminate_walk+0x91/0xf0
[  847.099919]  __alloc_pages_nodemask+0x25b/0x280
[  847.101087]  alloc_pages_current+0x6a/0xe0
[  847.102101]  __get_free_pages+0xe/0x30
[  847.103044]  pgd_alloc+0x1e/0x170
[  847.103941]  mm_init+0x197/0x280
[  847.104935]  copy_process.part.35+0xa50/0x1ab0
[  847.106029]  ? __seccomp_filter+0x49/0x540
[  847.107024]  ? _raw_spin_unlock_bh+0x1e/0x20
[  847.108080]  _do_fork+0xdf/0x3f0
[  847.109080]  ? __secure_computing+0x3f/0x100
[  847.110242]  ? syscall_trace_enter+0xca/0x2e0
[  847.111348]  SyS_clone+0x19/0x20
[  847.112435]  do_syscall_64+0x73/0x130
[  847.113606]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  847.114968] RIP: 0033:0x7f7ae276a38b
[  847.116137] RSP: 002b:00007ffea1b4b950 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[  847.117728] RAX: ffffffffffffffda RBX: 00007ffea1b4b950 RCX: 00007f7ae276a38b
[  847.119429] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[  847.121162] RBP: 00007ffea1b4b9a0 R08: 00007f7ae39148c0 R09: 0000000000000210
[  847.122879] R10: 00007f7ae3914b90 R11: 0000000000000246 R12: 0000000000000000
[  847.124640] R13: 0000000000000020 R14: 0000000000000000 R15: 0000000000000000
[  847.126237] INFO: task systemd-timesyn:2623 blocked for more than 120 seconds.
[  847.127856]       Tainted: P           O     4.15.18-4-pve #1
[  847.129293] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  847.130993] systemd-timesyn D    0  2623      1 0x00000120
[  847.132390] Call Trace:
[  847.133404]  __schedule+0x3e0/0x870
[  847.134513]  ? spl_kmem_zalloc+0xa4/0x190 [spl]
[  847.135783]  schedule+0x36/0x80
[  847.136876]  cv_wait_common+0x11e/0x140 [spl]
[  847.138128]  ? wait_woken+0x80/0x80
[  847.139215]  __cv_wait+0x15/0x20 [spl]
[  847.140503]  txg_wait_open+0xb0/0x100 [zfs]
[  847.141783]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  847.142980]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  847.144255]  zvol_write+0x175/0x620 [zfs]
[  847.145531]  ? avl_find+0x5f/0xa0 [zavl]
[  847.146818]  zvol_request+0x24a/0x300 [zfs]
[  847.166365]  ? SyS_madvise+0xa20/0xa20
[  847.167706]  generic_make_request+0x123/0x2f0
[  847.169118]  submit_bio+0x73/0x140
[  847.170263]  ? submit_bio+0x73/0x140
[  847.171385]  ? get_swap_bio+0xcf/0x100
[  847.172468]  __swap_writepage+0x345/0x3b0
[  847.173624]  ? __frontswap_store+0x73/0x100
[  847.174690]  swap_writepage+0x34/0x90
[  847.175697]  pageout.isra.53+0x1e5/0x330
[  847.176840]  shrink_page_list+0x955/0xb70
[  847.177948]  shrink_inactive_list+0x256/0x5e0
[  847.179101]  ? next_arg+0x80/0x110
[  847.180086]  shrink_node_memcg+0x365/0x780
[  847.181168]  shrink_node+0xe1/0x310
[  847.182126]  ? shrink_node+0xe1/0x310
[  847.183114]  do_try_to_free_pages+0xef/0x360
[  847.184216]  try_to_free_pages+0xf2/0x1b0
[  847.185362]  __alloc_pages_slowpath+0x401/0xf10
[  847.186524]  ? dev_queue_xmit+0x10/0x20
[  847.187576]  ? br_dev_queue_push_xmit+0x7a/0x140
[  847.188691]  __alloc_pages_nodemask+0x25b/0x280
[  847.189891]  alloc_pages_vma+0x88/0x1c0
[  847.190975]  __read_swap_cache_async+0x147/0x200
[  847.192166]  read_swap_cache_async+0x2b/0x60
[  847.193337]  swapin_readahead+0x22f/0x2b0
[  847.194369]  ? radix_tree_lookup_slot+0x22/0x50
[  847.195408]  ? find_get_entry+0x1e/0x100
[  847.196396]  ? pagecache_get_page+0x2c/0x2b0
[  847.197482]  do_swap_page+0x52d/0x9b0
[  847.198517]  ? do_swap_page+0x52d/0x9b0
[  847.199527]  ? crypto_shash_update+0x47/0x130
[  847.200539]  __handle_mm_fault+0x88d/0x11e0
[  847.201564]  ? jbd2_journal_dirty_metadata+0x22d/0x290
[  847.202735]  handle_mm_fault+0xce/0x1b0
[  847.203761]  __do_page_fault+0x25e/0x500
[  847.204766]  ? __switch_to_asm+0x34/0x70
[  847.205728]  do_page_fault+0x2e/0xe0
[  847.206648]  do_async_page_fault+0x1a/0x80
[  847.207715]  async_page_fault+0x25/0x50
[  847.208725] RIP: 0010:ep_send_events_proc+0x120/0x1a0
[  847.209764] RSP: 0000:ffffad16810c3d80 EFLAGS: 00010246
[  847.210821] RAX: 0000000000000001 RBX: ffffad16810c3df8 RCX: 00007fffc09eaec0
[  847.212099] RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246
[  847.213504] RBP: ffffad16810c3dd8 R08: 00000000000002c6 R09: ffff9d25fb9f5f98
[  847.214822] R10: ffffad16810c3cc0 R11: 0000000000000038 R12: 0000000000000000
[  847.216193] R13: ffffad16810c3e78 R14: ffff9d25fb9b89c0 R15: ffff9d25fb9f5f98
[  847.217572]  ? ep_read_events_proc+0xd0/0xd0
[  847.218601]  ep_scan_ready_list.constprop.18+0x9e/0x210
[  847.219743]  ep_poll+0x1f8/0x3b0
[  847.220735]  ? wake_up_q+0x80/0x80
[  847.221659]  SyS_epoll_wait+0xce/0xf0
[  847.222683]  do_syscall_64+0x73/0x130
[  847.223681]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  847.224859] RIP: 0033:0x7fe0d40920c3
[  847.225852] RSP: 002b:00007fffc09eaeb0 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8
[  847.227252] RAX: ffffffffffffffda RBX: 000056492a0138f0 RCX: 00007fe0d40920c3
[  847.228575] RDX: 0000000000000006 RSI: 00007fffc09eaec0 RDI: 0000000000000004
[  847.229952] RBP: 00007fffc09eb010 R08: 431bde82d7b634db R09: 00000000ffffffff
[  847.231397] R10: 00000000ffffffff R11: 0000000000000293 R12: 00007fffc09eaec0
[  847.232823] R13: 0000000000000001 R14: ffffffffffffffff R15: 0000000000000000
[  847.234199] INFO: task rpcbind:2624 blocked for more than 120 seconds.
[  847.235669]       Tainted: P           O     4.15.18-4-pve #1
[  847.237042] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  847.242089] rpcbind         D    0  2624      1 0x00000000
[  847.244049] Call Trace:
[  847.245324]  __schedule+0x3e0/0x870
[  847.246883]  schedule+0x36/0x80
[  847.247957]  cv_wait_common+0x11e/0x140 [spl]
[  847.249063]  ? wait_woken+0x80/0x80
[  847.250169]  __cv_wait+0x15/0x20 [spl]
[  847.251423]  txg_wait_open+0xb0/0x100 [zfs]
[  847.252706]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  847.254264]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  847.255518]  zvol_write+0x175/0x620 [zfs]
[  847.256706]  zvol_request+0x24a/0x300 [zfs]
[  847.258092]  ? SyS_madvise+0xa20/0xa20
[  847.259682]  generic_make_request+0x123/0x2f0
[  847.261515]  submit_bio+0x73/0x140
[  847.263119]  ? submit_bio+0x73/0x140
[  847.264676]  ? get_swap_bio+0xcf/0x100
[  847.266047]  __swap_writepage+0x345/0x3b0
[  847.267622]  ? __frontswap_store+0x73/0x100
[  847.269369]  swap_writepage+0x34/0x90
[  847.270991]  pageout.isra.53+0x1e5/0x330
[  847.272710]  shrink_page_list+0x955/0xb70
[  847.274195]  shrink_inactive_list+0x256/0x5e0
[  847.275618]  ? next_arg+0x80/0x110
[  847.276909]  shrink_node_memcg+0x365/0x780
[  847.278354]  shrink_node+0xe1/0x310
[  847.279636]  ? shrink_node+0xe1/0x310
[  847.280971]  do_try_to_free_pages+0xef/0x360
[  847.282448]  try_to_free_pages+0xf2/0x1b0
[  847.283825]  __alloc_pages_slowpath+0x401/0xf10
[  847.285329]  ? __hrtimer_init+0xa0/0xa0
[  847.286574]  ? poll_freewait+0x4a/0xb0
[  847.287577]  __alloc_pages_nodemask+0x25b/0x280
[  847.288654]  alloc_pages_current+0x6a/0xe0
[  847.289718]  __page_cache_alloc+0x86/0x90
[  847.290707]  __do_page_cache_readahead+0x10e/0x2d0
[  847.291834]  ? radix_tree_lookup_slot+0x22/0x50
[  847.293072]  ? __intel_pmu_enable_all.constprop.19+0x4d/0x80
[  847.294357]  ? find_get_entry+0x1e/0x100
[  847.295449]  filemap_fault+0x571/0x740
[  847.296492]  ? filemap_fault+0x571/0x740
[  847.297612]  ? filemap_map_pages+0x180/0x380
[  847.298682]  ext4_filemap_fault+0x31/0x44
[  847.299784]  __do_fault+0x24/0xe3
[  847.300829]  __handle_mm_fault+0xcd7/0x11e0
[  847.301910]  handle_mm_fault+0xce/0x1b0
[  847.302926]  __do_page_fault+0x25e/0x500
[  847.303946]  ? ktime_get_ts64+0x51/0xf0
[  847.305059]  do_page_fault+0x2e/0xe0
[  847.306133]  ? async_page_fault+0x2f/0x50
[  847.307158]  do_async_page_fault+0x1a/0x80
[  847.308292]  async_page_fault+0x45/0x50
[  847.309415] RIP: 0033:0x7fc96419d660
[  847.310441] RSP: 002b:00007fff1212c788 EFLAGS: 00010246
[  847.311917] RAX: 0000000000000000 RBX: 0000000000000007 RCX: 00007fc96419d660
[  847.313598] RDX: 0000000000007530 RSI: 0000000000000007 RDI: 00007fff1212c9d0
[  847.315185] RBP: 0000000000000007 R08: 00000000000000c3 R09: 000000000000000b
[  847.316666] R10: 00007fff1212c700 R11: 0000000000000246 R12: 000000000000000c
[  847.318208] R13: 000000000000000c R14: 0000558af8c4d480 R15: 00007fff1212c810
[  847.319523] INFO: task systemd-logind:2759 blocked for more than 120 seconds.
[  847.320812]       Tainted: P           O     4.15.18-4-pve #1
[  847.322042] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  847.323441] systemd-logind  D    0  2759      1 0x00000120
[  847.324575] Call Trace:
[  847.325493]  __schedule+0x3e0/0x870
[  847.326482]  schedule+0x36/0x80
[  847.327482]  cv_wait_common+0x11e/0x140 [spl]
[  847.328683]  ? wait_woken+0x80/0x80
[  847.329778]  __cv_wait+0x15/0x20 [spl]
[  847.330970]  txg_wait_open+0xb0/0x100 [zfs]
[  847.332121]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  847.333402]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  847.334552]  zvol_write+0x175/0x620 [zfs]
[  847.335630]  zvol_request+0x24a/0x300 [zfs]
[  847.336841]  ? SyS_madvise+0xa20/0xa20
[  847.337971]  generic_make_request+0x123/0x2f0
[  847.339185]  submit_bio+0x73/0x140
[  847.340308]  ? submit_bio+0x73/0x140
[  847.341486]  ? get_swap_bio+0xcf/0x100
[  847.342568]  __swap_writepage+0x345/0x3b0
[  847.343733]  ? __frontswap_store+0x73/0x100
[  847.344923]  swap_writepage+0x34/0x90
[  847.345982]  pageout.isra.53+0x1e5/0x330
[  847.347115]  shrink_page_list+0x955/0xb70
[  847.348290]  shrink_inactive_list+0x256/0x5e0
[  847.349570]  ? virtqueue_add_sgs+0x3b0/0x490
[  847.350704]  ? next_arg+0x80/0x110
[  847.351733]  shrink_node_memcg+0x365/0x780
[  847.352902]  shrink_node+0xe1/0x310
[  847.353928]  ? shrink_node+0xe1/0x310
[  847.355200]  do_try_to_free_pages+0xef/0x360
[  847.356538]  try_to_free_pages+0xf2/0x1b0
[  847.357638]  __alloc_pages_slowpath+0x401/0xf10
[  847.358760]  ? __update_load_avg_se.isra.38+0x1bc/0x1d0
[  847.360134]  ? __switch_to_asm+0x40/0x70
[  847.361353]  ? __switch_to_asm+0x34/0x70
[  847.362495]  ? __switch_to_asm+0x34/0x70
[  847.363611]  ? __switch_to_asm+0x40/0x70
[  847.364668]  __alloc_pages_nodemask+0x25b/0x280
[  847.365826]  alloc_pages_vma+0x88/0x1c0
[  847.366881]  __read_swap_cache_async+0x147/0x200
[  847.368331]  read_swap_cache_async+0x2b/0x60
[  847.369577]  swapin_readahead+0x22f/0x2b0
[  847.370702]  ? radix_tree_lookup_slot+0x22/0x50
[  847.371913]  ? find_get_entry+0x1e/0x100
[  847.373118]  ? pagecache_get_page+0x2c/0x2b0
[  847.374463]  do_swap_page+0x52d/0x9b0
[  847.375610]  ? do_swap_page+0x52d/0x9b0
[  847.376802]  ? __wake_up_sync_key+0x1e/0x30
[  847.378066]  __handle_mm_fault+0x88d/0x11e0
[  847.379291]  handle_mm_fault+0xce/0x1b0
[  847.380417]  __do_page_fault+0x25e/0x500
[  847.381577]  ? __switch_to_asm+0x34/0x70
[  847.382513]  do_page_fault+0x2e/0xe0
[  847.383370]  do_async_page_fault+0x1a/0x80
[  847.384302]  async_page_fault+0x25/0x50
[  847.385225] RIP: 0010:ep_send_events_proc+0x120/0x1a0
[  847.386258] RSP: 0000:ffffad168108bd80 EFLAGS: 00010246
[  847.387335] RAX: 0000000000000001 RBX: ffffad168108bdf8 RCX: 00007fffea7798b0
[  847.388657] RDX: 0000000000000000 RSI: 0000000000000246 RDI: 0000000000000246
[  847.390106] RBP: ffffad168108bdd8 R08: 000000000000030a R09: ffff9d25f5dfc218
[  847.391442] R10: ffffad168108bcc0 R11: 0000000000000000 R12: 0000000000000000
[  847.392743] R13: ffffad168108be78 R14: ffff9d25faea20c0 R15: ffff9d25f5dfc218
[  847.394189]  ? ep_read_events_proc+0xd0/0xd0
[  847.395244]  ep_scan_ready_list.constprop.18+0x9e/0x210
[  847.396415]  ep_poll+0x1f8/0x3b0
[  847.397387]  ? wake_up_q+0x80/0x80
[  847.398331]  SyS_epoll_wait+0xce/0xf0
[  847.399313]  do_syscall_64+0x73/0x130
[  847.400232]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  847.401329] RIP: 0033:0x7fc20e6080a3
[  847.402292] RSP: 002b:00007fffea7798a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e8
[  847.403817] RAX: ffffffffffffffda RBX: 0000560a68ae1260 RCX: 00007fc20e6080a3
[  847.405311] RDX: 000000000000000d RSI: 00007fffea7798b0 RDI: 0000000000000004
[  847.406765] RBP: 00007fffea779a50 R08: 0000560a68aed6e0 R09: 0000000000000000
[  847.408354] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007fffea7798b0
[  847.410017] R13: 0000000000000001 R14: ffffffffffffffff R15: 00007fffea779b50
[  847.411383] INFO: task qemu-ga:2842 blocked for more than 120 seconds.
[  847.413009]       Tainted: P           O     4.15.18-4-pve #1
[  847.414513] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  847.416280] qemu-ga         D    0  2842      1 0x00000000
[  847.417779] Call Trace:
[  847.418769]  __schedule+0x3e0/0x870
[  847.419916]  schedule+0x36/0x80
[  847.421044]  cv_wait_common+0x11e/0x140 [spl]
[  847.422126]  ? wait_woken+0x80/0x80
[  847.423130]  __cv_wait+0x15/0x20 [spl]
[  847.424295]  zil_commit.part.14+0x86/0x8b0 [zfs]
[  847.425726]  ? spl_kmem_free+0x33/0x40 [spl]
[  847.427098]  ? zfs_range_unlock+0x1b3/0x2e0 [zfs]
[  847.428477]  zil_commit+0x17/0x20 [zfs]
[  847.429738]  zvol_write+0x5a2/0x620 [zfs]
[  847.430951]  ? avl_find+0x5f/0xa0 [zavl]
[  847.432198]  zvol_request+0x24a/0x300 [zfs]
[  847.433506]  ? SyS_madvise+0xa20/0xa20
[  847.434707]  generic_make_request+0x123/0x2f0
[  847.435984]  submit_bio+0x73/0x140
[  847.437152]  ? submit_bio+0x73/0x140
[  847.438330]  ? get_swap_bio+0xcf/0x100
[  847.439487]  __swap_writepage+0x345/0x3b0
[  847.440700]  ? __frontswap_store+0x73/0x100
[  847.441983]  swap_writepage+0x34/0x90
[  847.443175]  pageout.isra.53+0x1e5/0x330
[  847.444443]  shrink_page_list+0x955/0xb70
[  847.445809]  shrink_inactive_list+0x256/0x5e0
[  847.447055]  ? next_arg+0x80/0x110
[  847.448177]  shrink_node_memcg+0x365/0x780
[  847.449443]  shrink_node+0xe1/0x310
[  847.450563]  ? shrink_node+0xe1/0x310
[  847.451733]  do_try_to_free_pages+0xef/0x360
[  847.452995]  try_to_free_pages+0xf2/0x1b0
[  847.454159]  __alloc_pages_slowpath+0x401/0xf10
[  847.455441]  ? __page_cache_alloc+0x86/0x90
[  847.456679]  __alloc_pages_nodemask+0x25b/0x280
[  847.457968]  alloc_pages_current+0x6a/0xe0
[  847.459194]  __page_cache_alloc+0x86/0x90
[  847.460374]  filemap_fault+0x369/0x740
[  847.461542]  ? __switch_to_asm+0x40/0x70
[  847.462696]  ? __switch_to_asm+0x34/0x70
[  847.463843]  ? filemap_map_pages+0x180/0x380
[  847.465061]  ? __switch_to_asm+0x34/0x70
[  847.466211]  ext4_filemap_fault+0x31/0x44
[  847.467346]  __do_fault+0x24/0xe3
[  847.468398]  __handle_mm_fault+0xcd7/0x11e0
[  847.469631]  handle_mm_fault+0xce/0x1b0
[  847.470750]  __do_page_fault+0x25e/0x500
[  847.471897]  do_page_fault+0x2e/0xe0
[  847.472997]  ? async_page_fault+0x2f/0x50
[  847.474089]  do_async_page_fault+0x1a/0x80
[  847.475205]  async_page_fault+0x45/0x50
[  847.476285] RIP: 0033:0x55c573934927
[  847.477358] RSP: 002b:00007ffee4db6630 EFLAGS: 00010206
[  847.478611] RAX: 0000000000000000 RBX: 000055c575723e20 RCX: 00007ff976bbc270
[  847.480176] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007ffee4db6610
[  847.481798] RBP: 00007ffee4db6640 R08: 000055c57572a0c0 R09: 0000000000000000
[  847.483337] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c57572a9a0
[  847.484893] R13: 000055c57572a2a0 R14: 000055c57572ad60 R15: 00007ff97766b610
[  847.486492] INFO: task rrdcached:3267 blocked for more than 120 seconds.
[  847.487976]       Tainted: P           O     4.15.18-4-pve #1
[  847.489368] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  847.491033] rrdcached       D    0  3267      1 0x00000000
[  847.492429] Call Trace:
[  847.493452]  __schedule+0x3e0/0x870
[  847.494556]  schedule+0x36/0x80
[  847.495609]  cv_wait_common+0x11e/0x140 [spl]
[  847.496841]  ? wait_woken+0x80/0x80
[  847.497976]  __cv_wait+0x15/0x20 [spl]
[  847.499154]  txg_wait_open+0xb0/0x100 [zfs]
[  847.500355]  dmu_tx_wait+0x389/0x3a0 [zfs]
[  847.501566]  dmu_tx_assign+0x8b/0x4a0 [zfs]
[  847.502758]  zvol_write+0x175/0x620 [zfs]
[  847.503976]  zvol_request+0x24a/0x300 [zfs]
[  847.505245]  ? SyS_madvise+0xa20/0xa20
[  847.506371]  generic_make_request+0x123/0x2f0
[  847.507549]  submit_bio+0x73/0x140
[  847.508608]  ? submit_bio+0x73/0x140
[  847.509801]  ? get_swap_bio+0xcf/0x100
[  847.510921]  __swap_writepage+0x345/0x3b0
[  847.512140]  ? __frontswap_store+0x73/0x100
[  847.513384]  swap_writepage+0x34/0x90
[  847.514525]  pageout.isra.53+0x1e5/0x330
[  847.515632]  shrink_page_list+0x955/0xb70
[  847.516772]  shrink_inactive_list+0x256/0x5e0
[  847.518015]  ? syscall_return_via_sysret+0x1a/0x7a
[  847.519363]  ? next_arg+0x80/0x110
[  847.520513]  shrink_node_memcg+0x365/0x780
[  847.521717]  shrink_node+0xe1/0x310
[  847.522846]  ? shrink_node+0xe1/0x310
[  847.523949]  do_try_to_free_pages+0xef/0x360
[  847.525182]  try_to_free_pages+0xf2/0x1b0
[  847.526436]  __alloc_pages_slowpath+0x401/0xf10
[  847.527735]  ? __page_cache_alloc+0x86/0x90
[  847.528986]  __alloc_pages_nodemask+0x25b/0x280
[  847.530283]  alloc_pages_current+0x6a/0xe0
[  847.531307]  __page_cache_alloc+0x86/0x90
[  847.532306]  filemap_fault+0x369/0x740
[  847.533308]  ? filemap_map_pages+0x27a/0x380
[  847.534323]  ext4_filemap_fault+0x31/0x44
[  847.535333]  __do_fault+0x24/0xe3
[  847.536304]  __handle_mm_fault+0xcd7/0x11e0
[  847.537400]  ? update_cfs_group+0xc4/0xe0
[  847.538423]  handle_mm_fault+0xce/0x1b0
[  847.539484]  __do_page_fault+0x25e/0x500
[  847.540584]  ? ktime_get_ts64+0x51/0xf0
[  847.541690]  do_page_fault+0x2e/0xe0
[  847.542714]  ? async_page_fault+0x2f/0x50
[  847.543785]  do_async_page_fault+0x1a/0x80
[  847.544885]  async_page_fault+0x45/0x50
[  847.545894] RIP: 0033:0x7f40f2d6667d
[  847.546870] RSP: 002b:00007fffcec77460 EFLAGS: 00010293
[  847.548075] RAX: 0000000000000000 RBX: 0000556934dc0c10 RCX: 00007f40f2d6667d
[  847.549599] RDX: 00000000000003e8 RSI: 0000000000000001 RDI: 0000556934dae6e0
[  847.551060] RBP: 0000000000000001 R08: 00007f40e9f24700 R09: 00007f40e9f24700
[  847.552554] R10: 00000000000008ba R11: 0000000000000293 R12: 0000556934c28360
[  847.554038] R13: 0000556934dae6e0 R14: 0000000000000000 R15: 0000000000000001

@behlendorf
Copy link
Contributor

behlendorf commented Sep 5, 2018

If possible could you dump the stack from the txg_sync process when this happens.

@siv0
Copy link
Contributor

siv0 commented Sep 5, 2018

Managed to get the following by running sleep 60; echo t > /proc/sysrq-trigger, before starting stress:

[10981.102235] txg_sync        R  running task        0   606      2 0x80000000
[10981.103613] Call Trace:
[10981.104615]  <IRQ>
[10981.105989]  sched_show_task+0xfe/0x130
[10981.107636]  show_state_filter+0x62/0xd0
[10981.108730]  sysrq_handle_showstate+0x10/0x20
[10981.109813]  __handle_sysrq+0x106/0x170
[10981.110859]  sysrq_filter+0x9b/0x390
[10981.111841]  input_to_handler+0x61/0x100
[10981.112901]  input_pass_values.part.6+0x117/0x130
[10981.114128]  input_handle_event+0x137/0x510
[10981.115171]  input_event+0x54/0x80
[10981.116220]  atkbd_interrupt+0x4b6/0x7a0
[10981.117176]  serio_interrupt+0x4c/0x90
[10981.118211]  i8042_interrupt+0x1f4/0x3b0
[10981.119180]  __handle_irq_event_percpu+0x84/0x1a0
[10981.120324]  handle_irq_event_percpu+0x32/0x80
[10981.121878]  handle_irq_event+0x3b/0x60
[10981.123547]  handle_edge_irq+0x78/0x1a0
[10981.124862]  handle_irq+0x20/0x30
[10981.125899]  do_IRQ+0x4e/0xd0
[10981.126730]  common_interrupt+0x84/0x84
[10981.127697]  </IRQ>
[10981.128422] RIP: 0010:page_vma_mapped_walk+0x49e/0x770
[10981.129488] RSP: 0000:ffffc3310106f188 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffd6
[10981.130916] RAX: ffff9e7c795f6a80 RBX: ffffc3310106f1d8 RCX: ffff9e7c40000a80
[10981.132363] RDX: 0000000000000000 RSI: 00003ffffffff000 RDI: 0000000000000000
[10981.133810] RBP: ffffc3310106f1b8 R08: 0000000000000067 R09: 0000000000000000
[10981.135262] R10: 0000000000000000 R11: 0000000000000091 R12: ffffe7bd40949440
[10981.136754] R13: ffff9e7c79468cc0 R14: 0000000000000988 R15: ffffc3310106f1d8
[10981.138596]  ? page_vma_mapped_walk+0x204/0x770
[10981.140413]  page_referenced_one+0x91/0x190
[10981.141780]  rmap_walk_anon+0x113/0x270
[10981.142918]  rmap_walk+0x48/0x60
[10981.143998]  page_referenced+0x10d/0x170
[10981.145189]  ? rmap_walk_anon+0x270/0x270
[10981.146331]  ? page_get_anon_vma+0x80/0x80
[10981.147523]  shrink_page_list+0x791/0xb70
[10981.148737]  shrink_inactive_list+0x256/0x5e0
[10981.150007]  shrink_node_memcg+0x365/0x780
[10981.151191]  ? _cond_resched+0x1a/0x50
[10981.152316]  shrink_node+0xe1/0x310
[10981.153384]  ? shrink_node+0xe1/0x310
[10981.154739]  do_try_to_free_pages+0xef/0x360
[10981.155776]  try_to_free_pages+0xf2/0x1b0
[10981.156747]  __alloc_pages_slowpath+0x401/0xf10
[10981.157910]  __alloc_pages_nodemask+0x25b/0x280
[10981.159112]  alloc_pages_current+0x6a/0xe0
[10981.160480]  new_slab+0x317/0x690
[10981.161439]  ? __blk_mq_get_tag+0x21/0x90
[10981.162485]  ___slab_alloc+0x3c1/0x4e0
[10981.163458]  ? spl_kmem_zalloc+0xa4/0x190 [spl]
[10981.164520]  ? update_load_avg+0x665/0x700
[10981.165545]  ? update_load_avg+0x665/0x700
[10981.166545]  __slab_alloc+0x20/0x40
[10981.167472]  ? __enqueue_entity+0x5c/0x60
[10981.168405]  ? __slab_alloc+0x20/0x40
[10981.169297]  __kmalloc_node+0xe4/0x2b0
[10981.170362]  ? spl_kmem_zalloc+0xa4/0x190 [spl]
[10981.171331]  spl_kmem_zalloc+0xa4/0x190 [spl]
[10981.172380]  dbuf_dirty+0x20c/0x880 [zfs]
[10981.173440]  ? spl_kmem_zalloc+0xa4/0x190 [spl]
[10981.174606]  dbuf_dirty+0x74e/0x880 [zfs]
[10981.175719]  dnode_setdirty+0xbe/0x100 [zfs]
[10981.176841]  dbuf_dirty+0x46c/0x880 [zfs]
[10981.177809]  dmu_buf_will_dirty+0x11c/0x130 [zfs]
[10981.178814]  dsl_dataset_sync+0x26/0x230 [zfs]
[10981.179801]  dsl_pool_sync+0x9f/0x430 [zfs]
[10981.180747]  spa_sync+0x42d/0xd50 [zfs]
[10981.181653]  txg_sync_thread+0x2d4/0x4a0 [zfs]
[10981.182655]  ? finish_task_switch+0x74/0x200
[10981.183603]  ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
[10981.184578]  thread_generic_wrapper+0x74/0x90 [spl]
[10981.185537]  kthread+0x105/0x140
[10981.186386]  ? __thread_exit+0x20/0x20 [spl]
[10981.187333]  ? kthread_create_worker_on_cpu+0x70/0x70
[10981.188470]  ? do_syscall_64+0x73/0x130
[10981.189420]  ? kthread_create_worker_on_cpu+0x70/0x70
[10981.190450]  ret_from_fork+0x35/0x40

@dweeezil
Copy link
Contributor

dweeezil commented Sep 7, 2018

I decided to give this a try and, as advertised, the deadlock is quite easy to reproduce. First, in a 4.14 kernel, it's interesting to note that a bunch of the "page allocation stalls for ..." messages are generated. That particular message, however, was removed in later kernels because it was determined that merely trying to print the message could, itself, cause deadlocks.

In a 4.16 kernel, which does not have the message, the processes are in quite a few different states. Among other things, an interesting observation is that the "zvol" tasks are generally in a running (R) state. Most of the regular user processes are in various states of dormancy (D). This sshd process is typical ("?" lines elided for brevity):

[ 6212.296211] sshd            D    0  3018   2976 0x00000000
[ 6212.296211] Call Trace:
[ 6212.296211]  schedule+0x32/0x80
[ 6212.296211]  schedule_timeout+0x159/0x330
[ 6212.296211]  io_schedule_timeout+0x19/0x40
[ 6212.296211]  mempool_alloc+0x121/0x150
[ 6212.296211]  bvec_alloc+0x86/0xe0
[ 6212.296211]  bio_alloc_bioset+0x153/0x200
[ 6212.296211]  get_swap_bio+0x40/0xd0
[ 6212.296211]  __swap_writepage+0x292/0x360
[ 6212.296211]  pageout.isra.51+0x1bb/0x300
[ 6212.296211]  shrink_page_list+0x959/0xbd0
[ 6212.296211]  shrink_inactive_list+0x2a8/0x630
[ 6212.296211]  shrink_node_memcg+0x37f/0x7d0
[ 6212.296211]  shrink_node+0xcc/0x300
[ 6212.296211]  do_try_to_free_pages+0xdb/0x330
[ 6212.296211]  try_to_free_pages+0xd5/0x1a0
[ 6212.296211]  __alloc_pages_slowpath+0x37c/0xd80
[ 6212.296211]  __alloc_pages_nodemask+0x227/0x260
[ 6212.296211]  alloc_pages_vma+0x7c/0x1e0
[ 6212.296211]  __read_swap_cache_async+0x149/0x1c0
[ 6212.296211]  do_swap_page_readahead+0xc9/0x1a0
[ 6212.296211]  do_swap_page+0x5a5/0x820
[ 6212.296211]  __handle_mm_fault+0x91a/0x1040
[ 6212.296211]  handle_mm_fault+0xe4/0x1f0
[ 6212.296211]  __do_page_fault+0x242/0x4d0
[ 6212.296211]  do_page_fault+0x2c/0x110
[ 6212.296211]  page_fault+0x25/0x50

Other user dormant user processes such as agetty, etc. all have pretty much identical stacks.

As far as ZFS is concerned, in the current deadlocked system I've got running, the only blocked process is "txg_quiesce" with a rather uninteresting stack of:

[ 5891.836341] txg_quiesce     D    0  2457      2 0x80000000
[ 5891.836341] Call Trace:
[ 5891.836341]  schedule+0x32/0x80
[ 5891.836341]  cv_wait_common+0x129/0x260 [spl]
[ 5891.836341]  txg_quiesce_thread+0x2fc/0x500 [zfs]
[ 5891.836341]  thread_generic_wrapper+0x76/0xb0 [spl]
[ 5891.836341]  kthread+0xf5/0x130
[ 5891.836341]  ret_from_fork+0x35/0x40

which is in the cv_wait(&tc->tc_cv[g], &tc->tc_lock); at the bottom of txg_quiesce().

The most interesting zfs-related thing is that all the zvol threads are spinning (in "R" state). They all appear to be in their while (ui.ui_resid > 0 ... loop calling dmu_write_uio_dnode() and show this typical stack:

[ 5891.836341] zvol            R  running task        0  3034      2 0x80000000
[ 5891.836341] Call Trace:
[ 5891.836341]  ? page_referenced+0xb9/0x160
[ 5891.836341]  ? shrink_page_list+0x467/0xbd0
[ 5891.836341]  ? shrink_inactive_list+0x2b5/0x630
[ 5891.836341]  ? shrink_node_memcg+0x37f/0x7d0
[ 5891.836341]  ? __list_lru_init+0x170/0x200
[ 5891.836341]  ? shrink_node+0xcc/0x300
[ 5891.836341]  ? shrink_node+0xcc/0x300
[ 5891.836341]  ? do_try_to_free_pages+0xdb/0x330
[ 5891.836341]  ? try_to_free_pages+0xd5/0x1a0
[ 5891.836341]  ? __alloc_pages_slowpath+0x37c/0xd80
[ 5891.836341]  ? get_page_from_freelist+0x145/0x1190
[ 5891.836341]  ? __alloc_pages_nodemask+0x227/0x260
[ 5891.836341]  ? abd_alloc+0x1e4/0x4d0 [zfs]
[ 5891.836341]  ? arc_hdr_alloc_abd+0x5b/0x230 [zfs]
[ 5891.836341]  ? arc_hdr_alloc+0x161/0x390 [zfs]
[ 5891.836341]  ? arc_alloc_buf+0x38/0x110 [zfs]
[ 5891.836341]  ? dmu_buf_will_fill+0x122/0x3b0 [zfs]
[ 5891.836341]  ? dmu_write_uio_dnode+0xe1/0x1a0 [zfs]
[ 5891.836341]  ? zvol_write+0x191/0x6c0 [zfs]

I'm going to try a couple of things: First, set zvol_request_sync=1 to see what difference it might make. Second, add some instrumentation to the zvol path to see whether it might actually be making some forward progress, albeit very slowly.

@cwedgwood
Copy link
Contributor

cwedgwood commented Sep 7, 2018

naively i don't see how swap can be expected to work as things are, i think we should document/label this as such

@behlendorf given the amount of code complexity and potentially allocations required is supporting swap even realistic?

@behlendorf
Copy link
Contributor

behlendorf commented Sep 7, 2018

@cwedgwood It's definitely been a challenging area to get working and is not heavily stress tested on a wide range of kernels. It was working reliably with older zfs releases and older kernels, but as you say given the code complexity and surface area we'd need to test it's not something I'd recommend. I tend to agree we should make this clearer in the documentation.

@drescherjm
Copy link

drescherjm commented Sep 7, 2018

I am using this (swap on zfs) on several production systems (home and work) with recent kernels. Thankfully it has not been an issue for me yet.

@inpos
Copy link

inpos commented Sep 17, 2018

Same issue on Gentoo with ZFS 0.8.0-rc1 and native encryption.

@ryao
Copy link
Contributor

ryao commented Sep 17, 2018

@inpos Out of curiosity, how did you get 0.8.0-rc1 on Gentoo? I have not finished my review of it, so I haven't pushed it to the main tree yet. I might just push it unkeyworded if people would otherwise be resort to building outside the package manager.

@ryao
Copy link
Contributor

ryao commented Sep 17, 2018

@behlendorf This was never 100% sane, although the return of the zvol threads certainly did not help. I started (but did not finish) writing a patch that could help with this if finished:

https://paste.pound-python.org/show/KWlvrHdBU2mA9ev2odXL/

At this point, it is fairly clear to me that my offline obligations prevent me from speculating on if/when I will finish that, but other developers should be able to see the idea there. It ought to help if/when finished. The current version will NOT compile, so I would prefer it if users did not attempt it. Trying to compile it would be a waste of their time.

Alternatively, there is a nuclear option for resolving this. If we treat swap like Illumos treats dump devices by disabling CoW and checksums, things will definitely work. That would be at the expense of protection against bitrot on the swap. This requires modifications to the code because the current codebase is not able to support a zvol in that configuration. Quite honestly, if we were going for the nuclear option, I'd prefer to create a new "dataset" type and implement extent based allocation.

@ahrens Not that I am seriously considering this, but it would be nice to hear your thoughts on the nuclear option for dealing with swap.

@inpos
Copy link

inpos commented Sep 17, 2018

@ryao Here content of /usr/local/portage/sys-fs/zfs/zfs-0.8.0_rc1.ebuild

And here content of /usr/local/portage/sys-fs/zfs-kmod/zfs-kmod-0.8.0_rc1.ebuild

@ryao
Copy link
Contributor

ryao commented Sep 17, 2018

On second thought, given that swap devices are simultaneously dump devices on Linux, it might make sense to implement support for creating/writing to Illumos dump devices. That would probably be a good middle ground here.

@prakashsurya
Copy link
Member

prakashsurya commented Sep 17, 2018

I'm not an expert in this area of the code, but I think that swap on ZVOL is inherently unreliable due to writes to the swap ZVOL having to go through the normal TXG sync and ZIO write paths, which can require lots of memory allocations by design (and these memory allocations can stall due to a low memory situation). I believe this to be true for swap on ZVOL for illumos, as well as Linux, and presumably FreeBSD too (although I have no experience using it on FreeBSD, so I could be wrong).

I think the "proper" way to address this is to mimic the write path for a ZVOL dump device on illumos, for the ZVOL swap device. i.e. preallocate the ZVOL's blocks on disk, then do a "direct write" to the preallocated LBA such that it doesn't go through the normal TXG sync and ZIO write code paths. This would significantly reduce the amount of work (e.g. memory allocations) that's required to write a swap page to disk, thereby increasing the reliability of that write.

The drawback to this approach would be that we wont get the data consistency guarantees that we normally get with ZFS, e.g. data checksums. I think this is a reasonable tradeoff (i.e. swap zvol + no hangs + no checksums vs. swap zvol + hangs + checksums), given that other Linux filesystems are no better (right?).

I spent a couple hours reading the ZVOL dump device write codepaths during the OpenZFS dev summit last week, and I think this approach is viable. It'll require porting the code over to work on Linux, where we'll need to rework the code to work with the Linux block device layer, since (IIRC) that's what's used to issue the writes to disk (see zvol_dumpio_vdev() in illumos). Additionally, we'll need to determine how/where to do the preallocation of the ZVOL's blocks (see zvol_prealloc() in illumos); for illumos dump devices, this occurs via the DKIOCDUMPINIT ioctl (when zvol_dumpify() is called).

With all that said, I've only spent an hour or two looking into this and I don't have any prior experience with swap, so I may be over looking something or flat out wrong in my analysis so far. Also, I won't have any time in the near future to look into this in detail, but I could potentially help answer questions if somebody else wants to try to prototype the changes I'm proposing.

@behlendorf
Copy link
Contributor

behlendorf commented Sep 17, 2018

we'll need to determine how/where to do the preallocation of the ZVOL's blocks (see zvol_prealloc() in illumos);

I took a look at mkswap.c and aside from writing a unique signature to the device it doesn't do anything which would let ZFS easily identify it as a swap device. One solution would be to add a udev helper which detects ZFS volumes with the swap signature and calls zvol_prealloc() via an ioctl(). This should allow things to "just work" on most Linux distributions, though obviously it's Linux specific. We should probably do something a bit more generic and reliable and add a new create-time-only property for volumes. This wouldn't even necessarily prevent us from adding the udev hook in the future.

@prakashsurya
Copy link
Member

prakashsurya commented Sep 17, 2018

@behlendorf that's unfortunate; given there's no good entry point to do the preallocation, your two suggestions seem reasonable at first glance.

Using a new ioctl (or create-time property) to preallocate and "convert" a ZVOL to a "direct write" ZVOL might be a neat feature, as it could enable this use case of swap, but also other use cases where the user would like to forego the COW semantics (e.g. a zpool layered on top of ZVOLs).

@davidklaftenegger
Copy link

davidklaftenegger commented Dec 7, 2018

Maybe this wiki page

https://github.com/zfsonlinux/pkg-zfs/wiki/HOWTO-use-a-zvol-as-a-swap-device

should be edited to warn about this issue?

@MobyGamer
Copy link

MobyGamer commented Jan 18, 2019

I took the liberty of amending the wiki (I've been affected by this as well). It's quite reproducible in v0.6.5.9-5~bpo8+1 running on 3.16.0-7-amd64 (Debian 3.16.59-1), so I look forward to an eventual fix. I've found no sustainable workaround (other than to stop using zvols for swap).

@mafredri
Copy link

mafredri commented Jan 18, 2019

Thanks @MobyGamer, I'm running ZFS 0.7.12-1~bpo9+1 on Debian Stretch (kernel 4.19.12-1~bpo9+1) on a system with 2GB of RAM and I experienced this issue. I had followed the Debian Stretch Root on ZFS and was lead to believe I should use swap on ZFS. Searching for the cause, there was really nothing in the wiki's about potential down-sides to doing it, and the only "against" I managed to find was this issue. It's good that there is at least some documentation regarding it now 😄.

@gmelikov
Copy link
Member

gmelikov commented Jan 21, 2019

@mafredri i updated the wiki. Indeed, we didn't find problems with Swap on ZFS 0.7 branch for a while.

@didrocks
Copy link
Contributor

didrocks commented Oct 31, 2019

FYI, this issue is what made us switch from a swap zvol to a dedicated partition at the last minute after user's feedback and our own experience in our installer on ubuntu 19.10. (https://bugs.launchpad.net/bugs/1847628)

This is quite an annoying issue as for ext4, we are using for quite a long time a swapfile to prevent too much partionning on user's machine and would like our default ZFS experience to be similar.

@scineram
Copy link

scineram commented Oct 31, 2019

With multi_vdev_crash_dump it should be possible to swap into a preallocated zvol or file, the writes bypassing zfs completely, right? But it is not implemented yet.

@prakashsurya
Copy link
Member

prakashsurya commented Oct 31, 2019

@didrocks I gave some information on a possible approach to address this long term, in my comment above. At this point, I think it's accepted that using a ZVOL for swap will result in instabilities, but we haven't had anybody with the time or motivation to step up and attempt to fix this.

I think at least partially, the lack of motivation stems from the fact that (I assume) most folks using ZFS on Linux are not using it for the root filesystem (I use it personally, and so does my employer, so I don't mean this as nobody uses root on ZFS). Now that Ubuntu is officially supporting a root filesystem on ZFS configuration, perhaps that will change. I mean that, both in terms of more users of a root filesystem on ZFS, but also in terms of more developers with sufficient motivation to fix this issue.

@RSully
Copy link

RSully commented Nov 4, 2019

Alternatively, there is a nuclear option for resolving this. If we treat swap like Illumos treats dump devices by disabling CoW and checksums, things will definitely work. That would be at the expense of protection against bitrot on the swap.

and:

The drawback to this approach would be that we wont get the data consistency guarantees that we normally get with ZFS, e.g. data checksums. I think this is a reasonable tradeoff (i.e. swap zvol + no hangs + no checksums vs. swap zvol + hangs + checksums), given that other Linux filesystems are no better (right?).

Not that my 2¢ means much as a non-contributor, but I'd personally really prefer to see a design where at least checksums can be maintained functional so the possibility of end-to-end integrity can remain. Can anyone speak to the amount of complexity/effort that that single feature would add to this work?

I know above Illumos dump devices were mentioned. I am father unfamiliar with Illumos, but I thought dump and swap devices were not the same thing, and that Illumos actually did use zvols for swap. Am I incorrect, or were the above comparisons not quite valid?

@prakashsurya
Copy link
Member

prakashsurya commented Nov 5, 2019

I'd personally really prefer to see a design where at least checksums can be maintained functional

I think we'd all prefer this, but I don't think it's currently possible if we were to write directly to pre-allocated blocks; due to the checksums being stored in the indirect blocks, and the indirect blocks being written during the pre-allocation rather than when we write out the swap data.

and that Illumos actually did use zvols for swap.

Yes, we used ZVOLs for swap on our illumos based appliance, and we often saw (likely the same) deadlocks for workloads that tried to actually use the swap space.

@Kodiman
Copy link

Kodiman commented Apr 25, 2022

Yes, excellent. Compare linux swap on bcachefs vs OpenZFS. Hopefully the zfs devs read this. Thanks.

@pzakha
Copy link
Contributor

pzakha commented Apr 25, 2022

Note that there are already some ideas for how to implement this, and they are mentioned in this thread. For instance: #7734 (comment) and #7734 (comment) (looks like the doc is set to private now: @ahrens should we put this info somewhere else?). The problems is to find volunteers who'd be willing to implement it and test it :).

@ahrens
Copy link
Member

ahrens commented Apr 25, 2022

@pzakha I've updated the URL in the previous comment to link to a publicly-readable version: https://docs.google.com/document/d/14KM0A8tlmvGyqF2FOqsaWCS3ACaNBBRQMJcgCkZ9cjM/edit?usp=sharing

While I'm here I'll also point to my previous thoughts on the subject: #7734 (comment)

@mfb
Copy link

mfb commented Apr 25, 2022

I use redundant zvol swap on FreeBSD 13.0 and haven't seen issues yet, but based on reading threads such as this one, I do try to carefully manage the systems to ensure they don't exhaust memory. (Before using ZFS swap, I had multiple crashes/lockups when init, ssh, virtual consoles, database servers, etc. were all swapped out to standard FreeBSD swap partitions on spinning disks which failed - so far ZFS swap has been more stable thanks to the redundancy.)

@nwmcsween
Copy link

nwmcsween commented Apr 25, 2022

@h1z1
Copy link

h1z1 commented Apr 26, 2022

took a look at mkswap.c and aside from writing a unique signature to the device it doesn't do anything which would let ZFS easily identify it as a swap device.

Was curious about this given file does identify a swap device.

 # file  /SWAP00 
/SWAP00: Linux/i386 swap file (new style), version 1 (4K pages), size 3071999 pages, no label, UUID=386bec3d-95fc-411a-aaaa-ffffffff
#

https://github.com/file/file/blob/d17d8e9ff8ad8e95fdf66239ccdcc2133d1ce5ce/magic/Magdir/linux#L84

?

@rm-happens
Copy link

rm-happens commented May 1, 2022

I'd be interested to see how Oracle Solaris handles this. They appear to support swap on ZVOLs on both SPARC and x86. I'm sure the Oracle Solaris ZFS code and OpenZFS code have massively diverged since Solaris was closed back up, but they would've had to deal with the exact same problems.

Also, has anyone noticed whether size seems to affect whether swap on a ZVOL locks up? I haven't had any issues running swap on a ZVOL on my laptop, but I only have 16G RAM, 8G swap, relatively light workloads (memory-wise, anyway), and on the occasion I do swap its not a huge amount.

@DemiMarie
Copy link

DemiMarie commented May 1, 2022

I'd be interested to see how Oracle Solaris handles this. They appear to support swap on ZVOLs on both SPARC and x86. I'm sure the Oracle Solaris ZFS code and OpenZFS code have massively diverged since Solaris was closed back up, but they would've had to deal with the exact same problems.

The only solution I can think of is to not require memory allocations during writing.

@rm-happens
Copy link

rm-happens commented May 2, 2022

I'd be interested to see how Oracle Solaris handles this. They appear to support swap on ZVOLs on both SPARC and x86. I'm sure the Oracle Solaris ZFS code and OpenZFS code have massively diverged since Solaris was closed back up, but they would've had to deal with the exact same problems.

The only solution I can think of is to not require memory allocations during writing.

That would be doable. Oracle Solaris does not differentiate a ZVOL used for swap from a ZVOL used for any other purpose. That could have performance implications. Perhaps having a separate ZVOL type just for swaps, which pre-allocates memory as you suggest, would be the best solution?

@DemiMarie
Copy link

DemiMarie commented May 2, 2022

I'd be interested to see how Oracle Solaris handles this. They appear to support swap on ZVOLs on both SPARC and x86. I'm sure the Oracle Solaris ZFS code and OpenZFS code have massively diverged since Solaris was closed back up, but they would've had to deal with the exact same problems.

The only solution I can think of is to not require memory allocations during writing.

That would be doable. Oracle Solaris does not differentiate a ZVOL used for swap from a ZVOL used for any other purpose. That could have performance implications. Perhaps having a separate ZVOL type just for swaps, which pre-allocates memory as you suggest, would be the best solution?

It should be possible to differentiate between the two based on how they are used, without requiring any manual work on the sysadmin’s part.

@Kodiman
Copy link

Kodiman commented May 2, 2022

Thanks to everyone for the good progress in thinking about solutions. Hope the zfs devs see your ideas.

@Kodiman
Copy link

Kodiman commented May 2, 2022

As was mentioned above, "a separate ZVOL type just for swaps, which pre-allocates memory". Sounds good.

@rm-happens
Copy link

rm-happens commented May 3, 2022

I'd be interested to see how Oracle Solaris handles this. They appear to support swap on ZVOLs on both SPARC and x86. I'm sure the Oracle Solaris ZFS code and OpenZFS code have massively diverged since Solaris was closed back up, but they would've had to deal with the exact same problems.

The only solution I can think of is to not require memory allocations during writing.

That would be doable. Oracle Solaris does not differentiate a ZVOL used for swap from a ZVOL used for any other purpose. That could have performance implications. Perhaps having a separate ZVOL type just for swaps, which pre-allocates memory as you suggest, would be the best solution?

It should be possible to differentiate between the two based on how they are used, without requiring any manual work on the sysadmin’s part.

Perhaps. I'd rather just have an option toggle, like "-pamem" which tells ZFS to pre-allocate memory on the ZVOL. Itr wouldnt require any extra work on the part of the SA, other than setting the option when creating the ZVOL.

@Kodiman
Copy link

Kodiman commented May 3, 2022

Yes. Can have option toggles. Good. Thanks.

@RubenKelevra
Copy link

RubenKelevra commented May 29, 2022

Would be nice to have just a single thread preallocating some static memory and don't have it cleaned up by overcommitting optimizations in the Linux kernel and then introduce a flag which does flag the ZVOL as swap device of the system.

This way the overall performance isn't hurt by this "workaround" but it's guaranteed to be able to accept writes without having to ask for more memory, as it's bound to this thread.

Hey @behlendorf, maybe you can have a look at this if you got time? Seems rather important! :)

Doesn't even have to "properly" write it out, but just dump it into a reserved section of the disk would be fine, I guess. Then Linux can reclaim the memory and other ZFS threads can read it again and write it out "properly".

@RubenKelevra
Copy link

RubenKelevra commented May 29, 2022

Another option may be to have an emergency mode for the arc. So if there's a high write demand on this specific device the ARC cache in memory gets wiped and used for this purpose to buffer it, while the thread also starts writing it out properly to the disk.

As there's a reserved minimum for the arc anyway, and in a low memory situation it's better to get rid of some pressure right away and it doesn't depend so heavily on the disk-io beeing good.

Not sure what's easier to implement.

@opoloko
Copy link

opoloko commented Jun 19, 2022

I just installed an Ubuntu using zfs as it seemed the most recent option. Started experimenting lockups during heavy usage that seem linked to swap usage (in my case a swap volume of my zfs filesystem).

After 4 years this was opened it seems unlikely will ever be fixed.

Considering I cannot resize my zfs (it takes full disk) to make space for a normal swap partition, and I need swap for some heavy processes, what are my alternatives/workarounds apart from reinstalling everything to ditch zfs and add another voice telling Ubuntu to take it off from the setup?

@mabod
Copy link

mabod commented Jun 19, 2022

You should look into zram-generator. This is what I am using with excellent results.

https://github.com/systemd/zram-generator

@opoloko
Copy link

opoloko commented Jun 19, 2022

You should look into zram-generator. This is what I am using with excellent results.

https://github.com/systemd/zram-generator

Could you provide some more detail? I'm not sure I get how to use this to have a swap that bypasses the zvol/zfs swap problem...?

@mabod
Copy link

mabod commented Jun 19, 2022

zram-generator reserves part of your RAM for swap. This sounds counter intuitive but in fact it works very well. On my machine for example which has 64 GB of RAM and 12 GB swap with zram-generator. I do frequent stress tests where 76 GB of RAM are consumed and it works very well. And the swap is very fast.

@opoloko
Copy link

opoloko commented Jun 19, 2022

zram-generator reserves part of your RAM for swap. This sounds counter intuitive but in fact it works very well. On my machine for example which has 64 GB of RAM and 12 GB swap with zram-generator. I do frequent stress tests where 76 GB of RAM are consumed and it works very well. And the swap is very fast.

I'm afraid this only applies to a configuration woth lots of ram, in my case with a netbook with 8gb of RAM I cannot do that, I need a swap that is a real swap on disk and that is bigger than the ram, and so it cannot stay in RAM.

@mabod
Copy link

mabod commented Jun 19, 2022

If that is the case, then your only choice is reinstallation with a new partition scheme which has one dedicated swap partition

@milindpatel63
Copy link

milindpatel63 commented Jun 19, 2022

I also have the exact same issue. Running ubuntu with zfs on an old laptop with 8gb ram...can't use zram..
I end up plugging in a 16gb usb drive and using it as a swap space. Been working fine.

@gmelikov
Copy link
Member

gmelikov commented Jun 19, 2022

Offtopic, but

in fact ZRAM can help with 8GB RAM, at the expense of CPU usage, zstd works really great on laptop scenarios, it compresses 4GB of browser/etc ram into 1GB, you just need to give it needed space (note, that it's maximum swapped RAM space, in worst case it will be used as is, in usual case - see next). For 16GB I gave it 75% of RAM as threshold, and usually it looks like (output from my laptop, live):

# zramctl
NAME       ALGORITHM DISKSIZE  DATA COMPR TOTAL STREAMS MOUNTPOINT
/dev/zram0 zstd         11.3G  4.3G  1.1G  1.2G      16 [SWAP]

, so, it compressed 4G into 1G, profit, usually 4x with zstd and 3x with lz4. I saw 10GB compressed into 2-3GB too.

I use turn-in zram-tools package on Debian, see zramswap.service and /etc/default/zramswap config.

(many Android firmwares use zram too, so it works even on 1G RAM cases for consumer usage)

@pepa65
Copy link

pepa65 commented Jun 19, 2022

~zram: On Ubuntu I only install zram-config and it does everything for you.

~Ubuntu zfs: They should make a separate swap partition (not ditch zfs altogether).

@dm17
Copy link

dm17 commented Jun 19, 2022

zram-generator reserves part of your RAM for swap. This sounds counter intuitive but in fact it works very well. On my machine for example which has 64 GB of RAM and 12 GB swap with zram-generator. I do frequent stress tests where 76 GB of RAM are consumed and it works very well. And the swap is very fast.

Doesn't it mean there is a flaw in the default configuration of Linux if zram-generator is beating swap and/or no-swap configs? I mean, why would zram-generator beat no swap except in the rare case that you're needing to swap really often (completely running out of RAM often)?

I've not used swap on most modern systems because there doesn't seem to be a need. The reason I'm in this thread is I've one of those modern 🙄 ThinkPads that share the 8GB of system RAM with the integrated graphics - leaving 5882MiB of RAM for the OS to use! So when RAM would fill up I was thinking a swap would help, but so far it doesn't much - because to just have tabs die / manage memory so as to not go overboard.

@grahamperrin
Copy link

grahamperrin commented Jun 19, 2022

For focus: maybe take workarounds, and other things that don't directly progress this issue, to discussions –

https://github.com/openzfs/zfs/discussions

Thanks

@MasterJubei
Copy link

MasterJubei commented Jul 27, 2022

Can this be fixed? Ubuntu on ZFS should not be an option if this isn't fixed.

@pepa65
Copy link

pepa65 commented Jul 27, 2022

The Ubuntu on ZFS installation should just have a separate (LUKS encrypted!) swap partition, as swap on ZFS apparently won't be fixed rapidly.

@DemiMarie
Copy link

DemiMarie commented Jul 27, 2022

Could the relevant allocations be wrapped in memalloc_noio_save()/memalloc_noio_restore() on Linux?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management
Projects
None yet
Development

No branches or pull requests