PANIC at arc.c:5295:arc_read() when little free memory left #7186

vozhyk- · 2018-02-17T18:00:02Z

System information

Type	Version/Name
Distribution Name	Funtoo
Distribution Version
Linux Kernel	4.1.15-aufs
Architecture	x86_64
ZFS Version	0.7.6-2_gc558c93f
SPL Version	0.7.6-r0-gentoo

Describe the problem you're observing

When there was little free RAM (around 50M or 30M) (and no swap), some processes got stuck in D state:

2018-02-17 16:39:11 vozhx-thinkpad /home/vozhyk # ps aux | grep '  D'
root      2386  0.0  0.0      0     0 ?        DN   00:09   0:01 [dbuf_evict]
root      2985  0.0  0.0      0     0 ?        D<   00:14   0:15 [z_wr_int_2]
root      3170  0.0  0.0      0     0 ?        D    00:14   0:36 [txg_sync]
portage   9110  0.0  0.0  20384  2684 pts/4    DN   16:10   0:00 git status --porcelain
vozhyk    9415  0.0  0.4 301516 14360 ?        DN   16:10   0:00 konsole -e sudo su
vozhyk   10182  0.0  0.4 301516 14388 ?        DN   16:11   0:00 konsole -e sudo su
vozhyk   10610  0.0  0.5 301516 14788 ?        DN   16:11   0:00 konsole -e sudo su
vozhyk   10645  0.0  0.5 301516 14788 ?        DN   16:12   0:00 konsole -e sudo su
vozhyk   11171  0.0  0.5 301516 14800 ?        DN   16:19   0:00 konsole -e sudo su
root     12847  0.0  0.0  12852  2044 pts/7    SN+  16:39   0:00 grep --color=auto   D

I've enabled swap after that, but the system didn't start using it (reported 0M used).
After a while, more processes hung as well and I had to trigger a kernel panic.
Some files written before doing it weren't created/updated after the reboot (weechat logs and a file written with cat >).

2018-02-17 18:54:18 vozhx-thinkpad /home/vozhyk # zpool status
  pool: sm
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 0h0m with 0 errors on Wed Dec  6 03:52:09 2017
config:

        NAME        STATE     READ WRITE CKSUM
        sm          ONLINE       0     0     0
          sda1      ONLINE       0     0     0

errors: No known data errors

  pool: sm-boot
 state: ONLINE
status: The pool is formatted using a legacy on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on software that does not support
        feature flags.
  scan: none requested
config:

        NAME                                                   STATE     READ WRITE CKSUM
        sm-boot                                                ONLINE       0     0     0
          ata-Samsung_SSD_850_EVO_250GB_S2R6NX0JB14474X-part3  ONLINE       0     0     0

errors: No known data errors

(Both pools are on the same disk.)

#5711 (comment) seems similar (has the same kind of disks, the same error message, and the system hangs), and #5711 itself has the same error message but happens when importing a pool.

Describe how to reproduce the problem

I haven't tried to reproduce it.

Include any warning/errors/backtraces from the system logs

[57648.727319] VERIFY3(0 == arc_buf_alloc_impl(hdr, private, compressed_read, B_TRUE, &buf)) failed (0 == 0)
[57648.727326] PANIC at arc.c:5295:arc_read()
[57648.727329] Showing stack for process 9110
[57648.727334] CPU: 0 PID: 9110 Comm: git Tainted: P           O    4.1.15-aufs #1
[57648.727336] Hardware name: LENOVO        /       , BIOS 7LETC9WW (2.29 ) 03/18/2011
[57648.727339]  00000000000014af ffff8800a2ddf858 ffffffff819625f4 0000000000000007
[57648.727344]  ffffffffa028676c ffff8800a2ddf868 ffffffffa00648c4 ffff8800a2ddf9e8
[57648.727348]  ffffffffa0064986 ffff88006ad137f8 0000000000000030 ffff8800a2ddf9f8
[57648.727352] Call Trace:
[57648.727363]  [<ffffffff819625f4>] dump_stack+0x45/0x57
[57648.727378]  [<ffffffffa00648c4>] spl_dumpstack+0x44/0x50 [spl]
[57648.727384]  [<ffffffffa0064986>] spl_panic+0xb6/0x110 [spl]
[57648.727390]  [<ffffffffa006050d>] ? spl_kmem_cache_alloc+0x10d/0x7f0 [spl]
[57648.727410]  [<ffffffffa01ff94a>] ? zio_decompress_data+0x5a/0x70 [zfs]
[57648.727422]  [<ffffffffa013451e>] ? arc_space_return+0x37e/0x520 [zfs]
[57648.727433]  [<ffffffffa01348ae>] ? arc_decompress+0x1ee/0x980 [zfs]
[57648.727444]  [<ffffffffa0135a35>] arc_read+0x9f5/0x34a0 [zfs]
[57648.727457]  [<ffffffffa013fc39>] ? dbuf_read+0x1249/0x13b0 [zfs]
[57648.727469]  [<ffffffffa01402a0>] ? dbuf_rele_and_unlock+0x4c0/0x5c0 [zfs]
[57648.727482]  [<ffffffffa013f238>] dbuf_read+0x848/0x13b0 [zfs]
[57648.727487]  [<ffffffffa005ee4a>] ? spl_kmem_free+0x2a/0x40 [spl]
[57648.727502]  [<ffffffffa0149de5>] dmu_buf_rele_array+0x135/0x4c0 [zfs]
[57648.727516]  [<ffffffffa014a2a4>] dmu_buf_hold_array_by_bonus+0x134/0x1f0 [zfs]
[57648.727530]  [<ffffffffa014a3a9>] dmu_read+0x49/0x70 [zfs]
[57648.727546]  [<ffffffffa01ea967>] zfs_getpage+0x117/0x1d0 [zfs]
[57648.727559]  [<ffffffffa020209c>] zpl_putpage+0x26c/0xba0 [zfs]
[57648.727564]  [<ffffffff8112503d>] filemap_fault+0x23d/0x3d0
[57648.727569]  [<ffffffff8114abc1>] __do_fault+0x41/0xa0
[57648.727573]  [<ffffffff8114e04f>] handle_mm_fault+0xbff/0x1030
[57648.727577]  [<ffffffff8196772c>] ? __schedule+0x6ac/0xff0
[57648.727582]  [<ffffffff81042f4e>] __do_page_fault+0x15e/0x3d0
[57648.727586]  [<ffffffff810431fc>] do_page_fault+0xc/0x10
[57648.727589]  [<ffffffff8196d632>] page_fault+0x22/0x30
[57758.549708] Adding 5242876k swap on /dev/sda4.  Priority:-1 extents:1 across:5242876k SSFS
[57760.945614] Process accounting resumed

The 0 == 0 part is probably due to openzfs/spl#684.

bt, dmesg and ps from the crash dump taken after more processes hung: https://gist.github.com/vozhyk-/ff5b96cb11c964a3aa2e0baa374162a6

The text was updated successfully, but these errors were encountered:

behlendorf · 2018-02-20T21:49:01Z

@vozhyk- by chance is this pool using the new encryption feature?

vozhyk- · 2018-02-20T22:59:34Z

@behlendorf no. The 2 extra commits in ZFS don't touch ZFS code (just add some libraries and a binary).

It was running a version of ZFS with encryption support for some time, but I never enabled or used it.

behlendorf · 2018-02-21T00:17:21Z

Thanks, then that narrows this issue down to a failure here decompressing the data which is almost certainly due to a memory allocation failure. Which is good since that jibes with the stack trace you posted. I'd expect the proposed changes in #7170 to help mitigate this and we'll of course want to handle this more gracefully since it clearly can happen.

behlendorf · 2018-10-12T22:07:38Z

Closing. This should have been resolved.

behlendorf added this to the 0.8.0 milestone Feb 20, 2018

behlendorf closed this as completed Oct 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PANIC at arc.c:5295:arc_read() when little free memory left #7186

PANIC at arc.c:5295:arc_read() when little free memory left #7186

vozhyk- commented Feb 17, 2018 •

edited

behlendorf commented Feb 20, 2018

vozhyk- commented Feb 20, 2018

behlendorf commented Feb 21, 2018

behlendorf commented Oct 12, 2018

PANIC at arc.c:5295:arc_read() when little free memory left #7186

PANIC at arc.c:5295:arc_read() when little free memory left #7186

Comments

vozhyk- commented Feb 17, 2018 • edited

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

behlendorf commented Feb 20, 2018

vozhyk- commented Feb 20, 2018

behlendorf commented Feb 21, 2018

behlendorf commented Oct 12, 2018

vozhyk- commented Feb 17, 2018 •

edited