Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PANIC at arc.c:5295:arc_read() when little free memory left #7186

Closed
vozhyk- opened this issue Feb 17, 2018 · 4 comments
Closed

PANIC at arc.c:5295:arc_read() when little free memory left #7186

vozhyk- opened this issue Feb 17, 2018 · 4 comments
Milestone

Comments

@vozhyk-
Copy link
Contributor

vozhyk- commented Feb 17, 2018

System information

Type Version/Name
Distribution Name Funtoo
Distribution Version
Linux Kernel 4.1.15-aufs
Architecture x86_64
ZFS Version 0.7.6-2_gc558c93f
SPL Version 0.7.6-r0-gentoo

Describe the problem you're observing

When there was little free RAM (around 50M or 30M) (and no swap), some processes got stuck in D state:

2018-02-17 16:39:11 vozhx-thinkpad /home/vozhyk # ps aux | grep '  D'
root      2386  0.0  0.0      0     0 ?        DN   00:09   0:01 [dbuf_evict]
root      2985  0.0  0.0      0     0 ?        D<   00:14   0:15 [z_wr_int_2]
root      3170  0.0  0.0      0     0 ?        D    00:14   0:36 [txg_sync]
portage   9110  0.0  0.0  20384  2684 pts/4    DN   16:10   0:00 git status --porcelain
vozhyk    9415  0.0  0.4 301516 14360 ?        DN   16:10   0:00 konsole -e sudo su
vozhyk   10182  0.0  0.4 301516 14388 ?        DN   16:11   0:00 konsole -e sudo su
vozhyk   10610  0.0  0.5 301516 14788 ?        DN   16:11   0:00 konsole -e sudo su
vozhyk   10645  0.0  0.5 301516 14788 ?        DN   16:12   0:00 konsole -e sudo su
vozhyk   11171  0.0  0.5 301516 14800 ?        DN   16:19   0:00 konsole -e sudo su
root     12847  0.0  0.0  12852  2044 pts/7    SN+  16:39   0:00 grep --color=auto   D

I've enabled swap after that, but the system didn't start using it (reported 0M used).
After a while, more processes hung as well and I had to trigger a kernel panic.
Some files written before doing it weren't created/updated after the reboot (weechat logs and a file written with cat >).

2018-02-17 18:54:18 vozhx-thinkpad /home/vozhyk # zpool status
  pool: sm
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 0h0m with 0 errors on Wed Dec  6 03:52:09 2017
config:

        NAME        STATE     READ WRITE CKSUM
        sm          ONLINE       0     0     0
          sda1      ONLINE       0     0     0

errors: No known data errors

  pool: sm-boot
 state: ONLINE
status: The pool is formatted using a legacy on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on software that does not support
        feature flags.
  scan: none requested
config:

        NAME                                                   STATE     READ WRITE CKSUM
        sm-boot                                                ONLINE       0     0     0
          ata-Samsung_SSD_850_EVO_250GB_S2R6NX0JB14474X-part3  ONLINE       0     0     0

errors: No known data errors

(Both pools are on the same disk.)

#5711 (comment) seems similar (has the same kind of disks, the same error message, and the system hangs), and #5711 itself has the same error message but happens when importing a pool.

Describe how to reproduce the problem

I haven't tried to reproduce it.

Include any warning/errors/backtraces from the system logs

[57648.727319] VERIFY3(0 == arc_buf_alloc_impl(hdr, private, compressed_read, B_TRUE, &buf)) failed (0 == 0)
[57648.727326] PANIC at arc.c:5295:arc_read()
[57648.727329] Showing stack for process 9110
[57648.727334] CPU: 0 PID: 9110 Comm: git Tainted: P           O    4.1.15-aufs #1
[57648.727336] Hardware name: LENOVO        /       , BIOS 7LETC9WW (2.29 ) 03/18/2011
[57648.727339]  00000000000014af ffff8800a2ddf858 ffffffff819625f4 0000000000000007
[57648.727344]  ffffffffa028676c ffff8800a2ddf868 ffffffffa00648c4 ffff8800a2ddf9e8
[57648.727348]  ffffffffa0064986 ffff88006ad137f8 0000000000000030 ffff8800a2ddf9f8
[57648.727352] Call Trace:
[57648.727363]  [<ffffffff819625f4>] dump_stack+0x45/0x57
[57648.727378]  [<ffffffffa00648c4>] spl_dumpstack+0x44/0x50 [spl]
[57648.727384]  [<ffffffffa0064986>] spl_panic+0xb6/0x110 [spl]
[57648.727390]  [<ffffffffa006050d>] ? spl_kmem_cache_alloc+0x10d/0x7f0 [spl]
[57648.727410]  [<ffffffffa01ff94a>] ? zio_decompress_data+0x5a/0x70 [zfs]
[57648.727422]  [<ffffffffa013451e>] ? arc_space_return+0x37e/0x520 [zfs]
[57648.727433]  [<ffffffffa01348ae>] ? arc_decompress+0x1ee/0x980 [zfs]
[57648.727444]  [<ffffffffa0135a35>] arc_read+0x9f5/0x34a0 [zfs]
[57648.727457]  [<ffffffffa013fc39>] ? dbuf_read+0x1249/0x13b0 [zfs]
[57648.727469]  [<ffffffffa01402a0>] ? dbuf_rele_and_unlock+0x4c0/0x5c0 [zfs]
[57648.727482]  [<ffffffffa013f238>] dbuf_read+0x848/0x13b0 [zfs]
[57648.727487]  [<ffffffffa005ee4a>] ? spl_kmem_free+0x2a/0x40 [spl]
[57648.727502]  [<ffffffffa0149de5>] dmu_buf_rele_array+0x135/0x4c0 [zfs]
[57648.727516]  [<ffffffffa014a2a4>] dmu_buf_hold_array_by_bonus+0x134/0x1f0 [zfs]
[57648.727530]  [<ffffffffa014a3a9>] dmu_read+0x49/0x70 [zfs]
[57648.727546]  [<ffffffffa01ea967>] zfs_getpage+0x117/0x1d0 [zfs]
[57648.727559]  [<ffffffffa020209c>] zpl_putpage+0x26c/0xba0 [zfs]
[57648.727564]  [<ffffffff8112503d>] filemap_fault+0x23d/0x3d0
[57648.727569]  [<ffffffff8114abc1>] __do_fault+0x41/0xa0
[57648.727573]  [<ffffffff8114e04f>] handle_mm_fault+0xbff/0x1030
[57648.727577]  [<ffffffff8196772c>] ? __schedule+0x6ac/0xff0
[57648.727582]  [<ffffffff81042f4e>] __do_page_fault+0x15e/0x3d0
[57648.727586]  [<ffffffff810431fc>] do_page_fault+0xc/0x10
[57648.727589]  [<ffffffff8196d632>] page_fault+0x22/0x30
[57758.549708] Adding 5242876k swap on /dev/sda4.  Priority:-1 extents:1 across:5242876k SSFS
[57760.945614] Process accounting resumed

The 0 == 0 part is probably due to openzfs/spl#684.

bt, dmesg and ps from the crash dump taken after more processes hung: https://gist.github.com/vozhyk-/ff5b96cb11c964a3aa2e0baa374162a6

@behlendorf
Copy link
Contributor

@vozhyk- by chance is this pool using the new encryption feature?

@behlendorf behlendorf added this to the 0.8.0 milestone Feb 20, 2018
@vozhyk-
Copy link
Contributor Author

vozhyk- commented Feb 20, 2018

@behlendorf no. The 2 extra commits in ZFS don't touch ZFS code (just add some libraries and a binary).

It was running a version of ZFS with encryption support for some time, but I never enabled or used it.

@behlendorf
Copy link
Contributor

Thanks, then that narrows this issue down to a failure here decompressing the data which is almost certainly due to a memory allocation failure. Which is good since that jibes with the stack trace you posted. I'd expect the proposed changes in #7170 to help mitigate this and we'll of course want to handle this more gracefully since it clearly can happen.

@behlendorf
Copy link
Contributor

Closing. This should have been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants