Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

l2arc_feed stuck at 100% CPU #3259

Closed
odoucet opened this issue Apr 6, 2015 · 12 comments
Closed

l2arc_feed stuck at 100% CPU #3259

odoucet opened this issue Apr 6, 2015 · 12 comments
Labels
Component: Memory Management kernel memory management Type: Performance Performance improvement or performance problem

Comments

@odoucet
Copy link

odoucet commented Apr 6, 2015

The problem I reported on several pull requests (#3190 (comment) and #3216 (comment)) was a system hang when doing a 100% read pattern with 16 threads. After some digging, I found the problem is located on l2arc_feed, so I think it's time creating a bug report on its own and stop polluting other pull requests / bug reports.

System :

196GB RAM - dual CPU CPU E5-2430
11 mirror (22 drives) + mirror on logs (2 drives SSD) + L2ARC (2 drives)
CentOS 6.5 - Kernel 3.10.73 (elrepo)

Zpool :

  • 10 mirrors (20 drives SAS 10K.7)
  • logs 1 mirror (2 SSD 100GB)
  • cache 2 drives (2 SSD 800GB)

ZFS tuning

options zfs zfs_arc_min=161061273600
options zfs zfs_arc_max=161061273600

Benchmark

time find /vol -type f -print0 |xargs -0 -n 20 -P16 cat > /dev/null

Results

  1. SPL+ZFS trunk on March 30th
    System hang after ~ 1 hour
  2. SPL+ZFS Trunk on April 4th
    System hang after ~ 2 hours
  3. Pull request [cumulative] buildbot crunching, proof of concept (*1), #2129, #3115 #3190 (SPL v0.6.3-81_gfade6b0 ; ZFS v0.6.3-259_g0f7d2a4)
    System hang after ~ 1 hour
  4. Pull request [buildbot, proof-of-concept (*2)/testing, #2129 + new (tracepoint multilist) #3115], <feedback, comments>  #3216
    System hang after ~ 2 hour
  5. Pull request [buildbot, proof-of-concept (*2)/testing, #2129 + new (tracepoint multilist) #3115], <feedback, comments>  #3216 + doubling ZFS_OBJ_MTX_SZ to 512 as suggested by @kernelOfTruth
    System hang after ~ 2 hour

For all these results, stacktrace is the same :

INFO: task *ANY PROCESS NAME BLOCKING HERE*:1473 blocked for more than 180 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
arc_adapt       D 000000000000000e     0  1473      2 0x00000000
ffff8817cfb99bb8 0000000000000046 ffff8817cfb99fd8 0000000000013140
ffff8817cfb98010 0000000000013140 0000000000013140 0000000000013140
ffff8817cfb99fd8 0000000000013140 ffff8818054f0180 ffff8818061ea6c0
Call Trace:
[<ffffffff815eee69>] schedule+0x29/0x70
[<ffffffff815ef17e>] schedule_preempt_disabled+0xe/0x10
[<ffffffff815ed8d4>] __mutex_lock_slowpath+0x194/0x240
[<ffffffff815ed71b>] mutex_lock+0x2b/0x50
[<ffffffffa02d748c>] arc_hdr_realloc+0xbc/0x210 [zfs]
[<ffffffffa02db1ef>] arc_evict_hdr+0x29f/0x340 [zfs]
[<ffffffff815ed70e>] ? mutex_lock+0x1e/0x50
[<ffffffffa02db464>] arc_evict_state_impl+0x1d4/0x2d0 [zfs]
[<ffffffffa02db687>] arc_evict_state+0x127/0x1f0 [zfs]
[<ffffffffa02dbd13>] arc_adjust_impl.clone.0+0x33/0x40 [zfs]
[<ffffffffa02dbee4>] arc_adjust+0x1c4/0x290 [zfs]
[<ffffffffa027a820>] ? __thread_create+0x160/0x160 [spl]
[<ffffffffa02dc10d>] arc_adapt_thread+0x15d/0x220 [zfs]
[<ffffffffa02dbfb0>] ? arc_adjust+0x290/0x290 [zfs]
[<ffffffffa027a898>] thread_generic_wrapper+0x78/0x90 [spl]
[<ffffffffa027a820>] ? __thread_create+0x160/0x160 [spl]
[<ffffffff810821be>] kthread+0xce/0xe0
[<ffffffff810820f0>] ? kthread_freezable_should_stop+0x70/0x70
[<ffffffff815f93c8>] ret_from_fork+0x58/0x90
[<ffffffff810820f0>] ? kthread_freezable_should_stop+0x70/0x70

All processes accessing the zpool are dumping this message (seen arc_adapt, 'cat', 'zfs', etc.)

Various informations are available on this gist :
https://gist.github.com/odoucet/6d18e4a91ad1fe90d181
including arcstats, slabinfo and top

"Why are you saying this comes from l2arc_feed" ?

The 'top' command running when system hang shows that process l2arc_feed was using 99.9% CPU. So I made a new test, on the same SPL+ZFS version (so trunk + pull #3216) but with

zfs set secondarycache=none mypool

and ... benchmark runs fine (finished in 4h06mn, first time I succeed in finishing it)

$ time find /vol -type f -print0 |xargs -0 -n 20 -P16 cat > /dev/null
real    243m37.864s
user    9m5.137s
sys     152m45.585s
@kernelOfTruth
Copy link
Contributor

@odoucet could you please run your tests with

secondarycache=metadata

that would be the other option left to test and run with ?

I've already suspected from the reports that there must have been some underlying issue with l2arc (which is, btw, also mentioned in FreeBSD, etc. mailing list that performance or locking behavior on occasion is better without an l2arc device rather than using one)

Reducing lock contention and tweaking l2arc access seems to be the next logical step to tackle (which partly already has been done with recent zfs changes)

I'm just curious - do you have any form of compression enabled ?

if no - how does that change the success rate with/without l2arc ?

@brendangregg
Copy link

Have any ZFS parameters been tuned? Eg:

# cat /sys/module/zfs/parameters/l2arc_write_max 
8388608

Some people have tried increasing this in the hope that the L2ARC warms up faster, but increasing it can make the L2ARC feed thread burn a CPU as it tries and fails to find fresh buffers to cache.

@odoucet
Copy link
Author

odoucet commented Apr 7, 2015

No parameter tuned except zfs_arc_min / zfs_arc_max.
compression is lz4. Here are all properties :

NAME     PROPERTY              VALUE                  SOURCE
filerXX  type                  filesystem             -
filerXX  creation              Tue Sep 30 18:56 2014  -
filerXX  used                  3.51T                  -
filerXX  available             8.25T                  -
filerXX  referenced            564K                   -
filerXX  compressratio         1.50x                  -
filerXX  mounted               yes                    -
filerXX  quota                 none                   default
filerXX  reservation           none                   default
filerXX  recordsize            128K                   default
filerXX  mountpoint            /vol                   local
filerXX  sharenfs              off                    default
filerXX  checksum              on                     default
filerXX  compression           lz4                    local
filerXX  atime                 off                    local
filerXX  devices               on                     default
filerXX  exec                  on                     default
filerXX  setuid                on                     default
filerXX  readonly              off                    default
filerXX  zoned                 off                    default
filerXX  snapdir               hidden                 default
filerXX  aclinherit            restricted             default
filerXX  canmount              on                     default
filerXX  xattr                 on                     default
filerXX  copies                1                      default
filerXX  version               5                      -
filerXX  utf8only              off                    -
filerXX  normalization         none                   -
filerXX  casesensitivity       sensitive              -
filerXX  vscan                 off                    default
filerXX  nbmand                off                    default
filerXX  sharesmb              off                    default
filerXX  refquota              none                   default
filerXX  refreservation        none                   default
filerXX  primarycache          all                    default
filerXX  secondarycache        none                   local
filerXX  usedbysnapshots       160K                   -
filerXX  usedbydataset         564K                   -
filerXX  usedbychildren        3.51T                  -
filerXX  usedbyrefreservation  0                      -
filerXX  logbias               latency                default
filerXX  dedup                 off                    default
filerXX  mlslabel              none                   default
filerXX  sync                  standard               default
filerXX  refcompressratio      7.84x                  -
filerXX  written               8K                     -
filerXX  logicalused           4.97T                  -
filerXX  logicalreferenced     1.65M                  -
filerXX  snapdev               hidden                 default
filerXX  acltype               off                    default
filerXX  context               none                   default
filerXX  fscontext             none                   default
filerXX  defcontext            none                   default
filerXX  rootcontext           none                   default
filerXX  relatime              off                    default
filerXX  redundant_metadata    all                    default
filerXX  overlay               off                    default

I'll test with secondarycache=metadata

@odoucet
Copy link
Author

odoucet commented Apr 7, 2015

@kernelOfTruth test finished with success with secondarycache=metadata :

real    246m13.190s
user    9m18.723s
sys     267m34.155s

So only a 1.06% perf difference, so I think we could say perfs are equal.

There was a few minutes where system appears less responsive, but it was at least always working (no big stuck like with secondarycache=all).

For the last hour of the test, here are the number of times processes appeared using more than 90% of one core (5s interval) :

    428 cat
     91 kswapd0
     75 l2arc_feed
     42 *monitoring task grabing arcstats / zil infos*
     24 khugepaged
     12 z_rd_int/16
      8 xargs

L2Arc stats :

l2_hits                         4    24066
l2_misses                       4    70100638
l2_feeds                        4    86573
l2_rw_clash                     4    1
l2_read_bytes                   4    165331968
l2_write_bytes                  4    3252802560
l2_writes_sent                  4    11705
l2_writes_done                  4    11705
l2_writes_error                 4    0
l2_writes_lock_retry            4    9
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    2481
l2_cdata_free_on_write          4    0
l2_abort_lowmem                 4    1459
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    21079414784
l2_asize                        4    3252712448
l2_hdr_size                     4    200630400
l2_compress_successes           4    1252822
l2_compress_zeros               4    0
l2_compress_failures            4    0

@odoucet
Copy link
Author

odoucet commented Apr 7, 2015

Another benchmark with secondarycache=metadata :

find /vol -type f |wc -l

So this time I have a single process doing the hard work, meaning just reading metadata.

I was able to catch what happened when system hang for a few minutes. find task is blocked with this stack :

[<ffffffff815eee69>] schedule+0x29/0x70
[<ffffffff815f00a5>] rwsem_down_write_failed+0xf5/0x1b0
[<ffffffff812ab873>] call_rwsem_down_write_failed+0x13/0x20
[<ffffffff815edf91>] ? down_write+0x31/0x40
[<ffffffff811628c0>] SyS_brk+0x40/0x180
[<ffffffff815f9479>] system_call_fastpath+0x16/0x1b

I also have my monitoring process blocked with this stacktrace :

[<ffffffff8108fb6a>] ttwu_do_activate.clone.0+0x4a/0x60
[<ffffffff81180dee>] kmem_cache_shrink+0x3e/0x70
[<ffffffffa02759dc>] spl_kmem_cache_reap_now+0x13c/0x190 [spl]
[<ffffffffa02d934b>] arc_kmem_reap_now+0x4b/0xd0 [zfs]
[<ffffffffa02d94f0>] __arc_shrinker_func+0x120/0x240 [zfs]
[<ffffffffa02d961e>] arc_shrinker_func+0xe/0x10 [zfs]
[<ffffffff811421fd>] shrink_slab+0x18d/0x2c0
[<ffffffff81144364>] do_try_to_free_pages+0x2d4/0x3d0
[<ffffffff81144762>] try_to_free_pages+0xf2/0x160
[<ffffffff81139994>] __alloc_pages_slowpath+0x3a4/0x7d0
[<ffffffff8113a0cd>] __alloc_pages_nodemask+0x30d/0x330
[<ffffffff8117ff57>] kmem_getpages+0x67/0x1e0
[<ffffffff81181a0b>] fallback_alloc+0x18b/0x270
[<ffffffff811817b5>] ____cache_alloc_node+0x95/0x160
[<ffffffff811827c3>] __kmalloc+0x163/0x2b0
[<ffffffff8120aea3>] stat_open+0x63/0xc0
[<ffffffff81202db0>] proc_reg_open+0x80/0x110
[<ffffffff81199b97>] do_dentry_open+0x177/0x2e0
[<ffffffff81199e15>] finish_open+0x35/0x50
[<ffffffff811a9c99>] do_last+0x419/0x750
[<ffffffff811ac848>] path_openat+0xb8/0x460
[<ffffffff811acd29>] do_filp_open+0x49/0xa0
[<ffffffff8119b128>] do_sys_open+0x108/0x1f0
[<ffffffff8119b24e>] SyS_open+0x1e/0x20
[<ffffffff815f9479>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

arcstats shows that counter l2_abort_lowmem incremented during the process.

@odoucet
Copy link
Author

odoucet commented Apr 8, 2015

removing cache drives leads to other stalls :

[<ffffffff81180dee>] kmem_cache_shrink+0x3e/0x70
[<ffffffffa02779dc>] spl_kmem_cache_reap_now+0x13c/0x190 [spl]
[<ffffffffa02db367>] arc_kmem_reap_now+0x67/0xd0 [zfs]
[<ffffffffa02db4f0>] __arc_shrinker_func+0x120/0x240 [zfs]
[<ffffffffa02db61e>] arc_shrinker_func+0xe/0x10 [zfs]
[<ffffffff811421fd>] shrink_slab+0x18d/0x2c0
[<ffffffff81145548>] balance_pgdat+0x5e8/0x780
[<ffffffff8114584d>] kswapd+0x16d/0x270
[<ffffffff810821be>] kthread+0xce/0xe0
[<ffffffff815f93c8>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

@odoucet
Copy link
Author

odoucet commented Apr 9, 2015

lowering ARC size from 160G to 100G (on a 196GB RAM server) seems to do the trick and no more stalls. Will try some additonal testing with L2ARC back on.

@kernelOfTruth
Copy link
Contributor

That's unfortunate that ARC can't be that huge - during those loads most of the work is being spent in kernel/system space instead of user space ?

So the benchmark succeeded:

How is the system's responsiveness ?

Did it take significantly longer or even the same time (compared to e.g. 160G of the ARC) ?

Memory pressure appears significantly higher with l2arc and moreso with more than one drive (room for improvements ?):

There must be a sweet spot for the size of the ARC related to performance (e.g. 65% of RAM for ARC - maybe lower with l2arc) - so it's not sure whether the ARC's size of 50% of RAM is rather a sane & safe default or rather a conservative value to prevent (locking, swapping, latency) issues.

Do you have some stats whether throughput was negatively affected with a smaller ARC ?

Lowering latency and/or improving responsiveness of the system during those heavy load scenarios could be improved via raising the values of

[1] /proc/sys/vm/min_free_kbytes
(I e.g. set it to a rather large value of 800 MB at 32 GB RAM)
[2] /proc/sys/vm/mmap_min_addr
(raised to 65536 from 4096)

[1] http://blog.nitrous.io/2014/03/10/stability-and-a-linux-oom-killer-bug.html ,
[2] https://wiki.debian.org/mmap_min_addr

You tried raising /sys/module/zfs/parameters/l2arc_write_max with a smaller ARC how that affects performance and stability?
http://serverfault.com/questions/534751/how-to-fetch-current-l2arc-max-write-setting-on-illumos

@odoucet
Copy link
Author

odoucet commented Apr 9, 2015

during those loads most of the work is being spent in kernel/system space instead of user space ?

95% is iowait, 5% is system, <1% is user. Total CPU usage is ~ 50%

How is the system's responsiveness ?

System stays perfectly responsive with smaller ARC. But note that 100G ARC used up to 188GB of memory !

Did it take significantly longer or even the same time (compared to e.g. 160G of the ARC) ?

Perfs are ... better !

ARC Size Time needed
160G ARC 258 minutes
3G ARC 286 minutes
100G ARC 217 minutes
100G ARC+L2ARC 217 minutes

(All with 0.6.3-trunk)

Do you have some stats whether throughput was negatively affected with a smaller ARC ?

my benchmark is emulating reading all files on 16 descriptors, and it was faster, so I think I can say thoughput was better with smaller ARC.

Oh and yes, I will definitely use 0.6.4 version for the new tests :)

@behlendorf behlendorf added Type: Performance Performance improvement or performance problem Component: Memory Management kernel memory management Bug - Minor labels Apr 10, 2015
@kernelOfTruth
Copy link
Contributor

@odoucet

did/do the reading results get faster after several passes and a l2arc ?

Thanks for providing those stats, glad responsiveness improved

that ARC overgrowth is disturbing 😲

FYI:

#1012 (comment)

Ok, so currently it makes no sense to add a zil to a ZOL pool , if there are large synchronous writes like iscsi or nfs to zvols?

Basically, yes. If you care about large synchronous writes, then adding a slog might be counter-productive. Note that this is true for all ZFS implementations, including FreeBSD and Illumos.

I assume you sidestep the issue if you set the logbias=throughput ???

Yes, but if you set logbias=throughput then the slog is never used for synchronous writes, even small ones, so that makes the slog useless (unless you have some other dataset that uses it).

@odoucet

Could you please try again with a large ARC (e.g. 160G) - with/without l2arc - and the patch provided from

#3283

whether you see changes ?

I'm sure @dweeezil would appreciate it - since it appears related ;)

Thank you

@odoucet
Copy link
Author

odoucet commented Apr 13, 2015

Yes, second path is faster. For just metadata read (time find /vol |wc) it is ~ 50% faster on the second pass.
I'll check #3283

@gmelikov
Copy link
Member

Close as stale.

If it's actual - feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management kernel memory management Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

5 participants