Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ARC behavior with metadata heavy workloads #1967

Closed
wants to merge 17 commits into from
Closed

Improve ARC behavior with metadata heavy workloads #1967

wants to merge 17 commits into from

Conversation

prakashsurya
Copy link
Member

Please see individual commit messages for details on each change.

behlendorf and others added 6 commits December 12, 2013 14:41
A GPF may occur if an l2arc buffer is evicted before the write
completes for the l2arc buffer.  The l2arc_write_done() function
unconditionally starts walking the list at the write head so we
must verify that l2arc_evict() has not removed it from the list.

This long standing issue was exposed by allowing arc_adjust() to
evict entries from the l2arc due to memory pressue.  Prior to
this change buffers would only be evicted immediately in front
of the write hand or during l2arc cleanup.  In neither case could
we wrap around remove the write head as part of l2arc_evict().

PID: 3182   TASK: ffff88020df28080  CPU: 0   COMMAND: "z_null_int/0"
  [exception RIP: l2arc_write_done+128]
 6 [ffff880210e97c58] zio_done at ffffffffa0597e0c [zfs]
 7 [ffff880210e97cd8] zio_done at ffffffffa05982a7 [zfs]
 8 [ffff880210e97d58] zio_done at ffffffffa05982a7 [zfs]
 9 [ffff880210e97dd8] zio_execute at ffffffffa05948a3 [zfs]
10 [ffff880210e97e18] taskq_thread at ffffffffa041d8d8 [spl]
11 [ffff880210e97ee8] kthread at ffffffff81090d26
12 [ffff880210e97f48] kernel_thread at ffffffff8100c14a

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Decrease the mimimum ARC size from 1/32 of total system memory
(or 64MB) to a much smaller 4MB.

1) Large systems with over a 1TB of memory are being deployed
   and reserving 1/32 of this memory (32GB) as the mimimum
   requirement is overkill.

2) Tiny systems like the raspberry pi may only have 256MB of
   memory in which case 64MB is far too large.

The ARC should be reclaimable if the VFS determines it needs
the memory for some other purpose.  If you want to ensure the
ARC is never completely reclaimed due to memory pressure you
may still set a larger value with zfs_arc_min.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
In an attempt to prevent arc_c from collapsing "too fast", the
arc_shrink() function was updated to take a "bytes" parameter by this
change:

    commit 302f753
    Author: Brian Behlendorf <behlendorf1@llnl.gov>
    Date:   Tue Mar 13 14:29:16 2012 -0700

        Integrate ARC more tightly with Linux

Unfortunately, that change failed to make a similar change to the way
that arc_p was updated. So, there still exists the possibility for arc_p
to collapse to near 0 when the kernel start calling the arc's shrinkers.

This change attempts to fix this, by decrementing arc_p by the "bytes"
parameter in the same way that arc_c is updated.

In addition, a minimum value of arc_p is attempted to be maintained,
similar to the way a minimum arc_p value is maintained in arc_adapt().

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Currently there is not mechanism in place to reclaim or limit the number
of L2ARC headers contained in RAM. This means that given a large enough
L2ARC device, the headers can accumulate until they push out all other
data, and can potentially cause an out-of-memory event on the system.

While we'd rather not discard these headers, because it invalidates any
blocks written to the L2ARC device, we also don't want these headers to
push out all other (potentially more useful) data on the system and/or
cause the system to hit a out-of-memory event.

As a fist attempt to fix this issue, a new "l2arc_hdr_limit" module
parameter has been introduced. This option allows an administrator to
set a limit on the cumulative size that L2ARC headers can grow to on the
system. By default this limit is 1/8th of the total ARC size.

To do this a l2arc_evict_headers() function has been added. It will
attempt to reclaim N bytes of l2arc headers by repeatedly advancing the
write hand and evicting blocks from the l2arc.

In addition, since we're not pruning from the L2ARC devices based on
"arc_meta_limit" or even "arc_c_max", so these headers were completely
taken out of the calculation of "arc_meta_used" and "arc_size".

Signed-off-by: Prakash Surya <surya1@llnl.gov>
The meaning of the arc_meta_used and arc_size values are ambiguous at
best. Prior to this change, the best definition I could come up with are
the following:

    * "arc_meta_used" is the sum of bonus buffers, dmu_buf_impl_t's,
      dnode_t's, arc headers, and arc buffers of type ARC_BUFC_METADATA

    * "arc_size" is the sum of arc_meta_used and arc buffers of type
      ARC_BUFC_DATA

With these two definitions in mind, their limits and associated logic for
eviction and arc tuning don't make a whole lot of sense; and here's why:

    1. First off, in the arc_adapt thread, we call arc_adjust_meta when
       arc_meta_used is greater than arc_meta_limit. The problem is that
       we evict buffer's of type ARC_BUFC_METADATA from the arc's mru
       and mfu before we every try to evict from the upper layers (via
       the prune callbacks). This can potentially lead us to evicting
       "hot" buffers from the mru and/or mfu while leaving "cold"
       dnode_t's, dmu_buf_impl_t's, etc. in the cache (consuming
       precious arc_meta_used space).

    2. Even though arc headers consume space within arc_meta_used, the
       ghost lists are never pruned based on arc_meta_limit. This can
       lead to a *very* unfortunate situation where the arc_met_used
       space is consumed entirely by "useless" arc headers (any new arc
       buffers of type ARC_BUFC_METADATA would almost immediately be
       evicted and placed on it's respective ghost list due to the
       arc_adapt thread calling arc_adjust_meta)

    3. The arc code appears to be littered these following assumptions:

           a) the size of the mfu is capped by arc_c - arc_p
           b) the sum of the mru and mfu sizes is capped by arc_c
           c) and finally, the sum of the mru and mfu sizes is arc_size

       The problem is, (c) is blatantly untrue (see previous definition
       above), which then invalidates the other assumptions as well.
       Using the proper definition of arc_size, we can arrive at the
       following truths:

           d) the arc_size is capped by arc_c
           e) the mru size is capped by arc_p
           f) the sum of arcstat_[data,other,hdr]_size is arc_size
           g) the sum of the mru and mfu sizes is arcstat_data_size
           h) which leads to, the sum of the mru and mfu sizes being
              capped by arc_c - arcstat_other_size - arcstat_hdr_size

       The unveiling of (h) means that, although unlikely, for certain
       values of arc_c, arcstat_other_size, and arcstat_hdr_size, one
       can completely starve the arc lists.

       The more likely scenario is not that *both* lists are starved
       simultaneously, it's that *one* list is starved for a certain
       time interval. For example, I've seen cases where the mru list
       grows to arc_p, and the mfu is squashed to a negligible size
       because it's space (the incorrectly assumed arc_c - arc_p space)
       is actually being consumed by arc header buffers, dnodes, etc.

As an attempt to remedy the situation, the meaning of the arc_meta_used
and arc_size values have been redefined as the following:

    * "arc_meta_used" is the sum of bonus buffers, and dmu_buf_impl_t's
      (i.e. arc_meta_used == arcstat_other_size)

    * "arc_size" is the sum of all arc buffers and arc headers
      (i.e. arc_size == arcstat_data_size + arcstat_hdr_size)

In addition, to directly address issues (1) and (2), we no longer prune
buffers of type ARC_BUFC_METADATA from the arc lists when over
arc_meta_limit. With the definition of arc_meta_used, that functionality
doesn't make sense (buffers residing in the arc lists no longer factor
into arc_meta_used.

Now, when over arc_meta_limit, we only call into the upper layers via
the prune callbacks. Ideally we would also reap from the kmem caches
for the objects which account for arc_meta_used (e.g. dnode_t,
dmu_buf_impl_t, etc.), but none of them have associated shrinkers.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
For specific workloads consisting mainly of mfu data and new anon data
buffers, the aggressive growth of arc_p found in the arc_get_data_buf()
function can have detrimental effects on the mfu list size and ghost
list hit rate.

Running a workload consisting of two processes:

    * Process 1 is creating many small files
    * Process 2 is tar'ing a directory consisting of many small files

I've seen arc_p and the mru grow to their maximum size, while the mru
ghost list receives 100K times fewer hits than the mfu ghost list.

Ideally, as the mfu ghost list receives hits, arc_p should be driven
down and the size of the mfu should increase. Given the specific
workload I was testing with, the mfu list size should grow to a point
where almost no mfu ghost list hits would occur. Unfortunately, this
does not happen because the newly dirtied anon buffers constancy drive
arc_p to its maximum value and keep it there (effectively prioritizing
the mru list and starving the mfu list down to a negligible size).

The logic to increment arc_p from within the arc_get_data_buf() function
was introduced many years ago in this upstream commit:

    commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc
    Author: maybee <none@none>
    Date:   Wed Dec 20 15:46:12 2006 -0800

        6505658 target MRU size (arc.p) needs to be adjusted more aggressively

and since I don't fully understand the motivation for the change, I am
reluctant to completely remove it.

As a way to test out how it's removal might affect performance, I've
disabled that code by default, but left it tunable via a module option.
Thus, if its removal is found to be grossly detrimental for certain
workloads, it can be re-enabled on the fly, without a code change.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
@DeHackEd
Copy link
Contributor

I can't be sure I've done something wrong, but running with this patch I find the ARC to be completely not regulating its size at all.

# cat /proc/spl/kstat/zfs/arcstats 
49 1 0x01 86 4128 448634175465 1434096363469
name                            type data
hits                            4    2509299
misses                          4    493325
demand_data_hits                4    0
demand_data_misses              4    0
demand_metadata_hits            4    2359610
demand_metadata_misses          4    281858
prefetch_data_hits              4    0
prefetch_data_misses            4    0
prefetch_metadata_hits          4    149689
prefetch_metadata_misses        4    211467
mru_hits                        4    272077
mru_ghost_hits                  4    12180
mfu_hits                        4    2087533
mfu_ghost_hits                  4    86614
deleted                         4    448
recycle_miss                    4    139179
mutex_miss                      4    106
evict_skip                      4    34701926
evict_l2_cached                 4    1446111232
evict_l2_eligible               4    419158016
evict_l2_ineligible             4    521666560
hash_elements                   4    394086
hash_elements_max               4    394086
hash_collisions                 4    190221
hash_chains                     4    116252
hash_chain_max                  4    10
p                               4    313441056
c                               4    5000000000
c_min                           4    4194304
c_max                           4    5000000000
size                            4    5772108096
hdr_size                        4    138256704
data_size                       4    5633851392
other_size                      4    1761772296
anon_size                       4    16384
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    2909063168
mru_evict_data                  4    0
mru_evict_metadata              4    0
mru_ghost_size                  4    510478848
mru_ghost_evict_data            4    0
mru_ghost_evict_metadata        4    510478848
mfu_size                        4    2724771840
mfu_evict_data                  4    0
mfu_evict_metadata              4    0
mfu_ghost_size                  4    171085312
mfu_ghost_evict_data            4    0
mfu_ghost_evict_metadata        4    171085312
l2_hits                         4    68774
l2_misses                       4    420817
l2_feeds                        4    782
l2_rw_clash                     4    5
l2_read_bytes                   4    364065792
l2_write_bytes                  4    207549440
l2_writes_sent                  4    555
l2_writes_done                  4    555
l2_writes_error                 4    0
l2_writes_hdr_miss              4    1
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    10
l2_abort_lowmem                 4    0
l2_reclaim_lowmem               4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    545421312
l2_asize                        4    207549440
l2_hdr_size                     4    1542360
l2_hdr_limit                    4    625000000
l2_compress_successes           4    34514
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    1761772296
arc_meta_limit                  4    4500000000
arc_meta_max                    4    1761773136

Workload is only multiple instances of a perl script which is effectively du -- the ARC probably contains zero "data" in the previous sense of the term (eg: output from cat /poolname/somefile). The machine had been idle for a good minute before I took this arcstat dump.

@prakashsurya
Copy link
Member Author

OK, great. I have a hunch this is being cause by the bugged arc_adjust function. I'll try to reproduce that malformed behavior today, and push a fix soon (assuming the issue is what I think it is).

@DeHackEd
Copy link
Contributor

Just to update, I've been letting it run for a while and it's started behaving better. Maybe it needed to slam into some reclaim first?

5 1 0x01 86 4128 6280349096 57277913791133
name                            type data
hits                            4    63012692
misses                          4    15231126
demand_data_hits                4    0
demand_data_misses              4    41890
demand_metadata_hits            4    60361704
demand_metadata_misses          4    10874230
prefetch_data_hits              4    0
prefetch_data_misses            4    0
prefetch_metadata_hits          4    2650988
prefetch_metadata_misses        4    4315006
mru_hits                        4    5813091
mru_ghost_hits                  4    1518388
mfu_hits                        4    54792986
mfu_ghost_hits                  4    5792847
deleted                         4    8637043
recycle_miss                    4    8210179
mutex_miss                      4    14390
evict_skip                      4    1295462694
evict_l2_cached                 4    100718430208
evict_l2_eligible               4    114773699584
evict_l2_ineligible             4    23804513280
hash_elements                   4    2462832
hash_elements_max               4    2568779
hash_collisions                 4    7703893
hash_chains                     4    261931
hash_chain_max                  4    25
p                               4    917587264
c                               4    5000000000
c_min                           4    4194304
c_max                           4    5000000000
size                            4    5000007408
hdr_size                        4    197165808
data_size                       4    4802841600
other_size                      4    1727879024
anon_size                       4    1147392
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    877194240
mru_evict_data                  4    0
mru_evict_metadata              4    37143552
mru_ghost_size                  4    4124495872
mru_ghost_evict_data            4    923648
mru_ghost_evict_metadata        4    4123572224
mfu_size                        4    3924499968
mfu_evict_data                  4    0
mfu_evict_metadata              4    1916374528
mfu_ghost_size                  4    875671040
mfu_ghost_evict_data            4    0
mfu_ghost_evict_metadata        4    875671040
l2_hits                         4    2742474
l2_misses                       4    6246871
l2_feeds                        4    11296
l2_rw_clash                     4    180
l2_read_bytes                   4    11603466752
l2_write_bytes                  4    5159491072
l2_writes_sent                  4    9611
l2_writes_done                  4    9611
l2_writes_error                 4    0
l2_writes_hdr_miss              4    40
l2_evict_lock_retry             4    1
l2_evict_reading                4    9
l2_free_on_write                4    15883
l2_abort_lowmem                 4    122
l2_reclaim_lowmem               4    2004
l2_cksum_bad                    4    3
l2_io_error                     4    3
l2_size                         4    36180289536
l2_asize                        4    1794785191863296
l2_hdr_size                     4    623479608
l2_hdr_limit                    4    625000000
l2_compress_successes           4    2627936
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    257
memory_indirect_count           4    45682
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    35992
arc_meta_used                   4    1727879024
arc_meta_limit                  4    2000000000
arc_meta_max                    4    3408673312

I also shrunk arc_meta_limit live, so the displayed max is left over from that time and not necessarily representative of an overload. But I did happen to catch this:

arc_meta_used                   4    2118991096
arc_meta_limit                  4    2000000000

@behlendorf
Copy link
Contributor

@prakashsurya Notice *_evict_data, *_evict_metadata in the first arcstats. Despite the mru and mfu being substantial nothing is eligible for eviction, once some of those buffers become eligible in the second arcstats output the size is correctly brought under control.

mru_evict_data                  4    0
mru_evict_metadata              4    0
mfu_evict_data                  4    0
mfu_evict_metadata              4    0

@prakashsurya
Copy link
Member Author

With these changes, I would not be surprised if arc_meta_used exceeds arc_meta_limit on a regular basis, to the point that it's normal behavior. It's a mouthful to explain why that is now the case, and more importantly why it is OK (previously it wasn't), so unless you're interested in the nitty gritty details (are you?) I'll refrain from explaining all of that.

@prakashsurya
Copy link
Member Author

@DeHackEd If you're testing these changes out, any chance you can keep a log of the arcstat file as your workload is running? For example:

$ while true; do \
> cat /proc/spl/kstat/zfs/arcstats > ./arcstats-`hostname`-`date +%s`.txt; \
> sleep 15; \
> done

I have scripts to generate plots from logs in that format which is useful for analyzing how the ARC subsystem is working over time. The ARC is very dynamic in nature, so a single snapshot in time is helpful up to a point, it's much more beneficial to be able to visualize how the values change over time.

@DeHackEd
Copy link
Contributor

                capacity     operations    bandwidth
pool         alloc   free   read  write   read  write
-----------  -----  -----  -----  -----  -----  -----
whoopass1    5.32T  4.68T    486      0  1.09M      0
  sdc        5.32T  4.68T    486      0  1.09M      0
cache            -      -      -      -      -      -
  SSD-part1  16.0E  36.4G     47      8   163K   740K
-----------  -----  -----  -----  -----  -----  -----

Surprise cache growth. The partition I gave it is only 16 GB.

I've been playing with the parameters via /sys/module/zfs/parameters/* if that matters.

And sure, I'll keep an arcstats log. Leave it running for an hour or two.

@DeHackEd
Copy link
Contributor

So far so good... (based on the original 6 commits only)

@prakashsurya
Copy link
Member Author

@DeHackEd Good to hear. The original 6 is all that I plan to try and get merged for now.

@prakashsurya
Copy link
Member Author

I ran some tests over the weekend, and this set of patch definitely helped one workload that I was expecting it to help. The workload was a single thread creating a lot of small files. I used fdtree to do this:

fdtree -d 10000 -l 100 -s1 -f1

With that as the only thing stressing the filesystem, I ran "before" and "after" tests on two nodes of essentially the same hardware configuration (the pool config was different, but that doesn't make much of a factor in these tests).

First, here's a graph from the "before" test (zfs-0.6.2-128-gdda12da) when running fdtree:

zeno7

As you can see in the top right graph, the hit rate never settles throughout the entirety of the test. Also, MRU and MFU sizes fall to a negligible size (data_size in the top left graph, and mru_size/mfu_size in the bottom left graph), despite the constant hits on each list's respective ghost list (the bottom right graph).

Now, running with this set of patches (zfs-0.6.1-255-ga226ae8), I see the following behavior:

zeno8

The hit rate is much more stable, and is also much better than previously. In addition, the sizes for the MRU and MFU are much more as I would expect.

@prakashsurya
Copy link
Member Author

I also ran a "before" and "after" test with a single thread continually untar-ing the kernel source tree, compiling it, and then removing the directory. Just looking at the ARC values, I don't see much of a difference between the two runs (each receive about a 100% hit rate). But in the interest of full disclosure, I'll post the graphs for these tests as well.

Here's the workload:

while true; do
    tar -xf linux-3.12.tar.gz
    cd linux-3.12
    make defconfig
    make
    cd ..
    rm -rf linux-3.12
done

Graph of ARC values from "before" test (zfs-0.6.2-128-gdda12da):

zeno5

Graph of ARC values from "after" test (zfs-0.6.1-255-ga226ae8):

zeno6

Really, the only difference is hdr_size (and as a result size) slowly increasing in the "after" test (as it should).

@prakashsurya
Copy link
Member Author

Another workload that I tested and gathered some "before" and "after" data is a send and recv of the fdtree data created in the test I mentioned above. I did a zfs send of the dataset containing some 200+ million small files from one machine to another machine which collected the dataset with zfs recv. Here's the two commands used:

zfs send tank/fdtree@arc-changes-send | nc $RECV_ADDR $RECV_PORT
 nc -l $RECV_PORT | zfs receive tank/fdtree-recv

Using 4 machines, I was able to have a send-receive pair running master (zfs-0.6.2-128-gdda12da) and a send-receive pair running this branch (zfs-0.6.1-255-ga226ae8).

Looking at the ARC graphs on each of the receivers, it looks like this set of patches helps substantially for this workload. After running for over an hour, it looks like the receiver running master stabilized at about a 40% hit rate:

zeno5

Whereas the receiver running this branch (zfs-0.6.1-255-ga226ae8) stabilized at a significantly higher hit rate (looks to be around 95% or more):

zeno6

This branch doesn't appear to help the senders much at all, as they appear to be more IOPs limited, but it doesn't appear to hurt either. Here is the graph of the sender running master (zfs-0.6.2-128-gdda12da):

zeno7

And here is the graph of the sender running this branch (zfs-0.6.1-255-ga226ae8):

zeno8

Although the hit rate of the senders doesn't appear to be affected much by this branch's changes, the graphs do look much "better" with these changes, IMO.

@prakashsurya
Copy link
Member Author

@kpande That's still an open question, really. In all of the testing I've done through the ZPL, these changes show that it either helps or doesn't really do anything. But, I've only done some very targeted testing, so I welcome others to try them out on any workload they care about. I don't foresee these changes having any negative drawbacks, but I would not be surprised if I'm overlooking something (which is why I would really like some more people testing this out). Changes to the ARC are very subtle and even seemingly "small" changes can have drastic performance implications.

There are still a couple patches that I intend to push into this branch, but they need a little more internal testing time before I'm fully confident in them. I was hoping I would be able to get them out to the community today, but unfortunately I don't think that'll happen (issues unrelated to the code).

This reverts commit c11a12b.

Out of memory events were fixed by reverting this patch.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Work in progress. I will add a more detailed commit message soon.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
@DeHackEd
Copy link
Contributor

Status report. This system has been running for the last ~10 days so these last 2 commits don't apply currently.

Still, I applied 5 of the 6 patches (due to l2arc fields breaking I left that patch out, and just gave it a 20 GB partition to use with compression). I definitely see an improvement in system reliability.

(I do have some of my own patches as well, but they are not of a nature that would affect this outcome)

When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).

This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.

For example, consider the following scenario:

    * the size of the arc is capped at 10G
    * the meta_limit is capped at 4G
    * 9G of the arc contains "data"
    * 1G of the arc contains "metadata"

Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.

To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer (unless it's at the
"metadata" limit). In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
@prakashsurya
Copy link
Member Author

Great to hear that!!

I think the l2arc limiting patch still needs to be looked at with a closer inspection. I've done limited testing and it seems to be "better" than before (i.e. l2arc headers not causing OOMs), but the limit isn't being strictly enforced by any means. So for the time being, the best course of action is still to use a properly sized l2arc device; as you did.

Those last two patches that were added were only really needed to address some issues when running using Lustre instead of the ZPL, so I would not be surprised if your workload doesn't strictly need them. With that said, they shouldn't hurt (and might improve) the system if you do decide to pull them in.

I'm planning on adding a few more patches to this series, slightly tweaking the current implementation. If you're interested in a "pre-release" view of these changes, I'm using this branch as my temporary working branch as I test the new patches out:

    https://github.com/prakashsurya/zfs/tree/arc-changes-experimental

Although, I make no promises as to whether this branch "works" or not; there's a good chance it'll be completely untested if/when you decide to peek at it. Once I feel confident in the changes I'll bring them over into this pull request.

Currently the "arc_meta_used" value contains all of the arc_buf_t's,
arc_buf_hdr_t's, dnode_t's, dmu_buf_impl_t's, arc "metadata" buffers,
etc. allocated on the system. The value is then used in conjunction with
"arc_meta_limit" to pro-actively throttle arc "metadata" buffers
contained in the arc's regular lists (i.e. non-ghost lists). There's a
couple problems that can arise as a result:

    1. As the "arc_meta_limit" is reached, new arc "metadata" buffers
       that are added will force an eviction from either the arc's mru
       list or the mfu list. Since *all* arc_buf_hdr_t's are contained
       in "arc_meta_used", even the ghost list entries will take up
       space in "arc_meta_used" and _may_ never be pruned. Given a
       "metadata" intensive workload (e.g. rsync, mkdir, etc.), it is
       possible for the ghost list headers to completely consume
       "arc_meta_used" and leave no room for "useful" metadata;
       causing atrocious arc hit rates (i.e. terrible performance).

    2. Since *all* arc_buf_hdr_t's are accounted for in "arc_meta_used",
       this means that even "data" buffers will take up a small amount
       of space in "arc_meta_used". Given a significant amount of "data"
       buffers (or a small enough "arc_meta_limit"), this can have a
       noticeable impact on the amount of arc "metadata" that can be
       cached. What makes it worse, even if we evict "metadata" from the
       arc (as we do now when over "arc_meta_limit"), these "data" buffer
       headers will never be touched.

This patch is intended to address problem (2) described above. Instead
of using "arc_meta_used" to contain *all* arc_buf_hdr_t's, arc_buf_t's,
etc., now it only contains those headers that accompany arc "metadata".
This way, if *all* of the arc's "metadata" buffers were evicted,
"arc_meta_used" will shrink to zero (even if "data" buffers are still
contained in the arc).

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Using "arc_meta_used" to determine if the arc's mru list is over it's
target value of "arc_p" doesn't seem correct. The size of the mru list
and the value of "arc_meta_used", although related, are completely
independent. Buffers contained in "arc_meta_used" may not even be
contained in the arc's mru list. As such, this patch removes
"arc_meta_used" from the calculation in arc_adjust.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Setting a limit on the minimum value of "arc_p" has been shown to have
detrimental effects on the arc hit rate for certain "metadata" intensive
workloads. Specifically, this has been exhibited with a workload that
constantly dirties new "meatadata" but also frequently touches a "small"
amount of mfu data (e.g. mkdir's).

What is seen is that the new anon data throttles the mfu list to a
negligible size (because arc_p > anon + mru in arc_get_data_buf), even
though the mfu ghost list receives a constant stream of hits. To remedy
this, arc_p is now allowed to drop to zero if the algorithm deems it
necessary.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
It's unclear why adjustments to arc_p need to be dampened as they are in
arc_adjust. With that said, it's removal significantly improves the arc's
ability to "warm up" to a given workload. Thus, I'm removing it until
its usefulness is better understood.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Work in progress, needs better commit message.

To try and prevent arc headers from completely consuming
"arc_meta_used", this commit changes arc_adjust_meta to more closely
align with arc_adjust; "arc_meta_limit" is used similarly to "arc_c".
@prakashsurya
Copy link
Member Author

Here's a graph of the arc parameters, hit rate, and number of reads while running the above mentioned fdtree workload with the latest code. I was able to tweak the ARC such that we can still maintain a limit on the amount of metadata contained, while still maintaining a decent hit rate for this specific workload. I would love some outside testing of this code on other workloads, to verify it works as I expect.

Workload: fdtree -d 10000 -l 100 -s1 -f1
ZFS Version: zfs-0.6.2-144-gc2c9535

arc-experimental-zeno-fdtree

For comparison purposes, I've included identical graphs for the previous runs that I performed using the same fdtree workload. Please note, he current code hasn't yet run for as long as the previous tests did, so the scale of the x-axis don't match up between the test runs.

ZFS Version: zfs-0.6.2-128-gdda12da (i.e. "master")

master-zeno-fdtree

ZFS Version: zfs-0.6.1-255-ga226ae8

arc-change-zeno-fdtree

@DeHackEd
Copy link
Contributor

Since this issue has gotten some attention lately (indirectly) I just wanted to mention I've been running with the first 6 patches (see below) on all my (mostly not-so-high load) systems lately without noticeable issue. I do think things are better for stability on the higher-load system as well. I hadn't been keeping up with the churn of the additional patches in the pull request.

One catch: I explicitly left out the L2ARC patch because I do need the L2ARC capacity and prefer to control it by limiting the size of the device provided.

+1 towards getting it merged.

@prakashsurya
Copy link
Member Author

Thanks for the update. Leaving out the L2ARC patch, and properly sizing the device is the proper way to go about it at the moment, IMO.

I should be back in the office this week, so I really want to finish cleaning up this pull request and get it in a merge-able state. The first 6 patches you mentioned have a few deficiencies (specific to certain workloads) that I want to address before it all lands; hence the follow up patches.

@spacelama
Copy link

Incidentally, I personally don't get anything out of any metrics I graph - it's far too noisy to work out what's going on. But "watch -d -n 0.1 cat /proc/spl/kstat/zfs/arcstats". Why are some of those metrics changing so rapidly? mru_size/mfu_size, mru_evict_metadata, arc_meta_used etc fluctuating by 1% per second even with a write-only load with or without primary and secondary cache set to metadata only, when every cache should be well and truly hot after months of uptime, and when arc_meta_used is well short of arc_meta_max and arc_meta_limit and arc_max?

@behlendorf
Copy link
Contributor

@spacelama I think the problem is that the meaning of the values exposed by the ARC isn't at all intuitive. Those values don't mean what you think they mean, and what they actually mean isn't documented anywhere. The result is the graphs aren't too useful unless your already very familiar with the area. It's certainly an area I think we could improve.

@prakashsurya
Copy link
Member Author

This is deprecated by #2110 . Thus, I'm closing it. Anybody testing this pull request, please move to the new revision in #2110.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants