-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit l2arc header size #1420
Comments
It sure looks like we're having trouble allocating memory. Could you post in the contents of the |
Sure, here they are: [root@centaurus zfs-0.6.1]$ cat /proc/spl/kstat/zfs/dmu_tx and [root@centaurus zfs-0.6.1]$ cat /proc/spl/kstat/zfs/arcstats |
One thing that looks very odd to me (as I stated on the ML discussion), is the fact my mirrored log devices do never see any IO (as long as I can tell having watched them for a while today). No idea whether it might related, however... |
Oh one more thing: I cannot remove the mirorred log from the zpool. The "zpool remove" command hang for a while (like any other zfs/zpool command, but significantly longer), then it returns without any error message (return code is 0); but does nothing. Running it in a strace, it remains stuck for a while (several minutes I'd say) on "ioctl(3, 0x5a0c, 0x7fff4012acd0)" It's stack is then: [root@centaurus ~]$ cat /proc/28462/stack |
@douardda I believe I see what's going on here. You've stumbled in to an L2ARC memory management issue which really should be better documented on the FAQ. What's happening is that virtually all of your 4GB of ARC space is being consumed managing the 125GB of data in the L2ARC. This means there's basically no memory available for anything else which is why your system is struggling. To explain a little more, when a data buffer gets removed from the primary ARC cache and migrated to the L2ARC a reference to the L2ARC buffer must be left if memory. Depending on how large your L2ARC device is and what your default block size it can take a significant amount of memory to manage these headers. This can get particularly bad for ZVOLs because they have a small 8k block size vs 128k for a file system. This means the ARCs memory requirements for L2ARC headers increase by 16x. You can check for this in the
Arguably ZFS should internally limit its L2ARC usage to prevent this pathological behavior and that's something we'll want to look in to. The upstream code also suffers from this issue but it's somewhat hidden because the vendors will carefully size the ARC and L2ARC to avoid this case. |
Oh wow. Sorry for the formatting in the previous comment, seems like github fucks up mail replies. Trying again from the web form: At least for the short term, we've tried 2), so we now have two 10G Things seem better as far as performance; /proc/spl/kmem/slab indicates Thanks for the help. |
@behlendorf hi, thanks again to point us on some solutions. Here is the situation. As @jcristau stated above, we've manage to get back to an acceptable situation by (drastically) reducing the L2ARC devices. We also managed to unload then reload the zfs/spl modules so we removed the zfs_arc_max kernel parameter. So we have now a basically working zfs setup again. But (ghee), we still have performance issues. When I first set this zfs setup, I did a few benchmarks. A simple "dbench -s 10" could reach 45Mb/s before adding slog and cache, and I could get almost 150Mb/s after adding the SSD slogs and the caches devices (on a zfs filesystem, not a zvol). Now (with no other IO on the ZFS pool than the dbench), I get a poor 7 to 10Mb/s and a dd of a quite big zvol (215G) did complete (hurrah) but at a mean rate of 21.0 MB/s (ghee). Once again, I never see any activity on the slog devices; is this "normal"? Can it be a symptom related to my poor performances here? |
If you're experiencing much better performance with empty zvols versus filled zvols, you're definitely hitting #361. You'll notice lots of read activity if this is the case, it's a known long standing issue which hasn't yet been addressed. |
Some news on our ZFS setup. Since we have lowered a lot the cache devices size, the situation is mostly stable and under control. But it remains quite easy to put ZFS under memory pressure. The first way to do so is to export a ZFS filesystem with NFS, put many (15 millions) small files in this filesystem (I know, this is not reasonable). Then, a simple "find" (on another computer on which the NFS volume is mounted) kills the zfs setup (free memory drops to 0 and the arc_adapt and a few more zfs processes like spl_kmem_cache then spend their time trying to move pages or so: there are then a huge amount of read IOs on the disks, and the whole zfs is very sluggish; even a "zpool -h" takes ages to return). The second way to put it under pressure is to run an fio test using the following config file:
During the first 2 phases of the test (read and randread), everything is fine. Then, during the first "write" test, everything starts ok but the consumed memory starts to increase. As expected, when zfs runs out of "free" memory, performances drop down to 0, arc_adapt etc. When the system is this "under pressure" state, the only to make it return back to a normal state (besides waiting almost forever), is to remove the cache devices from the pool. I can then reinster them, and then everything is back to a normal behaviour. Just to illustrate, the dstat of the zvol used in the fio test:
|
Interesting. It sounds like you're l2arc headers may be consuming the majority of your arc cache. Right now these will not be released except when removing the l2arc device. Dropping the headers during memory pressure would mean we'd be throwing away the references to some data in the l2arc. You can check the l2arc headers field in arcstats to determine how much memory they are using. |
Could we implement a max percent on l2arc headers in main arc and implement
|
@behlendorf thanks, I'll try to check the arcstats next time I reproduce a memory pressure situation. Which numbers of the arcstats should I watch precisely? |
Some news. I have upgraded the memory on the machine from 16GB to 32GB. I've reinsterted one of the SSDs dedicated to be a cache device (120GB). There is no special zfs_arc_max configured. During the WE, I've converted some volumes with a 8k volblocksize to 128K ones (I've just created a new properly configured volume, then dd from the 8k one to the 128k, don't know if threr is a better method). I have very poor performances on these zvol copies (less than 20MB/s). Now, the server is doing constant (mostly reads) IOs on the zfs disks (while there is almost no activity on the zvols or the zfs filesystems), the load is quite high, and munin report a constant diskstat_utilization of almost 90% on the disks involved in the raidz vdevs; these IOs are not very high in volume, but quite high in IOPS, according the disks are 7k2 organized in raidz1, like:
My zfs munin plugin reports a l2_hdr_size almost constant for hours at 4.55GB (the total l2arc size is around 180GB). For now, every zpool or zfs command takes 10's of seconds to respond; a strace looks like;
The system was running with a load level at 16, I have tgtd that is stuck on the D state. Then I've removed the cache device from the pool:
So my conclusion so far is that, on my system, adding cache has major performance impact on the system. Which is a little bit odd, isn't it? I could revert to set up a small cache device (using only part of the SSD), but I'm not sure adding a small cache would of any use. Any clue is welcome, David |
@douardda It may be because the memory requirements to track everything in your L2ARC cache device push other useful data out of the primary ARC. Using a smaller device would resolve this. You can verify this is the problem by checking the l2_hdr_size in arcstats. This is roughly the amount of memory consumed managing the L2ARC.
|
Thanks @behlendorf. I have added a munin plugin to monitor the l2arc size (total size and headers), then I have reinserted the cache devices (for a couple of weeks); the l2arc header size remains quite acceptable (around 2Go right now), so the system behaves mostly fine for now. The problem is that it seems very fragile: it's quite easy to kill the system (allocate and use 4k zvols, or "pathological" IO patterns) by making zfs allocate a huge number of small l2arc blocks that must be tracked in memory (the behaviour you point out here). I don't know how, but there should definitively be some quota somewhere (maybe it cannot be really done without rethinking the memory allocation in zfs to make it more linux friendly) to prevent such a pattern. |
I'd like to note that I'm seeing this same speed problem (multi-second pauses in the same types of calls to |
I looked but could not find one, is there an upstream illumos ticket/discussion on limiting L2ARC header size somewhere? |
@cburroughs Not that I'm aware of, but I'm sure they are aware of the issue. |
I'm getting similar strace and cpu behaviour as described in #1420 (comment) (creating new dataset with Hopefully the root cause is the same and I'm not hijacking this thread. 1Tb mirrored pool with 50% frag and 94% capacity, ~1k datasets each with tens of snapshots, core2duo, 4gb ram, l2arc 1gb on ssd, spl/zfs default parameters /proc/spl/kstat/zfs/dmu_tx /proc/spl/kstat/zfs/arcstats |
Closing this code has been refactored considerably to reduce the header sizes, in addition the compressed ARC feature was added to further reduce overhead. |
Hi,
i have serious performance problems with my ZFS system. For the record it's a Debian squeeze+backports (3.2.41-2~bpo60+1 debian kernel) running on a Dell PE2950 (2 Xeon L5420@2.5GHz, 16Go), driving a storage bay consisting of 18 spinning drives (SAS, 7200) and 4 SSDs (2 MLC, 2 SLC for logs and cache).
The HBA is a LSI SAS9200-8E
My problem is that after a fresh reboot, the system behaves quite normally, but as write operations occurs on ZFS volumes, performances degrades down to a state which makes it almost unusable (any zfs command take more than a minute to return, IO perf on zfs and zvols are near 0, zfs kernel threads spend most of their time waiting for mutexes, etc.)
There a detailed explanation on the mailinglist:
https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-discuss/6-2sqov3usM
The text was updated successfully, but these errors were encountered: