Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZVOL write IO merging not sufficient #8472

Open
samuelxhu opened this issue Mar 2, 2019 · 67 comments
Open

ZVOL write IO merging not sufficient #8472

samuelxhu opened this issue Mar 2, 2019 · 67 comments
Labels
Bot: Not Stale Override for the stale bot Component: ZVOL ZFS Volumes Status: Understood The root cause of the issue is known Type: Documentation Indicates a requested change to the documentation Type: Performance Performance improvement or performance problem

Comments

@samuelxhu
Copy link

samuelxhu commented Mar 2, 2019

System information

Type Version/Name
Distribution Name ZFS on Linux
Distribution Version Centos 7
Linux Kernel 3.10
Architecture x-86
ZFS Version 0.6.5.X, 0.7.X, 0.8.X
SPL Version 0.6.5.X, 0.7.X, 0.8.X

Describe the problem you're observing

Before 0.6.5.X, e.g. 0.6.3-1.3 or 0.6.4.2, ZoL had the standard linux block device layer for ZVOL, thus one can use scheduler, deadline or others, to merge incoming IO requests. Even with the simplest noop scheduler, contiguous IO requests could still merge if they are sequential.

Things changed from 0.6.5.X on, Rao re-wrote the block layer of ZVOL, and disabled request merging at ZVOL layer, claiming that DMU does IO merging. However it seems that DMU IO merging either not work properly, or DMU IO merging is not sufficient from the performance point of view.

The problem is as follows. ZVOL has a volblocksize setting, and in many cases, e.g. for hosting VM, it is set to 32KB or so. When IO requests has a request size less than the volblocksize, read-modify-writes (RMW) will occur, leading to performance degradation. A scheduler, such as deadline, is capable of sorting and merging IO request, thus reducing the chance of RMW.

Describe how to reproduce the problem

Create a not-so-big ZVOL with volblocksize of 32KB, use FIO to issue a single sequential write IO workload of size 4KB, after a while (after the ZVOL filled with some data), either using "iostat -mx 1 10 " or "zpool iostat 1 10", one can see there are a lot of read-modify-writes. Note that at the beginning of writes, there will be less or no RMW because ZVOL is almost empty and ZFS can intelligently skip reading zeros.

In contrast, use FIO to issue sequential write IO workload of size 32KB, 64KB, or even larger, no matter how long you run the workload, there is no RMW.

Apparently IO merging logic at ZVOL is not working properly. Either we re-enable block device scheduler choice of deadline or noop, or fix the broken IO merging logic in DMU, should fix this performance issue.

Include any warning/errors/backtraces from the system logs

@samuelxhu
Copy link
Author

ZVOL currently does not even support noop

@samuelxhu
Copy link
Author

samuelxhu commented Mar 2, 2019

The default value of nomerges is 2, I will try to set it 0, re-test the cases, and report back soon.

Today I can confirm that setting nomerges to 0 has no actual effect

@samuelxhu
Copy link
Author

Can somebody (who are familar with ZFS DMU code) investigate the IO merging logic inside DMU a bit, perhaps one can find a better solution there?

Just wonder why the IO merging at DMU is not working in this simple (single thread of 4KB consecutive IO writes) case.

@shodanshok
Copy link
Contributor

shodanshok commented Mar 3, 2019

@samuelxhu @kpande From how I understand it, the problem is reproducible even without zvol: if you overwrite a large-recordsize (ie: 128k) file with 4k writes, you will encounter heavy read/modify/write. The problem does not seem related to the aggregator not doing its work; rather, it depends on the fact that on partial-recordsize write, the entire recordsize must be copied in memory. For example:

  • a 32M sized, 128K recordsize file exists. A sequential 4k workload is generated by issuing something as simple as dd if=/dev/urandom of=<testfile> bs=4k count=1024 conv=notrunc,nocreat;

  • the previous command accumulates writes in memory - nothing is written until txg_sync;

  • by monitoring I/O on another terminal we can see that, while no writes are issued, a significant read activity happens. This is due to each 4k write belonging to a new 128K chunk to bring that specific 128K chunk in memory in the ADB (ARC data buffer) structure. In other words: the first 4k hitting the file at offset 0 will cause the entire recordsized chunk (128K) to be copied in memory, before even issuing other 4k writes and regardless if these writes completely overwrite such recordsized chunk.

  • at transaction flush, the DMU aggregates these individual 4k writes in much fewer 128K ones. This can be checked by running zpool iostat -r 1 on another terminal.

So, the r/m/w behavior really seems intrinsically tied to the ARC/checksumming, rather than depending on aggregator not doing its work.

However, in older ZFS versions (<= 0.6.4), zvols where somewhat immune from this problem. This stems from the fact that, unless doing direct I/O, zvols do not bypass the standard linux pagecache. In the example above, running dd if=/dev/random of=/dev/zd0 bs=4k count=1024 will place all new data into pagacache, rather than in ZFS own ARC. Is at this point, before "passing down" the writes to the ARC, that the linux kernel has a change to coalesce all these 4k writes into bigger ones (up to 512K by default). If it succeeds, the ARC will only see 128K+ sized requests, which cause no r/m/w. This, however, is not without contraindications: double caching all data in pagecache leads to much higher pressure on ARC, causing lower hit rates and higher CPU load. Bypassing the pagecache with direct I/O will instead cause r/m/w.

On ZFS >= 0.6.5, the zvol code was changed to skip some of the previous linux "canned" block layer code, simplyfing the I/O stack and bypassing the I/O scheduler entirely (side note: in recent linux kernel, none is not a noop alias anymore. Rather, it really means no scheduler is in use. I also tried setting nomerges to 0, with no changes in I/O speed or behavior). This increased performance for the common case (zvol with direct I/O), but prevented any merging in the pagecache.

For what it is worth, I feel the current behavior is the right one: in my opinion, zvols should not behave too much differently from datasets. That said, this preclude a possible optimization (ie: using the pagecache as a sort of "first stage" buffer where merging can be done before sending anything to ZFS).

@samuelxhu
Copy link
Author

samuelxhu commented Mar 3, 2019 via email

@samuelxhu
Copy link
Author

@kpande Just to confirm that setting /sys/devices/virtual/block/zdXXX/queue/nomerges to 0 does not cause contiguous IO requests to merge. It seems all kinds of IO merging are, unfortunately, disabled by the current implementation.

Ryao's original good will is to avoid double merging and let DMU do IO merging. It is mysterious that DMU does not do the correct merging either.

@shodanshok
Copy link
Contributor

shodanshok commented Mar 3, 2019

@samuelxhu I think the rationale for current behavior is that you should avoid double caching by using direct I/O to the zvols; in this case, the additional merging done by the pagacache is skipped anyway, so it is better to also skip any additional processing done by the I/O scheduler. Anyway, @ryao can surely answer you in more detailed/correct form.

The key point is that is not the DMU not merging request. Actually, it is doing I/O merging. You are asking for an additional buffer to "pre-merge" multiple write requests before passing them to the "real" ZFS code in order to avoid read amplification. While understanding your request, I think this currently is out of scope, and quite different from how ZFS is expected to work.

@samuelxhu
Copy link
Author

samuelxhu commented Mar 3, 2019 via email

@shodanshok
Copy link
Contributor

@samuelxhu but they are normal block devices; only the scheduler code was bypassed to improve performance in the common case. I have no problem understanding what you say and why, but please be aware you are describing a pretty narrow use case/optimization: contiguous, non-direct 4k writes to a zvols, is the only case where pagacache merging will be useful. If random I/O are issued, merging is not useful. If direct I/O Is used, merging is again not useful.

So, while I am not against the change you suggest, please be aware of its narrow scope in real world workloads.

@samuelxhu
Copy link
Author

@kpande I have over 20 ZFS storage box served as FC/iSCSI backend, which use 32KB volblocksize. We run different workloads on them, and found that 32KB volblocksize strikes the best balance between IOPS and throughput. I had severals friends runing ZVOL for VMware, who recommends 32KB as well.
Therefore IO reqeust merging and sorting at ZVOL can effectively reduce RMW.
@shodanshok Adding scheduler layer to ZVOL will not cost much memory/CPU usage, but it will enable stacking ZVOL with many other linux block devices, embracing a much broader scope of use.

@samuelxhu
Copy link
Author

samuelxhu commented Mar 4, 2019

Let me describe another ZVOL use case which requires the normal block device behavior with a valid scheduler: one or multiple application servers use an FC or iSCSI LUN backed by ZVOL; the servers use a server-side SSD cache, such as Flashcache or bcache to reduce latency and to accelerate application IO. Either flashcache or bcache will issue small but contiguous 4KB IO requests to the backend, anticipating the backend block device will sort and merge those contiguous IO requests.

In the above case, any other block devices include HDD, SSD, RAID, or virutal block device will have no performance issues. BUT with zvol with its current implementation, one will see siginificant performance degradation due to excessive and unneccesary RMWs.

@richardelling
Copy link
Contributor

In general, it is unlikely that merging will benefit overall performance. However, concurrency is important and has changed during the 0.7 evolution. Unfortunately, AFIAK, there is no comprehensive study on how to tune the concurrency. See https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zvol_threads

Also, there are discussions in #7834 regarding the performance changes over time, especially with the introduction of the write and DVA throttles. If you have data to add, please add it there.

@samuelxhu
Copy link
Author

samuelxhu commented Mar 5, 2019

Why using ZVOLs as backend block device for iSCSI/FC LUN is not a common use case? Don't be narror-minded, it is very common. This is the typical use case that ZVOL should have its own scheduler, at least for two purposes: 1) to keep compatible with linux block device model (extremely important for block device stacking), as the applications anticipate the backend storage ZVOL to do IO merging and sorting ; 2) to reduce the chance of notorious RMWs in parituclar for non-4KB ZVOL volblocksizes

I do not really understand, why ZVOL should be different from a normal block device? For those who use ZVOL with 4KB volblocksize only, set the scheduler to noop/deadline does only cost few CPU cycles, but IO merging has the big potential to reduce the chance of RMWs for non-4KB ZVOL volblocksizes.

Pity on me, i run more than hundreds FC/iSCSI ZFS ZVOL storage box with volblocksize of 32KB or even bigger for sensible reasons, missing a valid scheduler in 0.7.X causes us pains on excessive RMWs and thus performance degradation, preventing us from upgrading (from 0.6.4.2) to any later versions.

We would like to sponsor a fund to support somebody who can make a patch restoring the scheduler feature for ZVOL in 0.7.X. Anyone who are interested pls contact me at samuel.xhu@gmail.com. The patch may or may not be accepted by ZFS authority, but we would like to pay the work.

@samuelxhu
Copy link
Author

samuelxhu commented Mar 5, 2019

@kpande Thanks a lot for pointing out the related previous commits, and i will have a careful look at it and try to find a temporary remedy for excessive RMWs.

I notice that previous zvol performance testing focuses primarily on 4KB or 8KB ZVOL, perhaps that is the reason rendering RMW issues less visible and thus RMWs are ignored by many eyes.

Let me explain a bit why a larger blocksize ZVOL still makes sense and should not be ignored: 1) to enable the use of LZ4 compression together with RAIDZ(1/2/3) to gain storage space efficiency; 2) to strike a balance between IOPS and throughput, and 32KB seems to be good for VM workloads since it is not-so-big and not-so-small either; 3) We have server-side flash cache (flashcache, bcache, enhanceIO, etc) implemented on all application servers, which absorbs random 4KB writes and then issues contiguous IO requests (semi-sequential )of 4KB or other small sizes, anticipating the backend block devices (iSCSI/FC ZVOLs) to do IO merging/sorting.

In my humble opinion, elliminating the scheduler code from ZVOL really causes the RMWs pain for non-4KB ZVOL, perhaps not for everyone, but at least for some of ZFS fans.

@samuelxhu
Copy link
Author

samuelxhu commented Mar 5, 2019

@kpande It is interesting to notice that some people are complaining about performance degradation due to commit 37f9dac as well in #4512

Maybe it is just a coincidence, maybe not.

The commit 37f9dac may perform well for zvols with direct I/O, but there are many other use cases which are suffering from performance degradation due to the missing scheduler (merging and sorting IO requests) behavior.

@shodanshok
Copy link
Contributor

shodanshok commented Mar 5, 2019

It seems #361 basically cover the problem explained here.

Rather than using the pagecache (with its double-caching and increased memory pressure on ARC), I would suggest creating a small (~1M), front "write buffer" to coalesce writes before sending them to ARC.

@behlendorf @ryao any chances to implement something similar?

@samuelxhu
Copy link
Author

samuelxhu commented Mar 5, 2019

@shodanshok good finding!

Indeed #361 deals essentially with the same RMW issue here. It came out in 2011, at which time ZFS practitioner can at least use deadline/noop scheduler (before 0.6.5.X) to allievate the chance of RMWs. In #4512, a few ZFS users complained about significant writes amplification just after removing the scheduler, but for unknown reasons RMWs were not paid attention to.

Given so much evidence, it seems to be the right time to take serious efforts to solve this RMW issue for ZVOL. We volunteer to take the responsibility of testing, and if needed, funding sponsorship up to 5K USD (from Horeb Data AG, Switerland) is possible for the code developer (If multiple developers involved, behlendorf pls divide)

@samuelxhu
Copy link
Author

samuelxhu commented Mar 6, 2019

@kpande Only for database workloads we have aligned IO for ZVOLs, and unfortunately I do not observe significant performance improvement after 0.6.5.x. The reason might be that I universally have ZFS box with high-end CPU and plenty of DRAM (256GB or above), thus saving a few CPU cycles does not have material impact on IO performance. (The bottleneck is definitely on HDDs, not on CPU cycles or memory bandwidth)

Most of our workloads are not un-aligned IOs, such as hosting VMs, FC/iSCSI backed by ZVOLs, where the frontend applications generate mixed workloads of all kinds. Our engineer team currently focuses on fighting with RMWs, and I think either #361 or #4512 should already show sufficent evidence of the issue.

Before ZVOLs has an effective IO merging facility, we plan to write a shim layer block device sitting in front of ZFS to enable IO request sorting and merging to reduce the occurence of RMWs.

@behlendorf
Copy link
Contributor

@samuelxhu one thing I'd suggest trying first is to increase the dbuf cache size. This small cache sits in front of the compressed ARC and contains an LRU of the most recently used uncompressed buffers. By increasing its size you may be able to mitigate some of the RMW penalty you're seeing. You'll need to increase the dbuf_cache_max_bytes module option.

Before ZVOLs has an effective IO merging facility, we plan to write a shim layer block device sitting in front of ZFS to enable IO request sorting and merging to reduce the occurrence of RMWs.

You might find you can use one of Linux's many existing dm devices for this layer.

Improving the performance of volumes across a wide variety of workloads is something we're interested in, but haven't had the time to work on. If you're interested, rather than implementing your own shim layer I'd be happy to discuss a design for doing the merging in the zvol implementation. As mentioned above, the current code depends on the DMU do the heavy lifting regarding merging. However, for volumes there's nothing preventing us from doing our own merging. Even just front/back merging or being aware of the volumes internal alignment might yield significant gains.

@richardelling
Copy link
Contributor

In order to merge you need two queues: active and waiting. With the request-based scheme there is one queue with depth=zvol_threads. In other words, we'd have to pause I/Os before they become active. This is another reason why I believe merging is not the solution to the observed problem.

@shodanshok
Copy link
Contributor

shodanshok commented Mar 7, 2019

@richardelling From my test, it seems that DMU merging at writeout time is working properly.
What kills the performance of smaller-than-recordsize writes (ie: 4k on a 128K recordsize/volblocksize), for both zvols and regular datasets, is the read part of the r/m/w behavior. Basically, when a small write (ie: 4k) is buffered by the ARC, it had to bring in memory the whole 128K record, irrespective of later writes overlapping (and completely accounting for) the whole recordsize.

Hence my idea of a "front-buffer" which accepts small writes as they are (irrespective of the underlying recordsize) that, after having accumulated/merged some data (say, 1 MB) writes them via the normal ARC buffering/flushing scheme. This would emulate what pagecache is doing for regular block devices, without the added memory pressure of a real pagecache (which can not be limited in any way, if I remember correctly).

I have no idea if this can be implemented without lowering ZFS excellent resilience or how difficult would be doing it, of course.

@samuelxhu
Copy link
Author

@behlendorf thanks a lot for suggestions. Looks like that front merging can be easily turned on by reverting commit 5731140, but extensive IO sorting/merging inside ZVOL/DMU may take more efforts, and I may not be capable of coding much myself, but would like to contribute in testing or other ways as much as possible

@zviratko
Copy link

zviratko commented Apr 3, 2019

Just to chime in - we use ZFS heavily with VM workloads and there is a huge tradeoff between using a 128KiB volblocksize or smaller. Higher volblocksizes actually perform much better up to a point when throughput is saturated, while smaller volblocksizes almost always perform worse, but don't cause throughput problems. And I found it quite difficult to actually predict/benchmark this behaviour because it works very differently on new unfragmented pools, new ZVOLs (no overwrites), different layers of caching (I am absolutely certain that linux pagecache still does something with ZFS as I'm seeing misses that never hit the drives) and various caching problems (ZFS doesn't seem to cache everything it should or could in ARC).

This all makes it very hard to compare performance of ZFS/ZVOLs to any other block device, it makes it hard to tune and it makes it extremely hard to compete with "dumb" solutions like mdraid when performance is all over the place.

If there is any possibility to improve merging to avoid throughput saturation then please investigate it, the other solution (to the problems I am seeing in my environment) is to fix performance issues with smaller volblocksizes, but I guess that will be much more difficult and I have seen it already discussed elsewhere multiple times (like ZFS not being able to use vdev queues efficiently when those vdevs are fast, like NVMe where I have rarely seen a queue size >1).

@janetcampbell
Copy link

janetcampbell commented Apr 6, 2019

We did a lot of experimentation with ZVOLs here and I'd like to offer a few suggestions.

  1. RMW can come from above you as well as from within ZFS. Depending on what parameters you're using on your filesystem and what you set for your block device, you can end up with either the VM subsystem or user land thinking that you have a large minimum IO size, and they will try to pull in data from you before they write out.

With zvols, always always always blktrace them as you're setting up to see what is going on. We found that some filesystem options (large XFS allocsize=) could provoke RMW from the pager when things were being flushed out. If you blktrace and see reads for a block coming in before the writes do, you are in this situation.

  1. Proper setup is essential and "proper" is a matter of perspective. Usually it's best to configure a filesystem as though it was on a RAID stripe either the size of the volblocksize, or half that size. The reason you might choose a smaller size is if you are on a pool with no SLOG and you want all FIO writes to the zvol to go to ZIL blocks instead of indirect sync, as large block zvols do with full-block writes. Or, you may want to refactor your data into larger chunks for efficiency or synchronization purposes.

  2. Poor inbound IO merge. It's best to configure a filesystem on a zvol to expose a large preferred IO size to applications, allowing FIO to come through in big chunks.

  3. Always use primarycache=all.

  4. If you use XFS on zvols, use a separate 4K volblocksize ZVOL for XFS filesystem journaling. This can be small, 100MB is more than enough. This keeps the constant flushing that XFS does out of your primary ZVOL, and allows things to aggregate much more effectively.

Here's an example:

zfs create -V 1g -o volblocksize=128k tank/xfs
zfs create -V 100m -o volblocksize=4k tank/xfsjournal

mkfs.xfs -s size=4096 -d sw=1,su=131072 -m crc=0 -l logdev=/dev/zvol/tank/xfsjournal /dev/zvol/tank/xfs
mount -o largeio,discard,noatime,logbsize=256K,logbufs=8 /dev/zvol/tank/xfs /somewhere

largeio + large stripe unit + separate XFS journal has been the winning combination for us.

Hope this helps.

@samuelxhu
Copy link
Author

samuelxhu commented Apr 6, 2019 via email

@janetcampbell
Copy link

janetcampbell commented Apr 6, 2019

Just to chime in - we use ZFS heavily with VM workloads and there is a huge tradeoff between using a 128KiB volblocksize or smaller. Higher volblocksizes actually perform much better up to a point when throughput is saturated, while smaller volblocksizes almost always perform worse, but don't cause throughput problems.

A little gem I came up with that I haven't seen elsewhere...

Large zvols cause more TxG commit activity. The big danger from this is RMW reads, which can stomp on other IO that's going around.

Measure TxG commit speed. Open the ZIO throttle. Then, set zfs_sync_taskq_batch_pct=1 and do a TxG commit. Raise it slowly until TxG commit speed is a little slower than it was before the test. This will rate limit the TxG commit and the RMW reads that come off of it, and also can help I/O aggregation. I came up with this approach when I developed a remote backup system that went to block devices on the far side of a WAN.

With this you can run long intervals between commits and carry plenty of dirty data, which helps reduce RMW. Once you set the sync taskq, turn the ZIO throttle on and adjust it to just before where it starts to have an effect. This will match these two parameters to the natural flow of the system. At this point you can usually turn aggregation way up and drop the number of async writers some.

Oh, and make sure your dirty data write throttle is calibrated correctly and has enough room to work. ndirty should stabilize in the middle of its range during high throughput workloads.

We mostly use 128K-256K zvols. They work very well and beat out ZPL mounts for MongoDB performance for us. Performance is more consistent than ZPL mounts provided you're good to them (don't do indirect sync writes with a small to moderate block size zvol unless you don't care about read performance).

@janetcampbell
Copy link

janetcampbell commented Apr 7, 2019

I realized there are a lot of comments here that are coming from the wrong place on RMW reads and how ZFS handles data going into the DMU and such. Unless in the midst of a TxG commit, ZFS will not issue RMW reads for partial blocksize writes unless they are indirect sync writes, and you can't get a partial block indirect sync write on a ZVOL due to how zvol_immediate_write_size is handled. Normally the txg commit handles all RMW reads when necessary at the start of the commit, and none happen between commits.

The RMW reads people are bothered by are actually coming from the Linux kernel, in fs/buffer.c. Here's a long winded explanation of why and how to fix it (easy with ZVOLs):

#8590

With a 4k superblock inode size you can run a ZVOL with a huge volblocksize, txg commit once a minute, and handle tiny writes without problem. Zero RMW if all the pieces of the block show up before TxG commit.

Hope this helps.

@shodanshok
Copy link
Contributor

@janetcampbell while I agree that a reasonable sized recordsize is key to extract good read performance, especially from rotating media, I think you are missing the fact that RMW can and will happen very early in the write process, as early as accepting the write buffer into the DMU. Let me do a practical example:

# create a 128K recordsize test pool
[root@singularity ~]# zfs create tank/test
[root@singularity ~]# zfs get recordsize tank/test
NAME       PROPERTY    VALUE    SOURCE
tank/test  recordsize  128K     default

# create a 1GB test file and drop caches
[root@singularity ~]# dd if=/dev/urandom of=/tank/test/test.img bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 4.94275 s, 217 MB/s
[root@singularity ~]# sync
[root@singularity ~]# echo 3 > /proc/sys/vm/drop_caches

# rewrite some sequential 4k blocks
[root@singularity ~]# dd if=/dev/urandom of=/tank/test/test.img bs=4k count=1024 conv=notrunc,nocreat
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 1.05854 s, 4.0 MB/s

# on another terminal, monitor disk io - rmw happens
[root@singularity ~]# zpool iostat 1
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        92.0G  1.72T      0      4  2.16K   494K
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      3      0   511K      0
tank        92.0G  1.72T     27      0  3.50M      0
tank        92.0G  1.72T      0    169      0  9.35M
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0

# retry the same *without* dropping the cache
[root@singularity ~]# dd if=/dev/urandom of=/tank/test/test.img bs=4k count=1024 conv=notrunc,nocreat
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 0.0306379 s, 137 MB/s

# no rmw happens
[root@singularity ~]# zpool iostat 1
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        92.0G  1.72T      0      4  3.07K   489K
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0     61      0  7.63M
tank        92.0G  1.72T      0    104      0  1.78M

Please note how, on the first 4k write test, rmw (with synchronous reads) happens as soon as the write buffers are accepted in the DMU (this is reflected by the very low dd throughput). This happens even if dd, being sequential, completely overwrites the affected zfs records. In other words, we don't really have a merging problem here; rather, we see io amplification due to rmw. Merging at writeout time is working correctly.

The second identical write test, which is done without dropping the cache, avoids the rmw part (especially its synchronous read part) and shows much higher write performance. Again, merging at write time is working correctly.

This is, in my opinion, the key reason why peoples tell ZFS needs tons of memory to have good performance: being so penalizing, reducing the R part of rmw using very large ARC can be extremely important. It should be noted that L2ARC works very well in this scenario, and it is the main reason why I often use cache device even on workloads with low L2ARC hit rate.

@shodanshok
Copy link
Contributor

@DemiMarie based on my tests, no: the overlaying device mapper will not expose any IO scheduler, negating early IO merging. That said, the real performance killer is the sync read IOs needed for partial record update. To somewhat mitigate that, you can use ZVOLs avoiding O_DIRECT file IO (ie: using the Linux pagecache as an upper, coelescing buffer); however, this means double-caching and possibly some bad (performance-wise) interaction between the pagecache and the ARC.

@DemiMarie
Copy link

@DemiMarie based on my tests, no: the overlaying device mapper will not expose any IO scheduler, negating early IO merging. That said, the real performance killer is the sync read IOs needed for partial record update. To somewhat mitigate that, you can use ZVOLs avoiding O_DIRECT file IO (ie: using the Linux pagecache as an upper, coelescing buffer); however, this means double-caching and possibly some bad (performance-wise) interaction between the pagecache and the ARC.

The use-case I am interested in is using ZFS in QubesOS, which means that the zvols are being exposed over the Xen PV disk protocol. Not sure what the best answer is there. Is @sempervictus’s patch a solution?

@filip-paczynski
Copy link

filip-paczynski commented Apr 12, 2021

@DemiMarie I might be entirely wrong about this, but in Your usecase:

  • Xen by default doesn't use O_DIRECT. One has to add direct-io-safe parameter to VBD definition (https://xenbits.xen.org/docs/unstable/man/xl-disk-configuration.5.html#direct-io-safe)
  • If a VM is run on top of a ZVOL, then this VM has it's own FS and it's own scheduler for VBD, also pagecache. Therefore VM might merge IOs on it's own, but I am not sure whether this actually happens.
  • In my experience block size is very important parameter for ZVOL performance. Anything smaller than 16k should be avoided. Also, If using anything larger than 4k, one has to be careful to properly mkfs and also specify some flags in VM's fstab/rootflags boot param (eg: for XFS this means largeio and swalloc)

@liyimeng
Copy link

The SpectraLogic DMU rework that Matt Macy is attempting to upstream again could be the solution to this.

Is this done? Where is the PR?

@sempervictus
Copy link
Contributor

@liyimeng its not done - i revived the PR (#12166) a while back but have zero time to do free work right now. Please feel free to complete the merge and validation work on that though.

@tonyhutter
Copy link
Contributor

I would encourage all zvol users to test drive my block multi-queue PR here: #12664 . You could see pretty big performance improvements with it, depending on your workload.

@mailinglists35
Copy link

lazy person asking: how far in time is that PR from being merged into main?

@sempervictus
Copy link
Contributor

@mailinglists35 - hard to tell; even otherwise complete PRs sometimes hang out in the queue for a while as other things are implemented in master. Its in the testing phase though, so closer to it than otherwise :).

@DemiMarie
Copy link

Would it be possible to use the kernel’s write IO merging layer?

@stale
Copy link

stale bot commented Nov 9, 2022

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Nov 9, 2022
@DemiMarie
Copy link

Still an issue

@behlendorf behlendorf added Bot: Not Stale Override for the stale bot and removed Status: Stale No recent activity for issue labels Nov 9, 2022
@tonyhutter
Copy link
Contributor

Long story short:

  1. Use O_DIRECT if possible

or

  1. Try enabling block multiqueue: zvol: Support blk-mq for better performance (updated) #13148

See my "zvol performance" slides:
https://drive.google.com/file/d/1smEcZULZ6ni6XHkbubmeuI1vZj37naED/view?usp=sharing

@sempervictus
Copy link
Contributor

@tonyhutter - thanks for the slide deck. How do those graphs look after a ZVOL has taken enough writes to have allocated all blocks at least once before freeing? The performance degradation over use is a very unpleasant effect in real world scenarios which benchmarks don't capture.

@MichaelHierweck
Copy link

MichaelHierweck commented Nov 11, 2022 via email

@tonyhutter
Copy link
Contributor

@sempervictus unfortunately I didn't test that. All tests were on relatively new zvols.

@MichaelHierweck that's the plan, but I'm still undecided on if blk-mq should be enabled by default or not. It's not a clear performance winner in all use cases. Currently blk-mq is not enabled by default in the master branch.

@shodanshok
Copy link
Contributor

@tonyhutter thanks for the slide.

However, I am not sure using O_DIRECT would be enough to avoid any performance degradation, as the one describe by @sempervictus and detailed here: #8472 (comment) (where I am not using O_DIRECT but, writing to a zfs plain file, I still avoid the linux pagecache).

In these cases the real performance killer is the r/m/w commanded by reading the to-be-overwritten block. I think the solution is the (non-merged) async DMU patch, or am I missing something?

@sempervictus
Copy link
Contributor

The async-DMU patch isn't just unmerged, its rather unfinished. Getting that across the finish line likely requires a fair chunk of dedicated time from one of the senior folks familiar with a wide swath of the code or a lot of time from a skilled C developer who would need to figure out all of the structures, functions, and locking semantics that effort impacts.

@DemiMarie
Copy link

How much time do you think it would take @sempervictus?

@sempervictus
Copy link
Contributor

@DemiMarie - a professional C/C++ developer colleague took a cursory pass, and since he's not of ZFS-land, the complexity of handling async dispatch in C was compounded by the complexity of ZFS itself which anyone from outside the core team would need to learn along the way. He found a few bugs in the async pieces just on that first pass - he's quite good 😄, but that likely indicates that we're not at first down with 20y to go... We figured at least several weeks of professional effort (not cheap), but had no way to get an upper bound for that (consideration at the time was to hire him to do this, but folks like that aren't exactly dripping w/ spare time). So i can't answer that question to anyone's satisfaction for budgeting/hiring/etc.
The level of effort to get into that PR is above the amount of free time i have, or will have in the next year given we just spun up a totally unrelated development venture; the level of effort to take it completion is unknown. I dont have tens of thousands of dollars lying around right now to put toward such work, so its on the back burner on my list until i can either hire/delegate a hired resource, contract one, or have the time and headspace to spend on it myself.

@sempervictus
Copy link
Contributor

I think that the best chance we have at assay on the level of effort for that async DMU piece is to ask one of the heavy-hitting OpenZFS members to slot a review aimed at ascertaining requirements into their workflow over the next X months. I've fallen off somewhat in my awareness of the goings-on around here in the last year or so (not that i don't love y'all, just that infosec is somewhat of an all consuming function these days); but off the top of my head, figure that @ryao, @behlendorf, or @tonyhutter have the breadth of code-awareness required to able to partition & delegate efforts for such a review. Personally, wouldn't be surprised if (given appropriate effort) this becomes the new "ABD patch set" during R&D 😄

@bghira
Copy link

bghira commented Nov 12, 2022

@tuxoko or @dweeezil or @pcd1193182 might be able to take a look? (sorry for volunteering you :) )

@DemiMarie
Copy link

I wonder if stackless coroutines could help. Those can be implemented in C with some macro abuse. If this were part of the Linux kernel I would suggest using Rust, which supports async programming natively.

@sempervictus
Copy link
Contributor

I mostly live in Rust these days, but until the GCC front-end is done, i'm not a big fan of mixing ABI like that.

@shodanshok
Copy link
Contributor

@DemiMarie the main point of the async DMU patch should be to avoid the synchronous read of the to-be-overwritten records, rather than issuing an async callback. In other workds, I understand it as a mean to deferring some writes to avoid useless reads. Or am I wrong?

@DemiMarie
Copy link

@DemiMarie the main point of the async DMU patch should be to avoid the synchronous read of the to-be-overwritten records, rather than issuing an async callback. In other workds, I understand it as a mean to deferring some writes to avoid useless reads. Or am I wrong?

I honestly am not sure what the async DMU patch actually does, but async C code is a known pain-point in general.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bot: Not Stale Override for the stale bot Component: ZVOL ZFS Volumes Status: Understood The root cause of the issue is known Type: Documentation Indicates a requested change to the documentation Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests