[Performance] Extreme performance penalty, holdups and write amplification when writing to ZVOLs #11407

Binarus · 2020-12-27T09:07:45Z

System information

Describe the problem you're observing

Setup

Supermicro X10DRU-i+
LSI 9361-8i connected to a 6 / 12 Gbps SATA/SAS backplane
2 x Seagate ST4000NM000A (denoted sda and sdb), connected to the backplane
1 x Seagate ST4000NM0035 (denoted sdc), connected to the backplane
128 GB RAM (ECC, of course)

sda and sdb make up a mirrored ZFS VDEV. The O/S boots from this VDEV. There is only one pool, called rpool. rpool does not contain any other VDEVs besides that mirror. The root file system is mounted to rpool/system.

There is no swap file on that system (yet).

rpool has been created using the following command:

zpool create -o ashift=12 -o altroot=/mnt -O acltype=posixacl -O canmount=off -O checksum=on -O compression=off -O mountpoint=none -O sharesmb=off -O sharenfs=off -O xattr=sa rpool mirror /dev/disk/by-id/ata-ST4000...-part1 /dev/disk/by-id/ata-ST4000...-part1

That is, the pool and the VDEV have ashift=12.

rpool/system has been created using the following command:

zfs create -o aclinherit=passthrough -o acltype=posixacl -o atime=on -o canmount=on -o checksum=on -o compression=off -o mountpoint=/ -o overlay=off -o primarycache=all -o redundant_metadata=all -o relatime=off -o secondarycache=none -o setuid=on -o sharesmb=off -o sharenfs=off -o logbias=latency -o snapdev=hidden -o snapdir=hidden -o sync=standard -o xattr=sa -o casesensitivity=sensitive -o normalization=none -o utf8only=off rpool/system

We further have created a ZVOL using the following command:

zfs create -b 4096 -o checksum=on -o compression=off -o primarycache=metadata -o redundant_metadata=all -o secondarycache=none -o logbias=latency -o snapdev=hidden -o sync=standard -V 100G rpool/zvol-test

That zvol is mounted on /blob.

sdc contains a partition with a normal ext4 file system which is mounted on /mnt. That file system just contains several dozens of ISO files (average size about 6 GB).

On that machine, nothing else runs than the standard services the distribution installs. Notably, there is no VM running and nothing else which could produce substantial workload.

In this state, when starting watch -n 1 zpool iostat -ylv 1 1 and watching it for a while, there is indeed nearly no load on the ZFS disks. Once in several seconds or so some kilobytes hit the VDEV, which is expected.

Copying to the dataset (not the ZVOL): No problem

Now we open the iostats in one terminal window (watch -n 1 zpool iostat -ylv 1 1) and start to copy ISO files from sdc onto the ZFS dataset rpool/system in another terminal window (rsync --progress /mnt/*.iso ~/test, where ~/test is part of the root file system and thus is on rpool/system).

While the copy runs, rsync shows a few drops in bandwidth every now and then, but there are no noticeable holdups, and the drops in bandwidth are short. Likewise, the zpool iostats shows that the two disks in the VDEV are hit with data rates which could be expected. The changes in disk load reported by zpool iostat are surprisingly high, though (the load constantly jumps between something like 30 MB/s and 300 MB/s), but there are no real holdups either. In summary, the copy in average runs with over 100 MB/s and does not stall for a longer time.

We have interrupted that test after 30 GB or so because we didn't expect anything new from letting it run longer. However, we repeated it several times, each time copying other ISO files, and each time rebooting before. The behavior was the same each time.

Copying to the ZVOL: Problem

When we do exactly the same thing, but copy to the ZVOL instead of the dataset (rsync --progress /mnt/*.iso /blob), the situation changes. rsync initially shows the copy running with roughly about 190 MB/s for a few seconds, then it stalls. Thereafter it continues with copying for a few seconds at the rate denoted above, then stalls again after a few seconds, and so on.

The problem is that the holdups last for a long time where absolutely nothing happens, up to several minutes (!). However, zpool iostat shows that the two ZFS disks are under heavy load during this time, constantly (more or less) being hit with over 100 MB/s. Even when we interrupt copying by hitting Ctrl-c in the terminal window where rsync runs, this high load lasts for several minutes until everything returns to normal.

There must be extreme write amplification somewhere, the amplification factor being somewhere between 5 and 10. For example, if we copy 40 GB that way, this would normally take about 5 minutes. But actually it takes at least half an hour, although the ZFS disks are under heavy load all that time.

For that reason, ZVOLs are currently just not usable for us, which imposes a major problem. What could be going on there?

Our own thoughts and what we have tried already:

At first, we'd like to stress again that the ZVOL test did not happen within a VM. The problem is definitely not due to QEMU or (para)virtualization of data transfer.

Secondly, I am aware that it might not be the best idea to have ZFS running on disks which are attached to a RAID controller like the LSI 9361-8i, or to have it running on hybrid disks like the ones we have. However, we have configured that controller to JBOD mode, and the O/S sees the disks as individual ones as expected. But the ultimate key point regarding possible hardware problems is that copying large amounts of data to the ZFS dataset (rpool/system) works as expected. If the problems with the ZVOL would be due to hardware, we would have the same problems with the dataset; this is not the case, though.

Thirdly, the problem is not due to ZFS versions. Debian buster comes with ZoL 0.7.12, and we originally have noticed the problem there. We desperately need ZVOLs working, so we have installed OpenZFS 2.0.0 on that machine, which did change exactly nothing with respect to that problem.

As a further test, we created the ZVOL with volblocksize=512 and did the tests again. Again, nothing changed. We repeated the process with volblocksizes of 8192, 16384 and 128k. Again, no luck: Maybe it stalled a few seconds earlier or later, longer or shorter in each test compared to the others, but the general situation remained the same. Between the stalls, the copy ran a few seconds with expected speed, then it stalled for a lot of seconds, mostly even a few minutes while iostat was showing a constant data rate of roughly 100 MB/s for each disk, and so on. After interrupting the copy, both ZFS disks continued to be hit with a data rate of 100 MB/s or more for several minutes.

Then we tested the ZVOL with sync=disabled. That didn't change anything. The same goes for primarycache=all (instead of metadata) (but at least this was expected), and for logbias=latency (instead of throughput).

Next, we thought that it may have something to do with the physical sector size of the ZFS disks being 512 bytes, while the pool (and the VDEVs) had ashift=12. Therefore, we destroyed the pool, re-created it with ashift=9, re-created all file systems / datasets as described above, and did all tests again. Once again, this didn't change anything.

We then went back to the original pool with ashift=12 and used it for the further tests. At this point, we were out of ideas what to do next, so we read about the ZFS I/O scheduler and tested a large number of combinations of zfs_dirty_data_max, zfs_delay_scale, zfs_vdev_async_write_max_active, zfs_vdev_async_write_min_active, zfs_vdev_async_write_active_max_dirty_percent, and zfs_vdev_async_write_active_min_dirty_percent.

To our surprise, the last five of these did barely influence the behavior. However, the first one (zfs_dirty_data_max), which originally was set to 4 GB, changed the situation when we set it to a low value, e.g. 512 MB. The improvement was that there were less long-lasting holdups: there were even more holdups, but all of them were so short that it became acceptable. However, the average data rate did not increase, because now the transfer rate rsync reported was limited to about 30 MB/s, mostly hanging around 10 MB/s or 20 MB/s. There were no phases with high data rates any more.

So the copying was more "responsive" with low values of zfs_dirty_data_max, but that didn't help because the data rate per se was drastically limited. In summary, changing the I/O scheduler parameters which are explained in the document linked above did not lead to anywhere.

The last thing we were looking into was zfs_txg_timeout. Setting it to a lower value didn't improve the situation with copying to the ZVOL (but increased the load which hit the ZFS disks when the system was completely idle). Setting it to a higher value didn't improve copying either (but reduced the load on the ZFS disks when the system was idle).

Now we are completely out of ideas. We probably could look into other parameters of the ZFS module (/sys/module/zfs/parameters) or the disk drivers (/sys/block/sdx). But this would be just wild guessing and a waste of time. Therefore, we are hoping that somebody is willing to give us some hints.

What we did not try, and why not

zfs_arc_max is set to 4 GB on that system, and we did not test larger values for the following reasons:

That parameter is about reading, not writing, and the copy source is an ext4 partition of a physical disk, so no ZFS parameter would have any effect on the copy source.
We clearly have a problem with writing here, not with reading (remembering that copying to the normal dataset (not the ZVOL) works normally).
When we began working with ZFS some years ago, the first thing we had to solve was a system which started normally, but then became totally unresponsive and finally totally locked up within minutes. The cause of that problem was that ZFS was eating up all available RAM for its ARC cache until the machine crashed or hung. Since then, we always limit the ARC size (and never ever had any stability issues or crashes with ZFS again).
Our goal is to run a bunch of VMs with ZVOL storage (the tests described above are just, eehm, tests before we put even more effort into switching completely to ZFS). The number of VMs and the memory they will be given is precisely known. It would not make any sense to test larger ARC sizes, because the ARC size at the end of the day couldn't be much larger than 4 GB.

We did not try to use a secondary cache (L2ARC). Again, the copy source is not on ZFS, and therefore this wouldn't make any sense, and furthermore, we have a writing problem here, not a reading problem.

We did not try to use an SLOG. This would not make any sense, because one of our tests was to set sync=disabled on the copy destination ZVOL, and this did not change the slightest bit in the behavior observed. Therefore, we know that our problem is not due to sync writes, and thus, an SLOG wouldn't improve the behavior.

Describe how to reproduce the problem

Install a system similar to the one described above, issue the commands described above, and watch the long-lasting holdups in the terminal window where rsync runs and the heavy disk load zpool iostat shows in the other terminal window, leading to high disk wear and low bandwidth.

Since it is not easy to setup a system like ours, we are willing to give remote access to one of that systems if somebody would be interested in investigating the problem. In this case, please leave a comment, stating how we can get into contact.

Include any warning/errors/backtraces from the system logs

If somebody tells us what exactly is needed here, we'll immediately do it :-). We guess zpool iostat or other tools produce output which is more valuable than the log files, but neither being Linux nor ZFS experts, we are a bit lost here. Notably, we don't know how to operate dtrace or strace properly. If somebody tells us what to do, we'll try our best.

The text was updated successfully, but these errors were encountered:

IvanVolosyuk · 2020-12-27T12:58:40Z

You might have to decide what you value more - consistency or throughput. Decreasing zfs write buffers (zfs_dirty_data_max) will give you more consistency - consistently bad write speed with lots of TXGs.

If you want throughput, you don't want to measure it using minimum write speed during the write operation or seconds without visible progress (effectively what you seem to be doing). Instead I would suggest to pick a smaller test set, but write it fully and measure the time it took for completion.

For throughput I would pick larger volblocksize - default 8k should work better, keep default zfs_dirty_data_max, enable compression - it will benefit slow disks, sync=disabled to avoid wasting disk time on zil, primarycache=all to cache ext4 metadata.

Because of the in memory write cache initial writes will look faster and then stall when in memory buffers hit the hard limit while disks are busy writing back the data to free space in ram for more dirty data. But if you measure the total copy operation time - this information is irrelevant for overall throughput number you get.

Also, there should be minimal write amplification when you copy large ISO files.

Personal notes: with qemu you can use raw files, which can give slightly better performance than zvol in some cases. I use:
-drive file=/somefile.img,format=raw,id=disk,if=none,cache=none,aio=threads,discard=unmap

Binarus · 2020-12-28T08:46:50Z

@IvanVolosyuk Thank you very much for your help.

Your first comment (which you have removed) proposed to use logbias=throughput instead of logbias=latency. We have to apologize that we forgot to mention that we already had tested this setting, too, but it didn't change anything. I'll add this in the description of the issue, i.e. to our first post.

With respect to your other proposals, I guess I'll have to explain some specialties regarding our setup and goals:

You might have to decide what you value more - consistency or throughput.

We understand that there always is a trade-off. But what we are talking about here is a decision between between holdups of several minutes and extreme disk wear caused by extreme write amplification on one side and a data rate of 5% to 10% percent of what the hardware is able to provide on the other side. Trade-offs of such magnitude are in no way acceptable.

If you want throughput, you don't want to measure it using minimum write speed during the write operation or seconds without visible progress (effectively what you seem to be doing).

The problem here is that the holdups are taking so long and are putting the disks under such heavy load that the system cannot be used reasonably during that time. In my first post, I have explained that we did the tests without any VMs running. This is true, because we wanted to rule out any issues related to QEMU, KVM and block device drivers.

However, of course, we also did additional tests with running VMs. It turned out that the bad situation, notably the holdups, could be provoked by just copying large files from the third disk to the VM ZVOL storage in a VM, and that these holdup made the other VMs freak out - they just couldn't write data to their (virtual) storage as needed because they hit timeouts after one or two minutes.

This basically means that you can have only one VM on your server (provided you want to be it on a ZVOL), unless you are willing to put your data at risk.

Instead I would suggest to pick a smaller test set, but write it fully and measure the time it took for completion.

The outcome would be interesting. However, it is not our use case; more precisely, we have important other use cases. One of them (and an important one) is to have multiple VMs running on ZVOLs, where at least one VM will be used to copy large amounts of data scattered across a few files from a third disk to virtual ZVOL storage. It is completely not acceptable to have the other VMs freak out then, putting their data at risk.

Therefore we first have to solve the problems described in the first post before we proceed.

I have mentioned VMs solely to explain our issue in greater depth, and why it definitely is a show stopper in our case. Still, I would like to keep VM related problems out of this discussion, because the issue exists without any VM running, and adding VMs surely won't improve the situation.

For throughput I would pick larger volblocksize

This is one of tests we did and which we have described. We have tested 512 bytes, 4096 bytes, 8192 bytes, 16384 bytes and 128k bytes. None of theses settings made the behavior change in any way.

keep default zfs_dirty_data_max

Changing it was only for testing. Of course, we started testing with default values.

enable compression

We have compression disabled because (at a later stage) the ZVOL data will be encrypted (from within the VMs). Given that, compression won't do any good.

sync=disabled

We already have tested this setting without noticing any change in behavior (described in the first post I guess).

primarycache=all to cache ext4 metadata

Could you please elaborate? How could a ZFS setting influence reading from the third disk, which is not on ZFS? Did we miss something?

Apart from that, we already had tested primarycache=all, but it didn't change anything either. IMHO, this is expected because primarycache relates to reading data, while we obviously have a problem with writing.

But if you measure the total copy operation time - this information is irrelevant for overall throughput number you get.

Agreed, but the holdups would make VMs freak out (and hence, put data at risk) if we had several VMs running, which will be the case later.

Plus, the extreme write amplification we obviously experience will destroy our disks in no time.

Also, there should be minimal write amplification when you copy large ISO files.

This is exactly what we were convinced of when we began the tests. Imagine out surprise ...

We are seriously thinking of making a video which shows the two terminal windows for 10 minutes or so. Perhaps somebody could make sense of it. We'll first have to look for appropriate screen recording software, though (must be able to record cygwin terminal windows under Windows 10 at a reasonable frame rate (e.g. 10 frames / sec)).

Cheers,

Binarus

IvanVolosyuk · 2020-12-28T12:24:04Z

With primarycache=all for zvol - the filesystem on zvol (ext4 I assumed) will have its metadata cached in ARC.

I reproduced similar behavior with the copy of your settings and with the set of changed I suggested I've got some improvements when copying large files to zvol/ext4.

I think what you missed in your tunning is that filesystem on zvol will have a lot of dirty data accumulated as you have a lot of RAM. It will try to write it back when Linux kernel decides that it has too much dirty pages. You can tune it down to see if it will help with the write consistency, e.g.:
echo $[128 * 1024 * 1024] >/proc/sys/vm/dirty_bytes
This will force the aggressive writeback in the filesystem in zvol.
Try this with the other suggestions I gave before. It made a big difference for writes consistency in my setup.

devZer0 · 2020-12-29T00:13:54Z

this not not a really new to me, i have seen quite some reports on accessing zvols performs much worse compared to accessing ordinary files on zfs datasets - and it confirms my own negative experiences with zvols, which also includes lockups/stalls etc.

this is the reason why i completely avoid using zvols on proxmox for quite a while (which are still default there)

see https://bugzilla.proxmox.com/show_bug.cgi?id=1453 or #10095 for example

dswartz · 2020-12-29T01:16:40Z

I'm sadly familiar with awful zvol performance. On December 28, 2020, at 7:14 PM, devZer0 <notifications@github.com> wrote: this not not a really new to me, i have seen quite some reports on accessing zvols performs much worse compared to accessing ordinary files on zfs datasets - and it confirms my own negative experiences with zvols, which also includes lockups/stalls etc. this is the reason why i completely avoid using zvols on proxmox for quite a while (which are still default there) see https://bugzilla.proxmox.com/show_bug.cgi?id=1453 or #10095 for example — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "#11407 (comment)", "url": "#11407 (comment)", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

sempervictus · 2020-12-29T14:41:10Z

ZVOLs are pretty pathologically broken by design. A block device is a range of raw bits, full stop. A zvol is an abstraction presenting a range but mapping them across noncontiguous space with deterministic logic involved in linearizing it. So architecturally, it inherently will be slower to resolve a request. The fact that request paths elongate with data written prior to the zvol amplifies and exacerbates the poor design.
Atop that, zvols are almost never considered when changes are introduced, with performance regression after regression going into zfs for years and maintainers never having time or interest in the feature. You can find prior issues on this where we've discussed how the entire pipeline isn't even optimized for ssd, much less nvme and the expected-by-design iowait hampers not only the dmu but multiplicatively the zvols atop it. I've got benchmarks from a few years back showing drastic reduction in throughput after a zvol is filled once (write a g to a g-sized disk zvol and then write to it again, will make you sad).
IMO, zvols need a rethink from the ground up to actually be as thin an abstraction atop the bit ranges handled by the DMU as possible with consistent throughput as a primary design goal and os-native block device interfaces (SG?) to avoid problems like your have today if you map SCST atop a zvol and beat it up with 100 random writers. Problem is, that's a lot of work for a shrinking number of consumers because businesses have moved away from using iscsi zvols for one (iscsi in general but zvols are now known as garbage for business use), and because other tech like ceph actually keeps up with hardware development and optimizes for modern storage busses and media (they dropped zfs as a backing store way back).
Unless the powers that be put their weight behind making zvols a primary member of the ecosystem again, we'll keep seeing issues like this every year. Thanks for filing this one, I'm frankly tired of begging for this to be resolved. Since @ryao left these discussions there's been no real improvement, and in the end the removal of the sg layer may have actually made performance worse (~0.6.4). I tried to force-sync the virtual devices way back to make writes more consistent from db workloads, but really they just need to behave like proper disks with the full range of scheduling parameters and commit/read semantic adjustments from the consumer side and a much thinner/faster underlying implementation. Anyone got a really good storage dev with time on their hands and a hefty budget to fund the work? Semper Victus would be open to joining others from the community to fund a bounty project to un... this mess, if maintainers agree that zvol performance will be a primary consideration when adopting new features or merging commits so the money and effort aren't wasted when draid hits a tag or whatever. If there are any takers, we'd even consider trading an engagement for a completed PR + $0.01 to bind a contract (anyone who contracts red or blue teams knows the cost associated) - feel free to reach out if this sounds appealing and we'll work out a scope.

sempervictus · 2020-12-29T15:34:00Z

@Binarus: for use under Qemu/libvirt, we've found that the most consistent throughput is achieved by directly mapping the ZVOL to the VM as a virtio-scsi device a la

      <driver name='qemu' type='raw' cache='none' io='native' discard='unmap' detect_zeroes='off'/>
      <source dev='/dev/zvol/<pool_name>/<path_to_volume>' index='1'/>

This removes a lot of the intermediate buffers/copies, but still works without the ZVOL having proper SG interfaces. It allows the ZFS pipeline to deal with zeroes/compression natively, with AIO (io=native) between the visor and block device.
Still a hack, not a solution, and still experiences the write amp and other issues, but at least thins out the visor interaction so you're not seeing amplification of amplifications.
You may also want to revert 5731140 which i had intended to help with the performance degradation resulting from interactions with the linux-tier write merge and subsequent rewrites of the merged blocks in the ZVOL. If you have a lot of async writes, it should raise the peak performance, but beware of degradation over time and potentially deeper valleys.

prgwiz · 2020-12-30T00:25:16Z

The following settings helped us: options zfs zfs_arc_min=8589934592 options zfs zfs_arc_max=106300440567 options zfs zfs_prefetch_disable=1 options zfs zfs_nocacheflush=1 options zfs zfs_arc_meta_limit_percent=95

…

On Sun, Dec 27, 2020 at 4:08 AM Binarus ***@***.***> wrote: System information Type | Version/Name Linux | Debian Distribution Name | debian Distribution Version | buster (10.7) Linux Kernel | 4.19.0-12-amd64 Architecture | amd64 ZFS Version | OpenZFS 2.0.0 SPL Version | (SPL integrated in OpenZFS 2.0.0) Describe the problem you're observing Setup - Supermicro X10DRU-i+ - LSI 9361-8i connected to a 6 / 12 Gbps SATA/SAS backplane - 2 x Seagate ST4000NM000A (denoted sda and sdb), connected to the backplane - 1 x Seagate ST4000NM0035 (denoted sdc), connected to the backplane - 128 GB RAM (ECC, of course) sda and sdb make up a mirrored ZFS VDEV. The O/S boots from this VDEV. There is only one pool, called rpool. rpool does not contain any other VDEVs besides that mirror. The root file system is mounted to rpool/system . There is no swap file on that system (yet). rpool has been created using the following command: zpool create -o ashift=12 -o altroot=/mnt -O acltype=posixacl -O canmount=off -O checksum=on -O compression=off -O mountpoint=none -O sharesmb=off -O sharenfs=off -O xattr=sa rpool mirror /dev/disk/by-id/ata-ST4000...-part1 /dev/disk/by-id/ata-ST4000...-part1 That is, the pool and the VDEV have ashift=12. rpool/system has been created using the following command: zfs create -o aclinherit=passthrough -o acltype=posixacl -o atime=on -o canmount=on -o checksum=on -o compression=off -o mountpoint=/ -o overlay=off -o primarycache=all -o redundant_metadata=all -o relatime=off -o secondarycache=none -o setuid=on -o sharesmb=off -o sharenfs=off -o logbias=latency -o snapdev=hidden -o snapdir=hidden -o sync=standard -o xattr=sa -o casesensitivity=sensitive -o normalization=none -o utf8only=off rpool/system We further have created a ZVOL using the following command: zfs create -b 4096 -o checksum=on -o compression=off -o primarycache=metadata -o redundant_metadata=all -o secondarycache=none -o logbias=latency -o snapdev=hidden -o sync=standard -V 100G rpool/zvol-test That zvol is mounted on /blob. sdc contains a partition with a normal ext4 file system which is mounted on /mnt. That file system just contains several dozens of ISO files (average size about 6 GB). On that machine, nothing else runs than the standard services the distribution installs. Notably, there is no VM running and nothing else which could produce substantial workload. In this state, when starting watch -n 1 zpool iostat -ylv 1 1 and watching it for a while, there is indeed nearly no load on the ZFS disks. Once in several seconds or so some kilobytes hit the VDEV, which is expected. Copying to the dataset (not the ZVOL): No problem Now we open the iostats in one terminal window (watch -n 1 zpool iostat -ylv 1 1) and start to copy ISO files from sdc onto the ZFS dataset rpool/system in another terminal window (rsync --progress /mnt/*.iso ~/test, where ~/test is part of the root file system and thus is on rpool/system). While the copy runs, rsync shows a few drops in bandwidth every now and then, but there is no real stalling, and the drops are short. Likewise, the iostats show that the two disks in the VDEV are hit with data rates which could be expected. The changes in disk load reported by zpool iostat are surprisingly high, though (constantly jumping between something like 30 MB/s and 300 MB/s), but there are no real stalls either. In summary, the copy in average runs with over 100 MB/s and does not stall. We have interrupted that test after 30 GB or so because we didn't expect anything new from letting it run longer. However, we repeated it several times, each time copying another part of the ISO files, and each time rebooting before. The behavior was the same each time. Copying to the ZVOL: Problem When we do exactly the same thing, but copy to the ZVOL instead of the dataset (rsync --progress /mnt/*.iso /blob), the situation changes. rsync shows the copy running with roughly about 190 MB/s, then it stalls. Thereafter it continues with copying for a few seconds at the rate denoted above, and stalls again after a few seconds, and so on. The problem is that stalling lasts for a long time where absolutely nothing happens, up to several minutes (!). However, the zpool iostat shows that the two ZFS disks are under heavy load during these holdups, constantly (more or less) being hit with over 100 MB/s. Even when we interrupt copying by hitting Ctrl-c in the terminal window where rsync runs, this high load lasts for several minutes until everything returns to normal. There must be extreme write amplification somewhere. I estimate that roughly 10 times more data than the payload actually is hits the disks. For example, if I copy 40 GB that way, this would normally take about 5 minutes; it actually takes about an hour, though, while the disks are under heavy load. For that reason, ZVOLs are currently just not usable for us, which imposes a major problem. What could be going on there? My own thoughts and what we have tried already: At first, we'd like to stress again that the ZVOL test did not happen within a VM. The problem is definitely not due to QEMU or (para)virtualization of data transfer. Secondly, I am aware that it might not be the best idea to have ZFS running on disks which are attached to a RAID controller like the LSI 9361-8i, or to have it running on hybrid disks like the ones we have. However, we have configured that controller to JBOD mode, and the O/S sees the disks as individual ones as expected. But the ultimate key point regarding possible hardware problems is that copying large amounts of data to the ZFS dataset (rpool/system) works as expected. If the problems with the ZVOL would be due to hardware, we would have the same problems with the dataset; this is not the case, though. Thirdly, the problem is not due to ZFS versions. Debian buster comes with ZoL 0.7.12, and we originally have noticed the problem there. We desperately need ZVOLs working, so we have installed OpenZFS 2.0.0 on that machine, which did change exactly nothing with respect to that problem. As a further test, we created the ZVOL with volblocksize=512 and did the tests again. Again, nothing changed. We repeated the process with volblocksizes of 8192, 16384 and 128k. Again, no luck: Maybe it stalled a few seconds earlier or later, longer or shorter in each test compared to the others, but the general situation remained the same. Between the stalls, the copy ran a few seconds with expected speed, then it stalled for a lot of seconds, mostly even a few minutes while iostat was showing a constant data rate of roughly 100 MB/s for each disk, and so on. After interrupting the copy, both ZFS disks continued to be hit with a data rate of 100 MB/s or more for several minutes. Then we tested the ZVOL with sync=disabled. That didn't change anything. The same goes for primarycache=all (instead of metadata) (but at least this was expected). Next, we thought that it may have something to do with the physical sector size of the ZFS disks being 512 bytes, while the pool (and the VDEVs) had ashift=12. Therefore, we destroyed the pool, re-created it with ashift=9, re-created all file systems / datasets as described above, and did all tests again. Once again, this didn't change anything. We then went back to the original pool with ashift=12 and used it for the further tests. At this point, we were out of ideas what to do next, so we read about the ZFS I/O scheduler <https://gist.github.com/szaydel/6244302> and tested a large number of combinations of zfs_dirty_data_max, zfs_delay_scale, zfs_vdev_async_write_max_active, zfs_vdev_async_write_min_active, zfs_vdev_async_write_active_max_dirty_percent, and zfs_vdev_async_write_active_min_dirty_percent. To our surprise, the last five of these did barely influence the behavior. However, the first one (zfs_dirty_data_max), which originally was set to 4 GB, changed the situation when we set it to a low value, e.g. 512 MB. The improvement was that there were less long-lasting holdups: there were even more holdups, but all of them were so short that it became acceptable. However, the average data rate did not increase, because now the transfer rate rsync reported was limited to about 30 MB/s, mostly hanging around 10 MB/s or 20 MB/s. There were no phases with high data rates any more. So the copying was more "responsive" with low values of zfs_dirty_data_max, but that didn't help because the data rate per se was drastically limited. In summary, changing the I/O scheduler parameters which are explained in the document linked above did not lead to anywhere. The last thing we were looking into was zfs_txg_timeout. Setting it to a lower value didn't improve the situation with copying to the ZVOL (but increased the load which hit the ZFS disks when the system was completely idle). Setting it to a higher value didn't improve copying either (but reduced the load on the ZFS disks when the system was idle). Now we are completely out of ideas. We probably could look into other parameters of the ZFS module (/sys/module/zfs/parameters) or the disk drivers (/sys/block/sdx). But this would be just wild guessing and a waste of time. Therefore, we are hoping that somebody is willing to give us some hints. What we did not try, and why not zfs_arc_max is set to 4 GB on that system, and we did not test larger values for the following reasons: 1. That parameter is about reading, not writing, and the copy source is an ext4 partition of a physical disk, so no ZFS parameter would have any effect on the copy source. 2. We clearly have a problem with writing here, not with reading (remember that copying to the normal dataset (not the ZVOL) works normally). 3. When I began working with ZFS some years ago, the first thing I had to solve was a system which started normally, but then became totally unresponsive and finally totally locked up within minutes. The cause of that problem was that ZFS was eating up all available RAM for its ARC cache until the machine crashed or hung. Since then, I always limit the ARC size (and never ever had any stability issues or crashes with ZFS again). 4. Our goal is to run a bunch of VMs with ZVOL storage (the tests described above are just, eehm, tests before we put even more effort into switching completely to ZFS). The number of VMs and the memory they will be given is precisely known. It would not make any sense to test larger ARC sizes, because the ARC size at the end of the day couldn't be much larger than 4 GB. We did not try to use a secondary cache (L2ARC). Again, the copy source is not on ZFS, and therefore this wouldn't make any sense, and furthermore, we have a writing problem here, not a reading problem. We did not try to use an SLOG. This would not make any sense, because one of our tests was to set sync=disabled on the copy destination ZVOL, and this did not change the slightest bit in the behavior observed. Therefore, we know that our problem is not due to sync writes, and thus, an SLOG wouldn't improve the behavior. Describe how to reproduce the problem Install a system similar to the one described above, issue the commands described above, and watch the long-lasting stalls in the terminal windows where rsync runs and the weird behavior zpool iostat reveals in the other terminal window. Since it is not easy to setup a system like ours, we are willing to give remote access to one of that systems if somebody would be interested in investigating the problem. In this case, please leave a comment, stating how we can get into contact. Include any warning/errors/backtraces from the system logs If somebody tells us what exactly is needed here, we'll immediately do it :-). We guess zpool iostat or other tools produce output which is more valuable than the log files, but neither being Linux nor ZFS experts, we are a bit lost here. Notably, we don't know how to operate dtrace or strace properly. If somebody tells us what to do, we'll try our best. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#11407>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABTQJT6ODFQNY33S4ANCZRTSW32PRANCNFSM4VKUQMWQ> .

sempervictus · 2020-12-30T03:57:27Z

By the way, having a SLOG is rather important for both latency and reducing fragmentation over time on sync writes.
I wonder if the new special allocation classes could be used in some way to provide special short codepath areas for ZVOLs...?

sempervictus · 2020-12-30T17:26:38Z

@behlendorf @ahrens - with the new special allocation classes targeting specific workloads to VDEVs, could the paradigm be inverted similarly to how small blocks work in order to provide an allocation "arena" on all vdevs to be used for "intended to be constant latency" volume operations? Basically a storage slab with performance-oriented semantics managed by a subset of the DMU dedicated to ZVOLs. It might double as a safer (thinner?) proving ground for strategies and code to be pulled into the full ZPL as well.

Binarus · 2021-01-02T15:39:51Z

@IvanVolosyuk Thank you very much! Your comment was very helpful.

With primarycache=all for zvol - the filesystem on zvol (ext4 I assumed) will have its metadata cached in ARC.

I see. Thank you. You are right, the ZVOL is formatted with ext4 as well. However, we already had tried this, but to no avail. (I guess I had described it in my first post, which is why I initially didn't get what you meant).

I think what you missed in your tunning is that filesystem on zvol will have a lot of dirty data accumulated as you have a lot of RAM. It will try to write it back when Linux kernel decides that it has too much dirty pages. You can tune it down to see if it will help with the write consistency, e.g.:
echo $[128 * 1024 * 1024] >/proc/sys/vm/dirty_bytes

Now that was a real game changer. Thank you very much for that tip. We will do further research with this and similar settings: It didn't solve the problem yet, because it decreased throughput drastically. But at least, it made the system responsive again when copying large files to the ZVOL. We achieved further improvement by also setting dirty_background_bytes, dirty_writeback_centisecs and a few more to appropriate values.

I hope that we will be able to work out between throughput and responsiveness, using various parameters.

We didn't come to the idea to change these parameters because everybody on the net says that ZFS circumvents the page cache. But I guess that this is true only for reading.

So let's see whether I get this right now:

When an application writes to a file system on a ZVOL (without O_SYNC and O_DIRECT), the data first goes into the normal page cache of Linux, and from there into the ZFS write cache (whose size is set by zfs_dirty_data_max and whose behavior is governed by the ZFS IO scheduler)?

If this is true, it explains the bad writing performance with this constellation, and tuning the six important ZFS IO scheduler parameters doesn't make much sense, because bad behavior will arise from the OS page cache, not the ZFS write cache. Did I get this right?

Binarus · 2021-01-02T16:01:15Z

@devZer0 @dswartz Thank you very much for your comments.

I already had seen the reports you mentioned. But I had the feeling that OpenZFS 2.0.0 would be a sort of restart or major improvement. I reported the issue again to let the developers know that the new version at least on our systems did not improve ZVOL behavior. In that sense, our report could be considered a confirmation of the issue for OpenZFS 2.0.0.

Furthermore, in one of my next comments, I'll describe another test which we did in the meantime and which is described rarely in other posts.

Binarus · 2021-01-02T16:36:59Z

@sempervictus Thank you very much for the many valuable comments! A few remarks:

ZVOLs are pretty pathologically broken by design. A block device is a range of raw bits, full stop. A zvol is an abstraction presenting a range but mapping them across noncontiguous space with deterministic logic involved in linearizing it. So architecturally, it inherently will be slower to resolve a request. The fact that request paths elongate with data written prior to the zvol amplifies and exacerbates the poor design.

You are completely right. However, if ZVOLs were usable in any way, we wouldn't care about the flaws in their architecture, and we would have no problem with sacrificing some throughput and responsiveness to use them, because this is technically unavoidable.

But what completely puzzles us is the magnitude of performance decrease and the fact that writing one large file to one ZVOL stalls the host as well as other ZVOLs and VMs so much that they crash. I'll never get why this happens when writing to ZVOLs, but not when writing to normal datasets. I would have expected that a ZVOL (from ZFS's internal point of view) is just a huge file on the dataset which is presented to the O/S in a special way. Admitted, there must be some overhead with translating block sizes, dealing with fragmentation etc., but that alone should decrease throughput and responsiveness by let's say 20%, but not make the system unusable.

for use under Qemu/libvirt [...]

Thank you very much again - I'll follow that advice.

But currently, we want to keep out VM-related issues. Therefore, we are testing with no VMs running. When there is such misbehavior even without a real workload, we really don't need to even think about VMs. I have mentioned VMs only to make clear that this issue is not academic, but can crash VMs (and the host as well, according to other reports), and put their data at risk.

By the way, having a SLOG is rather important for both latency and reducing fragmentation over time on sync writes.

Unless I have missed something, an SLOG wouldn't help us. One of our tests was to set sync=disabled on the ZVOL, but this didn't change anything. Our current tests are very basic; I guess copying large files via rsync produces a very small fraction of sync requests, so we didn't wonder about the outcome of this test.

with the new special allocation classes targeting specific workloads to VDEVs [...]

That sounds interesting. Could you please give us a starting point? Actually, we don't have deep knowledge of ZFS and never heard about that, but we are curious to learn about everything which could help us out.

sempervictus · 2021-01-02T17:07:33Z

@Binarus: the performance degradations, or any level of performance quotient, have never really mattered to the project when they commit - AFAIK there's no performance testing for ZVOLs in the test suite, and definitely not covering rewrites and long-lived volumes across ZFS revisions. They've been unstable/unusable for production for a long time and appear to have suffered even more in the ZoL -> OpenZFS cycle. My understanding is that the commercial interests behind OpenZFS mostly make their money on database workload optimization, and that they work out of the non-GPL OS' which have their own distinct block-layer semantics (so maybe they dont suck as much on ZVOLs).
If you want to get really upset, see what happens when you make a file in the ZPL, map it to a block device, and observe how much faster a loopback blockdev atop a ZPL file is than a ZVOL.
Far as SLOG, and the comment for using their allocation classes: ZFS writes transactional blocks every 5s. When sync IOs are issued by consumers in between those 5s, they have to go to disk. So the SLOG absorbs those sync writes and then shunts them out to the storage VDEVs when the 5s flush. With no SLOG, those writes have to be immediately committed to the storage VDEVs resulting in jitter/stalls.

IvanVolosyuk · 2021-01-03T11:00:05Z

When an application writes to a file system on a ZVOL (without O_SYNC and O_DIRECT), the data first goes into the normal page cache of Linux, and from there into the ZFS write cache (whose size is set by zfs_dirty_data_max and whose behavior is governed by the ZFS IO scheduler)?

If this is true, it explains the bad writing performance with this constellation, and tuning the six important ZFS IO scheduler parameters doesn't make much sense, because bad behavior will arise from the OS page cache, not the ZFS write cache. Did I get this right?

I would imagine ext4 on top of zvol should use page cache even if ZFS doesn't use it. Dirty data in ZFS will grow and shrink thus changing the amount of ram available for dirty pages in page cache. I this this can be the cause of write speed oscillations we can see.

Limiting dirty data in page cache (/proc/sys/vm/dirty_bytes) should help avoiding this oscillations and uncontrolled growth of dirty pages.

sempervictus · 2021-01-03T16:25:38Z

Adding a filesystem atop the ZVOL makes this entire mess a lot worse - the ZVOL with that nomerge flag should be bypassing the Linux caches just fine, but the FS atop it will make "decisions" about the block-level interactions and impact how the ZVOL actually absorbs written data and whether it is involved at all in reading data back if the FS atop it has cached the value of the IO request. If you write 2 512b files into the EXT4 layer at the same time/in rapid sequence they will likely be merged into a 1024b iovec to the underlying ZVOL... Considering that many modern FS already have compression and encryption, their use atop ZFS with the same options enabled will therefore double the amount of time spent encrypting, and increase the compression time until the lower-level compressor "gives up" (as LZ4 will with already compressed blocks IIRC), same concept applies to all of the caching and related scheduling.
My suggestion is to either keep the Linux FS' out of this, or separate their testing to a different kernel (like map the block device to a VM or an iSCSI target) so you're not (possibly) looking at Schroedinger's cat dancing on the keyboard.
EDIT: by the way, mapping ZVOLs to SCST's or LIOs loopback interfaces to give them the appearance of an SG device "works" up to a point where the overlaid block device appears to consume more IOs than it can push down to the underlying ZVOL and then "bad things" happen.

sempervictus · 2021-01-03T16:39:26Z

In regards to the SLOG metadata class to be reserved on normal VDEVs, could the same sort of allocations be made for the size of the refreservation value? This would allow the offload of snapshot data to normal allocation classes when a snap is created by treating the snapshot action as a type of commit, and keep the rewrite overhead down. You'd have some issues finding free space as you get closer to fill levels on the allocated special spaces, but if the "working set" of slabs is all in that class, you're not having to check blocks for snapped status or prior data as that would all have been offloaded/copied to the "permanent/normal" storage class once a snapshot is made. I think it would also make for a good case to change how the rather weird refereservation prop works - only apply it to the live set, dont reserve "extra space" in snapshots (always thought that to be kida stupid to be honest, reserving space in snapshots is like padding a squashfs file out to the original disk size), and let it be used to "mean" how much "fast IO" space is reserved.
ping @ahrens re ^^ - is that architecturally feasible? Does it violate tenets of operation i'm not considering? Are there some massive engineering challenges involved in this or just a ton of reasonably straightforward work atop the SLOG allocation class PR?

ahrens · 2021-01-04T17:52:43Z

I think you're saying that you have created a zvol, put some sort of filesystem on top of it (which? ext4?), and then you copy some files to it using rsync. You see that much more data is written to the ZFS storage (according to iostat or zpool iostat) than is "logically" copied (according to rsync). And it takes much longer than if you copy the files to a zpl filesystem. And you see periodic pauses in writes.

Do I understand the situation properly? If so, it sounds like something has gone wrong between the filesystem (ext4?) on top of the zvol and the zvol layer. That said, I'm surprised that volblocksize=512 plus sync=disabled doesn't fix all of those problems. I guess if ext4 is writing to random offsets, the cost of zfs updating the indirect blocks of the zvol could be substantial. In general I think we would need to understand the pattern of writes to the zvol. You could do a test with dd directly to the zvol (dd of=/dev/zvol/..., no ext4 invloved) to see that zvol performance is in general reasonable.

sempervictus · 2021-01-04T18:18:43Z

@ahrens - i've filed many issues here before for ZVOLs with no FS on them showing the same issue using pure dd
Several years ago, the Linux ZVOL implementation was thinned out with the "upper" half removed to reduce complexity, OS specific implementation, and hand the IO scheduling down to the ZFS pipeline. However, the internal semantics of ZVOLs do not lend themselves well to constant pefromance quotient requirements (they slow down, a lot, as they're filled, snapped, etc), and they dont appear to be tested for performance regressions.

ahrens · 2021-01-04T19:05:25Z

@sempervictus I haven't noticed the problems you mentioned. We are using zvols with the iscsi target with good performance (after #10163).

It doesn't make sense to me that zvol performance would change much if they've been written to or not -- maybe needing to read the indirect block, or read-modify-write if there are partial-block writes? Presence of snapshots should have no impact on performance (same with filesystems). What specifically about the design of zvols needs to be changed to improve performance? Could you point me to the existing issues that describe the problems you're alluding to?

Binarus · 2021-01-04T19:11:37Z

@sempervictus Thank you very much again for your explanations and ideas.

Far as SLOG, and the comment for using their allocation classes: ZFS writes transactional blocks every 5s. When sync IOs are issued by consumers in between those 5s, they have to go to disk. So the SLOG absorbs those sync writes and then shunts them out to the storage VDEVs when the 5s flush. With no SLOG, those writes have to be immediately committed to the storage VDEVs resulting in jitter/stalls.

Yes, we have understood that. Therefore, we have set sync=disabled on the ZVOL in question. This should rule out any problems, holdups or jitter due to sync writes, shouldn't it? However, it did not change anything.

Adding a filesystem atop the ZVOL makes this entire mess a lot worse [...]

We have come to the same idea. Please see a few paragraphs below regarding new tests we did in the meantime.

@IvanVolosyuk Thanks again.

I would imagine ext4 on top of zvol should use page cache even if ZFS doesn't use it. [...]

We have come to the same conclusion. Please see a few paragraphs below regarding new tests we did in the meantime.

@ahrens Thank you very much for participating here!

Do I understand the situation properly?

Yes, exactly. However, in the meantime, we also thought that putting ext4 on top of a ZVOL is a bad idea for testing, and conducted other tests; please refer to the paragraphs below.

Regarding blockvolsize=512: It didn't improve the situation. But even if it did, it would be hard to use, because it blows up a ZVOL by a factor of 1.5. That is, a ZVOL with 1 TB eats up 1.5 TB of disk space. That overhead gets much better with larger volblocksizes.

In the meantime, we did further tests:

The reason that we put ext4 on the ZVOL in question was that we wanted to see whether we could use ZVOLs as a replacement for physical block devices. Therefore, we partitioned it, put an ext4 file system on one of the partitions and conducted the rsync tests. For those tests, we chose large files because it is clear that we can't expect much from spinning disks if we have a lot of small files.

But actually, we are interested in using ZVOLs as VM storage. Once again, we currently don't test with VMs running, or from within VMs, because this would add in further variables (QEMU, virtual storage layers and the like). However, several days ago, we have seen that we had to employ another test method to simulate the throughput and the latency a ZVOL-backed VM could expect.

Therefore, we now test using dd with oflag=direct and using the ZVOL directly as destination:

dd if=/dev/zero of=/dev/zvol/rpool/zvol-test bs=4K count=2000000 status=progress oflag=direct

and let watch -n 1 zpool iostat -ylv 1 1 run in another terminal window.

This indeed circumvents the page cache (setting /proc/sys/vm/dirty_bytes and its friends to different values does not effect anything) and has improved the situation.

We still experience the disks running under high load for a while after the command above has finished or has been interrupted. We also noticed that the load, while it is high, is still way below what the disks could deliver. Therefore, we are quite sure now that we can further improve the behavior until we are happy with it, by tuning the parameters of the ZFS I/O scheduler.

However, one problem remains: When we start a few VMs (for testing) which are backed by ZVOLs in the same pool, and then run the command shown above, chances are that the VMs freak out. It seems that the load each ZVOLs is allowed to put on the disks is not always balanced (which by the way leads to a further question which I'll post separately in the discussion forum (how do we tune the I/O scheduler if we have VDEVs with fundamentally different characteristics?)). Obviously, when putting one ZVOL under load, it eventually may in turn the disks put under such load that other ZVOLs are starving.

We are currently evaluating whether we could use cgroups to solve this problem.

To summarize, we have now nearly reached a point where we are happy with the behavior when large files are copied to one ZVOL, circumventing the O/S page cache. There are still holdups which are not nice, but they aren't a show stopper because they are short (usual behavior: throughput first drops to something like 10 MB/s during a few seconds, then copying stalls for a few seconds (but not for minutes), then it continues normally).

Final personal remark regarding ZVOLs vs. normal files as VM storage:

In general, we are not against using regular files instead of block devices as VM storage. In fact, we did that before we switched to ZFS. However, normal ZFS datasets AFAIK still do not honor O_DIRECT, while ZVOLs do. In my personal opinion, it would be a very bad idea to chain VM and host caches, which inevitably happens if O_DIRECT is not honored. Imagine a Windows Server VM with 32 GB RAM hosted on a bare metal server with 128 GB; writes an application in the VM produces are first accumulated in Windows' internal cache, then go into the host's page cache, then go into the ZFS write cache. That is, you have chained three big caches, each one with it own characteristics and "eigenfrequency", leading to wild oscillations in throughput and latency (and possibly lockups) in the overall system,

You would probably need a PID closed loop regulator to make this work :-) I am aware that there are people who say "Just test it", and this a valid point of view. But the problem is that by principle we can't test every eventuality and every combination of circumstances. Hence, we believe that it's better to avoid setups which are theoretically bad from the beginning on, even though they might run satisfyingly for a while.

In other words, we'd like to have O_DIRECT supported, because we will run all VMs with QEMU option cache=none (old syntax) or cache.direct=on (new syntax), respectively, and therefore ZVOLs are our only option.

ahrens · 2021-01-04T19:38:04Z

dd if=/dev/zero of=/dev/zvol/rpool/zvol-test bs=4K

Makes sense. Note that this will produce partial-block writes with the default volblocksize=8k, but for the first write (where the zvol is empty) there will be no performance penalty. Subsequent runs on an already-written zvol will do read-modify-writes, so performance will be much worse. using bs=8k, or a multiple of volblocksize, would avoid that.

we'd like to have O_DIRECT supported

For the purpose of controlling (minimizing) what's in the ARC cache, I think that O_DIRECT would have the same effect as primarycache=metadata. Could you use that instead?

sempervictus · 2021-01-04T19:56:05Z

@ahrens: the RMW amplification is severe, and impacted by other factors (like ashift=12 with a volblocksize<8k on a RAIDZ), presence of SLOG is night|day. The iSCSI use case is where we've had the most issues, except we dont get COMSTAR here, we have SCST and LIO nowadays - not quite the well-organized storage subsystem hierarchy you might envision from the Illumos world :). The block-level interactions after the removal of those SG interfaces from ZVOLs are not pleasant under SCST. LIO seems to handle it better, but you still generally want to stick an LVM between the ZVOL and iSCSI export. All of this is compounded by the severe performance inconsistencies of ZVOLs underpinning the iSCSI transport, making them usable for VM OS disks in a cloud or something, but not for the data disks under Scylla or something like that (where having the TMR blocks deduped would be really nice, but obviously the dedup resolution overhead would make all of this even worse).
We push our iSCSI SAN snaps to a lower tier of backup pools which do have dedup, so we do see the advantages of ZFS on a datacenter level, but the performance impacts of all of these layers makes them unusable in a production sense.

sempervictus · 2021-01-16T17:29:15Z

@MichaelHierweck - spawning makes me think you were using the dynamic taskq function, which is a pretty well known performance bottleneck for ZVOLs. All of our iSCSI hosts are spl_taskq_thread_dynamic=0 with a 2:1 ratio for zfs.zvol_threads to host threads for this reason.
I'm not sure why mailing lists keep getting mentioned - its an outmoded mechanism, even GH is pretty sparse when it comes to project management functions (they've gotten better, but its no Redmine/OpenProject).
There seems to be no appetite to redo ZVOLs - its a large effort, requiring many hours of skilled developer with in-kernel performance analysis and data extraction skills, a software architect, reviewer time... which is hard to find and expensive to execute. With NVMEoF/Weka/etc coming online in a major way, the performance expectations of block media are going go through the roof this year. ZFS block devices are already laughably slow compared to their backing vdev and that gap will only grow. Vestigial status is on the horizon for this function without some commercial entity putting up effort like Datto did with Tom's time for ZFS crypto.
In the very least we would need someone to spend a lot of time building debug kernels which arent too "debuggy" to hide the bugs with the tracing components, run benchmarks of ZVOL operations under various conditions, compile the data and render flame graphs or the like, and do analysis to identify hotspots so we know where the problems arise (and then figure out why). Alternatively a clean-sheet implementation is needed which is a metric f-tonne of architecture work and problematic as ZFS is an evolved ecosystem which creates its own constraints.
I'd be willing to entertain a bounty for either a proper performance analysis and streamlining effort or a clean-slate for ZVOLs (since their consumer interfaces are silly-simple right now, this is actually feasible IMO). Probably would ask other commercial entities in the space to help out (we're a rather small shop - tailored security and infra), but might be handy if anyone's reading and has the relevant skillset to quote and execute such an effort.

beren12 · 2021-01-24T21:18:30Z

I always give a few GB from each special vdev ssd to a slog, but would be neat if it was automatically done, or something like that.

Binarus · 2022-01-09T18:48:43Z

At first, a happy new year, and thanks again for bringing us ZFS on Linux!

Sorry for reviving this old thread, but we're re-visiting the problems described above and are still not happy with the situation. We could improve things a little bit by adding a second VDEV (again consisting of a mirror of 2 x 4 TB spinners), though.

However, it's still hard to find a good write-up of the relationships between ZFS's caches and the page cache. So I hope I may ask some more very basic questions which hopefully are not too stupid (I got the impression that so far mainly contributors and experts with in-depth knowledge have participated in this thread :-)).

Question 1

However, write() system calls to ZFS filesystems don't use the page cache, so O_DIRECT would have no impact on the page cache in that case. Therefore, implementing O_DIRECT for ZFS filesystems (i.e. #10018) wouldn't help.

I guess that not having understood this completely is one of the obstacles which keeps us away from success, so could you please elaborate a little bit?

You wrote that write() calls to ZFS bypass the page cache. But then we would like to understand why dd to a ZVOL was behaving totally differently depending on whether we added -oflag=direct or not, and when not, why /proc/sys/vm/dirty_bytes (which is a parameter of the page cache) also changed the behavior drastically. We have documented the tests in this thread in our first post from 2021-01-05.

dd uses write() when writing to a ZVOL device, doesn't it? If this is true, why does it involve the page cache although it shouldn't?

Additionally, what exactly does the new feature described in #10018 implement with respect to writing given that write() bypasses the page cache anyway? Is it about bypassing the ZFS write cache (as opposed to the page cache)?

Question 2

Until now, I was thinking that O_DIRECT already was supported on ZVOLs, but not on datasets. I have come to this opinion from posts like the following: https://forum.proxmox.com/threads/zvol-vs-image-on-top-of-dataset.48022/post-225696

Now I am completely worried due to the issue / PR you linked (#10018). I have just read the whole page (of course without understanding all of it) and couldn't find a statement that it relates (only) to normal datasets. On the other hand, according to our tests (see above), O_DIRECT is honored for ZVOLs even in older versions of ZFS.

So could you please shortly clarify for non-experts what the new feature is about? Is it about supporting O_DIRECT for normal datasets, or is it just that we now can turn it on or off for ZVOLs (where is always was active until now according to our tests)? Does the new feature behave the same for normal datasets and ZVOLs?

Question 3

The next important question of course is which is the first official release which will incorporate that feature. I guess we'll immediately test it, because we eventually can (depending on the answer to the previous question) switch from ZVOLs to file-based VM storage which may drastically improve performance.

Question 4

Finally, at the beginning of our tests (and our learning), we were thinking that O_DIRECT on ZVOLs would bypass the ZFS write cache (which is independent from the O/S write cache) as well. But we were obviously wrong with that. Maybe there is a ZFS tunable to turn off the ZFS write cache, but we couldn't spot it. The reason for bringing this up (again, I may be silly and naive):

When we have an O/S on bare metal, the O/S manages a disk cache (read and write), and there is no further cache layer between the O/S disk cache and the physical storage (let's neglect the fact that most disks have hardware caches for the moment). O/Ss like Windows are optimized for this situation; they assume that they have exclusive access to the storage hardware.

But when running a VM on ZFS (file-based or ZVOL-based), there are at least two caches (at least for writing): The cache which is managed in the VM by the guest O/S, and the ZFS write cache (plus some buffering the VM software probably does, but let's neglect this either).

Isn't it a disadvantage to have two caches chained, and if yes, how to circumvent it? Does the new feature mentioned above have anything to do with it?

Best regards, and thank you very much in advance,

Binarus

IvanVolosyuk · 2022-01-10T04:39:56Z

I'm not expert ZFS internals, but my undestanding is following, when writting from VM you have:

write caching on guest
write caching on linux block layer for zvol
write caching on ZFS
write caching by drive's firmware

I am not expert and might be wrong, but if you use O_DIRECT in VM you can bypass (2). If you limit /proc/sys/vm/dirty_bytes you put the limit to (2). It smoothes the load on (3) and bandwidth management in (3) works better. Otherwise (2) can grow a lot and fill the host memory with dirty data. It will look like your writes very fast at the beginning and bottleneck later when free memory is exhausted. Try monitoring cat /proc/meminfo |grep Dirty.
That means if you back your VM with plain file instead of zvol you should have the same effect as well using zvol with O_DIRECT. The O_DIRECT flag is not supported by ZFS, but supported by linux block layer on top of zvol and as I said affects the (2). For the VM backed by file you should just have (1), (2) and (3).

Sorry, if I'm saying something obvious, simplistic or not exactly correct ;)

Codelica · 2022-03-22T17:06:38Z

Not that I'm adding much to the conversation here considering the wizard level insight in some of the posts above, but I'll put it out there anyway. Being new to ZFS, zvol write performance was the first snag I ran into.

Basically I was unable to figure out why I was looking at 2x+ write amplification using zvol+ext4 vs ext4 on a raw partition. Not that I was expecting them to be the same, but 2x+ seemed extreme and no combination of volblocksize on the zfs side and blocksize on the ext4 side seemed to help, even when feeding it pretty benign synthetic data. Trying zvol+xfs instead did lower the write amplification considerably, but even then, under heavy write loads both the xfs and ext4 formatted zvols would hit periods where they were almost completely non-responsive. While I imagine there are ways to help reduce(?) that issue, I shy away from things that take too much tweaking, as at the end of the day storage is only one aspect of our system (and we are a small team). So we've been avoiding zvols entirely at this point, even though it would be very nice to use in a few situations.

Considering the rising popularity of ZFS based systems like Proxmox, TrueNAS Scale, etc, I guess I'm a little surprised more people aren't running into zvol performance issues. Perhaps they aren't looking under the hood or pushing things too hard, but it seems like eventually it will need some attention. As the concept of a zvol is really very very attractive IMO.

devZer0 · 2022-03-22T17:26:52Z

Considering the rising popularity of ZFS based systems like Proxmox, TrueNAS Scale, etc, I guess I'm a little surprised more people aren't running into zvol performance issues

i always wondered about this too. i switched from proxmox default to qcow2 on ordinary zfs dataset a long time ago and i'm running fine with it.

behlendorf · 2022-03-22T20:18:53Z

There are some improvements for zvol performance being worked on in PR #13148. Any feedback or test results with/without the PR for your target workload would be welcome.

sempervictus · 2022-03-23T04:46:12Z

#13148 helps with some of the queuing issues around ZVOLs, but unfortunately does not address metaslab searches for free blocks when a ZVOL's been filled and erased. The async DMU thing was very promising, but unfortunately currently sitting still with no one having time+capability to move on it (highly non-trivial).

DemiMarie · 2022-04-04T21:32:04Z

The async DMU thing was very promising, but unfortunately currently sitting still with no one having time+capability to move on it (highly non-trivial).

How big a performance win was it?

Binarus · 2022-04-13T17:48:16Z

Sorry, if I'm saying something obvious, simplistic or not exactly correct ;)

@IvanVolosyuk Thank you very much for your answer, and please excuse the long delay.

We're using ZFS since several years, but didn't have the time yet to dig into the details. This is the first time we actually have understood or have had somebody confirm, respectively, that ZFS's write cache is on top of the Linux block layer write cache.

So I guess we'll have to stick with ZVOLs. We have understood that O_DIRECT turns off the block layer write cache, but not the ZFS write cache in normal ZFS datasets / file-backed VMs. Since I still believe that it is bad to have two concatenated caches (that of the guest O/S in the VM and that of ZFS), ZVOLs are the only reasonable options for VMs.

We did not try to turn off the caches in the guest O/S and use ZFS file-based storage, though. That also would result in only one active cache, but it feels totally wrong at the moment.

Thank you very much again, and best regards.

Binarus · 2022-04-13T18:09:57Z

There are some improvements for zvol performance being worked on in PR #13148. Any feedback or test results with/without the PR for your target workload would be welcome.

Thank you very much for working on the performance, and for this hint!

We are willing to help, but currently I only have access to two production machines. Since we have have solid backups, upgrading to new ZFS versions is a risk that we're ready to take in general. However, I have read PR #13148, and I am unsure. The last post there implies that ztest experienced crashes with that version. I don't know what that means, but at the moment it seems a bad idea to test that version on production machines.

But if somebody would state that the failing / crashing ztest is harmless, or explain what exactly could fail, we probably would take the risk and test that version nevertheless.

Thank you very much again, and best regards.

DemiMarie · 2022-04-14T05:56:11Z

@Binarus I’m no ZFS expert, but I do recommend using O_DIRECT. That should be faster in general, no matter what you storage system you are using.

zfsbot · 2022-04-14T13:46:27Z

that ZFS's write cache is on top of the Linux block layer write cache

ZFS doesn't actually have a write cache. this is simply dirty buffers held in the page cache, and if you use DirectIO, this doesn't happen. it's interesting though that when you do write async IO to a zpool, that the written page is also used to cache immediate reads that happen after.

zfsbot · 2022-04-14T13:47:49Z

We did not try to turn off the caches in the guest O/S and use ZFS file-based storage, though. That also would result in only one active cache, but it feels totally wrong at the moment.

typically, users will set primarycache=none or =metadata on their VM ZVOLs and rely on the guest OS page cache instead.

sempervictus · 2022-04-14T14:38:42Z

@Binarus - skimming the test failures, it looks like they are in components other than the altered codepaths in that PR. Might be a good idea to ask maintainers whether those are real failures related to the effort or false positives.
Performance-wise, that PR does help under certain conditions and hurts under others. The O_DIRECT bit helps quite a bit, but so long as there is so much synchronous work being done to write to a ZVOL, these problems are likely to persist. IMO, the "true way out" is to try and implement the async DMU code and back ZVOLs with that to permit concurrent dispatch and work on the elements which need to be computed to complete a write on a well-used ZVOL. That would make them "act" more like an nvme than a queued scsi device, and while backing stores and overall compute time would still be the limiting factors, we'd at least see much more work being done at the same time vs waiting for other work to complete.

DemiMarie · 2022-04-14T15:07:22Z

@sempervictus what would it take to implement the async DMU code?

sempervictus · 2022-04-14T15:41:40Z

@DemiMarie - many hours of tracking/tracing through what was already done to get caught up to speed by an experienced C dev and many more to make it stable and consistently testable (async handoff/pickup of tasks can be unpleasant to verify and can change from other parts of the code changing later so has to be very well validated).
#12166 is the last version i touched (tried to pull to 2.1) if you'd like to take a look at the work smarter people than me did in that branch

DemiMarie · 2022-04-14T15:45:08Z

@sempervictus I wonder if some sort of coroutine abstraction could help. There are macro tricks for writing coroutines in C, but Rust has them natively in the form of async/await.

Binarus · 2022-04-14T18:09:08Z

I’m no ZFS expert, but I do recommend using O_DIRECT. That should be faster in general, no matter what you storage system you are using.

Thanks for the recommendation. This is what we are doing. However, latency is still catastrophic, so we are trying to understand that stuff.

ZFS doesn't actually have a write cache. this is simply dirty buffers held in the page cache, and if you use DirectIO, this doesn't happen.

Thank you very much. So O_DIRECT prevents Linux from buffering data in the page cache, but does not prevent ZFS from buffering data there (except in case of ZVOLs)?

typically, users will set primarycache=none or =metadata on their VM ZVOLs and rely on the guest OS page cache instead.

This is what we are doing. However, as far as I know, primarycache relates to read caching (that is, the ARC). But our problem and the subject of this thread is the bad performance / latency when writing to ZVOLs (directly or from within a VM, it doesn't matter). Some people claim that the problems with writing are related to the read cache, though. This is one more thing which isn't clear to us.

filip-paczynski · 2022-04-20T19:51:43Z

@Binarus Hi again, In my experience introducing a SLOG device(s) greatly helps with writes. I've mentioned this before in #11407 (comment)
It worked for me, might not work for you, but why not try it? To implement SLOG I just bough a PCIE-M2 card and plugged two intel P* SSD on it. Yes, there is some cost, but it's not too much. I've not seen write performance problems ever since. However, I still somewhat struggle with random reads.

Binarus · 2022-04-23T21:10:24Z

@filip-paczynski Thank you very much for the tip!

However, as stated in my first post, we did one series of tests with sync=disabled in the ZVOL. This didn't improve the situation. Therefore, we didn't test SLOGs yet; an SLOG can't effect anything when sync=disabled, correct?

Anyway, two weeks ago, we were lucky enough to buy two Intel P3700 800GB for a low price, and will for sure try your suggestion soon. I would be glad if my above statement would be wrong.

Best regards, and thanks again.

jittygitty · 2022-05-01T01:44:10Z

@sempervictus Bounty was mentioned in this thread, so if interested in crowd-funding zfs please see: #13397

kyle0r · 2023-01-01T14:45:07Z

Hey there, I've been a lurker on this issue thread for a relatively long time.

I'm sharing some research in this post which I know is extensive and long-winded - but for anyone that gives it a look I'd love to hear and read critique and feedback. Better yet if someone wants to do any collaboration please reach out.

I hope by sharing that it might be a useful source of information and act as a confirmation that:

as of writing, for my hardware zvol is not a viable option.

Perhaps my research provokes further discussion here.

Sidebar: I would disagree that this issue should be "closed as completed". That feels premature. PR #13148 might help to mitigate zvol performance issues but I don't think it addresses the root cause of issues. Also AFAIK the code in #13148 has seen relatively little real production system runtime so far?

I'm interested to try PR #13148 when it hits a release. I was not able to easily determine that it made a release - it looks like not yet? (Shame that isn't easy to check via github web interface without cloning the repo?) zvol_use_blk_mq is not present in the release zfs-2.1.6-pve1. Once its available I'll write back with any feedback from further testing.

Back on topic. I did a bunch of research over the last few years on zvol performance (with my hardware). I've been capturing it on a private research post. This work has been occupying a space in my head for some time, so to take a positive step for myself I'm hitting publish and hopefully freeing up some space too xD
See here: https://coda.io/@ff0/zvol-performance-issues

The work is not perfect but good enough for now and can be revised in time as needed.
The synthetic tests in the research included ~10 TB of write IO and ~7 Million IOPS, and 9.5 TB of read IO and ~3.7 Million IOPS.

The research identified more research areas are still open... and further testing on different hardware would bring further clarity and validation. If anyone wants to collaborate on reproducing some of my test cases - please let me know and I'll commit my scripts to github and complete a TODO :)

Here are a few sneak-peaks of some of the commentary in the research, which demonstrate zvol using much more system resource and at the same time slower performance vs. zfs datasets. I provide evidence of this on my hardware for synthetic and real-world organic benchmarks.

The fastest zvol test in this test type costs 7.9 times more resources than its competitor, that is a 690% increase in resources for ~9% less performance. The 1m uptime at the end of the test was 9.41 vs. 1.19. This is a horrible negative cost/penality for zvol. All the other zvol tests fall off a performance cliff.

The fastest zfs-fs-read-4k-def vs. the fastest baseline-read-4k-raw-disk was 6.5 times faster than the baseline, which is ~550% increase in performance.

Binarus added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels Dec 27, 2020

Binarus changed the title ~~[Performance] Extreme performance penalty, stalling and write amplification when writing to ZVOLs~~ [Performance] Extreme performance penalty, holdups and write amplification when writing to ZVOLs Dec 27, 2020

behlendorf added Component: ZVOL ZFS Volumes Type: Performance Performance improvement or performance problem and removed Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels Dec 28, 2020

ghost mentioned this issue Apr 30, 2021

Hibernation openzfs/openzfs-docs#157

Closed

behlendorf linked a pull request Mar 22, 2022 that will close this issue

zvol: Support blk-mq for better performance (updated) #13148

Merged

13 tasks

tonynguien closed this as completed in #13148 Jun 9, 2022

kyle0r mentioned this issue Jan 1, 2023

~30% performance degradation for ZFS vs. non-ZFS for large file transfer #14346

Open

DemiMarie mentioned this issue Jan 4, 2023

WIP: Forward-port async dmu support to 2.1 #12166

Open

13 tasks

wibed mentioned this issue Jul 27, 2023

Support for io_uring API? #8716

Open

kyle0r mentioned this issue Sep 15, 2023

Significant performance degradation/regression with aes-256-gcm between zfs 2.1-pve vs. 2.1.12-pve1 #15276

Closed

[Performance] Extreme performance penalty, holdups and write amplification when writing to ZVOLs #11407

[Performance] Extreme performance penalty, holdups and write amplification when writing to ZVOLs #11407

Comments

Binarus commented Dec 27, 2020 • edited Loading

System information

Describe the problem you're observing

Setup

Copying to the dataset (not the ZVOL): No problem

Copying to the ZVOL: Problem

Our own thoughts and what we have tried already:

What we did not try, and why not

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

IvanVolosyuk commented Dec 27, 2020

Binarus commented Dec 28, 2020 • edited Loading

IvanVolosyuk commented Dec 28, 2020 • edited Loading

devZer0 commented Dec 29, 2020

dswartz commented Dec 29, 2020 via email

sempervictus commented Dec 29, 2020

sempervictus commented Dec 29, 2020

prgwiz commented Dec 30, 2020 via email

sempervictus commented Dec 30, 2020

sempervictus commented Dec 30, 2020

Binarus commented Jan 2, 2021

Binarus commented Jan 2, 2021

Binarus commented Jan 2, 2021

sempervictus commented Jan 2, 2021

IvanVolosyuk commented Jan 3, 2021

sempervictus commented Jan 3, 2021 • edited Loading

sempervictus commented Jan 3, 2021

ahrens commented Jan 4, 2021

sempervictus commented Jan 4, 2021

ahrens commented Jan 4, 2021

Binarus commented Jan 4, 2021 • edited Loading

ahrens commented Jan 4, 2021

sempervictus commented Jan 4, 2021

sempervictus commented Jan 16, 2021

beren12 commented Jan 24, 2021

Binarus commented Jan 9, 2022 • edited Loading

IvanVolosyuk commented Jan 10, 2022 • edited Loading

Codelica commented Mar 22, 2022

devZer0 commented Mar 22, 2022

behlendorf commented Mar 22, 2022

sempervictus commented Mar 23, 2022

DemiMarie commented Apr 4, 2022

Binarus commented Apr 13, 2022

Binarus commented Apr 13, 2022 • edited Loading

DemiMarie commented Apr 14, 2022

zfsbot commented Apr 14, 2022

zfsbot commented Apr 14, 2022

sempervictus commented Apr 14, 2022

DemiMarie commented Apr 14, 2022

sempervictus commented Apr 14, 2022

DemiMarie commented Apr 14, 2022

Binarus commented Apr 14, 2022 • edited Loading

filip-paczynski commented Apr 20, 2022

Binarus commented Apr 23, 2022 • edited Loading

jittygitty commented May 1, 2022 • edited Loading

kyle0r commented Jan 1, 2023

Binarus commented Dec 27, 2020 •

edited

Loading

Binarus commented Dec 28, 2020 •

edited

Loading

IvanVolosyuk commented Dec 28, 2020 •

edited

Loading

sempervictus commented Jan 3, 2021 •

edited

Loading

Binarus commented Jan 4, 2021 •

edited

Loading

Binarus commented Jan 9, 2022 •

edited

Loading

IvanVolosyuk commented Jan 10, 2022 •

edited

Loading

Binarus commented Apr 13, 2022 •

edited

Loading

Binarus commented Apr 14, 2022 •

edited

Loading

Binarus commented Apr 23, 2022 •

edited

Loading

jittygitty commented May 1, 2022 •

edited

Loading