Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZVOLs 3X slower than recordsets at same recordsize/volblocksize for sequential writes #4265

Closed
sempervictus opened this issue Jan 24, 2016 · 41 comments
Labels
Status: Inactive Not being actively updated Status: Stale No recent activity for issue Type: Performance Performance improvement or performance problem

Comments

@sempervictus
Copy link
Contributor

As we do more work on SSD-only pools we're starting to see consistent slowdowns across the board on ZVOLs vs normal data sets. Our use case is primarily iSCSI from ZVOLs, so this is a bit of a problem for us.

To prove this out, i wrote a simple test script for tiotest which creates zvols and data sets, benchmarks them, and destroys them:

MPOINT=$(zfs get mountpoint $POOL | awk '/mountp/{print$3}')

for i in 4 8 64 128 ; do
  echo "Testing ZVOL ${i}K blocksize on $POOL"
  zfs create $POOL/testvol-${i}k -o volblocksize=${i}k -V10G
  tiotest -f 256 -R -r 30000 -d /dev/zvol/$POOL/testvol-${i}k
  zfs destroy $POOL/testvol-${i}k
  CMD="zfs create $POOL/testset-${i}k -o recordsize=${i}k"
  if [[ "$MPOINT" == "none" ]]; then
    CMD="$CMD -o mountpoint=/testset-${i}k"
  fi
  sh -c "$CMD"
  LMP=$(zfs get mountpoint $POOL/testset-${i}k | awk '/mountp/{print$3}')
  echo "Testing DS ${i}K blocksize on $POOL"
  tiotest -f 256 -r 3000 -d $LMP
  zfs destroy $POOL/testset-${i}k  
done

The results (http://pastebin.com/AfU5VYuG) are odd - ~3X performance penalty on linear writes to ZVOLs, and somehow inverse for random writes. In neither case does the underlying disk show full-load on iostat. Thoughts?

@dweeezil
Copy link
Contributor

I've not done this exact type of performance testing but I'd like to point out that within ZFS, at least, the data component a ZVOL is no different than a file other than that it has no name in the filesystem namespace. The big difference, however, is that access to a ZVOL goes through the kernel's block IO system. Among other things, unless O_DIRECT is used, the entire Linux block buffering scheme comes into play. Sometimes it helps, sometimes it hurts. In all cases, however, not using direct IO results in some amount of double-buffering because writes are instantly absorbed into the page cache. Furthermore, it generally results in higher memory use, potentially causing unwanted reductions in ZFS arc size. I'm mentioning all this because I just checked the tiotest program and it does not open block devices in direct IO mode nor does it support an option to do so.

@sempervictus
Copy link
Contributor Author

Tiotest is a quick'n dirty approach, i'll run some fio tests in direct=1 iodepth=128 and iodepth=1 to provide more results, but this is a consistent thing we see across real world workloads - random IO on the iSCSI targets being fed by the ZVOLs is much better than linear.

@behlendorf
Copy link
Contributor

My first thought is that the sequential workload might be triggered some unnecessary read-ahead in either the Linux kernel's block layer or internally in ZFS. I'd try disabling all read-ahead in both the kernel and ZFS and see what impact that has. Also be aware that the ZFS prefetch logic was just significantly reworked in master as we just pulled in Illumos 5987 7f60329.

@sempervictus
Copy link
Contributor Author

@behlendorf: could you provide any instruction for insuring no read-ahead occurs in ZoL? Is hdparm -A 0 sufficient for the disk itself?

I'll deploy an ELK logger to consume iostat data for the runs and get some visual representation going.

By the way, during direct=1 testing through a libvirt-scsi disk (not lun) abstraction, i'm seeing SCSI errors from tcm_loop on zvols which arent natively 4k. When i pass a 4k volblocksize zvol through as a libvirt-scsi LUN, everything works just fine. Something shifty is going on with volblocksize and how other things at static block sizes stack on top, since this seems eerily similar to the SCST loopback problem i saw prior on 8k zvols. Problem with using 4k is the bloody space loss on ashift=12 RAIDZ3 pools... might need to go to a striped mirror for this, but that's also not the most space efficient layout in the world.

@behlendorf
Copy link
Contributor

@sempervictus set the zfs_prefetch_disabled module option to disable read-ahead/prefetching. You can also look at the `/proc/spl/kstat/zfs/zfetchstats to get an idea how it's performing.

Regarding the SCSI errors you might try the patch in #4203 to increase the logical block size. There seem to be issues here when the logical size is 512 and the physical size is larger. Even though ZFS is correctly providing the right optimal size.

@dswartz
Copy link
Contributor

dswartz commented Jan 25, 2016

On 2016-01-25 13:20, Richard Yao wrote:

@behlendorf [1] We disable Linux's readahead on datasets, but not on
zvols. I suspect this is the result of double prefetch overhead from
interactions between the Linux readahead and zfetch rather than an
issue in zfetch.

I wrote a patch to address this alongside the work to kill the zvol IO
queue and decided to split it into its own pull request. The patch to
disable Linux readahead was rejected when it turned out that Linux
readahead made us perform better under the IO queue and was forogtten.
Now that the IO queue is gone, we probably should revisit disabling
Linux readahead on zvols. My intuition is that the patch will have the
effect that I had originally expected now that we no longer have the
IO queue.

@sempervictus [2] Try applying the patch from 2487 and rerunning your

Richard, I may give this a try too.

@ryao
Copy link
Contributor

ryao commented Jan 25, 2016

@dswartz I am juggling enough things at once that while I thought my original comment was correct, I decided to delete it until I can double check things. I will repost it afterward.

@dswartz
Copy link
Contributor

dswartz commented Jan 25, 2016

On 2016-01-25 13:33, Richard Yao wrote:

@dswartz [1] Don't try it. I just double checked and it turns out that
while that patch will apply, it won't build because ->zv_queue is
gone. I need to rebase it against the modern codebase.

whoops, okay :)

don't remember if i mentioned this in any earlier posts. i am currently
running ESOS with ZFS (non-standard build), and was serving a couple of
zvols up to ESXi and windows veeam proxy and the performance was
disappointing. I ended up creating datasets and using 'dd skip=xxx' to
create LUN files and then set a reservation on the dataset. Also
improved SLOG performance, due to the larger block sizes. RMW overhead
is not a concern for me, since the appliance has 128GB ram, and the
warm-cache demand hit rate is 99%.

@dswartz
Copy link
Contributor

dswartz commented Jan 25, 2016

On 2016-01-25 13:35, Richard Yao wrote:

@dswartz [1] I am juggling enough things at once that while I thought
my original comment was correct, I decided to delete it until I can
double check things. I will repost it afterward.

No worries, thanks :)

@ryao
Copy link
Contributor

ryao commented Jan 25, 2016

@dswartz After double checking, it turns out that my remarks were right the first time. As paradoxical as saying "we no longer have the IO queue" sounds when we have a ->zv_queue pointer that the patch uses, that is indeed the case. My original comment should have been correct. I will repush that commit in a fresh pull request for the buildbot to evaluate.

@behlendorf We disable Linux's readahead on datasets, but not on zvols. I suspect this is the result of double prefetch overhead from interactions between the Linux readahead and zfetch rather than an
issue in zfetch.

I wrote a patch to address this alongside the work to kill the zvol IO queue and decided to split it into its own pull request. The patch to disable Linux readahead was rejected when it turned out that Linux
readahead made us perform better under the IO queue and was forgotten. Now that the IO queue is gone, we probably should revisit disabling Linux readahead on zvols. My intuition is that the patch will have the effect that I had originally expected now that we no longer have the IO queue.

@sempervictus Try applying the patch from #4272 and rerunning your tests:

https://github.com/zfsonlinux/zfs/pull/4272.patch

@ryao
Copy link
Contributor

ryao commented Jan 25, 2016

I just opened #4272 as an alias of #2487 for the buildbot tests. We probably need some benchmark numbers to confirm my hypothesis though. I decided against updating the commit message until we had benchmark numbers.

@dracwyrm
Copy link

Is there a way to disable the kernel read ahead without the patch while booted system and ZFS modules loaded?

I recently had time to play with ZFS again and using the latest version I still had ZVol issues that I reported in my bug report. I asked for that to be reopened. I was wondering if read ahead would cause the issue I had there.

Thanks

@dracwyrm
Copy link

I did some cheap and dirty testing with the Disable Linux Readahead patch.
Host:
Kernel: 4.1.15
ZFS/SPL: 0.6.4.2
ZFS Pool: Raid Z1 on three 2T Platter Drives with ARC/ZIL on separate SSDs in stripe/mirror configuration.

Guest:
Drive: 750G ZVol
OS: Windows 10 latest updates.

Before Disable patch:

CrystalDiskMark 5.1.1 x64 (C) 2007-2016 hiyohiyo

  • MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]

  • KB = 1000 bytes, KiB = 1024 bytes

    Sequential Read (Q= 32,T= 1) : 759.442 MB/s
    Sequential Write (Q= 32,T= 1) : 486.527 MB/s
    Random Read 4KiB (Q= 32,T= 1) : 39.550 MB/s [ 9655.8 IOPS]
    Random Write 4KiB (Q= 32,T= 1) : 31.144 MB/s [ 7603.5 IOPS]
    Sequential Read (T= 1) : 985.114 MB/s
    Sequential Write (T= 1) : 483.610 MB/s
    Random Read 4KiB (Q= 1,T= 1) : 22.380 MB/s [ 5463.9 IOPS]
    Random Write 4KiB (Q= 1,T= 1) : 18.713 MB/s [ 4568.6 IOPS]

    Test : 1024 MiB C: 4.1% (30.5/749.4 GiB) [Interval=5 sec]
    Date : 2016/01/24 12:07:18
    OS : Windows 10 Professional 10.0 Build 10586

After Disable Patch:

CrystalDiskMark 5.1.1 x64 (C) 2007-2016 hiyohiyo

  • MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]

  • KB = 1000 bytes, KiB = 1024 bytes

    Sequential Read (Q= 32,T= 1) : 872.373 MB/s
    Sequential Write (Q= 32,T= 1) : 438.053 MB/s
    Random Read 4KiB (Q= 32,T= 1) : 36.708 MB/s [ 8961.9 IOPS]
    Random Write 4KiB (Q= 32,T= 1) : 29.009 MB/s [ 7082.3 IOPS]
    Sequential Read (T= 1) : 1085.960 MB/s
    Sequential Write (T= 1) : 403.337 MB/s
    Random Read 4KiB (Q= 1,T= 1) : 23.740 MB/s [ 5795.9 IOPS]
    Random Write 4KiB (Q= 1,T= 1) : 22.053 MB/s [ 5384.0 IOPS]

    Test : 1024 MiB C: 4.6% (34.7/749.4 GiB) [Interval=5 sec]
    Date : 2016/01/27 9:20:20
    OS : Windows 10 Professional 10.0 Build 10586

EDIT: Pasted in the correct benchmarks. Sorry.
There is a bit of difference. Don't know how much of an increase will come from platter based disks.
These tests were conducted several days apart.

@odoucet
Copy link

odoucet commented Jan 27, 2016

These are exactly the same numbers. Are you sure you copy/paste them right ?

@dracwyrm
Copy link

@odoucet Oops. Pasted in the wrong ones for the after patch stats. I updated the post.

Looking at the post again. It seems that the patch makes reads faster, but writes slower. Strange. Is that an issue with the ZFS Readahead?

@ryao
Copy link
Contributor

ryao commented Jan 27, 2016

@dracwyrm A problem when evaluating that patch in your setup is that it has triple prefetch from zfetch, Linux readahead and Windows readahead. The interactions between three are not necessarily well behaved. I used to have a semi-similar setup where I ran Windows on physical hardware off an iSCSI backed zvol. I recall significant variations between runs of CrystalDiskMark in that setup. I suspect that might be the case for you too, which would mean that those numbers are probably not significant.

There are also other potential things in play that can bottleneck or add variation to performance, such as volblocksize being mismatched and NTFS fragmentation. The latter definitely should cause performance to vary between runs unless you start the VM with the same on-zvol state each time. It might be possible to correct for all of those things to get representative numbers, but you need to control everything very tightly, which is hard to do in comparison to the the test by @sempervictus. His test also shows differences between datasets and zvols, which is what suggests there is additional performance that we could gain.

@sempervictus
Copy link
Contributor Author

So i've applied 4272, and here are the test results: http://pastebin.com/1FHMKiSg.
Performance delta has reduced, but clearly still there, in the same manner - random IO is faster on ZVOLs, sequential is faster on datasets. Thoughts?

@dweeezil
Copy link
Contributor

dweeezil commented Feb 1, 2016

I believe I've gotten to the bottom of the differing results when using tiotest as in @sempervictus' initial report. I set up a test rig using a slightly modified version of @sempervictus' script and also hacked tiotest to use O_DIRECT for raw devices in order to get more sensible results. I also only worked with 4K blocksizes since they're the most stressful. The file-based tiotest was in fact about 50% faster using this particular test. Finally, I'll note that I ran with 20 threads (-t 20) in order to increase the workload and used the -k option to select only the sequential write workload.

The issue seems to be that tiotest -f 256 -R -r 30000 -d /dev/zvol... and tiotest -f 256 -r 3000 -d ... aren't quite doing the same thing. In zvol case, the each thread is writing to a different offset of what's effectively a 10GiB (the size of the zvol) sparse (in the case of a freshly-created zvol) file. In the file case, each thread is writing to its own 256MiB file.

For reasons I haven't yet tracked down, the zvol case results in a lot of contention as dbufs are dirtied. I believe the issue to be that the updates L1 and L2 blkptrs are highly contended.

I'll try to track this down a bit further but for the moment, I think this mainly shows that concurrent sequential writes to separate objects perform better than strided concurrent sequential writes to the same object.

@dweeezil
Copy link
Contributor

dweeezil commented Feb 1, 2016

After further testing and instrumentation, I've still not determined exactly what's causing the slowdown. Unfortunately, there's not a huge difference in performance. At the moment, The zvol test is running at 177 MB/s (with O_DIRECT enabled) and the file(s) test is running at 239 MB/s (tiotest 4K sequential writes).

I'll note that the stock tiotest, without O_DIRECT runs at about 115 MB/s but I consider the O_DIRECT mode to be a more reasonable test because it's not dependent on the manner in which the page cache is flushed.

@sempervictus
Copy link
Contributor Author

Thank you sir, disturbing to hear you can't see the cause - I figured you
had a crystal ball. :)
@ryao: thoughts?
On Feb 1, 2016 3:18 PM, "Tim Chase" notifications@github.com wrote:

After further testing and instrumentation, I've still not determined
exactly what's causing the slowdown. Unfortunately, there's not a huge
difference in performance. At the moment, The zvol test is running at 177
MB/s (with O_DIRECT enabled) and the file(s) test is running at 239 MB/s
(tiotest 4K sequential writes).

I'll note that the stock tiotest, without O_DIRECT runs at about 115 MB/s
but I consider the O_DIRECT mode to be a more reasonable test because it's
not dependent on the manner in which the page cache is flushed.


Reply to this email directly or view it on GitHub
#4265 (comment).

@dweeezil
Copy link
Contributor

dweeezil commented Feb 2, 2016

@sempervictus I've not given up yet :) The problem seems to be that the zvol_request->dmu_write_uio->dmu_write_uio_dnode path is slower than the zfs_write_dmu_write_uio_dbuf->dmu_write_uio_dnode path (even though they both ultimately use dmu_write_uio_dnode()). In my particular test case, the slowdown can be summarized as follows: a call to zfs_write() to write a 4K block takes about 0.214 ms. whereas a call to zvol_request() to write a 4K block takes about 0.312 ms. My test involves 20 streams each writing 256MB 4K at at time to sequential parts of the zvol. This is using the same tiotest parameters as the original poster. The only difference is that I've hacked tiotest to run in O_DIRECT mode for zvols.

I'm following all the various call paths of zvol_request() one at a time. My initial suspicion was that the dmu_tx_count_write() path was the bottleneck but I've stubbed it out and it didn't help much. I suspect I'll be able to track down the cause of the increased latency but it's going to take time because there are a lot of call paths and I'm looking for a 0.1 ms. difference.

@dweeezil
Copy link
Contributor

dweeezil commented Feb 4, 2016

@sempervictus I did a bit of a reset on my test regimen because I simply wasn't seeing anything like a 3:1 ratio for sequential writes between zvols and files. It turns out that adding O_DIRECT helped the zvol case a lot.

I modified the original test script as in https://gist.github.com/dweeezil/3fb4ef7f0ce88c6b2e59 to test zvols with and without O_DIRECT and also only to test sequential writes. It also recreates the pool before each test. The pool I'm using is a set of 10 5G files on a tmpfs filesystem (size and number chosen to give good SSD-like performance). The 4-thread results are in https://gist.github.com/dweeezil/c444f55667216ead022f and show that O_DIRECT causes the gap between zvols and files to shrink as the blocksize increases. There is clearly better performance for this test when using files but I'm not sure where it comes from. Among other things, I noticed that the zvol case has quite a few more CPU cache misses than the files case. There's a chance a non-cacheline-aligned structure or structure member is hurting things a bit. As the threads are increased to 20, the zvol case starts performing better than files (https://gist.github.com/dweeezil/54e217f63a0ac2167eeb).

So there seem to be 2 issues here: First is that O_DIRECT is better for this case. Second is that there is still a performance hit when using zvols with small block sizes.

The O_DIRECT issue is most likely due to generic_writepages() not flushing the dirty data as efficiently as it could.

As to the overhead involved with small blocksize zvols insofar as this benchmark is concerned, I'm still looking into it.

@tuxoko
Copy link
Contributor

tuxoko commented Feb 4, 2016

@dweeezil
zfs_write will use dmu_assign_buf when writing to whole block without going to dmu_write. So the path is pretty different. Maybe this contribute to the latency difference?

@ryao
Copy link
Contributor

ryao commented Feb 6, 2016

@sempervictus I think @dweeezil's approach of using O_DIRECT is the right thing to do on a zvol.

@tuxoko Your thought about the code paths being different was right. In the normal case, zfs_write should call dmu_write_uio_dbuf(), which is documented as being faster than the dmu_write_uio() that zvol_write calls.

A quick trace of tiotest on a dataset with perf-tools shows that we are using dmu_write_uio_dbuf:

desktop ~ # perf-tools/bin/funccount dmu_write*
Tracing "dmu_write*"... Ctrl-C to end.
^C
FUNC                              COUNT
dmu_write                          2584
dmu_write.part.6                   2584
dmu_write_policy                 131902
dmu_write_uio_dbuf               451225
dmu_write_uio_dnode              451226

While a trace of tiotest on a zvol shows:

# perf-tools/bin/funccount dmu_write*
Tracing "dmu_write*"... Ctrl-C to end.
^C
FUNC                              COUNT
dmu_write_uio_dbuf                   29
dmu_write_uio_dnode                  29
dmu_write                          1129
dmu_write.part.6                   1129
dmu_write_policy                 269794
dmu_write_bio                    524288

That would be dmu_write_uio on HEAD. The illumos-gate code is using dmu_write_uio_dbuf, so this is a ZoL-specific regression. I have opened #4316 with a possible fix. I did not test it locally, so I'll let the buildbot verify it.

@ryao
Copy link
Contributor

ryao commented Feb 6, 2016

@sempervictus It is not clear from your benchmarks if the patch to disable prefetch made a difference because the numbers do not look like they were done on the same pool configuration.

That being said, #4316 has passed the buildbot and should help.

@ryao
Copy link
Contributor

ryao commented Feb 6, 2016

I noticed that similar treatment is needed for the read case, so I pushed a couple of additional commits. I also see what might be some additional opportunities for improvement on both zvols and regular files in ZIL and maybe mmap too that I will try to implement next week.

@dweeezil
Copy link
Contributor

dweeezil commented Feb 6, 2016

I finally figured out where much of the file-version's advantage comes from. It involves a set of circumstances rather specific to @sempervictus' original testing script:

  • The benchmark is run immediately after creating a filesystem
  • The tiotest program's threads don't issue their fsync() calls until after all the data are written

It turns out that a freshly-created filesystem does not have a ZIL until the first modification is made to it. The following code in zil_commit_writer()

                if (txg > spa_last_synced_txg(spa) || txg > spa_freeze_txg(spa))
                        lwb = zil_lwb_commit(zilog, itx, lwb);

in the case of this benchmark, due to the conditions listed above, will generally never call zil_lwb_commit() (txg generally is equal to the last synced txg in this initial case). This is where much of the speedup comes from. As soon as there's a ZIL, txg generally is ahead of the last synced txg.

If one simply does something like this:

dd if=/dev/zero of=/tank/fs/blah count=1 conv=fsync
sync

before running the first tiotest, the performance (throughput) is substantially lower than it would be on a freshly-created filesystem.

I am running with @ryao's latest zvol patches but I'm pretty sure they don't make a big difference in the zvol results. After modifying the script to do the dd and sync as shown above, here's the results (on a completely different testing system than I was using before): https://gist.github.com/dweeezil/b23d372755a6ec7806a1.

@dweeezil
Copy link
Contributor

dweeezil commented Feb 6, 2016

For completeness, here's the testing script I used: https://gist.github.com/dweeezil/3f233823f5fc7922b276.

I put the extra dd/sync in the zvol case but it's not needed there since O_DIRECT causes the zil to be created right away.

@ryao
Copy link
Contributor

ryao commented Feb 6, 2016

@dweeezil I think you have satisfactorily explained the difference.

It seems reasonable to think that the dnode_hold() overhead from dmu_read_uio()/dmu_write_uio() probably is not very painful, but we probably should get rid of it anyway. It might explain the additional CPU time spent on zvols in your latest numbers.

@sempervictus
Copy link
Contributor Author

With the above patch merged in, the results are still rather pretty abysmal, and in some cases much worse than the original 3X comment.

The actual use case here is a Cinder volume service (LVM/ZVOL until i or someone writes a proper ZFS driver), which i've configured to assign volumes via libvirt to the VMs as direct pass through via the virtio-scsi LUN configuration - basically no qemu/kvm middleware between the iSCSI target on the compute node and the VM.
What i'm seeing in these conditions looks like this:

root@svl-u14-voltest:/home/ubuntu# lsblk 
NAME   MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda      8:0    0   1G  0 disk 
vda    253:0    0  20G  0 disk 
└─vda1 253:1    0  20G  0 part /
root@svl-u14-voltest:/home/ubuntu# fdisk -l /dev/sdaa
root@svl-u14-voltest:/home/ubuntu# fdisk -l /dev/sda

WARNING: GPT (GUID Partition Table) detected on '/dev/sda'! The util fdisk doesn't support GPT. Use GNU Parted.

Note: sector size is 4096 (not 512)

Disk /dev/sda: 1073 MB, 1073741824 bytes
34 heads, 61 sectors/track, 126 cylinders, total 262144 sectors
Units = sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 32768 bytes
I/O size (minimum/optimal): 65536 bytes / 1048576 bytes
Disk identifier: 0x00000000

Disk /dev/sda doesn't contain a valid partition table

root@svl-u14-voltest:/home/ubuntu# tiotest  -f 128 -R -d /dev/sda -r 30000 
Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write         512 MBs |    0.8 s | 636.276 MB/s |  15.0 %  | 290.6 % |
| Random Write  469 MBs |    0.3 s | 1522.049 MB/s |  23.0 %  | 342.7 % |
| Read          512 MBs |    0.7 s | 786.360 MB/s |  23.9 %  | 339.9 % |
| Random Read   469 MBs |    4.1 s | 113.599 MB/s |  28.3 %  | 279.9 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.021 ms |       35.287 ms |  0.00000 |   0.00000 |
| Random Write |        0.008 ms |       12.148 ms |  0.00000 |   0.00000 |
| Read         |        0.018 ms |       12.095 ms |  0.00000 |   0.00000 |
| Random Read  |        0.130 ms |        4.163 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.043 ms |       35.287 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

root@svl-u14-voltest:/home/ubuntu# tiotest  -f 128 -R -d /dev/sda -r 30000 
Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write         512 MBs |    2.2 s | 230.876 MB/s |  10.8 %  | 348.6 % |
| Random Write  469 MBs |    0.3 s | 1469.897 MB/s |  50.5 %  | 322.3 % |
| Read          512 MBs |    0.6 s | 841.393 MB/s |  23.9 %  | 323.8 % |
| Random Read   469 MBs |    4.0 s | 118.039 MB/s |  27.6 %  | 238.0 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.065 ms |       36.518 ms |  0.00000 |   0.00000 |
| Random Write |        0.007 ms |       16.027 ms |  0.00000 |   0.00000 |
| Read         |        0.016 ms |        3.494 ms |  0.00000 |   0.00000 |
| Random Read  |        0.123 ms |        3.351 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.052 ms |       36.518 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

root@svl-u14-voltest:/home/ubuntu# tiotest  -f 128 -R -d /dev/sda -r 30000 
Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write         512 MBs |    2.1 s | 247.282 MB/s |   9.8 %  | 326.3 % |
| Random Write  469 MBs |    0.3 s | 1460.886 MB/s |  40.7 %  | 338.7 % |
| Read          512 MBs |    0.7 s | 761.530 MB/s |  40.4 %  | 309.5 % |
| Random Read   469 MBs |    3.3 s | 141.803 MB/s |  24.9 %  | 212.4 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.058 ms |       44.304 ms |  0.00000 |   0.00000 |
| Random Write |        0.007 ms |       12.056 ms |  0.00000 |   0.00000 |
| Read         |        0.018 ms |        7.998 ms |  0.00000 |   0.00000 |
| Random Read  |        0.097 ms |        1.979 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.045 ms |       44.304 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

When the volume is "fresh" the linear write speed is a bit better, once it gets a first-pass of writes though, sequential writes degrade to 1/7th the rate of random writes.

Given the fact that ZoL is now being included in a number of distributions, especially Ubuntu, its only a matter of time before it becomes the defacto back-end for OpenStack storage (both for Cinder volumes, and Ceph backing), so this might become a more pressing concern.

Far as using O_DIRECT, while i agree that it provides a more consistent baseline for benchmarks, there are plenty of applications which do not set the flag in real-world conditions, so it would likely make sense to test both variations.

@olw2005
Copy link

olw2005 commented Aug 30, 2016

@sempervictus Just wondering if there's been any headway on this? (Or if you've found a workaround for it.) Reading through the issues list this afternoon and this one looks somewhat related to #4512 that I've been banging my head on for a bit now.

@sempervictus
Copy link
Contributor Author

Not really, no, i'm still seeing completely abysmal numbers under the default tiotest workload even on pools out of 1T 850 Pros. Currently working out a new stack on top of the #5009 with a 4.7.2 build, with 1b0c2ab from #4272.

@sempervictus
Copy link
Contributor Author

blktrace seems to show that all ZVOL operations on one of the systems experiencing this problem are pegged to (almost) a single core:

while true; do blktrace -d /dev/zd32 -w 1 ; done
=== zd32 ===
  CPU  0:                  923 events,       44 KiB data
  CPU  1:                    0 events,        0 KiB data
  CPU  2:                    0 events,        0 KiB data
  CPU  3:                    0 events,        0 KiB data
  CPU  4:                    0 events,        0 KiB data
  CPU  5:                    0 events,        0 KiB data
  CPU  6:                    4 events,        1 KiB data
  CPU  7:                    7 events,        1 KiB data
  Total:                   934 events (dropped 0),       44 KiB data
=== zd32 ===
  CPU  0:                 1140 events,       54 KiB data
  CPU  1:                    0 events,        0 KiB data
  CPU  2:                    6 events,        1 KiB data
  CPU  3:                    0 events,        0 KiB data
  CPU  4:                    0 events,        0 KiB data
  CPU  5:                    0 events,        0 KiB data
  CPU  6:                    1 events,        1 KiB data
  CPU  7:                    0 events,        0 KiB data
  Total:                  1147 events (dropped 0),       54 KiB data
=== zd32 ===
  CPU  0:                 1067 events,       51 KiB data
  CPU  1:                    6 events,        1 KiB data
  CPU  2:                    0 events,        0 KiB data
  CPU  3:                    4 events,        1 KiB data
  CPU  4:                    1 events,        1 KiB data
  CPU  5:                    0 events,        0 KiB data
  CPU  6:                    0 events,        0 KiB data
  CPU  7:                    0 events,        0 KiB data
  Total:                  1078 events (dropped 0),       51 KiB data
=== zd32 ===
  CPU  0:                   85 events,        4 KiB data
  CPU  1:                    0 events,        0 KiB data
  CPU  2:                    0 events,        0 KiB data
  CPU  3:                    0 events,        0 KiB data
  CPU  4:                    0 events,        0 KiB data
  CPU  5:                    1 events,        1 KiB data
  CPU  6:                    0 events,        0 KiB data
  CPU  7:                    0 events,        0 KiB data
  Total:                    86 events (dropped 0),        5 KiB data

That trace is from a VM doing a filesystem repair (NTFS, windows guest) atop an iSCSI export from zd32. This is with:

cat /etc/modprobe.d/spl.conf 
options spl spl_taskq_thread_dynamic=0

However, i dont see the spl_taskq_thread_dynamic parameter in sysctl -a...

sysctl -a 2>&1| grep dynamic ; find /proc/sys| grep thread_dynamic

returns nothing...

@kernelOfTruth
Copy link
Contributor

kernelOfTruth commented Sep 10, 2016

all of the SPL knobs afaik never were under /proc/sys/, they were always under /sys

find /sys/ | grep spl | grep dynamic
/sys/module/spl/parameters/spl_taskq_thread_dynamic

@ryao
Copy link
Contributor

ryao commented Sep 10, 2016

@sempervictus Nearly all zvol operations being pegged to a single core suggests that IO is being diverted to that core by the kernel. The cause is likely /sys/devices/virtual/block/zd*/queue/rq_affinity, which is set to the default value of 0, rather than 1 like it is set for block devices.

Try setting it to 1 and rerunning your tests. If things work well, we can patch the zvol code to make this the new default.

Also, the dynamic taskqs work by spawning threads when there is work and killing them when there is none, so it is not surprising that you do not see anything.

@sempervictus
Copy link
Contributor Author

@kernelOfTruth @ryao: thanks for the input. dynamic taskq's are off, verified in /sys/module/spl/parameters/spl_taskq_thread_dynamic. Far as blktrace results, i took a few cycles of output before and after setting echo 1 > /sys/devices/virtual/block/zd80/queue/rq_affinity and the output looks a bit better. However, while its now spreading events across all cores, the first 4 have a 10:1 event ratio with the last 4.

@sempervictus
Copy link
Contributor Author

sempervictus commented Oct 27, 2016

So as of todays version of #5135 (3b234ed) this problem persists, with throughput completely unaffected by the addition of a PCIe SLOG device:

# tiotest -o 512 -f 256 -R -d /dev/zvol/1t4sz/test-v -r 30000 ; zpool add 1t4sz log /dev/sdx ; tiotest -f 256 -R -d /dev/zvol/1t4sz/test-v -r 30000 -o 512
Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1024 MBs |    5.2 s | 195.275 MB/s | 707.1 %  | 137.5 % |
| Random Write  469 MBs |    0.7 s | 627.016 MB/s | 782.2 %  |  81.2 % |
| Read         1024 MBs |    0.6 s | 1699.216 MB/s |  94.9 %  | 1273.7 % |
| Random Read   469 MBs |    1.0 s | 471.024 MB/s | 471.8 %  | 1100.4 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.004 ms |        0.218 ms |  0.00000 |   0.00000 |
| Random Write |        0.003 ms |        0.057 ms |  0.00000 |   0.00000 |
| Read         |        0.007 ms |        0.389 ms |  0.00000 |   0.00000 |
| Random Read  |        0.030 ms |        0.626 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.009 ms |        0.626 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1024 MBs |    5.6 s | 183.996 MB/s | 692.4 %  | 120.3 % |
| Random Write  469 MBs |    0.8 s | 590.083 MB/s | 656.7 %  |  66.8 % |
| Read         1024 MBs |    0.5 s | 1877.700 MB/s |  73.7 %  | 1319.7 % |
| Random Read   469 MBs |    0.9 s | 500.032 MB/s | 560.5 %  | 1064.1 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.004 ms |        0.206 ms |  0.00000 |   0.00000 |
| Random Write |        0.003 ms |        0.051 ms |  0.00000 |   0.00000 |
| Read         |        0.007 ms |        0.464 ms |  0.00000 |   0.00000 |
| Random Read  |        0.028 ms |        0.485 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.008 ms |        0.485 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

# tiotest -f 256 -R -d /dev/zvol/1t4sz/test-v -r 30000 ; zpool remove 1t4sz /dev/sdx && tiotest -f 256 -R -d /dev/zvol/1t4sz/test-v -r 30000
Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1024 MBs |    5.1 s | 199.345 MB/s | 706.6 %  | 107.4 % |
| Random Write  469 MBs |    1.1 s | 430.512 MB/s | 670.9 %  | 122.4 % |
| Read         1024 MBs |    0.6 s | 1814.715 MB/s |  68.1 %  | 1310.4 % |
| Random Read   469 MBs |    1.0 s | 492.293 MB/s | 500.2 %  | 1120.3 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.003 ms |        0.206 ms |  0.00000 |   0.00000 |
| Random Write |        0.003 ms |        0.325 ms |  0.00000 |   0.00000 |
| Read         |        0.007 ms |        1.067 ms |  0.00000 |   0.00000 |
| Random Read  |        0.029 ms |        0.464 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.008 ms |        1.067 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1024 MBs |    5.1 s | 200.801 MB/s | 692.4 %  | 132.9 % |
| Random Write  469 MBs |    1.1 s | 433.263 MB/s | 654.0 %  | 144.3 % |
| Read         1024 MBs |    0.6 s | 1718.040 MB/s |  56.9 %  | 1048.8 % |
| Random Read   469 MBs |    1.0 s | 480.930 MB/s | 525.5 %  | 1039.4 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.004 ms |        0.441 ms |  0.00000 |   0.00000 |
| Random Write |        0.003 ms |        1.069 ms |  0.00000 |   0.00000 |
| Read         |        0.006 ms |        0.607 ms |  0.00000 |   0.00000 |
| Random Read  |        0.029 ms |        0.470 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.009 ms |        1.069 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

For comparison, here's a dataset in the same pool:

tiotest# tiotest -f 256 -d . -r 30000 && zpool add 1t4sz log /dev/sdx && tiotest -f 256  -d . -r 30000
Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1024 MBs |    1.3 s | 810.495 MB/s | 1042.2 %  | 445.1 % |
| Random Write  469 MBs |    1.6 s | 284.820 MB/s | 497.8 %  | 410.4 % |
| Read         1024 MBs |    2.1 s | 477.653 MB/s |  34.1 %  | 1628.6 % |
| Random Read   469 MBs |    1.1 s | 429.245 MB/s |  36.2 %  | 1471.1 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.016 ms |        5.300 ms |  0.00000 |   0.00000 |
| Random Write |        0.027 ms |        5.632 ms |  0.00000 |   0.00000 |
| Read         |        0.031 ms |        0.866 ms |  0.00000 |   0.00000 |
| Random Read  |        0.032 ms |        0.479 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.026 ms |        5.632 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

Tiotest results for 4 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1024 MBs |    1.3 s | 796.644 MB/s | 1048.6 %  | 490.9 % |
| Random Write  469 MBs |    2.1 s | 225.750 MB/s | 469.7 %  | 383.6 % |
| Read         1024 MBs |    2.1 s | 484.837 MB/s |  36.5 %  | 1548.5 % |
| Random Read   469 MBs |    0.9 s | 515.291 MB/s |  17.6 %  | 1668.5 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.018 ms |        5.777 ms |  0.00000 |   0.00000 |
| Random Write |        0.036 ms |        5.114 ms |  0.00000 |   0.00000 |
| Read         |        0.030 ms |        0.265 ms |  0.00000 |   0.00000 |
| Random Read  |        0.029 ms |        0.534 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.027 ms |        5.777 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

The pool itself is SAS SSD, in a RAIDZ1, but neither it nor it with a separate write-path are able to get near the dataset's linear write throughput. Though interestingly enough, random writes are faster.

Something synthetic is going on here..

@sempervictus
Copy link
Contributor Author

sempervictus commented Dec 20, 2016 via email

@sempervictus
Copy link
Contributor Author

sempervictus commented Dec 20, 2016 via email

@janetcampbell
Copy link

janetcampbell commented Apr 7, 2019

I believe you are being bitten by kernel RMW reads:

#8590

Resolving those should drastically speed up writes. Check with zpool iostat -r to make sure that you have no or very few reads outstanding during write tests. It can help to raise txg timeout and the dirty data sync/max variables to make txg commits farther apart.

Our experience is that large block ZVOLs are the fastest for write-based workloads, since there is less overhead. ZFS only does RMW reads at TxG commit time or with sub-blocksize indirect sync writes, and you can stretch that out to a longer interval with many workloads which decreases RMW even more.

We use 128k-1m ZVOLs for most applications. ZVOLs never experience RMW except during txg commit because they never make sub-blocksize indirect sync writes.

@stale
Copy link

stale bot commented Aug 24, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Inactive Not being actively updated Status: Stale No recent activity for issue Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

12 participants