zvol: Support blk-mq for better performance #12664

tonyhutter · 2021-10-21T00:15:28Z

Motivation and Context

Increase multi-threaded performance on zvols (Issue #12483)

Description

Add support for the kernel's block multiqueue (blk-mq) interface in the zvol block driver. blk-mq creates multiple request queues on different CPUs rather than having a single request queue. This can improve zvol performance with multithreaded reads/writes.

This implementation uses the blk-mq interfaces on 4.13 or newer kernels. Building against older kernels will fall back to the
older BIO interfaces.

How Has This Been Tested?

Test case added. Also ran benchmarks (results to follow)

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

tonyhutter · 2021-10-21T00:22:27Z

Benchmarks

My benchmarking script is here:

https://gist.github.com/tonyhutter/01cf39c44d967e0b176a1d2eeb7f2460#file-benchmark-zvols-sh

All tests were done with volblocksize=1MB and compression=off. The elapsed/system/user times are in seconds and were taken directly from the wait command. I broke these down by non-O_DIRECT and O_DIRECT reads/writes, since you get so much better performance on zvols using O_DIRECT.

16-CPU node, 8 NVMe drives, 4x 2-disk mirrors pool

WRITE	MB/s	elapsed	system	user
default, non-O_DIRECT	862	95.0	62.8	62.8
default, O_DIRECT	6080	13.5	31.0	31.0
Blk-mq, non-O_DIRECT	3037	27.0	74.0	74.0
Blk-mq, O_DIRECT	6463	12.7	23.9
Speedup:
non-O_DIRECT	252%	71.62%	-17.81%	-17.81%
O_DIRECT	6%	5.75%	22.72%	22.72%

READ	MB/s	elapsed	system	user
default, non-O_DIRECT	7294	11.2	49.3	49.3
default, O_DIRECT	10889	7.6	4.3	4.3
Blk-mq, non-O_DIRECT	9011	9.1	42.8	42.8
Blk-mq, O_DIRECT	13260	6.2	3.1	3.1
Speedup:
non-O_DIRECT	24%	19.00%	13.24%	13.24%
O_DIRECT	22%	18.15%	26.96%	26.96%

32-CPU node, 60 SAS HDDs in JBOD, 6x 10-disk raidz2 pool

WRITE	MB/s	elapsed	system	user
default, non-O_DIRECT	1222	134.2	163.8	163.8
default, O_DIRECT	3219	50.9	32.6	32.6
Blk-mq, non-O_DIRECT	2837	57.7	201.3	201.3
Blk-mq, O_DIRECT	3196	51.3	26.8	26.8
Speedup:
non-O_DIRECT	132%	56.97%	-22.94%	-22.94%
O_DIRECT	-1%	-0.72%	17.73%	17.73%

READ	MB/s	elapsed	system	user
default, non-O_DIRECT	3180	51.5	95.2	95.2
default, O_DIRECT	3539	46.3	6.7	6.7
Blk-mq, non-O_DIRECT	3627	45.2	70.2	70.2
Blk-mq, O_DIRECT	3779	43.4	5.6	5.6
Speedup:
non-O_DIRECT	14%	12.25%	26.27%	26.27%
O_DIRECT	7%	6.34%	15.26%	15.26%

Some of these numbers seem unreasonably high, so I'm curious to what kind of performance numbers other people get. A proper fio benchmark would be welcome too.

Note: To benchmark the "old" zvol code against blk-mq, simply comment out #define HAVE_BLK_MQ 1 in zfs_config.h (after you've run ./autogen.sh && ./configure), and run make again.

Be sure not to build with ./configure --enable-debug when benchmarking! It will kill performance. Use ./configure

bghira · 2021-10-21T01:03:54Z

would encourage using a real EPYC system with full NUMA zone configuration to test memory locality latency concerns

tonyhutter · 2021-10-21T18:05:41Z

would encourage using a real EPYC system with full NUMA zone configuration to test memory locality latency concerns

One of the nice things about blk-mq is that it tries to submit the block IO to the request queue for same CPU. So if you do a dd on CPU 4, it will try to submit the block IOs to the request queue on CPU 4. I say "try to" as request queues are based off of workqueues, which can move work around to other CPUs if the preferred CPU is too busy.

tonynguien

This is great! I'll make time to do some testing early next week.

module/os/linux/zfs/zvol_os.c

tests/zfs-tests/tests/functional/zvol/zvol_stress/Makefile.am

tests/zfs-tests/tests/functional/zvol/zvol_stress/cleanup.ksh

tests/zfs-tests/tests/functional/zvol/zvol_stress/setup.ksh

tests/zfs-tests/tests/functional/zvol/zvol_stress/zvol_stress.ksh

module/os/linux/zfs/zvol_os.c

adamdmoss · 2021-10-26T21:51:18Z

I've been test-driving this branch for a couple of days.
No problems except the system (at least, the X session) appears to consistently hang after wake-from-suspend. The only suspicious syslog breadcrumbs are things like this:

Oct 25 12:37:48 foxbox kernel: [103815.516588] CPU: 1 PID: 15 Comm: cpuhp/1 Tainted: P           OE     5
.8.0-59-lowlatency #66~20.04.1-Ubuntu
Oct 25 12:37:48 foxbox kernel: [103815.516589] Hardware name: Gigabyte Technology Co., Ltd. Z68MA-D2H-B3/
Z68MA-D2H-B3, BIOS F10 02/23/2012
Oct 25 12:37:48 foxbox kernel: [103815.516590] Call Trace:
Oct 25 12:37:48 foxbox kernel: [103815.516599]  dump_stack+0x74/0x92
Oct 25 12:37:48 foxbox kernel: [103815.516602]  __schedule_bug.cold+0x4c/0x5d
Oct 25 12:37:48 foxbox kernel: [103815.516606]  __schedule+0x789/0xa70
Oct 25 12:37:48 foxbox kernel: [103815.516609]  schedule+0x6e/0xe0
Oct 25 12:37:48 foxbox kernel: [103815.516611]  schedule_timeout+0x132/0x180
Oct 25 12:37:48 foxbox kernel: [103815.516615]  ? ttwu_queue_wakelist+0x9b/0xd0
Oct 25 12:37:48 foxbox kernel: [103815.516618]  wait_for_completion_killable+0x8a/0x140
Oct 25 12:37:48 foxbox kernel: [103815.516622]  __kthread_create_on_node+0xd6/0x1c0
Oct 25 12:37:48 foxbox kernel: [103815.516630]  ? task_done+0xb0/0xb0 [spl]
Oct 25 12:37:48 foxbox kernel: [103815.516633]  ? string_nocheck+0x4f/0x60
Oct 25 12:37:48 foxbox kernel: [103815.516639]  ? task_done+0xb0/0xb0 [spl]
Oct 25 12:37:48 foxbox kernel: [103815.516645]  ? taskq_thread_create+0xf0/0xf0 [spl]
Oct 25 12:37:48 foxbox kernel: [103815.516647]  kthread_create_on_node+0x49/0x60
Oct 25 12:37:48 foxbox kernel: [103815.516652]  ? task_done+0xb0/0xb0 [spl]
Oct 25 12:37:48 foxbox kernel: [103815.516658]  spl_kthread_create+0x88/0xd0 [spl]
Oct 25 12:37:48 foxbox kernel: [103815.516665]  taskq_thread_create+0x6a/0xf0 [spl]
Oct 25 12:37:48 foxbox kernel: [103815.516671]  spl_taskq_expand+0xc0/0xd0 [spl]
Oct 25 12:37:48 foxbox kernel: [103815.516675]  cpuhp_invoke_callback+0x248/0x630
Oct 25 12:37:48 foxbox kernel: [103815.516678]  cpuhp_thread_fun+0xb8/0x120
Oct 25 12:37:48 foxbox kernel: [103815.516681]  smpboot_thread_fn+0xfc/0x1f0
Oct 25 12:37:48 foxbox kernel: [103815.516683]  ? sort_range+0x30/0x30
Oct 25 12:37:48 foxbox kernel: [103815.516685]  kthread+0x129/0x170
Oct 25 12:37:48 foxbox kernel: [103815.516686]  ? kthread_park+0x90/0x90
Oct 25 12:37:48 foxbox kernel: [103815.516690]  ret_from_fork+0x1f/0x30
Oct 25 12:37:48 foxbox kernel: [103815.517031] BUG: scheduling while atomic: cpuhp/1/15/0x00000000

... but that sort of stacktrace featuring ZFS was fairly common even without this branch (though the hang wasn't), so it may be a red herring / unrelated.

(I suspect that it's a bug to call any variation of wait_for_completion while atomic, and perhaps outside of a lowlatency kernel the opportunity wouldn't arise.)

behlendorf · 2021-10-26T23:23:32Z

@adamdmoss thanks for the additional testing. The issue you reported wasn't introduced by this change, I've gone ahead and opened PR #12696 with your stack trace to track it.

tonyhutter · 2021-10-26T23:25:20Z

@behlendorf thanks for taking a look, I'll make those changes.

sempervictus · 2021-11-04T00:33:41Z

@tonyhutter: Thanks for tagging old threads with this - email notifications are handy.
We build directly into the kernel, and run grsec or linux-hardened kernels, which will hopefully skyline some additional misbehavior during testing. Any data dangers here, or is this purely execution semantics and ordering operations (haven't skimmed the code yet)?

The benchmarks show considerable system CPU usage increases in the non-O_DIRECT case. When MQ was first mainlined, we saw that sort of thing as well and ended up tracking it down to scheduler thrash for where things should be handled (related IOs want to be on the same CPU/queue, but its not guaranteed to be submitted into the same queue IIRC). Is the threadpool fixed in your testing or dynamic? I may not need to do this anymore, but we boot older (like pre v7) xeons using spl_taskq_thread_dynamic=0 to deal with scheduling thrash on varying numbers of threads.

There is also a dm-crypt optimization which came about with faster SSD/NVMe storage which de-queues crypto operations and forces serialized execution in the same context. I think it came in around 5.10, cloudflare wrote it IIRC. If a similar trick is feasible here, it might help reduce the gaps between O_DIRECT and non-direct, or possibly even help out with the system load if it is in fact being caused by queue scheduling contention

Also, whats the impact of 100 RWM cycles on the benchmark - does it actually help execute other IOs while the slab allocator seeks a viable hole in the extent, or does the RWM penalty overtake the gains here? Unfortunately ZVOLs wear like race slicks :).

Thanks a ton for tackling this, looking forward to faster ZVOLs. Any thoughts on whether this would play nice with the async DMU idea Matt Macy was working on? Async dispatch across multiple queues sounds offhand like it would have cumulative benefits (so long as the async parts are truly evented instead of having to poll all those queues).

DemiMarie · 2021-11-04T01:21:29Z

Test suite is failing.

tonyhutter · 2021-11-04T17:44:01Z

@sempervictus

Any data dangers here, or is this purely execution semantics and ordering operations (haven't skimmed the code yet)?

There shouldn't be... In the end it's just passing BIOs to the zvol layer as before. Famous last words though...

The benchmarks show considerable system CPU usage increases in the non-O_DIRECT case.

Correct, but in this case it's a good thing. It's showing that the system is utilizing more of the CPU cores to get the work done quicker. For example, in the 16-CPU benchmark it's basically burning ~18% more CPU cycles for 252% more throughput. There may be some scheduler thash as we do have lots of queueing going on. And by that I mean each request gets queued twice - once in the kernel's block request_queue, and once in the zvol workqueue (see zvol_mq_queue_rq()). That sounds bad, but I don't think it's a big bottlneck since the requests themselves represent pretty big blocks of data.

Also, whats the impact of 100 RWM cycles on the benchmark

I don't know, that would interesting to see. I used parallel dd's in the benchmark since that's what the developer who originaly reported the issue used.

Any thoughts on whether this would play nice with the async DMU idea Matt Macy was working on?

I'm not familiar enough with that code so say one way or another. This PR doesn't change anything lower than the BIO layer, so I'm assuming it would be fine.

@DemiMarie yea for some reason it's not running the zvol_stress test I added. I'll look into that and will hopefully have a fix on my next push.

samuelxhu · 2021-11-05T15:22:27Z

@tonyhutter This is a very interesting performance optimization and i would definitely give it a testing try. Since versions 0.8.6 and 2.0.6 are widely used and seems quite stable now, could you please produce backport patches for 0.8.6 and 2.0.6 ?

tonyhutter · 2021-11-05T21:44:49Z

@samuelxhu this will either go into a future 2.1.x or 2.2.x release, depending on risk/benefit. 2.0.x is only for bugfixes and low-risk patches and 0.8.6 is no longer developed, so no plans to backport it.

tonynguien · 2021-11-05T21:49:02Z

It would be nice if ZTS performance suite has zvol tests. Perhaps a hackathon project ;-)

sempervictus · 2021-11-07T15:43:59Z

@tonyhutter: i believe that there is a way for viable schedulers of a block device to be restricted such as to remove the "intelligently queuing" options from the available list. AFAIK, distros tend to default to things like CFQ for generic use-cases, which stack unfavorably with the IO scheduler underneath the ZVOL. I think there's also a way to set the default scheduler, but i haven't dug around in that part of the kernel in some time.

bghira · 2021-11-07T16:47:21Z

@tonyhutter: i believe that there is a way for viable schedulers of a block device to be restricted such as to remove the "intelligently queuing" options from the available list. AFAIK, distros tend to default to things like CFQ for generic use-cases, which stack unfavorably with the IO scheduler underneath the ZVOL. I think there's also a way to set the default scheduler, but i haven't dug around in that part of the kernel in some time.

setting the default is for all block devices, unfortunately. ZFS used to set noop scheduler on the vdevs but this mechanism was removed from the Linux kernel in torvalds/linux@85c0a03

sempervictus · 2021-11-08T01:14:14Z

@tonyhutter: i believe that there is a way for viable schedulers of a block device to be restricted such as to remove the "intelligently queuing" options from the available list. AFAIK, distros tend to default to things like CFQ for generic use-cases, which stack unfavorably with the IO scheduler underneath the ZVOL. I think there's also a way to set the default scheduler, but i haven't dug around in that part of the kernel in some time.

setting the default is for all block devices, unfortunately. ZFS used to set noop scheduler on the vdevs but this mechanism was removed from the Linux kernel in torvalds/linux@85c0a03

To quote Homer Simpson: "doh!"
Thanks for the clarification @bghira

tonyhutter · 2021-11-09T01:38:26Z

@behlendorf my last push has all your changes included. I need to look into some zvol_stress errors on FreeBSD though:

01:10:32.14 zvol_stress2: (groupid=0, jobs=1): err=45 (file:engines/posixaio.c:180, func=xfer, error=Operation not supported): pid=98257: Tue Nov 9 01:10:32 2021

sempervictus · 2022-01-16T14:17:17Z

I may have found an unpleasant bug.
Looks like the scheduling for ZVOLs falls down with this in-play if you are making thin-pools atop the ZVOLs. Commands just hang, kernel times out, bad news bears:

root        9321  0.0  0.0 1232172 11648 ?       Ssl  13:56   0:00              \_ serviced-storage create-thin-pool serviced -s 20G zenoss
root        9335  0.0  0.0  25956  8084 ?        D<   13:56   0:00                  \_ lvcreate -T -L 20G zenoss/serviced-pool

which produces

[  363.658041]  sda: sda1 sda9
[  366.167046] nfsd: last server has exited, flushing export cache
[  615.631009] INFO: task lvcreate:9335 blocked for more than 122 seconds.
[  615.635351]       Tainted: G                T  5.15.14 #1
[  615.637653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  615.641282] task:lvcreate        state:D stack:    0 pid: 9335 ppid:  9321 flags:0x00004000
[  615.641289] Call Trace:
[  615.641294]  <TASK>
[  615.641295]  [<ffffffff823c18b4>] ? __schedule+0xfb4/0x1ac0
[  615.641306]  [<ffffffff823c241a>] ? schedule+0x5a/0x130
[  615.641310]  [<ffffffff823c96fd>] ? schedule_timeout+0x11d/0x190
[  615.641314]  [<ffffffff81159b75>] ? __prepare_to_swait+0x85/0xc0
[  615.641319]  [<ffffffff823c089f>] ? io_schedule_timeout+0x6f/0xd0
[  615.641323]  [<ffffffff823c35fe>] ? wait_for_completion_io+0xbe/0x160
[  615.641329]  [<ffffffff849a9d53>] ? sync_io+0x123/0x190 [dm_mod]
[  615.641354]  [<ffffffff849a9fd5>] ? dm_io+0x205/0x2d0 [dm_mod]
[  615.641361]  [<ffffffff849a9440>] ? rap_hash.km_get_page+0x10/0x10 [dm_mod]
[  615.641367]  [<ffffffff849a8f40>] ? rap_hash.km_next_page+0x10/0x10 [dm_mod]
[  615.641376]  [<ffffffff846db312>] ? dm_bufio_write_dirty_buffers+0x4a2/0x500 [dm_bufio]
[  615.641382]  [<ffffffff846e86b1>] ? dm_bm_flush+0x51/0x60 [dm_persistent_data]
[  615.641402]  [<ffffffff846ef377>] ? dm_tm_pre_commit+0x67/0xa0 [dm_persistent_data]
[  615.641409]  [<ffffffff849d8181>] ? __create_persistent_data_objects+0x681/0xcb0 [dm_thin_pool]
[  615.641417]  [<ffffffff849d9324>] ? dm_pool_metadata_open+0xd4/0x2e0 [dm_thin_pool]
[  615.641424]  [<ffffffff849cf0db>] ? pool_ctr+0x6bb/0xf10 [dm_thin_pool]
[  615.641429]  [<ffffffff8499d3c7>] ? dm_table_add_target+0x197/0x470 [dm_mod]
[  615.641437]  [<ffffffff849a8033>] ? table_load+0x1e3/0x600 [dm_mod]
[  615.641444]  [<ffffffff849a4d61>] ? ctl_ioctl+0x271/0xa00 [dm_mod]
[  615.641451]  [<ffffffff849a7e50>] ? rap_hash.table_load+0x10/0x10 [dm_mod]
[  615.641460]  [<ffffffff849a551d>] ? dm_ctl_ioctl+0x1d/0x50 [dm_mod]
[  615.641466]  [<ffffffff8146e36d>] ? __x64_sys_ioctl+0x13d/0x180
[  615.641473]  [<ffffffff8239c6cc>] ? do_syscall_64+0x7c/0x100
[  615.641477]  [<ffffffff81001293>] ? entry_SYSCALL_64_after_hwframe+0x75/0x138
[  615.641481]  </TASK>

Some of that stack won't be reproducible by others - this is a 5.15 grsec kernel (hence the RAP hash calls), but its the IO which never completes - not a failure of the call/ret site check.

Since we build ZFS into the kernel, i updated the code to default to using blk-mq for ZVOLs and in order to ensure that this is in-fact the cause (happened multiple times in a row), i booted with zfs.zvol_use_blk_mq=0 and now Chef is happily building the stack (i'm old school, sue me) long past the point where it got stuck creating those dm-thin pieces.

I think we need tests where we build other block constructs atop ZVOLs to test for scheduling interactions between layered blockdevs in-kernel.

DemiMarie · 2022-01-16T16:21:07Z

Does ZFS do any memory allocations in the I/O path? It is possible for doing so to deadlock.

sempervictus · 2022-01-16T17:02:15Z

To add injury to insult, since thin pools are fragile things to begin with, this one got toasted.
The test target is a devops-built, hardened Zenoss Core deployment (its a sort of PoC to showcase mechanisms for securing systems with access to everything in the IT infrastructure), which leverages ZFS for all of its Docker functionality and ServiceD storage. The iffy thing with serviced is that it insists, profusely, on using really fragile storage - dm-thin pools & legacy FS, basically asking for faceplants because the userspace stack can get very memory intensive (this failure is occurring well before said stack even exists, its in the pre-provision phase setting up storage). It looks like it might be readily reproducible atop our Xena OpenStack environment so i should be able to re-test down the line. For now, will have to mark this unsafe and keep it off on my personal host (our prod systems don't have this patch yet).

DemiMarie · 2022-01-16T17:10:30Z

To add injury to insult, since thin pools are fragile things to begin with, this one got toasted.

I take it that thin_repair wasn’t able to fix the problem? If there was a transaction ID mismatch, that may be fixable by hand-editing the metadata.

sempervictus · 2022-01-16T17:56:07Z

@DemiMarie - thin_repair found nothing wrong, but the XFS atop the volume wouldn't mount or even be detected as XFS by xfsfix. Whatever was written-out to the thin pool mangled the FS beyond recognition.

tonyhutter · 2022-01-18T22:40:13Z

@sempervictus Funny you should mention the lockup - I was just doing an "extract the linux kernel tarball" test on top of an ext4-formatted blk-mq zvol, and also hit lockups:

[1277479.103071] INFO: task ar:3226921 blocked for more than 120 seconds.
[1277479.118387] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1277479.126394] ar              D    0 3226921 3226752 0x00004080
[1277479.132324] Call Trace:
[1277479.134958]  __schedule+0x2c7/0x720
[1277479.138638]  ? wbt_exit+0x30/0x30
[1277479.142134]  ? __wbt_done+0x30/0x30
[1277479.145810]  ? rq_qos_wait+0xfc/0x180
[1277479.149655]  schedule+0x4d/0xc0
[1277479.152974]  io_schedule+0x12/0x40
[1277479.156565]  rq_qos_wait+0x101/0x180
[1277479.160325]  ? karma_partition+0x1e0/0x1e0
[1277479.164605]  ? wbt_exit+0x30/0x30
[1277479.168113]  wbt_wait+0x99/0xe0
[1277479.171445]  __rq_qos_throttle+0x23/0x30
[1277479.175554]  blk_mq_make_request+0x12d/0x5f0
[1277479.180006]  generic_make_request+0x2e9/0x350
[1277479.184548]  submit_bio+0x3c/0x160
[1277479.188155]  ext4_io_submit+0x48/0x60 [ext4]
[1277479.192621]  ext4_bio_write_page+0x17d/0x440 [ext4]
...

Discussion:

There have been essentially four versions (really, variations) of the blkmq code that I've tested over the course of this PR:

Ver 1: blk-mq with 'struct requests' on the workqueue (tonyhutter:blkmq-request)

This was the first iteration I submitted for this PR. It put each struct request (which contain one or more BIOs) onto the workqueue to be processed. So multiple struct request could be processed in parallel in the background, but the BIOs within each request were processed sequentially.

Ver 2: blk-mq with BIOs on the workqueue (tonyhutter:blkmq)

This is the current branch in the PR. It is also the branch that is presently giving us lockups. It puts each BIO from each struct request into a workqueue. The idea here was to give better parallelism and rectify the bad random-RW performance @tonynguien was seeing. Unfortunately, it didn't fix the random-RW performance, and performed roughly the same as the Ver 1. I believe the lockups we're seeing with this branch may be due to the gnarly request finalizing logic I had to add in to get the last completed BIO in the request to call blk_mq_end_request().

Ver 3: blk-mq with the old-school "zvol threads" (tonyhutter:blkmq-oldschool)

This was an experimental branch that I put together to see if I rectify the bad random-RW performance by using the existing zvol codepaths with blk-mq. It worked, but gave mostly the same (and sometimes worse) performance than the current, vanilla, non-blkmq zvol code.

Ver 4: blk-mq with 'struct requests' on the workqueue + "BIO merging" (tonyhutter:biomerge)

This is another experimental branch that takes Ver 1, and adds the ability to merge contiguous BIOs within a request into one big BIO. The idea being to reduce dbuf locking overhead and processing for contiguous RW. I've seen it merge requests with 300+ BIOs into one BIO, for example. It has given me the best performance so far for O_DIRECT sequential writes.

I think the way forward is Ver 4. Or more specifically, I will upload Ver 1 for this PR (which I'm hoping will fix the lockup), and then do a follow-on PR with the BIO merging in it to bring it up to Ver 4.

sempervictus · 2022-01-19T03:16:44Z

@tonyhutter: i'm having a fair deal of deja vu from when @ryao yanked the old version of these block device mechanisms. One of the resulting follow-ups was to implement 5731140, with the following comment:

    Disable write merging on ZVOLs
    
    The current ZVOL implementation does not explicitly set merge
    options on ZVOL device queues, which results in the default merge
    behavior.
    
    Explicitly set QUEUE_FLAG_NOMERGES on ZVOL queues allowing the
    ZIO pipeline to do its work.
    
    Initial benchmarks (tiotest with no O_DIRECT) show random write
    performance going up almost 3X on 8K ZVOLs, even after significant
    rewrites of the logical space allocation.

so really, we're talking about reverting that when we use the MQ pipeline and leaving it as hacked together 5y ago when using the existing pipeline.

If the Linux-proper way to do this is to feed BIOs (a la v3) into the queues, is what we're seeing some sort of ordering issue? Is there some analog to blk_mq_end_request in other implementations like block storage servers or something which we could review to correct the implementation? If its actually alright to use struct request allowing iovec merging to happen in the upper (Linux) layer, then that sounds like it will more closely resemble what we had back when ZVOLs were (relatively) fast. Might be useful to find some of the PRs and issues from back then - guessing that code graveyard has some handy info for this effort.

EDIT: sorry about the nuisant nitpicking here - i was around the last time this codepath was significantly altered and in the years since we have lost so much performance relative to the gains in backing storage that it seems like this is something "we must get right, this time" if ZVOLs are to stay relevant in modern compute environments (backing cloud storage for medical research, for example).
The original objective was to simplify the block IO codepath and leverage ZFS' own innards for dispatch and optimization of such, but i think that giving up the OS-native interfaces ended up being more costly in the end. When we have async dispatch and more hardware offload in those innards to decrease latency/improve throughput, we can reduce these queue depths to 1 to retain the OS interface while relying on ZFS for performance; but for now the IO merging and other "upper layer optimizations" are probably good band-aids to have.

tonyhutter · 2022-01-19T23:22:24Z

so really, we're talking about reverting that when we use the MQ pipeline and leaving it as hacked together 5y ago when using the existing pipeline.

Right, we would revert QUEUE_FLAG_NOMERGES for the blkmq codepath only. We need to revert it to get BIO merging to work. zvol_use_blk_mq would basically be yet another knob to tune to your workload (yes, this is annoying). We might promote zvol_use_blk_mq to a per-zvol parameter if it makes sense to do so, but for now I think a module param is fine.

If the Linux-proper way to do this is to feed BIOs (a la v3) into the queues, is what we're seeing some sort of ordering issue? Is there some analog to blk_mq_end_request in other implementations like block storage servers or something which we could review to correct the implementation?

I don't know if it's the ordering. I suspect the slighly worse performance for V3 may be due to the fact that V3 is basically "do it exactly the same as the old way but put the BIO in a blkmq queue first". So more overhead from queuing.

If its actually alright to use struct request allowing iovec merging to happen in the upper (Linux) layer, then that sounds like it will more closely resemble what we had back when ZVOLs were (relatively) fast.

Unfortunately, there's no easy way to tell Linux to "merge the iovecs from the BIOs in this this request". Maybe it did that in older kernels back in the day, but I don't see where it does it anymore (and I have looked). Note that for the "BIO merging" patch, we actually do merge the iovecs, in that we create a single UIO (which is ZFS's internal representation of a BIO) from multiple BIOs, and then put the BIO's iovecs into the UIO as an array.

The original objective was to simplify the block IO codepath and leverage ZFS' own innards for dispatch and optimization of such, but i think that giving up the OS-native interfaces ended up being more costly in the end.

Yea, when all is said and done, we'll basically have:

zvol_use_blk_mq = 0: Use the old zvol code. Best for low-recordsize zvols with mostly small random RW workloads (databases).
zvol_use_blk_mq = 1: Use blkmq with BIO merging. Best for medium-large recordsize zvols with sequential reads/wires (filesystems on top of zvols?).

Alternatively we could get rid of zvol_use_blk_mq and hardcode something like: "if zvolrecordsize >= 16k then use blkmq, else use normal zvol threads". I'm not convinced this is a good idea though,

Also -QUEUE_FLAG_NOMERGES means "you can only have one BIO per request" I've verified this in testing. There's another flag called QUEUE_FLAG_NOXMERGES for "no extended merges". That basically "only do simple low-overhead merges, don't go out of your way to do a merge". QUEUE_FLAG_NOXMERGES is worth investingating, both for the blk-mq and non-blk-mq cases. It's dynamically settable in sysfs with nomerges as well (0 = NOMERGE, 1 = NOXMERGE, 2 = merge).

sempervictus · 2022-01-24T20:09:02Z

I cant quite tell how much of the force-push from 3d ago is this PR and how much is delta in master without a deep dive.
@tonyhutter - any significant changes worth a rebuild/retest cycle on my end?

tonyhutter · 2022-01-24T21:02:47Z

@sempervictus the latest push changes the PR to "Ver 1: blk-mq with 'struct requests' on the workqueue". Hopefully that will fix the crash we were seeing earlier, and also preps the code for the follow-up "BIO merge" PR.

sempervictus · 2022-01-24T21:12:36Z

Thank you sir. Should i wait to throw this into our kernel tree for the follow-up since we saw that the struct approach worked before?

tonyhutter · 2022-01-25T18:24:14Z

@sempervictus yes, feel free to give it a shot. It is not crashing on my anymore when I do my ext4-on-zvol test. Don't forget to enable blkmq before importing/creating your volumes: echo 1 > /sys/module/zfs/parameters/zvol_use_blk_mq

tonyhutter · 2022-01-31T18:11:09Z

Reviewers - this PR is passing ZTS and appears to be ready to go. Can you please take another look?

sempervictus · 2022-01-31T22:01:17Z

I've been running the version from ~5d ago without issue.
I'll do another merge/build and kick off a day-long zloop

EDIT: looks like i am running the correct version, so will get a VM to run the tests for a bit

behlendorf · 2022-02-01T17:15:07Z

@tonyhutter can you go ahead and rebase this to resolve the kernel.org build issue.

sempervictus · 2022-02-01T17:50:16Z

No crashes yet - two identical VMs on grsec 5.15.17 (with RAP and all the goodies) running, one with MQ on and one with it off. I might need new nvmes at some point with all this abuse 😄, but it looks to be working.
Moving into "functional" testing where i will replicate the BIO-submission crash conditions i saw before - building Zenoss atop our Arch base with ZFS backing the stores.

sempervictus · 2022-02-04T14:28:02Z

@tonyhutter - when copying in-tree, but building as a module (normally i build ZFS into the main binary), we hit this little gem:

FATAL: modpost: GPL-incompatible module zfs.ko uses GPL-only symbol 'blk_queue_write_cache'
make[4]: *** [scripts/Makefile.modpost:125: modules-only.symvers] Error 1
make[3]: *** [Makefile:1413: modules] Error 2
make[3]: *** Waiting for unfinished jobs....

^^ 5.10.96

tonyhutter · 2022-02-04T17:54:24Z

Hmm.. that might be a general build bug then - my PR doesn't do anything with that function. Maybe try re-running ./autogen.sh && ./configure. Interestingly, my zfs_config.h also shows that function as GPL'd but it builds fine as a module:

/* blk_queue_write_cache() exists */
#define HAVE_BLK_QUEUE_WRITE_CACHE 1

/* blk_queue_write_cache() is GPL-only */
#define HAVE_BLK_QUEUE_WRITE_CACHE_GPL_ONLY 1

Add support for the kernel's block multiqueue (blk-mq) interface in the zvol block driver. blk-mq creates multiple request queues on different CPUs rather than having a single request queue. This can improve zvol performance with multithreaded reads/writes. This implementation uses the blk-mq interfaces on 4.13 or newer kernels. Building against older kernels will fall back to the older BIO interfaces. Note that you must set the `zvol_use_blk_mq` module param to enable the blk-mq API. It is disabled by default. Signed-off-by: Tony Hutter <hutter2@llnl.gov> Issue openzfs#12483

tonyhutter · 2022-02-24T00:34:20Z

I'm closing this PR in favor of #13148 which has both the blk-mq changes and the "BIO merging" that we've been talking about. I originally thought it would be better to split it into two PRs, but after looking at the new code it doesn't really make sense to do that way.

tonyhutter mentioned this pull request Oct 21, 2021

Better performance with O_DIRECT on zvols #12483

Open

behlendorf added Component: ZVOL ZFS Volumes Status: Code Review Needed Ready for review and testing labels Oct 21, 2021

tonynguien reviewed Oct 21, 2021

View reviewed changes

module/os/linux/zfs/zvol_os.c Show resolved Hide resolved

tonynguien self-assigned this Oct 21, 2021

behlendorf requested changes Oct 22, 2021

View reviewed changes

behlendorf mentioned this pull request Oct 26, 2021

spl_taskq_expand() scheduling while atomic #12696

Closed

tonyhutter mentioned this pull request Nov 4, 2021

ZVOL write IO merging not sufficient #8472

Open

tonyhutter force-pushed the blkmq branch 3 times, most recently from 2102b5f to 8bf8dad Compare November 5, 2021 21:40

tonyhutter force-pushed the blkmq branch from 8bf8dad to 8b69bfd Compare November 5, 2021 22:03

tonyhutter force-pushed the blkmq branch from 8b69bfd to f623f78 Compare November 8, 2021 20:51

tonyhutter force-pushed the blkmq branch from f623f78 to 86de76b Compare November 9, 2021 22:51

tonyhutter force-pushed the blkmq branch 2 times, most recently from 2c5b268 to b1a1fcb Compare January 11, 2022 17:45

tonyhutter force-pushed the blkmq branch 2 times, most recently from 9545c22 to 193025a Compare January 24, 2022 19:52

tonyhutter force-pushed the blkmq branch from 193025a to 72c1786 Compare January 25, 2022 17:59

tonyhutter force-pushed the blkmq branch from 72c1786 to f2d41f8 Compare January 26, 2022 18:38

tonyhutter force-pushed the blkmq branch from f2d41f8 to 97104bf Compare February 2, 2022 00:07

tonyhutter force-pushed the blkmq branch from 97104bf to 3e7d45a Compare February 17, 2022 00:47

tonyhutter mentioned this pull request Feb 24, 2022

zvol: Support blk-mq for better performance (updated) #13148

Merged

13 tasks

tonyhutter closed this Feb 24, 2022

zvol: Support blk-mq for better performance #12664

zvol: Support blk-mq for better performance #12664

Conversation

tonyhutter commented Oct 21, 2021

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

tonyhutter commented Oct 21, 2021

Benchmarks

16-CPU node, 8 NVMe drives, 4x 2-disk mirrors pool

32-CPU node, 60 SAS HDDs in JBOD, 6x 10-disk raidz2 pool

bghira commented Oct 21, 2021

tonyhutter commented Oct 21, 2021

tonynguien left a comment

Choose a reason for hiding this comment

adamdmoss commented Oct 26, 2021

behlendorf commented Oct 26, 2021

tonyhutter commented Oct 26, 2021

sempervictus commented Nov 4, 2021

DemiMarie commented Nov 4, 2021

tonyhutter commented Nov 4, 2021

samuelxhu commented Nov 5, 2021

tonyhutter commented Nov 5, 2021

tonynguien commented Nov 5, 2021

sempervictus commented Nov 7, 2021

bghira commented Nov 7, 2021 • edited Loading

sempervictus commented Nov 8, 2021

tonyhutter commented Nov 9, 2021

sempervictus commented Jan 16, 2022 • edited Loading

DemiMarie commented Jan 16, 2022

sempervictus commented Jan 16, 2022

DemiMarie commented Jan 16, 2022

sempervictus commented Jan 16, 2022

tonyhutter commented Jan 18, 2022

Discussion:

sempervictus commented Jan 19, 2022 • edited Loading

tonyhutter commented Jan 19, 2022

sempervictus commented Jan 24, 2022

tonyhutter commented Jan 24, 2022

sempervictus commented Jan 24, 2022

tonyhutter commented Jan 25, 2022

tonyhutter commented Jan 31, 2022

sempervictus commented Jan 31, 2022 • edited Loading

behlendorf commented Feb 1, 2022

sempervictus commented Feb 1, 2022

sempervictus commented Feb 4, 2022

tonyhutter commented Feb 4, 2022

tonyhutter commented Feb 24, 2022

bghira commented Nov 7, 2021 •

edited

Loading

sempervictus commented Jan 16, 2022 •

edited

Loading

sempervictus commented Jan 19, 2022 •

edited

Loading

sempervictus commented Jan 31, 2022 •

edited

Loading