Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zvol: Support blk-mq for better performance #12664

Closed
wants to merge 1 commit into from

Conversation

tonyhutter
Copy link
Contributor

Motivation and Context

Increase multi-threaded performance on zvols (Issue #12483)

Description

Add support for the kernel's block multiqueue (blk-mq) interface in the zvol block driver. blk-mq creates multiple request queues on different CPUs rather than having a single request queue. This can improve zvol performance with multithreaded reads/writes.

This implementation uses the blk-mq interfaces on 4.13 or newer kernels. Building against older kernels will fall back to the
older BIO interfaces.

How Has This Been Tested?

Test case added. Also ran benchmarks (results to follow)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@tonyhutter
Copy link
Contributor Author

Benchmarks

My benchmarking script is here:

https://gist.github.com/tonyhutter/01cf39c44d967e0b176a1d2eeb7f2460#file-benchmark-zvols-sh

All tests were done with volblocksize=1MB and compression=off. The elapsed/system/user times are in seconds and were taken directly from the wait command. I broke these down by non-O_DIRECT and O_DIRECT reads/writes, since you get so much better performance on zvols using O_DIRECT.

16-CPU node, 8 NVMe drives, 4x 2-disk mirrors pool

WRITE MB/s elapsed system user
default, non-O_DIRECT 862 95.0 62.8 62.8
default, O_DIRECT 6080 13.5 31.0 31.0
Blk-mq, non-O_DIRECT 3037 27.0 74.0 74.0
Blk-mq, O_DIRECT 6463 12.7 23.9
Speedup:
non-O_DIRECT 252% 71.62% -17.81% -17.81%
O_DIRECT 6% 5.75% 22.72% 22.72%
READ MB/s elapsed system user
default, non-O_DIRECT 7294 11.2 49.3 49.3
default, O_DIRECT 10889 7.6 4.3 4.3
Blk-mq, non-O_DIRECT 9011 9.1 42.8 42.8
Blk-mq, O_DIRECT 13260 6.2 3.1 3.1
Speedup:
non-O_DIRECT 24% 19.00% 13.24% 13.24%
O_DIRECT 22% 18.15% 26.96% 26.96%

32-CPU node, 60 SAS HDDs in JBOD, 6x 10-disk raidz2 pool

WRITE MB/s elapsed system user
default, non-O_DIRECT 1222 134.2 163.8 163.8
default, O_DIRECT 3219 50.9 32.6 32.6
Blk-mq, non-O_DIRECT 2837 57.7 201.3 201.3
Blk-mq, O_DIRECT 3196 51.3 26.8 26.8
Speedup:
non-O_DIRECT 132% 56.97% -22.94% -22.94%
O_DIRECT -1% -0.72% 17.73% 17.73%
READ MB/s elapsed system user
default, non-O_DIRECT 3180 51.5 95.2 95.2
default, O_DIRECT 3539 46.3 6.7 6.7
Blk-mq, non-O_DIRECT 3627 45.2 70.2 70.2
Blk-mq, O_DIRECT 3779 43.4 5.6 5.6
Speedup:
non-O_DIRECT 14% 12.25% 26.27% 26.27%
O_DIRECT 7% 6.34% 15.26% 15.26%

Some of these numbers seem unreasonably high, so I'm curious to what kind of performance numbers other people get. A proper fio benchmark would be welcome too.

Note: To benchmark the "old" zvol code against blk-mq, simply comment out #define HAVE_BLK_MQ 1 in zfs_config.h (after you've run ./autogen.sh && ./configure), and run make again.

Be sure not to build with ./configure --enable-debug when benchmarking! It will kill performance. Use ./configure

@bghira
Copy link

bghira commented Oct 21, 2021

would encourage using a real EPYC system with full NUMA zone configuration to test memory locality latency concerns

@tonyhutter
Copy link
Contributor Author

would encourage using a real EPYC system with full NUMA zone configuration to test memory locality latency concerns

One of the nice things about blk-mq is that it tries to submit the block IO to the request queue for same CPU. So if you do a dd on CPU 4, it will try to submit the block IOs to the request queue on CPU 4. I say "try to" as request queues are based off of workqueues, which can move work around to other CPUs if the preferred CPU is too busy.

@behlendorf behlendorf added Component: ZVOL ZFS Volumes Status: Code Review Needed Ready for review and testing labels Oct 21, 2021
Copy link
Contributor

@tonynguien tonynguien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I'll make time to do some testing early next week.

module/os/linux/zfs/zvol_os.c Show resolved Hide resolved
@tonynguien tonynguien self-assigned this Oct 21, 2021
module/os/linux/zfs/zvol_os.c Show resolved Hide resolved
module/os/linux/zfs/zvol_os.c Show resolved Hide resolved
module/os/linux/zfs/zvol_os.c Outdated Show resolved Hide resolved
module/os/linux/zfs/zvol_os.c Outdated Show resolved Hide resolved
module/os/linux/zfs/zvol_os.c Outdated Show resolved Hide resolved
@adamdmoss
Copy link
Contributor

I've been test-driving this branch for a couple of days.
No problems except the system (at least, the X session) appears to consistently hang after wake-from-suspend. The only suspicious syslog breadcrumbs are things like this:

Oct 25 12:37:48 foxbox kernel: [103815.516588] CPU: 1 PID: 15 Comm: cpuhp/1 Tainted: P           OE     5
.8.0-59-lowlatency #66~20.04.1-Ubuntu
Oct 25 12:37:48 foxbox kernel: [103815.516589] Hardware name: Gigabyte Technology Co., Ltd. Z68MA-D2H-B3/
Z68MA-D2H-B3, BIOS F10 02/23/2012
Oct 25 12:37:48 foxbox kernel: [103815.516590] Call Trace:
Oct 25 12:37:48 foxbox kernel: [103815.516599]  dump_stack+0x74/0x92
Oct 25 12:37:48 foxbox kernel: [103815.516602]  __schedule_bug.cold+0x4c/0x5d
Oct 25 12:37:48 foxbox kernel: [103815.516606]  __schedule+0x789/0xa70
Oct 25 12:37:48 foxbox kernel: [103815.516609]  schedule+0x6e/0xe0
Oct 25 12:37:48 foxbox kernel: [103815.516611]  schedule_timeout+0x132/0x180
Oct 25 12:37:48 foxbox kernel: [103815.516615]  ? ttwu_queue_wakelist+0x9b/0xd0
Oct 25 12:37:48 foxbox kernel: [103815.516618]  wait_for_completion_killable+0x8a/0x140
Oct 25 12:37:48 foxbox kernel: [103815.516622]  __kthread_create_on_node+0xd6/0x1c0
Oct 25 12:37:48 foxbox kernel: [103815.516630]  ? task_done+0xb0/0xb0 [spl]
Oct 25 12:37:48 foxbox kernel: [103815.516633]  ? string_nocheck+0x4f/0x60
Oct 25 12:37:48 foxbox kernel: [103815.516639]  ? task_done+0xb0/0xb0 [spl]
Oct 25 12:37:48 foxbox kernel: [103815.516645]  ? taskq_thread_create+0xf0/0xf0 [spl]
Oct 25 12:37:48 foxbox kernel: [103815.516647]  kthread_create_on_node+0x49/0x60
Oct 25 12:37:48 foxbox kernel: [103815.516652]  ? task_done+0xb0/0xb0 [spl]
Oct 25 12:37:48 foxbox kernel: [103815.516658]  spl_kthread_create+0x88/0xd0 [spl]
Oct 25 12:37:48 foxbox kernel: [103815.516665]  taskq_thread_create+0x6a/0xf0 [spl]
Oct 25 12:37:48 foxbox kernel: [103815.516671]  spl_taskq_expand+0xc0/0xd0 [spl]
Oct 25 12:37:48 foxbox kernel: [103815.516675]  cpuhp_invoke_callback+0x248/0x630
Oct 25 12:37:48 foxbox kernel: [103815.516678]  cpuhp_thread_fun+0xb8/0x120
Oct 25 12:37:48 foxbox kernel: [103815.516681]  smpboot_thread_fn+0xfc/0x1f0
Oct 25 12:37:48 foxbox kernel: [103815.516683]  ? sort_range+0x30/0x30
Oct 25 12:37:48 foxbox kernel: [103815.516685]  kthread+0x129/0x170
Oct 25 12:37:48 foxbox kernel: [103815.516686]  ? kthread_park+0x90/0x90
Oct 25 12:37:48 foxbox kernel: [103815.516690]  ret_from_fork+0x1f/0x30
Oct 25 12:37:48 foxbox kernel: [103815.517031] BUG: scheduling while atomic: cpuhp/1/15/0x00000000

... but that sort of stacktrace featuring ZFS was fairly common even without this branch (though the hang wasn't), so it may be a red herring / unrelated.

(I suspect that it's a bug to call any variation of wait_for_completion while atomic, and perhaps outside of a lowlatency kernel the opportunity wouldn't arise.)

@behlendorf
Copy link
Contributor

@adamdmoss thanks for the additional testing. The issue you reported wasn't introduced by this change, I've gone ahead and opened PR #12696 with your stack trace to track it.

@tonyhutter
Copy link
Contributor Author

@behlendorf thanks for taking a look, I'll make those changes.

@sempervictus
Copy link
Contributor

@tonyhutter: Thanks for tagging old threads with this - email notifications are handy.
We build directly into the kernel, and run grsec or linux-hardened kernels, which will hopefully skyline some additional misbehavior during testing. Any data dangers here, or is this purely execution semantics and ordering operations (haven't skimmed the code yet)?

The benchmarks show considerable system CPU usage increases in the non-O_DIRECT case. When MQ was first mainlined, we saw that sort of thing as well and ended up tracking it down to scheduler thrash for where things should be handled (related IOs want to be on the same CPU/queue, but its not guaranteed to be submitted into the same queue IIRC). Is the threadpool fixed in your testing or dynamic? I may not need to do this anymore, but we boot older (like pre v7) xeons using spl_taskq_thread_dynamic=0 to deal with scheduling thrash on varying numbers of threads.

There is also a dm-crypt optimization which came about with faster SSD/NVMe storage which de-queues crypto operations and forces serialized execution in the same context. I think it came in around 5.10, cloudflare wrote it IIRC. If a similar trick is feasible here, it might help reduce the gaps between O_DIRECT and non-direct, or possibly even help out with the system load if it is in fact being caused by queue scheduling contention

Also, whats the impact of 100 RWM cycles on the benchmark - does it actually help execute other IOs while the slab allocator seeks a viable hole in the extent, or does the RWM penalty overtake the gains here? Unfortunately ZVOLs wear like race slicks :).

Thanks a ton for tackling this, looking forward to faster ZVOLs. Any thoughts on whether this would play nice with the async DMU idea Matt Macy was working on? Async dispatch across multiple queues sounds offhand like it would have cumulative benefits (so long as the async parts are truly evented instead of having to poll all those queues).

@DemiMarie
Copy link

Test suite is failing.

@tonyhutter
Copy link
Contributor Author

@sempervictus

Any data dangers here, or is this purely execution semantics and ordering operations (haven't skimmed the code yet)?

There shouldn't be... In the end it's just passing BIOs to the zvol layer as before. Famous last words though...

The benchmarks show considerable system CPU usage increases in the non-O_DIRECT case.

Correct, but in this case it's a good thing. It's showing that the system is utilizing more of the CPU cores to get the work done quicker. For example, in the 16-CPU benchmark it's basically burning ~18% more CPU cycles for 252% more throughput. There may be some scheduler thash as we do have lots of queueing going on. And by that I mean each request gets queued twice - once in the kernel's block request_queue, and once in the zvol workqueue (see zvol_mq_queue_rq()). That sounds bad, but I don't think it's a big bottlneck since the requests themselves represent pretty big blocks of data.

Also, whats the impact of 100 RWM cycles on the benchmark

I don't know, that would interesting to see. I used parallel dd's in the benchmark since that's what the developer who originaly reported the issue used.

Any thoughts on whether this would play nice with the async DMU idea Matt Macy was working on?

I'm not familiar enough with that code so say one way or another. This PR doesn't change anything lower than the BIO layer, so I'm assuming it would be fine.

@DemiMarie yea for some reason it's not running the zvol_stress test I added. I'll look into that and will hopefully have a fix on my next push.

@samuelxhu
Copy link

@tonyhutter This is a very interesting performance optimization and i would definitely give it a testing try. Since versions 0.8.6 and 2.0.6 are widely used and seems quite stable now, could you please produce backport patches for 0.8.6 and 2.0.6 ?

@tonyhutter
Copy link
Contributor Author

@samuelxhu this will either go into a future 2.1.x or 2.2.x release, depending on risk/benefit. 2.0.x is only for bugfixes and low-risk patches and 0.8.6 is no longer developed, so no plans to backport it.

@tonynguien
Copy link
Contributor

It would be nice if ZTS performance suite has zvol tests. Perhaps a hackathon project ;-)

@sempervictus
Copy link
Contributor

@tonyhutter: i believe that there is a way for viable schedulers of a block device to be restricted such as to remove the "intelligently queuing" options from the available list. AFAIK, distros tend to default to things like CFQ for generic use-cases, which stack unfavorably with the IO scheduler underneath the ZVOL. I think there's also a way to set the default scheduler, but i haven't dug around in that part of the kernel in some time.

@bghira
Copy link

bghira commented Nov 7, 2021

@tonyhutter: i believe that there is a way for viable schedulers of a block device to be restricted such as to remove the "intelligently queuing" options from the available list. AFAIK, distros tend to default to things like CFQ for generic use-cases, which stack unfavorably with the IO scheduler underneath the ZVOL. I think there's also a way to set the default scheduler, but i haven't dug around in that part of the kernel in some time.

setting the default is for all block devices, unfortunately. ZFS used to set noop scheduler on the vdevs but this mechanism was removed from the Linux kernel in torvalds/linux@85c0a03

@sempervictus
Copy link
Contributor

@tonyhutter: i believe that there is a way for viable schedulers of a block device to be restricted such as to remove the "intelligently queuing" options from the available list. AFAIK, distros tend to default to things like CFQ for generic use-cases, which stack unfavorably with the IO scheduler underneath the ZVOL. I think there's also a way to set the default scheduler, but i haven't dug around in that part of the kernel in some time.

setting the default is for all block devices, unfortunately. ZFS used to set noop scheduler on the vdevs but this mechanism was removed from the Linux kernel in torvalds/linux@85c0a03

To quote Homer Simpson: "doh!"
Thanks for the clarification @bghira

@tonyhutter
Copy link
Contributor Author

@behlendorf my last push has all your changes included. I need to look into some zvol_stress errors on FreeBSD though:

01:10:32.14 zvol_stress2: (groupid=0, jobs=1): err=45 (file:engines/posixaio.c:180, func=xfer, error=Operation not supported): pid=98257: Tue Nov 9 01:10:32 2021

@sempervictus
Copy link
Contributor

sempervictus commented Jan 16, 2022

I may have found an unpleasant bug.
Looks like the scheduling for ZVOLs falls down with this in-play if you are making thin-pools atop the ZVOLs. Commands just hang, kernel times out, bad news bears:

root        9321  0.0  0.0 1232172 11648 ?       Ssl  13:56   0:00              \_ serviced-storage create-thin-pool serviced -s 20G zenoss
root        9335  0.0  0.0  25956  8084 ?        D<   13:56   0:00                  \_ lvcreate -T -L 20G zenoss/serviced-pool

which produces

[  363.658041]  sda: sda1 sda9
[  366.167046] nfsd: last server has exited, flushing export cache
[  615.631009] INFO: task lvcreate:9335 blocked for more than 122 seconds.
[  615.635351]       Tainted: G                T  5.15.14 #1
[  615.637653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  615.641282] task:lvcreate        state:D stack:    0 pid: 9335 ppid:  9321 flags:0x00004000
[  615.641289] Call Trace:
[  615.641294]  <TASK>
[  615.641295]  [<ffffffff823c18b4>] ? __schedule+0xfb4/0x1ac0
[  615.641306]  [<ffffffff823c241a>] ? schedule+0x5a/0x130
[  615.641310]  [<ffffffff823c96fd>] ? schedule_timeout+0x11d/0x190
[  615.641314]  [<ffffffff81159b75>] ? __prepare_to_swait+0x85/0xc0
[  615.641319]  [<ffffffff823c089f>] ? io_schedule_timeout+0x6f/0xd0
[  615.641323]  [<ffffffff823c35fe>] ? wait_for_completion_io+0xbe/0x160
[  615.641329]  [<ffffffff849a9d53>] ? sync_io+0x123/0x190 [dm_mod]
[  615.641354]  [<ffffffff849a9fd5>] ? dm_io+0x205/0x2d0 [dm_mod]
[  615.641361]  [<ffffffff849a9440>] ? rap_hash.km_get_page+0x10/0x10 [dm_mod]
[  615.641367]  [<ffffffff849a8f40>] ? rap_hash.km_next_page+0x10/0x10 [dm_mod]
[  615.641376]  [<ffffffff846db312>] ? dm_bufio_write_dirty_buffers+0x4a2/0x500 [dm_bufio]
[  615.641382]  [<ffffffff846e86b1>] ? dm_bm_flush+0x51/0x60 [dm_persistent_data]
[  615.641402]  [<ffffffff846ef377>] ? dm_tm_pre_commit+0x67/0xa0 [dm_persistent_data]
[  615.641409]  [<ffffffff849d8181>] ? __create_persistent_data_objects+0x681/0xcb0 [dm_thin_pool]
[  615.641417]  [<ffffffff849d9324>] ? dm_pool_metadata_open+0xd4/0x2e0 [dm_thin_pool]
[  615.641424]  [<ffffffff849cf0db>] ? pool_ctr+0x6bb/0xf10 [dm_thin_pool]
[  615.641429]  [<ffffffff8499d3c7>] ? dm_table_add_target+0x197/0x470 [dm_mod]
[  615.641437]  [<ffffffff849a8033>] ? table_load+0x1e3/0x600 [dm_mod]
[  615.641444]  [<ffffffff849a4d61>] ? ctl_ioctl+0x271/0xa00 [dm_mod]
[  615.641451]  [<ffffffff849a7e50>] ? rap_hash.table_load+0x10/0x10 [dm_mod]
[  615.641460]  [<ffffffff849a551d>] ? dm_ctl_ioctl+0x1d/0x50 [dm_mod]
[  615.641466]  [<ffffffff8146e36d>] ? __x64_sys_ioctl+0x13d/0x180
[  615.641473]  [<ffffffff8239c6cc>] ? do_syscall_64+0x7c/0x100
[  615.641477]  [<ffffffff81001293>] ? entry_SYSCALL_64_after_hwframe+0x75/0x138
[  615.641481]  </TASK>

Some of that stack won't be reproducible by others - this is a 5.15 grsec kernel (hence the RAP hash calls), but its the IO which never completes - not a failure of the call/ret site check.

Since we build ZFS into the kernel, i updated the code to default to using blk-mq for ZVOLs and in order to ensure that this is in-fact the cause (happened multiple times in a row), i booted with zfs.zvol_use_blk_mq=0 and now Chef is happily building the stack (i'm old school, sue me) long past the point where it got stuck creating those dm-thin pieces.

I think we need tests where we build other block constructs atop ZVOLs to test for scheduling interactions between layered blockdevs in-kernel.

@DemiMarie
Copy link

Does ZFS do any memory allocations in the I/O path? It is possible for doing so to deadlock.

@sempervictus
Copy link
Contributor

To add injury to insult, since thin pools are fragile things to begin with, this one got toasted.
The test target is a devops-built, hardened Zenoss Core deployment (its a sort of PoC to showcase mechanisms for securing systems with access to everything in the IT infrastructure), which leverages ZFS for all of its Docker functionality and ServiceD storage. The iffy thing with serviced is that it insists, profusely, on using really fragile storage - dm-thin pools & legacy FS, basically asking for faceplants because the userspace stack can get very memory intensive (this failure is occurring well before said stack even exists, its in the pre-provision phase setting up storage). It looks like it might be readily reproducible atop our Xena OpenStack environment so i should be able to re-test down the line. For now, will have to mark this unsafe and keep it off on my personal host (our prod systems don't have this patch yet).

@DemiMarie
Copy link

To add injury to insult, since thin pools are fragile things to begin with, this one got toasted.

I take it that thin_repair wasn’t able to fix the problem? If there was a transaction ID mismatch, that may be fixable by hand-editing the metadata.

@sempervictus
Copy link
Contributor

@DemiMarie - thin_repair found nothing wrong, but the XFS atop the volume wouldn't mount or even be detected as XFS by xfsfix. Whatever was written-out to the thin pool mangled the FS beyond recognition.

@tonyhutter
Copy link
Contributor Author

@sempervictus Funny you should mention the lockup - I was just doing an "extract the linux kernel tarball" test on top of an ext4-formatted blk-mq zvol, and also hit lockups:

[1277479.103071] INFO: task ar:3226921 blocked for more than 120 seconds.
[1277479.118387] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1277479.126394] ar              D    0 3226921 3226752 0x00004080
[1277479.132324] Call Trace:
[1277479.134958]  __schedule+0x2c7/0x720
[1277479.138638]  ? wbt_exit+0x30/0x30
[1277479.142134]  ? __wbt_done+0x30/0x30
[1277479.145810]  ? rq_qos_wait+0xfc/0x180
[1277479.149655]  schedule+0x4d/0xc0
[1277479.152974]  io_schedule+0x12/0x40
[1277479.156565]  rq_qos_wait+0x101/0x180
[1277479.160325]  ? karma_partition+0x1e0/0x1e0
[1277479.164605]  ? wbt_exit+0x30/0x30
[1277479.168113]  wbt_wait+0x99/0xe0
[1277479.171445]  __rq_qos_throttle+0x23/0x30
[1277479.175554]  blk_mq_make_request+0x12d/0x5f0
[1277479.180006]  generic_make_request+0x2e9/0x350
[1277479.184548]  submit_bio+0x3c/0x160
[1277479.188155]  ext4_io_submit+0x48/0x60 [ext4]
[1277479.192621]  ext4_bio_write_page+0x17d/0x440 [ext4]
...

Discussion:

There have been essentially four versions (really, variations) of the blkmq code that I've tested over the course of this PR:

Ver 1: blk-mq with 'struct requests' on the workqueue (tonyhutter:blkmq-request)

This was the first iteration I submitted for this PR. It put each struct request (which contain one or more BIOs) onto the workqueue to be processed. So multiple struct request could be processed in parallel in the background, but the BIOs within each request were processed sequentially.

Ver 2: blk-mq with BIOs on the workqueue (tonyhutter:blkmq)

This is the current branch in the PR. It is also the branch that is presently giving us lockups. It puts each BIO from each struct request into a workqueue. The idea here was to give better parallelism and rectify the bad random-RW performance @tonynguien was seeing. Unfortunately, it didn't fix the random-RW performance, and performed roughly the same as the Ver 1. I believe the lockups we're seeing with this branch may be due to the gnarly request finalizing logic I had to add in to get the last completed BIO in the request to call blk_mq_end_request().

Ver 3: blk-mq with the old-school "zvol threads" (tonyhutter:blkmq-oldschool)

This was an experimental branch that I put together to see if I rectify the bad random-RW performance by using the existing zvol codepaths with blk-mq. It worked, but gave mostly the same (and sometimes worse) performance than the current, vanilla, non-blkmq zvol code.

Ver 4: blk-mq with 'struct requests' on the workqueue + "BIO merging" (tonyhutter:biomerge)

This is another experimental branch that takes Ver 1, and adds the ability to merge contiguous BIOs within a request into one big BIO. The idea being to reduce dbuf locking overhead and processing for contiguous RW. I've seen it merge requests with 300+ BIOs into one BIO, for example. It has given me the best performance so far for O_DIRECT sequential writes.

I think the way forward is Ver 4. Or more specifically, I will upload Ver 1 for this PR (which I'm hoping will fix the lockup), and then do a follow-on PR with the BIO merging in it to bring it up to Ver 4.

@sempervictus
Copy link
Contributor

sempervictus commented Jan 19, 2022

@tonyhutter: i'm having a fair deal of deja vu from when @ryao yanked the old version of these block device mechanisms. One of the resulting follow-ups was to implement 5731140, with the following comment:

    Disable write merging on ZVOLs
    
    The current ZVOL implementation does not explicitly set merge
    options on ZVOL device queues, which results in the default merge
    behavior.
    
    Explicitly set QUEUE_FLAG_NOMERGES on ZVOL queues allowing the
    ZIO pipeline to do its work.
    
    Initial benchmarks (tiotest with no O_DIRECT) show random write
    performance going up almost 3X on 8K ZVOLs, even after significant
    rewrites of the logical space allocation.

so really, we're talking about reverting that when we use the MQ pipeline and leaving it as hacked together 5y ago when using the existing pipeline.

If the Linux-proper way to do this is to feed BIOs (a la v3) into the queues, is what we're seeing some sort of ordering issue? Is there some analog to blk_mq_end_request in other implementations like block storage servers or something which we could review to correct the implementation? If its actually alright to use struct request allowing iovec merging to happen in the upper (Linux) layer, then that sounds like it will more closely resemble what we had back when ZVOLs were (relatively) fast. Might be useful to find some of the PRs and issues from back then - guessing that code graveyard has some handy info for this effort.

EDIT: sorry about the nuisant nitpicking here - i was around the last time this codepath was significantly altered and in the years since we have lost so much performance relative to the gains in backing storage that it seems like this is something "we must get right, this time" if ZVOLs are to stay relevant in modern compute environments (backing cloud storage for medical research, for example).
The original objective was to simplify the block IO codepath and leverage ZFS' own innards for dispatch and optimization of such, but i think that giving up the OS-native interfaces ended up being more costly in the end. When we have async dispatch and more hardware offload in those innards to decrease latency/improve throughput, we can reduce these queue depths to 1 to retain the OS interface while relying on ZFS for performance; but for now the IO merging and other "upper layer optimizations" are probably good band-aids to have.

@tonyhutter
Copy link
Contributor Author

so really, we're talking about reverting that when we use the MQ pipeline and leaving it as hacked together 5y ago when using the existing pipeline.

Right, we would revert QUEUE_FLAG_NOMERGES for the blkmq codepath only. We need to revert it to get BIO merging to work. zvol_use_blk_mq would basically be yet another knob to tune to your workload (yes, this is annoying). We might promote zvol_use_blk_mq to a per-zvol parameter if it makes sense to do so, but for now I think a module param is fine.

If the Linux-proper way to do this is to feed BIOs (a la v3) into the queues, is what we're seeing some sort of ordering issue? Is there some analog to blk_mq_end_request in other implementations like block storage servers or something which we could review to correct the implementation?

I don't know if it's the ordering. I suspect the slighly worse performance for V3 may be due to the fact that V3 is basically "do it exactly the same as the old way but put the BIO in a blkmq queue first". So more overhead from queuing.

If its actually alright to use struct request allowing iovec merging to happen in the upper (Linux) layer, then that sounds like it will more closely resemble what we had back when ZVOLs were (relatively) fast.

Unfortunately, there's no easy way to tell Linux to "merge the iovecs from the BIOs in this this request". Maybe it did that in older kernels back in the day, but I don't see where it does it anymore (and I have looked). Note that for the "BIO merging" patch, we actually do merge the iovecs, in that we create a single UIO (which is ZFS's internal representation of a BIO) from multiple BIOs, and then put the BIO's iovecs into the UIO as an array.

The original objective was to simplify the block IO codepath and leverage ZFS' own innards for dispatch and optimization of such, but i think that giving up the OS-native interfaces ended up being more costly in the end.

Yea, when all is said and done, we'll basically have:

zvol_use_blk_mq = 0: Use the old zvol code. Best for low-recordsize zvols with mostly small random RW workloads (databases).
zvol_use_blk_mq = 1: Use blkmq with BIO merging. Best for medium-large recordsize zvols with sequential reads/wires (filesystems on top of zvols?).

Alternatively we could get rid of zvol_use_blk_mq and hardcode something like: "if zvolrecordsize >= 16k then use blkmq, else use normal zvol threads". I'm not convinced this is a good idea though,

Also -QUEUE_FLAG_NOMERGES means "you can only have one BIO per request" I've verified this in testing. There's another flag called QUEUE_FLAG_NOXMERGES for "no extended merges". That basically "only do simple low-overhead merges, don't go out of your way to do a merge". QUEUE_FLAG_NOXMERGES is worth investingating, both for the blk-mq and non-blk-mq cases. It's dynamically settable in sysfs with nomerges as well (0 = NOMERGE, 1 = NOXMERGE, 2 = merge).

@sempervictus
Copy link
Contributor

I cant quite tell how much of the force-push from 3d ago is this PR and how much is delta in master without a deep dive.
@tonyhutter - any significant changes worth a rebuild/retest cycle on my end?

@tonyhutter
Copy link
Contributor Author

@sempervictus the latest push changes the PR to "Ver 1: blk-mq with 'struct requests' on the workqueue". Hopefully that will fix the crash we were seeing earlier, and also preps the code for the follow-up "BIO merge" PR.

@sempervictus
Copy link
Contributor

Thank you sir. Should i wait to throw this into our kernel tree for the follow-up since we saw that the struct approach worked before?

@tonyhutter
Copy link
Contributor Author

@sempervictus yes, feel free to give it a shot. It is not crashing on my anymore when I do my ext4-on-zvol test. Don't forget to enable blkmq before importing/creating your volumes: echo 1 > /sys/module/zfs/parameters/zvol_use_blk_mq

@tonyhutter
Copy link
Contributor Author

Reviewers - this PR is passing ZTS and appears to be ready to go. Can you please take another look?

@sempervictus
Copy link
Contributor

sempervictus commented Jan 31, 2022

I've been running the version from ~5d ago without issue.
I'll do another merge/build and kick off a day-long zloop

EDIT: looks like i am running the correct version, so will get a VM to run the tests for a bit

@behlendorf
Copy link
Contributor

@tonyhutter can you go ahead and rebase this to resolve the kernel.org build issue.

@sempervictus
Copy link
Contributor

No crashes yet - two identical VMs on grsec 5.15.17 (with RAP and all the goodies) running, one with MQ on and one with it off. I might need new nvmes at some point with all this abuse 😄, but it looks to be working.
Moving into "functional" testing where i will replicate the BIO-submission crash conditions i saw before - building Zenoss atop our Arch base with ZFS backing the stores.

@sempervictus
Copy link
Contributor

@tonyhutter - when copying in-tree, but building as a module (normally i build ZFS into the main binary), we hit this little gem:

FATAL: modpost: GPL-incompatible module zfs.ko uses GPL-only symbol 'blk_queue_write_cache'
make[4]: *** [scripts/Makefile.modpost:125: modules-only.symvers] Error 1
make[3]: *** [Makefile:1413: modules] Error 2
make[3]: *** Waiting for unfinished jobs....

^^ 5.10.96

@tonyhutter
Copy link
Contributor Author

Hmm.. that might be a general build bug then - my PR doesn't do anything with that function. Maybe try re-running ./autogen.sh && ./configure. Interestingly, my zfs_config.h also shows that function as GPL'd but it builds fine as a module:

/* blk_queue_write_cache() exists */
#define HAVE_BLK_QUEUE_WRITE_CACHE 1

/* blk_queue_write_cache() is GPL-only */
#define HAVE_BLK_QUEUE_WRITE_CACHE_GPL_ONLY 1

Add support for the kernel's block multiqueue (blk-mq) interface in
the zvol block driver.  blk-mq creates multiple request queues on
different CPUs rather than having a single request queue.  This can
improve zvol performance with multithreaded reads/writes.

This implementation uses the blk-mq interfaces on 4.13 or newer
kernels.  Building against older kernels will fall back to the
older BIO interfaces.

Note that you must set the `zvol_use_blk_mq` module param to
enable the blk-mq API.  It is disabled by default.

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Issue openzfs#12483
@tonyhutter
Copy link
Contributor Author

I'm closing this PR in favor of #13148 which has both the blk-mq changes and the "BIO merging" that we've been talking about. I originally thought it would be better to split it into two PRs, but after looking at the new code it doesn't really make sense to do that way.

@tonyhutter tonyhutter closed this Feb 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: ZVOL ZFS Volumes Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet