Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Forward-port async dmu support to 2.1 #12166

Open
wants to merge 1 commit into
base: zfs-2.1-release
Choose a base branch
from

Conversation

sempervictus
Copy link
Contributor

Motivation and Context

This effort was being implemented by @mattmacy prior to his departure for another venture earlier this year. The work contained should breathe new life into ZFS with the world running on nvme now for several years, and nvmeof appliances becoming the standard datacenter block storage paradigm - shallow but wide queue arrangements for parallel IOPs.
This is not a complete forward-port, as noted below, i had to skip 3 commits to request their authors/contributors' assistance.
This PR should not be merged until these are resolved as it would introduce 3 regressions into the codebase.

Description

Rebase 2.1 rc6 atop fbf26c2 (#10377), including updates for:
668115f98f1
e330514ad08
ece24c1
The rebase was executed skipping the following commits to permit
testing while requesting assistance from appropriate contributors:
64e0fe1 - ping @amotin for assistance
e439ee8 - ping @behlendorf for assistance
336bb3662b - ping @amotin for assistance

How Has This Been Tested?

  • Built into 5.10.41-grsec (with grsec 2.1 ZFS patch applied).
  • Zloop execution for 4h with no crashes.
  • FIO and bonnie++ tests in a VM against zvol over a loopback file
    inside a qcow2 atop a zpool (on 2.1 without this) on an nvme drive.
  • FIO runs atop 3 ~1GB/s ceph pool's RBDs in a raidz as an 8k
    block size ZVOL.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@sempervictus
Copy link
Contributor Author

sempervictus commented May 31, 2021

Sorry buildbots - this was like throwing screws in the garbage disposal. Until we can get those 3 commits addressed and force-push this entire branch as one commit atop master, i dont think the buildbots will be able to make sense of this (i could turn it into one gross diff off master so it builds if that's preferred)

Rebase 2.1 rc6 atop fbf26c2 (openzfs#10377), including updates for:
  668115f98f1
  e330514ad08
  ece24c1
The rebase was executed skipping the following commits to permit
testing while requesting assistance from appropriate contributors:
  64e0fe1 - ping @amotin for assistance
  e439ee8 - ping @behlendorf for assistance
  336bb3662b - ping @amotin for assistance

DO NOT MERGE THIS - IT IS A DIFF OF A REBASE WHICH HAS SKIPPED
COMMITS, the commits above *MUST* be resolved before this can be
applied to a current branch.

Testing:
  Built into 5.10.41-grsec (with grsec 2.1 ZFS patch applied).
  Zloop execution for 4h with no crashes.
  FIO and bonnie++ tests in a VM against zvol over a loopback file
inside a qcow2 atop a zpool (on 2.1 without this) on an nvme drive.
  FIO runs atop 3 ~1GB/s ceph pool's RBDs in a raidz as an 8k
block size ZVOL.
@sempervictus sempervictus force-pushed the feature/forward-port_async_dmu branch from f44cc93 to dd3f3fc Compare June 1, 2021 18:48
@sempervictus sempervictus changed the base branch from master to zfs-2.1-release June 1, 2021 18:57
@sempervictus
Copy link
Contributor Author

Well, thankfully the grsec patches are rather unforgiving and caught this little gem:

[   84.124042][  T221] BUG: unable to handle page fault for address: ffffffffffffff12
[   84.129136][  T221] RIP: 0010:[<ffffffff8169090b>] dmu_tx_commit+0x4b/0x200
[   84.132901][  T221] Code: 8d 77 10 49 39 c6 0f 84 de 00 00 00 48 8b 57 08 48 8b 5f 10 48 29 d3 0f 84 cd 00 00 00 65 48 8b 04 25 00 1d 01 00 48 89 04 24 <4c> 8b 7b 18 4d 85 ff 0f 84 a1 00 00 00 4d 8d af 28 01 00 00 4c 89
[   84.143766][  T221] #PF: supervisor read access in kernel mode
[   84.147007][  T221] #PF: error_code(0x0000) - not-present page
[   84.150184][  T221] PGD ffff888002d3cff8 2f34063 P4D ffff888002d3cff8 2f34063 PUD ffff888002f34ff8 2fb8063 PMD ffff888002fb8ff8 0 
[   84.156472][  T221] Oops: 0000 [#1] SMP
[   84.158559][  T221] CPU: 20 PID: 221 Comm: zvol Tainted: G                T  5.10.41 #1
[   84.162985][  T221] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.13.0-1ubuntu1.1 04/01/2014
[   84.168127][  T221] RIP: 0010:[<ffffffff8169090b>] dmu_tx_commit+0x4b/0x200
[   84.171863][  T221] Code: 8d 77 10 49 39 c6 0f 84 de 00 00 00 48 8b 57 08 48 8b 5f 10 48 29 d3 0f 84 cd 00 00 00 65 48 8b 04 25 00 1d 01 00 48 89 04 24 <4c> 8b 7b 18 4d 85 ff 0f 84 a1 00 00 00 4d 8d af 28 01 00 00 4c 89
[   84.182559][  T221] RSP: 0000:ffffc900006c7c88 EFLAGS: 00010296
[   84.185846][  T221] RAX: ffff888121e41780 RBX: fffffffffffffefa RCX: 0000000000000000
[   84.190071][  T221] RDX: 0000000000000008 RSI: ffff888120ae5300 RDI: ffff888120ae5300
[   84.194283][  T221] RBP: ffff888120ae5300 R08: 0000000000000000 R09: 0000000000000000
[   84.198483][  T221] R10: ffff888121e41780 R11: ffff8881a8c36000 R12: caa6f536e94204e3
[   84.202681][  T221] R13: ffff888133eb3600 R14: ffff888120ae5310 R15: ffff888180d69800
[   84.206917][  T221] RAX: task_struct+0x0/0xbc0
...
[   84.227644][  T221] R10: task_struct+0x0/0xbc0
[   84.230082][  T221] R11: zio_buf_comb_8192+0x0/0x2000
...
[   84.248985][  T221] FS:  0000000000000000(0000) GS:ffff88842f600000(0000) knlGS:0000000000000000
[   84.253691][  T221] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   84.257228][  T221] CR2: ffffffffffffff12 CR3: 0000000002d3c000 CR4: 0000000000160ef0 shadow CR4: 0000000000160ef0
[   84.262999][  T221] Stack:
[   84.264500][  T221]  ffff888121e41780 35590ac968c1a5fe ffff888207850000 35590ac968243f6e
[   84.268797][  T221]  ffff888133eb3600 ffff888120ae5300 ffff888180d69800 ffffffff8183a11d
[   84.273131][  T221]  ffff888133eb3600 ffff888207850000 ffff888133eb3600 35590ac968244e96
[   84.277478][  T221] Call Trace:
[   84.279391][  T221]  [<ffffffff8183a11d>] ? zvol_dmu_buf_set_transfer_write+0x6d/0xf0
[   84.283677][  T221]  [<ffffffff81663b8d>] ? dmu_buf_set_ready+0x18d/0x1e0
[   84.288094][  T221]  [<ffffffff81664a75>] ? dmu_issue+0x305/0x3a0
[   84.291425][  T221]  [<ffffffff81837bc5>] ? zvol_dmu_issue+0x35/0x90
[   84.295070][  T221]  [<ffffffff8188f295>] ? zvol_strategy+0x185/0x330
[   84.298710][  T221]  [<ffffffff8156d54d>] ? taskq_thread+0x4fd/0x9e0
[   84.302679][  T221]  [<ffffffff8115e8d0>] ? wake_up_q+0xd0/0xd0
[   84.305987][  T221]  [<ffffffff8188f110>] ? zvol_strategy_dmu_done+0x1a0/0x1a0
[   84.309875][  T221]  [<ffffffff811421cd>] ? kthread+0x21d/0x290
[   84.313106][  T221]  [<ffffffff8156d050>] ? param_set_taskq_kick+0x210/0x210
[   84.316831][  T221]  [<ffffffff81141fb0>] ? kthread_park+0x120/0x120
[   84.320182][  T221]  [<ffffffff81007cdd>] ? ret_from_fork+0xc5/0xd8
[   84.323552][  T221] Modules linked in: xfs libcrc32c sb_edac edac_core kvm_intel cirrus hid_generic drm_kms_helper kvm mousedev cec usbhid irqbypass syscopyarea sysfillrect crct10dif_pclmul psmouse input_leds sysimgblt evdev ghash_clmulni_intel hid fb_sys_fops i2c_piix4 cirrusfb uio_pdrv_genirq led_class uio tiny_power_button qemu_fw_cfg button sch_fq_codel drm fuse ip_tables x_tables ext4 crc32c_generic crc16 jbd2 mbcache virtio_net virtio_rng net_failover virtio_blk virtio_balloon rng_core failover ata_generic pata_acpi uhci_hcd ehci_pci ehci_hcd atkbd crc32_pclmul libps2 crc32c_intel aesni_intel ata_piix glue_helper usbcore crypto_simd libata cryptd i8042 usb_common virtio_pci floppy serio
[   84.359041][  T221] CR2: ffffffffffffff12 0xffffffffffffff12
[   84.362138][  T221] ---[ end trace 4e1f464a7828dde8 ]---
[   84.365037][  T221] RIP: 0010:[<ffffffff8169090b>] dmu_tx_commit+0x4b/0x200
[   84.368769][  T221] Code: 8d 77 10 49 39 c6 0f 84 de 00 00 00 48 8b 57 08 48 8b 5f 10 48 29 d3 0f 84 cd 00 00 00 65 48 8b 04 25 00 1d 01 00 48 89 04 24 <4c> 8b 7b 18 4d 85 ff 0f 84 a1 00 00 00 4d 8d af 28 01 00 00 4c 89
[   84.379399][  T221] RSP: 0000:ffffc900006c7c88 EFLAGS: 00010296
[   84.382627][  T221] RAX: ffff888121e41780 RBX: fffffffffffffefa RCX: 0000000000000000
[   84.386843][  T221] RDX: 0000000000000008 RSI: ffff888120ae5300 RDI: ffff888120ae5300
[   84.391328][  T221] RBP: ffff888120ae5300 R08: 0000000000000000 R09: 0000000000000000
[   84.395510][  T221] R10: ffff888121e41780 R11: ffff8881a8c36000 R12: caa6f536e94204e3
[   84.399707][  T221] R13: ffff888133eb3600 R14: ffff888120ae5310 R15: ffff888180d69800
[   84.404220][  T221] RAX: task_struct+0x0/0xbc0
[   84.406581][  T221] RSI: ...
[   84.425041][  T221] R10: task_struct+0x0/0xbc0
[   84.427468][  T221] R11: zio_buf_comb_8192+0x0/0x2000
...
[   84.448283][  T221] FS:  0000000000000000(0000) GS:ffff88842f600000(0000) knlGS:0000000000000000
[   84.453062][  T221] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   84.456545][  T221] CR2: ffffffffffffff12 CR3: 0000000002d3c000 CR4: 0000000000160ef0 shadow CR4: 0000000000160ef0
[   84.462303][  T221] Kernel panic - not syncing: grsec: halting the system due to suspicious kernel crash caused by root
[   84.468510][  T221] Kernel Offset: disabled
[   84.468871][  T221] Rebooting in 120 seconds..

which the wizard himself suggested may be caused by

+	/* Readers and writers with a context transaction do not apply. */
+	if ((dmu_ctx->dc_flags & DMU_CTX_FLAG_READ) || dmu_ctx->dc_tx != NULL)
+		return (0);

because dmu_tx_create can't fail meaning that *txp = dmu_tx_create(dmu_ctx->dc_os); has to be set and (dbs->dbs_dc->dc_tx ? dbs->dbs_dc->dc_tx : dbs->dbs_tx) won't pass the NULL up either.

Going to dig into it some more this evening, but really hoping that maintainers take notice and weigh in given that i'm probably not the brightest bulb in the shed when it comes to this stuff and bothering smarter people who otherwise spend their time making the world more secure also doesn't seem proper (ping @behlendorf).

@sempervictus
Copy link
Contributor Author

Is it possible for the dmu_buf_set_t being passed into zvol_dmu_buf_set_transfer_write to have a NULL ->dbs_tx? If dmu_buf_set_tx doesnt see a dc_tx set, it falls back to the dbs_tx but if that's not yet set either then we might have a situation where we're issuing the async transfer, log the write, but are not able to dmu_tx_commit the NULL *tx.

@sempervictus
Copy link
Contributor Author

The passing tests are concerning, but then again, we might not be testing "the right thing."
Reproducer for that crash:

  1. Created 100G zpool atop a ceph RBD (Pacific, pure SSD pool on bluestore) volume in an openstack instance
  2. Created 100G ZVOL in the zpool with no refreservation
  3. Created XFS filesystem directly atop the ZVOL and mounted to /utest
  4. Ran FIO benchmarks from here

Spender says that the NULL pointer deref here should be caught by upstream kernels as well, so it should not require having a grsecurity/pax-patched kernel to reproduce by others. Happens pretty consistently to me on this branch, but it did take having that 1G/s underlying block device before it started happening more or less consistently, so probably best to test against nvme backing stores or fast distributed flash.

@sempervictus
Copy link
Contributor Author

@amotin: do you happen to have some time to take a look at this and both the BSD patches it impacts as well as the DMU prefetch bit commit? I think i have some mismatch with upstream so the commit hash isnt displaying properly in the description, but that's one of the major ones i had to skip while trying to get this working.

@sempervictus
Copy link
Contributor Author

ping @mmaybee - i'm guessing @ahrens tagged you due to knowledge of the codepaths being changed? Any thoughts on that crash?

@mmaybee
Copy link
Contributor

mmaybee commented Jun 11, 2021

@sempervictus Heh, I was assigned to this PR because I have volunteered to help Brian out. I do have quite a bit of knowledge about the dmu and zvol code paths, but the crash above falls within the new code path introduced in this PR, which I have no experience with. If you can reproduce this issue reliably, I suggest you start adding some debug to try to pinpoint what is happening.

@sempervictus
Copy link
Contributor Author

@mmaybee - thanks for the clarification. Roger, wilco

@sempervictus
Copy link
Contributor Author

Actually, @mmaybee, would you be able to take a look at the commits i had to skip for now?

@DemiMarie
Copy link

I can’t help, but I would really like to see this merged. Hopefully it can get in!

@sempervictus
Copy link
Contributor Author

There's a bunch of work that needs to happen for this to merge-in - and unfortunately i'm still pretty much completely heads down to even parse out the laggards and accumulated merge issues since then, much less start tackling them in any organized manner. I am however open to bounty payouts to devs who do have cycles and can knock this out.

@scineram
Copy link

scineram commented Nov 4, 2021

I am however open to bounty payouts to devs who do have cycles and can knock this out.

Have you tried contacting @wca afaik the original authors of these commits if he is available?

@sempervictus
Copy link
Contributor Author

@scineram: my understanding is that the author of the branch from which i pulled this moved to another company and stopped work on this at that point. That said, if @wca is the actual originator, or if anyone can help fill in the provenance of this code so that we could try to find someone familiar enough to finish it - would be great to have that data memorialized here.

@DemiMarie
Copy link

@sempervictus my hope is that once the async DMU and CoW changes are in, there will be no in-ZFS blocking for I/O under any normal situation, and if userspace uses io_uring, it will not be blocking either. That is, all ZFS operations should be async all the way to the driver, VFS permitting.

@jittygitty
Copy link

Bounty was mentioned in this thread, so if interested in crowd-funding zfs please see: #13397

@DemiMarie
Copy link

Which regressions would merging this introduce?

@sempervictus
Copy link
Contributor Author

Which regressions would merging this introduce?

It shouldn't build at all, will probably eat your data and your pets if it does, should be considered tire-fire-ware until a competent party has looked into it and marked-up where and what needs changing.

@DemiMarie
Copy link

Which regressions would merging this introduce?

It shouldn't build at all, will probably eat your data and your pets if it does, should be considered tire-fire-ware until a competent party has looked into it and marked-up where and what needs changing.

I wonder if iXsystems would be willing/able to hire someone to do the work.

@amotin
Copy link
Member

amotin commented Jan 4, 2023

Which regressions would merging this introduce?

It shouldn't build at all, will probably eat your data and your pets if it does, should be considered tire-fire-ware until a competent party has looked into it and marked-up where and what needs changing.

I wonder if iXsystems would be willing/able to hire someone to do the work.

We have no particular plans at the time. Considering how much it complicates ZFS code, I personally not sure I'd like to see it in.

@DemiMarie
Copy link

Which regressions would merging this introduce?

It shouldn't build at all, will probably eat your data and your pets if it does, should be considered tire-fire-ware until a competent party has looked into it and marked-up where and what needs changing.

I wonder if iXsystems would be willing/able to hire someone to do the work.

We have no particular plans at the time. Considering how much it complicates ZFS code, I personally not sure I'd like to see it in.

How big are the potential performance wins? Could enough of it be merged to at least undo the zvol performance regression? (#8472 and #11407, among others).

@amotin
Copy link
Member

amotin commented Jan 4, 2023

How big are the potential performance wins? Could enough of it be merged to at least undo the zvol performance regression? (#8472 and #11407, among others).

We are going to look closer on Linux ZVOL performance.

@sempervictus
Copy link
Contributor Author

@amotin - i think its inevitable: high-performance IO stacks are async these days with massive parallelism and shallow queues. If we don't adopt something like this (i take your point about complexity, every time i dig into this PR my headache gets worse), we will become the modern version of tape drives - just archival storage (and S3 will be winning that fight anyway because "its easy" and inherently distributed in its various formats). We have to work at the speed of modern storage because people pay a lot for that speed and they won't or can't give up that investment because performance quotients for large operations dictate cost of operations in cloud rather directly and on-prem once the costs are fully broken down. The time ZFS spends blocked and waiting for operations is literally money to cloud users and reason-enough for their consumers to demand something faster.

If we stick to the synchronous approach, we simply can't support real world workloads at the performance expectations of their users. Take iSCSI backed by ZVOLs for example: you can get passable performance for a new ZVOL today, but after you fill its extents even once, it will degrade horrifically and become functionally unusable for tons of consumers while it figures out whether it can place data into a hole within the extent of the ZVOL. If that calculation is done async, in the background, then the "front-end" can service other requests unless the IO issued to the zvol was intentionally synchronous as part of some serialized set of actions.

@DemiMarie
Copy link

i take your point about complexity, every time i dig into this PR my headache gets worse

Is the solution to port parts of ZFS to another language? Rust has great async support and is already supported in the Linux kernel. I see no reason why FreeBSD and illumos could not also add support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work in Progress Not yet ready for general review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants