Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add async dmu support #10377

Closed

Conversation

mattmacy
Copy link
Contributor

@mattmacy mattmacy commented May 27, 2020

This supersedes #10303 and #10317

Exposed to user space by way of zvol and vnops. Can be exercised by way of aio on FreeBSD and io_uring on Linux.

The only user visible change for existing code is the removal of DMU_READ_NO_PREFETCH
and the replacement of DMU_READ_PREFETCH and DMU_READ_NO_DECRYPT

typedef enum {
       DMU_CTX_FLAG_READ       = 1 << 1,
       DMU_CTX_FLAG_UIO        = 1 << 2,
       DMU_CTX_FLAG_PREFETCH   = 1 << 3,
       DMU_CTX_FLAG_NO_HOLD    = 1 << 4,
       DMU_CTX_FLAG_SUN_PAGES  = 1 << 5,
       DMU_CTX_FLAG_NOFILL     = 1 << 6,
       DMU_CTX_FLAG_ASYNC      = 1 << 7,
       DMU_CTX_FLAG_NODECRYPT  = 1 << 8,
       DMU_CTX_WRITER_FLAGS    = DMU_CTX_FLAG_SUN_PAGES,
       DMU_CTX_READER_FLAGS    = DMU_CTX_FLAG_PREFETCH
} dmu_ctx_flag_t;

There are two new data structures:

  • dmu_ctx_t maintains dmu context during operations
  • dmu_buf_set_t maintains references to dbufs and is used for dbuf read completions.

For most purposes the new functions of interest are:

typedef void (*dmu_ctx_cb_t)(struct dmu_ctx *);
typedef uint64_t (*dmu_buf_transfer_cb_t)(struct dmu_buf_set *, dmu_buf_t *,
    uint64_t, uint64_t);


/* Initialize dmu context */
int dmu_ctx_init(dmu_ctx_t *dc, struct dnode *dn, objset_t *os,
    uint64_t object, uint64_t offset, uint64_t size, void *data_buf, void *tag,
    dmu_ctx_flag_t flags);

/* execute operation */
int dmu_issue(dmu_ctx_t *dc);

/* release local hold on dmu context */
void dmu_ctx_rele(dmu_ctx_t *dc);

/*
 *  Set completion function to be called when _all_ operations have been 
 *  completed.
 */
void dmu_ctx_set_complete_cb(dmu_ctx_t *dc, dmu_ctx_cb_t cb);

/*
 *  Set custom data transfer operation. See zvol, spa_checkpoint, dmu_redact,
 *  and dmu_read_pages for examples
 *  
 */
void dmu_ctx_set_buf_set_transfer_cb(dmu_ctx_t *dc, dmu_buf_set_cb_t cb);

Two helper functions have been added to simplify the most straightforward case.

/* 
* Takes an uninitialized dmu context and callback. Callback will be called when
* read completes.
*/ 
int dmu_read_async(dmu_ctx_t *dc, objset_t *os, uint64_t object,
    uint64_t offset, uint64_t size, void *buf, uint32_t flags,
    dmu_ctx_cb_t done_cb);
/* 
* Takes an uninitialized dmu context and callback. Callback will be called when
* write completes.
*/ 
int dmu_write_async(dmu_ctx_t *dc, objset_t *os, uint64_t object,
    uint64_t offset, uint64_t size, void *buf, dmu_tx_t *tx,
    dmu_ctx_cb_t done_cb);

These two functions are used to exercise the async completion logic in ztest.

Read and write in zvol on both FreeBSD and Linux no longer require a dedicated thread for async semantics. This could be later extended to support delete.

Platforms with VFS interfaces that accept completion routines can use the new interfaces to eliminate the need for a dedicated kernel thread to support posix AIO on ZFS.

Some further implementation details:
Internally, dmu_buf_set_t is used for managing state and tracking completions. The dbs_holds field is used as a count of pending read operations that need to be completed before the data transfer callback can be executed. The completions are handled by dmu_buf_set_rele when dbs_holds goes to zero it calls the data transfer callback. The initial set up for this occurs in dbuf_hold_impl. When dbuf_hold_impl is called in dmu_buf_set_setup_buffers, we now also pass it a dmu_buf_set_t. If the dbuf in question is not DB_CACHED it will add it to a list attached to the dbuf for dmu_buf_set_rele to be called at read completion time, otherwise it release the hold immediately.

This change further extends the SpectraLogic work by generalizing the dbuf_read completion callbacks and deferring their processing until the lock has been dropped to avoid cases of recursive deadlocks.

NB: dmu_ctx_setup_tx still issues synchronous reads and blocks on the write throttle. So only reads are truly asynchronous.

2020.06.09
An async variant of zil_commit zil_commit_async allows the async paths in zvol to avoid block on sync.

A non-blocking variant of zfs_rangelock_enter zfs_rangelock_tryenter allows async code paths in zvol to avoid blocking on rangelock acquisition. The callback passed will resume the operation when the range becomes available. To avoid asynchronous waiters starving synchronous waiters or vice versa, all deferred handling has been consolidated in to a single list, guaranteeing FCFS behavior. All waiters acquire their desired locked range before being signaled or having their callbacks executed (wakeups are simply handled as a callback on a stack local kcondvar).

Co-authored-by: Will Andrews wca@FreeBSD.org
Co-authored-by: Matt Macy mmacy@FreeBSD.org
Signed-off-by: Matt Macy mmacy@FreeBSD.org

Motivation and Context

Description

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • I have run the ZFS Test Suite with this change applied.
  • All commit messages are properly formatted and contain Signed-off-by.

@codecov-commenter
Copy link

codecov-commenter commented May 27, 2020

Codecov Report

Merging #10377 (65b7360) into master (84268b0) will decrease coverage by 0.15%.
The diff coverage is 73.56%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10377      +/-   ##
==========================================
- Coverage   76.82%   76.67%   -0.16%     
==========================================
  Files         400      400              
  Lines      128040   129298    +1258     
==========================================
+ Hits        98366    99133     +767     
- Misses      29674    30165     +491     
Flag Coverage Δ
kernel 80.51% <80.95%> (-0.15%) ⬇️
user 47.82% <39.11%> (-0.12%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
cmd/ztest/ztest.c 6.62% <0.00%> (-0.62%) ⬇️
include/os/linux/spl/sys/uio.h 100.00% <ø> (ø)
include/os/linux/spl/sys/vmsystm.h 91.66% <ø> (ø)
include/sys/dbuf.h 100.00% <ø> (ø)
lib/libspl/include/sys/uio.h 0.00% <ø> (ø)
module/os/linux/zfs/zfs_acl.c 57.37% <ø> (+0.19%) ⬆️
module/zfs/arc.c 86.45% <ø> (+0.10%) ⬆️
module/zfs/bpobj.c 90.34% <ø> (+0.53%) ⬆️
module/zfs/bptree.c 88.54% <ø> (ø)
module/zfs/dsl_bookmark.c 89.02% <ø> (+0.60%) ⬆️
... and 107 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 84268b0...fbf26c2. Read the comment docs.

@mattmacy mattmacy force-pushed the projects/async_dmu_baseline-pr branch from 4a6a99e to 7fbf8ae Compare May 27, 2020 07:24
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label May 27, 2020
@adamdmoss
Copy link
Contributor

adamdmoss commented May 27, 2020

I'm getting plenty of these quite easily with this branch (7fbf8ae):

[  464.662155] VERIFY((dc->dc_flags & DMU_CTX_FLAG_ASYNC) || zfs_refcount_count(&dc->dc_holds) == 1) 
failed
[  464.662156] VERIFY((dc->dc_flags & DMU_CTX_FLAG_ASYNC) || zfs_refcount_count(&dc->dc_holds) == 1) 
failed
[  464.662159] PANIC at dmu.c:928:dmu_issue()
[  464.662160] Showing stack for process 10308
[  464.662163] CPU: 2 PID: 10308 Comm: ccache Tainted: P           OE     5.3.0-51-generic #44~18.04.
2-Ubuntu
[  464.662164] PANIC at dmu.c:928:dmu_issue()
[  464.662166] Showing stack for process 10187
[  464.662167] Hardware name: Gigabyte Technology Co., Ltd. Z68MA-D2H-B3/Z68MA-D2H-B3, BIOS F10 02/23
/2012
[  464.662167] Call Trace:
[  464.662175]  dump_stack+0x6d/0x95
[  464.662184]  spl_dumpstack+0x29/0x2b [spl]
[  464.662189]  spl_panic+0xd3/0xfb [spl]
[  464.662192]  ? _cond_resched+0x19/0x40
[  464.662193]  ? mutex_lock+0x12/0x40
[  464.662197]  ? cv_wait_common+0xd3/0x130 [spl]
[  464.662200]  ? wait_woken+0x80/0x80
[  464.662204]  ? __cv_wait+0x15/0x20 [spl]
[  464.662253]  ? dmu_buf_set_process_io+0xe8/0x140 [zfs]
[  464.662292]  dmu_issue+0x133/0x150 [zfs]
[  464.662331]  dmu_read_impl+0x5c/0x90 [zfs]
[  464.662370]  ? ddt_zap_create+0x80/0x80 [zfs]
[  464.662408]  ? dmu_buf_write_uio+0x70/0x70 [zfs]
[  464.662446]  ? dmu_buf_write_uio+0x70/0x70 [zfs]
[  464.662485]  dmu_read_uio_dbuf+0x50/0x70 [zfs]
[  464.662542]  zfs_read+0x12d/0x4b0 [zfs]
[  464.662545]  ? path_openat+0x329/0x1700
[  464.662601]  zpl_read_common_iovec+0x97/0xe0 [zfs]
[  464.662657]  zpl_iter_read+0xfd/0x170 [zfs]
[  464.662660]  new_sync_read+0x122/0x1b0
[  464.662662]  __vfs_read+0x29/0x40
[  464.662663]  vfs_read+0x8e/0x130
[  464.662665]  ksys_read+0xa7/0xe0
[  464.662667]  __x64_sys_read+0x1a/0x20
[  464.662669]  do_syscall_64+0x5a/0x130
[  464.662671]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  464.662673] RIP: 0033:0x7fd9b684c34e

@mattmacy
Copy link
Contributor Author

I'm getting plenty of these quite easily with this branch

@adamdmoss Yup. It appears that in the refactoring that I lost a completion path. It shows up on all platforms in the CI so should be easy to reproduce.

@mattmacy mattmacy force-pushed the projects/async_dmu_baseline-pr branch from 7fbf8ae to 151e9b3 Compare May 28, 2020 02:42
@mattmacy
Copy link
Contributor Author

I'm getting plenty of these quite easily with this branch

I moved the dbuf completion callback execution out from underneath the dbuf lock to avoid holding it across copyout and potentially recursing. This created a race because the db_changed cv would get signalled before the refcount was dropped. I added a cv to indicate buf set completion to plug the race where it was hit.

@mattmacy mattmacy force-pushed the projects/async_dmu_baseline-pr branch from 151e9b3 to 3f85a69 Compare May 28, 2020 06:44
@adamdmoss
Copy link
Contributor

Cheers, definitely looking much better @ 3f85a69

@adamdmoss
Copy link
Contributor

Curiously this appears to have significantly raised latency on l2arc hits in my usual bevvy of casual tests. I'll verify this properly later.

@adamdmoss
Copy link
Contributor

adamdmoss commented May 29, 2020

Confirmed, alternating with/without this patchset over 8 reboots I see this strong pattern (l2arc is not a significant factor):

Wallclock times without patchset (trunk):
Operation A: 0:09.71 elapsed (sys:1.71 user:0.05)
Operation B: 0:17.64 elapsed (sys:4.42 user:0.06)

Wallclock times with patchset (projects/async_dmu_baseline-pr):
Operation A: 0:18.92 elapsed (sys:2.13 user:0.08)
Operation B: 0:26.32 elapsed (sys:2.72 user:0.06)

A = reading 15GB set of files from NVME, not backed by l2arc
B = reading 15GB set of files from spinning media backed by entirely warmed l2arc-on-NVME

(FYI it's a set of 523 files with a variety of sizes, dominated by ~12GB of files in the 1-3GB range. The files are spread across 65 directories. The files are read with a script which does
exec /usr/bin/time --format '%E elapsed (sys:%S user:%U)' find -H "$@" -type f -execdir cat {} + >/dev/null which is a utility I typically use to pre-warm the l2arc)

Curiously, this branch's read bandwidth according to iotop is as good as - or better than - trunk's, so I can't explain the massive difference in wall-clock time - this is why I called it 'raised latency' rather than 'lowered bandwidth' but I could be off-mark

Dumb speculation:

  • Serialization of reads? (serialization of meta vs data?)
  • Read-ahead has broken?
  • Accidentally issuing every read twice? 😁

Linux version 5.3.0-51-generic (buildd@lgw01-amd64-018) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #44~18.04.2-Ubuntu SMP Thu Apr 23 14:27:18 UTC 2020

Let me know if I can provide further info.

@mattmacy
Copy link
Contributor Author

Curiously, this branch's read bandwidth according to iotop is as good as - or better than - trunk's, so I can't explain the massive difference in wall-clock time - this is why I called it 'raised latency' rather than 'lowered bandwidth' but I could be off-mark

Dumb speculation:

* Serialization of reads?  (serialization of meta vs data?)

* Read-ahead has broken?

* Accidentally issuing every read twice? grin

Linux version 5.3.0-51-generic (buildd@lgw01-amd64-018) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #44~18.04.2-Ubuntu SMP Thu Apr 23 14:27:18 UTC 2020

Let me know if I can provide further info.

@adamdmoss My last bug fix introduced a second cv_wait in to the synchronous read path. I'm currently testing a change that eliminates the additional wait. It would be helpful if you could trim down the test case in such a way that I could reproduce this in a 12 thread VM with file-backed pools.

@mattmacy mattmacy force-pushed the projects/async_dmu_baseline-pr branch from 3f85a69 to bcec065 Compare May 30, 2020 22:13
@mattmacy
Copy link
Contributor Author

@adamdmoss bcec065 limits us to one sleep / wakeup per synchronous read. I see a 1:17 reduction in run time for the pool_checkpoint test sub-suite (18:39 -> 17:22) a 6.9% reduction for a set of tests that have a large write and cpu-bound user-level component. Could you see how much benefit it confers on your end?

@mattmacy
Copy link
Contributor Author

@adamdmoss My last bug fix introduced a second cv_wait in to the synchronous read path. I'm currently testing a change that eliminates the additional wait. It would be helpful if you could trim down the test case in such a way that I could reproduce this in a 12 thread VM with file-backed pools.

Never mind. I see this even with a 20G pool in virtualbox. I have ZoF and FreeBSD git checkouts on it using 3.1G:

Filesystem      Size    Used   Avail Capacity  Mounted on
/dev/ada0s1a     37G     19G     15G    55%    /
devfs           1.0K    1.0K      0B   100%    /dev
pool1            19G    3.1G     16G    16%    /pool1

I did cd /; doas zpool export pool1; doas kldunload openzfs; doas kldload openzfs; doas zpool import pool1; cd /pool1/matt; time find . -type f | xargs cat >& /dev/null; 3 times for each branch version. My recent change doesn't appear to have improved performance at all. I'll have to figure what is adding latency to cat.

(cp / find breakdown is separate)

async_dmu_baseline-1
0.01s user 2.82s system 5% cpu 50.533 total
0.03s user 13.87s system 19% cpu 1:11.37 total

0.01s user 2.91s system 5% cpu 49.982 total
0.05s user 12.82s system 19% cpu 1:07.37 total

0.00s user 2.87s system 5% cpu 56.030 total
0.05s user 13.81s system 18% cpu 1:13.57 total

master:
0.01s user 2.98s system 5% cpu 54.535 total
0.02s user 12.49s system 20% cpu 1:02.24 total

0.00s user 2.90s system 5% cpu 53.340 total
0.03s user 12.67s system 20% cpu 1:00.91 total

0.01s user 2.88s system 5% cpu 56.434 total
0.01s user 12.86s system 20% cpu 1:02.96 total

async_dmu_baseline
0.01s user 3.14s system 5% cpu 55.543 total
0.02s user 14.27s system 19% cpu 1:14.79 total

0.00s user 2.83s system 5% cpu 52.013 total
0.05s user 13.86s system 19% cpu 1:11.57 total

0.01s user 2.84s system 5% cpu 54.376 total
0.02s user 14.40s system 18% cpu 1:15.98 total

@mattmacy
Copy link
Contributor Author

master: milliseconds off-cpu

kernel`mi_switch+0xd7
kernel`sleepq_timedwait+0x2f
kernel`_cv_timedwait_sbt+0x177
openzfs.ko`zio_wait+0x30e
openzfs.ko`dmu_buf_hold_array_by_dnode+0x20b
openzfs.ko`dmu_read_uio_dnode+0x37
openzfs.ko`dmu_read_uio_dbuf+0x3b
openzfs.ko`zfs_freebsd_read+0x52f
kernel`VOP_READ_APV+0x75
kernel`vn_read+0x124
kernel`vn_io_fault_doio+0x43
kernel`vn_io_fault1+0x15c
kernel`vn_io_fault+0x186
kernel`dofileread+0x95
kernel`sys_read+0xc0
kernel`amd64_syscall+0x399
kernel`0xffffffff8108fef0
cat
55572

async_dmu_baseline:

kernel`mi_switch+0xd7
kernel`sleepq_timedwait+0x2f
kernel`_cv_timedwait_sbt+0x177
openzfs.ko`zio_wait+0x30e
openzfs.ko`dmu_buf_set_process_io+0x2f
openzfs.ko`dmu_issue+0x72
openzfs.ko`dmu_read_uio_dnode+0x51
openzfs.ko`dmu_read_uio_dbuf+0x3b
openzfs.ko`zfs_freebsd_read+0x52f
kernel`VOP_READ_APV+0x75
kernel`vn_read+0x124
kernel`vn_io_fault_doio+0x43
kernel`vn_io_fault1+0x15c
kernel`vn_io_fault+0x186
kernel`dofileread+0x95
kernel`sys_read+0xc0
kernel`amd64_syscall+0x399
kernel`0xffffffff8108fef0
cat
64692

@mattmacy mattmacy force-pushed the projects/async_dmu_baseline-pr branch 2 times, most recently from c1f2c61 to 5d29fed Compare May 31, 2020 02:05
@mattmacy
Copy link
Contributor Author

I lost the implied prefetch in VFS ops with the API flag conversion.
With prefetching enabled I now see:

kernel`mi_switch+0xd7
kernel`sleepq_timedwait+0x2f
kernel`_cv_timedwait_sbt+0x177
openzfs.ko`zio_wait+0x30e
openzfs.ko`dmu_buf_set_process_io+0x2f
openzfs.ko`dmu_issue+0x72
openzfs.ko`dmu_read_uio_dnode+0x51
openzfs.ko`dmu_read_uio_dbuf+0x3b
openzfs.ko`zfs_freebsd_read+0x52f
kernel`VOP_READ_APV+0x75
kernel`vn_read+0x124
kernel`vn_io_fault_doio+0x43
kernel`vn_io_fault1+0x15c
kernel`vn_io_fault+0x186
kernel`dofileread+0x95
kernel`sys_read+0xc0
kernel`amd64_syscall+0x399
kernel`0xffffffff8108fef0
cat
53771

@mattmacy mattmacy force-pushed the projects/async_dmu_baseline-pr branch from 5d29fed to 04f6220 Compare May 31, 2020 03:05
@adamdmoss
Copy link
Contributor

Great - I'll give the latest version a try-out.

@adamdmoss
Copy link
Contributor

Confirmed - read performance is back within normal ranges. Thanks for investigating!

@adamdmoss
Copy link
Contributor

FYI module/os/freebsd/zfs/zvol_os.c has considerable conflicts with master now - I couldn't resolve them with shallow knowledge. :)

@ghost
Copy link

ghost commented Jun 6, 2020

FYI module/os/freebsd/zfs/zvol_os.c has considerable conflicts with master now - I couldn't resolve them with shallow knowledge. :)

In zfsonfreebsd:testing/async_dmu_baseline I've rebased/resolved that file.

@adamdmoss
Copy link
Contributor

Thanks Ryan. When merging that version to master, it appears conflict-free and it builds okay, but it creates a DKMS module which doesn't build okay:

...
   CC [M]  /var/lib/dkms/zfs/0.8.0/build/module/zfs/../os/linux/zfs/zpl_super.o
   CC [M]  /var/lib/dkms/zfs/0.8.0/build/module/zfs/../os/linux/zfs/zpl_xattr.o
   CC [M]  /var/lib/dkms/zfs/0.8.0/build/module/zfs/../os/linux/zfs/zvol_os.o
   CC [M]  /var/lib/dkms/zfs/0.8.0/build/module/zfs/vdev_raidz_math_ssse3.o
   CC [M]  /var/lib/dkms/zfs/0.8.0/build/module/zfs/vdev_raidz_math_sse2.o
 /var/lib/dkms/zfs/0.8.0/build/module/zfs/../os/linux/zfs/zvol_os.c: In function ‘zvol_strategy_dmu_done’:
 /var/lib/dkms/zfs/0.8.0/build/module/zfs/../os/linux/zfs/zvol_os.c:333:48: error: ‘struct zvol_state_os’ has no member named ‘zvo_kstat’
    dataset_kstats_update_read_kstats(&zv->zv_zso->zvo_kstat, len);
                                                 ^~
 /var/lib/dkms/zfs/0.8.0/build/module/zfs/../os/linux/zfs/zvol_os.c:336:49: error: ‘struct zvol_state_os’ has no member named ‘zvo_kstat’
    dataset_kstats_update_write_kstats(&zv->zv_zso->zvo_kstat, len);
                                                  ^~
 scripts/Makefile.build:288: recipe for target '/var/lib/dkms/zfs/0.8.0/build/module/zfs/../os/linux/zfs/zvol_os.o' failed
 make[5]: *** [/var/lib/dkms/zfs/0.8.0/build/module/zfs/../os/linux/zfs/zvol_os.o] Error 1
 make[5]: *** Waiting for unfinished jobs....
 scripts/Makefile.build:519: recipe for target '/var/lib/dkms/zfs/0.8.0/build/module/zfs' failed
 make[4]: *** [/var/lib/dkms/zfs/0.8.0/build/module/zfs] Error 2
 Makefile:1656: recipe for target '_module_/var/lib/dkms/zfs/0.8.0/build/module' failed
 make[3]: *** [_module_/var/lib/dkms/zfs/0.8.0/build/module] Error 2
 make[3]: Leaving directory '/usr/src/linux-headers-5.3.0-51-generic'
 Makefile:38: recipe for target 'modules-Linux' failed
 make[2]: *** [modules-Linux] Error 2
 make[2]: Leaving directory '/var/lib/dkms/zfs/0.8.0/build/module'
 Makefile:844: recipe for target 'all-recursive' failed
 make[1]: *** [all-recursive] Error 1
 make[1]: Leaving directory '/var/lib/dkms/zfs/0.8.0/build'
 Makefile:705: recipe for target 'all' failed
 make: *** [all] Error 2
DKMSKernelVersion: 5.3.0-51-generic
Date: Sat Jun  6 17:21:16 2020
DuplicateSignature: dkms:zfs-dkms:(not installed):/var/lib/dkms/zfs/0.8.0/build/module/zfs/../os/linux/zfs/zvol_os.c:333:48: error: ‘struct zvol_state_os’ has no member named ‘zvo_kstat’
Package: zfs-dkms (not installed)
PackageVersion: (not installed)
SourcePackage: zfs-linux
Title: zfs-dkms (not installed): zfs kernel module failed to build

@mattmacy mattmacy force-pushed the projects/async_dmu_baseline-pr branch 2 times, most recently from be33c31 to 262ffb2 Compare June 8, 2020 19:58
@adamdmoss
Copy link
Contributor

Verified that zfsonfreebsd:projects/async_dmu_baseline-pr now merges w/master and builds cleanly on Linux w/DKMS.
Thanks!

@mattmacy mattmacy force-pushed the projects/async_dmu_baseline-pr branch from 262ffb2 to 7e92fb0 Compare June 9, 2020 03:04
@mattmacy
Copy link
Contributor Author

That's a superset of these changes. This code does not have the change that caused the problem mentioned there.

Can you explain which code caused the corruption?

Async dmu adds reference counting for dbufs, with the completion routine being called when the count goes to zero. CFA refactor dbuf_fill_done such that there's a branch where a completion routine needs to not be called if a dbuf has a pending read. This "nuance" was lost when doing a refactoring of the code in the routine for some locking changes that needed to happen because of further enhancements making reads of indirect blocks async.

This is the fix:
zfsonfreebsd@4070d55

@sempervictus
Copy link
Contributor

So to clarify, this should be safe to use against 2.0.1?

@mattmacy mattmacy force-pushed the projects/async_dmu_baseline-pr branch from 65b7360 to affe4bd Compare January 25, 2021 02:05
@mattmacy
Copy link
Contributor Author

So to clarify, this should be safe to use against 2.0.1?

Should, yes.

@mattmacy mattmacy force-pushed the projects/async_dmu_baseline-pr branch 3 times, most recently from 8076f48 to 8813d4d Compare January 28, 2021 21:47
- Integrate with zvol and vnops

- Add asynchronous rangelock acquisition.
 o Make rangelock acquisition strictly FCFS while
   eliminating potential recursion when executing
   acquisition callbacks
 o Add debug support for dumping held rangelocks.
- Add async read write to ztest
- add taskq ctor and dtor callbacks
- Add arc_watch support to FreeBSD kmod.

Initially based on old code from SpectraLogic.

Co-authored-by: Will Andrews <wca@FreeBSD.org>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
@mattmacy mattmacy force-pushed the projects/async_dmu_baseline-pr branch from 8813d4d to fbf26c2 Compare February 3, 2021 21:09
@adamdmoss
Copy link
Contributor

FWIW I've been using this PR (again!) relative to master for the last few weeks day-to-day on my personal/dev linux box. That's not exactly a strenuous test but it appears to have behaved well so far.

@sempervictus
Copy link
Contributor

@mattmacy - with all the force-pushes, this has become a little bit confusing to merge back atop the 2.0.2 tag. Any chance you have the whole commit history in a branch somewhere so i could rebase into our 2.0.2 branch?

@mattmacy
Copy link
Contributor Author

mattmacy commented Feb 4, 2021

@mattmacy - with all the force-pushes, this has become a little bit confusing to merge back atop the 2.0.2 tag. Any chance you have the whole commit history in a branch somewhere so i could rebase into our 2.0.2 branch?

@sempervictus - I'm sorry, no I haven't. I've just been folding in changes as I fix issues exposed in the CI. It's very difficult to maintain any amount of history when regularly rebasing against master.

@adamdmoss
Copy link
Contributor

I've been using this PR for yet another month, though it's becoming increasingly laborious to reconcile with master. Still seems good, but the PR is still marked as Work-In-Progress; is there really more to do or is it ready for review?

@ghost
Copy link

ghost commented Mar 19, 2021

Closing at the request of the author.

@ghost ghost closed this Mar 19, 2021
@sempervictus
Copy link
Contributor

@freqlabs: is this superseded or are there issues/concerns with the implementation?

@ghost
Copy link

ghost commented Mar 19, 2021

@sempervictus neither. @mattmacy has taken a position at a different company and is no longer working on ZFS or FreeBSD. We hope someone will pick up this work and move it across the finish line.

@adamdmoss
Copy link
Contributor

That presupposes that it's remotely clear what its state was and what it'd take to move it across the finish line; goodness knows I've asked.

@scineram
Copy link

I think it did ship in Truenas, but there were bugs.

@sempervictus
Copy link
Contributor

If anyone here can give me a quote to wrap this and get it into a current stable tag (or at least compatible with it once its in master), i'm willing to consider paying a qualified community member to complete the work... ZFS performance issues are taking it out of the mainstream.

@sempervictus
Copy link
Contributor

@behlendorf: this looks like this useful functionality about to be lost in the churn. Is there any way to pin this as "useful but currently unmaintained" to avoid losing all of the work and tag it for pickup in the PR queue by contributors?

@jumbi77
Copy link
Contributor

jumbi77 commented Mar 21, 2021

@behlendorf: this looks like this useful functionality about to be lost in the churn. Is there any way to pin this as "useful but currently unmaintained" to avoid losing all of the work and tag it for pickup in the PR queue by contributors?

In fact that's a nice idea - either pin closed PR's or may create a new project or pinned discussion and link some closed/not merged PR's. That way its easier for people/dev's to pickup some unfinished work. For e.g.: PR #10308, PR #10943, PR #10377, PR #5095, etc. (just to note some PRs.. I got a bigger, complete list)

@behlendorf behlendorf added Status: Inactive Not being actively updated and removed Status: Work in Progress Not yet ready for general review labels Mar 22, 2021
@behlendorf
Copy link
Contributor

It would be nice to more easily track these PR which haven't been merged but we do want to see completed. For the moment I've tagged it as "Status: Inactive" which we've used before for this kind of thing. We could certainly either create a new tag or project if that's preferable and we have a concise list of these kinds of tasks. Or even just curate what's currently tagged as inactive a bit more.

@sempervictus
Copy link
Contributor

Trying to revive this in my branch atop 2.1-rc5 and not looking too great. Already master has gone quite far ahead and there's per-commit digging required to see how some macros and constants changed. Plus it looks like there's new taskq semantics in the dmu.
ZFS performance is god awful on modern hardware, further made worse by wait in synchronous pipelines up and down the stack, so letting this die on the vine seems a terrible waste.

sempervictus pushed a commit to sempervictus/zfs that referenced this pull request May 31, 2021
Rebase 2.1 rc6 atop fbf26c2 (openzfs#10377), including updates for:
  668115f98f1
  e330514ad08
  ece24c1
The rebase was executed skipping the following commits to permit
testing while requesting assistance from appropriate contributors:
  64e0fe1 - ping @amotin for assistance
  e439ee8 - ping @behlendorf for assistance
  336bb3662b - ping @amotin for assistance

Testing:
  Built into 5.10.41-grsec (with grsec 2.1 ZFS patch applied).
  Zloop execution for 4h with no crashes.
  FIO and bonnie++ tests in a VM against zvol over a loopback file
inside a qcow2 atop a zpool (on 2.1 without this) on an nvme drive.
  FIO runs atop 3 ~1GB/s ceph pool's RBDs in a raidz as an 8k
block size ZVOL.
sempervictus pushed a commit to sempervictus/zfs that referenced this pull request Jun 1, 2021
Rebase 2.1 rc6 atop fbf26c2 (openzfs#10377), including updates for:
  668115f98f1
  e330514ad08
  ece24c1
The rebase was executed skipping the following commits to permit
testing while requesting assistance from appropriate contributors:
  64e0fe1 - ping @amotin for assistance
  e439ee8 - ping @behlendorf for assistance
  336bb3662b - ping @amotin for assistance

DO NOT MERGE THIS - IT IS A DIFF OF A REBASE WHICH HAS SKIPPED
COMMITS, the commits above *MUST* be resolved before this can be
applied to a current branch.

Testing:
  Built into 5.10.41-grsec (with grsec 2.1 ZFS patch applied).
  Zloop execution for 4h with no crashes.
  FIO and bonnie++ tests in a VM against zvol over a loopback file
inside a qcow2 atop a zpool (on 2.1 without this) on an nvme drive.
  FIO runs atop 3 ~1GB/s ceph pool's RBDs in a raidz as an 8k
block size ZVOL.
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Inactive Not being actively updated Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants