Reduce overhead in zvol_write/zvol_read via existing hold #4316

ryao · 2016-02-06T01:51:51Z

Our zvol_write function had severely diverged with illumos-gate in a
manner that caused multiple regressions:

We did dnode_hold() on each IO when we already had a hold.
We would attempt to send writes that exceeded DMU_MAX_ACCESS to the
DMU.
We could call zil_commit() twice. In this case, this is because
Linux uses the ->write function to send flushes and can aggregate the
flush with a write. If a synchronous write occurred with the flush, we
effectively flushed twice when there is no need to do that.

Signed-off-by: Richard Yao ryao@gentoo.org

ryao · 2016-02-06T03:00:04Z

I probably should elaborate on FLUSH+FUA semantics. I suspect the correct interpretation is that FLUSH applies to data that preceded this IO while FUA applies to the IO itself. Since the ZIL records are a linked list, we have an ordering such that we implicitly protect data that preceeded this IO on power failure without allowing this IO to appear. If the transaction group closes while we are still writing a large write, everything prior would be committed to disk while the new write would be partially written, provided that it is being broken into multiple transactions. Since we do not guarantee anything made it to disk at all until we return, we only need to call zil_commit once.

ryao · 2016-02-06T04:14:06Z

I noticed that zvol_read() suffers from the same problems, so I pushed a couple of commits to fix it. There are two commits rather than one because illumos-gate has the same bug where it uses dmu_read_uio() instead of dmu_read_uio_dbuf() too.

dracwyrm · 2016-02-06T09:02:23Z

module/zfs/zvol.c


+		error = dmu_read_uio_dbuf(zv->zv_dbuf, ZVOL_OBJ, &uio, bytes);


Don't need ZVOL_OBJ as per function:
dmu_read_uio_dbuf(dmu_buf_t *zdb, uio_t *uio, uint64_t size)

error = dmu_read_uio_dbuf(zv->zv_dbuf, &uio, bytes);
That should be correct.

Thanks. I corrected that typo and repushed. :)

ryao · 2016-02-08T21:32:39Z

@behlendorf I would like to squash this PR into 1 or 2 patches and rewrite the commit message(s) before any of this is merged. I say 1 or 2 because I am undecided on whether I should break out the Call dmu_read_uio_dbuf() in zvol_read() patch that is applicable to illumos-gate.

behlendorf · 2016-02-17T19:34:22Z

I doubt this will make any significant impact to performance and the bugs addressed aren't harmful but it's a useful bit of cleanup. It nicely builds on the work done by @tuxoko which allowed bio_vec's to be safely passed in a uio_t. @tuxoko this LGTM to be but I'd appreciate it if you could please review this too.

behlendorf · 2016-02-17T19:34:49Z

@ryao let's keep the patches separate as you've done.

ryao · 2016-02-17T19:39:20Z

@behlendorf At this point, I think further improvements in performance will involve micro-optimizations of this nature. There really is no obvious bottleneck for latency or source of CPU usage, although it would be nice to make further reductions in both categories.

dweeezil · 2016-02-17T19:55:53Z

Since I was using this patch while working on #4265, I figured I'd post here that this patch worked perfectly fine for me and seems to be a nice bit of cleanup. I never did measure its performance impact, but as has been pointed out, it's likely to be fairly small.

In illumos-gate, `zvol_read` and `zvol_write` are both passed uio_t rather than bio_t. Since we are translating from bio to uio for both, we might as well unify the logic and have code more similar to its illumos counterpart. At the same time, we can fix some regressions that occurred versus the original code from illumos-gate. We refactor zvol_write to take uio and also correct the following problems: 1. We did `dnode_hold()` on each IO when we already had a hold. 2. We would attempt to send writes that exceeded `DMU_MAX_ACCESS` to the DMU. 3. We could call `zil_commit()` twice. In this case, this is because Linux uses the `->write` function to send flushes and can aggregate the flush with a write. If a synchronous write occurred with the flush, we effectively flushed twice when there is no need to do that. zvol_read also suffers from the first two problems. Other platforms suffer from the first, so we leave that for a second patch so that there is a discrete patch for them to cherry-pick. Signed-off-by: Richard Yao <ryao@gentoo.org>

The difference between `dmu_read_uio()` and `dmu_read_uio_dbuf()` is that the former takes a hold while the latter uses an existing hold. `zfs_read()` in the ZPL will use `dmu_read_uio_dbuf()` while our analogous `zvol_write()` will use `dmu_write_uio_dbuf()`, but for no apparent reason, we inherited a `zvol_read()` function from OpenSolaris that does `dmu_read_uio()`. illumos-gate also still uses `dmu_read_uio()` to this day. Lets switch to `dmu_read_uio_dbuf()`, which is more performant. Signed-off-by: Richard Yao <ryao@gentoo.org>

ryao · 2016-02-17T20:08:55Z

@tuxoko I have updated this to address your comments on the !! operator that you made on b7a908a.

As I said on the comment, your suggestion on breaking at a block boundary is an improvement that I would prefer to leave for another patch because the first one is partly intended to allow improvements that apply to both ZoL and illumos-gate to be easily cherry-picked and your suggestion applies to code specific to illumos-gate. Also, I am not sure how much that matters in practice because we are breaking up IO on a 32MB boundary and IOs of such size are very rare. It is probably an improvement that we can do separately when one of us has some time to spend working it out. We also do not need to break on the block boundary as we only need to break on the record boundary. After doing that, 32MB chunks will fill the buffers every time without any need to perform reads.

tuxoko · 2016-02-17T20:13:57Z

@ryao
Right, 32MB should be very unlikely.

tuxoko · 2016-02-17T20:20:57Z

@behlendorf
This LGTM.

The difference between `dmu_read_uio()` and `dmu_read_uio_dbuf()` is that the former takes a hold while the latter uses an existing hold. `zfs_read()` in the ZPL will use `dmu_read_uio_dbuf()` while our analogous `zvol_write()` will use `dmu_write_uio_dbuf()`, but for no apparent reason, we inherited a `zvol_read()` function from OpenSolaris that does `dmu_read_uio()`. illumos-gate also still uses `dmu_read_uio()` to this day. Lets switch to `dmu_read_uio_dbuf()`, which is more performant. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #4316

behlendorf · 2016-02-18T18:30:15Z

Merged as:

19a47cb Call dmu_read_uio_dbuf() in zvol_read()
a765a34 Clean up zvol request processing to pass uio and fix porting regressions

ryao mentioned this pull request Feb 6, 2016

ZVOLs 3X slower than recordsets at same recordsize/volblocksize for sequential writes #4265

Closed

ryao force-pushed the zvol branch 6 times, most recently from 633e7d7 to 785d8a0 Compare February 6, 2016 02:23

ryao force-pushed the zvol branch from 4cbad68 to 46dbc30 Compare February 6, 2016 04:48

dracwyrm reviewed Feb 6, 2016
View reviewed changes

ryao force-pushed the zvol branch from 46dbc30 to 6758989 Compare February 6, 2016 14:35

ryao changed the title ~~Resynchronize zvol_write with illumos-gate~~ Reduce overhead in zvol_write/zvol_read via existing hold Feb 6, 2016

ryao force-pushed the zvol branch from 5c9298f to 0213a82 Compare February 6, 2016 21:29

dracwyrm mentioned this pull request Feb 8, 2016

for testing only - dbuf hash mutexes to dbuf hash rwlocks #4317

Closed

ryao force-pushed the zvol branch from 0213a82 to 637f7b1 Compare February 8, 2016 20:56

ryao force-pushed the zvol branch 2 times, most recently from 7be5e80 to 4f47945 Compare February 13, 2016 04:23

ryao added 2 commits February 17, 2016 15:05

ryao force-pushed the zvol branch from 4f47945 to 216e3ab Compare February 17, 2016 20:05

behlendorf closed this in a765a34 Feb 18, 2016

kernelOfTruth mentioned this pull request Sep 26, 2017

zfs 0.6.5 and write performance problems #4512

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce overhead in zvol_write/zvol_read via existing hold #4316

Reduce overhead in zvol_write/zvol_read via existing hold #4316

ryao commented Feb 6, 2016

ryao commented Feb 6, 2016

ryao commented Feb 6, 2016

dracwyrm Feb 6, 2016

ryao Feb 6, 2016

ryao commented Feb 8, 2016

behlendorf commented Feb 17, 2016

behlendorf commented Feb 17, 2016

ryao commented Feb 17, 2016

dweeezil commented Feb 17, 2016

ryao commented Feb 17, 2016

tuxoko commented Feb 17, 2016

tuxoko commented Feb 17, 2016

behlendorf commented Feb 18, 2016


		error = dmu_read_uio_dbuf(zv->zv_dbuf, ZVOL_OBJ, &uio, bytes);

Reduce overhead in zvol_write/zvol_read via existing hold #4316

Reduce overhead in zvol_write/zvol_read via existing hold #4316

Conversation

ryao commented Feb 6, 2016

ryao commented Feb 6, 2016

ryao commented Feb 6, 2016

dracwyrm Feb 6, 2016

Choose a reason for hiding this comment

ryao Feb 6, 2016

Choose a reason for hiding this comment

ryao commented Feb 8, 2016

behlendorf commented Feb 17, 2016

behlendorf commented Feb 17, 2016

ryao commented Feb 17, 2016

dweeezil commented Feb 17, 2016

ryao commented Feb 17, 2016

tuxoko commented Feb 17, 2016

tuxoko commented Feb 17, 2016

behlendorf commented Feb 18, 2016