Poor zvol discard performance. #6728

ab-oe · 2017-10-06T08:21:15Z

System information

Type	Version/Name
Distribution Name	all
Distribution Version
Linux Kernel	4.4
Architecture	x86_64
ZFS Version	0.7.2
SPL Version	0.7.2

Describe the problem you're observing

Since the commit 539d33c zvol discard operation time greatly fell down no matter how the zfs_per_txg_dirty_frees_percent parameter is set.

Describe how to reproduce the problem

The easiest way is to run mkfs.ext4 which do the discard by default on the zvol.

On the 0.7.2 with reverted 539d33c the whole mkfs.ext4 on 1TiB thin provisioned zvol takes 6 seconds.

time mkfs.ext4 /dev/zd16
mke2fs 1.42.12 (29-Aug-2014)
/dev/zd16 contains a ext4 file system
	created on Fri Oct  6 10:00:28 2017
Proceed anyway? (y,n) y
Discarding device blocks: done                            
Creating filesystem with 268435456 4k blocks and 67108864 inodes
Filesystem UUID: 37e643bc-f80f-4b91-867b-c6b12ef11261
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000, 214990848

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done     


real	0m6.305s
user	0m0.030s
sys	0m0.140s

On official release this operation takes 2 minutes and 19 seconds:

time mkfs.ext4 /dev/zd1
mke2fs 1.42.12 (29-Aug-2014)
/dev/zd16 contains a ext4 file system
	created on Fri Oct  6 10:00:45 2017
Proceed anyway? (y,n) y
Discarding device blocks: done                            
Creating filesystem with 268435456 4k blocks and 67108864 inodes
Filesystem UUID: aa37f90c-34e7-49d4-97b1-57ff9f71ee64
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000, 214990848

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done     


real	2m19.554s
user	0m0.040s
sys	0m0.130s

For the 10TiB volume it took almost 20 minutes.

The text was updated successfully, but these errors were encountered:

mailinglists35 · 2017-10-08T00:05:16Z

The easiest way is to run mkfs.ext4

blkdiscard is part of util-linux and can tell how many bytes were discarded.
it can also discard arbitrary blocks, not just entire device.
you also don't want to keep the device busy with background initialization writes that happen after mkfs.ext4, if all you want is discard.
util-linux is by default present in all modern distros (you cannot even uninstall it on debian)

behlendorf · 2017-10-10T18:30:39Z

zvol discard operation time greatly fell down no matter how the zfs_per_txg_dirty_frees_percent parameter is set.

@ab-oe ouch that's a pretty dramatic performance hit. I assume that test was run on an otherwise idle system. Have you tried setting zfs_per_txg_dirty_frees_percent=0 to disable the throttle? That should get you the same behavior as reverting the patch.

If that works we could consider doing something along the lines of disabling the throttle when there's not a significant amount of contending dirty data. The purpose of the throttle was to prevent free's from starving out writes. Let's get @alek-p's thoughts on this since he implemented the original patch and may have a better idea.

ab-oe · 2017-10-11T06:54:11Z

@behlendorf I must admit that I didn't try setting zfs_per_txg_dirty_frees_percent to 0 but it indeed resolves this issue. The discard performance is as good as with patch reverted. Thank you.

alek-p · 2017-10-11T19:00:55Z

disabling the throttle when there's not a significant amount of contending dirty data

This issue with this approach could be that once we disable throttle we may not let any more non-freeing dirty data into the TXG until the frees are done which would starve out writes again. To avoid this we could periodically re-enable the throttle and look at how much non-freeing dirty data gets in then but I would think this could lead to choppy performance.

Instead of disabling throttle, perhaps setting zfs_per_txg_dirty_frees_percent dynamically may be the way to go. My initial thought is tracking the difference between dp_dirty_pertxg and dp_long_free_dirty_pertxg, we would then adjust zfs_per_txg_dirty_frees_percent based on that difference. The key part is always leaving some room for non-freeing dirty data to come in so that we can react and start to throttle frees.

Also, it now seems to me that we should calculate dirty_frees_threshold differently, perhaps by allowing zfs_per_txg_dirty_frees_percent to go above 100. We need to take into account that we're comparing dirty_frees_threshold (based on 1 TXG) to long_free_dirty_all_txgs (based on all TXGs that haven't completed sync pass 1). This would explain why setting zfs_per_txg_dirty_frees_percent to 100 doesn't give us the same results as disabling the throttle completely.

behlendorf · 2017-10-11T21:33:30Z

After looking at this a little bit more I'm no longer convinced that any specific limit is the real root cause for this. Specifically for a discard/free heavy only workload, like mkfs.ext4.

The throttle here in dmu_free_long_range_impl is implemented by calling txg_wait_open() which blocks until the next txg is available. On the surface this looks entirely reasonable, but it assumes that additional real dirty data will be added to force a txg sync. For a discard/free only workload it looks like the txg will be filled to zfs_per_txg_dirty_frees_percent and then will wait for zfs_txg_timeout since the discards/frees won't dirty any dbufs. That means dp->dp_dirty_total will remain under the zfs_dirty_data_sync limit.

I suspect this is the reason why setting zfs_per_txg_dirty_frees_percent=100 has no real effect. We still end up waiting the full zfs_txg_timeout before even attempting to write out the txg. I suspect that using txg_wait_synced() in dmu_free_long_range_impl would mitigate most of the issue by conveying to txg_sync_thread that there is a waiter and it needs to start writing.

[edit] Using txg_wait_synced() isn't quite right either since we don't actually care about when these frees are written. We just want to make sure the next txg gets written and a new one gets opened.

alek-p · 2017-10-11T23:53:14Z

That makes a lot of sense, so what we need is to trigger a TXG rollover for the frees-only workload, without short circuting the TXG mechanics when the load is "normal".

alek-p · 2017-10-24T07:36:47Z

Not sure how viable this is but perhaps adding frees/discards as a new class to the I/O scheduler makes sense and could be made to satisfy the conditions referenced above

tcaputi · 2019-01-29T20:37:50Z

We are currently experiencing this problem at datto and are working around it by setting zfs_per_txg_dirty_frees_percent=0. @behlendorf can we prioritize getting #7888 reviewed and merged?

behlendorf · 2019-01-30T17:51:50Z

@tcaputi reviewed, the patch looks good. I've just requested a comment be added.

alek-p · 2019-03-07T22:43:56Z

This issue should be addressed in d2d141e merged in from PR #7888

ab-oe mentioned this issue Oct 6, 2017

zfs 0.6.5 and write performance problems #4512

Closed

loli10K added the Type: Performance Performance improvement or performance problem label Oct 6, 2017

behlendorf added this to the 0.8.0 milestone Oct 11, 2017

alek-p mentioned this issue Sep 11, 2018

freeing threshold should account for holes #7888

Merged

13 tasks

alek-p mentioned this issue Sep 25, 2018

Discard operations on empty zvols are "slow" #7951

Open

alek-p closed this as completed Mar 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor zvol discard performance. #6728

Poor zvol discard performance. #6728

ab-oe commented Oct 6, 2017

mailinglists35 commented Oct 8, 2017 •

edited

Loading

behlendorf commented Oct 10, 2017

ab-oe commented Oct 11, 2017

alek-p commented Oct 11, 2017

behlendorf commented Oct 11, 2017 •

edited

Loading

alek-p commented Oct 11, 2017

alek-p commented Oct 24, 2017 •

edited

Loading

tcaputi commented Jan 29, 2019 •

edited

Loading

behlendorf commented Jan 30, 2019

alek-p commented Mar 7, 2019

Poor zvol discard performance. #6728

Poor zvol discard performance. #6728

Comments

ab-oe commented Oct 6, 2017

System information

Describe the problem you're observing

Describe how to reproduce the problem

mailinglists35 commented Oct 8, 2017 • edited Loading

behlendorf commented Oct 10, 2017

ab-oe commented Oct 11, 2017

alek-p commented Oct 11, 2017

behlendorf commented Oct 11, 2017 • edited Loading

alek-p commented Oct 11, 2017

alek-p commented Oct 24, 2017 • edited Loading

tcaputi commented Jan 29, 2019 • edited Loading

behlendorf commented Jan 30, 2019

alek-p commented Mar 7, 2019

mailinglists35 commented Oct 8, 2017 •

edited

Loading

behlendorf commented Oct 11, 2017 •

edited

Loading

alek-p commented Oct 24, 2017 •

edited

Loading

tcaputi commented Jan 29, 2019 •

edited

Loading