Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor zvol discard performance. #6728

Closed
ab-oe opened this issue Oct 6, 2017 · 10 comments
Closed

Poor zvol discard performance. #6728

ab-oe opened this issue Oct 6, 2017 · 10 comments
Labels
Type: Performance Performance improvement or performance problem
Milestone

Comments

@ab-oe
Copy link
Contributor

ab-oe commented Oct 6, 2017

System information

Type Version/Name
Distribution Name all
Distribution Version
Linux Kernel 4.4
Architecture x86_64
ZFS Version 0.7.2
SPL Version 0.7.2

Describe the problem you're observing

Since the commit 539d33c zvol discard operation time greatly fell down no matter how the zfs_per_txg_dirty_frees_percent parameter is set.

Describe how to reproduce the problem

The easiest way is to run mkfs.ext4 which do the discard by default on the zvol.

On the 0.7.2 with reverted 539d33c the whole mkfs.ext4 on 1TiB thin provisioned zvol takes 6 seconds.

time mkfs.ext4 /dev/zd16
mke2fs 1.42.12 (29-Aug-2014)
/dev/zd16 contains a ext4 file system
	created on Fri Oct  6 10:00:28 2017
Proceed anyway? (y,n) y
Discarding device blocks: done                            
Creating filesystem with 268435456 4k blocks and 67108864 inodes
Filesystem UUID: 37e643bc-f80f-4b91-867b-c6b12ef11261
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000, 214990848

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done     


real	0m6.305s
user	0m0.030s
sys	0m0.140s

On official release this operation takes 2 minutes and 19 seconds:

time mkfs.ext4 /dev/zd1
mke2fs 1.42.12 (29-Aug-2014)
/dev/zd16 contains a ext4 file system
	created on Fri Oct  6 10:00:45 2017
Proceed anyway? (y,n) y
Discarding device blocks: done                            
Creating filesystem with 268435456 4k blocks and 67108864 inodes
Filesystem UUID: aa37f90c-34e7-49d4-97b1-57ff9f71ee64
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000, 214990848

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done     


real	2m19.554s
user	0m0.040s
sys	0m0.130s

For the 10TiB volume it took almost 20 minutes.

@loli10K loli10K added the Type: Performance Performance improvement or performance problem label Oct 6, 2017
@mailinglists35
Copy link

mailinglists35 commented Oct 8, 2017

The easiest way is to run mkfs.ext4

blkdiscard is part of util-linux and can tell how many bytes were discarded.
it can also discard arbitrary blocks, not just entire device.
you also don't want to keep the device busy with background initialization writes that happen after mkfs.ext4, if all you want is discard.
util-linux is by default present in all modern distros (you cannot even uninstall it on debian)

@behlendorf
Copy link
Contributor

zvol discard operation time greatly fell down no matter how the zfs_per_txg_dirty_frees_percent parameter is set.

@ab-oe ouch that's a pretty dramatic performance hit. I assume that test was run on an otherwise idle system. Have you tried setting zfs_per_txg_dirty_frees_percent=0 to disable the throttle? That should get you the same behavior as reverting the patch.

If that works we could consider doing something along the lines of disabling the throttle when there's not a significant amount of contending dirty data. The purpose of the throttle was to prevent free's from starving out writes. Let's get @alek-p's thoughts on this since he implemented the original patch and may have a better idea.

@ab-oe
Copy link
Contributor Author

ab-oe commented Oct 11, 2017

@behlendorf I must admit that I didn't try setting zfs_per_txg_dirty_frees_percent to 0 but it indeed resolves this issue. The discard performance is as good as with patch reverted. Thank you.

@behlendorf behlendorf added this to the 0.8.0 milestone Oct 11, 2017
@alek-p
Copy link
Contributor

alek-p commented Oct 11, 2017

disabling the throttle when there's not a significant amount of contending dirty data

This issue with this approach could be that once we disable throttle we may not let any more non-freeing dirty data into the TXG until the frees are done which would starve out writes again. To avoid this we could periodically re-enable the throttle and look at how much non-freeing dirty data gets in then but I would think this could lead to choppy performance.

Instead of disabling throttle, perhaps setting zfs_per_txg_dirty_frees_percent dynamically may be the way to go. My initial thought is tracking the difference between dp_dirty_pertxg and dp_long_free_dirty_pertxg, we would then adjust zfs_per_txg_dirty_frees_percent based on that difference. The key part is always leaving some room for non-freeing dirty data to come in so that we can react and start to throttle frees.

Also, it now seems to me that we should calculate dirty_frees_threshold differently, perhaps by allowing zfs_per_txg_dirty_frees_percent to go above 100. We need to take into account that we're comparing dirty_frees_threshold (based on 1 TXG) to long_free_dirty_all_txgs (based on all TXGs that haven't completed sync pass 1). This would explain why setting zfs_per_txg_dirty_frees_percent to 100 doesn't give us the same results as disabling the throttle completely.

@behlendorf
Copy link
Contributor

behlendorf commented Oct 11, 2017

After looking at this a little bit more I'm no longer convinced that any specific limit is the real root cause for this. Specifically for a discard/free heavy only workload, like mkfs.ext4.

The throttle here in dmu_free_long_range_impl is implemented by calling txg_wait_open() which blocks until the next txg is available. On the surface this looks entirely reasonable, but it assumes that additional real dirty data will be added to force a txg sync. For a discard/free only workload it looks like the txg will be filled to zfs_per_txg_dirty_frees_percent and then will wait for zfs_txg_timeout since the discards/frees won't dirty any dbufs. That means dp->dp_dirty_total will remain under the zfs_dirty_data_sync limit.

I suspect this is the reason why setting zfs_per_txg_dirty_frees_percent=100 has no real effect. We still end up waiting the full zfs_txg_timeout before even attempting to write out the txg. I suspect that using txg_wait_synced() in dmu_free_long_range_impl would mitigate most of the issue by conveying to txg_sync_thread that there is a waiter and it needs to start writing.

[edit] Using txg_wait_synced() isn't quite right either since we don't actually care about when these frees are written. We just want to make sure the next txg gets written and a new one gets opened.

@alek-p
Copy link
Contributor

alek-p commented Oct 11, 2017

That makes a lot of sense, so what we need is to trigger a TXG rollover for the frees-only workload, without short circuting the TXG mechanics when the load is "normal".

@alek-p
Copy link
Contributor

alek-p commented Oct 24, 2017

Not sure how viable this is but perhaps adding frees/discards as a new class to the I/O scheduler makes sense and could be made to satisfy the conditions referenced above

@tcaputi
Copy link
Contributor

tcaputi commented Jan 29, 2019

We are currently experiencing this problem at datto and are working around it by setting zfs_per_txg_dirty_frees_percent=0. @behlendorf can we prioritize getting #7888 reviewed and merged?

@behlendorf
Copy link
Contributor

@tcaputi reviewed, the patch looks good. I've just requested a comment be added.

@alek-p
Copy link
Contributor

alek-p commented Mar 7, 2019

This issue should be addressed in d2d141e merged in from PR #7888

@alek-p alek-p closed this as completed Mar 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

6 participants