Backfilling metadnode degrades object create rates #4460

nedbass · 2016-03-26T08:20:04Z

Object creation rates may be degraded when dmu_object_alloc() tries
to backfill the metadnode array by restarting its search at offset 0.
The method of searching the dnode space for holes is inefficient and
unreliable, leading to many failed attempts to obtain a dnode hold.
These failed attempts are expensive and limit overall system
throughput. This patch changes the default behavior to disable
backfilling, and it adds a zfs_metadnode_backfill module parameter to
allow the old behavior to be enabled.

The search offset restart happens at most once per call to
dmu_object_alloc() when the previously allocated object number is a
multiple of 4096. If the hold on the requested object fails because
the object is allocated, dmu_object_next() is called to find the next
hole. That function should theoretically identify the next free
object that the next loop iteration can successfully obtain a hold
on. In practice, however, dmu_object_next() may falsely identify a
recently allocated dnode as free because the in-memory copy of the
dnode_phys_t is not up to date. The next hold attempt then fails, and
this process repeats for up to 4096 loop iterations before the search
skips ahead to a sparse region of the metadnode. A similar pathology
occurs if dmu_object_next() returns ESRCH when it fails to find a
hole in the current dnode block. In this case dmu_object_alloc()
simply increments the object number and retries, resulting again in
up to 4096 failed dnode hold attempts.

We can avoid these pathologies by not attempting to backfill the
metadnode array. This may result in sparse dnode blocks, potentially
costing disk space, memory overhead, and increased disk I/O. These
penalties appear to be outweighed by the performance cost of the
current approach. Future work could implement a more efficient means
to search for holes and allow us to reenable backfilling by default.

=== Benchmark Results ===

We measured a 46% increase in average file creation rate by
setting zfs_metadnode_backfill=0.

The createmany benchmark used is available at
http://github.com/nedbass/createmany. It used 32 threads to create 16
million files over 16 iterations. The pool was freshly created for each
of the two tests. The test system was a d2.xlarge Amazon AWS virtual
machine with 3 2TB disks in a raidz pool.

zfs_metadnode_backfill Average creates/second

                 0                  43879
                 1                  30040

$ zpool create tank raidz /dev/xvd{b,c,d}
$ echo 0 > /sys/module/zfs/parameters/zfs_metadnode_backfill
$ for ((i=0;i<16;i++)) ; do ./createmany -o -t 32 -D $(mktemp
-d /tank/XXXXX) 1000000 ; done
total: 1000000 creates in 21.142829 seconds: 47297.359852 creates/second
total: 1000000 creates in 21.421943 seconds: 46681.108566 creates/second
total: 1000000 creates in 21.996960 seconds: 45460.826977 creates/second
total: 1000000 creates in 22.031947 seconds: 45388.637143 creates/second
total: 1000000 creates in 21.597262 seconds: 46302.165727 creates/second
total: 1000000 creates in 21.194397 seconds: 47182.281302 creates/second
total: 1000000 creates in 23.844561 seconds: 41938.285457 creates/second
total: 1000000 creates in 25.678497 seconds: 38943.089478 creates/second
total: 1000000 creates in 22.400553 seconds: 44641.757449 creates/second
total: 1000000 creates in 22.011262 seconds: 45431.290857 creates/second
total: 1000000 creates in 21.848749 seconds: 45769.211022 creates/second
total: 1000000 creates in 26.574808 seconds: 37629.622928 creates/second
total: 1000000 creates in 22.326124 seconds: 44790.580077 creates/second
total: 1000000 creates in 23.562593 seconds: 42440.152541 creates/second
total: 1000000 creates in 26.825597 seconds: 37277.828270 creates/second
total: 1000000 creates in 22.277026 seconds: 44889.297413 creates/second

$ zpool destroy tank
$ zpool create tank raidz /dev/xvd{b,c,d}
$ echo 1 > /sys/module/zfs/parameters/zfs_metadnode_backfill
$ for ((i=0;i<16;i++)) ; do ./createmany -o -t 32 -D $(mktemp
-d /tank/XXXXX) 1000000 ; done
total: 1000000 creates in 31.947285 seconds: 31301.564265 creates/second
total: 1000000 creates in 31.511260 seconds: 31734.687822 creates/second
total: 1000000 creates in 31.984121 seconds: 31265.515618 creates/second
total: 1000000 creates in 31.960720 seconds: 31288.406458 creates/second
total: 1000000 creates in 32.651408 seconds: 30626.550663 creates/second
total: 1000000 creates in 32.579218 seconds: 30694.414826 creates/second
total: 1000000 creates in 36.163562 seconds: 27652.143474 creates/second
total: 1000000 creates in 33.621352 seconds: 29743.003829 creates/second
total: 1000000 creates in 33.097268 seconds: 30213.974061 creates/second
total: 1000000 creates in 34.419482 seconds: 29053.313476 creates/second
total: 1000000 creates in 34.014244 seconds: 29399.448204 creates/second
total: 1000000 creates in 32.972573 seconds: 30328.236705 creates/second
total: 1000000 creates in 34.757156 seconds: 28771.054526 creates/second
total: 1000000 creates in 32.194859 seconds: 31060.859951 creates/second
total: 1000000 creates in 32.464407 seconds: 30802.966165 creates/second
total: 1000000 creates in 37.443681 seconds: 26706.776650 creates/second

Signed-off-by: Ned Bass bass6@llnl.gov

Object creation rates may be degraded when dmu_object_alloc() tries to backfill the metadnode array by restarting its search at offset 0. The method of searching the dnode space for holes is inefficient and unreliable, leading to many failed attempts to obtain a dnode hold. These failed attempts are expensive and limit overall system throughput. This patch changes the default behavior to disable backfilling, and it adds a zfs_metadnode_backfill module parameter to allow the old behavior to be enabled. The search offset restart happens at most once per call to dmu_object_alloc() when the previously allocated object number is a multiple of 4096. If the hold on the requested object fails because the object is allocated, dmu_object_next() is called to find the next hole. That function should theoretically identify the next free object that the next loop iteration can successfully obtain a hold on. In practice, however, dmu_object_next() may falsely identify a recently allocated dnode as free because the in-memory copy of the dnode_phys_t is not up to date. The next hold attempt then fails, and this process repeats for up to 4096 loop iterations before the search skips ahead to a sparse region of the metadnode. A similar pathology occurs if dmu_object_next() returns ESRCH when it fails to find a hole in the current dnode block. In this case dmu_object_alloc() simply increments the object number and retries, resulting again in up to 4096 failed dnode hold attempts. We can avoid these pathologies by not attempting to backfill the metadnode array. This may result in sparse dnode blocks, potentially costing disk space, memory overhead, and increased disk I/O. These penalties appear to be outweighed by the performance cost of the current approach. Future work could implement a more efficient means to search for holes and allow us to reenable backfilling by default. === Benchmark Results === We measured a 46% increase in average file creation rate by setting zfs_metadnode_backfill=0. The createmany benchmark used is available at http://github.com/nedbass/createmany. It used 32 threads to create 16 million files over 16 iterations. The pool was freshly created for each of the two tests. The test system was a d2.xlarge Amazon AWS virtual machine with 3 2TB disks in a raidz pool. zfs_metadnode_backfill Average creates/second ---------------------- ---------------------- 0 43879 1 30040 $ zpool create tank raidz /dev/xvd{b,c,d} $ echo 0 > /sys/module/zfs/parameters/zfs_metadnode_backfill $ for ((i=0;i<16;i++)) ; do ./createmany -o -t 32 -D $(mktemp -d /tank/XXXXX) 1000000 ; done total: 1000000 creates in 21.142829 seconds: 47297.359852 creates/second total: 1000000 creates in 21.421943 seconds: 46681.108566 creates/second total: 1000000 creates in 21.996960 seconds: 45460.826977 creates/second total: 1000000 creates in 22.031947 seconds: 45388.637143 creates/second total: 1000000 creates in 21.597262 seconds: 46302.165727 creates/second total: 1000000 creates in 21.194397 seconds: 47182.281302 creates/second total: 1000000 creates in 23.844561 seconds: 41938.285457 creates/second total: 1000000 creates in 25.678497 seconds: 38943.089478 creates/second total: 1000000 creates in 22.400553 seconds: 44641.757449 creates/second total: 1000000 creates in 22.011262 seconds: 45431.290857 creates/second total: 1000000 creates in 21.848749 seconds: 45769.211022 creates/second total: 1000000 creates in 26.574808 seconds: 37629.622928 creates/second total: 1000000 creates in 22.326124 seconds: 44790.580077 creates/second total: 1000000 creates in 23.562593 seconds: 42440.152541 creates/second total: 1000000 creates in 26.825597 seconds: 37277.828270 creates/second total: 1000000 creates in 22.277026 seconds: 44889.297413 creates/second $ zpool destroy tank $ zpool create tank raidz /dev/xvd{b,c,d} $ echo 1 > /sys/module/zfs/parameters/zfs_metadnode_backfill $ for ((i=0;i<16;i++)) ; do ./createmany -o -t 32 -D $(mktemp -d /tank/XXXXX) 1000000 ; done total: 1000000 creates in 31.947285 seconds: 31301.564265 creates/second total: 1000000 creates in 31.511260 seconds: 31734.687822 creates/second total: 1000000 creates in 31.984121 seconds: 31265.515618 creates/second total: 1000000 creates in 31.960720 seconds: 31288.406458 creates/second total: 1000000 creates in 32.651408 seconds: 30626.550663 creates/second total: 1000000 creates in 32.579218 seconds: 30694.414826 creates/second total: 1000000 creates in 36.163562 seconds: 27652.143474 creates/second total: 1000000 creates in 33.621352 seconds: 29743.003829 creates/second total: 1000000 creates in 33.097268 seconds: 30213.974061 creates/second total: 1000000 creates in 34.419482 seconds: 29053.313476 creates/second total: 1000000 creates in 34.014244 seconds: 29399.448204 creates/second total: 1000000 creates in 32.972573 seconds: 30328.236705 creates/second total: 1000000 creates in 34.757156 seconds: 28771.054526 creates/second total: 1000000 creates in 32.194859 seconds: 31060.859951 creates/second total: 1000000 creates in 32.464407 seconds: 30802.966165 creates/second total: 1000000 creates in 37.443681 seconds: 26706.776650 creates/second Signed-off-by: Ned Bass <bass6@llnl.gov>

nedbass · 2016-03-26T18:29:35Z

@ahrens this might be of interest to your metadata performance work.

It highlights a problem I noticed while working on large dnodes #3542. When dnode_next_offset_level() searches the dnode space it looks at dnode_phys_t buffers. But if the object was recently created that buffer is not in sync with the in-core dnode_t. It will be empty and dnode_next_offset_level() falsely identifies a recently allocated dnode as a hole because the dn_type field appears to be DMU_OT_NONE. In #3542 I had to inspect both dnode_phys_t and dnode handles to deal with this issue.

Addressing that doesn't fix this performance issue, however, because dmu_object_alloc() doesn't handle ESRCH errors from dmu_object_next(). It simply increments the object number and retries in that case, so it still may end up iterating through a whole L2 block pointer's worth of allocated dnodes before moving on.

Disabling backfill is a band aid. In the long term we need a better way to find holes, i.e. spacemaps.

ahrens · 2016-03-26T22:18:03Z

Thanks @nedbass. I recently discovered this as well. I agree we need to take into account allocated dnodes that have not been written to disk yet. I'll be working on a design for that. I'll try to avoid changing the on disk structure though (i.e. Not add space maps. Range trees could be useful though, for tracking what's allocated but not yet synced. )

behlendorf · 2016-03-27T18:56:29Z

@nedbass nice find. This issue of not being able to cheaply determine if a dnode has just been dirtied but not yet written has come up a few times recently. It clearly has a significant impact on create performance. I think space maps, range trees, or even bitmaps might all be reasonable approaches depending on exactly what use case is being optimized. None of these solutions necessarily require us to change the on disk format (which I agree would be a good thing).

nedbass · 2016-03-30T23:50:18Z

Related to openzfs/openzfs#82

adilger · 2016-04-01T20:28:01Z

module/zfs/dmu_object.c

@@ -31,6 +31,8 @@
 #include <sys/zap.h>
 #include <sys/zfeature.h>

+int zfs_metadnode_backfill = 0;


It probably makes sense to add a pool parameter for this instead of just a module option, so that it can be set persistently if users care more about dense allocations than create performance.

You mean to make it a zpool property? I think that would be a bad idea, since this is essentially a workaround for a performance bug.

Yes, if we can reliably detect holes using openzfs/openzfs#82, and add appropriate handling of ESRCH from dmu_object_next(), then this patch shouldn't be needed.

I don't think that openzfs/openzfs#82 is sufficient to address the performance problem that this patch is working around. The problem is that dnode_next_offset (and dmu_object_next) don’t take into account objects allocated in memory but not yet synced to disk. Therefore if we allocate more than a L1 (the comment about L2 is inaccurate) worth of dnodes in one txg, we will end up calling dnode_hold_impl / dmu_object_next on every allocated-in-memory object when cross the L1 boundary.

OK, I misunderstood what that patch fixes. The symptoms are similar (dnode_next_offset() can detect fictional holes) but happen under different conditions.

sempervictus · 2016-04-08T19:28:54Z

If this patch were to be applied in a stack, used on a pool, and then removed for a subsequent iteration, will the holes missed due to this workaround be back-filled?
Secondly, what, if any effect would this have on ZVOLs? We've observed degrading performance on ZVOLs under heavy write load (while the VDEV volumes comprising the pool show ~20% load), and i'm wondering if that could be related, and potentially (temporarily) addressed this way.
Thanks.

nedbass · 2016-04-08T20:43:37Z

@sempervictus Yes, the holes would be backfilled. In fact the backfilling behavior would be immediately restored if you dynamically set zfs_metadnode_backfill=1.

This patch will no effect on ZVOL performance since a ZVOL is effectively one giant object and doesn't allocate new objects internally.

behlendorf · 2016-04-12T20:27:35Z

While this isn't the ideal long term solution for this problem it is a small safe change which does significantly improve meta-data performance today. @nedbass let me know if you're happy with this as a final version so it can be merged.

nedbass · 2016-04-12T21:37:01Z

@behlendorf I'm fine with it in its current form. We'll probably want to revert this once the underlying issues are fixed.

adilger · 2016-04-13T00:33:45Z

AFAIK, the problem with using this patch in production is that it will result in monotonously increasing dnode numbers, and for a workload that is creating and deleting files continuously the metadnode would become very large and sparse, until the filesystem is remounted? For long-running servers this might be unacceptable.

Is there some way to track the number of freed dnodes (in-memory per pupil counter) and reset the scanning once it hits some threshold (e.g. 4x the average number of dnodes allocated in the past few TXGs)? That makes it worthwhile to go back and re-scan, while ensuring the new dnodes are unlikely to hit recently allocated dnodes. It may have a noticeable performance cost to change from allocating all new dnodes in a block to filling in holes.

nedbass · 2016-04-13T01:10:37Z

@adilger That's a good point. I wonder if metadata compression makes it less of an issue though. I think zeroed-out blocks do not actually consume space on disk. And mostly-zero blocks should compress well. There is still a memory overhead problem with a very sparse metadnode if the working set is spread across the entire dnode space.

adilger · 2016-04-13T01:32:19Z

I suspect that a relatively simple heuristic as I outlined could be implemented, based on the number of freed dnodes, so that a create-mostly workload will never try backfilling, while a create/delete workload will do it occasionally when it is worthwhile to do so. In addition to the count of freed dnodes, it might be worthwhile to track the minimum dnode number freed so that the allocator doesn't scan the whole metadnode each time. Since the allocator will rescan the metadnode from the start on each boot, these values can be lazy or racy updates in memory.

ahrens · 2016-04-19T23:29:03Z

@adilger

the problem with using this patch in production is that it will result in monotonously increasing dnode numbers, and for a workload that is creating and deleting files continuously the metadnode would become very large and sparse, until the filesystem is remounted

Yes

track the number of freed dnodes (in-memory per pupil counter) and reset the scanning once it hits some threshold

That's a good idea.

behlendorf · 2016-04-21T18:43:21Z

the problem with using this patch in production is that it will result in monotonously increasing dnode numbers, and for a workload that is creating and deleting files continuously the metadnode would become very large and sparse, until the filesystem is remounted

Yes, but is having a sparse dnode object actually a real problem? Sure, it's not the ideal long term fix but aside from possibly slightly worse memory utilization this seems like it wouldn't cause any issues.

That said, I'm happy to hold off on doing anything here. If we get a little time I agree it would be interesting to take a crack at @adilger suggested which is nice and simple.

adilger · 2016-04-22T07:07:34Z

I think there are a few potential drawbacks of never trying to backfill in a workload that does both creates and unlinks:

depending on the workload the leaf blocks may not be totally empty, so the whole block will be kept in cache even if only one dnode is left, which might consume a fair amount of memory. It is better to try and keep those blocks more full.
metadnode gets very large and sparse, which increases the depth of the tree needed to address leaf blocks which (possibly, I don't recall the exact implementation) will increase the transaction size for each dnode update based on the max file size. That is a relatively small effect, if any (it might be that the transaction size is based on the number of indirect blocks to rewrite any data block for the maximum possible file size instead of the actual file size)
if the metadnode gets too large and sparse, it will take more IOPS to do scrubs and other dnode traversals, since it could only get a few dnodes per lead block, while a dense metadnode will get many dnodes from disk for each read.
if we wait too long before trying to backfill (e.g. after remount instead of after a few tens of seconds if there are many files being deleted) then the creation of new files will have to read these many metadnode blocks from disk to fill in space, rather than getting them from ARC

ahrens · 2016-04-22T23:43:46Z

@adilger

The metadnode unfortunately has a fixed depth.
Sparseness hurts other operations that need to visit all dnodes (in addition to scrub), most notably zfs send and zfs destroy. These could devolve into 1 dnode per block (or even 1 dnode per 2 blocks -- if there's only one object under each L1 indirect block!).
You could fill up the metadnode and reach the limit of # files per filesystem (2^48) - though admittedly it would take 45 years at 200,000 creations/second.

bzzz77 · 2016-04-27T11:37:10Z

is it correct to say that the major issue is dnode_next_offset_level() incapable to detect just allocated dnodes?

nedbass · 2016-04-27T17:15:36Z

@bzzz77 yes that's correct. A secondary issue is that dmu_object_alloc() doesn't handle ESRCH returned from dmu_object_next(). So if the L1 BP of dnode blocks are full, as will be the case with a create-only workload, it still calls dnode_hold_impl() on every one.

bzzz77 · 2016-04-27T19:31:17Z

then what if we have an in-memory structure tracking allocations in current TXG and consult with that? then TXG sync would release that structure

nedbass · 2016-04-27T19:40:07Z

@bzzz77 I think that's exactly what @ahrens was proposing to do using range trees.

adilger · 2016-04-28T03:52:38Z

Even if there is an efficient structure for tracking in-progress allocations, I think there is still a benefit from not doing any scanning of metadnode blocks if there aren't any files being unlinked. For HPC at least, there may be a few 100k's of file creates in one group, and then a similar number of deletes in a later group, so rescanning the metadnode for holes repeatedly during creation is wasteful unless there is some reason to expect that there are new holes (i.e. some reasonably number of dnodes have been deleted since the last time the metadnode was scanned).

bzzz77 · 2016-04-28T04:30:22Z

yes, obviously it makes sense to track deletes someway as well.

nedbass · 2016-05-28T02:47:15Z

Closing in favor of #4711

behlendorf added the Type: Performance Performance improvement or performance problem label Mar 27, 2016

behlendorf added this to the 0.7.0 milestone Mar 27, 2016

adilger reviewed Apr 1, 2016
View reviewed changes

ahrens mentioned this pull request May 12, 2016

Object allocation backfilling causes O(n^2) CPU usage #4636

Closed

nedbass closed this May 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backfilling metadnode degrades object create rates #4460

Backfilling metadnode degrades object create rates #4460

nedbass commented Mar 26, 2016

nedbass commented Mar 26, 2016

ahrens commented Mar 26, 2016

behlendorf commented Mar 27, 2016

nedbass commented Mar 30, 2016

adilger Apr 1, 2016

ahrens Apr 1, 2016

nedbass Apr 1, 2016

ahrens Apr 1, 2016

nedbass Apr 1, 2016

sempervictus commented Apr 8, 2016

nedbass commented Apr 8, 2016

behlendorf commented Apr 12, 2016

nedbass commented Apr 12, 2016

adilger commented Apr 13, 2016

nedbass commented Apr 13, 2016

adilger commented Apr 13, 2016 via email

ahrens commented Apr 19, 2016

behlendorf commented Apr 21, 2016

adilger commented Apr 22, 2016

ahrens commented Apr 22, 2016 •

edited

bzzz77 commented Apr 27, 2016

nedbass commented Apr 27, 2016

bzzz77 commented Apr 27, 2016

nedbass commented Apr 27, 2016

adilger commented Apr 28, 2016

bzzz77 commented Apr 28, 2016

nedbass commented May 28, 2016

Backfilling metadnode degrades object create rates #4460

Backfilling metadnode degrades object create rates #4460

Conversation

nedbass commented Mar 26, 2016

nedbass commented Mar 26, 2016

ahrens commented Mar 26, 2016

behlendorf commented Mar 27, 2016

nedbass commented Mar 30, 2016

adilger Apr 1, 2016

Choose a reason for hiding this comment

ahrens Apr 1, 2016

Choose a reason for hiding this comment

nedbass Apr 1, 2016

Choose a reason for hiding this comment

ahrens Apr 1, 2016

Choose a reason for hiding this comment

nedbass Apr 1, 2016

Choose a reason for hiding this comment

sempervictus commented Apr 8, 2016

nedbass commented Apr 8, 2016

behlendorf commented Apr 12, 2016

nedbass commented Apr 12, 2016

adilger commented Apr 13, 2016

nedbass commented Apr 13, 2016

adilger commented Apr 13, 2016 via email

ahrens commented Apr 19, 2016

behlendorf commented Apr 21, 2016

adilger commented Apr 22, 2016

ahrens commented Apr 22, 2016 • edited

bzzz77 commented Apr 27, 2016

nedbass commented Apr 27, 2016

bzzz77 commented Apr 27, 2016

nedbass commented Apr 27, 2016

adilger commented Apr 28, 2016

bzzz77 commented Apr 28, 2016

nedbass commented May 28, 2016

ahrens commented Apr 22, 2016 •

edited