Skip to content

Commit

Permalink
Implement large_dnode pool feature
Browse files Browse the repository at this point in the history
Justification
-------------

This feature adds support for variable length dnodes. Our motivation
is to eliminate the overhead associated with using spill blocks.
Spill blocks are used to store system attribute data (i.e. file
metadata) that does not fit in the dnode's bonus buffer. By allowing
a larger bonus buffer area the use of a spill block can be avoided.
Spill blocks potentially incur an additional read I/O for every dnode
in a dnode block, and this extra I/O would be desirable to avoid. As
a worst case example, reading 32 dnodes from a 16k dnode block and
all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks.
Then the worst case number of blocks read is reduced to from 33 to
two--one per dnode block. In practice spill blocks may tend to be
co-located on disk with the dnode blocks so the reduction in I/O
would not be this drastic. In a badly fragmented pool, however, the
improvement could be significant.

ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.

Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.

Implementation
--------------

The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older pools.

Similarly, the in-core dnode structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-core objects. We were free, therefore, to choose a
more natural representation of size in a dnode_t, rather than the
awkward "extra_slots" used by dnode_phys_t for compatibility reasons.

The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run

 # zfs set dnodesize=auto tank/fish

The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces name are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the minimum supported dnode
size.

New DMU interfaces:
  dmu_object_alloc_dnsize()
  dmu_object_claim_dnsize()
  dmu_object_reclaim_dnsize()

New ZAP interfaces:
  zap_create_dnsize()
  zap_create_norm_dnsize()
  zap_create_flags_dnsize()
  zap_create_claim_norm_dnsize()
  zap_create_link_dnsize()

The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.

These are a few noteworthy changes to key functions:

* The prototype for dnode_hold_impl() now takes a "slots" parameter.
  When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
  ensure the hole at the specified object offset is large enough to
  hold the dnode being created. The dn_slot parameter is also used
  to ensure a dnode does not span multiple dnode blocks. In both of
  these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
  these failure cases are only possible when using DNODE_MUST_BE_FREE.

  If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
  dnode_hold_impl() will check if the requested dnode is already
  consumed as an extra dnode slot by an large dnode, in which case
  it returns ENOENT.

* The function dmu_object_alloc() advances to the next dnode block
  if dnode_hold_impl() returns an error for a requested object.
  This is because the beginning of the next dnode block is the only
  location it can safely assume to either be a hole or a valid
  starting point for a dnode.

* dnode_next_offset_level() and other functions that iterate
  through dnode blocks may no longer use a simple array indexing
  scheme. These now use the current dnode's dn_num_slots field to
  advance to the next dnode in the block. This is to ensure we
  properly skip the current dnode's bonus area and don't interpret it
  as a valid dnode.

zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.

For ZIL create log records, zdb will now display the slot count for
the object.

ztest
-----
The ztest command was updated to periodically set the dnodesize
dataset property to a randomly chosen value.  Independently of the
dnodesize dataset property value, ztest chooses a random dnodesize
for every newly created object. The random distribution is more
heavily weighted toward small dnodes to better simulate real-world
datasets.

Unused bonus buffer space is now filled with non-zero values computed
from the object number, dataset id, offset, and generation number.
This helps ensure that the dnode traversal code properly skips the
interior regions of large dnodes, and that these interior regions are
not overwritten by data belonging to other dnodes. A new test is
added which visits each object in a dataset. It verifies that the
actual dnode size matches what was stored in the ztest block tag when
it was created. It also verifies that the unused bonus buffer space
is filled with the expected data patterns.

Send/Receive
------------

Datasets using large dnodes cannot be sent to pools that don't
support the feature. A send stream with large dnodes sets a
DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be unrecognized by an
incompatible receiving pool so that the zfs receive will fail
gracefully.

While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.

For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.

ZIL Replay
----------

The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.

Resizing Dnodes
---------------

It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.

Feature Reference Counting
--------------------------

The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset.  This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.

Signed-off-by: Ned Bass <bass6@llnl.gov>
  • Loading branch information
nedbass committed Jun 3, 2016
1 parent 64b1078 commit ea35843
Show file tree
Hide file tree
Showing 44 changed files with 999 additions and 217 deletions.
17 changes: 9 additions & 8 deletions cmd/zdb/zdb.c
Expand Up @@ -1880,15 +1880,15 @@ dump_object(objset_t *os, uint64_t object, int verbosity, int *print_header)
dnode_t *dn;
void *bonus = NULL;
size_t bsize = 0;
char iblk[32], dblk[32], lsize[32], asize[32], fill[32];
char iblk[32], dblk[32], lsize[32], asize[32], fill[32], dnsize[32];
char bonus_size[32];
char aux[50];
int error;

if (*print_header) {
(void) printf("\n%10s %3s %5s %5s %5s %5s %6s %s\n",
"Object", "lvl", "iblk", "dblk", "dsize", "lsize",
"%full", "type");
(void) printf("\n%10s %3s %5s %5s %5s %6s %5s %6s %s\n",
"Object", "lvl", "iblk", "dblk", "dsize", "dnsize",
"lsize", "%full", "type");
*print_header = 0;
}

Expand All @@ -1910,6 +1910,7 @@ dump_object(objset_t *os, uint64_t object, int verbosity, int *print_header)
zdb_nicenum(doi.doi_max_offset, lsize);
zdb_nicenum(doi.doi_physical_blocks_512 << 9, asize);
zdb_nicenum(doi.doi_bonus_size, bonus_size);
zdb_nicenum(doi.doi_dnodesize, dnsize);
(void) sprintf(fill, "%6.2f", 100.0 * doi.doi_fill_count *
doi.doi_data_block_size / (object == 0 ? DNODES_PER_BLOCK : 1) /
doi.doi_max_offset);
Expand All @@ -1926,13 +1927,13 @@ dump_object(objset_t *os, uint64_t object, int verbosity, int *print_header)
ZDB_COMPRESS_NAME(doi.doi_compress));
}

(void) printf("%10lld %3u %5s %5s %5s %5s %6s %s%s\n",
(void) printf("%10lld %3u %5s %5s %5s %6s %5s %6s %s%s\n",
(u_longlong_t)object, doi.doi_indirection, iblk, dblk,
asize, lsize, fill, ZDB_OT_NAME(doi.doi_type), aux);
asize, dnsize, lsize, fill, ZDB_OT_NAME(doi.doi_type), aux);

if (doi.doi_bonus_type != DMU_OT_NONE && verbosity > 3) {
(void) printf("%10s %3s %5s %5s %5s %5s %6s %s\n",
"", "", "", "", "", bonus_size, "bonus",
(void) printf("%10s %3s %5s %5s %5s %5s %5s %6s %s\n",
"", "", "", "", "", "", bonus_size, "bonus",
ZDB_OT_NAME(doi.doi_bonus_type));
}

Expand Down
6 changes: 4 additions & 2 deletions cmd/zdb/zdb_il.c
Expand Up @@ -80,8 +80,10 @@ zil_prt_rec_create(zilog_t *zilog, int txtype, lr_create_t *lr)
}

(void) printf("%s%s", prefix, ctime(&crtime));
(void) printf("%sdoid %llu, foid %llu, mode %llo\n", prefix,
(u_longlong_t)lr->lr_doid, (u_longlong_t)lr->lr_foid,
(void) printf("%sdoid %llu, foid %llu, slots %llu, mode %llo\n", prefix,
(u_longlong_t)lr->lr_doid,
(u_longlong_t)LR_FOID_GET_OBJ(lr->lr_foid),
(u_longlong_t)LR_FOID_GET_SLOTS(lr->lr_foid),
(longlong_t)lr->lr_mode);
(void) printf("%suid %llu, gid %llu, gen %llu, rdev 0x%llx\n", prefix,
(u_longlong_t)lr->lr_uid, (u_longlong_t)lr->lr_gid,
Expand Down

0 comments on commit ea35843

Please sign in to comment.