Skip to content
This repository has been archived by the owner on Nov 7, 2019. It is now read-only.

Commit

Permalink
9337 zfs get all is slow due to uncached metadata
Browse files Browse the repository at this point in the history
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

This project's goal is to make read-heavy channel programs and zfs(1m)
administrative commands faster by caching all the metadata that they will
need in the dbuf layer. This will prevent the data from being evicted, so
that any future call to i.e. zfs get all won't have to go to disk (very
much). There are two parts:

The dbuf_metadata_cache. We identify what to put into the cache based on
the object type of each dbuf.  Caching objset properties os
{version,normalization,utf8only,casesensitivity} in the objset_t. The reason
these needed to be cached is that although they are queried frequently,
they aren't stored in a dbuf type which we can easily recognize and cache in
the dbuf layer; instead, we have to explicitly store them. There's already
existing infrastructure for maintaining cached properties in the objset
setup code, so I simply used that.

Performance Testing:

 - Disabled kmem_flags
 - Tuned dbuf_cache_max_bytes very low (128K)
 - Tuned zfs_arc_max very low (64M)

Created test pool with 400 filesystems, and 100 snapshots per filesystem.
Later on in testing, added 600 more filesystems (with no snapshots) to make
sure scaling didn't look different between snapshots and filesystems.

Results:

    | Test                   | Time (trunk / diff) | I/Os (trunk / diff) |
    +------------------------+---------------------+---------------------+
    | zpool import           |     0:05 / 0:06     |    12.9k / 12.9k    |
    | zfs get all (uncached) |     1:36 / 0:53     |    16.7k / 5.7k     |
    | zfs get all (cached)   |     1:36 / 0:51     |    16.0k / 6.0k     |

Closes #599
  • Loading branch information
ahrens authored and Prakash Surya committed Apr 22, 2018
1 parent cfaba7f commit 7dec52f
Show file tree
Hide file tree
Showing 7 changed files with 283 additions and 100 deletions.
182 changes: 146 additions & 36 deletions usr/src/uts/common/fs/zfs/dbuf.c
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@
#include <sys/abd.h>
#include <sys/vdev.h>
#include <sys/cityhash.h>
#include <sys/spa_impl.h>

uint_t zfs_dbuf_evict_key;

Expand All @@ -74,24 +75,58 @@ static kcondvar_t dbuf_evict_cv;
static boolean_t dbuf_evict_thread_exit;

/*
* LRU cache of dbufs. The dbuf cache maintains a list of dbufs that
* are not currently held but have been recently released. These dbufs
* are not eligible for arc eviction until they are aged out of the cache.
* Dbufs are added to the dbuf cache once the last hold is released. If a
* dbuf is later accessed and still exists in the dbuf cache, then it will
* be removed from the cache and later re-added to the head of the cache.
* Dbufs that are aged out of the cache will be immediately destroyed and
* become eligible for arc eviction.
* There are two dbuf caches; each dbuf can only be in one of them at a time.
*
* 1. Cache of metadata dbufs, to help make read-heavy administrative commands
* from /sbin/zfs run faster. The "metadata cache" specifically stores dbufs
* that represent the metadata that describes filesystems/snapshots/
* bookmarks/properties/etc. We only evict from this cache when we export a
* pool, to short-circuit as much I/O as possible for all administrative
* commands that need the metadata. There is no eviction policy for this
* cache, because we try to only include types in it which would occupy a
* very small amount of space per object but create a large impact on the
* performance of these commands. Instead, after it reaches a maximum size
* (which should only happen on very small memory systems with a very large
* number of filesystem objects), we stop taking new dbufs into the
* metadata cache, instead putting them in the normal dbuf cache.
*
* 2. LRU cache of dbufs. The "dbuf cache" maintains a list of dbufs that
* are not currently held but have been recently released. These dbufs
* are not eligible for arc eviction until they are aged out of the cache.
* Dbufs that are aged out of the cache will be immediately destroyed and
* become eligible for arc eviction.
*
* Dbufs are added to these caches once the last hold is released. If a dbuf is
* later accessed and still exists in the dbuf cache, then it will be removed
* from the cache and later re-added to the head of the cache.
*
* If a given dbuf meets the requirements for the metadata cache, it will go
* there, otherwise it will be considered for the generic LRU dbuf cache. The
* caches and the refcounts tracking their sizes are stored in an array indexed
* by those caches' matching enum values (from dbuf_cached_state_t).
*/
static multilist_t *dbuf_cache;
static refcount_t dbuf_cache_size;
uint64_t dbuf_cache_max_bytes = 0;
typedef struct dbuf_cache {
multilist_t *cache;
refcount_t size;
} dbuf_cache_t;
dbuf_cache_t dbuf_caches[DB_CACHE_MAX];

/* Set the default size of the dbuf cache to log2 fraction of arc size. */
/* Size limits for the caches */
uint64_t dbuf_cache_max_bytes = 0;
uint64_t dbuf_metadata_cache_max_bytes = 0;
/* Set the default sizes of the caches to log2 fraction of arc size */
int dbuf_cache_shift = 5;
int dbuf_metadata_cache_shift = 6;

/*
* The dbuf cache uses a three-stage eviction policy:
* For diagnostic purposes, this is incremented whenever we can't add
* something to the metadata cache because it's full, and instead put
* the data in the regular dbuf cache.
*/
uint64_t dbuf_metadata_cache_overflow;

/*
* The LRU dbuf cache uses a three-stage eviction policy:
* - A low water marker designates when the dbuf eviction thread
* should stop evicting from the dbuf cache.
* - When we reach the maximum size (aka mid water mark), we
Expand Down Expand Up @@ -393,6 +428,41 @@ dbuf_is_metadata(dmu_buf_impl_t *db)
}
}

/*
* This returns whether this dbuf should be stored in the metadata cache, which
* is based on whether it's from one of the dnode types that store data related
* to traversing dataset hierarchies.
*/
static boolean_t
dbuf_include_in_metadata_cache(dmu_buf_impl_t *db)
{
DB_DNODE_ENTER(db);
dmu_object_type_t type = DB_DNODE(db)->dn_type;
DB_DNODE_EXIT(db);

/* Check if this dbuf is one of the types we care about */
if (DMU_OT_IS_METADATA_CACHED(type)) {
/* If we hit this, then we set something up wrong in dmu_ot */
ASSERT(DMU_OT_IS_METADATA(type));

/*
* Sanity check for small-memory systems: don't allocate too
* much memory for this purpose.
*/
if (refcount_count(&dbuf_caches[DB_DBUF_METADATA_CACHE].size) >
dbuf_metadata_cache_max_bytes) {
dbuf_metadata_cache_overflow++;
DTRACE_PROBE1(dbuf__metadata__cache__overflow,
dmu_buf_impl_t *, db);
return (B_FALSE);
}

return (B_TRUE);
}

return (B_FALSE);
}

/*
* This function *must* return indices evenly distributed between all
* sublists of the multilist. This is needed due to how the dbuf eviction
Expand Down Expand Up @@ -428,7 +498,7 @@ dbuf_cache_above_hiwater(void)
uint64_t dbuf_cache_hiwater_bytes =
(dbuf_cache_max_bytes * dbuf_cache_hiwater_pct) / 100;

return (refcount_count(&dbuf_cache_size) >
return (refcount_count(&dbuf_caches[DB_DBUF_CACHE].size) >
dbuf_cache_max_bytes + dbuf_cache_hiwater_bytes);
}

Expand All @@ -438,7 +508,7 @@ dbuf_cache_above_lowater(void)
uint64_t dbuf_cache_lowater_bytes =
(dbuf_cache_max_bytes * dbuf_cache_lowater_pct) / 100;

return (refcount_count(&dbuf_cache_size) >
return (refcount_count(&dbuf_caches[DB_DBUF_CACHE].size) >
dbuf_cache_max_bytes - dbuf_cache_lowater_bytes);
}

Expand All @@ -448,8 +518,9 @@ dbuf_cache_above_lowater(void)
static void
dbuf_evict_one(void)
{
int idx = multilist_get_random_index(dbuf_cache);
multilist_sublist_t *mls = multilist_sublist_lock(dbuf_cache, idx);
int idx = multilist_get_random_index(dbuf_caches[DB_DBUF_CACHE].cache);
multilist_sublist_t *mls = multilist_sublist_lock(
dbuf_caches[DB_DBUF_CACHE].cache, idx);

ASSERT(!MUTEX_HELD(&dbuf_evict_lock));

Expand All @@ -472,8 +543,10 @@ dbuf_evict_one(void)
if (db != NULL) {
multilist_sublist_remove(mls, db);
multilist_sublist_unlock(mls);
(void) refcount_remove_many(&dbuf_cache_size,
(void) refcount_remove_many(&dbuf_caches[DB_DBUF_CACHE].size,
db->db.db_size, db);
ASSERT3U(db->db_caching_status, ==, DB_DBUF_CACHE);
db->db_caching_status = DB_NO_CACHE;
dbuf_destroy(db);
} else {
multilist_sublist_unlock(mls);
Expand Down Expand Up @@ -560,7 +633,8 @@ dbuf_evict_notify(void)
* because it's OK to occasionally make the wrong decision here,
* and grabbing the lock results in massive lock contention.
*/
if (refcount_count(&dbuf_cache_size) > dbuf_cache_max_bytes) {
if (refcount_count(&dbuf_caches[DB_DBUF_CACHE].size) >
dbuf_cache_max_bytes) {
if (dbuf_cache_above_hiwater())
dbuf_evict_one();
cv_signal(&dbuf_evict_cv);
Expand Down Expand Up @@ -600,26 +674,35 @@ dbuf_init(void)
mutex_init(&h->hash_mutexes[i], NULL, MUTEX_DEFAULT, NULL);

/*
* Setup the parameters for the dbuf cache. We set the size of the
* dbuf cache to 1/32nd (default) of the size of the ARC. If the value
* has been set in /etc/system and it's not greater than the size of
* the ARC, then we honor that value.
* Setup the parameters for the dbuf caches. We set the sizes of the
* dbuf cache and the metadata cache to 1/32nd and 1/16th (default)
* of the size of the ARC, respectively. If the values are set in
* /etc/system and they're not greater than the size of the ARC, then
* we honor that value.
*/
if (dbuf_cache_max_bytes == 0 ||
dbuf_cache_max_bytes >= arc_max_bytes()) {
dbuf_cache_max_bytes = arc_max_bytes() >> dbuf_cache_shift;
}
if (dbuf_metadata_cache_max_bytes == 0 ||
dbuf_metadata_cache_max_bytes >= arc_max_bytes()) {
dbuf_metadata_cache_max_bytes =
arc_max_bytes() >> dbuf_metadata_cache_shift;
}

/*
* All entries are queued via taskq_dispatch_ent(), so min/maxalloc
* configuration is not required.
*/
dbu_evict_taskq = taskq_create("dbu_evict", 1, minclsyspri, 0, 0, 0);

dbuf_cache = multilist_create(sizeof (dmu_buf_impl_t),
offsetof(dmu_buf_impl_t, db_cache_link),
dbuf_cache_multilist_index_func);
refcount_create(&dbuf_cache_size);
for (dbuf_cached_state_t dcs = 0; dcs < DB_CACHE_MAX; dcs++) {
dbuf_caches[dcs].cache =
multilist_create(sizeof (dmu_buf_impl_t),
offsetof(dmu_buf_impl_t, db_cache_link),
dbuf_cache_multilist_index_func);
refcount_create(&dbuf_caches[dcs].size);
}

tsd_create(&zfs_dbuf_evict_key, NULL);
dbuf_evict_thread_exit = B_FALSE;
Expand Down Expand Up @@ -653,8 +736,10 @@ dbuf_fini(void)
mutex_destroy(&dbuf_evict_lock);
cv_destroy(&dbuf_evict_cv);

refcount_destroy(&dbuf_cache_size);
multilist_destroy(dbuf_cache);
for (dbuf_cached_state_t dcs = 0; dcs < DB_CACHE_MAX; dcs++) {
refcount_destroy(&dbuf_caches[dcs].size);
multilist_destroy(dbuf_caches[dcs].cache);
}
}

/*
Expand Down Expand Up @@ -2037,9 +2122,15 @@ dbuf_destroy(dmu_buf_impl_t *db)
dbuf_clear_data(db);

if (multilist_link_active(&db->db_cache_link)) {
multilist_remove(dbuf_cache, db);
(void) refcount_remove_many(&dbuf_cache_size,
ASSERT(db->db_caching_status == DB_DBUF_CACHE ||
db->db_caching_status == DB_DBUF_METADATA_CACHE);

multilist_remove(dbuf_caches[db->db_caching_status].cache, db);
(void) refcount_remove_many(
&dbuf_caches[db->db_caching_status].size,
db->db.db_size, db);

db->db_caching_status = DB_NO_CACHE;
}

ASSERT(db->db_state == DB_UNCACHED || db->db_state == DB_NOFILL);
Expand Down Expand Up @@ -2093,6 +2184,7 @@ dbuf_destroy(dmu_buf_impl_t *db)
ASSERT(db->db_hash_next == NULL);
ASSERT(db->db_blkptr == NULL);
ASSERT(db->db_data_pending == NULL);
ASSERT3U(db->db_caching_status, ==, DB_NO_CACHE);
ASSERT(!multilist_link_active(&db->db_cache_link));

kmem_cache_free(dbuf_kmem_cache, db);
Expand Down Expand Up @@ -2231,6 +2323,7 @@ dbuf_create(dnode_t *dn, uint8_t level, uint64_t blkid,
ASSERT3U(db->db.db_size, >=, dn->dn_bonuslen);
db->db.db_offset = DMU_BONUS_BLKID;
db->db_state = DB_UNCACHED;
db->db_caching_status = DB_NO_CACHE;
/* the bonus dbuf is not placed in the hash table */
arc_space_consume(sizeof (dmu_buf_impl_t), ARC_SPACE_OTHER);
return (db);
Expand Down Expand Up @@ -2263,6 +2356,7 @@ dbuf_create(dnode_t *dn, uint8_t level, uint64_t blkid,
avl_add(&dn->dn_dbufs, db);

db->db_state = DB_UNCACHED;
db->db_caching_status = DB_NO_CACHE;
mutex_exit(&dn->dn_dbufs_mtx);
arc_space_consume(sizeof (dmu_buf_impl_t), ARC_SPACE_OTHER);

Expand Down Expand Up @@ -2597,9 +2691,15 @@ dbuf_hold_impl(dnode_t *dn, uint8_t level, uint64_t blkid,

if (multilist_link_active(&db->db_cache_link)) {
ASSERT(refcount_is_zero(&db->db_holds));
multilist_remove(dbuf_cache, db);
(void) refcount_remove_many(&dbuf_cache_size,
ASSERT(db->db_caching_status == DB_DBUF_CACHE ||
db->db_caching_status == DB_DBUF_METADATA_CACHE);

multilist_remove(dbuf_caches[db->db_caching_status].cache, db);
(void) refcount_remove_many(
&dbuf_caches[db->db_caching_status].size,
db->db.db_size, db);

db->db_caching_status = DB_NO_CACHE;
}
(void) refcount_add(&db->db_holds, tag);
DBUF_VERIFY(db);
Expand Down Expand Up @@ -2816,12 +2916,22 @@ dbuf_rele_and_unlock(dmu_buf_impl_t *db, void *tag)
db->db_pending_evict) {
dbuf_destroy(db);
} else if (!multilist_link_active(&db->db_cache_link)) {
multilist_insert(dbuf_cache, db);
(void) refcount_add_many(&dbuf_cache_size,
ASSERT3U(db->db_caching_status, ==,
DB_NO_CACHE);

dbuf_cached_state_t dcs =
dbuf_include_in_metadata_cache(db) ?
DB_DBUF_METADATA_CACHE : DB_DBUF_CACHE;
db->db_caching_status = dcs;

multilist_insert(dbuf_caches[dcs].cache, db);
(void) refcount_add_many(&dbuf_caches[dcs].size,
db->db.db_size, db);
mutex_exit(&db->db_mtx);

dbuf_evict_notify();
if (db->db_caching_status == DB_DBUF_CACHE) {
dbuf_evict_notify();
}
}

if (do_arc_evict)
Expand Down
Loading

0 comments on commit 7dec52f

Please sign in to comment.