Skip to content
This repository has been archived by the owner on Nov 7, 2019. It is now read-only.

Commit

Permalink
9112 Improve allocation performance on high-end systems
Browse files Browse the repository at this point in the history
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Approved by: Gordon Ross <gwr@nexenta.com>

Overview
========

We parallelize the allocation process by creating the concept of
"allocators". There are a certain number of allocators per metaslab
group, defined by the value of a tunable at pool open time.  Each
allocator for a given metaslab group has up to 2 active metaslabs; one
"primary", and one "secondary". The primary and secondary weight mean
the same thing they did in in the pre-allocator world; primary metaslabs
are used for most allocations, secondary metaslabs are used for ditto
blocks being allocated in the same metaslab group.  There is also the
CLAIM weight, which has been separated out from the other weights, but
that is less important to understanding the patch.  The active metaslabs
for each allocator are moved from their normal place in the metaslab
tree for the group to the back of the tree. This way, they will not be
selected for use by other allocators searching for new metaslabs unless
all the passive metaslabs are unsuitable for allocations.  If that does
happen, the allocators will "steal" from each other to ensure that IOs
don't fail until there is truly no space left to perform allocations.

In addition, the alloc queue for each metaslab group has been broken
into a separate queue for each allocator. We don't want to dramatically
increase the number of inflight IOs on low-end systems, because it can
significantly increase txg times. On the other hand, we want to ensure
that there are enough IOs for each allocator to allow for good
coalescing before sending the IOs to the disk.  As a result, we take a
compromise path; each allocator's alloc queue max depth starts at a
certain value for every txg. Every time an IO completes, we increase the
max depth. This should hopefully provide a good balance between the two
failure modes, while not dramatically increasing complexity.

We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause
very similar contention when selecting IOs to allocate. This
parallelization uses the same allocator scheme as metaslab selection.

Performance Results
===================

Performance improvements from this change can vary significantly based
on the number of CPUs in the system, whether or not the system has a
NUMA architecture, the speed of the drives, the values for the various
tunables, and the workload being performed. For an fio async sequential
write workload on a 24 core NUMA system with 256 GB of RAM and 8 128 GB
SSDs, there is a roughly 25% performance improvement.

Future Work
===========

Analysis of the performance of the system with this patch applied shows
that a significant new bottleneck is the vdev disk queues, which also
need to be parallelized.  Prototyping of this change has occurred, and
there was a performance improvement, but more work needs to be done
before its stability has been verified and it is ready to be upstreamed.

Closes #548
  • Loading branch information
pcd1193182 authored and Prakash Surya committed Apr 4, 2018
1 parent 90a56e6 commit 3f3cc3c
Show file tree
Hide file tree
Showing 16 changed files with 628 additions and 218 deletions.
11 changes: 7 additions & 4 deletions usr/src/cmd/mdb/common/modules/zfs/zfs.c
Expand Up @@ -21,8 +21,8 @@
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
* Copyright 2011 Nexenta Systems, Inc. All rights reserved.
* Copyright (c) 2011, 2018 by Delphix. All rights reserved.
* Copyright (c) 2017, Joyent, Inc. All rights reserved.
* Copyright (c) 2011, 2017 by Delphix. All rights reserved.
*/

/* Portions Copyright 2010 Robert Milkowski */
Expand Down Expand Up @@ -1727,6 +1727,7 @@ typedef struct mdb_metaslab_alloc_trace {
uint64_t mat_weight;
uint64_t mat_offset;
uint32_t mat_dva_id;
int mat_allocator;
} mdb_metaslab_alloc_trace_t;

static void
Expand Down Expand Up @@ -1799,8 +1800,9 @@ metaslab_trace(uintptr_t addr, uint_t flags, int argc, const mdb_arg_t *argv)
}

if (!(flags & DCMD_PIPE_OUT) && DCMD_HDRSPEC(flags)) {
mdb_printf("%<u>%6s %6s %8s %11s %18s %18s%</u>\n",
"MSID", "DVA", "ASIZE", "WEIGHT", "RESULT", "VDEV");
mdb_printf("%<u>%6s %6s %8s %11s %11s %18s %18s%</u>\n",
"MSID", "DVA", "ASIZE", "ALLOCATOR", "WEIGHT", "RESULT",
"VDEV");
}

if (mat.mat_msp != NULL) {
Expand All @@ -1815,7 +1817,8 @@ metaslab_trace(uintptr_t addr, uint_t flags, int argc, const mdb_arg_t *argv)
mdb_printf("%6s ", "-");
}

mdb_printf("%6d %8llx ", mat.mat_dva_id, mat.mat_size);
mdb_printf("%6d %8llx %11llx ", mat.mat_dva_id, mat.mat_size,
mat.mat_allocator);

metaslab_print_weight(mat.mat_weight);

Expand Down
Expand Up @@ -12,7 +12,7 @@
#

#
# Copyright (c) 2017 by Delphix. All rights reserved.
# Copyright (c) 2016, 2018 by Delphix. All rights reserved.
#

. $STF_SUITE/tests/functional/cli_root/zpool_import/zpool_import.kshlib
Expand Down Expand Up @@ -49,6 +49,7 @@ function custom_cleanup
set_vdev_validate_skip 0
cleanup
log_must mdb_ctf_set_int vdev_min_ms_count 0t16
log_must mdb_ctf_set_int spa_allocators 0t4
}

log_onexit custom_cleanup
Expand Down Expand Up @@ -207,6 +208,10 @@ increase_device_sizes $(( FILE_SIZE * 4 ))
# reduce the chance of reusing a metaslab that holds old MOS metadata.
log_must mdb_ctf_set_int vdev_min_ms_count 0t150

# Decrease the number of allocators for pools created during this test,
# to increase the odds that metadata survives from old txgs.
log_must mdb_ctf_set_int spa_allocators 0t1

# Part of the rewind test is to see how it reacts to path changes
typeset pathstochange="$VDEV0 $VDEV1 $VDEV2 $VDEV3"

Expand Down
14 changes: 12 additions & 2 deletions usr/src/test/zfs-tests/tests/functional/slog/slog_014_pos.ksh
Expand Up @@ -26,7 +26,7 @@
#

#
# Copyright (c) 2013, 2016 by Delphix. All rights reserved.
# Copyright (c) 2013, 2018 by Delphix. All rights reserved.
#

. $STF_SUITE/tests/functional/slog/slog.kshlib
Expand All @@ -52,9 +52,19 @@ do
log_must zpool create $TESTPOOL $type $VDEV $spare $SDEV \
log $LDEV

# Create a file to be corrupted
dd if=/dev/urandom of=/$TESTPOOL/filler bs=1024k count=50

#
# Ensure the file has been synced out before attempting to
# corrupt its contents.
#
sync

#
# Corrupt a pool device to make the pool DEGRADED
dd if=/dev/urandom of=/$TESTPOOL/filler bs=1024k count=50
# The oseek value below is to skip past the vdev label.
#
log_must dd if=/dev/urandom of=$VDIR/a bs=1024k oseek=4 \
conv=notrunc count=50
log_must zpool scrub $TESTPOOL
Expand Down

0 comments on commit 3f3cc3c

Please sign in to comment.