Skip to content

Commit

Permalink
ZFS Interface for Accelerators (Z.I.A.)
Browse files Browse the repository at this point in the history
The ZIO pipeline has been modified to allow for external, alternative
implementations of existing operations to be used. The original ZFS
functions remain in the code as fallback in case the external
implementation fails.

Definitions:
    Accelerator - an entity (usually hardware) that is intended to
                  accelerate operations
    Offloader   - synonym of accelerator; used interchangeably
    Data Processing Unit Services Module (DPUSM)
                - https://github.com/hpc/dpusm
                - defines a "provider API" for accelerator
                  vendors to set up
                - defines a "user API" for accelerator consumers
                  to call
                - maintains list of providers and coordinates
                  interactions between providers and consumers.
    Provider    - a DPUSM wrapper for an accelerator's API
    Offload     - moving data from ZFS/memory to the accelerator
    Onload      - the opposite of offload

In order for Z.I.A. to be extensible, it does not directly communicate
with a fixed accelerator. Rather, Z.I.A. acquires a handle to a DPUSM,
which is then used to acquire handles to providers.

Using ZFS with Z.I.A.:
    1. Build and start the DPUSM
    2. Implement, build, and register a provider with the DPUSM
    3. Reconfigure ZFS with '--with-zia=<DPUSM root>'
    4. Rebuild and start ZFS
    5. Create a zpool
    6. Select the provider
           zpool set zia_provider=<provider name> <zpool>
    7. Select operations to offload
           zpool set zia_<property>=on <zpool>

The operations that have been modified are:
    - compression
        - non-raw-writes only
    - decompression
    - checksum
        - not handling embedded checksums
        - checksum compute and checksum error call the same function
    - raidz
        - generation
        - reconstruction
    - vdev_file
        - open
        - write
        - close
    - vdev_disk
        - open
        - invalidate
        - write
        - flush
        - close

Successful operations do not bring data back into memory after they
complete, allowing for subsequent offloader operations reuse the
data. This results in only one data movement per ZIO at the beginning
of a pipeline that is necessary for getting data from ZFS to the
accelerator.

When errors ocurr and the offloaded data is still accessible, the
offloaded data will be onloaded (or dropped if it still matches the
in-memory copy) for that ZIO pipeline stage and processed with
ZFS. This will cause thrashing if a later operation offloads
data. This should not happen often, as constant errors (resulting in
data movement) is not expected to be the norm.

Unrecoverable errors such as hardware failures will trigger pipeline
restarts (if necessary) in order to complete the original ZIO using
the software path.

The modifications to ZFS can be thought of as changes to two pipelines:
    - The ZIO write pipeline
        - compression, checksum, RAIDZ generation, and write
        - Each stage starts by offloading data that was not previously
          offloaded
            - This allows for ZIOs to be offloaded at any point in
              the pipeline
    - Resilver
        - vdev_raidz_io_done (RAIDZ reconstruction, checksum, and
          RAIDZ generation), and write
        - Because the core of resilver is vdev_raidz_io_done, data is
          only offloaded once at the beginning of vdev_raidz_io_done
            - Errors cause data to be onloaded, but will not
              re-offload in subsequent steps within resilver
            - Write is a separate ZIO pipeline stage, so it will
              attempt to offload data

The zio_decompress function has been modified to allow for offloading
but the ZIO read pipeline as a whole has not, so it is not part of the
above list.

An example provider implementation can be found in module/zia-software-provider
    - The provider's "hardware" is actually software - data is
      "offloaded" to memory not owned by ZFS
    - Calls ZFS functions in order to not reimplement operations
    - Has kernel module parameters that can be used to trigger
      ZIA_ACCELERATOR_DOWN states for testing pipeline restarts.

abd_t, raidz_row_t, and vdev_t have each been given an additional
"void *<prefix>_zia_handle" member. These opaque handles point to data
that is located on an offloader. abds are still allocated, but their
contents are expected to diverge from the offloaded copy as operations
are run.

ARC compression is disabled when Z.I.A. is configured

Encryption and deduplication are disabled for zpools with
Z.I.A. operations enabled

Aggregation is disabled for offloaded abds

RPMs will build with Z.I.A.

Signed-off-by: Jason Lee <jasonlee@lanl.gov>
  • Loading branch information
calccrypto committed Mar 22, 2023
1 parent 9fa007d commit aee2aa6
Show file tree
Hide file tree
Showing 46 changed files with 4,998 additions and 42 deletions.
2 changes: 2 additions & 0 deletions Makefile.am
Expand Up @@ -57,6 +57,8 @@ dist_noinst_DATA += module/os/linux/spl/THIRDPARTYLICENSE.gplv2
dist_noinst_DATA += module/os/linux/spl/THIRDPARTYLICENSE.gplv2.descrip
dist_noinst_DATA += module/zfs/THIRDPARTYLICENSE.cityhash
dist_noinst_DATA += module/zfs/THIRDPARTYLICENSE.cityhash.descrip
dist_noinst_DATA += module/zfs/THIRDPARTYLICENSE.zia
dist_noinst_DATA += module/zfs/THIRDPARTYLICENSE.zia.descrip

@CODE_COVERAGE_RULES@

Expand Down
3 changes: 3 additions & 0 deletions cmd/raidz_test/raidz_test.c
Expand Up @@ -453,6 +453,9 @@ vdev_raidz_map_alloc_expanded(abd_t *abd, uint64_t size, uint64_t offset,
rr->rr_firstdatacol = nparity;
rr->rr_abd_empty = NULL;
rr->rr_nempty = 0;
#ifdef ZIA
rr->rr_zia_handle = NULL;
#endif

for (int c = 0; c < rr->rr_cols; c++, child_id++) {
if (child_id >= row_phys_cols) {
Expand Down
1 change: 1 addition & 0 deletions config/Rules.am
Expand Up @@ -40,6 +40,7 @@ AM_CPPFLAGS += -DPKGDATADIR=\"$(pkgdatadir)\"
AM_CPPFLAGS += $(DEBUG_CPPFLAGS)
AM_CPPFLAGS += $(CODE_COVERAGE_CPPFLAGS)
AM_CPPFLAGS += -DTEXT_DOMAIN=\"zfs-@ac_system_l@-user\"
AM_CPPFLAGS += $(ZIA_CPPFLAGS)

AM_CPPFLAGS_NOCHECK = -D"strtok(...)=strtok(__VA_ARGS__) __attribute__((deprecated(\"Use strtok_r(3) instead!\")))"
AM_CPPFLAGS_NOCHECK += -D"__xpg_basename(...)=__xpg_basename(__VA_ARGS__) __attribute__((deprecated(\"basename(3) is underspecified. Use zfs_basename() instead!\")))"
Expand Down
9 changes: 8 additions & 1 deletion config/zfs-build.m4
Expand Up @@ -263,6 +263,8 @@ AC_DEFUN([ZFS_AC_CONFIG], [
AC_SUBST(TEST_JOBS)
])
ZFS_AC_ZIA
ZFS_INIT_SYSV=
ZFS_INIT_SYSTEMD=
ZFS_WANT_MODULES_LOAD_D=
Expand Down Expand Up @@ -294,7 +296,8 @@ AC_DEFUN([ZFS_AC_CONFIG], [
[test "x$qatsrc" != x ])
AM_CONDITIONAL([WANT_DEVNAME2DEVID], [test "x$user_libudev" = xyes ])
AM_CONDITIONAL([WANT_MMAP_LIBAIO], [test "x$user_libaio" = xyes ])
AM_CONDITIONAL([PAM_ZFS_ENABLED], [test "x$enable_pam" = xyes])
AM_CONDITIONAL([PAM_ZFS_ENABLED], [test "x$enable_pam" = xyes ])
AM_CONDITIONAL([ZIA_ENABLED], [test "x$enable_zia" = xyes ])
])

dnl #
Expand Down Expand Up @@ -342,6 +345,10 @@ AC_DEFUN([ZFS_AC_RPM], [
RPM_DEFINE_COMMON=${RPM_DEFINE_COMMON}' --define "__strip /bin/true"'
])
AS_IF([test "x$enable_zia" = xyes], [
RPM_DEFINE_COMMON=${RPM_DEFINE_COMMON}' --define "$(WITH_ZIA) 1" --define "DPUSM_ROOT $(DPUSM_ROOT)"'
])
RPM_DEFINE_UTIL=' --define "_initconfdir $(initconfdir)"'
dnl # Make the next three RPM_DEFINE_UTIL additions conditional, since
Expand Down
42 changes: 42 additions & 0 deletions config/zia.m4
@@ -0,0 +1,42 @@
dnl # Adds --with-zia=PATH to configuration options
dnl # The path provided should point to the DPUSM
dnl # root and contain Module.symvers.
AC_DEFUN([ZFS_AC_ZIA], [
AC_ARG_WITH([zia],
AS_HELP_STRING([--with-zia=PATH],
[Path to Data Processing Services Module]),
[
DPUSM_ROOT="$withval"
enable_zia=yes
]
)
AS_IF([test "x$enable_zia" == "xyes"],
AS_IF([! test -d "$DPUSM_ROOT"],
[AC_MSG_ERROR([--with-zia=PATH requires the DPUSM root directory])]
)
DPUSM_SYMBOLS="$DPUSM_ROOT/Module.symvers"
AS_IF([test -r $DPUSM_SYMBOLS],
[
AC_MSG_RESULT([$DPUSM_SYMBOLS])
ZIA_CPPFLAGS="-DZIA=1 -I$DPUSM_ROOT/include"
KERNEL_ZIA_CPPFLAGS="-DZIA=1 -I$DPUSM_ROOT/include"
WITH_ZIA="_with_zia"
AC_SUBST(WITH_ZIA)
AC_SUBST(KERNEL_ZIA_CPPFLAGS)
AC_SUBST(ZIA_CPPFLAGS)
AC_SUBST(DPUSM_SYMBOLS)
AC_SUBST(DPUSM_ROOT)
],
[
AC_MSG_ERROR([
*** Failed to find Module.symvers in:
$DPUSM_SYMBOLS
])
]
)
)
])
3 changes: 3 additions & 0 deletions include/Makefile.am
Expand Up @@ -140,6 +140,9 @@ COMMON_H = \
sys/zfs_vfsops.h \
sys/zfs_vnops.h \
sys/zfs_znode.h \
sys/zia.h \
sys/zia_cddl.h \
sys/zia_private.h
sys/zil.h \
sys/zil_impl.h \
sys/zio.h \
Expand Down
3 changes: 3 additions & 0 deletions include/sys/abd.h
Expand Up @@ -75,6 +75,9 @@ typedef struct abd {
list_t abd_gang_chain;
} abd_gang;
} abd_u;
#ifdef ZIA
void *abd_zia_handle;
#endif
} abd_t;

typedef int abd_iter_func_t(void *buf, size_t len, void *priv);
Expand Down
14 changes: 14 additions & 0 deletions include/sys/fs/zfs.h
Expand Up @@ -256,6 +256,20 @@ typedef enum {
ZPOOL_PROP_BCLONEUSED,
ZPOOL_PROP_BCLONESAVED,
ZPOOL_PROP_BCLONERATIO,
#ifdef ZIA
ZPOOL_PROP_ZIA_PROVIDER,
ZPOOL_PROP_ZIA_COMPRESS,
ZPOOL_PROP_ZIA_DECOMPRESS,
ZPOOL_PROP_ZIA_CHECKSUM,
ZPOOL_PROP_ZIA_RAIDZ1_GEN,
ZPOOL_PROP_ZIA_RAIDZ2_GEN,
ZPOOL_PROP_ZIA_RAIDZ3_GEN,
ZPOOL_PROP_ZIA_RAIDZ1_REC,
ZPOOL_PROP_ZIA_RAIDZ2_REC,
ZPOOL_PROP_ZIA_RAIDZ3_REC,
ZPOOL_PROP_ZIA_FILE_WRITE,
ZPOOL_PROP_ZIA_DISK_WRITE,
#endif
ZPOOL_NUM_PROPS
} zpool_prop_t;

Expand Down
8 changes: 8 additions & 0 deletions include/sys/spa_impl.h
Expand Up @@ -53,6 +53,10 @@
#include <sys/dsl_deadlist.h>
#include <zfeature_common.h>

#ifdef ZIA
#include <sys/zia.h>
#endif

#ifdef __cplusplus
extern "C" {
#endif
Expand Down Expand Up @@ -443,6 +447,10 @@ struct spa {
zfs_refcount_t spa_refcount; /* number of opens */

taskq_t *spa_upgrade_taskq; /* taskq for upgrade jobs */

#ifdef ZIA
zia_props_t spa_zia_props;
#endif
};

extern char *spa_config_path;
Expand Down
7 changes: 7 additions & 0 deletions include/sys/vdev_disk.h
Expand Up @@ -42,5 +42,12 @@

#ifdef _KERNEL
#include <sys/vdev.h>

#ifdef ZIA
int __vdev_disk_physio(struct block_device *bdev, zio_t *zio,
size_t io_size, uint64_t io_offset, int rw, int flags);
int vdev_disk_io_flush(struct block_device *bdev, zio_t *zio);
void vdev_disk_error(zio_t *zio);
#endif /* ZIA */
#endif /* _KERNEL */
#endif /* _SYS_VDEV_DISK_H */
3 changes: 3 additions & 0 deletions include/sys/vdev_impl.h
Expand Up @@ -477,6 +477,9 @@ struct vdev {
uint64_t vdev_checksum_t;
uint64_t vdev_io_n;
uint64_t vdev_io_t;
#ifdef ZIA
void *vdev_zia_handle;
#endif
};

#define VDEV_PAD_SIZE (8 << 10)
Expand Down
7 changes: 7 additions & 0 deletions include/sys/vdev_raidz.h
Expand Up @@ -70,6 +70,13 @@ typedef struct vdev_raidz {
int vd_nparity;
} vdev_raidz_t;

#ifdef ZIA
void vdev_raidz_generate_parity_p(struct raidz_row *);
void vdev_raidz_generate_parity_pq(struct raidz_row *);
void vdev_raidz_generate_parity_pqr(struct raidz_row *);
void vdev_raidz_reconstruct_general(struct raidz_row *, int *, int);
#endif

#ifdef __cplusplus
}
#endif
Expand Down
3 changes: 3 additions & 0 deletions include/sys/vdev_raidz_impl.h
Expand Up @@ -129,6 +129,9 @@ typedef struct raidz_row {
#ifdef ZFS_DEBUG
uint64_t rr_offset; /* Logical offset for *_io_verify() */
uint64_t rr_size; /* Physical size for *_io_verify() */
#endif
#ifdef ZIA
void *rr_zia_handle;
#endif
raidz_col_t rr_col[0]; /* Flexible array of I/O columns */
} raidz_row_t;
Expand Down

0 comments on commit aee2aa6

Please sign in to comment.