Skip to content

zdb fails to import pool because asize < vdev_min_asize in draid top-level vdev #11459

@mmaybee

Description

@mmaybee

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 18.04
Linux Kernel 5.4.0-42
Architecture x86
ZFS Version 2.0++
Commit 8f158ae

Describe the problem you're observing

A zloop run failed without producing a core file. ztest.out shows that the failure comes from zdb (attempting to verify a pool) returning an error (EINVAL):

Executing zdb -bccsv -G -d -Y -e -y -p /var/tmp/os-ztest/zloop-run ztest
zdb: can't open 'ztest': Invalid argument`

% grep -i einval /usr/include/asm-generic/errno-base.h
#define EINVAL          22      /* Invalid argument */

The actual error is coming from a call to vdev_open() (which occurs prior to the call to spa_load_failed() that generated the error message). This function is returning the EINVAL error being reported. Within vdev_open(), the error is being set because asize < vdev_min_asize for a top-level vdev:

        /*
         * Make sure the allocatable size hasn't shrunk too much.
         */
        if (asize < vd->vdev_min_asize) {
                vdev_set_state(vd, B_TRUE, VDEV_STATE_CANT_OPEN,
                    VDEV_AUX_BAD_LABEL);
                return (SET_ERROR(EINVAL));
        }

Here are the current values being compared (note that asize comes from osize in this function):

(gdb) print osize
$21 = 4798283776
(gdb) print vd->vdev_min_asize
$22 = 4831838208

The top-level vdev here is of type draid. The asize of a draid vdev is computed by summing the asize its children (minus the space reserved for distributed spares). Note that there is a spare device currently deployed in this top-level vdev as child vdev 4:

vdev.c:195:vdev_dbgmsg_print_tree():   vdev 0: root, guid: 7397569113689487881, path: N/A, can't open
vdev.c:195:vdev_dbgmsg_print_tree():     vdev 0: draid, guid: 4525148408276004100, path: N/A, can't open
vdev.c:195:vdev_dbgmsg_print_tree():       vdev 0: file, guid: 16492265217803933395, path: /net/pharos/export/bugs/DLPX-73135/vdev/ztest.0a, healthy
vdev.c:195:vdev_dbgmsg_print_tree():       vdev 1: file, guid: 13189481552791461187, path: /net/pharos/export/bugs/DLPX-73135/vdev/ztest.1a, healthy
vdev.c:195:vdev_dbgmsg_print_tree():       vdev 2: file, guid: 1960318212727225725, path: /net/pharos/export/bugs/DLPX-73135/vdev/ztest.2b, healthy
vdev.c:195:vdev_dbgmsg_print_tree():       vdev 3: file, guid: 795303241160842783, path: /net/pharos/export/bugs/DLPX-73135/vdev/ztest.3a, healthy
vdev.c:195:vdev_dbgmsg_print_tree():       vdev 4: spare, guid: 17473580923192177435, path: N/A, healthy
vdev.c:195:vdev_dbgmsg_print_tree():         vdev 0: file, guid: 5222868485229018091, path: /net/pharos/export/bugs/DLPX-73135/vdev/ztest.4a, healthy
vdev.c:195:vdev_dbgmsg_print_tree():         vdev 1: dspare, guid: 2629688768943859102, path: draid2-0-1, healthy
vdev.c:195:vdev_dbgmsg_print_tree():       vdev 5: file, guid: 17532226906533716578, path: /net/pharos/export/bugs/DLPX-73135/vdev/ztest.5a, healthy
...

Looking at the sizes for this spare and its children we see:

(gdb) print vd->vdev_child[4]->vdev_asize
$17 = 483131392
(gdb) print vd->vdev_child[4]->vdev_children
$18 = 2
(gdb) print vd->vdev_child[4]->vdev_child[0]->vdev_asize
$19 = 532152320
(gdb) print vd->vdev_child[4]->vdev_child[1]->vdev_asize
$20 = 483131392

The asize for the dspare is significantly smaller than the asize of the device it is sparing! The spare parent vdev reports the smaller vdev size. All other child vdevs in this draid report the larger size:

(gdb) set $i = 0
(gdb) p vd->vdev_child[$i++]->vdev_asize
$36 = 532152320
(gdb) 
$37 = 532152320
(gdb) 
$38 = 483131392
(gdb) 
$39 = 532152320
(gdb) 
$40 = 483131392
(gdb) 
$41 = 532152320
(gdb) 
$42 = 532152320
(gdb) 
$43 = 532152320
(gdb) 
$44 = 532152320
(gdb) 
$45 = 532152320
(gdb) 
$46 = 532152320
(gdb) 
$47 = 532152320
(gdb) 
$48 = 532152320

This difference in asize likely explains the unexpected small asize for the top-level vdev that generated this error. However, more investigation is needed to determine why the dspare has a smaller than expected size here.

Describe how to reproduce the problem

This is reproducible with zloop.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: DefectIncorrect behavior (e.g. crash, hang)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions