-
Notifications
You must be signed in to change notification settings - Fork 1.9k
zpool reopen should detect expanded devices #7546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Looking at the Ubuntu 18.04 tests, I think it's saying that it hung after 70 minutes without any new output: The last thing in the log is: Tip: to view the log, use |
|
Often in cases like this the hang is because an ASSERT/VERIFY was hit which by default isn't configured to panic the system only halts the offending process for analysis. When this happens, if the bot was able to dump its console logs you should be able to get a stack trace. For example, in this case the CentOS 7 TEST builder's console log has the reason why. |
|
I've been trying to reproduce these failure locally on a Ubuntu 17.10 VMWare VM. So far I haven't been able to get a clean run of the |
include/sys/vdev.h
Outdated
| * we will still be 1m aligned. Some devices are sensitive to the | ||
| * partition ending alignment as well. | ||
| */ | ||
| #define NEW_START_BLOCK 2048 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit. Would vdev_disk.h be OK for this disk-specific content?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable
|
|
||
| /* Otherwise assume the full device capacity */ | ||
| return (get_capacity(bdev->bd_disk) << 9); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disks can also be partition-less, typically seen in multi-path configurations. Not sure if we need a third case here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to have a third case, at least to match the behavior of the previous implementation. I'm thinking something like this:
if (part)
if (wholedev)
// corrected capacity
else:
//partsize
else:
get_capacity(bdev->bd_disk)
092e34b to
e460842
Compare
module/zfs/vdev_disk.c
Outdated
| return (available << 9); | ||
| } else { | ||
| /* The partition capacity referenced by the block device */ | ||
| return (part->nr_sects << 9); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: while we're here let's use SECTOR_BITS instead of 9. It's defined by the kernel and will always be 9.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
module/zfs/vdev_disk.c
Outdated
| /* Sectors available to the vdev */ | ||
| uint64_t available = sectors - used; | ||
| /* Bytes available to the vdev */ | ||
| return (available << 9); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though wholedisk is set we should add an additional sanity check to verify we never expand in to an existing partition. You can use bdev->bd_contains to access the full block device (/dev/sda) instead of the partition.
The on-disk partition table also somehow needs to be updated to reflect the new expanded size. Otherwise, there's no visibility from user space that this space is in use and it will be reported by gparted and other tools as free space. Since the kernel does support online resizing of partitions have you considered instead updating the on-disk partition table and then triggering a rescan with __blkdev_reread_part(). That should result in an updated part->nr_sects which reflects the new expanded size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Sure - I'll look into that.
-
It's my understanding that the partition table should not be updated until zfs has decided to actually use the new space (i.e. the user runs
zpool online -e). At which point the partition is relabeled byzpool_relabel_diskinzpool_vdev_online. I verified thatpartedshowed unused, extra space after the new space was detected but showed the new sized partition (with no unused space) after I ranzpool online -e.I tried this work flow on illumos for comparison and the behavior seems similar:fdiskdid not see the new space until it was explicitly onlined. Granted, this was all tested withautoexpandset to off, but I can testautoexpandcases as well.
I did consider triggering a scan fromvdev_disk_openwith__blkdev_reread_part()but I don't think that makes sense given what I wrote above. Also, initial attempts at that failed because theBLKRRPARTioctl returnsEBUSYfor open devices making it unusable for hot-expansion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zpool online -e
Ahh I see. OK that's reasonable, I'd forgotten the zpool online -e would do the relabel, In that case I think the only remaining issue is to the make sure we don't accidentally stomp on an existing partition (1. above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Should that check be something like bdev == bdev->bd_contains - to make sure that this is the "whole" disk? I'm a little unsure if that makes sense, because a zfs wholedisk configured device should actually have two partitions. Do we want to check that there are only two partitions on the vdev, one of which is the EFI partition size? Or is there some other check I'm not thinking of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd ideally like to scan the kernel's version of the partition table for conflicting partitions. But it looks like the functions needed to easily do this aren't available. So why don't we go with your suggestion and simply verify bdev != bdev->bd_contain and bdev->bd_contain->bd_part_count == 2. The primary EFI partition and legacy partition 9.
Speaking of which, is partition 9 recreated when zpool online -e is run and moved to the new end of the device? What is the illumos behavior here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other thing. With your fix in place we should be able to re-enable the zpool_expand_001_pos and zpool_expand_003_neg.ksh tests case which are currently being skipped.
The expansion tests currently create pools on zvols, which can occasionally result in a deadlock, so you might need to adapt them to use a scsi_debug or loopback device. There are some helper function for this in blkdev.shlib. I believe you'll also need to create a zedlet script for the zed to run zpool online -e when the expansion event is received.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure of the benefit of this kind of check. We only called zpool_relabel_disk if wholedisk is true. My understanding is that if wholedisk is true, ZFS believes that it is the only user of the device - let me know if this is wrong.
zpool_relabel_disk is the only caller of efi_use_whole_disk and in efi_use_whole_disk we assume that the reserved partition is the very last partition and that the zfs data partition is the one that comes just before that. That means that if we reach zpool_relabel_disk with a configuration like the following, it would be misinterpreted with B as reserved and reserved as ZFS data. The only way to tell something was wrong would be to notice that there are more than 2 partitions - i.e. we were wrong about the value of wholedisk.
-------------------------------------------------------------------------------
| A | ZFS Data | reserved | B |
-------------------------------------------------------------------------------
In the following case where efi_use_whole_disk gets the correct values for ZFS data and reserved, a check which verified the only partition in the expanded region is reserved partition would by true by definition since it's already been defined as the last partition. Should this type of configuration be valid for a ZFS wholedisk setup?
---------------------------------------------------------------------
| A | ZFS Data | reserved |
---------------------------------------------------------------------
My core concern is that we identify partitions based on their relative position to each other and to the end of the partition table, so checking that one only extends into the other doesn't seem meaningful.
Are there other issues that you think we could catch here?
===============
I'll take a look at the zpool expand tests and try and evaluate how much work it would be to get them operational - if it ends up being really involved I might not be able to get to them for awhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though apparently efi_nparts is 128 - so we just care about `physically non-zero partitions'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if wholedisk is true ZFS believes that it is the only user of the device - let me know if this is wrong.
This is definiitely what ZFS believes, unfortunately that doesn't necessarily make it true. The end goal is just to add some logic to verify this is the case before doing the expansion so we do no harm. Even if the admin did something like expand device and manually create a new partition+ext4 filesystem in the B space from your example above.
My core concern is that we identify partitions based on their relative position to each other and to the end of the partition table.
That's definitely not ideal, but it looks like we can improve things in this area. Since the reserved partition is created as part of zpool create, when wholedisk is set, we have a little more information we can use to uniquely indentify it. Then we won't have to rely on it being the last partition. We know:
vtoc->efi_parts[].p_size == EFI_MIN_RESV_SIZE, andvtoc->efi_parts[].p_tag == V_RESERVED, and- it will always be partition number 9 (index 8).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification - I was starting to doubt everything... I agree that we're treating partitions a little more fast and loose than would be ideal.
Ah, ok, cool! So in efi_use_whole_disk we could add checks that the partition we're treating as the the reserved partition (the last one we find in the table) is actually the one we created. If so, then we're safe to expand because there's no next partition. If not, we should give up because the wholedisk assumption is no longer valid and we have no idea what we might be expanding into.
da7965b to
fd457a0
Compare
behlendorf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally this looks good, it just needs the test cases re-enabled and updated. Let me know if you have any questions about the test suite.
|
I've started going through the tests and also comparing the work flows to those on Illumos. I agree that I need to make a |
|
In fact, I believe that the changes I've made in this PR wouldn't actually help either I think my plan will be to improve on or add some manual expand tests that exercise the feature I've added - namely the ability for the user to see newly available space and leave the autoexpand work for a different PR. That work would include a) adding the zedlet and b) either changing the autoexpand tests to re-import the pools or adding autoexpand events elsewhere than just |
|
@shartse thanks for checking the test results on Illumos. I've always assumed these tests worked on Illumos, but from looking at the source code it was never clear to me why. It's good to know I wasn't confused. In order to get these tests working on Linux I figured we'd leverage the udev block device events generated by the kernel. The ZED receives events from two sources, 1) the zfs kernel modules and 2) the udev monitor via libudev. The udev stream is how we get notified of things like device addition/removal. Since this is the standard mechanism on Linux I assumed there would be a device capacity change event we could use. Unfortunately, now that I'm looking closely at the kernel source it looks like only a console log message gets generated in
Given the circumstances I agree that sounds like the best way forward. |
|
Yeah, it's pretty odd. It's possible that Or it's possible that was expected to work on illumos because these are just
Right - that makes sense. I went down a udev rabbit-hole awhile back as well - it seems like change Lots of room for improvement - I hope to straighten out some of the |
|
The It looks like newer linux kernels handle this unit attention and generate a udev event. If |
module/zfs/vdev_disk.c
Outdated
| struct hd_struct *part = bdev->bd_part; | ||
| uint64_t sectors = get_capacity(bdev->bd_disk); | ||
| /* If there are no paritions, return the entire device capacity */ | ||
| if (!part) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: if (part == NULL)
fd457a0 to
5d7a279
Compare
5d7a279 to
3424397
Compare
| "$autoexp" | ||
| fi | ||
| typeset prev_size=$(get_pool_prop size $TESTPOOL1) | ||
| typeset zfs_prev_size=$(zfs get -p avail $TESTPOOL1 | tail -1 | \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's not your change, but this could be typeset zfs_prev_size=$(get_prop avail $TESTPOOL1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure - just changed it.
3424397 to
a33a32f
Compare
|
@behlendorf It looks like there's a zloop failure, similar to this: #6434 |
|
@shartse yes sorry about that, this is a known issue we occasionally see with |
|
@shartse this needs one last rebase on master to resolve the conflict. Then it looks ready to me. |
c58e482 to
36049e1
Compare
ef77366 to
0b97924
Compare
Update bdev_capacity to have wholedisk vdevs query the size of the underlying block device (correcting for the size of the efi parition and partition alignment) and therefore detect expanded space. Correct vdev_get_stats_ex so that the expandsize is aligned to metaslab size and new space is only reported if it is large enough for a new metaslab. External-issue: LX-165 Signed-off-by: sara hartse <sara.hartse@delphix.com>
0b97924 to
b7e45dd
Compare
Codecov Report
@@ Coverage Diff @@
## master #7546 +/- ##
==========================================
+ Coverage 77.38% 77.72% +0.34%
==========================================
Files 362 362
Lines 110288 110299 +11
==========================================
+ Hits 85342 85727 +385
+ Misses 24946 24572 -374
Continue to review full report at Codecov.
|
|
@tonyhutter we should consider this fix for inclusion in 0.7.10. |
Update bdev_capacity to have wholedisk vdevs query the size of the underlying block device (correcting for the size of the efi parition and partition alignment) and therefore detect expanded space. Correct vdev_get_stats_ex so that the expandsize is aligned to metaslab size and new space is only reported if it is large enough for a new metaslab. Reviewed by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Wren Kennedy <jwk404@gmail.com> Signed-off-by: sara hartse <sara.hartse@delphix.com> External-issue: LX-165 Closes openzfs#7546 Issue openzfs#7582
Update bdev_capacity to have wholedisk vdevs query the size of the underlying block device (correcting for the size of the efi parition and partition alignment) and therefore detect expanded space. Correct vdev_get_stats_ex so that the expandsize is aligned to metaslab size and new space is only reported if it is large enough for a new metaslab. Reviewed by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Wren Kennedy <jwk404@gmail.com> Signed-off-by: sara hartse <sara.hartse@delphix.com> External-issue: LX-165 Closes openzfs#7546 Issue openzfs#7582
Update bdev_capacity to have wholedisk vdevs query the size of the underlying block device (correcting for the size of the efi parition and partition alignment) and therefore detect expanded space. Correct vdev_get_stats_ex so that the expandsize is aligned to metaslab size and new space is only reported if it is large enough for a new metaslab. Reviewed by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Wren Kennedy <jwk404@gmail.com> Signed-off-by: sara hartse <sara.hartse@delphix.com> External-issue: LX-165 Closes openzfs#7546 Issue openzfs#7582
Update
bdev_capacityto have wholedisk vdevs query the size of the underlying block device (correcting for the size of the efi parition and partition alignment) and therefore detect expanded space.Correct
vdev_get_stats_exso that the expandsize is aligned to metaslab size and new space is only reported if it is large enough for a new metaslab.Description
vdev_disk_openuses the partition size to determine the device size (even if it’s a whole-disk vdev), and therefore doesn't notice if the device has grown. Callingget_capacityon the block device itself gives the total size of the device, expansion and all. This entire capacity isn’t actually available to the vdev - we need to correct for the size of the EFI boot label partition and the partition alignment space.zfs should only be reporting new space if there’s enough of it to allocate a new metaslab. This merge 0f676dc from openzfs should have fixed that, but line 2924 was left in place, meaning that esize is overwritten after it was aligned.
Motivation and Context
In theory you should be able to resize a device, call
zpool reopen, see additional expandsize withzpool list -vand then add that space to the vdev withzpool online -e. On Linux, the new size is not detected. This change corrects this by addressing the issues described in the previous section.How Has This Been Tested?
These changes were tested by running the device expansion workflows described above on ZoL installations running on vmware, aws and azure.
Types of changes
Checklist:
Signed-off-by.