Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for autoexpand property #7629

Merged
merged 1 commit into from
Jul 23, 2018

Conversation

behlendorf
Copy link
Contributor

@behlendorf behlendorf commented Jun 13, 2018

Description

While the autoexpand property may seem like a small feature it depends on a significant amount of system infrastructure. Enough of that infrastructure is now in place with a few modifications for Linux it can be supported.

Auto-expand works as follows; when a block device is modified (re-sized, closed after being open r/w, etc) a change uevent is generated for udev. The ZED, which is monitoring udev events, passes the change event along to zfs_deliver_dle() if the disk or partition contains a zfs_member as identified by blkid.

From here the device is matched against all imported pool vdevs using the vdev_guid which was read from the label by blkid. If a match is found the ZED reopens the pool vdev. This re-opening is important because it allows the vdev to be briefly closed so the disk partition table can be re-read. Otherwise, it wouldn't be possible to report thee maximum possible expansion size.

Finally, if the property autoexpand=on a vdev expansion will be attempted. After performing some sanity checks on the disk to verify that it is safe to expand, the primary partition (-part1) will be expanded and the partition table updated. The partition is then re-opened (again) to detect the updated size which allows the new capacity to be used.

Motivation and Context

Issue #120
Issue #2437
Issue #5771
Issue #7366
Issue #7582

How Has This Been Tested?

Full local run of ZTS on CentOS 7. Overnight run of the zpool_expand ZTS test group in a loop. No failures observed. Ready for additional buildbot testing for other distributions.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label Jun 13, 2018
@behlendorf behlendorf force-pushed the auto-expand branch 2 times, most recently from 37581be to 5cca586 Compare June 14, 2018 23:26
@behlendorf behlendorf removed the Status: Work in Progress Not yet ready for general review label Jun 14, 2018
@shartse
Copy link
Contributor

shartse commented Jun 15, 2018

@behlendorf I downloaded this patch and have been trying to verify it locally without any success yet. The pos_1 test fails and running manually through the steps, I can't see any sign autoexpand. I'm now trying to figure out at which stage things are going wrong - should I expect to see a udevadm CHANGE event after running zfs set volsize on my zvol?

@shartse
Copy link
Contributor

shartse commented Jun 15, 2018

Never mind! I hadn't reloaded properly and the new version of zed wasn't running. Works for me now

@behlendorf
Copy link
Contributor Author

Rebased on master to drop unrelated changes from the patch stack, no other changes included.

@shartse
Copy link
Contributor

shartse commented Jun 19, 2018

I believe that something for non-zvol vdevs has been broken in this change. When I resize a scsi device with vmware, rescan the scsi bus and and run zpool reopen, I see the new expandsize. However, if I run zpool online -e <pool> <dev>, the new space is not added to the pool nor removed from the expandsize.

@behlendorf
Copy link
Contributor Author

@shartse it sounds like the kernel code may be getting the stale cached version of the partition table. Can you check if reinstating the vdev_disk_rrpart() functionality resolves the issue. It also sounds like it would be a good idea to add a version of the expand tests that use the scsi_debug device.

@shartse
Copy link
Contributor

shartse commented Jun 19, 2018

@behlendorf Sure - I'll test that now. And yeah, I agree that we should have some non-zvol tests. I don't know a lot about the scsi_debug devices - I noticed we could power them on and off in blkdev.shlib - is it possible to resize them?

@behlendorf
Copy link
Contributor Author

behlendorf commented Jun 19, 2018

@shartse we should be able too, they're basically a tiny ramdisk layered below the scsi driver. You can use them to setup tests which might otherwise be difficult to orchestrate, like injecting IO or data errors and making sure they're handled properly all the way up the stack. Or in our case resizing an actual scsi device without needing a real scsi device.

@shartse
Copy link
Contributor

shartse commented Jun 19, 2018

@behlendorf Ok, that's good to know. I can start working on a set of zpool_expand tests that use scsi_debug devices.

And adding back in the call to vdev_disk_rrpart() hasn't solved the issue I was seeing.

@behlendorf
Copy link
Contributor Author

@shartse I was able to reproduce the issue you were seeing with scsi devices and have added a commit which resolves the issue in my testing on Ubuntu 18.04. I still need to verify it works as intended on older kernels, but it's there for your review.

@behlendorf
Copy link
Contributor Author

Refreshed and ready for review. The updated PR addresses all known issues and includes @shartse's additional test case which uses a real (scsi_debug) device.

@behlendorf
Copy link
Contributor Author

@rebased on master.

Copy link
Contributor

@shartse shartse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll keep looking and test this locally as well

uint64_t available = sectors - used;
return (available << SECTOR_BITS);
/*
* Returns the maximum expansion capacity of the block device, When the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you specify the the units of the value returned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

/*
* Returns the maximum expansion capacity of the block device, When the
* vdev has been created as a 'wholedisk' then expansion may be possible.
* Before any expansion is performed the partition layout is verified to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of this comment makes it seem like the layout verification and resizing happens in this function. Maybe specify which functions those things happen in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I'll clarify this.


if (wholedisk && bdev->bd_part != NULL && bdev != bdev->bd_contains) {
available = i_size_read(bdev->bd_contains->bd_inode) -
((EFI_MIN_RESV_SIZE + NEW_START_BLOCK +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment (here or at the top of the function) explaining why we're doing this calculation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

if (wholedisk && bdev->bd_part != NULL && bdev != bdev->bd_contains) {
available = i_size_read(bdev->bd_contains->bd_inode) -
((EFI_MIN_RESV_SIZE + NEW_START_BLOCK +
PARTITION_END_ALIGNMENT) << SECTOR_BITS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure this calculation matches the slice_size calculation in libzfs_pool.c zpool_label_disk()? The current code there aligns the slice size. This does not. The failure mode is that the pool can think it is bigger than slice 0 (aka part0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding when I worked on this was that EFI_MIN_RESV_SIZE + NEW_START_BLOCK + PARTITION_END_ALIGNMENT is a conservative estimate of the non-part0 space on the device.

In zpool_label_disk the size of part0 is effectively computed as P2ALIGN(total_size - (EFI_MIN_RESV_SIZE + NEW_START_BLOCK), PARTITION_END_ALIGNMENT). The size of the padding caused by the alignment can be anything from 0 to PARTITION_END_ALIGNMENT. By assuming we were padded with the full PARTITION_END_ALIGNMENT bytes, we actually report a size for part0 that could be slightly smaller than is really available. However, I don't think this is a big issue because this is just the size reported to the user as expandsize and once the pool is actually expanded, it will use as much space as fits evenly into metaslabs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except when it isn't. Observe the following failure:
[root@host4602 relling]# zpool create xp /dev/sddj

Message from syslogd@host4602 at Jul 5 11:49:36 ...
kernel:VERIFY3(io_offset + io_size <= bdev->bd_inode->i_size) failed (10000820355072 <= 10000747462656)

And the size of part0 is 10000747462656. I believe this is because we're picking the wrong size. Though I haven't compiled a new version to test. Meanwhile we workaround by creating the pool with part0 instead of whole disk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting - can you describe the circumstances of the failure a little more? Was /dev/sddj ever expanded? Is it still showing up with the latest version of this patch?

I'm not sure why this calculation would ever end up determining the size of the partition. My understanding is that efi_use_whole_disk is where the partition size is updated after an expand and that doesn't use what we computed for asize.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite simply, cannot create a pool on a 10TB disk using whole-disk method. Prior to the commit for expansion, we were able to create such a pool. It is unclear to me whether this is noticeable on a non-debug kernel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardelling can you post the exact size of the device in bytes, I'd like to see if I can reproduce what your seeing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardelling I believe I see what's going on, and this commit should in fact resolve the issue accidentally introduced by the first expansion commit. It was possible bdev_capacity() could slightly over report the partition size leading to the ASSERT your seeing in the wholedisk case. This is no longer possible since bdev_capacity() was split in to bdev_capacity() and bdev_max_capacity().

Copy link
Contributor

@shartse shartse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good!

@behlendorf
Copy link
Contributor Author

@richardelling this updated version should address your concerns. If possible it would be great if you could verify the wholedisk case on the system where you ran in to problems.

@richardelling
Copy link
Contributor

Thanks @shartse I agree with your analysis. @behlendorf I'll schedule time on the system to test and report back.

@codecov
Copy link

codecov bot commented Jul 13, 2018

Codecov Report

Merging #7629 into master will increase coverage by 0.2%.
The diff coverage is 84.61%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #7629     +/-   ##
=========================================
+ Coverage   78.08%   78.28%   +0.2%     
=========================================
  Files         368      368             
  Lines      111899   111952     +53     
=========================================
+ Hits        87378    87644    +266     
+ Misses      24521    24308    -213
Flag Coverage Δ
#kernel 78.78% <85.18%> (+0.05%) ⬆️
#user 67.36% <84.37%> (+0.44%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e5dc44...b49b09f. Read the comment docs.

While the autoexpand property may seem like a small feature it
depends on a significant amount of system infrastructure.  Enough
of that infrastructure is now in place with a few modifications
for Linux it can be supported.

Auto-expand works as follows; when a block device is modified
(re-sized, closed after being open r/w, etc) a change uevent is
generated for udev.  The ZED, which is monitoring udev events,
passes the change event along to zfs_deliver_dle() if the disk
or partition contains a zfs_member as identified by blkid.

From here the device is matched against all imported pool vdevs
using the vdev_guid which was read from the label by blkid.  If
a match is found the ZED reopens the pool vdev.  This re-opening
is important because it allows the vdev to be briefly closed so
the disk partition table can be re-read.  Otherwise, it wouldn't
be possible to report thee maximum possible expansion size.

Finally, if the property autoexpand=on a vdev expansion will be
attempted.  After performing some sanity checks on the disk to
verify that it is safe to expand,  the primary partition (-part1)
will be expanded and the partition table updated.  The partition
is then re-opened (again) to detect the updated size which allows
the new capacity to be used.

In order to make all of the above possible the following changes
were required:

* Updated the zpool_expand_001_pos and zpool_expand_003_pos tests.
  These tests now create a pool which is layered on a loopback,
  scsi_debug, and file vdev.  This allows for testing of non-
  partitioned block device (loopback), a partition block device
  (scsi_debug), and a file which does not receive udev change
  events.  This provided for better test coverage, and by removing
  the layering on ZFS volumes there issues surrounding layering
  one pool on another are avoided.

* zpool_find_vdev_by_physpath() updated to accept a vdev guid.
  This allows for matching by guid rather than path which is a
  more reliable way for the ZED to reference a vdev.

* Fixed zfs_zevent_wait() signal handling which could result
  in the ZED spinning when a signal was not handled.

* Removed vdev_disk_rrpart() functionality which can be abandoned
  in favor of kernel provided blkdev_reread_part() function.

* Added a rwlock which is held as a writer while a disk is being
  reopened.  This is important to prevent errors from occurring
  for any configuration related IOs which bypass the SCL_ZIO lock.
  The zpool_reopen_007_pos.ksh test case was added to verify IO
  error are never observed when reopening.  This is not expected
  to impact IO performance.

Additional fixes which aren't critical but were discovered and
resolved in the course of developing this functionality.

* Added PHYS_PATH="/dev/zvol/dataset" to the vdev configuration for
  ZFS volumes.  This is as good as a unique physical path, while the
  volumes are not used in the test cases anymore for other reasons
  this improvement was included.

Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#120
Issue openzfs#2437
Issue openzfs#5771
Issue openzfs#7366
Issue openzfs#7582
@richardelling
Copy link
Contributor

@shartse tests complete and the results are:

  1. from master 2e5dc44 zpool create on 10T disk fails with:
kernel:VERIFY3(io_offset + io_size <= bdev->bd_inode->i_isze)  failed (10000820355072 <= 10000747462656)
  1. with this latest diff, pool is created successfully (woohoo!) asize = 10000742744064 which is close enough

@shartse
Copy link
Contributor

shartse commented Jul 17, 2018

Great!

@behlendorf behlendorf merged commit d441e85 into openzfs:master Jul 23, 2018
@tonyhutter tonyhutter added this to To do in 0.7.11 Sep 12, 2018
@tonyhutter tonyhutter removed this from To do in 0.7.11 Sep 14, 2018
@tonyhutter tonyhutter added this to To do in 0.7.12 Sep 14, 2018
@tonyhutter tonyhutter removed this from To do in 0.7.12 Oct 11, 2018
@behlendorf behlendorf deleted the auto-expand branch April 19, 2021 19:29
ahrens added a commit to ahrens/zfs that referenced this pull request Dec 10, 2023
`zvol_disk_open()` waits for up to `zfs_vdev_open_timeout_ms` (1 second by
default) (e.g. if the block device does not exist).  While in this loop,
it calls `schedule_timeout()`.

The problem is that `schedule_timeout()` may not actually cause the
thread to go off-CPU.  Per the "documentation" (comment in the source
code):
```
 * The function behavior depends on the current task state:
 * %TASK_RUNNING - the scheduler is called, but the task does not sleep
 * at all. That happens because sched_submit_work() does nothing for
 * tasks in %TASK_RUNNING state.
```

In my experience, `schedule_timeout()` never sleeps from this code path.
This is especially noticeable if `zfs_vdev_open_timeout_ms` has been
increased from its default.

This commit uses `msleep()` to actually sleep.  Note that this is how it
was before openzfs#7629.
@ahrens ahrens mentioned this pull request Dec 10, 2023
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
0.7.11
  
Awaiting triage
Development

Successfully merging this pull request may close these issues.

None yet

3 participants