Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raidz expansion feature #15022

Merged
merged 1 commit into from Nov 8, 2023
Merged

Conversation

don-brady
Copy link
Contributor

@don-brady don-brady commented Jun 29, 2023

Motivation and Context

This feature allows disks to be added one at a time to a RAID-Z group, expanding its capacity incrementally. This feature is especially useful for small pools (typically with only one RAID-Z group), where there isn't sufficient hardware to add capacity by adding a whole new RAID-Z group (typically doubling the number of disks).

For additional context as well as a design overview, see Matt Ahrens' talk at the 2021 FreeBSD Developer Summit (video) (slides), and a news article from Ars Technica.

Description

Initiating expansion

A new device (disk) can be attached to an existing RAIDZ vdev, by running zpool attach POOL raidzP-N NEW_DEVICE, e.g. zpool attach tank raidz2-0 sda. The new device will become part of the RAIDZ group. A raidz expansion will
be initiated, and the new device will contribute additional space to the RAIDZ group once the expansion completes.

The feature@raidz_expansion on-disk feature flag must be enabled to initiate an expansion, and it remains active for the life of the pool. In other words, pools with expanded RAIDZ vdevs can not be imported by older releases of the ZFS software.

During expansion

The expansion entails reading all allocated space from existing disks in the RAIDZ group, and rewriting it to the new disks in the RAIDZ group (including the newly added device).

The expansion progress can be monitored with zpool status.

Data redundancy is maintained during (and after) the expansion. If a disk fails while the expansion is in progress, the expansion pauses until the health of the RAIDZ vdev is restored (e.g. by replacing the failed disk and waiting
for reconstruction to complete).

The pool remains accessible during expansion. Following a reboot or export/import, the expansion resumes where it left off.

After expansion

When the expansion completes, the additional space is available for use, and is reflected in the available zfs property (as seen in zfs list, df, etc).

Expansion does not change the number of failures that can be tolerated without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after expansion).

A RAIDZ vdev can be expanded multiple times.

After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks. New blocks will be written with the new data-to-parity
ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ vdev's "assumed parity ratio" does not change, so slightly less space than is expected may be reported for newly-written blocks, according to zfs list, df, ls -s, and similar tools.

Manpage changes

zpool-attach.8:

NAME
     zpool-attach — attach new device to existing ZFS vdev

SYNOPSIS
     zpool attach [-fsw] [-o property=value] pool device new_device

DESCRIPTION
     Attaches new_device to the existing device.  The behavior differs depend‐
     ing on if the existing device is a RAIDZ device, or a mirror/plain
     device.

     If the existing device is a mirror or plain device ...

     If the existing device is a RAIDZ device (e.g. specified as "raidz2-0"),
     the new device will become part of that RAIDZ group.  A "raidz expansion"
     will be initiated, and the new device will contribute additional space to
     the RAIDZ group once the expansion completes.  The expansion entails
     reading all allocated space from existing disks in the RAIDZ group, and
     rewriting it to the new disks in the RAIDZ group (including the newly
     added device).  Its progress can be monitored with zpool status.

     Data redundancy is maintained during and after the expansion.  If a disk
     fails while the expansion is in progress, the expansion pauses until the
     health of the RAIDZ vdev is restored (e.g. by replacing the failed disk
     and waiting for reconstruction to complete).  Expansion does not change
     the number of failures that can be tolerated without data loss (e.g. a
     RAIDZ2 is still a RAIDZ2 even after expansion).  A RAIDZ vdev can be
     expanded multiple times.

     After the expansion completes, old blocks remain with their old data-to-
     parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distrib‐
     uted among the larger set of disks.  New blocks will be written with the
     new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded
     once to 6-wide, has 4 data to 2 parity).  However, the RAIDZ vdev's
     "assumed parity ratio" does not change, so slightly less space than is
     expected may be reported for newly-written blocks, according to zfs list,
     df, ls -s, and similar tools.

Status

Matt Ahrens' original pull request (#12225) has been rebased here to current master branch and updated to incorporate recent code cleanups in the OpenZFS codebase. This feature is believed to be complete. However, like all PR's, it is subject to change as part of the code review process. Since this PR includes on-disk changes, it shouldn't be used on production systems before it is integrated to the OpenZFS codebase. Tasks that still need to be done before integration:

  • Additional code cleanup in ztest code
  • zloop changes to drive coverage of this feature
  • Address test failures in ztest runs
  • Document the high-level design in a "big theory statement" comment
  • Remove verbose logging
  • Detection of MBR partitions using reserved boot area (FreeBSD BTX boot loader)
  • Address any performance concerns

Acknowledgments

Thank you to the FreeBSD Foundation for commissioning this work in 2017 and continuing to sponsor it well past the original time estimates!
Thank you to iXsystems for sponsoring the final push to land this feature into OpenZFS.

Thanks also to contributors @FedorUporovVstack, @stuartmaybee, @thorsteneb, and @Fmstrat for portions
of the implementation.

Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Contributions-by: Stuart Maybee stuart.maybee@comcast.net
Contributions-by: Fedor Uporov fuporov.vstack@gmail.com
Contributions-by: Thorsten Behrens tbehrens@outlook.com
Contributions-by: Fmstrat nospam@nowsci.com
Contributions-by: Don Brady dev.fs.zfs@gmail.com

How Has This Been Tested?

Tests added to the ZFS Test Suite (functional/raidz) and ztest, in addition to manual testing.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • I have run the ZFS Test Suite with this change applied.
  • All commit messages are properly formatted and contain Signed-off-by.

Pull Request Comments

Please limit comments here to code review/feedback and testing questions/results.

For generic discussions about RAID-Z, or discussions on future enhancements to RAIDZ expansion, please use RAIDZ Expansion feature discussions.

@don-brady don-brady added Type: Feature Feature request or new feature Status: Code Review Needed Ready for review and testing labels Jun 29, 2023
@ahrens ahrens mentioned this pull request Jun 29, 2023
18 tasks
@Evernow
Copy link

Evernow commented Jun 29, 2023

Thank you for you and @ahrens work! Hopefully this gets merged soon!

@emaste
Copy link

emaste commented Jun 29, 2023

A small comment, one of the trailer lines is inconsistent (space vs dash):

Sponsored-by: The FreeBSD Foundation
Sponsored by: iXsystems, Inc.

@mmayer
Copy link

mmayer commented Jul 4, 2023

Thank you so much for continuing this work. And thank you to iXsystems for sponsoring it.

@EvanCarroll
Copy link

We are all cheering for you.

May god bless all of those who contribute to this patch and may their children live long and healthy forever, without death. May this patch swiftly be tested and committed so the heavens may finally open and rain bliss down upon us. Amen.

@shivabohemian
Copy link

Thank you so much. BTW, When will we merge this pr? @behlendorf @ahrens @mmaybee

@abjugard
Copy link

BTW, When will we merge this pr?

When it's good and ready. You can't rush art.

@EvanCarroll
Copy link

When this patch is finally done, the pope himself will consent to painting it over that other crap in the Sistine Chapel.

@shivabohemian
Copy link

Got it. I don't mean to rush and just want to know if it's on the plan~

@KaeTuuN
Copy link

KaeTuuN commented Aug 11, 2023

This PR is followed by many people, so please:

STOP FLOATING IT WITH USELESS COMMENTS!

That would be really awesome!
If you have a question to the Code or want to help, fine. In every other case: Do not post!

Sorry for the harsh wording, but it's really annoying me...

module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
@don-brady
Copy link
Contributor Author

Rebased on latest master branch.

module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
@TurkeyMan
Copy link

There are a lot of people watching this, so I'm sorry for adding noise, but I'd like to better understand this sentence:

However, the RAIDZ vdev's "assumed parity ratio" does not change, so slightly less space than is expected may be reported for newly-written blocks [...]

Does this mean an actual loss of space, or just a reporting quirk?
I understand that the old blocks remain with the original parity arrangement, and new blocks will be written to take advantage of the new disk layout, but is there another disadvantage expressed in that sentence above that I haven't understood?

Also on a minor tangent, after I expand my raid (eagerly waiting!), I would ideally also like the process to rewrite all existing blocks with the new parity structure so there are no 'old blocks'. Is there a simple way to rewrite all data that is stored in the old block structure such that the rewrite expands it to the new structure? Or can this be added as a feature?
I'm not sure how, post-expansion, I am able to identify the data with the old block structure that needs rewriting. It seems like something the FS might need to scan for and do on its own?
What will be the recommended solution here?
In my immediate case, I intent to expand a raidz2 from 4 drives to 6 drives, so there is a huge gain in efficiency to be had by rewriting all old blocks.
Thank you so much for this work, it's oh-so long overdue!

@meyergru
Copy link

meyergru commented Aug 31, 2023

As for the "loss of space": There is an excellent article on Ars Technica: https://arstechnica.com/gadgets/2021/06/raidz-expansion-code-lands-in-openzfs-master/ - you will understand what is meant if you look at the graphic. Essentially, the ratio of data-to-parity is not changed in the expansion process. This is different in RAID5/6, where new parity blocks are calculated. With ZFS expansion, blocks are only moved, therefore you gain no data-to-parity ratio by adding disks later on.

Therefore, it is best to start out with the largest number of drives you can afford. However, the ZFS calculator (https://wintelguy.com/zfs-calc.pl) shows that for a 6 drive raidz2, the remaining capacity is 63.93% (which would not change if you expanded it to, say, 8 drives), while for an 8 drive raidz2, it is 68.19%, which is not that much bigger anyway.

@TurkeyMan
Copy link

Okay, I understand this, and if I were to rewrite those blocks, they would be rewritten in the new layout which would return the 'waste' space.
But as I read that sentence, it seemed like some other effect that wasn't covered by that detail; it's this bit ...so slightly less space than is expected may be reported for newly-written blocks. As I read, that suggests that new blocks will consume more space than I expect them to for some reason, even though they are written out at the new stripe width.

What about the second part of my question; how can I cause a rewrite of all old blocks such that they are expanded to the new raid width to improve efficiency? I think many people will want to do that for archive drives where the data is not rewritten naturally over time.

@meyergru
Copy link

meyergru commented Aug 31, 2023

The ZFS expansion as it is defined now does not change the layout of the parity blocks. The initial number of needed parity blocks is detemined by the number of disks. If you have raidz1, you need one disk of parity, that means you lose one data disk. The idea of ZFS expansion is that you keep the initial ratio, because you need "at least" one parity disk per row. Adding disks could optimally change that ratio while staying redundant, but from a safety perspective, it does not have to.

So, no, "rewriting blocks" does not change the layout (at least not with normal filesystem access). You probably could devise a raid5-like approach but then you would have to update parity blocks, more critical metadata a.s.o, which would probably break some assumptions and make online expansion harder or infeasible. I can only imagine this to be the reason for not doing it like a (space-wise better) raid5 expansion.

And also, no, as I understand it, there is no "new layout" for added blocks - the (expanded) layout is valid for all blocks, but keeps the (slightly) less optimal, initial data-to-parity ratio - also for new data written to the filesystem.

It's the difference between wish and reality, I guess. I rather have an option to expand at all than to do it optimally.

module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
module/zfs/vdev_draid.c Outdated Show resolved Hide resolved
module/zfs/vdev_draid.c Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
cmd/zpool/zpool_main.c Outdated Show resolved Hide resolved
tests/zfs-tests/tests/functional/raidz/raidz_003_pos.ksh Outdated Show resolved Hide resolved
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Can you just wrap up the cstyle warning.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Oct 25, 2023
module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved
cmd/zpool/zpool_main.c Outdated Show resolved Hide resolved
cmd/zpool/zpool_main.c Outdated Show resolved Hide resolved
cmd/zpool/zpool_main.c Outdated Show resolved Hide resolved
cmd/zpool/zpool_main.c Outdated Show resolved Hide resolved
@don-brady
Copy link
Contributor Author

Update -- all the code reviews have been completed and the tests are all passing!

We did uncover an issue when manually replacing a healthy RAIDZ child in the middle of an expansion. I hope to have a resolution for that soon.

This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally.  This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).

== Initiating expansion ==

A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`.  The new device will become part of the RAIDZ group.  A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.

The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.

== During expansion ==

The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).

The expansion progress can be monitored with `zpool status`.

Data redundancy is maintained during (and after) the expansion.  If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).

The pool remains accessible during expansion.  Following a reboot or
export/import, the expansion resumes where it left off.

== After expansion ==

When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).

Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).

A RAIDZ vdev can be expanded multiple times.

After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks.  New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity).  However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.

Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>

Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
@don-brady
Copy link
Contributor Author

Rebased to latest master. Squashed the commits to one.

@behlendorf behlendorf merged commit 5caeef0 into openzfs:master Nov 8, 2023
25 checks passed
@ShadowJonathan
Copy link

🚀 🎉 💚

I assume this'll go through a bunch of testing before it'll be offered on stable branches, but what's the release plan for this?

@behlendorf
Copy link
Contributor

Merged! Thanks @ahrens @don-brady and everybody else who helped on this major new feature!

This feature will be available in the OpenZFS 2.3 release, which is probably about a year out.

@Sparklingx
Copy link

so cool.

@justinclift
Copy link

As a reminder, the existing discussion for this PR is in #15232.

Please direct all discussion type comments there instead.

@darkbasic
Copy link

We did uncover an issue when manually replacing a healthy RAIDZ child in the middle of an expansion. I hope to have a resolution for that soon.

Did that fix made its way into this PR?

@ahrens
Copy link
Member

ahrens commented Nov 9, 2023

We did uncover an issue when manually replacing a healthy RAIDZ child in the middle of an expansion. I hope to have a resolution for that soon.

Did that fix made its way into this PR?

Yes, @don-brady was able to track that down and fix it.

gamanakis pushed a commit to gamanakis/zfs that referenced this pull request Nov 16, 2023
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally.  This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).

== Initiating expansion ==

A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`.  The new device will become part of the RAIDZ group.  A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.

The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.

== During expansion ==

The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).

The expansion progress can be monitored with `zpool status`.

Data redundancy is maintained during (and after) the expansion.  If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).

The pool remains accessible during expansion.  Following a reboot or
export/import, the expansion resumes where it left off.

== After expansion ==

When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).

Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).

A RAIDZ vdev can be expanded multiple times.

After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks.  New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity).  However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.

Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes openzfs#15022
(void) snprintf(oldvd->vdev_path, strlen(newvd->vdev_path) + 5,
"%s/%s", newvd->vdev_path, "old");
(void) sprintf(oldvd->vdev_path, "%s/old",
newvdpath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh sprintf() - wish I had caught this earlier, no sprintf on modern system :)

lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Dec 12, 2023
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally.  This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).

== Initiating expansion ==

A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`.  The new device will become part of the RAIDZ group.  A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.

The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.

== During expansion ==

The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).

The expansion progress can be monitored with `zpool status`.

Data redundancy is maintained during (and after) the expansion.  If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).

The pool remains accessible during expansion.  Following a reboot or
export/import, the expansion resumes where it left off.

== After expansion ==

When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).

Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).

A RAIDZ vdev can be expanded multiple times.

After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks.  New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity).  However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.

Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes openzfs#15022
gamanakis pushed a commit to gamanakis/zfs that referenced this pull request Jan 3, 2024
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally.  This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).

== Initiating expansion ==

A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`.  The new device will become part of the RAIDZ group.  A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.

The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.

== During expansion ==

The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).

The expansion progress can be monitored with `zpool status`.

Data redundancy is maintained during (and after) the expansion.  If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).

The pool remains accessible during expansion.  Following a reboot or
export/import, the expansion resumes where it left off.

== After expansion ==

When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).

Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).

A RAIDZ vdev can be expanded multiple times.

After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks.  New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity).  However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.

Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes openzfs#15022
gamanakis pushed a commit to gamanakis/zfs that referenced this pull request Feb 24, 2024
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally.  This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).

== Initiating expansion ==

A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`.  The new device will become part of the RAIDZ group.  A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.

The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.

== During expansion ==

The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).

The expansion progress can be monitored with `zpool status`.

Data redundancy is maintained during (and after) the expansion.  If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).

The pool remains accessible during expansion.  Following a reboot or
export/import, the expansion resumes where it left off.

== After expansion ==

When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).

Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).

A RAIDZ vdev can be expanded multiple times.

After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks.  New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity).  However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.

Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes openzfs#15022
gamanakis pushed a commit to gamanakis/zfs that referenced this pull request Mar 17, 2024
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally.  This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).

== Initiating expansion ==

A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`.  The new device will become part of the RAIDZ group.  A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.

The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.

== During expansion ==

The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).

The expansion progress can be monitored with `zpool status`.

Data redundancy is maintained during (and after) the expansion.  If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).

The pool remains accessible during expansion.  Following a reboot or
export/import, the expansion resumes where it left off.

== After expansion ==

When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).

Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).

A RAIDZ vdev can be expanded multiple times.

After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks.  New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity).  However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.

Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes openzfs#15022
gamanakis pushed a commit to gamanakis/zfs that referenced this pull request Apr 25, 2024
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally.  This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).

== Initiating expansion ==

A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`.  The new device will become part of the RAIDZ group.  A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.

The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.

== During expansion ==

The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).

The expansion progress can be monitored with `zpool status`.

Data redundancy is maintained during (and after) the expansion.  If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).

The pool remains accessible during expansion.  Following a reboot or
export/import, the expansion resumes where it left off.

== After expansion ==

When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).

Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).

A RAIDZ vdev can be expanded multiple times.

After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks.  New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity).  However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.

Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes openzfs#15022
gamanakis pushed a commit to gamanakis/zfs that referenced this pull request Apr 25, 2024
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally.  This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).

== Initiating expansion ==

A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`.  The new device will become part of the RAIDZ group.  A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.

The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.

== During expansion ==

The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).

The expansion progress can be monitored with `zpool status`.

Data redundancy is maintained during (and after) the expansion.  If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).

The pool remains accessible during expansion.  Following a reboot or
export/import, the expansion resumes where it left off.

== After expansion ==

When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).

Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).

A RAIDZ vdev can be expanded multiple times.

After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks.  New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity).  However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.

Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes openzfs#15022
gamanakis pushed a commit to gamanakis/zfs that referenced this pull request Apr 26, 2024
This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally.  This feature is especially useful
for small pools (typically with only one RAID-Z group), where there
isn't sufficient hardware to add capacity by adding a whole new RAID-Z
group (typically doubling the number of disks).

== Initiating expansion ==

A new device (disk) can be attached to an existing RAIDZ vdev, by
running `zpool attach POOL raidzP-N NEW_DEVICE`, e.g. `zpool attach tank
raidz2-0 sda`.  The new device will become part of the RAIDZ group.  A
"raidz expansion" will be initiated, and the new device will contribute
additional space to the RAIDZ group once the expansion completes.

The `feature@raidz_expansion` on-disk feature flag must be `enabled` to
initiate an expansion, and it remains `active` for the life of the pool.
In other words, pools with expanded RAIDZ vdevs can not be imported by
older releases of the ZFS software.

== During expansion ==

The expansion entails reading all allocated space from existing disks in
the RAIDZ group, and rewriting it to the new disks in the RAIDZ group
(including the newly added device).

The expansion progress can be monitored with `zpool status`.

Data redundancy is maintained during (and after) the expansion.  If a
disk fails while the expansion is in progress, the expansion pauses
until the health of the RAIDZ vdev is restored (e.g. by replacing the
failed disk and waiting for reconstruction to complete).

The pool remains accessible during expansion.  Following a reboot or
export/import, the expansion resumes where it left off.

== After expansion ==

When the expansion completes, the additional space is available for use,
and is reflected in the `available` zfs property (as seen in `zfs list`,
`df`, etc).

Expansion does not change the number of failures that can be tolerated
without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after
expansion).

A RAIDZ vdev can be expanded multiple times.

After the expansion completes, old blocks remain with their old
data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but
distributed among the larger set of disks.  New blocks will be written
with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been
expanded once to 6-wide, has 4 data to 2 parity).  However, the RAIDZ
vdev's "assumed parity ratio" does not change, so slightly less space
than is expected may be reported for newly-written blocks, according to
`zfs list`, `df`, `ls -s`, and similar tools.

Sponsored-by: The FreeBSD Foundation
Sponsored-by: iXsystems, Inc.
Sponsored-by: vStack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Authored-by: Matthew Ahrens <mahrens@delphix.com>
Contributions-by: Fedor Uporov <fuporov.vstack@gmail.com>
Contributions-by: Stuart Maybee <stuart.maybee@comcast.net>
Contributions-by: Thorsten Behrens <tbehrens@outlook.com>
Contributions-by: Fmstrat <nospam@nowsci.com>
Contributions-by: Don Brady <dev.fs.zfs@gmail.com>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes openzfs#15022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested) Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet