Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAIDZ Expansion feature #12225

Closed
wants to merge 26 commits into from
Closed

RAIDZ Expansion feature #12225

wants to merge 26 commits into from

Conversation

ahrens
Copy link
Member

@ahrens ahrens commented Jun 11, 2021

Motivation and Context

This feature allows disks to be added one at a time to a RAID-Z group,
expanding its capacity incrementally. This feature is especially useful for
small pools (typically with only one RAID-Z group), where there isn't
sufficient hardware to add capacity by adding a whole new RAID-Z group
(typically doubling the number of disks).

For additional context as well as a design overview, see my talk at the 2021 FreeBSD Developer Summit (video) (slides), and a news article from Ars Technica.

Description

Initiating expansion

A new device (disk) can be attached to an existing RAIDZ vdev, by running
zpool attach POOL raidzP-N NEW_DEVICE, e.g. zpool attach tank raidz2-0 sda.
The new device will become part of the RAIDZ group. A "raidz expansion" will
be initiated, and the new device will contribute additional space to the RAIDZ
group once the expansion completes.

The feature@raidz_expansion on-disk feature flag must be enabled to
initiate an expansion, and it remains active for the life of the pool. In
other words, pools with expanded RAIDZ vdevs can not be imported by older
releases of the ZFS software.

During expansion

The expansion entails reading all allocated space from existing disks in the
RAIDZ group, and rewriting it to the new disks in the RAIDZ group (including
the newly added device).

The expansion progress can be monitored with zpool status.

Data redundancy is maintained during (and after) the expansion. If a disk
fails while the expansion is in progress, the expansion pauses until the health
of the RAIDZ vdev is restored (e.g. by replacing the failed disk and waiting
for reconstruction to complete).

The pool remains accessible during expansion. Following a reboot or
export/import, the expansion resumes where it left off.

After expansion

When the expansion completes, the additional space is avalable for use, and is
reflected in the available zfs property (as seen in zfs list, df, etc).

Expansion does not change the number of failures that can be tolerated without
data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after expansion).

A RAIDZ vdev can be expanded multiple times.

After the expansion completes, old blocks remain with their old data-to-parity
ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the
larger set of disks. New blocks will be written with the new data-to-parity
ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data
to 2 parity). However, the RAIDZ vdev's "assumed parity ratio" does not
change, so slightly less space than is expected may be reported for
newly-written blocks, according to zfs list, df, ls -s, and similar
tools.

Manpage changes

zpool-attach.8:

NAME
     zpool-attach — attach new device to existing ZFS vdev

SYNOPSIS
     zpool attach [-fsw] [-o property=value] pool device new_device

DESCRIPTION
     Attaches new_device to the existing device.  The behavior differs depend‐
     ing on if the existing device is a RAIDZ device, or a mirror/plain
     device.

     If the existing device is a mirror or plain device ...

     If the existing device is a RAIDZ device (e.g. specified as "raidz2-0"),
     the new device will become part of that RAIDZ group.  A "raidz expansion"
     will be initiated, and the new device will contribute additional space to
     the RAIDZ group once the expansion completes.  The expansion entails
     reading all allocated space from existing disks in the RAIDZ group, and
     rewriting it to the new disks in the RAIDZ group (including the newly
     added device).  Its progress can be monitored with zpool status.

     Data redundancy is maintained during and after the expansion.  If a disk
     fails while the expansion is in progress, the expansion pauses until the
     health of the RAIDZ vdev is restored (e.g. by replacing the failed disk
     and waiting for reconstruction to complete).  Expansion does not change
     the number of failures that can be tolerated without data loss (e.g. a
     RAIDZ2 is still a RAIDZ2 even after expansion).  A RAIDZ vdev can be
     expanded multiple times.

     After the expansion completes, old blocks remain with their old data-to-
     parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distrib‐
     uted among the larger set of disks.  New blocks will be written with the
     new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded
     once to 6-wide, has 4 data to 2 parity).  However, the RAIDZ vdev's
     "assumed parity ratio" does not change, so slightly less space than is
     expected may be reported for newly-written blocks, according to zfs list,
     df, ls -s, and similar tools.

Status

This feature is believed to be complete. However, like all PR's, it is subject
to change as part of the code review process. Since this PR includes on-disk
changes, it shouldn't be used on production systems before it is integrated to
the OpenZFS codebase. Tasks that still need to be done before integration:

  • Cleanup ztest code
  • Additional code cleanup (address all XXX comments)
  • Document the high-level design in a "big theory statement" comment
  • Remove/disable verbose logging
  • Few last test failures
  • remove first commit (needed to get cleaner test runs)

Acknowledgments

Thank you to the FreeBSD Foundation for
commissioning this work in 2017 and continuing to sponsor it well past our
original time estimates!

Thanks also to contributors @FedorUporovVstack, @stuartmaybee, @thorsteneb, and @Fmstrat for portions
of the implementation.

Sponsored-by: The FreeBSD Foundation
Contributions-by: Stuart Maybee stuart.maybee@comcast.net
Contributions-by: Fedor Uporov fuporov.vstack@gmail.com
Contributions-by: Thorsten Behrens tbehrens@outlook.com
Contributions-by: Fmstrat nospam@nowsci.com

How Has This Been Tested?

Tests added to the ZFS Test Suite, in addition to manual testing.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@ahrens ahrens mentioned this pull request Jun 11, 2021
12 tasks
@ahrens ahrens added Status: Code Review Needed Ready for review and testing Type: Feature Feature request or new feature labels Jun 11, 2021
@Evernow
Copy link

Evernow commented Jun 11, 2021

Congrats on the progress!

@cornim
Copy link

cornim commented Jun 11, 2021

Also congrats on moving this out of alpha.

I have a question regarding the

After the expansion completes, old blocks remain with their old data-to-parity ratio

section and it might help to do a little example here:

If I have a 8x1TB RAIDZ2 and load 2 TB into it, it would be 33% full.

In comparison, if I start with 4x1TB RAIDZ2 which is full and expand it to 8x1TB RAIDZ2, it would be 50% full.

Is that understanding correct? If so, it would mean that one should always start an expansion with as empty a vdev as possible. As this will not always be an option, is there any possibility (planned) to rewrite the old data into the new parity? Would moving the data off the vdev and then back again do the job?

@GurliGebis
Copy link

Moving the data off and back on again should do it I think, since it rewrites the data (snapshots might get in the way of it).
If deduplication is disabled, copying the files and removing the old version might do the trick (Or am I missing something?)

@ahrens
Copy link
Member Author

ahrens commented Jun 11, 2021

@cornim I think that math is right - that's one of the worst cases (you'd be better off starting with mirrors than a 4-wide RAIDZ2; and if you're doubling your number of drives you might be better off adding a new RAIDZ group). To take another example, if you have a 5-wide RAIDZ1, and add a disk, you'll still be using 1/5th (20%) of the space as parity, whereas newly written blocks will use 1/6th (17%) of the space as parity - a difference of 3%.

@GurliGebis Rewriting the blocks would cause them to be updated to the new data:parity ratio (e.g. saving 3% of space in the 5-wide example). Assuming there are no snapshots, copying every file, or dirtying every block (e.g. reading the first byte and then writing the same byte back) would do the trick. If there are snapshots (and you want to preserve their block sharing), you could use zfs send -R to copy the data. It should be possible to add some automation to make it easier to rewrite your blocks in the common cases.

@stuartthebruce
Copy link

It might be helpful to state explicitly whether extra space becomes available during expansion or only after expansion is completed.

@ahrens
Copy link
Member Author

ahrens commented Jun 11, 2021

@stuartthebruce Good point. This is mentioned in the commit message and PR writeup:

When the expansion completes, the additional space is avalable for use

But I'll add it to the manpage as well. Oh, It is stated in the manpage as well:

the new device will contribute
additional space to the RAIDZ group once the expansion completes.

@stuartthebruce
Copy link

But I'll add it to the manpage as well. Oh, It is stated in the manpage as well:

the new device will contribute
additional space to the RAIDZ group once the expansion completes.

If it is not to pedantic, how about, "additional space...only after the expansion completes." The current wording leaves open the possibility that space might become available incrementally during expansion.

@felisucoibi
Copy link

felisucoibi commented Jun 11, 2021

One question, if i add to a 10 dirves pool 5 more with this system the 5 new ones have the new parity, and after that the old 10 ones are replaced step by step with replace command, at the end when the 10 old ones are replaced all the raid will have the new parity? when replace the old ones the parity keeps or the new parity is used, so we can revcoer extra space.

@ahrens
Copy link
Member Author

ahrens commented Jun 12, 2021

@felisucoibi I'm not sure I totally understand your question, but here's an example that may be related to yours: If you start with a 10-wide RAIDZ1 vdev, and then do 5 zpool attach operations to add 5 more disks to it, you'll then have a 15-wide RAIDZ1 vdev. If the 5 new disks were bigger than the 10 old disks, and you then zpool replace each of the 10 old disks with a new big disk, then the vdev will be able to expand to use the space of the 15 new, large disks. In any case, old blocks will continue to use the existing data:parity ratio of 9:1 (10% parity), and newly written blocks will use the new data:parity ratio of 14:1 (6.7% parity). So the difference in space used by parity is only 3.3%.

@felisucoibi
Copy link

@felisucoibi I'm not sure I totally understand your question, but here's an example that may be related to yours: If you start with a 10-wide RAIDZ1 vdev, and then do 5 zpool attach operations to add 5 more disks to it, you'll then have a 15-wide RAIDZ1 vdev. If the 5 new disks were bigger than the 10 old disks, and you then zpool replace each of the 10 old disks with a new big disk, then the vdev will be able to expand to use the space of the 15 new, large disks. In any case, old blocks will continue to use the existing data:parity ratio of 9:1 (10% parity), and newly written blocks will use the new data:parity ratio of 14:1 (6.7% parity). So the difference in space used by parity is only 3.3%.

Thanks for the answer. So the only way to recalculate the old blocks is to rewrite the data like you suggested.

@louwrentius
Copy link
Contributor

louwrentius commented Jun 12, 2021

After the expansion completes, old blocks remain with their old data-to-parity ratio

First of all, this is an awesome feature, thank you. If I may ask: why aren't the old blocks rewritten to reclaim some extra space? I can imagine that redistributing data only affects a smaller portion of all data and thus is faster but the user then still has to rewrite data to reclaim storage space. Would be nice if this can be done as part of the expansion process as an extra option if people are willing to accept the extra time required? For what it's worth.

@yorickdowne
Copy link

ZFS has a philosophy of “don’t mess with what’s already on disk if you can avoid it. If need be go to extremes to not mess with what’s been written (memory mapping removed disks in pools of mirrors for example)”. Someone who wants old data rewritten can make that choice and send/recv, which is an easy operation. I like the way this works now.

@teambvd
Copy link

teambvd commented Jun 12, 2021

@louwrentius I'm of the same mind - coming from the other direction though, and given the code complexity involved, I was wondering if perhaps the data redistribution component was maybe going to be listed as a subsequent PR...? I'd looked for an existing one in the event it was already out there, but it could be something that's already thought of/planned and I just wasn't able to locate it.

Given the number of components involved and the complexity of the operations that'd be necessary, especially as it'd pertain to memory and snapshots, I could see it making sense to split the tasks up. I'm imagining something like -

  • expansion completes
  • existing stripe read - how do we ensure the data read is written back to the same vdev? If within the intent log, we'd need a method to direct those writes back to a specific vdev and bypass the current vdev allocation method
  • stripe re-written to 'new' stripe' - Assuming there's snapshotted data on the vdev, how is that snapshots metadata updated to reflect the new block locations? Metadata on other vdevs may (likely does) point to lbas housed on this vdev, and updating a snapshot could be... problematic. Do we make it a requirement that no snapshots can exist in the pool for this operation to take place?
  • existing stripe freed

To me at least, the more I think about this, the more sense it'd make to have a 'pool level' rebalance/redistribution, as all existing data within a pool's vdevs is typically reliant upon one another. It'd certainly seem to simplify things compared to what I'm describing above I'd think. It also helps to solve other issues which've been longstanding, especially as it relates to performance of long lived pools, which may've had multiple vdevs added over time as the existing pool became full.

Anyway, I don't want to ramble too much - I could just see how it'd make sense at least at some level to have data redistribution be another PR.

@mufunyo
Copy link

mufunyo commented Jun 12, 2021

Assuming there are no snapshots, copying every file, or dirtying every block (e.g. reading the first byte and then writing the same byte back) would do the trick. If there are snapshots (and you want to preserve their block sharing), you could use zfs send -R to copy the data. It should be possible to add some automation to make it easier to rewrite your blocks in the common cases.

Having an easily accessible command to rewrite all the old blocks, or preferably an option to do so as part of the expansion process would be greatly appreciated.

@ahrens
Copy link
Member Author

ahrens commented Jun 12, 2021

@louwrentius @teambvd @yorickdowne @mufunyo I think y'all are getting at a few different questions:

  1. What would be involved in re-allocating the existing blocks such that they use less space? Doing this properly - online, working with other existing features (snapshots, clones, dedup), without requiring tons of extra space - would require incrementally changing snapshots, which is a project of similar scale to RAIDZ Expansion. There are workarounds available that accomplish the same end result in restricted use cases (no snapshots? touch all blocks. plenty of space? zfs send -R).
  2. How much benefit would it be? I gave a few examples above, but it's typically a few percent. (e.g. 5-wide -> 6-wide, you get at least 5/6th (83%) of a drive of additional usable space, and if you reallocated the existing blocks you could get a whole drive of usable space, i.e. an additional 17% of a drive, or ~3% of the whole pool. Wider RAIDZ's will see less impact (9-wide -> 10-wide, you get at least 90% of a drive of additional usable spacespace; you're missing out on 1% of the whole pool).
  3. Why didn't I do this yet? Because it's an incredible amount of work for little benefit, and I believe that RAIDZ Expansion as designed and implemented is useful for a lot of people.

All that said, I'd be happy to be proven wrong about the difficulty of this! Such a facility could be used for many other tasks, e.g. recompressing existing data to save more space (lz4 -> zstd). If anyone has ideas on how this could be implemented, maybe we can discuss them on the discussion forum or in a new feature request. Another area that folks could help with is automating the rewrite in restricted use cases (by touching all blocks or zfs send -R).

@Jerkysan
Copy link

@louwrentius @teambvd @yorickdowne @mufunyo I think y'all are getting at a few different questions:

  1. What would be involved in re-allocating the existing blocks such that they use less space? Doing this properly - online, working with other existing features (snapshots, clones, dedup), without requiring tons of extra space - would require incrementally changing snapshots, which is a project of similar scale to RAIDZ Expansion. There are workarounds available that accomplish the same end result in restricted use cases (no snapshots? touch all blocks. plenty of space? zfs send -R).
  2. How much benefit would it be? I gave a few examples above, but it's typically a few percent. (e.g. 5-wide -> 6-wide, you get at least 5/6th (83%) of a drive of additional usable space, and if you reallocated the existing blocks you could get a whole drive of usable space, i.e. an additional 17% of a drive, or ~3% of the whole pool. Wider RAIDZ's will see less impact (9-wide -> 10-wide, you get at least 90% of a drive of additional usable spacespace; you're missing out on 1% of the whole pool).
  3. Why didn't I do this yet? Because it's an incredible amount of work for little benefit, and I believe that RAIDZ Expansion as designed and implemented is useful for a lot of people.

All that said, I'd be happy to be proven wrong about the difficulty of this! Such a facility could be used for many other tasks, e.g. recompressing existing data to save more space (lz4 -> zstd). If anyone has ideas on how this could be implemented, maybe we can discuss them on the discussion forum or in a new feature request. Another area that folks could help with is automating the rewrite in restricted use cases (by touching all blocks or zfs send -R).

I'd just like to state I greatly GREATLY appreciate the work your doing. Frankly, this is one of the things holding lots of people back from using ZFS, and having the ability to grow a zfs pool without having to add vdevs will be literally magical. I will be able to easily switch back to ZFS after this is complete. Again THANK YOU VERY MUCH!!

@kellerkindt
Copy link

kellerkindt commented Jun 12, 2021

@ahrens As someone silently following the progress since the original PR, I also want to note that I really appreciate all the effort and commitment you put and are putting into this feature! I believe once this lands, it'll be a really valuable addition to zfs :)

Thank you

@ahrens
Copy link
Member Author

ahrens commented Jun 12, 2021

@kellerkindt @Jerkysan @Evernow Thanks for the kind words! It really makes my day to know that this work will be useful and appreciated! ❤️ 😄

@DayBlur
Copy link

DayBlur commented Jun 13, 2021

Thanks for your work on this feature. It's exciting to finally see some progress in this area, and it will be useful for many people once released.

Do these changes lay any groundwork for future support for adding a parity disk (instead of a data disk - i.e., increasing the RAID-Z level)? Meaningfully growing the number of disks in an existing array would likely trigger a desire to increase the fault tolerance level as well.

Since the existing data is just redistributed, I understand that the old data would not have the increased redundancy unless rewritten. But I am still curious if your work that allows supporting old/new data+parity layouts simultaneously in a pool could also apply to increasing the number of parity disks (and algorithm) for future writes.

@rickatnight11
Copy link

I'm also incredibly stoked for this issue, and understand the decision to separate reallocation of existing data into another FR. Thanks so much for all of the hard work that went into this. I can't wait to take advantage of it.

After performing a raidz expansion is there at least an accurate mechanism to determine which objects (files, snapshots, datasets....I admit I haven't fully wrapped my head around how this impacts things, so apologies for not using the correct terms) map to blocks with the "old" data-to-parity ratio and possibly calculate the resulting space increase? I imagine many administrators desire a balance between maximizing space, benefitting from all of the benefits of ZFS (checksumming, deduplication, etc), and the flexibility of expanding storage (yes, we want to eat all the cakes) and will naturally compare this feature to other technologies, such as md raid, where growing a raid array triggers a recalculation of all parity. As such, these administrators will want to be able to plan out how to do the same on a zpool with an expanded raidz vdev without just blindly rewriting all files.

@ahrens
Copy link
Member Author

ahrens commented Jun 14, 2021

@DayBlur

Do these changes lay any groundwork for future support for adding a parity disk (instead of a data disk - i.e., increasing the RAID-Z level)? ... I am still curious if your work that allows supporting old/new data+parity layouts simultaneously in a pool could also apply to increasing the number of parity disks (and algorithm) for future writes.

Yes, that's right. This work allows adding a disk, and future work could be to increase the parity. As you mentioned, the variable, time-based geometry scheme of RAIDZ Expansion could be leveraged to know that old blocks have the old amount of parity, and new blocks have the new amount of parity. That work would be pretty straightforward.

However, the fact that the old blocks remain with the old amount of failure tolerance means that overall you would not be able to tolerate an increased number of failures, until all the old blocks have been freed. So I think that practically, it would be important to have a mechanism to at least observe the amount of old blocks, and probably also to reallocate the old blocks. Otherwise you won't actually be able to tolerate any more failures without losing [the old] data. This would be challenging to implement in the general case but as mentioned above there are some OK solutions for special cases.

@ahrens
Copy link
Member Author

ahrens commented Jun 14, 2021

@rickatnight11

After performing a raidz expansion is there at least an accurate mechanism to determine which objects (files, snapshots, datasets....I admit I haven't fully wrapped my head around how this impacts things, so apologies for not using the correct terms) map to blocks with the "old" data-to-parity ratio and possibly calculate the resulting space increase?

There isn't an easy way to do this, but you could get this information out of zdb. Basically you are looking for blocks whose birth time is before the expansion completion time. I think it would be relatively straightforward to make some tools to help with this problem. A starting point might be to report which snapshots were created before the parity-expansion completed, and therefore would need to be destroyed to release that space.

@shodanshok
Copy link
Contributor

@ahrens I also wish to thank you for the awesome work!

I have some questions:

  • do DVAs need to change after a disk is attached to a RAIDZ vdev?
  • if so, how are DVA rewritten? Are you using a "placeholder" pointing to the new DVA?
  • if DVAs do not changes, how the relocation actually works?

Thanks.

@bghira
Copy link

bghira commented Jun 16, 2021

Frankly, this is one of the things holding lots of people back from using ZFS

curious what filesystem or pooling / RAID setup these imaginary people went with instead; presumably it also checks most of the boxes that ZFS does? volume manager, encryption provider, compressing filesystem, with snapshots etc..

@Jerkysan
Copy link

Frankly, this is one of the things holding lots of people back from using ZFS

curious what filesystem or pooling / RAID setup these imaginary people went with instead; presumably it also checks most of the boxes that ZFS does? volume manager, encryption provider, compressing filesystem, with snapshots etc..

The answer is "we settled"... I settled for unraid though until that point I had run ZFS for years and years. I could no longer afford to buy enough drives all at once to build a ZFS box outright. I had life obligations that I had to meet while wanting to continue to my hobby that seemingly never stops expanding in cost frankly. I needed something that could grow with me. It just simply doesn't give me the compressing file system, snapshots, and yada yada.

Basically, I had to make sacrifices to continue my hobby without literally breaking the bank. This functionality will allow me to go back to ZFS. I don't care if I have to move all the files to get them to use the new drives I'm adding and such. It's not "production critical" but I do want all the "nice things" that ZFS offers. This project will literally "give me my cake and let me eat it to". I've been waiting since it was first announced years ago and hovering over the searches waiting to see a new update. I'm already trying to figure out how I'm going to slide into this.

@justinclift
Copy link

justinclift commented Mar 4, 2023

Its causing data loss for me (zfs panic)

@kocoman1 Is that on Linux, or did you get it to compile on macOS? 😄

@AT-StephenDetomasi
Copy link

Has this feature been merged or not?!

To answer this question directly - no. More work and testing is required. I'd say we're at least 12 months out. Come back in March 2024 and see how things are going.

@AlterX76
Copy link

AlterX76 commented Mar 4, 2023

How hard is to jump into the project for new devs?

@kocoman1
Copy link

kocoman1 commented Mar 4, 2023

Its causing data loss for me (zfs panic)

@kocoman1 Is that on Linux, or did you get it to compile on macOS? 😄

I didn't try anymore in osx (it was in linux) after I could not even get an ls of what files I lost. The sas2008 does work with external kext on ventura ok. I added extfs/apfs but it reboots sometimes during load. not sure where the bug is, plus the unmount is take so long that it panics when I shut the machine down in osx

plus I have a bunch of STDM3000 and 2.5 4000 series SMR that are dying.

@rwkeane
Copy link

rwkeane commented Mar 11, 2023

So I would love RaidZ expansion for a personal project, and am hoping to try it. I pulled the PR branch and tried to build from source, but the branch was too out-of-date to build against my current Ubuntu version (most recent public release).

So instead I am considering the following:

  1. Installing an old version of Linux on a spare hard drive and building ZFS from source there
  2. Exporting my ZFS array from my current OS and importing it on the new one
  3. Expanding the array on the new OS using my from-source build
  4. Porting the post-expansion array back to my current Linux OS.

But I am missing the context to know if this is a good idea. So a few questions:

  • What level of risk am I looking at? Has there been any change to the exported array schema I should worry about? Is there any reason I shouldn't do this?
  • What version of Linux would you recommend I go with / is most stable in this branch for my experiment?
  • Are there any extra flags I should add or debug features I should turn on when building / running to help provide a useful bug report if it fails?

I don't have the spare time to pick up a new codebase to help with rebasing the branch, so I might as well make myself a guinea pig next time I've got some free time

@HammyHavoc
Copy link

  • What level of risk am I looking at? Has there been any change to the exported array schema I should worry about? Is there any reason I shouldn't do this?

The level of risk is directly proportionate to your ability to restore the data from a backup.

@Fmulder007
Copy link

I am a novice user of FreeBSD and not so long ago I started using ZFS for my experiments. I am looking forward to the appearance of this functionality! Many thanks for your work!

@kocoman1
Copy link

So I would love RaidZ expansion for a personal project, and am hoping to try it. I pulled the PR branch and tried to build from source, but the branch was too out-of-date to build against my current Ubuntu version (most recent public release).

So instead I am considering the following:

  1. Installing an old version of Linux on a spare hard drive and building ZFS from source there
  2. Exporting my ZFS array from my current OS and importing it on the new one
  3. Expanding the array on the new OS using my from-source build
  4. Porting the post-expansion array back to my current Linux OS.

But I am missing the context to know if this is a good idea. So a few questions:

  • What level of risk am I looking at? Has there been any change to the exported array schema I should worry about? Is there any reason I shouldn't do this?
  • What version of Linux would you recommend I go with / is most stable in this branch for my experiment?
  • Are there any extra flags I should add or debug features I should turn on when building / running to help provide a useful bug report if it fails?

I don't have the spare time to pick up a new codebase to help with rebasing the branch, so I might as well make myself a guinea pig next time I've got some free time

I had data loss after expanding it like 5 times(expand, copy some files from the ones to erase and expand on...etc), then when I read the data I get Zfs panic and any ls etc just hangs, I was able to mount via readonly, but doing ls in some directories results just in error. copying back gets checksum and io abort error.. so its not ready I think unless they fixed something.

@bachp
Copy link

bachp commented Mar 29, 2023

@kocoman1 Could you provide some more detailed error description? This might help pinpoint the issue.

At minimum:

  • What steps did you do exactly, this can help reproduce the issue.
  • What is the exact error messages you see.

@kocoman1
Copy link

kocoman1 commented Mar 29, 2023

@kocoman1 Could you provide some more detailed error description? This might help pinpoint the issue.

At minimum:

  • What steps did you do exactly, this can help reproduce the issue.
  • What is the exact error messages you see.

Its hidden in "kocoman1 commented on Sep 29, 2021 "
#12225 (comment)

VERIFY3(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT) failed (36028797018963967 < 32768)
[Wed Sep 29 22:13:28 2021] PANIC at zio.c:322:zio_buf_alloc()
 

[Wed Sep 29 22:13:28 2021] Call Trace: 

[Wed Sep 29 22:13:28 2021] show_stack+0x52/0x58  
[Wed Sep 29 22:13:28 2021] dump_stack+0x70/0x8b
[Wed Sep 29 22:13:28 2021] spl_dumpstack+0x29/0x2b [spl]
[Wed Sep 29 22:13:28 2021] spl_panic+0xd4/0xfc [spl]  
[Wed Sep 29 22:13:28 2021] ? _cond_resched+0x1a/0x50  
[Wed Sep 29 22:13:28 2021] ? _cond_resched+0x1a/0x50
[Wed Sep 29 22:13:28 2021] ? __kmalloc_node+0x144/0x2b0  
[Wed Sep 29 22:13:28 2021] ? spl_kvmalloc+0x82/0xb0 [spl]  
[Wed Sep 29 22:13:28 2021] zio_buf_alloc+0x5e/0x60 [zfs]                                                  
[Wed Sep 29 22:13:28 2021] abd_borrow_buf_copy+0x6e/0xa0 [zfs]                                                  
[Wed Sep 29 22:13:28 2021] annotate_ecksum+0x85/0x4d0 [zfs]                                                  
[Wed Sep 29 22:13:28 2021] ? abd_cmp_cb+0x10/0x10 [zfs]                                                  
[Wed Sep 29 22:13:28 2021] zfs_ereport_finish_checksum+0x2c/0xb0 [zfs]                                                  
[Wed Sep 29 22:13:28 2021] zio_vsd_default_cksum_finish+0x14/0x20 [zfs]                                                  
[Wed Sep 29 22:13:28 2021] zio_done+0x2fb/0x11b0 [zfs]                                                  
[Wed Sep 29 22:13:28 2021] zio_execute+0x8b/0x130 [zfs]                                                  
[Wed Sep 29 22:13:28 2021] taskq_thread+0x2b7/0x500 [spl]                                                  
[Wed Sep 29 22:13:28 2021] ? wake_up_q+0xa0/0xa0                                                  
[Wed Sep 29 22:13:28 2021] ? zio_gang_tree_free+0x70/0x70 [zfs]                                                  
[Wed Sep 29 22:13:28 2021] kthread+0x11f/0x140                                                  
[Wed Sep 29 22:13:28 2021] ? taskq_thread_spawn+0x60/0x60 [spl]                                                  
[Wed Sep 29 22:13:28 2021] ? set_kthread_struct+0x50/0x50                                                  
[Wed Sep 29 22:13:28 2021] ret_from_fork+0x22/0x30                                                  

@mmayer
Copy link

mmayer commented Apr 7, 2023

Is there anything specific volunteers could do to aid in testing this feature? What would help the most?

Also, would it be possible to rebase this branch on master? There is a great number of merge conflicts right now. This makes it a lot more difficult to test this branch. Also, I am suspecting that it will render any tests that are being conducted more meaningless, since features already present in upstream won't be tested together with RAIDZ expansion. In other words, the more outdated this branch becomes, the less sense it makes to use it or to rely on it.

@kocoman1
Copy link

kocoman1 commented Apr 8, 2023

Is there anything specific volunteers could do to aid in testing this feature? What would help the most?

Also, would it be possible to rebase this branch on master? There is a great number of merge conflicts right now. This makes it a lot more difficult to test this branch. Also, I am suspecting that it will render any tests that are being conducted more meaningless, since features already present in upstream won't be tested together with RAIDZ expansion. In other words, the more outdated this branch becomes, the less sense it makes to use it or to rely on it.

can try what I did that cause the error

have a bunch of data that can be lost
prepare like 10 hdd with some data (maybe small sized so faster ?)

initally create a raidz-expand supported zfs with raidz2 (for mines anyway) with the min 3 (?) drives, then add another drive, copy more data in till full(most important), rinse and repeat till reach 10 drives to see if any error during the procress.. can also scrub after each expand

@justinclift
Copy link

Just to point out for anyone that would like to test this, but doesn't have much spare hardware... testing this in virtual machines is a workable idea. eg virtual machine with virtual disks. Use them to simulate adding more disks, pulling the plug on some while it's doing stuff etc.

My problem is lack of time currently. 👼

@owlshrimp
Copy link

Just to point out for anyone that would like to test this, but doesn't have much spare hardware... testing this in virtual machines is a workable idea. eg virtual machine with virtual disks. Use them to simulate adding more disks, pulling the plug on some while it's doing stuff etc.

My problem is lack of time currently. angel

One of the great powers of ZFS has been the ability to supply fake file-backed disks, used in testing. Might still bring down the kernel though if not in a VM.

@darkbasic
Copy link

Realistically I don't think the bottleneck is the lack of testing but rather the lack of code reviews (AFAIK except for a couple of PRs there isn't any other kind of feedback to address) and mostly @ahrens himself being busy with other stuff (the previously mentioned PRs are pending since more than a year). So I suggest to either wait until his priorities align with ours or be prepared to contribute in some meaningful way. This feature was teased as possibly being addressed in the next major version of zfs (I don't remember which company was interested in it) but honestly I'm starting to highly doubt it.

@kocoman1
Copy link

Just to point out for anyone that would like to test this, but doesn't have much spare hardware... testing this in virtual machines is a workable idea. eg virtual machine with virtual disks. Use them to simulate adding more disks, pulling the plug on some while it's doing stuff etc.
My problem is lack of time currently. angel

One of the great powers of ZFS has been the ability to supply fake file-backed disks, used in testing. Might still bring down the kernel though if not in a VM.

is there a tutorial for that? I want to combine a 1 and 2tb drive to make 3tb to replace a failling 3tb
thx

@owlshrimp
Copy link

Just to point out for anyone that would like to test this, but doesn't have much spare hardware... testing this in virtual machines is a workable idea. eg virtual machine with virtual disks. Use them to simulate adding more disks, pulling the plug on some while it's doing stuff etc.
My problem is lack of time currently. angel

One of the great powers of ZFS has been the ability to supply fake file-backed disks, used in testing. Might still bring down the kernel though if not in a VM.

is there a tutorial for that? I want to combine a 1 and 2tb drive to make 3tb to replace a failling 3tb thx

You can supply two drives as toplevel vdevs and ZFS will essentially stripe across them. If either drive (1TB or 2TB) fails you will loose all the data though, and any errors will be uncorrectable in this configuration.

Please ask support questions unrelated to this feature on a public forum like https://www.reddit.com/r/openzfs/
There are a lot of people subscribed to this issue and it is a waste of everyone's time to post about things which don't help the work being done.

@felisucoibi
Copy link

Waiting this feature to be merged since the beginning. Seems freebsd guys are anouncing this as finished (one year ago)
https://freebsdfoundation.org/blog/raid-z-expansion-feature-for-zfs/
But at the same time seems here the prs are getting outdated... what is hapening?

@owlshrimp
Copy link

owlshrimp commented Apr 20, 2023

@felisucoibi

Realistically I don't think the bottleneck is the lack of testing but rather the lack of code reviews (AFAIK except for a couple of PRs there isn't any other kind of feedback to address) and mostly @/ahrens himself being busy with other stuff (the previously mentioned PRs are pending since more than a year). So I suggest to either wait until his priorities align with ours or be prepared to contribute in some meaningful way. This feature was teased as possibly being addressed in the next major version of zfs (I don't remember which company was interested in it) but honestly I'm starting to highly doubt it.

Please read the thread before replying. As per the comment three comments before yours:

Realistically I don't think the bottleneck is the lack of testing but rather the lack of code reviews (AFAIK except for a couple of PRs there isn't any other kind of feedback to address) and mostly @/ahrens himself being busy with other stuff (the previously mentioned PRs are pending since more than a year). So I suggest to either wait until his priorities align with ours or be prepared to contribute in some meaningful way. This feature was teased as possibly being addressed in the next major version of zfs (I don't remember which company was interested in it) but honestly I'm starting to highly doubt it.

This work is functional but requires both cleanup and code reviews. Other people have also experienced bugs during recent testing. In order for this to be done there has to be time for @/ahrens and others to work on it, but they are very busy with other priorities.

There are a lot of people subscribed to this issue and it is a waste of everyone's time to post about things which have already been answered.

@barrelful
Copy link

barrelful commented Apr 20, 2023

I try to believe in open-source projects with all my will, but situations like in this PR create great demotivation. Many projects and sometimes important features depending on one specific person that is not continuing what they started, creating false hope to users. I feel very frustrated and axious (maybe not only me). The author last commited code on 25 Feb 2022, more than one year ago, then we got some changes from @fuporovvStack and last activity was on 16 Nov 2023, almost 6 months ago. Also the maintainers and developers do not communicate anything about intensions or status. It looks to me there is no intension to continue it.

I feel sad that this project does not seem sustainable, commited and respecting, even with hundreds of interested people in it. For me this feature is stale/abandoned, I will better invest my time finding an alternative FS that fullfills my requirements, if someone else feels the same ways as me we can try to find together a better solution, since I don't have enough knowledge of internals of this project to fix the code of zfs (and it seems people that have it do not really want it).

@Evernow
Copy link

Evernow commented Apr 20, 2023

Also the maintainers and developers do not communicate anything about intensions or status. It looks to me there is no intension to continue it.

May I suggest adding this to the agenda then on the leadership meeting on the 25th of April (5 days from now)? Details here: https://docs.google.com/document/d/1w2jv2XVYFmBVvG1EGf-9A5HBVsjAYoLIFZAnWHhV-BM/edit

@ThinkChaos
Copy link

Could someone try and find a solution to limit spam in this PR? Most of the comments are just noise. (including mine, sorry!)
Maybe locking it so only contributors can comment and having a separate discussion for testing and feedback from third parties?

@kwinz
Copy link

kwinz commented May 1, 2023

Also the maintainers and developers do not communicate anything about intensions or status. It looks to me there is no intension to continue it.

May I suggest adding this to the agenda then on the leadership meeting on the 25th of April (5 days from now)? Details here: https://docs.google.com/document/d/1w2jv2XVYFmBVvG1EGf-9A5HBVsjAYoLIFZAnWHhV-BM/edit

Recording of said leadership meeting: https://youtu.be/sZJMFvjqXvE?t=1490

@hendrik42
Copy link

Thanks for giving us an update in that meeting @behlendorf

Here's the transcript from youtube:

Here on the list and I don't have much to say about it either is raid Z expansion.
I'm not quite sure what's going on with this one - maybe others have some more insight - but I know this is still work that Matt did that needs [..] more code review, needs more testing, it needs to be rebased so there's still a bunch of work there to do time to get it finalized.

I know people have done some testing with it but it's still not quite wrapped up but it would be great to push this work forward and get it rebased and tested because I know this is a feature a lot of people are keen on. It just hasn't quite come together yet, but it's a big lift

@chrisjsimpson chrisjsimpson mentioned this pull request May 8, 2023
12 tasks
@kyesil
Copy link

kyesil commented Jun 11, 2023

It happening ?
Here:
https://www.qnap.com/en-us/operating-system/quts-hero/5.1.0
QuTS Hero

@kmoore134
Copy link

Pleased to announce that iXsytems is sponsoring the efforts by @don-brady to get this finalized and merged. Thanks to @don-brady and @ahrens for discussing this on the OpenZFS leadership meeting today. Looking forward to an updated PR soon.

https://www.youtube.com/watch?v=2p32m-7FNpM

@don-brady don-brady mentioned this pull request Jun 29, 2023
19 tasks
@ahrens
Copy link
Member Author

ahrens commented Jun 29, 2023

@don-brady has taken over this work and opened #15022. Thanks Don, and thanks iXsystems for sponsoring his work! See the June 2023 OpenZFS Leadership Meeting for a brief discussion of the transition.

@ahrens ahrens closed this Jun 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.