Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add upper bound for slop space calculation #11023

Merged
merged 1 commit into from Feb 24, 2021
Merged

Add upper bound for slop space calculation #11023

merged 1 commit into from Feb 24, 2021

Conversation

prakashsurya
Copy link
Member

@prakashsurya prakashsurya commented Oct 7, 2020

This change modifies the behavior of how we determine how much slop
space to use in the pool, such that now it has an upper limit. The
default upper limit is 128G, but is able to be modified via the new variable.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Oct 8, 2020
return (MAX(space >> spa_slop_shift, MIN(space >> 1, spa_min_slop)));
return (MIN(spa_max_slop,
MAX(space >> spa_slop_shift,
MIN(space >> 1, spa_min_slop))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the comment above this function needs to be updated too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, there's a few comments that'll need to be updated. I didn't worry about that yet, until we all think this is something we can actually move forward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can and should. 128GiB feels plenty conservative.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahrens do you agree with @behlendorf?

I was going to try and see how I might gather some data to verify if this is a reasonable upper bound, but I'm not really sure how to do that, and didn't make much progress toward that goal during the hackathon yesterday.

@beren12
Copy link
Contributor

beren12 commented Dec 31, 2020

This would be a welcome change, especially to those of us with drastically different pool sizes. It seems like a small change, could we get this added soon?

@behlendorf
Copy link
Contributor

@prakashsurya if you get a moment would you mind rebasing this on the latest code from the master branch.

@prakashsurya
Copy link
Member Author

@behlendorf done.

FWIW, we've determined this change isn't strictly necessary for our product, so it's not high my priority list anymore. Further, I was a little unsure how to guarantee the correctness of this change, for all the different pool configurations and workloads that are possible, which is why I haven't tried to push it forward since initially opening the PR.

If we all collectively agree this is a safe change to make, as-is, I can try to push it forward; but if we think it'll take a decent amount of effort to guarantee it's safe to land, and/or requires a decent amount of changes to make it safe to land, I doubt I'll have the time to spend on it.

@behlendorf
Copy link
Contributor

@prakashsurya thanks! Agreed, I think the open question is can we determine the worst case upper bound for a pool. Or more specifically, how much space must we always leave free in order to accommodate any additional unaccounted for changes which may be required when syncing a transaction group. This include MOS updates, ZIL metaslabs, administrative operations, etc.

Let me try and make a convincing argument. While I don't think we can calculate an exact value, I would suggest that zfs_dirty_data_max_max (max 4G) does give us a decent first order approximation. To exhaust the slop space with this as a maximum we'd need to end up writing twice as much data as expected. That doesn't seem impossible to me. Particularly when the pool is full and fragmented, or there are complex administration operations to handle. Moreover, to get good performance on very fast pools it's pretty common to increase zfs_dirty_data_max considerably. So I think it would be safer to assume 128G here. Then to be on the safe side I'd like to leave ourselves at least another factor of 2x. Which bring us to 256G which is right in line with what you were suggesting.

This would effectively mean that for any pool smaller than 8TB there wouldn't be any change to the slop space (8TB * 3.2% = 256G). Which seems reasonable to me, though I could be talked in to picking an even more conservative value (512G or even 1TB).

The other key concern would be the need to leave enough headroom to avoid pathologically bad allocation times when there isn't much free space. However, all of the recent allocator work has really improved things in this area so it's perhaps somewhat less of a concern. And frankly 256G of free space is a considerable amount of space by most measures!

@ahrens what do you think? Are there other concerns, have I overlooked something important? My particular interest in this is for very large pools, >1PB, where we wind up unnecessarily reserving multiple terabytes of capacity.

@ahrens
Copy link
Member

ahrens commented Jan 28, 2021

@behlendorf

how much space must we always leave free in order to accommodate any additional unaccounted for changes which may be required when syncing a transaction group. This include MOS updates, ZIL metaslabs, administrative operations, etc.

I agree. It's really the MOS updates that we need to worry about. The ZIL metaslabs are a known quantity, and "administrative operations" basically means changes to the MOS.

I would suggest that zfs_dirty_data_max_max (max 4G) does give us a decent first order approximation

I don't see how the amount of dirty data is related to the amount of MOS updates, which kind of invalidates this line of reasoning.

Let me try a few other lines of reasoning that end with a similar conclusion:

  1. Let's say you have an 8TB pool. I think this is large enough to do an essentially unlimited amount of MOS operations. i.e. if you want to create/delete a ton of filesystems/snapshots, or set a ton of properties, you will not run into pool-size limits. So you don't need more slop space than an 8TB pool has [256GB].
  2. Essentially all the MOS modifications for one TXG will have to be in memory at the same time. So you certainly don't need more free space than you have RAM. 1TB is currently an enormous amount of RAM, and by default the ARC will use up to half of it. So it's very unlikely that you could write more than 512GB to the MOS in one TXG. (But this assessment could look dated in a few years as RAM sizes increase. At a bare minimum we could say that the slop space needn't be more than the RAM of the running system, at least for reasonably-large-memory systems.)
  3. Trying to think of single operations that would write a lot to the MOS, which would be allowed even when there's barely any space available, metaslab condensing comes to mind. A metaslab's spacemap could in the worst case have metaslab_size / 1<<ashift / 2 entries, each of which is typically 8 bytes. So it could be metaslab_size / 128 bytes (with ashift=9). On large vdevs (up to 2PB), metaslab_size is 16GB, so a worst-case spacemap is 128MB.

My takeaway is that it's hard to imagine ever needing more than hundreds of GB of slop space.

@beren12
Copy link
Contributor

beren12 commented Jan 28, 2021

A tunable for us with small systems would be amazing as well. I have 128gb ram on my 150tb raw server, no need to even reserve the max max value here. If only it could be tuned per pool. I have a 256gb root pool and 150tb & 48tb storage pools

@behlendorf
Copy link
Contributor

Let me try a few other lines of reasoning that end with a similar conclusion:

All good points! It sounds like we're in agreement this change as it stands is safe and very unlikely to cause any problems.

I don't see how the amount of dirty data is related to the amount of MOS updates

It's not, I didn't quite convey what I meant very well. Let me try again briefly. Back in commit 3ec3bc2 we relaxed the tx_space_* accounting and in the process significantly simplified it. My concern was that with a heavily fragmented full pool we could significantly underestimate how much needs to be written to disk (due to zap updates, indirect blocks, gang blocks, etc) and allow more dirty data then we should have in to the txg. This would all be entirely independent of the MOS updates.

A tunable for us with small systems would be amazing as well.

This won't change the behavior at all for small pools. We still do want to reserve ~3.2% (1/2^5) of the total pool capacity, for your 256G pools that's going to be 8G. That seems pretty reasonable to me.

@beren12
Copy link
Contributor

beren12 commented Jan 28, 2021

Yeah I guess I was thinking if you are having the max limit set for 1tb ram and 1pb storage, it could still be safely less with 1/10th that limit on a smaller pool.

@ahrens
Copy link
Member

ahrens commented Jan 28, 2021

@behlendorf I see. In practice, thanks to spa_asize_inflation=24, it's very very unlikely that we could underestimate the space usage, even on a highly fragmented system.

@behlendorf behlendorf marked this pull request as ready for review February 2, 2021 17:44
@behlendorf
Copy link
Contributor

@prakashsurya @ahrens I've marked this PR ready for review and approved it. Based on the discussion above I suggest we go ahead and merge this change as is.

@prakashsurya
Copy link
Member Author

@behlendorf OK, let me go back and fix up some of the comments that I haven't gotten to yet.

@prakashsurya prakashsurya changed the title WIP: Add upper bound for slop space calculation Add upper bound for slop space calculation Feb 2, 2021
@prakashsurya
Copy link
Member Author

@behlendorf I've updated some of the comments, and rebased onto master. If you and @ahrens are good with this, I'm good landing it. Thanks for all the help.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks!

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Feb 8, 2021
/*
* Additionally, slop space should never exceed spa_max_slop.
*/
slop = MIN(slop, spa_max_slop);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be extra precise, on medium to large pools, I think that the space that is unavailable today is the sum of:

  • the 4MB labels/boot region per vdev (also not included in the zpool SIZE property)
  • the metaslab rounding (up to 16GB on large vdevs) per vdev (also not included in the zpool SIZE property)
  • 1/32nd of the remainder (which is the sum of the embedded log metaslab and the return value of spa_get_slop_space()) (included in zpool SIZE but not zfs USED+AVAIL)

Considering the interaction with the embedded log class, I think that the proposed change is that for large pools (vdev size >2TB, total pool size >4TB), the space that's unavailable will be the sum of:

  • the 4MB labels/boot region per vdev (also not included in the zpool SIZE property)
  • the metaslab rounding (up to 16GB per vdev) (also not included in the zpool SIZE property)
  • the embedded log metaslab (16GB per vdev) (included in zpool SIZE but not zfs USED+AVAIL)
  • 128GB (included in zpool SIZE but not zfs USED+AVAIL)

The embedded log metaslab change kept the user-visible slop space (the difference between zpool SIZE and zfs USED+AVAIL) the same. The proposed change exposes this, so the user-visible slop space will depend on number of vdevs and the value of zfs_embedded_slog_min_ms.

We might want to keep the user-visible slop space constant regardless of the embedded log (on most configurations). For example we could set the default spa_max_slop to 256GB, but have spa_get_slop_space() return MAX(128GB, 256GB - embedded_log_space). That way with up to 8 top-level vdevs, the spa_max_slop exactly equals the user-visible slop space. To implement this, we would just move the new code before the embedded_log adjustment.

@ahrens ahrens added Status: Code Review Needed Ready for review and testing and removed Status: Accepted Ready to integrate (reviewed, tested) labels Feb 11, 2021
This change modifies the behavior of how we determine how much slop
space to use in the pool, such that now it has an upper limit. The
default upper limit is 128G, but is configurable via a tunable.

Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
@prakashsurya
Copy link
Member Author

@ahrens let me know if my latest revision is what you had in mind.

@ahrens
Copy link
Member

ahrens commented Feb 22, 2021

@prakashsurya Looks great to me.

@behlendorf behlendorf removed the Status: Code Review Needed Ready for review and testing label Feb 23, 2021
@behlendorf behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Feb 23, 2021
@ahrens ahrens mentioned this pull request Feb 23, 2021
12 tasks
@behlendorf behlendorf merged commit f01eaed into openzfs:master Feb 24, 2021
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
This change modifies the behavior of how we determine how much slop
space to use in the pool, such that now it has an upper limit. The
default upper limit is 128G, but is configurable via a tunable.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes openzfs#11023
@jwittlincohen
Copy link
Contributor

Is there any plan to add this minor patch to the 2.0 or 2.1 branches? I have a 192TB, 120TB, and a 1TB pool on my main system. With the default spa_slop_space space setting (3.2%), that's over 6TB set aside for the 192TB pool. I've changed the setting to 9 (~375 GB slop), but that is certainly not optimal for the 1TB pool (2GB).

sempervictus pushed a commit to sempervictus/zfs that referenced this pull request May 31, 2021
This change modifies the behavior of how we determine how much slop
space to use in the pool, such that now it has an upper limit. The
default upper limit is 128G, but is configurable via a tunable.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes openzfs#11023
rincebrain pushed a commit to rincebrain/zfs that referenced this pull request Aug 26, 2021
This change modifies the behavior of how we determine how much slop
space to use in the pool, such that now it has an upper limit. The
default upper limit is 128G, but is configurable via a tunable.

(Backporting note: Snipped out the embedded_log portion of the changes.)

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes openzfs#11023
behlendorf pushed a commit that referenced this pull request Aug 30, 2021
This change modifies the behavior of how we determine how much slop
space to use in the pool, such that now it has an upper limit. The
default upper limit is 128G, but is configurable via a tunable.

(Backporting note: Snipped out the embedded_log portion of the changes.)

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #11023
tonyhutter pushed a commit that referenced this pull request Sep 22, 2021
This change modifies the behavior of how we determine how much slop
space to use in the pool, such that now it has an upper limit. The
default upper limit is 128G, but is configurable via a tunable.

(Backporting note: Snipped out the embedded_log portion of the changes.)

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #11023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants