Skip to content
This repository has been archived by the owner on Nov 7, 2019. It is now read-only.

9486 reduce memory used by device removal on fragmented pools #627

Closed
wants to merge 1 commit into from

Conversation

ahrens
Copy link
Member

@ahrens ahrens commented Apr 17, 2018

Device removal allocates a new location for each allocated segment on
the disk that's being removed. Each allocation results in one entry in
the mapping table, which maps from old location + length to new
location. When a fragmented disk is removed, this can result in a large
number of mapping entries, and thus a large amount of memory consumed by
the mapping table. In the worst real-world cases, we've seen around 1GB
of RAM per 1TB of storage removed.

We can improve on this situation by allocating larger segments, which
span across both allocated and free regions of the device being removed.
By including free regions in the allocation (and thus mapping), we
reduce the number of mapping entries. For example, if we have a 4K
allocation followed by 1K free and then 4K allocated, we would allocate
4+1+4 = 9KB, and then move the entire region (including allocated and
free parts). In this case we used one mapping where previously we would
have used two, but often the ratio is much higher (up to 20:1 in
real-world use). We then need to mark the regions that were free on the
removing device as free in the new locations, and also obsolete in the
mapping entry.

This method preserves the fragmentation of the removing device, rather
than consolidating its allocated space into a small number of chunks
where possible. But it results in drastic reduction of memory used by
the mapping table - around 20x in the most-fragmented cases.

In the most fragmented real-world cases, this reduces memory used by the
mapping from ~1GB to ~50MB of RAM per 1TB of storage removed. Less
fragmented cases will typically also see around 50-100MB of RAM per 1TB
of storage.

External-issue: DLPX-57962

@ahrens ahrens requested a review from behlendorf April 17, 2018 22:02
@ahrens
Copy link
Member Author

ahrens commented Apr 17, 2018

@dweeezil you might want to check this out

Copy link

@dweeezil dweeezil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note regarding WIP port to ZoL.

*/
range_tree_t *segs = range_tree_create(NULL, NULL);
for (;;) {
range_seg_t *rs = avl_first(&svr->svr_allocd_segs->rt_root);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing my WIP port to ZoL, I'm seeing "rs == NULL". I've put in an ASSERT below the range_tree_is_empty() above to make sure avl_first() is not null when empty but somehow, as part of the loop, avl_first() is able to return NULL. I'm trying to audit all the uses of svr_allocd_segs to make sure they're all being done under the svr_lock.

Mentioning @behlendorf so he knows I'm working on porting this.

Copy link

@dweeezil dweeezil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIP port to ZoL.


range_tree_add(segs, rs->rs_start, seg_length);
range_tree_remove(svr->svr_allocd_segs,
rs->rs_start, seg_length);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can empty svr_allocd_segs so we also need to break out of the loop above when rs == NULL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK that makes sense. I'll fix this - I actually had it like you are suggesting before, but "simplified" the code a bit too much it seems :-)

dweeezil pushed a commit to dweeezil/zfs that referenced this pull request May 12, 2018
removal on fragmented pools

Device removal allocates a new location for each allocated segment on
the disk that's being removed.  Each allocation results in one entry in
the mapping table, which maps from old location + length to new
location.  When a fragmented disk is removed, this can result in a large
number of mapping entries, and thus a large amount of memory consumed by
the mapping table.  In the worst real-world cases, we've seen around 1GB
of RAM per 1TB of storage removed.

We can improve on this situation by allocating larger segments, which
span across both allocated and free regions of the device being removed.
By including free regions in the allocation (and thus mapping), we
reduce the number of mapping entries.  For example, if we have a 4K
allocation followed by 1K free and then 4K allocated, we would allocate
4+1+4 = 9KB, and then move the entire region (including allocated and
free parts).  In this case we used one mapping where previously we would
have used two, but often the ratio is much higher (up to 20:1 in
real-world use).  We then need to mark the regions that were free on the
removing device as free in the new locations, and also obsolete in the
mapping entry.

This method preserves the fragmentation of the removing device, rather
than consolidating its allocated space into a small number of chunks
where possible.  But it results in drastic reduction of memory used by
the mapping table - around 20x in the most-fragmented cases.

In the most fragmented real-world cases, this reduces memory used by the
mapping from ~1GB to ~50MB of RAM per 1TB of storage removed.  Less
fragmented cases will typically also see around 50-100MB of RAM per 1TB
of storage.

External-issue: DLPX-57962

Authored by: Matthew Ahrens <mahrens@delphix.com>
OpenZFS-issue: https://illumos.org/issues/9486
OpenZFS-commit: ahrens/illumos@07152e1
OpenZFS-PR: openzfs/openzfs#627
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
dweeezil pushed a commit to dweeezil/zfs that referenced this pull request May 17, 2018
removal on fragmented pools

Device removal allocates a new location for each allocated segment on
the disk that's being removed.  Each allocation results in one entry in
the mapping table, which maps from old location + length to new
location.  When a fragmented disk is removed, this can result in a large
number of mapping entries, and thus a large amount of memory consumed by
the mapping table.  In the worst real-world cases, we've seen around 1GB
of RAM per 1TB of storage removed.

We can improve on this situation by allocating larger segments, which
span across both allocated and free regions of the device being removed.
By including free regions in the allocation (and thus mapping), we
reduce the number of mapping entries.  For example, if we have a 4K
allocation followed by 1K free and then 4K allocated, we would allocate
4+1+4 = 9KB, and then move the entire region (including allocated and
free parts).  In this case we used one mapping where previously we would
have used two, but often the ratio is much higher (up to 20:1 in
real-world use).  We then need to mark the regions that were free on the
removing device as free in the new locations, and also obsolete in the
mapping entry.

This method preserves the fragmentation of the removing device, rather
than consolidating its allocated space into a small number of chunks
where possible.  But it results in drastic reduction of memory used by
the mapping table - around 20x in the most-fragmented cases.

In the most fragmented real-world cases, this reduces memory used by the
mapping from ~1GB to ~50MB of RAM per 1TB of storage removed.  Less
fragmented cases will typically also see around 50-100MB of RAM per 1TB
of storage.

External-issue: DLPX-57962

Authored by: Matthew Ahrens <mahrens@delphix.com>
OpenZFS-issue: https://illumos.org/issues/9486
OpenZFS-commit: ahrens/illumos@07152e1
OpenZFS-PR: openzfs/openzfs#627
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
dweeezil pushed a commit to dweeezil/zfs that referenced this pull request May 20, 2018
removal on fragmented pools

Device removal allocates a new location for each allocated segment on
the disk that's being removed.  Each allocation results in one entry in
the mapping table, which maps from old location + length to new
location.  When a fragmented disk is removed, this can result in a large
number of mapping entries, and thus a large amount of memory consumed by
the mapping table.  In the worst real-world cases, we've seen around 1GB
of RAM per 1TB of storage removed.

We can improve on this situation by allocating larger segments, which
span across both allocated and free regions of the device being removed.
By including free regions in the allocation (and thus mapping), we
reduce the number of mapping entries.  For example, if we have a 4K
allocation followed by 1K free and then 4K allocated, we would allocate
4+1+4 = 9KB, and then move the entire region (including allocated and
free parts).  In this case we used one mapping where previously we would
have used two, but often the ratio is much higher (up to 20:1 in
real-world use).  We then need to mark the regions that were free on the
removing device as free in the new locations, and also obsolete in the
mapping entry.

This method preserves the fragmentation of the removing device, rather
than consolidating its allocated space into a small number of chunks
where possible.  But it results in drastic reduction of memory used by
the mapping table - around 20x in the most-fragmented cases.

In the most fragmented real-world cases, this reduces memory used by the
mapping from ~1GB to ~50MB of RAM per 1TB of storage removed.  Less
fragmented cases will typically also see around 50-100MB of RAM per 1TB
of storage.

Porting notes:

    Add the following as module parameters:
        zfs_condense_indirect_vdevs_enable
        zfs_condense_max_obsolete_bytes

    Document the following module parameters:
        zfs_condense_indirect_vdevs_enable
        zfs_condense_max_obsolete_bytes
        zfs_condense_min_mapping_bytes

External-issue: DLPX-57962

Authored by: Matthew Ahrens <mahrens@delphix.com>
OpenZFS-issue: https://illumos.org/issues/9486
OpenZFS-commit: ahrens/illumos@07152e1
OpenZFS-PR: openzfs/openzfs#627
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
dweeezil pushed a commit to dweeezil/zfs that referenced this pull request May 20, 2018
removal on fragmented pools

Device removal allocates a new location for each allocated segment on
the disk that's being removed.  Each allocation results in one entry in
the mapping table, which maps from old location + length to new
location.  When a fragmented disk is removed, this can result in a large
number of mapping entries, and thus a large amount of memory consumed by
the mapping table.  In the worst real-world cases, we've seen around 1GB
of RAM per 1TB of storage removed.

We can improve on this situation by allocating larger segments, which
span across both allocated and free regions of the device being removed.
By including free regions in the allocation (and thus mapping), we
reduce the number of mapping entries.  For example, if we have a 4K
allocation followed by 1K free and then 4K allocated, we would allocate
4+1+4 = 9KB, and then move the entire region (including allocated and
free parts).  In this case we used one mapping where previously we would
have used two, but often the ratio is much higher (up to 20:1 in
real-world use).  We then need to mark the regions that were free on the
removing device as free in the new locations, and also obsolete in the
mapping entry.

This method preserves the fragmentation of the removing device, rather
than consolidating its allocated space into a small number of chunks
where possible.  But it results in drastic reduction of memory used by
the mapping table - around 20x in the most-fragmented cases.

In the most fragmented real-world cases, this reduces memory used by the
mapping from ~1GB to ~50MB of RAM per 1TB of storage removed.  Less
fragmented cases will typically also see around 50-100MB of RAM per 1TB
of storage.

Porting notes:

    Add the following as module parameters:
        zfs_condense_indirect_vdevs_enable
        zfs_condense_max_obsolete_bytes

    Document the following module parameters:
        zfs_condense_indirect_vdevs_enable
        zfs_condense_max_obsolete_bytes
        zfs_condense_min_mapping_bytes

External-issue: DLPX-57962

Authored by: Matthew Ahrens <mahrens@delphix.com>
OpenZFS-issue: https://illumos.org/issues/9486
OpenZFS-commit: ahrens/illumos@07152e1
OpenZFS-PR: openzfs/openzfs#627
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
dweeezil pushed a commit to dweeezil/zfs that referenced this pull request May 21, 2018
removal on fragmented pools

Device removal allocates a new location for each allocated segment on
the disk that's being removed.  Each allocation results in one entry in
the mapping table, which maps from old location + length to new
location.  When a fragmented disk is removed, this can result in a large
number of mapping entries, and thus a large amount of memory consumed by
the mapping table.  In the worst real-world cases, we've seen around 1GB
of RAM per 1TB of storage removed.

We can improve on this situation by allocating larger segments, which
span across both allocated and free regions of the device being removed.
By including free regions in the allocation (and thus mapping), we
reduce the number of mapping entries.  For example, if we have a 4K
allocation followed by 1K free and then 4K allocated, we would allocate
4+1+4 = 9KB, and then move the entire region (including allocated and
free parts).  In this case we used one mapping where previously we would
have used two, but often the ratio is much higher (up to 20:1 in
real-world use).  We then need to mark the regions that were free on the
removing device as free in the new locations, and also obsolete in the
mapping entry.

This method preserves the fragmentation of the removing device, rather
than consolidating its allocated space into a small number of chunks
where possible.  But it results in drastic reduction of memory used by
the mapping table - around 20x in the most-fragmented cases.

In the most fragmented real-world cases, this reduces memory used by the
mapping from ~1GB to ~50MB of RAM per 1TB of storage removed.  Less
fragmented cases will typically also see around 50-100MB of RAM per 1TB
of storage.

Porting notes:

    Add the following as module parameters:
        zfs_condense_indirect_vdevs_enable
        zfs_condense_max_obsolete_bytes

    Document the following module parameters:
        zfs_condense_indirect_vdevs_enable
        zfs_condense_max_obsolete_bytes
        zfs_condense_min_mapping_bytes

External-issue: DLPX-57962

Authored by: Matthew Ahrens <mahrens@delphix.com>
OpenZFS-issue: https://illumos.org/issues/9486
OpenZFS-commit: ahrens/illumos@07152e1
OpenZFS-PR: openzfs/openzfs#627
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
dweeezil pushed a commit to dweeezil/zfs that referenced this pull request May 21, 2018
removal on fragmented pools

Device removal allocates a new location for each allocated segment on
the disk that's being removed.  Each allocation results in one entry in
the mapping table, which maps from old location + length to new
location.  When a fragmented disk is removed, this can result in a large
number of mapping entries, and thus a large amount of memory consumed by
the mapping table.  In the worst real-world cases, we've seen around 1GB
of RAM per 1TB of storage removed.

We can improve on this situation by allocating larger segments, which
span across both allocated and free regions of the device being removed.
By including free regions in the allocation (and thus mapping), we
reduce the number of mapping entries.  For example, if we have a 4K
allocation followed by 1K free and then 4K allocated, we would allocate
4+1+4 = 9KB, and then move the entire region (including allocated and
free parts).  In this case we used one mapping where previously we would
have used two, but often the ratio is much higher (up to 20:1 in
real-world use).  We then need to mark the regions that were free on the
removing device as free in the new locations, and also obsolete in the
mapping entry.

This method preserves the fragmentation of the removing device, rather
than consolidating its allocated space into a small number of chunks
where possible.  But it results in drastic reduction of memory used by
the mapping table - around 20x in the most-fragmented cases.

In the most fragmented real-world cases, this reduces memory used by the
mapping from ~1GB to ~50MB of RAM per 1TB of storage removed.  Less
fragmented cases will typically also see around 50-100MB of RAM per 1TB
of storage.

Porting notes:

    Add the following as module parameters:
        zfs_condense_indirect_vdevs_enable
        zfs_condense_max_obsolete_bytes

    Document the following module parameters:
        zfs_condense_indirect_vdevs_enable
        zfs_condense_max_obsolete_bytes
        zfs_condense_min_mapping_bytes

External-issue: DLPX-57962

Authored by: Matthew Ahrens <mahrens@delphix.com>
OpenZFS-issue: https://illumos.org/issues/9486
OpenZFS-commit: ahrens/illumos@07152e1
OpenZFS-PR: openzfs/openzfs#627
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Copy link

@dweeezil dweeezil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch stack LGTM.

dweeezil pushed a commit to dweeezil/zfs that referenced this pull request May 23, 2018
removal on fragmented pools

Device removal allocates a new location for each allocated segment on
the disk that's being removed.  Each allocation results in one entry in
the mapping table, which maps from old location + length to new
location.  When a fragmented disk is removed, this can result in a large
number of mapping entries, and thus a large amount of memory consumed by
the mapping table.  In the worst real-world cases, we've seen around 1GB
of RAM per 1TB of storage removed.

We can improve on this situation by allocating larger segments, which
span across both allocated and free regions of the device being removed.
By including free regions in the allocation (and thus mapping), we
reduce the number of mapping entries.  For example, if we have a 4K
allocation followed by 1K free and then 4K allocated, we would allocate
4+1+4 = 9KB, and then move the entire region (including allocated and
free parts).  In this case we used one mapping where previously we would
have used two, but often the ratio is much higher (up to 20:1 in
real-world use).  We then need to mark the regions that were free on the
removing device as free in the new locations, and also obsolete in the
mapping entry.

This method preserves the fragmentation of the removing device, rather
than consolidating its allocated space into a small number of chunks
where possible.  But it results in drastic reduction of memory used by
the mapping table - around 20x in the most-fragmented cases.

In the most fragmented real-world cases, this reduces memory used by the
mapping from ~1GB to ~50MB of RAM per 1TB of storage removed.  Less
fragmented cases will typically also see around 50-100MB of RAM per 1TB
of storage.

Porting notes:

    Add the following as module parameters:
        zfs_condense_indirect_vdevs_enable
        zfs_condense_max_obsolete_bytes

    Document the following module parameters:
        zfs_condense_indirect_vdevs_enable
        zfs_condense_max_obsolete_bytes
        zfs_condense_min_mapping_bytes

External-issue: DLPX-57962

Authored by: Matthew Ahrens <mahrens@delphix.com>
OpenZFS-issue: https://illumos.org/issues/9486
OpenZFS-commit: ahrens/illumos@07152e1
OpenZFS-PR: openzfs/openzfs#627
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>

Device removal allocates a new location for each allocated segment on
the disk that's being removed.  Each allocation results in one entry in
the mapping table, which maps from old location + length to new
location.  When a fragmented disk is removed, this can result in a large
number of mapping entries, and thus a large amount of memory consumed by
the mapping table.  In the worst real-world cases, we've seen around 1GB
of RAM per 1TB of storage removed.

We can improve on this situation by allocating larger segments, which
span across both allocated and free regions of the device being removed.
By including free regions in the allocation (and thus mapping), we
reduce the number of mapping entries.  For example, if we have a 4K
allocation followed by 1K free and then 4K allocated, we would allocate
4+1+4 = 9KB, and then move the entire region (including allocated and
free parts).  In this case we used one mapping where previously we would
have used two, but often the ratio is much higher (up to 20:1 in
real-world use).  We then need to mark the regions that were free on the
removing device as free in the new locations, and also obsolete in the
mapping entry.

This method preserves the fragmentation of the removing device, rather
than consolidating its allocated space into a small number of chunks
where possible.  But it results in drastic reduction of memory used by
the mapping table - around 20x in the most-fragmented cases.

In the most fragmented real-world cases, this reduces memory used by the
mapping from ~1GB to ~50MB of RAM per 1TB of storage removed.  Less
fragmented cases will typically also see around 50-100MB of RAM per 1TB
of storage.

Closes #627
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
3 participants