New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: whole span reclaimer is expensive #11979

Open
aclements opened this Issue Aug 1, 2015 · 2 comments

Comments

Projects
None yet
3 participants
@aclements
Member

aclements commented Aug 1, 2015

Objects larger than 32K are handled by a large object allocator (largeAlloc) because they fall outside the size classes for which there are span caches. Currently, in order to control heap growth, the large object allocator sweeps until it reclaims the number of pages being allocated before performing the allocation (mHeap_Alloc_m -> mHeap_ReclaimList). This path is also taken when growing the heap.

However, this process can be quite expensive. It first tries sweeping large object spans. If this succeeds, the process is fairly efficient; it may require looking at a large number of spans that don't have a free object, but it's not clear if it's possible to eliminate this linear constant. However, empirically, this often fails. This may simply be because earlier large allocations swept spans that were larger than they were allocating and they didn't keep track of this credit. In this case, it falls back to sweeping small object spans until it's able to free enough (potentially non-contiguous) pages that way. As a result of this, in the x/benchmarks garbage benchmark, the large object sweeper winds up sweeping ~30 times more bytes than it's trying to allocate.

I believe we can eliminate this entire mechanism if, during marking, the garbage collector also keeps a per span "super mark" that is set if any objects in the span are marked. At the point where we would set this, we already have the span and, assuming we're willing to dedicate a byte per span for this, it can be set with a blind (possibly even non-temporal) write. At the end of GC, after mark termination, we can immediately free these spans (both large object spans and small object spans with no reachable objects). It takes roughly 1ms per heap GB to walk the span list, so even assuming a little more overhead for adding free spans to span lists, it seems very likely this would take less time than we currently spend sweeping these spans. This is also trivially parallelizable, and we probably have to do this walk anyway (#11484). Additionally, this will reduce load on concurrent sweep and coalesce neighboring spans much earlier, making larger regions of memory available for large object allocation immediately.

This idea is based on various (mostly pairwise) discussions between myself, @RLH, @rsc, and @dvyukov.

@aclements aclements added this to the Go1.6Early milestone Aug 1, 2015

@RLH

This comment has been minimized.

Contributor

RLH commented Aug 4, 2015

The downside is that every mark must read the span mark, which admittedly
will probably be in the cache. I would have to see numbers before I'd buy
into the blind mark is faster than read compare mark.

On Sat, Aug 1, 2015 at 12:38 PM, Austin Clements notifications@github.com
wrote:

Objects larger than 32K are handled by a large object allocator
(largeAlloc) because they fall outside the size classes for which there are
span caches. Currently, in order to control heap growth, the large object
allocator sweeps until it reclaims the number of pages being allocated
before performing the allocation (mHeap_Alloc_m -> mHeap_ReclaimList). This
path is also taken when growing the heap.

However, this process can be quite expensive. It first tries sweeping
large object spans. If this succeeds, the process is fairly efficient;
it may require looking at a large number of spans that don't have a free
object, but it's not clear if it's possible to eliminate this linear
constant. However, empirically, this often fails. This may simply be
because earlier large allocations swept spans that were larger than they
were allocating and they didn't keep track of this credit. In this case, it
falls back to sweeping small object spans until it's able to free enough
(potentially non-contiguous) pages that way. As a result of this, in the
x/benchmarks garbage benchmark, the large object sweeper winds up sweeping
~30 times more bytes than it's trying to allocate.

I believe we can eliminate this entire mechanism if, during marking, the
garbage collector also keeps a per span "super mark" that is set if any
objects in the span are marked. At the point where we would set this, we
already have the span and, assuming we're willing to dedicate a byte per
span for this, it can be set with a blind (possibly even non-temporal)
write. At the end of GC, after mark termination, we can immediately free
these spans (both large object spans and small object spans with no
reachable objects). It takes roughly 1ms per heap GB to walk the span list,
so even assuming a little more overhead for adding free spans to span
lists, it seems very likely this would take less time than we currently
spend sweeping these spans. This is also trivially parallelizable, and we
probably have to do this walk anyway (#11484
#11484). Additionally, this will
reduce load on concurrent sweep and coalesce neighboring spans much
earlier, making larger regions of memory available for large object
allocation immediately.

This idea is based on various (mostly pairwise) discussions between
myself, @RLH https://github.com/RLH, @rsc https://github.com/rsc, and
@dvyukov https://github.com/dvyukov.


Reply to this email directly or view it on GitHub
#11979.

@aclements

This comment has been minimized.

Member

aclements commented Aug 11, 2015

I think the performance of the blind write doesn't matter much (unless it adds significant per-object cost, which I highly doubt). The win is on the sweeping side.

There's even some literature on this approach, it turns out. "Effective Prefetch for Mark-Sweep Garbage Collection" by Garner, 2007 uses exactly this trick. Plus, I think this will be necessary if we want to move to sweep-free allocation because finding and freeing unmarked spans is one of the tasks of the sweeper that's hard to do efficiently otherwise.

@rsc rsc modified the milestones: Unplanned, Go1.6Early Nov 4, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment