-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: green tea garbage collector #73581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Change https://go.dev/cl/658036 mentions this issue: |
How to try it out
Sending feedbackPlease focus your attention on whole programs. Microbenchmarks tend to be poor representatives of garbage collection behavior in real programs. Large changes in the foundations such as these can move the needle for all sorts of reasons, unrelated to (or even inversely related to) how efficient the garbage collector actually is. If you encounter a situation where Green Tea is or isn't working out for you, we'd appreciate it if you could share some details with us that would be really helpful moving forward.
Note: you can use Feel free to post to this GitHub issue, or you can email me directly at Thank you! |
Our current parallel mark algorithm suffers from frequent stalls on memory since its access pattern is essentially random. Small objects are the worst offenders, since each one forces pulling in at least one full cache line to access even when the amount to be scanned is far smaller than that. Each object also requires an independent access to per-object metadata. The purpose of this change is to improve garbage collector performance by scanning small objects in batches to obtain better cache locality than our current approach. The core idea behind this change is to defer marking and scanning small objects, and then scan them in batches localized to a span. This change adds scanned bits to each small object (<=512 bytes) span in addition to mark bits. The scanned bits indicate that the object has been scanned. (One way to think of them is "grey" bits and "black" bits in the tri-color mark-sweep abstraction.) Each of these spans is always 8 KiB and if they contain pointers, the pointer/scalar data is already packed together at the end of the span, allowing us to further optimize the mark algorithm for this specific case. When the GC encounters a pointer, it first checks if it points into a small object span. If so, it is first marked in the mark bits, and then the object is queued on a work-stealing P-local queue. This object represents the whole span, and we ensure that a span can only appear at most once in any queue by maintaining an atomic ownership bit for each span. Later, when the pointer is dequeued, we scan every object with a set mark that doesn't have a corresponding scanned bit. If it turns out that was the only object in the mark bits since the last time we scanned the span, we scan just that object directly, essentially falling back to the existing algorithm. noscan objects have no scan work, so they are never queued. Each span's mark and scanned bits are co-located together at the end of the span. Since the span is always 8 KiB in size, it can be found with simple pointer arithmetic. Next to the marks and scans we also store the size class, eliminating the need to access the span's mspan altogether. The work-stealing P-local queue is a new source of GC work. If this queue gets full, half of it is dumped to a global linked list of spans to scan. The regular scan queues are always prioritized over this queue to allow time for darts to accumulate. Stealing work from other Ps is a last resort. This change also adds a new debug mode under GODEBUG=gctrace=2 that dumps whole-span scanning statistics by size class on every GC cycle. A future extension to this CL is to use SIMD-accelerated scanning kernels for scanning spans with high mark bit density. For #19112. (Deadlock averted in GOEXPERIMENT.) For #73581. Change-Id: I4bbb4e36f376950a53e61aaaae157ce842c341bc Reviewed-on: https://go-review.googlesource.com/c/go/+/658036 Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Change https://go.dev/cl/669655 mentions this issue: |
Now that the greenteagc GOEXPERIMENT is no longer a no-op, let's add some builders for it. For golang/go#73581. Change-Id: I50613100ec24eb9813ca8d64c474d6efe9bc791c Reviewed-on: https://go-review.googlesource.com/c/build/+/669655 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org> Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
CPU: AMD Ryzen 9 3950X 16-Core Processor
The attached files contain the CPU profiles and the gctrace output. I also have execution traces, but they span the whole execution and are ~50 MB each when compressed, which GitHub won't let me attach. greentea.tar.gz
|
Thanks for sharing! I think there is a win here, though it's not a massive improvement. I think there are some effects that might be masking a bigger win. At least according to the Also, here are the distributions of wall time spent in the mark phase:
Simultaneously, because each GC cycle is shorter, there's less floating garbage, leading to a smaller live heap. This means for the same amount of allocation work, we're going to have more GC cycles, if Here are the heap size distributions:
As for why this isn't improving overall end-to-end time much, I suspect that:
(I could probably analyze this more efficiently if I just made a tool to parse |
The actual analysis phase of Staticcheck tries to use GOMAXPROCS goroutines at once, parallelizing on packages as well as individual analyzers per package. I've updated my original comment with a proper comparison of before and after |
The increase in system time might be related to the use of a As an aside, there may be a more clever lock-free data structure we could use that would allow us to take many spans off the global list at once, but it's probably going to be quite complicated. I also checked out the new scan stats I added and based on that, Green Tea's core hypothesis does seem to apply to |
Green Tea 🍵 Garbage Collector
Authors: Michael Knyszek, Austin Clements
Updated: 2 May 2025
This issue tracks the design and implementation of the Green Tea garbage collector. As of the last update to this issue, development of Green Tea is still active. We'll produce more detailed design document once we're ready to commit to a design. For now, Green Tea is available as an experiment at tip-of-tree and is planned as to be available as an opt-in experiment in Go 1.25, once it releases. We encourage teams to try it out.
Introduction
Memory latency and bandwidth are becoming increasingly constrained as CPU clocks outpace DRAM clocks, increasing core counts offer more load on physically limited memory buses, and speed-of-light constraints necessitate increasingly non-uniform memory topologies. As a result, spatial locality, temporal locality, and topology-awareness are becoming ever more critical to high-performance systems.
Unfortunately, all of these trends are at odds with most of today’s garbage collection algorithms. Go’s garbage collector implements a classic tri-color parallel marking algorithm. This is, at its core, just a graph flood, where heap objects are nodes in the graph, and pointers are edges. However, this graph flood affords no consideration to the memory location of the objects that are being processed. As a result, it exhibits extremely poor spatial locality—jumping between completely different parts of memory—poor temporal locality—blithely spreading repeated accesses to the same memory across the GC cycle—and no concern for topology.
As a result, on average 85% of the garbage collector's time is spent in the core loop of this graph flood—the scan loop—and >35% of CPU cycles in the scan loop are spent solely stalled on memory accesses, excluding any knock-on effects. This problem is expected to only get worse as the industry trends toward many-core systems and non-uniform memory architectures.
In this document, we present Green Tea: a parallel marking algorithm that, if not memory-centric,1 is at least memory-aware, in that it endeavors to process objects close to one another together.
This new algorithm has an implementation that is ready for developers to trial on their workloads, and in this document we also present the results from evaluating this implementation against our benchmark suite. Overall, the algorithm shows a significant reduction in GC CPU costs on GC-heavy workloads.
Finally, this new marking algorithm unlocks new opportunities for future optimization, such as SIMD acceleration, which we discuss with other possible avenues of future work.
Design
The core idea behind the new parallel marking algorithm is simple. Instead of scanning individual objects, the garbage collector scans memory in much larger, contiguous blocks. The shared work queue tracks these coarse blocks instead of individual objects, and the individual objects waiting to be scanned in a block are tracked in that block itself. The core hypothesis is that while a block waits on the queue to be scanned, it will accumulate more objects to be scanned within that block, such that when a block does get dequeued, it’s likely that scanning will be able to scan more than one object in that block. This, in turn, improves locality of memory access, in addition to better amortizing per-scan costs.
Prototype implementation
In the prototype implementation of this new algorithm, the memory blocks we track are called spans. A span is always some multiple of 8 KiB, always aligned to 8 KiB, and consists entirely of objects of one size. Our prototype focuses exclusively on “small object spans”, which are exactly 8 KiB and contain objects up to 512 bytes.
A span is also the basic unit of storing heap metadata. In the prototype, each span stores two bits for each object: a gray bit and a black bit. These correspond to the tri-color abstraction: an object is black if it has been scanned, gray if it is in the queue to be scanned, and white if it has not been reached at all. In the prototype, white objects have neither bit set, gray objects have the gray bit set, and black objects have both bits set.
When scanning finds a pointer to a small object, it sets that object’s gray bit to indicate the object needs to be scanned. If the gray bit was not already set and the object’s span is not already enqueued for scanning, it enqueues the span. A per-span flag indicates whether the span is currently enqueued so it will only be enqueued once at a time. When the scan loop dequeues a span, it computes the difference between the gray bits and the black bits to identify objects to scan, copies the gray bits to the black bits, and scans any objects that had their gray bit set but not their black bit.
Limiting scope to small objects
The prototype focuses on small objects because we derive the most benefit from them. The per-scan overhead of small objects is much harder to amortize because the garbage collector spends so little time scanning each individual object. Larger objects continue to use the old algorithm.
The choice of which algorithm to use is made when scanning encounters a pointer. The span allocator maintains a bitmap with one bit for each 8 KiB page in the heap to indicate whether that page is backed by a small object span. The footprint of this fits easily into cache even for very large heaps, and contention is extremely low.
Since small object spans are always 8 KiB large and 8 KiB aligned, once the scanner knows the target of a pointer is in a small object span, it can use simple address arithmetic to find the object’s metadata within the span, thus avoiding indirections and dependent loads that seriously harm performance.
Work distribution
Go’s current garbage collector distributes work across scanners by having each scanner maintain a local fixed-sized stack of object pointers. However, in order to ensure parallelism, each scanner aggressively checks and populates global lists. This frequent mutation of the global lists is a significant source of contention in Go programs on many-core systems.
The prototype implementation has a separate queue dedicated to spans and based on the distributed work-stealing runqueues used by the goroutine scheduler. The opportunity for stealing work directly from other workers means less contention on global lists. Furthermore, by queuing spans instead of individual objects, there are far fewer items to queue and thus inherently lower contention on the queues.
Span work may be ordered several different ways. We explored several policies, including FIFO, LIFO, sparsest-first, and densest-first, random, and address-ordered. FIFO turned out to accumulate the highest average density of objects to scan on a span by the time it was dequeued for scanning.
Single-object scan optimization
If a span has only a single object to scan when it is dequeued, the new algorithm will have done more work to handle that single object than the current, object-centric algorithm does.
To bring the performance of the single-object-per-span case more in line with the current marking algorithm, we apply two tricks. First, we track the object that was marked when the span was enqueued. This object becomes the span's representative until the span is scanned. Next, we add a hit flag to the span that indicates an object was marked while the span was queued; that is, that at least two objects are marked. When scanning a span, if the hit flag is not set, then the garbage collector can directly scan the span’s representative, instead of processing the entire span.
Prototype evaluation
We evaluated the prototype implementation across a variety of benchmarks, on low-CPU-count, high-CPU-count, amd64, and arm64 Linux virtual machines. The rest of this section summarizes the results, focusing primarily on the differences in garbage collection CPU cost.
In select GC-heavy microbenchmarks (garbage, from golang.org/x/benchmarks, and binary-trees Go #2, from the Computer Language Benchmarks Game), depending on core count, we observed anywhere from a 10–50% reduction in GC CPU costs compared to the existing Go GC. The improvement generally rose with core count, indicating that the prototype scales better than the existing implementation. Furthermore, the number of L1 and L2 cache misses was reduced by half in these benchmarks.
On our bent and sweet benchmark suites, the results were more varied.
The results are positive overall, but include a mix of improvements and regressions.
Most benchmarks were either unaffected by the changes to the garbage collector or regressed or improved solely due to changes that had little to do with the garbage collector, such as code alignment changes. Some benchmarks regressed even though less CPU time is spent in the garbage collector. One reason for this is because the garbage collector's mark phase is active for less time, leading to less floating garbage which acts as a ballast in some benchmarks. Another reason for this is that less time spent in the garbage collector means more time spent in other scalability bottlenecks, either in the runtime or in user code, leading to a net apparent regression.
The Go compiler benchmarks appear to inconsistently show a very slight regression (0.5%). Given the magnitude and inconsistency of the regression, these benchmarks appear to be rather insensitive to this change. One hypothesis is that the occasional regression may be due to an out-of-date PGO profile, but remains to be investigated.
The tile38 benchmark shows substantial improvements across throughput, latency and memory use. In addition, it showed a 35% reduction in garbage collection overheads. This benchmark queries a local instance of a Tile38 in-memory geospatial database pre-seeded with data. Most of the heap consists of a high-fanout tree, so Green Tea is able to quickly generate large amounts of work and high densities.
The bleve-index benchmark has a heap topology that is quite difficult for Green Tea, though performance overall is a wash. Most of the heap consists of a low-fanout binary tree that is rapidly mutated by the benchmark. Green Tea struggles to generate locality in this benchmark, and half of all span scans only scan a single object. Our current hypothesis is that because of frequent tree rotations, the tree structure itself becomes shuffled across a large heap (100+ MiB). In contrast, the binary-trees benchmark does not perform rotations, so the tree layout retains the good locality of its initial allocations. This suggests that Green Tea has good locality when the application itself has good locality, unlike the current Go GC; but Green Tea, unsurprisingly, can’t create locality out of nothing. Both Linux
perf
and CPU profiles in the 16-core amd64 environment indicate a small ~2% regression in garbage collection overhead. Overall, the single object scan optimization was integral to making this benchmark perform well in the 16-core amd64 environment. In the 72- and 88-core environments we see a significant improvement, due to the design's improved many-core scalability. There remains a small overall regression on the 16-core arm64 environment that still needs to be investigated.Future work
SIMD-accelerated scanning kernels
Scanning memory in larger blocks of memory unlocks the ability to apply SIMD to small objects in the garbage collector. The core idea is to generate a unique scanning kernel for each size class and use SIMD bit manipulation and permutation instructions to load, mask, swizzle, pack, and enqueue pointers. The regularity of the layout of objects in a single span and Go’s packed representation of pointer/scalar metadata for small objects both play a major role in making this feasible.
Austin Clements developed prototype AVX512-based scanning kernels that reduce garbage collection overheads by another 15–20% in the benchmarks where we already saw improvements. The prototype implementation does not currently use these kernels because they only apply to a small subset of objects at this point in time.
These SIMD kernels tend to require a higher density of objects in order to outperform sparsely scanning objects within a scan. These kernels are still being developed, so the prototype does not use them by default, and when they are enabled, it only uses SIMD scanning when a minimum density threshold is reached.
Concentrator network
Austin's original design for Green Tea used a sorting network called the concentrator network to achieve the even higher pointer density required by SIMD-based scanning, and to generate locality even for metadata operations like setting gray bits. The network was carefully designed to minimize queuing costs, so individual pointers could still be processed efficiently when sufficient scan density was unavailable.
There are two main reasons we did not pursue this direction in the short term. First, we found that even very low density can produce good results, especially with the single-object scan optimization. Second, the concentrator network is more complex to implement, as it is a greater departure from the existing algorithm. However, this design is still an avenue we plan to explore, since it is far more general and tunable.
Acknowledgements
Credit to Austin Clements for the inception of the algorithm and initial prototyping. (They wrote down the key idea in 2018!)
Credit to Yves Vandriessche from Intel for providing many microarchitectural insights that were vital to making this design viable. Many of his suggestions were applied to the core scan loop, including proper prefetching, batching subsequent pointer loads to hide their memory latency, and simpler iteration over object pointers.
Footnotes
"The Garbage Collection Handbook" by Richard Jones, Anthony Hosking, Eliot Moss, a canonical source for garbage collection techniques, divides parallel garbage collection algorithms into two categories, "processor-centric" and "memory-centric." Loosely, the former category consists of algorithms that aggressively balance work across processors to maximize parallelism, while the latter consists of algorithms that may not parallelize as well, but process contiguous blocks of the heap. The current marking algorithm in the Go garbage collector is firmly processor-centric. Unfortunately for us, the handbook's wisdom ends there, as it goes on to state that "[a]ll known parallel marking algorithms are processor-centric." (page 279) ↩
The text was updated successfully, but these errors were encountered: