runtime: green tea garbage collector

# Green Tea 🍵 Garbage Collector

Authors: Michael Knyszek, Austin Clements
Updated: 15 August 2025

This issue tracks the design and implementation of the Green Tea garbage collector. As of the last update to this issue, development of Green Tea is still active. We'll produce more detailed design document once we're ready to commit to a design.

For now, Green Tea is available as an experiment in the Go 1.25 release. We don't expect any correctness issues with the experiment. It's running within Google, and we feel that it is production-ready. We encourage teams to [try it out](https://github.com/golang/go/issues/73581#issuecomment-2847696497). Your feedback is instrumental to producing a good final design!

## Introduction

Memory latency and bandwidth are becoming increasingly constrained as CPU clocks outpace DRAM clocks, increasing core counts offer more load on physically limited memory buses, and speed-of-light constraints necessitate increasingly non-uniform memory topologies. As a result, spatial locality, temporal locality, and topology-awareness are becoming ever more critical to high-performance systems.

Unfortunately, all of these trends are at odds with most of today’s garbage collection algorithms. [Go’s garbage collector](https://go.dev/doc/gc-guide) implements a classic tri-color parallel marking algorithm. This is, at its core, just a graph flood, where heap objects are nodes in the graph, and pointers are edges. However, this graph flood affords no consideration to the memory location of the objects that are being processed. As a result, it exhibits extremely poor spatial locality—jumping between completely different parts of memory—poor temporal locality—blithely spreading repeated accesses to the same memory across the GC cycle—and no concern for topology.

As a result, on average 85% of the garbage collector's time is spent in the core loop of this graph flood—the scan loop—and \>35% of CPU cycles in the scan loop are spent solely stalled on memory accesses, excluding any knock-on effects. This problem is expected to only get worse as the industry trends toward many-core systems and non-uniform memory architectures.

In this document, we present Green Tea: a parallel marking algorithm that, if not memory-centric,[^1] is at least *memory-aware*, in that it endeavors to process objects close to one another together.

This new algorithm has [an implementation](https://go.dev/cl/658036) that is ready for developers to trial on their workloads, and in this document we also present the results from evaluating this implementation against our benchmark suite. Overall, the algorithm shows a significant reduction in GC CPU costs on GC-heavy workloads.

Finally, this new marking algorithm unlocks new opportunities for future optimization, such as SIMD acceleration, which we discuss with other possible avenues of future work.

## Design

The core idea behind the new parallel marking algorithm is simple. Instead of scanning individual objects, the garbage collector scans memory in much larger, contiguous blocks. The shared work queue tracks these coarse blocks instead of individual objects, and the individual objects waiting to be scanned in a block are tracked in that block itself. The core hypothesis is that while a block waits on the queue to be scanned, it will accumulate more objects to be scanned within that block, such that when a block does get dequeued, it’s likely that scanning will be able to scan more than one object in that block. This, in turn, improves locality of memory access, in addition to better amortizing per-scan costs.

### Prototype implementation

In the prototype implementation of this new algorithm, the memory blocks we track are called *spans*. A span is always some multiple of 8 KiB, always aligned to 8 KiB, and consists entirely of objects of one size. Our prototype focuses exclusively on “small object spans”, which are exactly 8 KiB and contain objects up to 512 bytes.

A span is also the basic unit of storing heap metadata. In the prototype, each span stores two bits for each object: a *gray* bit and a *black* bit. These correspond to the tri-color abstraction: an object is *black* if it has been scanned, *gray* if it is in the queue to be scanned, and *white* if it has not been reached at all. In the prototype, white objects have neither bit set, gray objects have the gray bit set, and black objects have both bits set.

When scanning finds a pointer to a small object, it sets that object’s gray bit to indicate the object needs to be scanned. If the gray bit was not already set and the object’s span is not already enqueued for scanning, it enqueues the span. A per-span flag indicates whether the span is currently enqueued so it will only be enqueued once at a time. When the scan loop dequeues a span, it computes the difference between the gray bits and the black bits to identify objects to scan, copies the gray bits to the black bits, and scans any objects that had their gray bit set but not their black bit.

#### Limiting scope to small objects

The prototype focuses on small objects because we derive the most benefit from them. The per-scan overhead of small objects is much harder to amortize because the garbage collector spends so little time scanning each individual object. Larger objects continue to use the old algorithm.

The choice of which algorithm to use is made when scanning encounters a pointer. The span allocator maintains a bitmap with one bit for each 8 KiB page in the heap to indicate whether that page is backed by a small object span. The footprint of this fits easily into cache even for very large heaps, and contention is extremely low.

Since small object spans are always 8 KiB large and 8 KiB aligned, once the scanner knows the target of a pointer is in a small object span, it can use simple address arithmetic to find the object’s metadata within the span, thus avoiding indirections and dependent loads that seriously harm performance.

#### Work distribution

Go’s current garbage collector distributes work across scanners by having each scanner maintain a local fixed-sized stack of object pointers. However, in order to ensure parallelism, each scanner aggressively checks and populates global lists. This frequent mutation of the global lists is a significant source of contention in Go programs on many-core systems.

The prototype implementation has a separate queue dedicated to spans and based on the distributed work-stealing runqueues used by the goroutine scheduler. The opportunity for stealing work directly from other workers means less contention on global lists. Furthermore, by queuing spans instead of individual objects, there are far fewer items to queue and thus inherently lower contention on the queues.

Span work may be ordered several different ways. We explored several policies, including FIFO, LIFO, sparsest-first, and densest-first, random, and address-ordered. FIFO turned out to accumulate the highest average density of objects to scan on a span by the time it was dequeued for scanning.

#### Single-object scan optimization

If a span has only a single object to scan when it is dequeued, the new algorithm will have done more work to handle that single object than the current, object-centric algorithm does.

To bring the performance of the single-object-per-span case more in line with the current marking algorithm, we apply two tricks. First, we track the object that was marked when the span was enqueued. This object becomes the span's *representative* until the span is scanned. Next, we add a *hit* flag to the span that indicates an object was marked while the span was queued; that is, that at least two objects are marked. When scanning a span, if the hit flag is not set, then the garbage collector can directly scan the span’s representative, instead of processing the entire span.

## Prototype evaluation

We evaluated the [prototype implementation](https://go.dev/cl/658036) across a variety of benchmarks, on low-CPU-count, high-CPU-count, amd64, and arm64 Linux virtual machines. The rest of this section summarizes the results, focusing primarily on the differences in garbage collection CPU cost.

In select GC-heavy microbenchmarks ([garbage](https://cs.opensource.google/go/x/benchmarks/+/master:garbage/garbage.go;l=5;drc=c4c5b3d18f14d8a083658ec26d06778e9a4c8437), from [golang.org/x/benchmarks](http://golang.org/x/benchmarks), and [binary-trees Go \#2](https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-go-2.html), from the [Computer Language Benchmarks Game](https://benchmarksgame-team.pages.debian.net/benchmarksgame/index.html)), depending on core count, we observed anywhere from a 10–50% reduction in GC CPU costs compared to the existing Go GC. The improvement generally rose with core count, indicating that the prototype scales better than the existing implementation. Furthermore, the number of L1 and L2 cache misses was reduced by half in these benchmarks.

On our bent and sweet benchmark suites, the results were more varied.

* [linux/amd64 Intel Xeon Gold 6253CL Processor (Cascade Lake) (16-core VM)](https://logs.chromium.org/logs/golang/buildbucket/cr-buildbucket/8716948810519860369/+/u/log/0)  
* [linux/amd64 Intel Xeon Platinum 8481C Processor (Sapphire Rapids) (88-core VM)](https://logs.chromium.org/logs/golang/buildbucket/cr-buildbucket/8716948810519860385/+/u/log/0)  
* [linux/arm64 Google Axion (16-core VM)](https://logs.chromium.org/logs/golang/buildbucket/cr-buildbucket/8716948810519860417/+/u/log/0)  
* [linux/arm64 Google Axion (72-core VM)](https://logs.chromium.org/logs/golang/buildbucket/cr-buildbucket/8716948810519860401/+/u/log/0)

The results are positive overall, but include a mix of improvements and regressions.

Most benchmarks were either unaffected by the changes to the garbage collector or regressed or improved solely due to changes that had little to do with the garbage collector, such as code alignment changes. Some benchmarks regressed even though less CPU time is spent in the garbage collector. One reason for this is because the garbage collector's mark phase is active for less time, leading to less floating garbage which acts as a ballast in some benchmarks. Another reason for this is that less time spent in the garbage collector means more time spent in other scalability bottlenecks, either in the runtime or in user code, leading to a net apparent regression.

The Go compiler benchmarks appear to inconsistently show a very slight regression (0.5%). Given the magnitude and inconsistency of the regression, these benchmarks appear to be rather insensitive to this change. One hypothesis is that the occasional regression may be due to an out-of-date PGO profile, but remains to be investigated.

The tile38 benchmark shows substantial improvements across throughput, latency and memory use. In addition, it showed a 35% reduction in garbage collection overheads. This benchmark queries a local instance of a Tile38 in-memory geospatial database pre-seeded with data. Most of the heap consists of a high-fanout tree, so Green Tea is able to quickly generate large amounts of work and high densities.

The bleve-index benchmark has a heap topology that is quite difficult for Green Tea, though performance overall is a wash. Most of the heap consists of a low-fanout binary tree that is rapidly mutated by the benchmark. Green Tea struggles to generate locality in this benchmark, and half of all span scans only scan a single object. Our current hypothesis is that because of frequent tree rotations, the tree structure itself becomes shuffled across a large heap (100+ MiB). In contrast, the binary-trees benchmark does not perform rotations, so the tree layout retains the good locality of its initial allocations. This suggests that Green Tea has good locality when the application itself has good locality, unlike the current Go GC; but Green Tea, unsurprisingly, can’t create locality out of nothing. Both Linux `perf` and CPU profiles in the 16-core amd64 environment indicate a small \~2% regression in garbage collection overhead. Overall, the single object scan optimization was integral to making this benchmark perform well in the 16-core amd64 environment. In the 72- and 88-core environments we see a significant improvement, due to the design's improved many-core scalability. There remains a small overall regression on the 16-core arm64 environment that still needs to be investigated.

## Future work

### SIMD-accelerated scanning kernels

Scanning memory in larger blocks of memory unlocks the ability to apply SIMD to small objects in the garbage collector. The core idea is to generate a unique scanning kernel for each size class and use SIMD bit manipulation and permutation instructions to load, mask, swizzle, pack, and enqueue pointers. The regularity of the layout of objects in a single span and Go’s packed representation of pointer/scalar metadata for small objects both play a major role in making this feasible.

Austin Clements developed prototype AVX512-based scanning kernels that reduce garbage collection overheads by another 15–20% in the benchmarks where we already saw improvements. The prototype implementation does not currently use these kernels because they only apply to a small subset of objects at this point in time.

These SIMD kernels tend to require a higher density of objects in order to outperform sparsely scanning objects within a scan. These kernels are still being developed, so the prototype does not use them by default, and when they are enabled, it only uses SIMD scanning when a minimum density threshold is reached.

### Concentrator network

Austin's original design for Green Tea used a sorting network called the *concentrator network* to achieve the even higher pointer density required by SIMD-based scanning, and to generate locality even for metadata operations like setting gray bits. The network was carefully designed to minimize queuing costs, so individual pointers could still be processed efficiently when sufficient scan density was unavailable. 

There are two main reasons we did not pursue this direction in the short term. First, we found that even very low density can produce good results, especially with the single-object scan optimization. Second, the concentrator network is more complex to implement, as it is a greater departure from the existing algorithm. However, this design is still an avenue we plan to explore, since it is far more general and tunable.

## Acknowledgements

Credit to Austin Clements for the inception of the key idea (that is, accumulating pointers on a span before scanning, which they wrote down in 2018!), initial prototyping of the concentrator network, which motivated further work, and in polishing up this text significantly.

Credit to Yves Vandriessche from Intel for providing many microarchitectural insights that were vital to making this design viable. Many of his suggestions were applied to the new core scan loop, including proper prefetching, batching subsequent pointer loads to hide their memory latency, and simpler iteration over object pointers.

[^1]:  "The Garbage Collection Handbook" by Richard Jones, Anthony Hosking, Eliot Moss, a canonical source for garbage collection techniques, divides parallel garbage collection algorithms into two categories, "processor-centric" and "memory-centric." Loosely, the former category consists of algorithms that aggressively balance work across processors to maximize parallelism, while the latter consists of algorithms that may not parallelize as well, but process contiguous blocks of the heap. The current marking algorithm in the Go garbage collector is firmly processor-centric. Unfortunately for us, the handbook's wisdom ends there, as it goes on to state that "\[a\]ll known parallel marking algorithms are processor-centric." (page 279\)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: green tea garbage collector #73581

Green Tea 🍵 Garbage Collector

Introduction

Design

Prototype implementation

Limiting scope to small objects

Work distribution

Single-object scan optimization

Prototype evaluation

Future work

SIMD-accelerated scanning kernels

Concentrator network

Acknowledgements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

runtime: green tea garbage collector #73581

Description

Green Tea 🍵 Garbage Collector

Introduction

Design

Prototype implementation

Limiting scope to small objects

Work distribution

Single-object scan optimization

Prototype evaluation

Future work

SIMD-accelerated scanning kernels

Concentrator network

Acknowledgements

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions