Skip to content

runtime: green tea garbage collector #73581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mknyszek opened this issue May 2, 2025 · 7 comments
Open

runtime: green tea garbage collector #73581

mknyszek opened this issue May 2, 2025 · 7 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@mknyszek
Copy link
Contributor

mknyszek commented May 2, 2025

Green Tea 🍵 Garbage Collector

Authors: Michael Knyszek, Austin Clements
Updated: 2 May 2025

This issue tracks the design and implementation of the Green Tea garbage collector. As of the last update to this issue, development of Green Tea is still active. We'll produce more detailed design document once we're ready to commit to a design. For now, Green Tea is available as an experiment at tip-of-tree and is planned as to be available as an opt-in experiment in Go 1.25, once it releases. We encourage teams to try it out.

Introduction

Memory latency and bandwidth are becoming increasingly constrained as CPU clocks outpace DRAM clocks, increasing core counts offer more load on physically limited memory buses, and speed-of-light constraints necessitate increasingly non-uniform memory topologies. As a result, spatial locality, temporal locality, and topology-awareness are becoming ever more critical to high-performance systems.

Unfortunately, all of these trends are at odds with most of today’s garbage collection algorithms. Go’s garbage collector implements a classic tri-color parallel marking algorithm. This is, at its core, just a graph flood, where heap objects are nodes in the graph, and pointers are edges. However, this graph flood affords no consideration to the memory location of the objects that are being processed. As a result, it exhibits extremely poor spatial locality—jumping between completely different parts of memory—poor temporal locality—blithely spreading repeated accesses to the same memory across the GC cycle—and no concern for topology.

As a result, on average 85% of the garbage collector's time is spent in the core loop of this graph flood—the scan loop—and >35% of CPU cycles in the scan loop are spent solely stalled on memory accesses, excluding any knock-on effects. This problem is expected to only get worse as the industry trends toward many-core systems and non-uniform memory architectures.

In this document, we present Green Tea: a parallel marking algorithm that, if not memory-centric,1 is at least memory-aware, in that it endeavors to process objects close to one another together.

This new algorithm has an implementation that is ready for developers to trial on their workloads, and in this document we also present the results from evaluating this implementation against our benchmark suite. Overall, the algorithm shows a significant reduction in GC CPU costs on GC-heavy workloads.

Finally, this new marking algorithm unlocks new opportunities for future optimization, such as SIMD acceleration, which we discuss with other possible avenues of future work.

Design

The core idea behind the new parallel marking algorithm is simple. Instead of scanning individual objects, the garbage collector scans memory in much larger, contiguous blocks. The shared work queue tracks these coarse blocks instead of individual objects, and the individual objects waiting to be scanned in a block are tracked in that block itself. The core hypothesis is that while a block waits on the queue to be scanned, it will accumulate more objects to be scanned within that block, such that when a block does get dequeued, it’s likely that scanning will be able to scan more than one object in that block. This, in turn, improves locality of memory access, in addition to better amortizing per-scan costs.

Prototype implementation

In the prototype implementation of this new algorithm, the memory blocks we track are called spans. A span is always some multiple of 8 KiB, always aligned to 8 KiB, and consists entirely of objects of one size. Our prototype focuses exclusively on “small object spans”, which are exactly 8 KiB and contain objects up to 512 bytes.

A span is also the basic unit of storing heap metadata. In the prototype, each span stores two bits for each object: a gray bit and a black bit. These correspond to the tri-color abstraction: an object is black if it has been scanned, gray if it is in the queue to be scanned, and white if it has not been reached at all. In the prototype, white objects have neither bit set, gray objects have the gray bit set, and black objects have both bits set.

When scanning finds a pointer to a small object, it sets that object’s gray bit to indicate the object needs to be scanned. If the gray bit was not already set and the object’s span is not already enqueued for scanning, it enqueues the span. A per-span flag indicates whether the span is currently enqueued so it will only be enqueued once at a time. When the scan loop dequeues a span, it computes the difference between the gray bits and the black bits to identify objects to scan, copies the gray bits to the black bits, and scans any objects that had their gray bit set but not their black bit.

Limiting scope to small objects

The prototype focuses on small objects because we derive the most benefit from them. The per-scan overhead of small objects is much harder to amortize because the garbage collector spends so little time scanning each individual object. Larger objects continue to use the old algorithm.

The choice of which algorithm to use is made when scanning encounters a pointer. The span allocator maintains a bitmap with one bit for each 8 KiB page in the heap to indicate whether that page is backed by a small object span. The footprint of this fits easily into cache even for very large heaps, and contention is extremely low.

Since small object spans are always 8 KiB large and 8 KiB aligned, once the scanner knows the target of a pointer is in a small object span, it can use simple address arithmetic to find the object’s metadata within the span, thus avoiding indirections and dependent loads that seriously harm performance.

Work distribution

Go’s current garbage collector distributes work across scanners by having each scanner maintain a local fixed-sized stack of object pointers. However, in order to ensure parallelism, each scanner aggressively checks and populates global lists. This frequent mutation of the global lists is a significant source of contention in Go programs on many-core systems.

The prototype implementation has a separate queue dedicated to spans and based on the distributed work-stealing runqueues used by the goroutine scheduler. The opportunity for stealing work directly from other workers means less contention on global lists. Furthermore, by queuing spans instead of individual objects, there are far fewer items to queue and thus inherently lower contention on the queues.

Span work may be ordered several different ways. We explored several policies, including FIFO, LIFO, sparsest-first, and densest-first, random, and address-ordered. FIFO turned out to accumulate the highest average density of objects to scan on a span by the time it was dequeued for scanning.

Single-object scan optimization

If a span has only a single object to scan when it is dequeued, the new algorithm will have done more work to handle that single object than the current, object-centric algorithm does.

To bring the performance of the single-object-per-span case more in line with the current marking algorithm, we apply two tricks. First, we track the object that was marked when the span was enqueued. This object becomes the span's representative until the span is scanned. Next, we add a hit flag to the span that indicates an object was marked while the span was queued; that is, that at least two objects are marked. When scanning a span, if the hit flag is not set, then the garbage collector can directly scan the span’s representative, instead of processing the entire span.

Prototype evaluation

We evaluated the prototype implementation across a variety of benchmarks, on low-CPU-count, high-CPU-count, amd64, and arm64 Linux virtual machines. The rest of this section summarizes the results, focusing primarily on the differences in garbage collection CPU cost.

In select GC-heavy microbenchmarks (garbage, from golang.org/x/benchmarks, and binary-trees Go #2, from the Computer Language Benchmarks Game), depending on core count, we observed anywhere from a 10–50% reduction in GC CPU costs compared to the existing Go GC. The improvement generally rose with core count, indicating that the prototype scales better than the existing implementation. Furthermore, the number of L1 and L2 cache misses was reduced by half in these benchmarks.

On our bent and sweet benchmark suites, the results were more varied.

The results are positive overall, but include a mix of improvements and regressions.

Most benchmarks were either unaffected by the changes to the garbage collector or regressed or improved solely due to changes that had little to do with the garbage collector, such as code alignment changes. Some benchmarks regressed even though less CPU time is spent in the garbage collector. One reason for this is because the garbage collector's mark phase is active for less time, leading to less floating garbage which acts as a ballast in some benchmarks. Another reason for this is that less time spent in the garbage collector means more time spent in other scalability bottlenecks, either in the runtime or in user code, leading to a net apparent regression.

The Go compiler benchmarks appear to inconsistently show a very slight regression (0.5%). Given the magnitude and inconsistency of the regression, these benchmarks appear to be rather insensitive to this change. One hypothesis is that the occasional regression may be due to an out-of-date PGO profile, but remains to be investigated.

The tile38 benchmark shows substantial improvements across throughput, latency and memory use. In addition, it showed a 35% reduction in garbage collection overheads. This benchmark queries a local instance of a Tile38 in-memory geospatial database pre-seeded with data. Most of the heap consists of a high-fanout tree, so Green Tea is able to quickly generate large amounts of work and high densities.

The bleve-index benchmark has a heap topology that is quite difficult for Green Tea, though performance overall is a wash. Most of the heap consists of a low-fanout binary tree that is rapidly mutated by the benchmark. Green Tea struggles to generate locality in this benchmark, and half of all span scans only scan a single object. Our current hypothesis is that because of frequent tree rotations, the tree structure itself becomes shuffled across a large heap (100+ MiB). In contrast, the binary-trees benchmark does not perform rotations, so the tree layout retains the good locality of its initial allocations. This suggests that Green Tea has good locality when the application itself has good locality, unlike the current Go GC; but Green Tea, unsurprisingly, can’t create locality out of nothing. Both Linux perf and CPU profiles in the 16-core amd64 environment indicate a small ~2% regression in garbage collection overhead. Overall, the single object scan optimization was integral to making this benchmark perform well in the 16-core amd64 environment. In the 72- and 88-core environments we see a significant improvement, due to the design's improved many-core scalability. There remains a small overall regression on the 16-core arm64 environment that still needs to be investigated.

Future work

SIMD-accelerated scanning kernels

Scanning memory in larger blocks of memory unlocks the ability to apply SIMD to small objects in the garbage collector. The core idea is to generate a unique scanning kernel for each size class and use SIMD bit manipulation and permutation instructions to load, mask, swizzle, pack, and enqueue pointers. The regularity of the layout of objects in a single span and Go’s packed representation of pointer/scalar metadata for small objects both play a major role in making this feasible.

Austin Clements developed prototype AVX512-based scanning kernels that reduce garbage collection overheads by another 15–20% in the benchmarks where we already saw improvements. The prototype implementation does not currently use these kernels because they only apply to a small subset of objects at this point in time.

These SIMD kernels tend to require a higher density of objects in order to outperform sparsely scanning objects within a scan. These kernels are still being developed, so the prototype does not use them by default, and when they are enabled, it only uses SIMD scanning when a minimum density threshold is reached.

Concentrator network

Austin's original design for Green Tea used a sorting network called the concentrator network to achieve the even higher pointer density required by SIMD-based scanning, and to generate locality even for metadata operations like setting gray bits. The network was carefully designed to minimize queuing costs, so individual pointers could still be processed efficiently when sufficient scan density was unavailable.

There are two main reasons we did not pursue this direction in the short term. First, we found that even very low density can produce good results, especially with the single-object scan optimization. Second, the concentrator network is more complex to implement, as it is a greater departure from the existing algorithm. However, this design is still an avenue we plan to explore, since it is far more general and tunable.

Acknowledgements

Credit to Austin Clements for the inception of the algorithm and initial prototyping. (They wrote down the key idea in 2018!)

Credit to Yves Vandriessche from Intel for providing many microarchitectural insights that were vital to making this design viable. Many of his suggestions were applied to the core scan loop, including proper prefetching, batching subsequent pointer loads to hide their memory latency, and simpler iteration over object pointers.

Footnotes

  1. "The Garbage Collection Handbook" by Richard Jones, Anthony Hosking, Eliot Moss, a canonical source for garbage collection techniques, divides parallel garbage collection algorithms into two categories, "processor-centric" and "memory-centric." Loosely, the former category consists of algorithms that aggressively balance work across processors to maximize parallelism, while the latter consists of algorithms that may not parallelize as well, but process contiguous blocks of the heap. The current marking algorithm in the Go garbage collector is firmly processor-centric. Unfortunately for us, the handbook's wisdom ends there, as it goes on to state that "[a]ll known parallel marking algorithms are processor-centric." (page 279)

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label May 2, 2025
@mknyszek mknyszek added this to the Go1.25 milestone May 2, 2025
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 2, 2025
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/658036 mentions this issue: runtime: mark and scan small objects in whole spans [green tea]

@mknyszek
Copy link
Contributor Author

mknyszek commented May 2, 2025

How to try it out

  1. Install the gotip tool. (https://pkg.go.dev/golang.org/dl/gotip)
$ go install golang.org/dl/gotip@latest
$ gotip download
  1. Build your program with GOEXPERIMENT=greenteagc set and with gotip.

Sending feedback

Please focus your attention on whole programs. Microbenchmarks tend to be poor representatives of garbage collection behavior in real programs. Large changes in the foundations such as these can move the needle for all sorts of reasons, unrelated to (or even inversely related to) how efficient the garbage collector actually is.

If you encounter a situation where Green Tea is or isn't working out for you, we'd appreciate it if you could share some details with us that would be really helpful moving forward.

  1. The platform you're running on.
  2. The CPU model (or if in the cloud, the VM type) you're running your code on.
  3. Excerpts from stderr of your program running with GODEBUG=gctrace=2, both with and without Green Tea.
  4. A CPU profile of your program running with and without Green Tea.
  5. An execution trace capturing a few complete GC cycles, with and without Green Tea (a few seconds is usually enough).

Note: you can use GOEXPERIMENT=nogreenteagc to explicitly disable Green Tea at build time.

Feel free to post to this GitHub issue, or you can email me directly at mknyszek(at)golang.org.

Thank you!

gopherbot pushed a commit that referenced this issue May 2, 2025
Our current parallel mark algorithm suffers from frequent stalls on
memory since its access pattern is essentially random. Small objects
are the worst offenders, since each one forces pulling in at least one
full cache line to access even when the amount to be scanned is far
smaller than that. Each object also requires an independent access to
per-object metadata.

The purpose of this change is to improve garbage collector performance
by scanning small objects in batches to obtain better cache locality
than our current approach. The core idea behind this change is to defer
marking and scanning small objects, and then scan them in batches
localized to a span.

This change adds scanned bits to each small object (<=512 bytes) span in
addition to mark bits. The scanned bits indicate that the object has
been scanned. (One way to think of them is "grey" bits and "black" bits
in the tri-color mark-sweep abstraction.) Each of these spans is always
8 KiB and if they contain pointers, the pointer/scalar data is already
packed together at the end of the span, allowing us to further optimize
the mark algorithm for this specific case.

When the GC encounters a pointer, it first checks if it points into a
small object span. If so, it is first marked in the mark bits, and then
the object is queued on a work-stealing P-local queue. This object
represents the whole span, and we ensure that a span can only appear at
most once in any queue by maintaining an atomic ownership bit for each
span. Later, when the pointer is dequeued, we scan every object with a
set mark that doesn't have a corresponding scanned bit. If it turns out
that was the only object in the mark bits since the last time we scanned
the span, we scan just that object directly, essentially falling back to
the existing algorithm. noscan objects have no scan work, so they are
never queued.

Each span's mark and scanned bits are co-located together at the end of
the span. Since the span is always 8 KiB in size, it can be found with
simple pointer arithmetic. Next to the marks and scans we also store the
size class, eliminating the need to access the span's mspan altogether.

The work-stealing P-local queue is a new source of GC work. If this
queue gets full, half of it is dumped to a global linked list of spans
to scan. The regular scan queues are always prioritized over this queue
to allow time for darts to accumulate. Stealing work from other Ps is a
last resort.

This change also adds a new debug mode under GODEBUG=gctrace=2 that
dumps whole-span scanning statistics by size class on every GC cycle.

A future extension to this CL is to use SIMD-accelerated scanning
kernels for scanning spans with high mark bit density.

For #19112. (Deadlock averted in GOEXPERIMENT.)
For #73581.

Change-Id: I4bbb4e36f376950a53e61aaaae157ce842c341bc
Reviewed-on: https://go-review.googlesource.com/c/go/+/658036
Auto-Submit: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
@mknyszek mknyszek modified the milestones: Go1.25, Go1.26 May 2, 2025
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/669655 mentions this issue: main.star: add greenteagc builders

gopherbot pushed a commit to golang/build that referenced this issue May 3, 2025
Now that the greenteagc GOEXPERIMENT is no longer a no-op, let's add
some builders for it.

For golang/go#73581.

Change-Id: I50613100ec24eb9813ca8d64c474d6efe9bc791c
Reviewed-on: https://go-review.googlesource.com/c/build/+/669655
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
@mknyszek mknyszek self-assigned this May 12, 2025
@mknyszek mknyszek moved this to In Progress in Go Compiler / Runtime May 12, 2025
@dominikh
Copy link
Member

dominikh commented May 15, 2025

No discernible change Minor improvement in CPU time for Staticcheck.

CPU: AMD Ryzen 9 3950X 16-Core Processor

> ./staticcheck -debug.version | head -1
staticcheck (devel, v0.7.0-0.dev.0.20250511162810-63832850a84e)

Vanilla:
> GOEXPERIMENT=nogreenteagc go build
> go version ./staticcheck
./staticcheck: go1.25-devel_fac2ccbed3 Thu May 15 08:27:04 2025 -0700
> rm -rf ~/.cache/staticcheck/
> perf stat ./staticcheck -debug.cpuprofile=vanilla.cpu.pprof std

 Performance counter stats for './staticcheck -debug.cpuprofile=vanilla.cpu.pprof std':

        197,121.11 msec task-clock:u                     #   18.932 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
           741,678      page-faults:u                    #    3.763 K/sec                     
   738,518,295,442      cycles:u                         #    3.747 GHz                       
   124,817,958,966      stalled-cycles-frontend:u        #   16.90% frontend cycles idle      
   785,988,849,431      instructions:u                   #    1.06  insn per cycle            
                                                  #    0.16  stalled cycles per insn   
   163,480,505,449      branches:u                       #  829.340 M/sec                     
     2,197,680,754      branch-misses:u                  #    1.34% of all branches           

      10.411895545 seconds time elapsed

     182.959316000 seconds user
      10.483651000 seconds sys
> rm -rf ~/.cache/staticcheck
> ./staticcheck -debug.trace=vanilla.trace std # trace collected separately
> rm -rf ~/.cache/staticcheck
> GODEBUG=gctrace=2 ./staticcheck std 2>vanilla.gctrace # GC trace collected separately


Green tea:
> GOEXPERIMENT=greenteagc go build
> go version ./staticcheck
./staticcheck: go1.25-devel_fac2ccbed3 Thu May 15 08:27:04 2025 -0700 X:greenteagc
> rm -rf ~/.cache/staticcheck/
> perf stat ./staticcheck -debug.cpuprofile=greentea.cpu.pprof std
 Performance counter stats for './staticcheck -debug.cpuprofile=greentea.cpu.pprof std':

        193,681.28 msec task-clock:u                     #   18.553 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
           782,480      page-faults:u                    #    4.040 K/sec                     
   723,000,663,331      cycles:u                         #    3.733 GHz                       
   130,424,001,536      stalled-cycles-frontend:u        #   18.04% frontend cycles idle      
   788,115,346,312      instructions:u                   #    1.09  insn per cycle            
                                                  #    0.17  stalled cycles per insn   
   158,001,376,147      branches:u                       #  815.780 M/sec                     
     2,432,170,668      branch-misses:u                  #    1.54% of all branches           

      10.439577978 seconds time elapsed

     178.367421000 seconds user
      11.257257000 seconds sys
> rm -rf ~/.cache/staticcheck
> ./staticcheck -debug.trace=greentea.trace std # trace collected separately
> rm -rf ~/.cache/staticcheck
> GODEBUG=gctrace=2 ./staticcheck std 2>greentea.gctrace # GC trace collected separately

The attached files contain the CPU profiles and the gctrace output. I also have execution traces, but they span the whole execution and are ~50 MB each when compressed, which GitHub won't let me attach.

greentea.tar.gz
vanilla.tar.gz

perf stat before/after courtesy of normalize-benchfmt:

             │ vanilla.bench │          greentea.bench           │
             │      sec      │    sec      vs base               │
Task-clock:u      199.3 ± 0%   194.4 ± 1%  -2.50% (p=0.000 n=10)
Wall-time         10.11 ± 0%   10.18 ± 1%  +0.74% (p=0.005 n=10)
User-time         184.9 ± 0%   178.9 ± 1%  -3.26% (p=0.000 n=10)
System-time       10.66 ± 2%   11.33 ± 3%  +6.32% (p=0.000 n=10)
geomean           44.63        44.75       +0.25%

                          │ vanilla.bench │            greentea.bench             │
                          │      val      │     val      vs base                  │
Context-switches:u           0.000 ± 0%      0.000 ± 0%        ~ (p=1.000 n=10) ¹
Cpu-migrations:u             0.000 ± 0%      0.000 ± 0%        ~ (p=1.000 n=10) ¹
Page-faults:u               796.2k ± 4%     804.4k ± 4%        ~ (p=0.579 n=10)
Cycles:u                    739.5G ± 0%     719.6G ± 1%   -2.69% (p=0.000 n=10)
Stalled-cycles-frontend:u   125.9G ± 0%     131.1G ± 0%   +4.07% (p=0.000 n=10)
Instructions:u              786.6G ± 0%     789.0G ± 0%   +0.30% (p=0.000 n=10)
Branches:u                  163.6G ± 0%     158.1G ± 0%   -3.36% (p=0.000 n=10)
Branch-misses:u             2.205G ± 0%     2.436G ± 0%  +10.48% (p=0.000 n=10)
geomean                                 ²                 +1.15%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

@mknyszek
Copy link
Contributor Author

mknyszek commented May 15, 2025

Thanks for sharing! I think there is a win here, though it's not a massive improvement. I think there are some effects that might be masking a bigger win.

At least according to the gctrace I see less overall CPU time spent in any individual GC cycle. On average around 10-15% .

Also, here are the distributions of wall time spent in the mark phase:

$ cat greentea.gctrace | grep ^gc | cut -d ' ' -f 5 | cut -d '+' -f 2 | dist
N 152  sum 3269.07  mean 21.507  gmean 12.3254  std dev 19.7445  variance 389.844

     min 0.28
   1%ile 0.365667
   5%ile 0.9985
  25%ile 4.94167
  median 16.5
  75%ile 29.5833
  95%ile 59.1
  99%ile 82.1533
     max 101

⠀⠀⠀⠀⠀⠀⠀⠀⢀⡴⠒⠒⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.036
⠀⠀⠀⠀⠀⠀⠀⡰⠋⠀⠀⠀⠀⠈⠓⠒⠒⠤⠒⠒⠒⠒⠒⠦⢤⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠞⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠉⠙⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠋⠉⠉⠙⠒⠒⠦⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠁
        0                          50                          100
$ cat vanilla.gctrace | grep ^gc | cut -d ' ' -f 5 | cut -d '+' -f 2 | dist
N 139  sum 3502.72  mean 25.1994  gmean 15.1408  std dev 20.7032  variance 428.621

     min 0.28
   1%ile 0.3018
   5%ile 1.13
  25%ile 6.28333
  median 21
  75%ile 36.6667
  95%ile 66.4
  99%ile 80.2733
     max 81

⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠒⠒⠒⠲⢤⣀⠀⠀⠀⠀⠀⠀⠀⣀⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.028
⠀⠀⠀⠀⠀⠀⠀⣠⠔⠁⠀⠀⠀⠀⠀⠀⠈⠓⠒⠒⠊⠉⠉⠉⠁⠀⠈⠉⠙⠒⠦⢤⣀⡀⠀⠀⠀⠀⠀⠀⠀⢀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠚⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠉⠉⠉⠉⠉⠉⠁⠈⠉⠉⠉⠉⠉⠉⠉⠉⠒⠒⠒⠒⠒⠒⠢⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
         0

Simultaneously, because each GC cycle is shorter, there's less floating garbage, leading to a smaller live heap. This means for the same amount of allocation work, we're going to have more GC cycles, if GOGC is constant (which it is here). Proof of that is in the fact that with Green Tea there are 152 GC cycles, whereas the original GC only executes 139 cycles. Despite this increase in the number of cycles, the CPU profile suggests that around 10% less time is spent in the garbage collector.

Here are the heap size distributions:

$ cat greentea.gctrace | grep ^gc | cut -d ' ' -f 11 | sed 's/->/\./g' | cut -d '.' -f 2 | dist
N 152  sum 78714  mean 517.855  gmean 295.01  std dev 441.656  variance 195060

     min 3
   1%ile 4.71333
   5%ile 23.85
  25%ile 118.167
  median 425
  75%ile 746.583
  95%ile 1422.05
  99%ile 1576.32
     max 1644

⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⠒⠒⠒⠲⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.001
⠀⠀⠀⠀⠀⠀⠀⣠⠔⠁⠀⠀⠀⠀⠀⠀⠙⠒⠦⠤⠤⠖⠒⠋⠉⠉⠉⠓⠒⠒⠒⠲⠤⠤⠤⣄⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠴⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠙⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠚⠉⠉⠉⠉⠉⠉⠙⠒⠒⠢⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
         0                              1000
$ cat vanilla.gctrace | grep ^gc | cut -d ' ' -f 11 | sed 's/->/\./g' | cut -d '.' -f 2 | dist
N 139  sum 77043  mean 554.266  gmean 314.515  std dev 450.016  variance 202514

     min 3
   1%ile 4.45333
   5%ile 22
  25%ile 124.333
  median 504
  75%ile 820.833
  95%ile 1492.1
  99%ile 1602.14
     max 1661

⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⠴⠒⠒⠒⠦⣄⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.001
⠀⠀⠀⠀⠀⠀⠀⣠⠞⠁⠀⠀⠀⠀⠀⠈⠑⠦⠤⠤⠤⠖⠋⠉⠀⠀⠀⠈⠉⠉⠙⠒⠦⠤⠤⢤⣀⣀⣀⡀⠀⣀⣀⣀⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠴⠊⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠁⠀⠀⠀⠈⠉⠉⠉⠒⠒⠒⠒⠒⠚⠉⠉⠉⠉⠙⠒⠒⠒⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁

As for why this isn't improving overall end-to-end time much, I suspect that:

  1. The garbage collector is just not on the critical path. Does staticcheck run in parallel or is it mostly a single goroutine?
  2. The increase in the number of GC cycles from a reduction in memory is eroding some of the CPU win.

(I could probably analyze this more efficiently if I just made a tool to parse gctraces... I've been doing quite a lot of this lately. 😅)

@dominikh
Copy link
Member

The garbage collector is just not on the critical path. Does staticcheck run in parallel or is it mostly a single goroutine?

The actual analysis phase of Staticcheck tries to use GOMAXPROCS goroutines at once, parallelizing on packages as well as individual analyzers per package.

I've updated my original comment with a proper comparison of before and after perf stat. Maybe that can help shed some more light. The increase in system time surprises me.

@mknyszek
Copy link
Contributor Author

mknyszek commented May 15, 2025

The increase in system time might be related to the use of a runtime.mutex for the global span list. I tried to swap that out with a lock-free stack (and a lock-free stack of blocks of spans, not coalesced) and it actually scaled significantly worse. With a lock, we can amortize the lock's use over several spans (we try to almost never take just one off the list once we get the lock) whereas a lock-free stack requires an atomic operation for each span we want to take off the list. I'm not quite sure how to confirm this. (It's possibly also a second-order effect from (apparently) having a little less GC pressure. Maybe there's some other system resource that staticcheck relies on that's now hammered slightly harder. Like a kernel lock or something.)

As an aside, there may be a more clever lock-free data structure we could use that would allow us to take many spans off the global list at once, but it's probably going to be quite complicated.

I also checked out the new scan stats I added and based on that, Green Tea's core hypothesis does seem to apply to staticcheck, which is good. A dozen or two objects are scanned on every span scan, which is not bad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: In Progress
Development

No branches or pull requests

3 participants