runtime: go.uber.org/zap/zapcore BenchmarkZapJSON performance regression in 1.19 #54076
Given the big jumps on the benchmark, this benchmark seems to be rather sensitive to changes.
This reproduces easily locally (I get ~500ns/op vs ~650ns/op on my local machine).
The immediate cause of the regression appears to be additional GC cycles. For the same amount of work (
End of GC traces:
I'm not the best at interpreting these, but it seems that 1.19 does manage to use a lower GC CPU percentage for each cycle, and fewer assists, which is good, but the extra cycles hurt.
Another interesting point is that this benchmark has heavy contention on the heap lock (in both versions; tiny bit higher in 1.19):
I wonder if that contention may be messing with pacing a bit, since adding more concurrency is not helping as much as it should.
The overall 1.18 vs 1.19 results for zapcore on my machine look like this:
I suspect that we have overly sensitive benchmarks here and there isn't much for us to do (except perhaps investigate the heap lock contention), but I'd like to hear @mknyszek's thoughts.
The text was updated successfully, but these errors were encountered:
The additional GC cycles come from a difference in overshoot vs. undershoot. Go 1.18 overshoots on average, Go 1.19 undershoots on average, but has more variance in general. That might be worth investigating, but I'm not sure it makes this a critical regression (it's a microbenchmark after all).
In the image below, the first one is Go 1.18, the second is go1.19.
Here's the raw text (I added the image because the distribution is difficult to see with the Braille characters on GitHub):
As for the reason for the overshoot vs. undershoot, it looks like Go 1.18 thinks the cons/mark ratio is lower than Go 1.19 thinks it is. Note that the variance is also higher, which directly leads to the overshoot/undershoot variance.
Again, an image:
I think the increase in cons/mark (and variance) is a result of:
This would explain everything assuming the allocation rate is roughly the same. There are more GC cycles because the GC expects the costs to be offset by shorter cycles. But the cycles aren't really shorter, so it appears to the GC like the GC is just slower.
As a side-note: I think this benchmark also has some kind of memory leak (or is actually storing log data somewhere). I noticed that every run-of-10 I did, the timings went up slightly with each run. When I look at the