Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: go.uber.org/zap/zapcore BenchmarkZapJSON performance regression in 1.19 #54076

Closed
prattmic opened this issue Jul 26, 2022 · 6 comments
Closed
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Performance
Milestone

Comments

@prattmic
Copy link
Member

prattmic commented Jul 26, 2022

BenchmarkZapJSON is currently our largest regression for 1.19 on the performance dashboard.

Given the big jumps on the benchmark, this benchmark seems to be rather sensitive to changes.

This reproduces easily locally (I get ~500ns/op vs ~650ns/op on my local machine).

The immediate cause of the regression appears to be additional GC cycles. For the same amount of work (-test.count=3 -test.benchtime=2000000x), Go 1.19 runs nearly 25% more GC cycles.

End of GC traces:

1.18:

pacer: assist ratio=+1.171021e+001 (scan 1 MB in 4->4 MB) workers=3++0.000000e+000
pacer: 32% CPU (25 exp.) for 413560+94208+250800 B work (767440 B exp.) in 4194304 B -> 4535208 B (∆goal 198664, cons/mark +1.286002e-001)
gc 2314 @4.953s 10%: 0.090+0.32+0.038 ms clock, 1.0+0.36/0.42/0+0.45 ms cpu, 4->4->0 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 12 P
pacer: assist ratio=+1.173108e+001 (scan 1 MB in 4->4 MB) workers=3++0.000000e+000
pacer: 32% CPU (25 exp.) for 412280+94208+250800 B work (768808 B exp.) in 4194304 B -> 4373576 B (∆goal 36216, cons/mark +2.252150e-001)
gc 2315 @4.954s 10%: 0.092+0.34+0.039 ms clock, 1.1+0.39/0.48/0+0.47 ms cpu, 4->4->0 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 12 P
pacer: assist ratio=+1.171155e+001 (scan 1 MB in 4->4 MB) workers=3++0.000000e+000
pacer: 28% CPU (25 exp.) for 411872+94208+250800 B work (767528 B exp.) in 4194304 B -> 4424104 B (∆goal 94712, cons/mark +6.726164e-002)
gc 2316 @4.956s 10%: 0.091+0.30+0.047 ms clock, 1.1+0.17/0.49/0+0.56 ms cpu, 4->4->0 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 12 P
pacer: assist ratio=+1.170532e+001 (scan 1 MB in 4->4 MB) workers=3++0.000000e+000
pacer: 36% CPU (25 exp.) for 412632+94208+250800 B work (767120 B exp.) in 4194304 B -> 4400664 B (∆goal 53608, cons/mark +1.921140e-001)
gc 2317 @4.957s 10%: 0.093+0.23+0.047 ms clock, 1.1+0.46/0.42/0+0.56 ms cpu, 4->4->0 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 12 P
pacer: assist ratio=+1.171692e+001 (scan 1 MB in 4->4 MB) workers=3++0.000000e+000
pacer: 30% CPU (25 exp.) for 411960+94208+250800 B work (767880 B exp.) in 4194304 B -> 4402600 B (∆goal 72680, cons/mark +1.241341e-001)
gc 2318 @4.959s 10%: 0.097+0.20+0.040 ms clock, 1.1+0.21/0.45/0+0.48 ms cpu, 4->4->0 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 12 P
pacer: assist ratio=+1.170667e+001 (scan 1 MB in 4->4 MB) workers=3++0.000000e+000
pacer: 32% CPU (25 exp.) for 412584+94208+250800 B work (767208 B exp.) in 4194304 B -> 4422664 B (∆goal 83832, cons/mark +1.466648e-001)
gc 2319 @4.960s 10%: 0.065+0.46+0.041 ms clock, 0.78+0.46/0.50/0+0.49 ms cpu, 4->4->0 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 12 P
 2000000               499.2 ns/op
PASS

1.19:

pacer: assist ratio=+7.841786e-001 (scan 0 MB in 3->4 MB) workers=3++0.000000e+000
pacer: 29% CPU (25 exp.) for 401504+23000+251184 B work (677248 B exp.) in 3243422 B -> 4372552 B (∆goal 178248, cons/mark +5.395069e-001)
gc 2831 @5.804s 11%: 0.094+0.56+0.076 ms clock, 1.1+0.32/0.51/0.056+0.92 ms cpu, 3->4->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 12 P
pacer: assist ratio=+8.873352e-001 (scan 0 MB in 3->4 MB) workers=3++0.000000e+000
pacer: 28% CPU (25 exp.) for 401976+22368+251184 B work (675688 B exp.) in 3364591 B -> 4436664 B (∆goal 242360, cons/mark +7.316868e-001)
gc 2832 @5.806s 11%: 0.10+0.60+0.078 ms clock, 1.2+0.27/0.53/0.007+0.93 ms cpu, 3->4->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 12 P
pacer: assist ratio=+9.275663e-001 (scan 0 MB in 3->4 MB) workers=3++0.000000e+000
pacer: 26% CPU (25 exp.) for 402936+24536+251184 B work (675528 B exp.) in 3372020 B -> 4994456 B (∆goal 800152, cons/mark +5.260932e-001)
gc 2833 @5.807s 11%: 0.084+0.84+0.073 ms clock, 1.0+0.18/0.60/0+0.88 ms cpu, 3->4->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 12 P
pacer: assist ratio=+1.142488e+000 (scan 0 MB in 3->4 MB) workers=3++0.000000e+000
pacer: 28% CPU (25 exp.) for 402528+23264+251184 B work (678656 B exp.) in 3536821 B -> 4632408 B (∆goal 430976, cons/mark +9.060522e-001)
gc 2834 @5.809s 11%: 0.10+0.56+0.049 ms clock, 1.2+0.26/0.46/0+0.59 ms cpu, 3->4->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 12 P
pacer: assist ratio=+9.351634e-001 (scan 0 MB in 3->4 MB) workers=3++0.000000e+000
pacer: 27% CPU (25 exp.) for 401536+23640+251184 B work (676976 B exp.) in 3398354 B -> 4250728 B (∆goal 56424, cons/mark +4.320917e-001)
gc 2835 @5.810s 11%: 0.10+0.52+0.095 ms clock, 1.2+0.22/0.56/0+1.1 ms cpu, 3->4->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 12 P
pacer: assist ratio=+8.235036e-001 (scan 0 MB in 3->4 MB) workers=3++0.000000e+000
pacer: 28% CPU (25 exp.) for 401680+23944+251184 B work (676360 B exp.) in 3304339 B -> 4110056 B (∆goal -84248, cons/mark +6.644673e-001)
gc 2836 @5.811s 11%: 0.10+0.42+0.088 ms clock, 1.2+0.23/0.61/0.004+1.0 ms cpu, 3->3->1 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 12 P
 2000000               662.0 ns/op
PASS

I'm not the best at interpreting these, but it seems that 1.19 does manage to use a lower GC CPU percentage for each cycle, and fewer assists, which is good, but the extra cycles hurt.

Another interesting point is that this benchmark has heavy contention on the heap lock (in both versions; tiny bit higher in 1.19):

image

I wonder if that contention may be messing with pacing a bit, since adding more concurrency is not helping as much as it should.

The overall 1.18 vs 1.19 results for zapcore on my machine look like this:

name                                           old time/op  new time/op   delta
BufferedWriteSyncer/write_file_with_buffer-12   235ns ±17%    235ns ± 9%     ~     (p=0.739 n=10+10)
MultiWriteSyncer/2_discarder-12                14.1ns ± 1%   13.7ns ± 2%   -2.65%  (p=0.000 n=8+10)
MultiWriteSyncer/4_discarder-12                10.2ns ± 1%   10.2ns ± 2%     ~     (p=0.926 n=10+10)
MultiWriteSyncer/4_discarder_with_buffer-12     200ns ± 3%    198ns ± 6%     ~     (p=0.447 n=10+10)
WriteSyncer/write_file_with_no_buffer-12       4.77µs ± 1%   4.74µs ± 3%     ~     (p=0.247 n=10+10)
ZapConsole-12                                   664ns ± 1%    797ns ± 0%  +20.01%  (p=0.000 n=10+8)
JSONLogMarshalerFunc-12                        1.18µs ± 2%   1.18µs ± 2%     ~     (p=0.425 n=10+10)
ZapJSONFloat32AndComplex64-12                   418ns ± 1%    597ns ± 1%  +42.73%  (p=0.000 n=9+10)
ZapJSON-12                                      481ns ± 1%    650ns ± 1%  +35.30%  (p=0.000 n=10+10)
StandardJSON-12                                 852ns ± 1%    897ns ± 1%   +5.22%  (p=0.000 n=9+10)
Sampler_Check/7_keys-12                        10.7ns ± 1%   10.8ns ± 1%   +0.76%  (p=0.015 n=8+8)
Sampler_Check/50_keys-12                       11.2ns ± 1%   11.2ns ± 0%     ~     (p=0.067 n=10+9)
Sampler_Check/100_keys-12                      11.1ns ± 4%   10.9ns ± 0%     ~     (p=0.266 n=9+9)
Sampler_CheckWithHook/7_keys-12                20.7ns ± 4%   20.9ns ± 0%     ~     (p=0.165 n=10+8)
Sampler_CheckWithHook/50_keys-12               21.0ns ± 2%   21.7ns ± 2%   +3.33%  (p=0.000 n=8+10)
Sampler_CheckWithHook/100_keys-12              21.8ns ± 5%   21.1ns ± 2%   -3.21%  (p=0.012 n=10+10)
TeeCheck-12                                     143ns ± 3%    138ns ± 4%   -3.32%  (p=0.003 n=10+10)
[Geo mean]                                     98.5ns       103.4ns        +5.00%

I suspect that we have overly sensitive benchmarks here and there isn't much for us to do (except perhaps investigate the heap lock contention), but I'd like to hear @mknyszek's thoughts.

@prattmic prattmic added this to the Backlog milestone Jul 26, 2022
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 26, 2022
@mknyszek
Copy link
Contributor

mknyszek commented Jul 27, 2022

Ah, sorry, I actually looked into this one in the memory regressions issue. IIRC, yes, the pacer is just doing its job a little better (in the tail of your trace, note the difference in assist CPU %).

I'm inclined to close this.

@mknyszek
Copy link
Contributor

mknyszek commented Jul 27, 2022

Ah, you did note that. :) I should learn to read. The contention is not great, but with fewer assists, contention is going to increase because the allocator is hit harder.

@mknyszek
Copy link
Contributor

mknyszek commented Jul 27, 2022

The additional GC cycles come from a difference in overshoot vs. undershoot. Go 1.18 overshoots on average, Go 1.19 undershoots on average, but has more variance in general. That might be worth investigating, but I'm not sure it makes this a critical regression (it's a microbenchmark after all).

In the image below, the first one is Go 1.18, the second is go1.19.

Screen Shot 2022-07-27 at 6 20 58 AM

Here's the raw text (I added the image because the distribution is difficult to see with the Braille characters on GitHub):

zapcore $ grep "∆goal" go1.18.out | sed 's/^[^∆]*∆goal \(-\?[0-9][0-9]*\).*$/\1/' | dist
N 10661  sum 7.12519e+08  mean 66834.1  std dev 52150.8  variance 2.7197e+09

     min -37512
   1%ile -9800
   5%ile -1608
  25%ile 28088
  median 59128
  75%ile 96568
  95%ile 161704
  99%ile 220272
     max 556936

⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⣀⣀⣠⠤⠔⠒⠒⠒⠒⠒⠒⠒⠲⠤⠤⢤⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.000
⠀⠀⠀⠀⠀⠀⠀⡰⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠙⠒⠲⠢⠤⠤⢄⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠞⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠉⠉⠙⠒⠒⠒⠒⠒⠲⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
         0                  100000               200000
zapcore $ grep "∆goal" go1.19.out | sed 's/^[^∆]*∆goal \(-\?[0-9][0-9]*\).*$/\1/' | dist
N 12041  sum -1.50437e+09  mean -124937  std dev 259902  variance 6.75491e+10

     min -1.67618e+06
   1%ile -622537
   5%ile -486064
  25%ile -290363
  median -139368
  75%ile 21373.3
  95%ile 263290
  99%ile 456041
     max 5.95953e+06

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⠤⠤⠴⠒⠒⠒⠒⠒⠒⠒⠒⠒⠤⠤⣀⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.000
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⠤⠤⠖⠊⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠑⠒⠦⠤⠤⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠤⠤⠤⠒⠒⠒⠚⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠉⠓⠒⠒⠒⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
          -500000                   0                   500000

@mknyszek
Copy link
Contributor

mknyszek commented Jul 27, 2022

As for the reason for the overshoot vs. undershoot, it looks like Go 1.18 thinks the cons/mark ratio is lower than Go 1.19 thinks it is. Note that the variance is also higher, which directly leads to the overshoot/undershoot variance.

Again, an image:

Screen Shot 2022-07-27 at 6 26 42 AM

Raw text:

zapcore $ grep "∆goal" go1.18.out | sed 's/^.*cons\/mark \(+\?-\?[0-9]\.[0-9]*e[+-][0-9][0-9]*\).*$/\1/' | dist
N 10661  sum 1718.16  mean 0.161163  std dev 0.0653517  variance 0.00427084

     min -0.077172
   1%ile 0.0228243
   5%ile 0.0624429
  25%ile 0.116103
  median 0.157738
  75%ile 0.202597
  95%ile 0.273059
  99%ile 0.332474
     max 0.52172

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⡤⠤⠤⠒⠒⠒⠒⠒⠒⠒⠦⢤⣀⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 6.511
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣠⠤⠴⠒⠚⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠉⠓⠲⠤⢤⣀⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠤⠤⠤⠖⠒⠒⠒⠒⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠉⠉⠒⠒⠒⠒⠢⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
    -0
zapcore $ grep "∆goal" go1.19.out | sed 's/^.*cons\/mark \(+\?-\?[0-9]\.[0-9]*e[+-][0-9][0-9]*\).*$/\1/' | dist
N 12041  sum 5859.47  mean 0.486627  std dev 0.224311  variance 0.0503153

     min -3.6061
   1%ile 0.10322
   5%ile 0.244553
  25%ile 0.385596
  median 0.480679
  75%ile 0.574186
  95%ile 0.737443
  99%ile 0.970071
     max 6.55697

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⠤⠖⠒⠒⠦⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 3.013
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⡠⠖⠋⠀⠀⠀⠀⠀⠀⠀⠉⠓⠤⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠖⠒⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠓⠒⠲⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
          0                                   1

@mknyszek
Copy link
Contributor

mknyszek commented Jul 27, 2022

I think the increase in cons/mark (and variance) is a result of:

  1. Fewer bytes of work in Go 1.19 (the stack scanning work is more accurately tracked thanks to Keith's changes) which leads to some undershoot, which leads to less floating garbage, which means more undershoot (notice that the scan work each cycle is a good deal less; around 30 KiB (out of 300 KiB) less in heap, and 60 KiB less in stack).
  2. Not proportionally less work in CPU time, AFAICT. I suspect GC fixed costs dominate in this benchmark, so some of the assumptions in the pacer are probably breaking down. Perhaps we should try to model that. Small changes in small heaps like this are going to have a larger effect.

This would explain everything assuming the allocation rate is roughly the same. There are more GC cycles because the GC expects the costs to be offset by shorter cycles. But the cycles aren't really shorter, so it appears to the GC like the GC is just slower.

As a side-note: I think this benchmark also has some kind of memory leak (or is actually storing log data somewhere). I noticed that every run-of-10 I did, the timings went up slightly with each run. When I look at the gcpacertrace, I think I know why: the scan work is actually increasing over time!

@prattmic
Copy link
Member Author

prattmic commented Jul 27, 2022

Thanks for the detailed analysis!

@prattmic prattmic closed this as not planned Won't fix, can't repro, duplicate, stale Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. Performance
Projects
Status: Done
Development

No branches or pull requests

3 participants