Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: parallel json benchmark leads high CPU usage when go version > 1.15 on ARM64 #60043

Open
botieking98 opened this issue May 8, 2023 · 7 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@botieking98
Copy link

botieking98 commented May 8, 2023

I found a problem of CPU performance degradation on ARM64 platform. The parallel json benchmark in the latest go version resulted in a 50%+ increase CPU usage compared to the old version 1.15.

This can be easily reproduced on ARM64:

  1. install golang package
  2. download go-jsonbench: https://github.com/kpango/go-json-bench
  3. cd go-json-bench && make bench and observe CPU usage

My machine configuration:

Architecture:           aarch64
  CPU op-mode(s):       64-bit
  Byte Order:           Little Endian
CPU(s):                 128
  On-line CPU(s) list:  0-127
Vendor ID:              HiSilicon
  BIOS Vendor ID:       HiSilicon
  Model name:           Kunpeng-920
    BIOS Model name:    Kunpeng 920-6426
    Model:              0
    Thread(s) per core: 1
    Core(s) per socket: 64
    Socket(s):          2
    Stepping:           0x1
    Frequency boost:    disabled
    CPU max MHz:        2600.0000
    CPU min MHz:        200.0000
    BogoMIPS:           200.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm
Caches (sum of all):
  L1d:                  8 MiB (128 instances)
  L1i:                  8 MiB (128 instances)
  L2:                   64 MiB (128 instances)
  L3:                   128 MiB (4 instances)
NUMA:
  NUMA node(s):         4
  NUMA node0 CPU(s):    0-31
  NUMA node1 CPU(s):    32-63
  NUMA node2 CPU(s):    64-95
  NUMA node3 CPU(s):    96-127

The benchmark result is below:

CPU usage is about 2000% for go 1.15.7:

image

CPU usage is about 3000% for go latest version:

image

Finally, through git bisect search, I found that fe7ff71185cf was the first commit that leads performance degradation, but I was not clear why this commit had such a big impact on CPU performance.

@cherrymui cherrymui changed the title [ARM64]runtime: parallel json benchmark leads high CPU usage when go version > 1.15 runtime: parallel json benchmark leads high CPU usage when go version > 1.15 on ARM64 May 8, 2023
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label May 8, 2023
@cherrymui
Copy link
Member

Could you show the benchmark results using benchstat format? Does the benchmark execution time change? Thanks.

@cherrymui cherrymui added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label May 8, 2023
@botieking98
Copy link
Author

botieking98 commented May 9, 2023

Could you show the benchmark results using benchstat format? Does the benchmark execution time change? Thanks.

The benchmark result for go 1.15.7:

go test -count=10 -run=NONE -bench . -benchmem
goos: linux
goarch: arm64
pkg: github.com/kpango/go-json-bench
BenchmarkJSONUnmarshalParallel-128       1281096               833 ns/op             896 B/op          9 allocs/op
BenchmarkJSONUnmarshalParallel-128       1486346               778 ns/op             896 B/op          9 allocs/op
BenchmarkJSONUnmarshalParallel-128       1454241               719 ns/op             896 B/op          9 allocs/op
BenchmarkJSONUnmarshalParallel-128       1371498               882 ns/op             896 B/op          9 allocs/op
BenchmarkJSONUnmarshalParallel-128       1000000              1004 ns/op             896 B/op          9 allocs/op
BenchmarkJSONUnmarshalParallel-128        726432              1708 ns/op             896 B/op          9 allocs/op
BenchmarkJSONUnmarshalParallel-128       1335273               923 ns/op             896 B/op          9 allocs/op
BenchmarkJSONUnmarshalParallel-128       1290433               938 ns/op             896 B/op          9 allocs/op
BenchmarkJSONUnmarshalParallel-128       1275134               944 ns/op             896 B/op          9 allocs/op
BenchmarkJSONUnmarshalParallel-128       1136780               909 ns/op             896 B/op          9 allocs/op
BenchmarkJSONDecodeParallel-128           988424              1171 ns/op            1080 B/op         10 allocs/op
BenchmarkJSONDecodeParallel-128          1118308              1120 ns/op            1080 B/op         10 allocs/op
BenchmarkJSONDecodeParallel-128          1030258              1236 ns/op            1080 B/op         10 allocs/op
BenchmarkJSONDecodeParallel-128          1040060              1128 ns/op            1080 B/op         10 allocs/op
BenchmarkJSONDecodeParallel-128           994029              1089 ns/op            1080 B/op         10 allocs/op
BenchmarkJSONDecodeParallel-128          1099704              1137 ns/op            1080 B/op         10 allocs/op
BenchmarkJSONDecodeParallel-128          1095230              1121 ns/op            1080 B/op         10 allocs/op
BenchmarkJSONDecodeParallel-128          1045125              1048 ns/op            1080 B/op         10 allocs/op
BenchmarkJSONDecodeParallel-128          1054767              1144 ns/op            1080 B/op         10 allocs/op
BenchmarkJSONDecodeParallel-128          1021688              1063 ns/op            1080 B/op         10 allocs/op

The benchmark result for go latest version:

go test -count=10 -run=NONE -bench . -benchmem
goos: linux
goarch: arm64
pkg: github.com/kpango/go-json-bench
BenchmarkJSONUnmarshalParallel-128        814312              3953 ns/op             913 B/op         10 allocs/op
BenchmarkJSONUnmarshalParallel-128       1000000              1026 ns/op             912 B/op         10 allocs/op
BenchmarkJSONUnmarshalParallel-128       1221180               993.7 ns/op           912 B/op         10 allocs/op
BenchmarkJSONUnmarshalParallel-128       1000000              1024 ns/op             912 B/op         10 allocs/op
BenchmarkJSONUnmarshalParallel-128       1220218               986.9 ns/op           912 B/op         10 allocs/op
BenchmarkJSONUnmarshalParallel-128       1000000              1021 ns/op             912 B/op         10 allocs/op
BenchmarkJSONUnmarshalParallel-128       1000000              1006 ns/op             912 B/op         10 allocs/op
BenchmarkJSONUnmarshalParallel-128       1205456              1022 ns/op             912 B/op         10 allocs/op
BenchmarkJSONUnmarshalParallel-128       1232067              1000 ns/op             912 B/op         10 allocs/op
BenchmarkJSONUnmarshalParallel-128       1204812               958.3 ns/op           912 B/op         10 allocs/op
BenchmarkJSONDecodeParallel-128          1000000              1261 ns/op            1096 B/op         11 allocs/op
BenchmarkJSONDecodeParallel-128          1000000              1210 ns/op            1096 B/op         11 allocs/op
BenchmarkJSONDecodeParallel-128          1000000              1220 ns/op            1096 B/op         11 allocs/op
BenchmarkJSONDecodeParallel-128          1000000              1235 ns/op            1096 B/op         11 allocs/op
BenchmarkJSONDecodeParallel-128          1000000              1271 ns/op            1096 B/op         11 allocs/op
BenchmarkJSONDecodeParallel-128          1000000              1278 ns/op            1096 B/op         11 allocs/op
BenchmarkJSONDecodeParallel-128          1000000              1265 ns/op            1096 B/op         11 allocs/op
BenchmarkJSONDecodeParallel-128          1000000              1260 ns/op            1096 B/op         11 allocs/op
BenchmarkJSONDecodeParallel-128          1000000              1247 ns/op            1096 B/op         11 allocs/op
BenchmarkJSONDecodeParallel-128          1000000              1262 ns/op            1096 B/op         11 allocs/op

The benchmark execution time change using by benchstat:

goos: linux
goarch: arm64
pkg: github.com/kpango/go-json-bench
                          │ go-1.15.7.txt │            go-latest.txt             │
                          │    sec/op     │    sec/op     vs base                │
JSONUnmarshalParallel-128    916.0n ± 15%   1013.5n ± 3%  +10.64% (p=0.004 n=10)
JSONDecodeParallel-128       1.125µ ±  5%    1.261µ ± 3%  +12.09% (p=0.000 n=10)
geomean                      1.015µ          1.130µ       +11.37%

                          │ go-1.15.7.txt │            go-latest.txt            │
                          │     B/op      │     B/op      vs base               │
JSONUnmarshalParallel-128      896.0 ± 0%     912.0 ± 0%  +1.79% (p=0.000 n=10)
JSONDecodeParallel-128       1.055Ki ± 0%   1.070Ki ± 0%  +1.48% (p=0.000 n=10)
geomean                        983.7          999.8       +1.63%

                          │ go-1.15.7.txt │            go-latest.txt            │
                          │   allocs/op   │  allocs/op   vs base                │
JSONUnmarshalParallel-128      9.000 ± 0%   10.000 ± 0%  +11.11% (p=0.000 n=10)
JSONDecodeParallel-128         10.00 ± 0%    11.00 ± 0%  +10.00% (p=0.000 n=10)
geomean                        9.487         10.49       +10.55%

@botieking98
Copy link
Author

Can someone help to explain this issue ?

@cherrymui cherrymui added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels May 11, 2023
@cherrymui cherrymui added this to the Backlog milestone May 11, 2023
@cherrymui
Copy link
Member

cc @golang/arm @golang/compiler

@erifan
Copy link

erifan commented May 15, 2023

My test results are slightly different from yours.

  1. No matter whether these benchmarks are run with go1.15 or tip go, the CPU utilization is very high. The CPU utilization of tip go is about 50% higher than go1.15, which is consistent with what you observed.
    go1.15: 1697%
    tip go: 2294%

  2. The performance of tip go is better than go1.15.

# benchstat go115 tip
goos: linux
goarch: arm64
                         │    go115    │                 tip                  │
                         │   sec/op    │    sec/op      vs base               │
JSONUnmarshal-96           4.810µ ± 3%   3.613µ ±   4%  -24.89% (p=0.002 n=6)
JSONDecode-96              4.265µ ± 1%   3.894µ ±   4%   -8.70% (p=0.002 n=6)
JSONUnmarshalParallel-96   2.294µ ± 5%   1.118µ ± 130%        ~ (p=0.065 n=6)
JSONDecodeParallel-96      1.268µ ± 4%   1.281µ ±   4%        ~ (p=0.310 n=6)
geomean                    2.779µ        2.118µ         -23.79%

                         │    go115     │                 tip                 │
                         │     B/op     │     B/op      vs base               │
JSONUnmarshal-96            2432.0 ± 0%     912.0 ± 0%  -62.50% (p=0.002 n=6)
JSONDecode-96              1.055Ki ± 0%   1.070Ki ± 0%   +1.48% (p=0.002 n=6)
JSONUnmarshalParallel-96    2432.0 ± 0%     912.0 ± 0%  -62.50% (p=0.002 n=6)
JSONDecodeParallel-96      1.055Ki ± 0%   1.070Ki ± 0%   +1.48% (p=0.002 n=6)
geomean                    1.583Ki          999.8       -38.31%

                         │   go115    │                 tip                 │
                         │ allocs/op  │ allocs/op   vs base                 │
JSONUnmarshal-96           10.00 ± 0%   10.00 ± 0%        ~ (p=1.000 n=6) ¹
JSONDecode-96              10.00 ± 0%   11.00 ± 0%  +10.00% (p=0.002 n=6)
JSONUnmarshalParallel-96   10.00 ± 0%   10.00 ± 0%        ~ (p=1.000 n=6) ¹
JSONDecodeParallel-96      10.00 ± 0%   11.00 ± 0%  +10.00% (p=0.002 n=6)
geomean                    10.00        10.49        +4.88%
¹ all samples are equal

I didn't dig into why the CPU utilization is so high, but the test data looks reasonable?

My test machine is a bit different with yours.

Architecture:           aarch64
  CPU op-mode(s):       64-bit
  Byte Order:           Little Endian
CPU(s):                 96
  On-line CPU(s) list:  0-95
Vendor ID:              HiSilicon
  Model name:           Kunpeng-920
    Model:              0
    Thread(s) per core: 1
    Core(s) per socket: 48
    Socket(s):          2
    Stepping:           0x1
    BogoMIPS:           200.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asi
                        mddp asimdfhm
Caches (sum of all):
  L1d:                  6 MiB (96 instances)
  L1i:                  6 MiB (96 instances)
  L2:                   48 MiB (96 instances)
  L3:                   96 MiB (4 instances)
NUMA:
  NUMA node(s):         4
  NUMA node0 CPU(s):    0-23
  NUMA node1 CPU(s):    24-47
  NUMA node2 CPU(s):    48-71
  NUMA node3 CPU(s):    72-95
Vulnerabilities:
  Itlb multihit:        Not affected
  L1tf:                 Not affected
  Mds:                  Not affected
  Meltdown:             Not affected
  Mmio stale data:      Not affected
  Retbleed:             Not affected
  Spec store bypass:    Not affected
  Spectre v1:           Mitigation; __user pointer sanitization
  Spectre v2:           Not affected
  Srbds:                Not affected
  Tsx async abort:      Not affected

@botieking98
Copy link
Author

My test results are slightly different from yours.

  1. No matter whether these benchmarks are run with go1.15 or tip go, the CPU utilization is very high. The CPU utilization of tip go is about 50% higher than go1.15, which is consistent with what you observed.
    go1.15: 1697%
    tip go: 2294%
  2. The performance of tip go is better than go1.15.
# benchstat go115 tip
goos: linux
goarch: arm64
                         │    go115    │                 tip                  │
                         │   sec/op    │    sec/op      vs base               │
JSONUnmarshal-96           4.810µ ± 3%   3.613µ ±   4%  -24.89% (p=0.002 n=6)
JSONDecode-96              4.265µ ± 1%   3.894µ ±   4%   -8.70% (p=0.002 n=6)
JSONUnmarshalParallel-96   2.294µ ± 5%   1.118µ ± 130%        ~ (p=0.065 n=6)
JSONDecodeParallel-96      1.268µ ± 4%   1.281µ ±   4%        ~ (p=0.310 n=6)
geomean                    2.779µ        2.118µ         -23.79%

                         │    go115     │                 tip                 │
                         │     B/op     │     B/op      vs base               │
JSONUnmarshal-96            2432.0 ± 0%     912.0 ± 0%  -62.50% (p=0.002 n=6)
JSONDecode-96              1.055Ki ± 0%   1.070Ki ± 0%   +1.48% (p=0.002 n=6)
JSONUnmarshalParallel-96    2432.0 ± 0%     912.0 ± 0%  -62.50% (p=0.002 n=6)
JSONDecodeParallel-96      1.055Ki ± 0%   1.070Ki ± 0%   +1.48% (p=0.002 n=6)
geomean                    1.583Ki          999.8       -38.31%

                         │   go115    │                 tip                 │
                         │ allocs/op  │ allocs/op   vs base                 │
JSONUnmarshal-96           10.00 ± 0%   10.00 ± 0%        ~ (p=1.000 n=6) ¹
JSONDecode-96              10.00 ± 0%   11.00 ± 0%  +10.00% (p=0.002 n=6)
JSONUnmarshalParallel-96   10.00 ± 0%   10.00 ± 0%        ~ (p=1.000 n=6) ¹
JSONDecodeParallel-96      10.00 ± 0%   11.00 ± 0%  +10.00% (p=0.002 n=6)
geomean                    10.00        10.49        +4.88%
¹ all samples are equal

I didn't dig into why the CPU utilization is so high, but the test data looks reasonable?

My test machine is a bit different with yours.

Architecture:           aarch64
  CPU op-mode(s):       64-bit
  Byte Order:           Little Endian
CPU(s):                 96
  On-line CPU(s) list:  0-95
Vendor ID:              HiSilicon
  Model name:           Kunpeng-920
    Model:              0
    Thread(s) per core: 1
    Core(s) per socket: 48
    Socket(s):          2
    Stepping:           0x1
    BogoMIPS:           200.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asi
                        mddp asimdfhm
Caches (sum of all):
  L1d:                  6 MiB (96 instances)
  L1i:                  6 MiB (96 instances)
  L2:                   48 MiB (96 instances)
  L3:                   96 MiB (4 instances)
NUMA:
  NUMA node(s):         4
  NUMA node0 CPU(s):    0-23
  NUMA node1 CPU(s):    24-47
  NUMA node2 CPU(s):    48-71
  NUMA node3 CPU(s):    72-95
Vulnerabilities:
  Itlb multihit:        Not affected
  L1tf:                 Not affected
  Mds:                  Not affected
  Meltdown:             Not affected
  Mmio stale data:      Not affected
  Retbleed:             Not affected
  Spec store bypass:    Not affected
  Spectre v1:           Mitigation; __user pointer sanitization
  Spectre v2:           Not affected
  Srbds:                Not affected
  Tsx async abort:      Not affected

Maybe you can try to test it again by git checkout before commit runtime: add consistent heap statistics
And the result is that CPU utilization is less but performance is almost same with tip go

@botieking98
Copy link
Author

ping @erifan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: In Progress
Development

No branches or pull requests

4 participants