Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: CL 501976 performance and jitter regression #65647

Open
JetSetIlly opened this issue Feb 10, 2024 · 38 comments
Open

runtime: CL 501976 performance and jitter regression #65647

JetSetIlly opened this issue Feb 10, 2024 · 38 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@JetSetIlly
Copy link

JetSetIlly commented Feb 10, 2024

Go version

go version go1.22.0 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/steve/.cache/go-build'
GOENV='/home/steve/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/steve/Go/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/steve/Go/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/home/steve/Go/go1.22.0/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/home/steve/Go/go1.22.0/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.22.0'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build502182501=/tmp/go-build -gno-record-gcc-switches'

What did you do?

I am comparing the performance my project when compiled with Go 1.22.0 and when compiled with Go 1.21.7

The full instructions for my testing procedure and the result CPU profiles and the test program have been placed in a specially created Github repository

What did you see happen?

The performance of my project has dropped significantly with v1.22.0. Using the metric of performance used in my project, the performance with Go v1.21.7 is 146.28fps and the performance with Go v1.22.0 is 133.48fps. This is a drop of 8.75%

What did you expect to see?

I expected either no significant change or an increased performance rating. This method of measurement has been used for many years and has shown increases in performance as expected when new Go versions have been released. For example, the introduction of profile-guided optimisation, showed the expected increase in performance.

@Jorropo
Copy link
Member

Jorropo commented Feb 10, 2024

@randall77
Copy link
Contributor

I need some more specific instructions to reproduce.
So far:

git clone https://github.com/JetSetIlly/Gopher2600
cd Gopher2600
git checkout d1482c139ab9c78b18902426e87367a58cf8b101
~/go1.22.0/bin/go build

That last build produces some strange results:

go: downloading github.com/pkg/term v1.1.0
go: downloading github.com/go-gl/gl v0.0.0-20211210172815-726fda9656d6
go: downloading github.com/inkyblackness/imgui-go/v4 v4.7.0
go: downloading golang.org/x/sys v0.8.0
go: downloading github.com/sahilm/fuzzy v0.1.0
go: downloading github.com/veandco/go-sdl2 v0.4.35
go: downloading golang.org/x/image v0.13.0
go: downloading github.com/go-audio/wav v1.1.0
go: downloading github.com/hajimehoshi/go-mp3 v0.3.4
go: downloading github.com/go-audio/audio v1.0.0
go: downloading github.com/go-audio/riff v1.0.0
# github.com/go-gl/gl/v3.2-core/gl
# [pkg-config --cflags  -- gl gl]
Package gl was not found in the pkg-config search path.
Perhaps you should add the directory containing `gl.pc'
to the PKG_CONFIG_PATH environment variable
Package 'gl', required by 'virtual:world', not found
Package 'gl', required by 'virtual:world', not found
# github.com/veandco/go-sdl2/sdl
# [pkg-config --cflags  -- sdl2]
Package sdl2 was not found in the pkg-config search path.
Perhaps you should add the directory containing `sdl2.pc'
to the PKG_CONFIG_PATH environment variable
Package 'sdl2', required by 'virtual:world', not found
# github.com/inkyblackness/imgui-go/v4
In file included from wrapper.cpp:7:
../../../go/pkg/mod/github.com/inkyblackness/imgui-go/v4@v4.7.0/imgui/imgui_draw.cpp: In function ‘void ImFontAtlasBuildPackCustomRects(ImFontAtlas*, void*)’:
../../../go/pkg/mod/github.com/inkyblackness/imgui-go/v4@v4.7.0/imgui/imgui_draw.cpp:2623:11: warning: ‘void* memset(void*, int, size_t)’ specified size between 18446744071562067968 and 18446744073709551615 exceeds maximum object size 9223372036854775807 [-Wstringop-overflow=]
 2623 |     memset(pack_rects.Data, 0, (size_t)pack_rects.size_in_bytes());
      |     ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Is there some PKG_CONFIG_PATH thing I need to set?
Is there some specific C compiler I need?
I also don't see a Cosmcark.bin file anywhere, which I need to run it?

@JetSetIlly
Copy link
Author

JetSetIlly commented Feb 10, 2024

I need some more specific instructions to reproduce. So far:

git clone https://github.com/JetSetIlly/Gopher2600
cd Gopher2600
git checkout d1482c139ab9c78b18902426e87367a58cf8b101
~/go1.22.0/bin/go build

That last build produces some strange results:

The full project uses third-party C libraries. They aren't used when running the performance measurement but are needed for compilation. (The libraries are used for the GUI but we don't need a GUI when doing measurement)

I'll prepare a minimal project tree that only does the performance measurement so as to make it easier to compile and test.

Is there some PKG_CONFIG_PATH thing I need to set? Is there some specific C compiler I need? I also don't see a Cosmcark.bin file anywhere, which I need to run it?

The Cosmcark.bin file is in https://github.com/JetSetIlly/gopher2600_performance_profiles

@Jorropo
Copy link
Member

Jorropo commented Feb 10, 2024

@JetSetIlly since you already have everything setup can you benchmark GOEXPERIMENT=loopvar with go1.21.7 please ?

@JetSetIlly
Copy link
Author

JetSetIlly commented Feb 10, 2024

@randall77 If you clone this project now and run 'go build .' it should compile without requiring any external C libraries (ie. no SDL or OpenGL). There's also no need to worry about the Cosmcark.bin

https://github.com/JetSetIlly/gopher2600_performance_profiles

The measurements are slightly different to the previous results but the point is that the measurement for 1.22.0 is consistently and significantly lower than 1.21.7

@Jorropo I tried that and the results are the same/similar to 1.21.7.

@Jorropo

This comment was marked as outdated.

@JetSetIlly

This comment was marked as outdated.

@ianlancetaylor ianlancetaylor changed the title Go v1.22.0 Performance Impact runtime: Go v1.22.0 Performance Impact Feb 11, 2024
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Feb 11, 2024
@randall77
Copy link
Contributor

I am not seeing any significant performance differences between 1.21.6 and 1.22.0 (and tip, for that matter).
There's quite a bit of noise in this benchmark, so it is hard to be sure.

1.21.6

238.56 fps (71567 frames in 300.00 seconds) 397.6%
228.48 fps (68544 frames in 300.00 seconds) 380.8%
236.91 fps (71072 frames in 300.00 seconds) 394.8%

1.22.0

232.57 fps (69771 frames in 300.00 seconds) 387.6%
224.25 fps (67274 frames in 300.00 seconds) 373.7%
234.01 fps (70203 frames in 300.00 seconds) 390.0%

model name : 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz

@JetSetIlly
Copy link
Author

@randall77 Thanks for looking at this.

If we take an average of the runs for 1.21.6 and 1.22.0 then there's a drop of 1.86% in performance. I would say that's significant. At least, I've never seen drops in performance like this between go versions before. I would agree that it's hard to get consistent measurements, but there is a consistency in a drop between the two versions.

On my hardware the drop is a lot larger (Intel(R) Core(TM) i3-3225 CPU @ 3.30GHz) as previously stated.

The test runs for 5 minutes but it doesn't need to run that long. On line 22 of main.go, there's a string specifying the duration. It's currently "5m". You can always lower that if it makes it easier.

If there's anything I can do to reduce the benchmark noise let me know.

@randall77
Copy link
Contributor

If we take an average of the runs for 1.21.6 and 1.22.0 then there's a drop of 1.86% in performance.

Sure, but the run-to-run variance can be as much as 10%. It's not clear to me that the 1.86% drop is real. We'd need a lot more runs to have any confidence that the 1.86% drop isn't just noise.

On my hardware the drop is a lot larger (Intel(R) Core(TM) i3-3225 CPU @ 3.30GHz) as previously stated.

This is a possible cause. That chip is quite a bit older than the one I'm using. (~12 years vs ~4 years old.)
One possibility, though I think unlikely, is that your chip is missing some instruction (POPCNT?) that we started using more in 1.22. What does /proc/cpuinfo list for your flags?
Maybe there is something that got a bit larger and no longer fits in the small cache of that chip? Again, probably unlikely.
It would help if you could try on a more recent chip to see if that illuminates anything.

If there's anything I can do to reduce the benchmark noise let me know.

It would help a lot with investigations if we could get the benchmark variance down. Can you make any progress on that front? The benchmark appears single-threaded and doesn't allocate much, so I'm kind of confused as to where the variance might be coming from. (It is even possible the source of variance and the source of slowdown are related somehow.)

@JetSetIlly
Copy link
Author

If we take an average of the runs for 1.21.6 and 1.22.0 then there's a drop of 1.86% in performance.

Sure, but the run-to-run variance can be as much as 10%. It's not clear to me that the 1.86% drop is real. We'd need a lot more runs to have any confidence that the 1.86% drop isn't just noise. But I'm confident that there is.

I understand. My view of the code is very different to yours of course, so I'm perhaps more sensitive to the variation. I'll do some more runs here to assure myself that there is a real effect here.

On my hardware the drop is a lot larger (Intel(R) Core(TM) i3-3225 CPU @ 3.30GHz) as previously stated.

This is a possible cause. That chip is quite a bit older than the one I'm using. (~12 years vs ~4 years old.) One possibility, though I think unlikely, is that your chip is missing some instruction (POPCNT?) that we started using more in 1.22. What does /proc/cpuinfo list for your flags? Maybe there is something that got a bit larger and no longer fits in the small cache of that chip? Again, probably unlikely. It would help if you could try on a more recent chip to see if that illuminates anything.

This CPU does have POPCNT

It would help a lot with investigations if we could get the benchmark variance down. Can you make any progress on that front? The benchmark appears single-threaded and doesn't allocate much, so I'm kind of confused as to where the variance might be coming from. (It is even possible the source of variance and the source of slowdown are related somehow.)

This is interesting. If I run with GOMAXPROCS=1 then the variation between the versions goes away.

I'll do a lot more runs/tests here but with a small sample I can see very clearly that with v1.22.0 GOMAXPROCS=1 performs the same as 1.27.1 and GOMAXPROCS=n (ie. not set) the performance is lower by an average of 6%

With 1.27.1, the performance is the same regardless of the GOMAXPROCS value

@JetSetIlly
Copy link
Author

I've updated the test program to output a rating every second, running for a total of one minute. This gives a clearer picture of what's happening I think.

An example run of the four variations, shows that 1.21.7 is consistent regardless of GOMAXPROCS but that with 1.22.0 there is a significant difference.

Table of results
Version 1.22.0 1.22.0 1.21.7 1.21.7
GOMAXPROCS n 1 n 1
  145.88 146.46 148.85 146.91
  139.97 147.01 147.00 145.41
  147.05 145.42 148.03 146.32
  145.04 143.63 147.07 145.69
  140.05 144.52 146.92 145.16
  139.92 144.27 148.08 144.53
  140.06 144.93 147.94 145.33
  132.00 144.81 147.08 145.67
  137.95 144.80 147.99 144.99
  130.95 145.36 145.00 146.34
  139.11 144.51 145.91 145.73
  133.88 145.58 146.10 146.44
  137.07 145.16 145.87 146.41
  138.03 144.43 147.05 145.58
  129.89 144.84 146.05 146.00
  139.03 145.61 144.93 146.09
  137.08 144.71 146.07 144.74
  136.98 145.19 144.92 145.42
  137.90 143.95 146.05 146.17
  122.04 145.56 146.07 145.94
  130.03 144.45 144.91 143.91
  132.99 143.73 145.07 142.98
  143.96 143.31 144.94 145.41
  138.06 144.21 145.96 145.16
  137.95 144.24 147.05 144.81
  134.96 143.93 146.04 146.34
  143.07 145.27 147.01 146.04
  138.05 145.33 145.01 145.51
  133.92 143.37 144.94 145.98
  143.03 143.73 145.04 145.94
  144.08 145.12 146.01 145.25
  144.91 145.42 145.95 146.55
  130.02 144.66 145.98 143.76
  144.06 141.42 144.04 144.31
  132.89 144.09 145.91 145.56
  141.04 143.29 144.07 143.15
  140.00 145.02 147.05 145.38
  141.07 144.05 143.91 145.87
  140.90 145.04 145.04 145.07
  139.04 143.96 144.95 139.06
  140.06 145.04 146.07 147.02
  138.90 145.05 145.90 148.02
  147.06 143.95 145.08 147.06
  146.92 141.06 146.03 147.01
  144.07 145.04 146.02 147.06
  133.06 143.94 144.92 147.05
  138.00 144.39 146.05 148.02
  137.95 144.96 146.03 147.00
  138.90 145.06 145.94 147.04
  133.02 144.05 145.93 147.01
  138.11 144.98 145.00 147.99
  146.94 145.00 145.04 144.05
  140.05 145.06 144.07 147.05
  137.99 143.95 145.01 147.01
  144.00 143.08 145.91 147.03
  138.93 145.05 146.06 148.05
  146.05 144.05 146.00 148.02
  146.93 143.95 144.92 147.05
  145.06 144.04 145.12 147.05
  145.03 144.05 144.95 147.02
  137.91 143.96 145.94 147.06
  128.04 145.05 140.11 145.99
Average 138.98 144.50 145.77 145.91
Min/Max Difference 25.02 5.95 8.74 8.99

@qiulaidongfeng
Copy link
Contributor

Old test program
go1.21.7

418.02 fps (125405 frames in 300.00 seconds) 696.7%

go1.22.0

408.97 fps (122692 frames in 300.00 seconds) 681.6%

My machine is R7 7840hs and GOAMD64=v3.

@qiulaidongfeng
Copy link
Contributor

qiulaidongfeng commented Feb 11, 2024

@JetSetIlly
I tried the new test program and, after getting the output, manually rewrote it into the format accepted by benchstat.

go122.txt
go121.txt

I got this

benchstat go121.txt go122.txt
     │ go121.txt  │             go122.txt             │
     │   fps/op   │   fps/op    vs base               │
Test   417.8 ± 0%   410.8 ± 1%  -1.67% (p=0.000 n=10)

@JetSetIlly
Copy link
Author

JetSetIlly commented Feb 11, 2024

@qiulaidongfeng Nice! I didn't know about benchstat. These are my results

     │ benchstat_1.27.1_maxgoproc_n.txt │ benchstat_1.22.0_maxgoproc_n.txt  │
     │              fps/op              │   fps/op    vs base               │
Test                         145.9 ± 1%   139.0 ± 1%  -4.77% (p=0.000 n=62)

So, at first flush it looks like the variation is more pronounced with a less powerful CPU

And when GOMAXPROC is set to 1 (on the same hardware)

     │ benchstat_1.27.1_maxgoproc_1.txt │ benchstat_1.22.0_maxgoproc_1.txt  │
     │              fps/op              │   fps/op    vs base               │
Test                         146.0 ± 0%   144.6 ± 0%  -0.96% (p=0.000 n=62)

@ouvaa
Copy link

ouvaa commented Feb 11, 2024

i want to say thanks for golang 1.22, i see some significant performance improvement, notably in memory usage. though CPU seemed to be a little up but memory management wise seemed a great improvement. to me, the consistent memory is more important than the cpu for now. this is super good, making the software extremely stable. thank you all for the hard work.

by the way, please open up a discussion for people to praise golang especially when it comes to some optimization as good as this. the open source community can really have a place to show our appreciation to the hard work you guys have done.

it's impressive, it's great and software dev has never been this nice since golang's development. the trajectory is extremely future optimistic. i wish it can be done on all devices (e.g. mobile / iot) and ai; and this will be the only goto language for everyone.

the founders who thought on setting the foundation for this language is truly amazing. i've never done a language more productive, manageable, scalable and maintainable than golang. been a user since 1.6 and great to see it becoming more set with the module dev management and stuff.

@JetSetIlly
Copy link
Author

JetSetIlly commented Feb 11, 2024

@ouvaa You've made me very sad. In case it wasn't obvious, I'm not complaining about the development of Go. I hope I'm helping to improve it by identifying an area that has impacted my use case. If it's impacted me then it is likely impacting other people.

edit: I also hope my reply there didn't come across as too confrontational. But I'm feeling a bit sensitive because the future of a project that I've spent a lot of time on is now uncertain. Love and Peace ❤️

(and needless to say I fully agree with your recognition of the Go team's effort)

@qiulaidongfeng
Copy link
Contributor

I tried GOEXPERIMENT=noallocheaders ,no change in performance.
I tried GOEXPERIMENT=newinliner, performance(413.28 fps) is slightly improved, but fps is still less than go1.21.7(419 fps).

@qiulaidongfeng
Copy link
Contributor

qiulaidongfeng commented Feb 12, 2024

@JetSetIlly Can you have your test program output the format that benchstat accepts, and then use benchstat and git bisect (good=go1.21.0 bad=go1.22.0) ?
Run the test at least 10 times with go1.21 to get the results.
Start with git bisect, run $GOROOT/src/make.bash on each new commit, and then use the obtained toolchain to run the tests at least 10 times.
Use benchstat to determine whether performance has decreased compared to go1.21. Run git bisect bad if it goes down, or git bisect good if it doesn't, and you might be able to find when the performance started to drop and the commit that caused the drop.

@JetSetIlly
Copy link
Author

@JetSetIlly Can you have your test program output the format that benchstat accepts, and then use benchstat and git bisect (good=go1.21.0 bad=go1.22.0) ? Run the test at least 10 times with go1.21 to get the results. Start with git bisect, run $GOROOT/src/make.bash on each new commit, and then use the obtained toolchain to run the tests at least 10 times. Use benchstat to determine whether performance has decreased compared to go1.21. Run git bisect bad if it goes down, or git bisect good if it doesn't, and you might be able to find when the performance started to drop and the commit that caused the drop.

I can do that. Just to clarify on the amount of tests. If I run each test for one minute and output data every second, that's 60 data points. Each of those data points represents millions of iterations of the emulation. Not every code path will be hit millions of times, but certainly in the the hundreds of thousands.

So each benchstat file will contain 60 datapoints.

I usually use all.bash to compile the Go toolchain, but for these tests I'll use make.bash because it's quicker. And I'm sure it won't make a difference but I'll note that I'll be using Go1.20.5 to compile each tool chain.

@qiulaidongfeng
Copy link
Contributor

qiulaidongfeng commented Feb 12, 2024

See https://pkg.go.dev/testing#hdr-Benchmarks
The format accepts by benchstat, result it usually represents the average time per loop for executing the same piece of code. (Time or other comparable units such as fps both acceptable.)

@JetSetIlly
Copy link
Author

JetSetIlly commented Feb 12, 2024

See https://pkg.go.dev/testing#hdr-Benchmarks
The format accepts by benchstat, result it usually represents the average time per loop for executing the same piece of code. (Time or other comparable units such as fps both acceptable.)

I understand. I've turned the data around such that the iterator column shows the number of frames generated in a second and the value column shows the number of milliseconds per frame

I've run the bisect and found ad94306 to be the first "bad" commit.

benchstat_devel go1.22-ad943066f6 Thu Jul 20 21:39:57 2023 +0000.txt

The last "good" commit is dd5db4d

benchstat_devel go1.22-dd5db4df56 Thu Jul 20 21:39:53 2023 +0000.txt

To double check, I compared 1.27.1 with the last "good" commit dd5db4d and it shows no variation

benchstat_go1.21.7.txt

Comparing 1.27.1 with ad94306 (the first "bad commit) does show a variation

@JetSetIlly
Copy link
Author

JetSetIlly commented Feb 12, 2024

ad94306 was a simple enough change so I've removed that addition from gotip and performance is close enough to go1.21.7 to be unnoticeable. For certain, the "jitter" has gone in the measurements, which was the cause for the perceived drop in performance.

@randall77
Copy link
Contributor

@prattmic

@prattmic
Copy link
Member

@JetSetIlly Are you seeing differences without PGO? I've run your reproducer on my machine (thanks for making it easy to use!), and this is what I see:

     │ go1.21.7.nopgo.txt │       go1.21.7.pgo.txt       │       go1.22.0.nopgo.txt        │       go1.22.0.pgo.txt       │     go1.22.0.newpgo.txt      │
     │       ms/op        │   ms/op     vs base          │   ms/op     vs base             │   ms/op     vs base          │   ms/op     vs base          │
Test           4.930 ± 1%   4.500 ± 0%  -8.72% (n=61+62)   4.950 ± 1%  ~ (p=0.612 n=61+62)   4.650 ± 1%  -5.68% (n=61+62)   4.460 ± 1%  -9.53% (n=61+62)

i.e., with PGO disabled (-pgo=off), no change in performance. With PGO enabled (default) using the checked in default.pgo, I see a 3% regression. But if I collect a new profile on the 1.22.0 binary (newpgo), then the performance recovers (slightly better actually).

Here is the same comparison compared to the 1.21.7 with PGO baseline:

     │ go1.21.7.pgo.txt │         go1.22.0.pgo.txt          │        go1.22.0.newpgo.txt        │
     │      ms/op       │   ms/op     vs base               │   ms/op     vs base               │
Test         4.500 ± 0%   4.650 ± 1%  +3.33% (p=0.000 n=62)   4.460 ± 1%  -0.89% (p=0.000 n=62)

This seems in line with https://go.dev/doc/pgo#source-stability. i.e., when upgrading from 1.21 to 1.22, a lot of code in the standard library has changed, so some portions of the PGO profile likely no longer apply and we see a slight degradation in performance that recovers once we collect a fresh profile.

If you are seeing differences without PGO, then there may be more to investigate.

@JetSetIlly
Copy link
Author

@prattmic Yes. I am seeing differences without PGO

     │ benchstat_go1.21.7_nopgo.txt │   benchstat_go1.22.0_nopgo.txt    │
     │            ms/op             │   ms/op     vs base               │
Test                     7.300 ± 0%   7.690 ± 3%  +5.34% (p=0.000 n=62)

The +5.34% is important but the ± 3% figure is perhaps more important. This figure is showing the large variation in data points. This means that the time it takes to generate a frame is less predictable.

This is the result of comparing the two versions with PGO enabled (new PGO file generated with v1.22.0)

     │ benchstat_go1.21.7.txt │      benchstat_go1.22.0.txt       │
     │         ms/op          │   ms/op     vs base               │
Test               6.760 ± 0%   7.040 ± 4%  +4.14% (p=0.000 n=62)

@thanm thanm added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 12, 2024
@JetSetIlly
Copy link
Author

JetSetIlly commented Feb 12, 2024

A note on my comment about the time taken to generate a frame being "less predictable". This isn't a problem when I impose a realistic FPS cap. (ie. if I cap generation at 60fps then the benchmarks are equally good for 1.21.7 and 1.22.0) It's only when the program is putting the CPU under intense load do the differences appear.

@prattmic
Copy link
Member

Interesting, that is quite different from what I am seeing. The regression and variance disappears when you revert https://go.dev/cl/501976?

Could you share runtime execution traces from these runs, both with and without https://go.dev/cl/501976?

@JetSetIlly
Copy link
Author

Interesting, that is quite different from what I am seeing. The regression and variance disappears when you revert https://go.dev/cl/501976?

Without PGO:

     │ benchstat_go1.22.0.txt │ benchstat_go1.22.0_reversion.txt  │
     │         ms/op          │   ms/op     vs base               │
Test               7.630 ± 4%   7.300 ± 0%  -4.33% (p=0.000 n=62)

Running with (newly created) PGO:

     │ benchstat_go1.22.0.txt │ benchstat_go1.22.0_reversion.txt  │
     │         ms/op          │   ms/op     vs base               │
Test               6.940 ± 3%   6.710 ± 0%  -3.31% (p=0.000 n=62)

The combination of a new PGO file with the commit reversion gives excellent performance.

Could you share runtime execution traces from these runs, both with and without https://go.dev/cl/501976?

nopgo_profiles.zip

pgo_profiles.zip

@prattmic
Copy link
Member

Ah, by execution traces, I meant traces from runtime/trace, not CPU profiles. Sorry for the confusion.

@JetSetIlly
Copy link
Author

Sorry. My mistake.

pgo_traces.zip

nopgo_traces.zip

@JetSetIlly
Copy link
Author

Sometimes I think it's better to view data visually. This show chart shows the difference very clearly I think.

1.22.0 lots of jitter.
1.21.7 and 1.22.0_reversion are smoother and 1.22.0_reversion seems to be the smoothest
1.22.0_reversion higher overall fps

https://docs.google.com/spreadsheets/d/1DUTT7dHXPRFo5nf4-bhiNk2aHkd7MP4gtx41jWPtAUA/edit?usp=sharing

@prattmic prattmic changed the title runtime: Go v1.22.0 Performance Impact runtime: CL 501976 performance and jitter regression Feb 13, 2024
@prattmic
Copy link
Member

Thanks for the traces. Here are what 5s regions from each look like.

Without https://go.dev/cl/501976:
image

With https://go.dev/cl/501976:
image

In the first case, we can see that work tended to stay on the same thread for almost exactly 1s at a time (I'm actually not sure why it is quite so consistent). Afterwards, it is much less consistent, and the average time to stay in one place is noticeably lower.

These migrations are described in #65694. Gosched puts the G on the global run queue, so it makes sense that it is more likely to move to a different P/M after Gosched. An explicit wakep in Gosched will ensure that there is another M awake, which will be racing to run the G.

This is particularly unfortunate in your case, since there are idle Ps and thus there is no need to preempt at all. Last year, @felixge proposed https://go.dev/cl/460541 to avoid unnecessary preemption when there are idle Ps. If you can apply that CL (might not apply cleanly anymore), I'd be curious to hear the impact. I suspect it would eliminate you jitter and regression.

@JetSetIlly
Copy link
Author

@prattmic Thanks for the feedback. I'll take a look more closely later this evening. In the meantime, I've updated the testing tool so it's easier to use (command line arguments etc.)

I've also run the three versions of interest for 5 minutes (as opposed to 1 minute). We're probably past this high level analysis by now but I think it's interesting.

Benchstat results

     │ benchstat_go1.21.7.txt │       benchstat_go1.22.0.txt       │ benchstat_go1.22.0_reversion.txt │
     │         ms/op          │   ms/op     vs base                │     ms/op      vs base           │
Test               6.750 ± 0%   7.090 ± 1%  +5.04% (p=0.000 n=302)      6.760 ± 0%  ~ (p=0.167 n=302)

The large average drop is visible but the size of the jitter isn't obvious from these figures (at least to me) so I've prepared another graph on Google Sheets.

https://docs.google.com/spreadsheets/d/1DUTT7dHXPRFo5nf4-bhiNk2aHkd7MP4gtx41jWPtAUA/edit#gid=731389959

Cheers

@JetSetIlly
Copy link
Author

JetSetIlly commented Feb 13, 2024

Ps. If you can apply that CL (might not apply cleanly anymore), I'd be curious to hear the impact. I suspect it would eliminate you jitter and regression.

I forked from v1.22.0 and applied 460541. However, there is no change from v1.22.0

     │ benchstat_go1.21.7.txt │       benchstat_go1.22.0.txt       │   benchstat_go1.22.0_460541.txt    │
     │         ms/op          │   ms/op     vs base                │   ms/op     vs base                │
Test               6.750 ± 0%   7.090 ± 1%  +5.04% (p=0.000 n=302)   7.090 ± 1%  +5.04% (p=0.000 n=302)

I've run it twice and double-checked the patching/compilation process but definitely no effect.

This is the trace file
go1.22.0_460541_trace.zip

I also applied 460541 to a fork with the previously applied regression. In that case there's no difference to the fork with the regression only.

@JetSetIlly
Copy link
Author

This change to 1.22.0 produces good results:

index 061673150f..2b0994b2c1 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -4058,7 +4058,7 @@ func goschedImpl(gp *g, preempted bool) {
        globrunqput(gp)
        unlock(&sched.lock)
 
-       if mainStarted {
+       if mainStarted && !preempted {
                wakep()
        }
 

@mknyszek mknyszek added this to the Backlog milestone Feb 14, 2024
@prattmic
Copy link
Member

prattmic commented Feb 14, 2024

Huh, interesting that https://go.dev/cl/460541 doesn't resolve the regression. It does resolve the more frequent thread migrations, with the trace once again looking like the "good" case:

image

That leaves me a bit confused about where the regression is coming from. wakep wakes another thread, but if that thread doesn't steal our work, it should have a pretty minimal impact.

@JetSetIlly
Copy link
Author

@prattmic 460541 definitely helps in some way. With that patch the drops in performance are no longer as severe as they are in v1.22.0. That said, there is still a general overall drop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Development

No branches or pull requests

10 participants