Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: significant performance regression #67585

Closed
aclements opened this issue May 22, 2024 · 26 comments
Closed

runtime: significant performance regression #67585

aclements opened this issue May 22, 2024 · 26 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. release-blocker
Milestone

Comments

@aclements
Copy link
Member

aclements commented May 22, 2024

According to the performance dashboard, there was a significant regression across many benchmarks somewhere between d68d485 (good) and 647870b (bad). This was almost certainly from one of the commits in this CL stack, but I'm going to start up a bisect to pin it down. The regression is noticeable across more benchmarks on the 88 core machine than the 16 core machine, suggesting it may be a scalability issue.

cc @mknyszek @rhysh

@aclements aclements added this to the Go1.23 milestone May 22, 2024
@aclements aclements self-assigned this May 22, 2024
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label May 22, 2024
@rhysh
Copy link
Contributor

rhysh commented May 22, 2024

This may have been fixed in https://go.dev/cl/586796 — it looks like the benchmarks haven't run on a commit that includes that change.

Starting with https://go.dev/cl/585638 and ending with https://go.dev/cl/586796, unlock2 included an unconditional call to nanotime (regardless of whether the mutex was contended, and at 100% sample rate) which caused enough of a performance regression on the android-386-emu builder to cause a test timeout.

@aclements
Copy link
Member Author

Indeed, I accidentally ran tip (6861b2e) as the "bad" commit initially when I was bisecting and was very confused because it didn't seem different from the "good" commit. I'll leave this issue open until we can confirm it on the perf dashboard, but I'm glad it's probably already fixed. :)

@mknyszek
Copy link
Contributor

Part of the problem here is I reduced the resources going to running perf benchmarks, so the data is much more sparse than it used to be. I'll piggyback on this issue to fix that and turn it up again.

@prattmic
Copy link
Member

Looks like there is a run at d0edd9a now, which is slightly newer than your CL @rhysh (8ab131f).

Performance is improved, but still significantly regressed. A lot of these are microbenchmarks, but https://perf.golang.org/dashboard/?benchmark=Tile38QueryLoad-88&unit=sec/op is not. I'm surprised to see such a huge regression there. That benchmark is part sweet (https://cs.opensource.google/go/x/benchmarks/+/master:sweet/).

@prattmic
Copy link
Member

Ah, Tile38QueryLoad-88 regressed, but Tile38QueryLoad-16 did not (at least noticeably).

So this is a scalability regression, or maybe regression when contention occurs.

@rhysh
Copy link
Contributor

rhysh commented May 22, 2024

I'm having trouble getting a local version of the tile38 benchmarks.

Starting with https://cs.opensource.google/go/x/benchmarks 's suggestion of GOBIN=$PWD go get golang.org/x/benchmarks/... and "To run a benchmark, simply execute its binary.", I run:

$ ~/sdk/go1.20.6/bin/go version
go version go1.20.14 linux/amd64
$ (GOBIN=$PWD ~/sdk/go1.20.6/bin/go get golang.org/x/benchmarks/...)
$ (GOBIN=$PWD ~/sdk/go1.20.6/bin/go install golang.org/x/benchmarks/...)
$ ./tile38
starting server: failed to start server: exec: no command
$ ./sweet get
[sweet] Checking cache: /home/rhysh/.cache/go-sweet
[sweet] Downloading assets archive for version v0.3.0 to /home/rhysh/.cache/go-sweet/assets-v0.3.0.zip
[sweet] error: hash for version v0.3.0 not found
$ ./sweet get
[sweet] Checking cache: /home/rhysh/.cache/go-sweet
$ ./tile38
starting server: failed to start server: exec: no command

@prattmic
Copy link
Member

Sorry, the top level instructions in that repo are apparently out of date. You want the instructions from the sweet directory: https://cs.opensource.google/go/x/benchmarks/+/master:sweet/

I double-checked that those instructions work for me:

$ cd sweet
$ go build ./cmd/sweet
$ cat config.toml
[[config]]
  name = "experiment"
  goroot = "/home/prattmic/src/go"
$ ./sweet get
[sweet] Checking cache: /usr/local/google/home/mpratt/.cache/go-sweet
$ ./sweet run -run tile38 config.toml
[sweet] Work directory: /tmp/gosweet101673043
[sweet] Benchmarks: tile38 (10 runs)
[sweet] Setting up benchmark: tile38
[sweet] Running benchmark tile38 for experiment: run 1
...

benchstat results will be in the results directory. You can add multiple configs (with envexec = ["GOMAXPROCS=88"] and 16 e.g. to compare), though that is probably unnecessary. Add diagnostics = ["cpuprofile"] for profiles (also placed in results). See sweet help run for other options.

It is possible to run the benchmark in a more isolated way, but it is a bit complicated for tile38, since it is client/server. See how sweet builds and runs the benchmark.

@rhysh
Copy link
Contributor

rhysh commented May 22, 2024

That's working, thanks!

@rhysh
Copy link
Contributor

rhysh commented May 22, 2024

Comparing before the mutex profile changes through to a version that includes this morning's sampling fix (https://go.dev/cl/586796) doesn't show much change on my "20 thread" linux/amd64 test machine — "sec/op" is "+0.75%".

It's an i7-13700H, with a mix of E- and P-cores, running kernel 6.5. Here's the banner from ./all.bash:

##### Test execution environment.
# GOARCH: amd64
# CPU: 13th Gen Intel(R) Core(TM) i7-13700H
# GOOS: linux
# OS Version: Linux 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May  7 09:00:52 UTC 2 x86_64
$ /home/rhysh/sdk/go-before/bin/go version
go version devel go1.23-d68d485405 Tue May 21 16:24:41 2024 +0000 linux/amd64
$ /home/rhysh/sdk/go-after/bin/go version
go version devel go1.23-ac6dea7aa1 Wed May 22 20:34:59 2024 +0000 linux/amd64
$ benchstat /tmp/results/{before,after}.results
                   │ /tmp/results/before.results │     /tmp/results/after.results     │
                   │           sec/op            │   sec/op     vs base               │
Tile38QueryLoad-20                   370.0µ ± 1%   372.8µ ± 0%  +0.75% (p=0.035 n=10)

                   │ /tmp/results/before.results │        /tmp/results/after.results        │
                   │      average-RSS-bytes      │ average-RSS-bytes  vs base               │
Tile38QueryLoad-20                  5.997Gi ± 1%        5.862Gi ± 1%  -2.25% (p=0.004 n=10)

                   │ /tmp/results/before.results │      /tmp/results/after.results       │
                   │       peak-RSS-bytes        │ peak-RSS-bytes  vs base               │
Tile38QueryLoad-20                  6.384Gi ± 2%     6.162Gi ± 4%  -3.48% (p=0.011 n=10)

                   │ /tmp/results/before.results │      /tmp/results/after.results      │
                   │        peak-VM-bytes        │ peak-VM-bytes  vs base               │
Tile38QueryLoad-20                  7.532Gi ± 1%    7.311Gi ± 3%  -2.93% (p=0.007 n=10)

                   │ /tmp/results/before.results │       /tmp/results/after.results       │
                   │       p50-latency-sec       │ p50-latency-sec  vs base               │
Tile38QueryLoad-20                   133.7µ ± 1%       132.5µ ± 1%  -0.91% (p=0.003 n=10)

                   │ /tmp/results/before.results │    /tmp/results/after.results     │
                   │       p90-latency-sec       │ p90-latency-sec  vs base          │
Tile38QueryLoad-20                   555.4µ ± 1%       551.2µ ± 1%  ~ (p=0.123 n=10)

                   │ /tmp/results/before.results │       /tmp/results/after.results       │
                   │       p99-latency-sec       │ p99-latency-sec  vs base               │
Tile38QueryLoad-20                   3.780m ± 1%       3.826m ± 1%  +1.23% (p=0.000 n=10)

                   │ /tmp/results/before.results │     /tmp/results/after.results     │
                   │            ops/s            │    ops/s     vs base               │
Tile38QueryLoad-20                   40.54k ± 1%   40.24k ± 0%  -0.74% (p=0.030 n=10)

I'm working on a finer-grained set of benchmark runs. Seeing which commit in the series caused the regression should give good hints about how to resolve it ... at least, if this 20-thread machine is big enough to reproduce it.

$ cat config.toml 
[[config]]
  name = "before-d68d485405"
  goroot = "/home/rhysh/sdk/go-before"
[[config]]
  name = "mutex-1-be0b569caa"
  goroot = "/home/rhysh/sdk/go-be0b569caa"
[[config]]
  name = "mutex-2-dfb7073bb8"
  goroot = "/home/rhysh/sdk/go-dfb7073bb8"
[[config]]
  name = "mutex-3-d881ed6384"
  goroot = "/home/rhysh/sdk/go-d881ed6384"
[[config]]
  name = "mutex-4-ba1c5b2c45"
  goroot = "/home/rhysh/sdk/go-ba1c5b2c45"
[[config]]
  name = "mutex-5-87e930f728"
  goroot = "/home/rhysh/sdk/go-87e930f728"
[[config]]
  name = "mutex-6-f9ba2cff22"
  goroot = "/home/rhysh/sdk/go-f9ba2cff22"
[[config]]
  name = "mutex-split-clocks-8ab131fb12"
  goroot = "/home/rhysh/sdk/go-8ab131fb12"

@rhysh
Copy link
Contributor

rhysh commented May 23, 2024

My 20-thread machine doesn't show a clear reproduction of the Tile38QueryLoad-88 regression.

But I'm able to see a slowdown in the runtime's own microbenchmarks, such as with

$ go test runtime -run='^$' -bench=ontended -benchtime=100ms -count=10

Focusing on the ChanContended benchmark, and running with -test.benchtime=10000x to fix the iteration count, I see in the CPU profile that the benchmark spends more on-CPU time in runtime.lock2 when comparing d68d485 (before any mutex-related changes) and 8ab131f (after all of them). The new functions that lock2 calls show up a bit, but most of the on-CPU time still goes to procyield which on amd64 machines is the PAUSE instruction.

Looking at the program's syscalls with bpftrace, I see a big increase in the number of hits on syscalls:sys_enter_futex:

$ sudo bpftrace -e 'tracepoint:syscalls:sys_enter_futex { @[comm] = count(); }' -c "/tmp/runtime.test-go-8ab131fb12 -test.run='^$' -test.benchtime=10000x -test.count=10 -test.bench=ChanContended"
Attaching 1 probe...
goos: linux
goarch: amd64
pkg: runtime
cpu: 13th Gen Intel(R) Core(TM) i7-13700H
BenchmarkChanContended-20    	   10000	     85724 ns/op
BenchmarkChanContended-20    	   10000	     83153 ns/op
BenchmarkChanContended-20    	   10000	     86807 ns/op
BenchmarkChanContended-20    	   10000	     82407 ns/op
BenchmarkChanContended-20    	   10000	     85249 ns/op
BenchmarkChanContended-20    	   10000	     83407 ns/op
BenchmarkChanContended-20    	   10000	     82418 ns/op
BenchmarkChanContended-20    	   10000	     85117 ns/op
BenchmarkChanContended-20    	   10000	     85275 ns/op
BenchmarkChanContended-20    	   10000	     81587 ns/op
PASS


@[multipathd]: 50
@[runtime.test-go]: 27153822

Each iteration of BenchmarkChanContended involves 200 operations (100 send, 100 receives) on a buffered channel, so that run would have done 200 * 10 * 10000 = 20,000,000 channel operations. 27,000,000 futex calls would be 13,500,000 sleep/wake pairs (or many failed sleeps, since the value would keep changing as new Ms registered?) — could be 2/3rds of channel ops resulting in a futex sleep. Comparing against the mutex profile shows that not all contention events (in that microbenchmark) led to futexsleep:

$ sudo bpftrace -e 'tracepoint:syscalls:sys_enter_futex { @[comm] = count(); }' -c "/tmp/runtime.test-go-8ab131fb12 -test.run='^$' -test.benchtime=10000x -test.count=10 -test.bench=ChanContended -test.mutexprofile=/tmp/mut" | grep 'runtime.test-go'
$ ~/sdk/go-8ab131fb12/bin/go tool pprof -sample_index=contentions -top /tmp/mut | grep unlock
  19795670   100%   100%   19795670   100%  runtime.unlock (inline)
$ ~/sdk/go-8ab131fb12/bin/go tool pprof -sample_index=delay -top /tmp/mut | grep unlock
   123.11s   100%   100%    123.11s   100%  runtime.unlock (inline)

19.8 million contended lock calls taking 123 seconds to resolve (with GOMAXPROCS=20) means an average of 6.22µs each.

It seems to start with dfb7073, which restructured lock_futex.go to be similar to lock_sema.go (though ba1c5b2 gave another increase). That restructuring changed the speculative swap at the start into a CAS.

But it wasn't until d881ed6 that each M started writing a pointer to itself into the mutex.key, which I'd expect would lead to an increase in the number of failed CASes and failed futexsleeps since the value would keep changing as new waiters added themselves at the head of the list.


My hypothesis on the Tile38QueryLoad-88 regression and why it appears only on relatively large machines is that there's a (runtime-internal) mutex that's not quite contended with that app, until a certain number of Ms/Ps are active. Could be something like sched.lock; a profile (cpu or mutex) would show.

@rhysh
Copy link
Contributor

rhysh commented May 23, 2024

But I had my focus in the on-CPU profile backwards: the interesting work isn't how threads spend their time waiting .. they're waiting because something else is inside the critical section. The ChanContended benchmark is a measure of the critical section of channel operations, including the portions of lock2 and unlock2 where the other threads are excluded.

And the critical section includes new code, in the fixMutexWaitList, removeMutexWaitList, and claimMutexWaitTime functions.

With each "op" representing 200 channel operations, growing from 43 µs/op to 82 µs/op means the test's critical section grew, in effect, from 220ns to 410ns. An extra 190ns of locking overhead.

Does that sound like the right line of analysis?

goos: linux
goarch: amd64
pkg: runtime
cpu: 13th Gen Intel(R) Core(TM) i7-13700H
                 │ d68d485405  │             8ab131fb12              │
                 │   sec/op    │   sec/op     vs base                │
ChanContended-20   42.76µ ± 0%   82.10µ ± 2%  +91.97% (p=0.000 n=10)

@rhysh
Copy link
Contributor

rhysh commented May 23, 2024

I ran the tile38 benchmark on a gotip-linux-amd64_c3h88-perf_vs_release gomote instance, with the Go toolchain at 8ab131f. I tested with GOMAXPROCS of 88, 44, and 22.

$ gomote create -setup gotip-linux-amd64_c3h88-perf_vs_release
$ gomote ssh [...]
$ git clone https://github.com/golang/benchmarks
$ cd ./benchmarks/sweet
$ /home/swarming/.swarming/w/ir/tmp/workdir-swarming-task/go/bin/go build ./cmd/sweet
$ cat > config.toml
[[config]]
  name = "dev"
  goroot = "/home/swarming/.swarming/w/ir/tmp/workdir-swarming-task/go"
  envexec = ["GOMAXPROCS=88"]
  diagnostics = ["cpuprofile"]
[[config]]
  name = "dev-44"
  goroot = "/home/swarming/.swarming/w/ir/tmp/workdir-swarming-task/go"
  envexec = ["GOMAXPROCS=44"]
  diagnostics = ["cpuprofile"]
[[config]]
  name = "dev-22"
  goroot = "/home/swarming/.swarming/w/ir/tmp/workdir-swarming-task/go"
  envexec = ["GOMAXPROCS=22"]
  diagnostics = ["cpuprofile"]
$ ./sweet run -run tile38 -count=1 config.toml

And then

$ gomote gettar -dir benchmarks/sweet/results [...]
$ gomote destroy [...]

The CPU profile shows much more contention when running with GOMAXPROCS=88 than with 44 or 22: it's 35% of on-CPU time, rather than 2.4% or 0.56%. The contention seems focused on sched.lock.

GOMAXPROCS=88:

$ go tool pprof -peek='^runtime\.lock$' tile38/dev.debug/Tile38QueryLoad.cpuprofile*
File: tile38-server
Type: cpu
Time: May 23, 2024 at 12:31pm (PDT)
Duration: 58.14s, Total samples = 1541.12s (2650.55%)
Showing nodes accounting for 1541.12s, 100% of 1541.12s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context              
----------------------------------------------------------+-------------
                                           301.36s 55.61% |   runtime.findRunnable (inline)
                                           197.94s 36.52% |   runtime.goschedImpl (inline)
                                            13.59s  2.51% |   runtime.injectglist.func1 (inline)
                                             8.62s  1.59% |   runtime.injectglist (inline)
                                             7.28s  1.34% |   runtime.stopm (inline)
                                             6.83s  1.26% |   runtime.wakep (inline)
                                             2.04s  0.38% |   runtime.checkRunqsNoP (inline)
                                             1.61s   0.3% |   runtime.(*mheap).allocSpan (inline)
                                             1.02s  0.19% |   runtime.gcstopm (inline)
                                             0.74s  0.14% |   runtime.startm (inline)
                                             0.25s 0.046% |   runtime.entersyscall_gcwait (inline)
                                             0.21s 0.039% |   runtime.runSafePointFn (inline)
                                             0.10s 0.018% |   runtime.exitsyscall0 (inline)
                                             0.06s 0.011% |   runtime.(*sweepLocked).sweep.(*mheap).freeSpan.func3
                                             0.06s 0.011% |   runtime.newMarkBits (inline)
                                             0.04s 0.0074% |   runtime.(*sweepLocked).sweep.(*mheap).freeSpan.func4
                                             0.03s 0.0055% |   runtime.(*scavengerState).sleep (inline)
                                             0.03s 0.0055% |   runtime.checkIdleGCNoP (inline)
                                             0.03s 0.0055% |   runtime.mProf_Malloc (inline)
                                             0.02s 0.0037% |   runtime.markrootSpans (inline)
                                             0.02s 0.0037% |   runtime.setprofilebucket (inline)
                                             0.02s 0.0037% |   runtime.stopTheWorldWithSema (inline)
                                             0.01s 0.0018% |   runtime.(*spanSet).push (inline)
                                             0.01s 0.0018% |   runtime.forEachPInternal (inline)
                                             0.01s 0.0018% |   runtime.mcommoninit (inline)
                                             0.01s 0.0018% |   runtime.newArenaMayUnlock (inline)
     0.05s 0.0032% 0.0032%    541.94s 35.17%                | runtime.lock
                                           541.89s   100% |   runtime.lockWithRank (inline)
----------------------------------------------------------+-------------

GOMAXPROCS=44:

$ go tool pprof -peek='^runtime\.lock$' tile38/dev-44.debug/Tile38QueryLoad.cpuprofile*
File: tile38-server
Type: cpu
Time: May 23, 2024 at 12:32pm (PDT)
Duration: 57.63s, Total samples = 809.35s (1404.51%)
Showing nodes accounting for 809.35s, 100% of 809.35s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context              
----------------------------------------------------------+-------------
                                             9.38s 48.10% |   runtime.findRunnable (inline)
                                             6.66s 34.15% |   runtime.goschedImpl (inline)
                                             0.77s  3.95% |   runtime.(*mheap).allocSpan (inline)
                                             0.73s  3.74% |   runtime.wakep (inline)
                                             0.49s  2.51% |   runtime.injectglist.func1 (inline)
                                             0.36s  1.85% |   runtime.stopm (inline)
                                             0.27s  1.38% |   runtime.startm (inline)
                                             0.22s  1.13% |   runtime.injectglist (inline)
                                             0.17s  0.87% |   runtime.(*sweepLocked).sweep.(*mheap).freeSpan.func3
                                             0.16s  0.82% |   runtime.gcstopm (inline)
                                             0.10s  0.51% |   runtime.runSafePointFn (inline)
                                             0.07s  0.36% |   runtime.entersyscall_gcwait (inline)
                                             0.03s  0.15% |   runtime.(*sweepLocked).sweep.(*mheap).freeSpan.func4
                                             0.02s   0.1% |   runtime.checkRunqsNoP (inline)
                                             0.02s   0.1% |   runtime.newMarkBits (inline)
                                             0.01s 0.051% |   runtime.(*timers).lock (inline)
                                             0.01s 0.051% |   runtime.freeSpecial (inline)
                                             0.01s 0.051% |   runtime.mProf_Malloc (inline)
                                             0.01s 0.051% |   runtime.setprofilebucket (inline)
                                             0.01s 0.051% |   runtime.stkbucket (inline)
     0.02s 0.0025% 0.0025%     19.50s  2.41%                | runtime.lock
                                            19.48s 99.90% |   runtime.lockWithRank (inline)
----------------------------------------------------------+-------------

GOMAXPROCS=22:

$ go tool pprof -peek='^runtime\.lock$' tile38/dev-22.debug/Tile38QueryLoad.cpuprofile*
File: tile38-server
Type: cpu
Time: May 23, 2024 at 12:33pm (PDT)
Duration: 78.38s, Total samples = 738.99s (942.79%)
Showing nodes accounting for 738.99s, 100% of 738.99s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context              
----------------------------------------------------------+-------------
                                             1.84s 44.55% |   runtime.findRunnable (inline)
                                             1.42s 34.38% |   runtime.goschedImpl (inline)
                                             0.21s  5.08% |   runtime.wakep (inline)
                                             0.20s  4.84% |   runtime.(*mheap).allocSpan (inline)
                                             0.11s  2.66% |   runtime.stopm (inline)
                                             0.10s  2.42% |   runtime.(*sweepLocked).sweep.(*mheap).freeSpan.func3
                                             0.06s  1.45% |   runtime.gcstopm (inline)
                                             0.05s  1.21% |   runtime.injectglist.func1 (inline)
                                             0.04s  0.97% |   runtime.newMarkBits (inline)
                                             0.04s  0.97% |   runtime.startm (inline)
                                             0.03s  0.73% |   runtime.entersyscall_gcwait (inline)
                                             0.01s  0.24% |   runtime.injectglist (inline)
                                             0.01s  0.24% |   runtime.markrootSpans (inline)
                                             0.01s  0.24% |   runtime.runSafePointFn (inline)
     0.01s 0.0014% 0.0014%      4.13s  0.56%                | runtime.lock
                                             4.12s 99.76% |   runtime.lockWithRank (inline)
----------------------------------------------------------+-------------

The call stack for runtime.goschedImpl ends at runtime.mcall, but there's enough use of runtime.Gosched that the user's call shows up in the CPU profile itself:

$ go tool pprof -peek=Gosched tile38/dev.debug/Tile38QueryLoad.cpuprofile*
File: tile38-server
Type: cpu
Time: May 23, 2024 at 12:31pm (PDT)
Duration: 58.14s, Total samples = 1541.12s (2650.55%)
Showing nodes accounting for 1541.12s, 100% of 1541.12s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context              
----------------------------------------------------------+-------------
                                             0.02s   100% |   github.com/tidwall/tile38/internal/collection.nextStep (inline)
         0     0%     0%      0.02s 0.0013%                | runtime.Gosched
                                             0.02s   100% |   runtime.newstack
----------------------------------------------------------+-------------

Which points to a runtime.Gosched call in the application here: https://github.com/tidwall/tile38/blob/1.29.1/internal/collection/collection.go#L797


My current understanding of the problem:

  1. The changes I submitted recently to fix the semantics of runtime-internal mutex profiles made lock2/unlock2 pairs on contended runtime.mutex values slower, by something like 200 ns.
  2. Some apps are limited by the critical section of particular locks. A slowdown here can create a sort of Amdahl's law problem for them.
  3. But, it seems likely that those CLs also changed the ratio of the cost of lock2/unlock2 pairs for uncontended locks versus for contended locks. If there's a big penalty for dealing with contended locks, then it can be hard for an app to get itself out of that tar pit once it's fallen in. If that's one of the effects that's present here, then the performance versus GOMAXPROCS charts might look like what we saw for sync.Mutex a few years ago (sync: Mutex performance collapses with high concurrency #33747, starting with Go 1.9, fixed in Go 1.13). I don't know for sure whether that's the case here, but I suspect it and that worries me.
  4. It seems easy to spend 200 ns if you're accessing memory.

Do we have explicit testing for the ratio of contended versus uncontended runtime.mutex performance?
Does a large ratio there also worry you, @golang/runtime ?

It's possible that tile38's performance could be improved by reducing its calls to runtime.Gosched. But I don't think it's unusual for apps to operate in the vicinity of unknown performance cliffs caused by nearly-saturated runtime.mutex values. It's probably sched.lock for a bunch of apps, but there'll be other apps built around a single popular chan where they're close to saturating its internal runtime.mutex.


@golang/runtime , do you read the data in the same way, in that lock2/unlock2 pairs on contended runtime.mutex values are about 200 ns slower? And that memory latency is a likely cause?

And if so, that a viable path forward could be somehow preloading that memory while outside of the critical section?

@prattmic
Copy link
Member

Do we have explicit testing for the ratio of contended versus uncontended runtime.mutex performance?

Not that I am aware of. The tar pit is a bit worrying.

do you read the data in the same way, in that lock2/unlock2 pairs on contended runtime.mutex values are about 200 ns slower?

This sounds correct to me. It seems to be counting of the cross-core communication overhead as well, but I suppose that counts as part of the critical section (if other cores can't observe that you have unlocked, it is effectively locked).

And that memory latency is a likely cause?

I could believe that, though I'm not certain, I'd need to think about it more.

Stepping back, I think we have three high-level options for next steps:

  1. Decide that this is OK because these programs were on the edge of scalability.
  2. Try to speed up the critical section.
  3. Revert and revisit for 1.24.

Though I hate to revert, I'd say I'm leaning towards (3). If it were the beginning of the dev cycle, I'd be more comfortable getting more production experience to see if (1) is OK. Similar with (2), this is somewhat unknown complexity, though if we find a single smoking gun that fixes everything, great.

@rhysh
Copy link
Contributor

rhysh commented May 28, 2024

Thanks Michael. I'm not in a position to support option 1: that sort of performance cliff is one of the most operationally worrisome ways that a Go program can fail, so I don't want to make it worse. It's also bad news for vertically-scaled apps, which may not be able to find machines that are big and fast enough to recover the performance loss.

I've made some progress on option 2, on a path with a pretty simple concept: for futex-based locks, move the sampling decision into lock2 so we don't pay the bookkeeping overhead for contention events that we're not going to record anyway. The only reason futex-based locks would need to track the waiting Ms is in support of this profile (unlike lock_sema.go, which needs that list in order to wake the Ms), so this makes the overhead be more proportional to the number of contention events that end up in the mutex profile (especially the common case of "MutexProfileFraction=0, no events at all"). I'm not sure if that counts as "single smoking gun". But I haven't hammered out all the details on that (to be mailed) patch yet, and the details all need to be right.

Option 3 is safe, and would work for my purposes. My goal for this change is to support the project, by increasing the fraction of bug reports we get that include enough detail to find a solution. So there's the question of balancing that benefit against the risk of the change, including work in support of option 2. Maybe this is addressing a sharp pain point for the project and so can pay its way. If not, then a revert is probably right for the overall health.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/588857 mentions this issue: runtime: trim down bookkeeping in lock2/unlock2

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/588856 mentions this issue: runtime: sample waiting Ms on futex platforms

@aclements
Copy link
Member Author

I'm also in support of reverting and revisiting for 1.24. It's always unfortunate for things to just miss a release, but given that this touches really foundational code, I lean strongly toward the safest solution at this point in the release cycle. I look forward to seeing this re-land in 1.24!

@rhysh, do you have time to prep rollbacks? If not, we can do it.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/589095 mentions this issue: Revert "runtime: split mutex profile clocks"

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/589096 mentions this issue: Revert "runtime: improve runtime-internal mutex profile tests"

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/589116 mentions this issue: Revert "runtime: double-link list of waiting Ms"

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/589098 mentions this issue: Revert "runtime: use semaphore structure for futex locks"

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/589097 mentions this issue: Revert "runtime: remove GODEBUG=runtimecontentionstacks"

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/589115 mentions this issue: Revert "runtime: profile mutex contention during unlock"

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/589117 mentions this issue: Revert "runtime: prepare for extensions to waiting M list"

@rhysh
Copy link
Contributor

rhysh commented May 29, 2024

Thanks @aclements , the reverts are ready. Their test failure (cmd/cgo/internal/swig on linux-386) is also present on the build dashboard.

Beyond touching foundational code, it touches foundational code across several platforms. The current performance builders are only for linux/amd64, which wouldn't catch problems with semaphore-based locks (darwin, windows) or where cputicks is implemented by calling nanotime (often on arm64 and arm).

gopherbot pushed a commit that referenced this issue May 30, 2024
This reverts commit 8ab131f (CL 586796)

Reason for revert: This is part of a patch series that changed the
handling of contended lock2/unlock2 calls, reducing the maximum
throughput of contended runtime.mutex values, and causing a performance
regression on applications where that is (or became) the bottleneck.

Updates #66999
Updates #67585

Change-Id: I54711691e86e072081482102019d168292b5150a
Reviewed-on: https://go-review.googlesource.com/c/go/+/589095
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Than McIntosh <thanm@google.com>
gopherbot pushed a commit that referenced this issue May 30, 2024
This reverts commit f9ba2cf (CL 586237)

Reason for revert: This is part of a patch series that changed the
handling of contended lock2/unlock2 calls, reducing the maximum
throughput of contended runtime.mutex values, and causing a performance
regression on applications where that is (or became) the bottleneck.

This test verifies that the semantics of the mutex profile for
runtime.mutex values matches that of sync.Mutex values. Without the rest
of the patch series, this test would correctly identify that Go 1.22's
semantics are incorrect (issue #66999).

Updates #66999
Updates #67585

Change-Id: Id06ae01d7bc91c94054c80d273e6530cb2d59d10
Reviewed-on: https://go-review.googlesource.com/c/go/+/589096
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
gopherbot pushed a commit that referenced this issue May 30, 2024
This reverts commit 87e930f (CL 585639)

Reason for revert: This is part of a patch series that changed the
handling of contended lock2/unlock2 calls, reducing the maximum
throughput of contended runtime.mutex values, and causing a performance
regression on applications where that is (or became) the bottleneck.

Updates #66999
Updates #67585

Change-Id: I1e286d2a16d16e4af202cd5dc04b2d9c4ee71b32
Reviewed-on: https://go-review.googlesource.com/c/go/+/589097
Reviewed-by: Than McIntosh <thanm@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
gopherbot pushed a commit that referenced this issue May 30, 2024
This reverts commit ba1c5b2 (CL 585638).

Reason for revert: This is part of a patch series that changed the
handling of contended lock2/unlock2 calls, reducing the maximum
throughput of contended runtime.mutex values, and causing a performance
regression on applications where that is (or became) the bottleneck.

Updates #66999
Updates #67585

Change-Id: Ibeec5d8deb17e87966cf352fefc7efe2267839d6
Reviewed-on: https://go-review.googlesource.com/c/go/+/589115
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Than McIntosh <thanm@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
gopherbot pushed a commit that referenced this issue May 30, 2024
This reverts commit d881ed6 (CL 585637).

Reason for revert: This is part of a patch series that changed the
handling of contended lock2/unlock2 calls, reducing the maximum
throughput of contended runtime.mutex values, and causing a performance
regression on applications where that is (or became) the bottleneck.

Updates #66999
Updates #67585

Change-Id: I70d8d0b74f73be95c43d664f584e8d98519aba26
Reviewed-on: https://go-review.googlesource.com/c/go/+/589116
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
gopherbot pushed a commit that referenced this issue May 30, 2024
This reverts commit dfb7073 (CL 585636).

Reason for revert: This is part of a patch series that changed the
handling of contended lock2/unlock2 calls, reducing the maximum
throughput of contended runtime.mutex values, and causing a performance
regression on applications where that is (or became) the bottleneck.

Updates #66999
Updates #67585

Change-Id: I3483bf0b85ba0b77204032a68b7cbe93f142703e
Reviewed-on: https://go-review.googlesource.com/c/go/+/589098
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
gopherbot pushed a commit that referenced this issue May 30, 2024
This reverts commit be0b569 (CL 585635).

Reason for revert: This is part of a patch series that changed the
handling of contended lock2/unlock2 calls, reducing the maximum
throughput of contended runtime.mutex values, and causing a performance
regression on applications where that is (or became) the bottleneck.

Updates #66999
Updates #67585

Change-Id: I7843ccaecbd273b7ceacfa0f420dd993b4b15a0a
Reviewed-on: https://go-review.googlesource.com/c/go/+/589117
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
@aclements
Copy link
Member Author

It looks like mostly afbbc28 and a little bit following CLs got everything back to where it was. Notably, Tile38QueryLoad-88 is back to where it was on sec/op and latency. A bunch of microbenchmarks benefited in mysterious ways from dfb7073, and thus slowed back down to where they were when that was reverted, but the original effect was clearly not intended.

Several benchmarks are still behind where they were as of last release, but I'll open a separate issue to investigate that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. release-blocker
Projects
Status: Done
Development

No branches or pull requests

5 participants