-
Notifications
You must be signed in to change notification settings - Fork 17.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: unexpectedly large slowdown with runtime.LockOSThread #18023
Comments
The golang-nuts thread about this says that using 'perf stat' shows approximately 1000X more thread context switches with LockOSThread than without. Is that enough confirmation? |
Somehow I don't see that, but, yes, there is probably nothing that can be done here. Sorry for the noise. |
Let me chime in a bit. On Linux the context switch can happen, if my reading of futex_wake() is correct (which is probably not), because e.g. wake_up_q() via calling
for woken process. The Go runtime calls When LockOSThread is used an M is dedicated to G so when that G blocks, e.g. on chan send, that M, if I undestand correctly, has high chances to stop. And if it stops it goes to With this thinking the following patch: diff --git a/src/runtime/lock_futex.go b/src/runtime/lock_futex.go
index 9d55bd129c..418fe1b845 100644
--- a/src/runtime/lock_futex.go
+++ b/src/runtime/lock_futex.go
@@ -146,7 +157,13 @@ func notesleep(n *note) {
// Sleep for an arbitrary-but-moderate interval to poll libc interceptors.
ns = 10e6
}
- for atomic.Load(key32(&n.key)) == 0 {
+ for spin := 0; atomic.Load(key32(&n.key)) == 0; spin++ {
+ // spin a bit hoping we'll get wakup soon
+ if spin < 10000 {
+ continue
+ }
+
+ // no luck -> go to sleep heavily to kernel
gp.m.blocked = true
futexsleep(key32(&n.key), 0, ns)
if *cgo_yield != nil { makes BenchmarkLocked much faster on my computer:
I also looked around and found: essentially at every CGo call lockOSThread is used: https://github.com/golang/go/blob/ab401077/src/runtime/cgocall.go#L107 With this in mind I modified the benchmark a bit so that no LockOSThread is explicitly used, but server performs 1 and 10 simple C calls for every request:
which shows the change brings quite visible speedup. This way I'm not saying my patch is right, but at least it shows that much can be improved. So I suggest to reopen the issue. Thanks beforehand, /cc @dvyukov, @aclements, @bcmills full benchmark source: ( package tmp
import (
"runtime"
"testing"
)
type in struct {
c chan *out
arg int
}
type out struct {
ret int
}
func client(c chan *in, arg int) int {
rc := make(chan *out)
c <- &in{
c: rc,
arg: arg,
}
ret := <-rc
return ret.ret
}
func _server(c chan *in, argadjust func(int) int) {
for r := range c {
r.c <- &out{ret: argadjust(r.arg)}
}
}
func server(c chan *in) {
_server(c, func(arg int) int {
return 3 + arg
})
}
func lockedServer(c chan *in) {
runtime.LockOSThread()
server(c)
runtime.UnlockOSThread()
}
// server with 1 C call per request
func cserver(c chan *in) {
_server(c, cargadjust)
}
// server with 10 C calls per request
func cserver10(c chan *in) {
_server(c, func(arg int) int {
for i := 0; i < 10; i++ {
arg = cargadjust(arg)
}
return arg
})
}
func benchmark(b *testing.B, srv func(chan *in)) {
inc := make(chan *in)
go srv(inc)
for i := 0; i < b.N; i++ {
client(inc, i)
}
close(inc)
}
func BenchmarkUnlocked(b *testing.B) { benchmark(b, server) }
func BenchmarkLocked(b *testing.B) { benchmark(b, lockedServer) }
func BenchmarkCGo(b *testing.B) { benchmark(b, cserver) }
func BenchmarkCGo10(b *testing.B) { benchmark(b, cserver10) } ( package tmp
// int argadjust(int arg) { return 3 + arg; }
import "C"
// XXX here because cannot use C in tests directly
func cargadjust(arg int) int {
return int(C.argadjust(C.int(arg)))
} |
@navytux This issue is closed. If you want to discuss your patch, please open a new issue or raise a topic on the golang-dev mailing list. Thanks. |
@ianlancetaylor thanks for feedback. I've opened #21827 and posted update to original thread on golang-nuts. |
The benchmark https://play.golang.org/p/nI4LO1wu17 shows a significant slowdown when using
runtime.LockOSThread
.On amd64 at current tip I see this output from
go test -test.cpu=1,2,4,8 -test.bench=.
:The problem may be entirely due to the extra thread context switches required by
runtime.LockOSThread
. When the thread is not locked, the channel communication may be done in practice by a simple goroutine switch. When usingruntime.LockOSThread
a full thread context switch is required.Still, we should make sure we have the tools to confirm that.
The text was updated successfully, but these errors were encountered: