New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: big performance penalty with runtime.LockOSThread #21827

Open
navytux opened this Issue Sep 10, 2017 · 49 comments

Comments

Projects
None yet
@navytux
Contributor

navytux commented Sep 10, 2017

This issue reopens #18023.

There it was observed that if a server goroutine is locked to OS thread, such locking imposes big performance penalty compared to the same server code but without handler being locked to OS thread. Relevant golang-nuts thread discusses this and notes that for case when runtime.LockOSThread was used the number of context switches is 10x (ten times, not 1000x times) more compared to the case without OS thread locking. #18023 (comment) notices the context switch can happen because e.g. futex_wake() in kernel can move woken process to a different CPU.

More, it was found that essentially at every CGo call lockOSThread is used internally by Go runtime:

https://github.com/golang/go/blob/ab401077/src/runtime/cgocall.go#L107

so even if user code does not use LockOSThread, but uses CGo calls on server side, there are preconditions to presume similar kind of slowdown.

With above in mind #18023 (comment) shows a dirty patch that spins a bit in notesleep() before going to kernel to futex_wait(). This way it is shown that 1) large fraction of performance penalty related to LockOSThread can go away, and 2) the case of CGo calls on server can also receive visible speedup:

name        old time/op  new time/op  delta
Unlocked-4   485ns ± 0%   483ns ± 1%     ~     (p=0.188 n=9+10)
Locked-4    5.22µs ± 1%  1.32µs ± 5%  -74.64%  (p=0.000 n=9+10)
CGo-4        581ns ± 1%   556ns ± 0%   -4.27%  (p=0.000 n=10+10)
CGo10-4     2.20µs ± 6%  1.23µs ± 0%  -44.32%  (p=0.000 n=10+9)

The patch is for sure not completely right (and probably far away from being right) as always spinning unconditionally should sometimes bring harm instead of good. But it shows that with proper scheduler tuning it is possible to avoid context switches and perform better.

I attach my original post here for completeness.

Thanks,
Kirill

/cc @rsc, @ianlancetaylor, @dvyukov, @aclements, @bcmills


#18023 (comment):

Let me chime in a bit. On Linux the context switch can happen, if my reading of futex_wake() is correct (which is probably not), because e.g. wake_up_q() via calling wake_up_process() -> try_to_wake_up() -> select_task_rq() can select another cpu

                cpu = cpumask_any(&p->cpus_allowed);

for woken process.

The Go runtime calls futex_wake() in notewakeup() to wake up an M that was previously stopped via stopm() -> notesleep() (the latter calls futexwait()).

When LockOSThread is used an M is dedicated to G so when that G blocks, e.g. on chan send, that M, if I undestand correctly, has high chances to stop. And if it stops it goes to futexwait and then context switch happens when someone wakes it up because e.g. something was sent to the G via channel.

With this thinking the following patch:

diff --git a/src/runtime/lock_futex.go b/src/runtime/lock_futex.go
index 9d55bd129c..418fe1b845 100644
--- a/src/runtime/lock_futex.go
+++ b/src/runtime/lock_futex.go
@@ -146,7 +157,13 @@ func notesleep(n *note) {
                // Sleep for an arbitrary-but-moderate interval to poll libc interceptors.
                ns = 10e6
        }
-       for atomic.Load(key32(&n.key)) == 0 {
+       for spin := 0; atomic.Load(key32(&n.key)) == 0; spin++ {
+               // spin a bit hoping we'll get wakup soon
+               if spin < 10000 {
+                       continue
+               }
+
+               // no luck -> go to sleep heavily to kernel
                gp.m.blocked = true
                futexsleep(key32(&n.key), 0, ns)
                if *cgo_yield != nil {

makes BenchmarkLocked much faster on my computer:

name        old time/op  new time/op  delta
Unlocked-4   485ns ± 0%   483ns ± 1%     ~     (p=0.188 n=9+10)
Locked-4    5.22µs ± 1%  1.32µs ± 5%  -74.64%  (p=0.000 n=9+10)

I also looked around and found: essentially at every CGo call lockOSThread is used:

https://github.com/golang/go/blob/ab401077/src/runtime/cgocall.go#L107

With this in mind I modified the benchmark a bit so that no LockOSThread is explicitly used, but server performs 1 and 10 simple C calls for every request:

CGo-4        581ns ± 1%   556ns ± 0%   -4.27%  (p=0.000 n=10+10)
CGo10-4     2.20µs ± 6%  1.23µs ± 0%  -44.32%  (p=0.000 n=10+9)

which shows the change brings quite visible speedup.

This way I'm not saying my patch is right, but at least it shows that much can be improved. So I suggest to reopen the issue.

Thanks beforehand,
Kirill

/cc @dvyukov, @aclements, @bcmills


full benchmark source:

(tmp_test.go)

package tmp

import (
        "runtime"
        "testing"
)

type in struct {
        c   chan *out
        arg int
}

type out struct {
        ret int
}

func client(c chan *in, arg int) int {
        rc := make(chan *out)
        c <- &in{
                c:   rc,
                arg: arg,
        }
        ret := <-rc
        return ret.ret
}

func _server(c chan *in, argadjust func(int) int) {
        for r := range c {
                r.c <- &out{ret: argadjust(r.arg)}
        }
}

func server(c chan *in) {
        _server(c, func(arg int) int {
                return 3 + arg
        })
}

func lockedServer(c chan *in) {
        runtime.LockOSThread()
        server(c)
        runtime.UnlockOSThread()
}

// server with 1 C call per request
func cserver(c chan *in) {
        _server(c, cargadjust)
}

// server with 10 C calls per request
func cserver10(c chan *in) {
        _server(c, func(arg int) int {
                for i := 0; i < 10; i++ {
                        arg = cargadjust(arg)
                }
                return arg
        })
}

func benchmark(b *testing.B, srv func(chan *in)) {
        inc := make(chan *in)
        go srv(inc)
        for i := 0; i < b.N; i++ {
                client(inc, i)
        }
        close(inc)
}

func BenchmarkUnlocked(b *testing.B)    { benchmark(b, server) }
func BenchmarkLocked(b *testing.B)      { benchmark(b, lockedServer) }
func BenchmarkCGo(b *testing.B)         { benchmark(b, cserver) }
func BenchmarkCGo10(b *testing.B)       { benchmark(b, cserver10) }

(tmp.go)

package tmp

// int argadjust(int arg) { return 3 + arg; }
import "C"

// XXX here because cannot use C in tests directly
func cargadjust(arg int) int {
        return int(C.argadjust(C.int(arg)))
}
@typeless

This comment has been minimized.

Show comment
Hide comment
@typeless

typeless Sep 13, 2017

The kernel scheduler is invoked when 1. an interrupt handler exits (hardware irqs or timer ticks) 2. a syscall exits. 3. the scheduler is explicitly called. Which means the worst-case resolution of scheduling is equal to the resolution of the timer ticks when the system is idle (no irq and no traffic of syscalls). So it's doubtful that the 100us-ish timeout would work as expected when the granularity of the kernel scheduling is much coarse. And if you check out the manpage of futex, it has a sentence about the timeout saying that "This interval will be rounded up to the system clock granularity". That probably explains why the spinlock gets waken up faster.

typeless commented Sep 13, 2017

The kernel scheduler is invoked when 1. an interrupt handler exits (hardware irqs or timer ticks) 2. a syscall exits. 3. the scheduler is explicitly called. Which means the worst-case resolution of scheduling is equal to the resolution of the timer ticks when the system is idle (no irq and no traffic of syscalls). So it's doubtful that the 100us-ish timeout would work as expected when the granularity of the kernel scheduling is much coarse. And if you check out the manpage of futex, it has a sentence about the timeout saying that "This interval will be rounded up to the system clock granularity". That probably explains why the spinlock gets waken up faster.

@dvyukov

This comment has been minimized.

Show comment
Hide comment
@dvyukov

dvyukov Sep 13, 2017

Member

@typeless futex explicitly calls scheduler, so that's number 3

Member

dvyukov commented Sep 13, 2017

@typeless futex explicitly calls scheduler, so that's number 3

@dvyukov

This comment has been minimized.

Show comment
Hide comment
@dvyukov

dvyukov Sep 13, 2017

Member

I don't think this affects cgo calls, notesleep should not be on fast path there.
It should affect cgo callbacks, though.

Member

dvyukov commented Sep 13, 2017

I don't think this affects cgo calls, notesleep should not be on fast path there.
It should affect cgo callbacks, though.

@typeless

This comment has been minimized.

Show comment
Hide comment
@typeless

typeless Sep 13, 2017

@dvyukov

futex explicitly calls scheduler, so that's number 3

But for the user process being able to call the scheduler, doesn't the process have to be invoked by the scheduler first?

typeless commented Sep 13, 2017

@dvyukov

futex explicitly calls scheduler, so that's number 3

But for the user process being able to call the scheduler, doesn't the process have to be invoked by the scheduler first?

@dvyukov

This comment has been minimized.

Show comment
Hide comment
@dvyukov

dvyukov Sep 13, 2017

Member

If it's not running, it won't be able to set n.key either.

Member

dvyukov commented Sep 13, 2017

If it's not running, it won't be able to set n.key either.

@typeless

This comment has been minimized.

Show comment
Hide comment
@typeless

typeless Sep 13, 2017

@dvyukov

Apologies for my ambiguous wording. By processes I actually mean the OS threads.

If it's not running, it won't be able to set n.key either.

Isn't it possible that a running thread being interrupted and yield the CPU to others temporarily (unless it can disable the IRQs)? The timing that the thread gets rescheduled again can only be the next scheduling points, which is beyond the control of the thread.

P.S. FWIW a possible way to occupy the CPU exclusively other than disabling the IRQs is to use sched_setscheduler for the SCHD_FIFO policy.

typeless commented Sep 13, 2017

@dvyukov

Apologies for my ambiguous wording. By processes I actually mean the OS threads.

If it's not running, it won't be able to set n.key either.

Isn't it possible that a running thread being interrupted and yield the CPU to others temporarily (unless it can disable the IRQs)? The timing that the thread gets rescheduled again can only be the next scheduling points, which is beyond the control of the thread.

P.S. FWIW a possible way to occupy the CPU exclusively other than disabling the IRQs is to use sched_setscheduler for the SCHD_FIFO policy.

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux Sep 13, 2017

Contributor

I don't think this affects cgo calls, notesleep should not be on fast path there.
It should affect cgo callbacks, though.

@dvyukov original benchmark uses only CGo calls without CGo callbacks. lockOSThread is used on cgocall fast path for every CGo call:

https://github.com/golang/go/blob/ab401077/src/runtime/cgocall.go#L107

and somehow, as benchmark shows, it gets intermixed with the scheduler:

CGo-4        581ns ± 1%   556ns ± 0%   -4.27%  (p=0.000 n=10+10)
CGo10-4     2.20µs ± 6%  1.23µs ± 0%  -44.32%  (p=0.000 n=10+9)
Contributor

navytux commented Sep 13, 2017

I don't think this affects cgo calls, notesleep should not be on fast path there.
It should affect cgo callbacks, though.

@dvyukov original benchmark uses only CGo calls without CGo callbacks. lockOSThread is used on cgocall fast path for every CGo call:

https://github.com/golang/go/blob/ab401077/src/runtime/cgocall.go#L107

and somehow, as benchmark shows, it gets intermixed with the scheduler:

CGo-4        581ns ± 1%   556ns ± 0%   -4.27%  (p=0.000 n=10+10)
CGo10-4     2.20µs ± 6%  1.23µs ± 0%  -44.32%  (p=0.000 n=10+9)
@dvyukov

This comment has been minimized.

Show comment
Hide comment
@dvyukov

dvyukov Sep 13, 2017

Member

@typeless sorry, I don't see how this is related the topic.

Member

dvyukov commented Sep 13, 2017

@typeless sorry, I don't see how this is related the topic.

@dvyukov

This comment has been minimized.

Show comment
Hide comment
@dvyukov

dvyukov Sep 13, 2017

Member

@navytux

and somehow, as benchmark shows, it gets intermixed with the scheduler:

What is the result if you comment out lockOSThread/unlockOSThread in cgocall (it's not really needed there)?

Member

dvyukov commented Sep 13, 2017

@navytux

and somehow, as benchmark shows, it gets intermixed with the scheduler:

What is the result if you comment out lockOSThread/unlockOSThread in cgocall (it's not really needed there)?

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux Sep 13, 2017

Contributor

@dvyukov thanks for the question, I too thought about it just after commenting. So on today's unmodified tip (go version devel +c2f8ed267b Wed Sep 13 07:19:21 2017 +0000 linux/amd64) it gives:

$ benchstat dv0.txt
name     time/op
CGo-4     576ns ± 1%
CGo10-4  2.22µs ± 2%

with adding

diff --git a/src/runtime/cgocall.go b/src/runtime/cgocall.go
index ce4d707e06..decc310088 100644
--- a/src/runtime/cgocall.go
+++ b/src/runtime/cgocall.go
@@ -106,7 +106,7 @@ func cgocall(fn, arg unsafe.Pointer) int32 {
 
        // Lock g to m to ensure we stay on the same stack if we do a
        // cgo callback. In case of panic, unwindm calls endcgo.
-       lockOSThread()
+//     lockOSThread()
        mp := getg().m
        mp.ncgocall++
        mp.ncgo++
@@ -159,7 +159,7 @@ func endcgo(mp *m) {
                raceacquire(unsafe.Pointer(&racecgosync))
        }
 
-       unlockOSThread() // invalidates mp
+//     unlockOSThread() // invalidates mp
 }
 
 // Call from C back to Go.

it becomes:

$ benchstat dv0.txt dv1.txt 
name     old time/op  new time/op  delta
CGo-4     576ns ± 1%   558ns ± 1%  -3.18%  (p=0.000 n=10+10)
CGo10-4  2.22µs ± 2%  2.03µs ± 4%  -8.63%  (p=0.000 n=9+8)

with adding notesleep spin patch on top (so both cgocall and notesleep are patched) it becomes:

$ benchstat dv1.txt dv2.txt 
name     old time/op  new time/op  delta
CGo-4     558ns ± 1%   552ns ± 2%   -1.00%  (p=0.021 n=10+10)
CGo10-4  2.03µs ± 4%  1.17µs ± 1%  -42.45%  (p=0.000 n=8+8)

which shows the speedup is not only related to lockOSThread and somehow generally (?) applies to scheduler or some other details of CGo calls.

Contributor

navytux commented Sep 13, 2017

@dvyukov thanks for the question, I too thought about it just after commenting. So on today's unmodified tip (go version devel +c2f8ed267b Wed Sep 13 07:19:21 2017 +0000 linux/amd64) it gives:

$ benchstat dv0.txt
name     time/op
CGo-4     576ns ± 1%
CGo10-4  2.22µs ± 2%

with adding

diff --git a/src/runtime/cgocall.go b/src/runtime/cgocall.go
index ce4d707e06..decc310088 100644
--- a/src/runtime/cgocall.go
+++ b/src/runtime/cgocall.go
@@ -106,7 +106,7 @@ func cgocall(fn, arg unsafe.Pointer) int32 {
 
        // Lock g to m to ensure we stay on the same stack if we do a
        // cgo callback. In case of panic, unwindm calls endcgo.
-       lockOSThread()
+//     lockOSThread()
        mp := getg().m
        mp.ncgocall++
        mp.ncgo++
@@ -159,7 +159,7 @@ func endcgo(mp *m) {
                raceacquire(unsafe.Pointer(&racecgosync))
        }
 
-       unlockOSThread() // invalidates mp
+//     unlockOSThread() // invalidates mp
 }
 
 // Call from C back to Go.

it becomes:

$ benchstat dv0.txt dv1.txt 
name     old time/op  new time/op  delta
CGo-4     576ns ± 1%   558ns ± 1%  -3.18%  (p=0.000 n=10+10)
CGo10-4  2.22µs ± 2%  2.03µs ± 4%  -8.63%  (p=0.000 n=9+8)

with adding notesleep spin patch on top (so both cgocall and notesleep are patched) it becomes:

$ benchstat dv1.txt dv2.txt 
name     old time/op  new time/op  delta
CGo-4     558ns ± 1%   552ns ± 2%   -1.00%  (p=0.021 n=10+10)
CGo10-4  2.03µs ± 4%  1.17µs ± 1%  -42.45%  (p=0.000 n=8+8)

which shows the speedup is not only related to lockOSThread and somehow generally (?) applies to scheduler or some other details of CGo calls.

@dvyukov

This comment has been minimized.

Show comment
Hide comment
@dvyukov

dvyukov Sep 13, 2017

Member

which shows the speedup is not only related to lockOSThread and somehow generally (?) applies to scheduler or some other details of CGo calls.

This makes sense now.

I guess it is general tradeoff between latency and burning CPU.
If we do more aggressive scheduler spinning, it can make sense to do it directly in findrunnable/stoplockedm.
But generally it's unclear to me if this translates to real world improvements. Synthetic synchronization tests are usually misleading. If a locked goroutine blocks for more than 10us, then we just burn 10us on CPU time in vain.

Member

dvyukov commented Sep 13, 2017

which shows the speedup is not only related to lockOSThread and somehow generally (?) applies to scheduler or some other details of CGo calls.

This makes sense now.

I guess it is general tradeoff between latency and burning CPU.
If we do more aggressive scheduler spinning, it can make sense to do it directly in findrunnable/stoplockedm.
But generally it's unclear to me if this translates to real world improvements. Synthetic synchronization tests are usually misleading. If a locked goroutine blocks for more than 10us, then we just burn 10us on CPU time in vain.

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux Sep 13, 2017

Contributor

If in cgocall I also disable entersyscall/exitsyscall even without notesleep spin patch, so whole patch is below:

diff --git a/src/runtime/cgocall.go b/src/runtime/cgocall.go
index ce4d707e06..70fbe7e7b1 100644
--- a/src/runtime/cgocall.go
+++ b/src/runtime/cgocall.go
@@ -106,7 +106,7 @@ func cgocall(fn, arg unsafe.Pointer) int32 {
 
        // Lock g to m to ensure we stay on the same stack if we do a
        // cgo callback. In case of panic, unwindm calls endcgo.
-       lockOSThread()
+//     lockOSThread()
        mp := getg().m
        mp.ncgocall++
        mp.ncgo++
@@ -129,9 +129,9 @@ func cgocall(fn, arg unsafe.Pointer) int32 {
        // "system call", run the Go code (which may grow the stack),
        // and then re-enter the "system call" reusing the PC and SP
        // saved by entersyscall here.
-       entersyscall(0)
+//     entersyscall(0)
        errno := asmcgocall(fn, arg)
-       exitsyscall(0)
+//     exitsyscall(0)
 
        // From the garbage collector's perspective, time can move
        // backwards in the sequence above. If there's a callback into
@@ -159,7 +159,7 @@ func endcgo(mp *m) {
                raceacquire(unsafe.Pointer(&racecgosync))
        }
 
-       unlockOSThread() // invalidates mp
+//     unlockOSThread() // invalidates mp
 }
 
 // Call from C back to Go.

compared to only lockOSThread commented it becomes:

$ benchstat dv1.txt dv1nosys.txt 
name     old time/op  new time/op  delta
CGo-4     558ns ± 1%   509ns ± 1%   -8.86%  (p=0.000 n=10+9)
CGo10-4  2.03µs ± 4%  0.77µs ± 3%  -61.97%  (p=0.000 n=8+10)

And if I comment only syscallenter/syscallexit, so whole patch is:

diff --git a/src/runtime/cgocall.go b/src/runtime/cgocall.go
index ce4d707e06..243688f0af 100644
--- a/src/runtime/cgocall.go
+++ b/src/runtime/cgocall.go
@@ -129,9 +129,9 @@ func cgocall(fn, arg unsafe.Pointer) int32 {
        // "system call", run the Go code (which may grow the stack),
        // and then re-enter the "system call" reusing the PC and SP
        // saved by entersyscall here.
-       entersyscall(0)
+//     entersyscall(0)
        errno := asmcgocall(fn, arg)
-       exitsyscall(0)
+//     exitsyscall(0)
 
        // From the garbage collector's perspective, time can move
        // backwards in the sequence above. If there's a callback into

compared to unmodified tip it becomes:

$ benchstat dv0.txt dv1nosysonly.txt 
name     old time/op  new time/op  delta
CGo-4     576ns ± 1%   515ns ± 1%  -10.62%  (p=0.000 n=10+10)
CGo10-4  2.22µs ± 2%  0.84µs ± 2%  -62.07%  (p=0.000 n=9+10)

so in case of CGo calls the notesleep spin patch brings speedup not becuase of lockOSThread but due to entersyscall/exitsyscall being there.

Contributor

navytux commented Sep 13, 2017

If in cgocall I also disable entersyscall/exitsyscall even without notesleep spin patch, so whole patch is below:

diff --git a/src/runtime/cgocall.go b/src/runtime/cgocall.go
index ce4d707e06..70fbe7e7b1 100644
--- a/src/runtime/cgocall.go
+++ b/src/runtime/cgocall.go
@@ -106,7 +106,7 @@ func cgocall(fn, arg unsafe.Pointer) int32 {
 
        // Lock g to m to ensure we stay on the same stack if we do a
        // cgo callback. In case of panic, unwindm calls endcgo.
-       lockOSThread()
+//     lockOSThread()
        mp := getg().m
        mp.ncgocall++
        mp.ncgo++
@@ -129,9 +129,9 @@ func cgocall(fn, arg unsafe.Pointer) int32 {
        // "system call", run the Go code (which may grow the stack),
        // and then re-enter the "system call" reusing the PC and SP
        // saved by entersyscall here.
-       entersyscall(0)
+//     entersyscall(0)
        errno := asmcgocall(fn, arg)
-       exitsyscall(0)
+//     exitsyscall(0)
 
        // From the garbage collector's perspective, time can move
        // backwards in the sequence above. If there's a callback into
@@ -159,7 +159,7 @@ func endcgo(mp *m) {
                raceacquire(unsafe.Pointer(&racecgosync))
        }
 
-       unlockOSThread() // invalidates mp
+//     unlockOSThread() // invalidates mp
 }
 
 // Call from C back to Go.

compared to only lockOSThread commented it becomes:

$ benchstat dv1.txt dv1nosys.txt 
name     old time/op  new time/op  delta
CGo-4     558ns ± 1%   509ns ± 1%   -8.86%  (p=0.000 n=10+9)
CGo10-4  2.03µs ± 4%  0.77µs ± 3%  -61.97%  (p=0.000 n=8+10)

And if I comment only syscallenter/syscallexit, so whole patch is:

diff --git a/src/runtime/cgocall.go b/src/runtime/cgocall.go
index ce4d707e06..243688f0af 100644
--- a/src/runtime/cgocall.go
+++ b/src/runtime/cgocall.go
@@ -129,9 +129,9 @@ func cgocall(fn, arg unsafe.Pointer) int32 {
        // "system call", run the Go code (which may grow the stack),
        // and then re-enter the "system call" reusing the PC and SP
        // saved by entersyscall here.
-       entersyscall(0)
+//     entersyscall(0)
        errno := asmcgocall(fn, arg)
-       exitsyscall(0)
+//     exitsyscall(0)
 
        // From the garbage collector's perspective, time can move
        // backwards in the sequence above. If there's a callback into

compared to unmodified tip it becomes:

$ benchstat dv0.txt dv1nosysonly.txt 
name     old time/op  new time/op  delta
CGo-4     576ns ± 1%   515ns ± 1%  -10.62%  (p=0.000 n=10+10)
CGo10-4  2.22µs ± 2%  0.84µs ± 2%  -62.07%  (p=0.000 n=9+10)

so in case of CGo calls the notesleep spin patch brings speedup not becuase of lockOSThread but due to entersyscall/exitsyscall being there.

@dvyukov

This comment has been minimized.

Show comment
Hide comment
@dvyukov

dvyukov Sep 13, 2017

Member

Yes, entersyscall/exitsyscall is what interacts with Go scheduler in case of cgo calls.

Member

dvyukov commented Sep 13, 2017

Yes, entersyscall/exitsyscall is what interacts with Go scheduler in case of cgo calls.

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux Sep 13, 2017

Contributor

I guess it is general tradeoff between latency and burning CPU.
If we do more aggressive scheduler spinning, it can make sense to do it directly in findrunnable/stoplockedm.
But generally it's unclear to me if this translates to real world improvements. Synthetic synchronization tests are usually misleading. If a locked goroutine blocks for more than 10us, then we just burn 10us on CPU time in vain.

I generally agree but some kind of adaptive spinning could be brought to notesleep/notewakeup too -
similar to what you did to runtime.lock/unlock in 4e5086b (runtime: improve Linux mutex):

4e5086b#diff-608e335144c55dc824f257f5a66ac4d3R125
https://github.com/golang/go/blob/c2f8ed26/src/runtime/lock_futex.go#L54

becuase currently notesleep always unconditionally goes directly to sys_futex to kernel.

Contributor

navytux commented Sep 13, 2017

I guess it is general tradeoff between latency and burning CPU.
If we do more aggressive scheduler spinning, it can make sense to do it directly in findrunnable/stoplockedm.
But generally it's unclear to me if this translates to real world improvements. Synthetic synchronization tests are usually misleading. If a locked goroutine blocks for more than 10us, then we just burn 10us on CPU time in vain.

I generally agree but some kind of adaptive spinning could be brought to notesleep/notewakeup too -
similar to what you did to runtime.lock/unlock in 4e5086b (runtime: improve Linux mutex):

4e5086b#diff-608e335144c55dc824f257f5a66ac4d3R125
https://github.com/golang/go/blob/c2f8ed26/src/runtime/lock_futex.go#L54

becuase currently notesleep always unconditionally goes directly to sys_futex to kernel.

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux Sep 13, 2017

Contributor

And by the way - many fast syscalls is not synthetic benchmark - they appear in real programs either reading fast from cached files or sending/receiving on network (yes network is epolled but still every send/recv goes through full - not raw - syscall). In my experience every such event has potential to shake the scheduler.

Contributor

navytux commented Sep 13, 2017

And by the way - many fast syscalls is not synthetic benchmark - they appear in real programs either reading fast from cached files or sending/receiving on network (yes network is epolled but still every send/recv goes through full - not raw - syscall). In my experience every such event has potential to shake the scheduler.

@dvyukov

This comment has been minimized.

Show comment
Hide comment
@dvyukov

dvyukov Sep 13, 2017

Member

notesleep is not meant for sleeping for brief periods. There are callers for which spinning will be plain harmful. mutex is meant for brief blocking.
Non-blocking network calls should not invoke scheduler, they don't give up P for 20us, while non-blocking read/write takes on a par of 5us.

Member

dvyukov commented Sep 13, 2017

notesleep is not meant for sleeping for brief periods. There are callers for which spinning will be plain harmful. mutex is meant for brief blocking.
Non-blocking network calls should not invoke scheduler, they don't give up P for 20us, while non-blocking read/write takes on a par of 5us.

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux Sep 13, 2017

Contributor

Thanks for feedback. I understand there is difference between notesleep and mutex and notesleep by definition is more heavier sleep. With this in mind and adjusted patch:

diff --git a/src/runtime/lock_futex.go b/src/runtime/lock_futex.go
index 9d55bd129c..5648ef66f3 100644
--- a/src/runtime/lock_futex.go
+++ b/src/runtime/lock_futex.go
@@ -146,7 +146,14 @@ func notesleep(n *note) {
                // Sleep for an arbitrary-but-moderate interval to poll libc interceptors.
                ns = 10e6
        }
-       for atomic.Load(key32(&n.key)) == 0 {
+       for spin := 0; atomic.Load(key32(&n.key)) == 0; spin++ {
+               // spin a bit hoping we'll get wakup soon hopefully without context switch
+               if spin < 10 {
+                       osyield()
+                       continue
+               }
+
+               // no luck -> go to sleep heavily to kernel; this might result in context switch
                gp.m.blocked = true
                futexsleep(key32(&n.key), 0, ns)
                if *cgo_yield != nil {

it still works for both LockOSThread and CGo cases:

$ benchstat dv00.txt dv0+osyield.txt 
name        old time/op  new time/op  delta
Unlocked-4   482ns ± 1%   478ns ± 0%   -0.90%  (p=0.011 n=10+9)
Locked-4    5.08µs ± 1%  1.51µs ± 4%  -70.29%  (p=0.000 n=9+10)
CGo-4        577ns ± 0%   558ns ± 1%   -3.38%  (p=0.000 n=9+10)
CGo10-4     2.22µs ± 3%  1.30µs ± 1%  -41.47%  (p=0.000 n=10+10)

but the spinning now does not wastefully burn CPU and gives up to OS scheduler via osyield(). This works similarly to "passive spinning" phase of mutex lock. For mutex the N(passive-spin) = 1 and since notesleep is more heavier having N(passive-spin) for it an order of magnitude more seems logical. The osyield will release the CPU to other threads if there is other work to do and thus hopefully should not have negative impact (this has to be verified).

Non-blocking network calls should not invoke scheduler, they don't give up P for 20us, while non-blocking read/write takes on a par of 5us.

Maybe you are right here. I will try to reverify this once getting to a related topic and will hopefully come back with feedback.

Contributor

navytux commented Sep 13, 2017

Thanks for feedback. I understand there is difference between notesleep and mutex and notesleep by definition is more heavier sleep. With this in mind and adjusted patch:

diff --git a/src/runtime/lock_futex.go b/src/runtime/lock_futex.go
index 9d55bd129c..5648ef66f3 100644
--- a/src/runtime/lock_futex.go
+++ b/src/runtime/lock_futex.go
@@ -146,7 +146,14 @@ func notesleep(n *note) {
                // Sleep for an arbitrary-but-moderate interval to poll libc interceptors.
                ns = 10e6
        }
-       for atomic.Load(key32(&n.key)) == 0 {
+       for spin := 0; atomic.Load(key32(&n.key)) == 0; spin++ {
+               // spin a bit hoping we'll get wakup soon hopefully without context switch
+               if spin < 10 {
+                       osyield()
+                       continue
+               }
+
+               // no luck -> go to sleep heavily to kernel; this might result in context switch
                gp.m.blocked = true
                futexsleep(key32(&n.key), 0, ns)
                if *cgo_yield != nil {

it still works for both LockOSThread and CGo cases:

$ benchstat dv00.txt dv0+osyield.txt 
name        old time/op  new time/op  delta
Unlocked-4   482ns ± 1%   478ns ± 0%   -0.90%  (p=0.011 n=10+9)
Locked-4    5.08µs ± 1%  1.51µs ± 4%  -70.29%  (p=0.000 n=9+10)
CGo-4        577ns ± 0%   558ns ± 1%   -3.38%  (p=0.000 n=9+10)
CGo10-4     2.22µs ± 3%  1.30µs ± 1%  -41.47%  (p=0.000 n=10+10)

but the spinning now does not wastefully burn CPU and gives up to OS scheduler via osyield(). This works similarly to "passive spinning" phase of mutex lock. For mutex the N(passive-spin) = 1 and since notesleep is more heavier having N(passive-spin) for it an order of magnitude more seems logical. The osyield will release the CPU to other threads if there is other work to do and thus hopefully should not have negative impact (this has to be verified).

Non-blocking network calls should not invoke scheduler, they don't give up P for 20us, while non-blocking read/write takes on a par of 5us.

Maybe you are right here. I will try to reverify this once getting to a related topic and will hopefully come back with feedback.

@kostix

This comment has been minimized.

Show comment
Hide comment
@kostix

kostix Sep 13, 2017

@dvyukov, could you please clear up that

Non-blocking network calls should not invoke scheduler, they don't give up P for 20us, while non-blocking read/write takes on a par of 5us.

bit for me?

Do I parse it correctly, that network syscalls are treated specially and P is not removed from under a G which spends more than 20 us in such a syscall, like this is done for all other syscalls?

Or does this merely happen "all by itself" — as a byproduct of such syscalls commonly lasting only about 5 us (thanks to the sockets being non-blocking)?

kostix commented Sep 13, 2017

@dvyukov, could you please clear up that

Non-blocking network calls should not invoke scheduler, they don't give up P for 20us, while non-blocking read/write takes on a par of 5us.

bit for me?

Do I parse it correctly, that network syscalls are treated specially and P is not removed from under a G which spends more than 20 us in such a syscall, like this is done for all other syscalls?

Or does this merely happen "all by itself" — as a byproduct of such syscalls commonly lasting only about 5 us (thanks to the sockets being non-blocking)?

@dvyukov

This comment has been minimized.

Show comment
Hide comment
@dvyukov

dvyukov Sep 13, 2017

Member

It happens for all syscalls/cgocalls if they return faster than 20us.

Member

dvyukov commented Sep 13, 2017

It happens for all syscalls/cgocalls if they return faster than 20us.

@dvyukov

This comment has been minimized.

Show comment
Hide comment
@dvyukov

dvyukov Sep 13, 2017

Member

but the spinning now does not wastefully burn CPU and gives up to OS scheduler via osyield().

This is still wasteful for some callers. What exactly caller of notesleep is affected by this change? Can we improve the caller instead?

Member

dvyukov commented Sep 13, 2017

but the spinning now does not wastefully burn CPU and gives up to OS scheduler via osyield().

This is still wasteful for some callers. What exactly caller of notesleep is affected by this change? Can we improve the caller instead?

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux Sep 13, 2017

Contributor

For LockOSThread case it seems to be stoplockedm/startlockedm:

locked+osyield
(svg)

However for CGo case it is less clear, at least from CPU profile point of view:

cgo+osyield
(svg)

Contributor

navytux commented Sep 13, 2017

For LockOSThread case it seems to be stoplockedm/startlockedm:

locked+osyield
(svg)

However for CGo case it is less clear, at least from CPU profile point of view:

cgo+osyield
(svg)

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux Sep 13, 2017

Contributor

Btw, do you have particular example where osyielding in notesleep will be wasteful? (go scheduler is new to me)

Contributor

navytux commented Sep 13, 2017

Btw, do you have particular example where osyielding in notesleep will be wasteful? (go scheduler is new to me)

@dvyukov

This comment has been minimized.

Show comment
Hide comment
@dvyukov

dvyukov Sep 13, 2017

Member

I would try spinning in findrunnable more, before dropping P, that can yield better result without impacting other cases.

Re stoplockedm, I am still not sure. We could also spin there before dropping P, but then this is an optimization for the case when a locked G in unblocked very quickly. It's unclear how often this happens in real life, and even if it does happen it still won't be fast because of the required thread jumps. But this will hurt other goroutines because we are holding the P.
We could spin just in notesleep when called from stoplockedm, but then at least we need to avoid futexwake as well. And it's still not clear if it is the right thing to do.

If you care about performance of locked goroutines, there seems to be a bunch of other missed optimizations. E.g. when a locked G is woken, we wake another M, but that M cannot execute the G, so it wakes the right M and passes G/P to it. I guess we need to wake the right M right away.

Member

dvyukov commented Sep 13, 2017

I would try spinning in findrunnable more, before dropping P, that can yield better result without impacting other cases.

Re stoplockedm, I am still not sure. We could also spin there before dropping P, but then this is an optimization for the case when a locked G in unblocked very quickly. It's unclear how often this happens in real life, and even if it does happen it still won't be fast because of the required thread jumps. But this will hurt other goroutines because we are holding the P.
We could spin just in notesleep when called from stoplockedm, but then at least we need to avoid futexwake as well. And it's still not clear if it is the right thing to do.

If you care about performance of locked goroutines, there seems to be a bunch of other missed optimizations. E.g. when a locked G is woken, we wake another M, but that M cannot execute the G, so it wakes the right M and passes G/P to it. I guess we need to wake the right M right away.

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux Sep 13, 2017

Contributor

@dvyukov thanks for your feedback with knowledge sharing - it is appreciated.

My situation is this: I have a case where my server is misperforming because of, I think, go scheduler (tracing shows many spawned goroutines are queued to same P and are not stealed while I need all them to execute in parallel to significantly reduce latency; other Ps are idle) . There I do not use LockOSThread and CGo at all at the moment. I cannot even describe the problem properly yet because it has not been fully analyzed yet. I spent some time learning how go scheduler works to better understand what happens when a goroutine is started, channel sent/received etc. Along the way I did a quick tool to profile how often goroutines are migrated in between Ms:

https://lab.nexedi.com/kirr/neo/blob/851864a9/go/gmigrate.go

because unfortunately it seems to be a frequent event and changing M probably means changing CPU and thus loosing CPU caches.

After studying I quickly looked around for scheduler bugs here on issue tracker and found LockOSThread case. I tried to test whether I understood at least something via trying to fix it, and so we are here.

So if someone else does not fix the LockOSThread and CGo cases before me, I will hopefully try to give it a fresh look while working on my scheduler-related issues. However scheduler bits are currently lower priority for me compared to proper memory and other tuning - so it will be some time before I could dig in more details on this topic.

Thanks again for feedback and appologize for throttling,
Kirill

Contributor

navytux commented Sep 13, 2017

@dvyukov thanks for your feedback with knowledge sharing - it is appreciated.

My situation is this: I have a case where my server is misperforming because of, I think, go scheduler (tracing shows many spawned goroutines are queued to same P and are not stealed while I need all them to execute in parallel to significantly reduce latency; other Ps are idle) . There I do not use LockOSThread and CGo at all at the moment. I cannot even describe the problem properly yet because it has not been fully analyzed yet. I spent some time learning how go scheduler works to better understand what happens when a goroutine is started, channel sent/received etc. Along the way I did a quick tool to profile how often goroutines are migrated in between Ms:

https://lab.nexedi.com/kirr/neo/blob/851864a9/go/gmigrate.go

because unfortunately it seems to be a frequent event and changing M probably means changing CPU and thus loosing CPU caches.

After studying I quickly looked around for scheduler bugs here on issue tracker and found LockOSThread case. I tried to test whether I understood at least something via trying to fix it, and so we are here.

So if someone else does not fix the LockOSThread and CGo cases before me, I will hopefully try to give it a fresh look while working on my scheduler-related issues. However scheduler bits are currently lower priority for me compared to proper memory and other tuning - so it will be some time before I could dig in more details on this topic.

Thanks again for feedback and appologize for throttling,
Kirill

@dvyukov

This comment has been minimized.

Show comment
Hide comment
@dvyukov

dvyukov Sep 13, 2017

Member

tracing shows many spawned goroutines are queued to same P and are not stealed while I need all them to execute in parallel to significantly reduce latency; other Ps are idle

Sounds pretty bad. But does not look like an issue with notesleep.
Repro would be useful (and/or trace file).

Member

dvyukov commented Sep 13, 2017

tracing shows many spawned goroutines are queued to same P and are not stealed while I need all them to execute in parallel to significantly reduce latency; other Ps are idle

Sounds pretty bad. But does not look like an issue with notesleep.
Repro would be useful (and/or trace file).

@RLH

This comment has been minimized.

Show comment
Hide comment
@RLH

RLH Sep 13, 2017

Contributor
Contributor

RLH commented Sep 13, 2017

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux Sep 13, 2017

Contributor

It is running on my notebook which has 4 CPU (2 physical * 2 HT). There are 2 programs: server + client. The client tries to issue 512 simultaneous networking requests over same TCP connection to server and then waits for completion, but from what I recall even initiating them go mostly serially. GOMAXPROCS is unadjusted and default to ncpu=4. No other program significantly uses the processor.

I admit networking over loopback is not the same as networking on LAN (on LAN RTT ~= 500μs while for TCP/loopback and separate processes it is aroung 5-8μs). We do care about loopback case too though.

I promise I get back to this once dealing with other urgent tunings. Probably in a week or two.

Contributor

navytux commented Sep 13, 2017

It is running on my notebook which has 4 CPU (2 physical * 2 HT). There are 2 programs: server + client. The client tries to issue 512 simultaneous networking requests over same TCP connection to server and then waits for completion, but from what I recall even initiating them go mostly serially. GOMAXPROCS is unadjusted and default to ncpu=4. No other program significantly uses the processor.

I admit networking over loopback is not the same as networking on LAN (on LAN RTT ~= 500μs while for TCP/loopback and separate processes it is aroung 5-8μs). We do care about loopback case too though.

I promise I get back to this once dealing with other urgent tunings. Probably in a week or two.

@ianlancetaylor

This comment has been minimized.

Show comment
Hide comment
@ianlancetaylor

ianlancetaylor Sep 13, 2017

Contributor

Signals run on the gsignal stack. As far as I can see they are OK.

SWIG code can cause a Go panic, but that should be fine because it does it by calling back into Go. The panic will then skip over the C stack while unwinding the Go stack, so I don't think there is a problem there.

Contributor

ianlancetaylor commented Sep 13, 2017

Signals run on the gsignal stack. As far as I can see they are OK.

SWIG code can cause a Go panic, but that should be fine because it does it by calling back into Go. The panic will then skip over the C stack while unwinding the Go stack, so I don't think there is a problem there.

@valyala

This comment has been minimized.

Show comment
Hide comment
@valyala

valyala Sep 13, 2017

Contributor

I would try spinning in findrunnable more, before dropping P, that can yield better result without impacting other cases.

@dvyukov , this sounds great if it will reduce g migration between cpu cores leading to hotter per-cpu caches and better performance on multi-core systems. IMHO, it is better to wait for a few microseconds in the hope to find local runnable g before trying to steal g from other ps or from global queue.

Contributor

valyala commented Sep 13, 2017

I would try spinning in findrunnable more, before dropping P, that can yield better result without impacting other cases.

@dvyukov , this sounds great if it will reduce g migration between cpu cores leading to hotter per-cpu caches and better performance on multi-core systems. IMHO, it is better to wait for a few microseconds in the hope to find local runnable g before trying to steal g from other ps or from global queue.

@dvyukov

This comment has been minimized.

Show comment
Hide comment
@dvyukov

dvyukov Sep 13, 2017

Member

@valyala Local g's won't appear when a thread is spinning. Only the thread itself can submit local g's, if it's idle it won't submit any g's.

Member

dvyukov commented Sep 13, 2017

@valyala Local g's won't appear when a thread is spinning. Only the thread itself can submit local g's, if it's idle it won't submit any g's.

@valyala

This comment has been minimized.

Show comment
Hide comment
@valyala

valyala Sep 14, 2017

Contributor

@dvyukov , I thought local ready to run g's could appear because of completed I/O operations or timers that were issued by local g's.

Contributor

valyala commented Sep 14, 2017

@dvyukov , I thought local ready to run g's could appear because of completed I/O operations or timers that were issued by local g's.

@ianlancetaylor

This comment has been minimized.

Show comment
Hide comment
@ianlancetaylor

ianlancetaylor Sep 15, 2017

Contributor

@valyala I can't tell if you still think that or if you no longer think that. To be clear, I believe that that is not the case. Local ready-to-run g's only appear when a g currently running on the p takes some action to add them to the current p's queue.

Contributor

ianlancetaylor commented Sep 15, 2017

@valyala I can't tell if you still think that or if you no longer think that. To be clear, I believe that that is not the case. Local ready-to-run g's only appear when a g currently running on the p takes some action to add them to the current p's queue.

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Sep 15, 2017

Change https://golang.org/cl/64070 mentions this issue: runtime: don't call lockOSThread for every cgo call

gopherbot commented Sep 15, 2017

Change https://golang.org/cl/64070 mentions this issue: runtime: don't call lockOSThread for every cgo call

gopherbot pushed a commit that referenced this issue Sep 22, 2017

runtime: don't call lockOSThread for every cgo call
For a trivial benchmark with a do-nothing cgo call:

name    old time/op  new time/op  delta
Call-4  64.5ns ± 7%  63.0ns ± 6%  -2.25%  (p=0.027 n=20+16)

Because Windows uses the cgocall mechanism to make system calls,
and passes arguments in a struct held in the m,
we need to do the lockOSThread/unlockOSThread in that code.

Because deferreturn was getting a nosplit stack overflow error,
change it to avoid calling typedmemmove.

Updates #21827.

Change-Id: I9b1d61434c44faeb29805b46b409c812c9acadc2
Reviewed-on: https://go-review.googlesource.com/64070
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: David Crawshaw <crawshaw@golang.org>

@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017

@ianlancetaylor ianlancetaylor changed the title from runtime: big performance penalty with runtime.LockOSThread (and so potentially at every CGo call) to runtime: big performance penalty with runtime.LockOSThread Nov 22, 2017

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux Dec 30, 2017

Contributor

I promise I get back to this once dealing with other urgent tunings. Probably in a week or two.

I have to appologize: I've got swamped by other work and now there is close to zero chance for me to return to this topic in the near future. I'm sorry for that.

The original bug report still stands valid: there is performance penalty related to LockOSThread and it could be improved.

Kirill

Contributor

navytux commented Dec 30, 2017

I promise I get back to this once dealing with other urgent tunings. Probably in a week or two.

I have to appologize: I've got swamped by other work and now there is close to zero chance for me to return to this topic in the near future. I'm sorry for that.

The original bug report still stands valid: there is performance penalty related to LockOSThread and it could be improved.

Kirill

navytux added a commit to navytux/go-sqlite3 that referenced this issue Feb 16, 2018

Don't spawn interrupt goroutine if we know that context cannot be can…
…celed

For a Go-only project the following code pattern

	go func() {
		select {
		case <-ctx.Done():
			// call some cancel

		case <-done:
			// work finished ok
		}
	}()

	// do some work
	close(done)

works good and fast - without high scheduling overhead because scheduler
usually puts spawned goroutine into run queue on the same OS thread and so
after done is closed control is passed to spawned goroutine without OS context
switch.

However in the presence of Cgo calls in "do some work" the situation can
become different - Cgo calls are treated by go runtime similarly to
system calls with the effect that goroutines spawned on original OS
thread tend to be migrated by scheduler to be executed on another OS
thread.

This in turn can bring high overhead for communicating on "done", which
ultimately can result in full context switch: if the spawned goroutine
had chance to run, already checked done and ctx to be not ready, and went
into sleep via wait on futex - showing as something like below in strace for
one read query (note futex calls):

	27867 00:38:39.782146 stat(".../neo.sqlite-journal", 0x7f83809c4a20) = -1 ENOENT (No such file or directory)
	27867 00:38:39.782165 pread64(3, "\0\0\0\33\0\0\10\235\0\0\10]\0\0\0\27", 16, 24) = 16
	27871 00:38:39.782179 <... pselect6 resumed> ) = 0 (Timeout)
	27868 00:38:39.782187 <... pselect6 resumed> ) = 0 (Timeout)
	27871 00:38:39.782193 futex(0xc4200f8538, FUTEX_WAIT, 0, NULL <unfinished ...>
	27868 00:38:39.782199 futex(0xc420013138, FUTEX_WAIT, 0, NULL <unfinished ...>
	27867 00:38:39.782205 stat(".../neo.sqlite-wal", 0x7f83809c4a20) = -1 ENOENT (No such file or directory)
	27867 00:38:39.782224 fstat(3, {st_mode=S_IFREG|0644, st_size=9031680, ...}) = 0
	27867 00:38:39.782247 futex(0xc420013138, FUTEX_WAKE, 1 <unfinished ...>
	27868 00:38:39.782259 <... futex resumed> ) = 0
	27867 00:38:39.782265 <... futex resumed> ) = 1
	27868 00:38:39.782270 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=3000}, NULL <unfinished ...>
	27867 00:38:39.782279 fcntl(3, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
	27867 00:38:39.782315 fcntl(3, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
	27868 00:38:39.782336 <... pselect6 resumed> ) = 0 (Timeout)
	27867 00:38:39.782342 fcntl(3, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510} <unfinished ...>
	27868 00:38:39.782348 futex(0xc4200f8538, FUTEX_WAKE, 1 <unfinished ...>
	27867 00:38:39.782355 <... fcntl resumed> ) = 0
	27871 00:38:39.782360 <... futex resumed> ) = 0
	27868 00:38:39.782367 <... futex resumed> ) = 1
	27871 00:38:39.782372 futex(0xc4200f8138, FUTEX_WAKE, 1 <unfinished ...>
	27868 00:38:39.782377 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=3000}, NULL <unfinished ...>
	27871 00:38:39.782384 <... futex resumed> ) = 1
	27870 00:38:39.782389 <... futex resumed> ) = 0
	27867 00:38:39.782394 fcntl(3, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1} <unfinished ...>
	27870 00:38:39.782400 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=3000}, NULL <unfinished ...>
	27867 00:38:39.782408 <... fcntl resumed> ) = 0

Below link shows that go scheduler itself might be significantly improved for
cases when there are several Cgo calls made for a request in a server:

golang/go#21827 (comment)

in particular CGo-4 case should be closely related to this sqlite3 go package,
because for one query many CGo calls are made to SQLite.

However until there are proper scheduler fixes, let's make what could
be made to improve time to do queries:

If we know that the context under which a query is executed will never
be canceled - we know we can safely skip spawning the interrupt
goroutine and this was avoid ping-pong on done in between different OS
threads.

This brings the following speedup on my notebook with go1.10:

name               old req/s    new req/s    delta
Exec                 254k ± 1%    379k ± 1%  +48.89%  (p=0.000 n=10+10)
Query               90.6k ± 2%   96.4k ± 1%   +6.37%  (p=0.000 n=10+10)
Params              81.5k ± 1%   87.0k ± 1%   +6.83%  (p=0.000 n=10+10)
Stmt                 122k ± 2%    129k ± 1%   +6.07%  (p=0.000 n=10+9)
Rows                2.98k ± 1%   3.06k ± 1%   +2.77%  (p=0.000 n=9+10)
StmtRows            3.10k ± 1%   3.13k ± 1%   +1.12%  (p=0.000 n=9+10)

name               old time/op  new time/op  delta
CustomFunctions-4  10.6µs ± 1%  10.1µs ± 1%   -5.01%  (p=0.000 n=10+10)
@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux Feb 16, 2018

Contributor

#21827 (comment):

compared to unmodified tip it becomes:

 $ benchstat dv0.txt dv1nosysonly.txt 
 name     old time/op  new time/op  delta
 CGo-4     576ns ± 1%   515ns ± 1%  -10.62%  (p=0.000 n=10+10)
 CGo10-4  2.22µs ± 2%  0.84µs ± 2%  -62.07%  (p=0.000 n=9+10)

so in case of CGo calls the notesleep spin patch brings speedup not becuase of lockOSThread but due to entersyscall/exitsyscall being there.

For the reference: I've hit the "CGo10" case in practice with SQLite (only Cgo, no LockOSThread) where presense of other goroutines combined with Cgo calls on "main" one show big overhead: github.com/mattn/go-sqlite3 uses CGo and performs several CGo calls inside one Query or Exec. There was also an interrupt goroutine spawned for every Query or Exec to call sqlite3_interrup if context is canceled.

With Go1.10 avoiding to spawn that interrupt goroutine, if we know the context cannot be canceled, brings ~ 1.5x speedup to Exec (mattn/go-sqlite3#530).

However Query was made faster only by 5% because after 2b283ce (/cc @kardianos, @bradfitz) database/sql always spawns a goroutine to close Rows on context or transaction cancel.

( @kardianos it would be nice if somehow we could avoid spawning Rows.awaitDone if original query context cannot be cancelled. Because with the following test patch:

diff --git a/src/database/sql/sql.go b/src/database/sql/sql.go
index 8f5588ed26..4345aa736a 100644
--- a/src/database/sql/sql.go
+++ b/src/database/sql/sql.go
@@ -2568,6 +2568,7 @@ type Rows struct {
 }
 
 func (rs *Rows) initContextClose(ctx, txctx context.Context) {
+       return
        ctx, rs.cancel = context.WithCancel(ctx)
        go rs.awaitDone(ctx, txctx)
 }

SQLite Query and rest becomes speed up too:

    name               old req/s    new req/s    delta
    Exec                 379k ± 1%    375k ± 3%     ~     (p=0.218 n=10+10)
    Query               96.4k ± 1%  132.2k ± 3%  +37.14%  (p=0.000 n=10+10)
    Params              87.0k ± 1%  117.2k ± 3%  +34.66%  (p=0.000 n=10+10)
    Stmt                 129k ± 1%    178k ± 2%  +37.45%  (p=0.000 n=9+9)
    Rows                3.06k ± 1%   3.45k ± 1%  +12.49%  (p=0.000 n=10+9)
    StmtRows            3.13k ± 1%   3.54k ± 1%  +12.85%  (p=0.000 n=10+9)

    name               old time/op  new time/op  delta
    CustomFunctions-4  10.1µs ± 1%   7.2µs ± 2%  -28.68%  (p=0.000 n=10+10)

/cc @mattn)

The SQLite test benchmark changes reflect the numbers I see in my real project.

Applying my notesleep spinning patches does not help here, but clearly something needs to be done to remove this Cgo + go around large overhead.

Thanks,
Kirill

Contributor

navytux commented Feb 16, 2018

#21827 (comment):

compared to unmodified tip it becomes:

 $ benchstat dv0.txt dv1nosysonly.txt 
 name     old time/op  new time/op  delta
 CGo-4     576ns ± 1%   515ns ± 1%  -10.62%  (p=0.000 n=10+10)
 CGo10-4  2.22µs ± 2%  0.84µs ± 2%  -62.07%  (p=0.000 n=9+10)

so in case of CGo calls the notesleep spin patch brings speedup not becuase of lockOSThread but due to entersyscall/exitsyscall being there.

For the reference: I've hit the "CGo10" case in practice with SQLite (only Cgo, no LockOSThread) where presense of other goroutines combined with Cgo calls on "main" one show big overhead: github.com/mattn/go-sqlite3 uses CGo and performs several CGo calls inside one Query or Exec. There was also an interrupt goroutine spawned for every Query or Exec to call sqlite3_interrup if context is canceled.

With Go1.10 avoiding to spawn that interrupt goroutine, if we know the context cannot be canceled, brings ~ 1.5x speedup to Exec (mattn/go-sqlite3#530).

However Query was made faster only by 5% because after 2b283ce (/cc @kardianos, @bradfitz) database/sql always spawns a goroutine to close Rows on context or transaction cancel.

( @kardianos it would be nice if somehow we could avoid spawning Rows.awaitDone if original query context cannot be cancelled. Because with the following test patch:

diff --git a/src/database/sql/sql.go b/src/database/sql/sql.go
index 8f5588ed26..4345aa736a 100644
--- a/src/database/sql/sql.go
+++ b/src/database/sql/sql.go
@@ -2568,6 +2568,7 @@ type Rows struct {
 }
 
 func (rs *Rows) initContextClose(ctx, txctx context.Context) {
+       return
        ctx, rs.cancel = context.WithCancel(ctx)
        go rs.awaitDone(ctx, txctx)
 }

SQLite Query and rest becomes speed up too:

    name               old req/s    new req/s    delta
    Exec                 379k ± 1%    375k ± 3%     ~     (p=0.218 n=10+10)
    Query               96.4k ± 1%  132.2k ± 3%  +37.14%  (p=0.000 n=10+10)
    Params              87.0k ± 1%  117.2k ± 3%  +34.66%  (p=0.000 n=10+10)
    Stmt                 129k ± 1%    178k ± 2%  +37.45%  (p=0.000 n=9+9)
    Rows                3.06k ± 1%   3.45k ± 1%  +12.49%  (p=0.000 n=10+9)
    StmtRows            3.13k ± 1%   3.54k ± 1%  +12.85%  (p=0.000 n=10+9)

    name               old time/op  new time/op  delta
    CustomFunctions-4  10.1µs ± 1%   7.2µs ± 2%  -28.68%  (p=0.000 n=10+10)

/cc @mattn)

The SQLite test benchmark changes reflect the numbers I see in my real project.

Applying my notesleep spinning patches does not help here, but clearly something needs to be done to remove this Cgo + go around large overhead.

Thanks,
Kirill

@ianlancetaylor

This comment has been minimized.

Show comment
Hide comment
@ianlancetaylor

ianlancetaylor Feb 16, 2018

Contributor

As far as I can tell there is no longer a known performance penalty due to runtime.LockOSThread. Since that is what this issue is about, I'm going to close it.

We know that cgo will run much faster if we change it to not call into the scheduler, but that has the downside that many programs will stop working. And in any case that should be discussed in a different issue that is not about LockOSThread.

Contributor

ianlancetaylor commented Feb 16, 2018

As far as I can tell there is no longer a known performance penalty due to runtime.LockOSThread. Since that is what this issue is about, I'm going to close it.

We know that cgo will run much faster if we change it to not call into the scheduler, but that has the downside that many programs will stop working. And in any case that should be discussed in a different issue that is not about LockOSThread.

@kardianos

This comment has been minimized.

Show comment
Hide comment
@kardianos

kardianos Feb 16, 2018

Contributor

The database SQL change should also be discussed in a new issue.

Contributor

kardianos commented Feb 16, 2018

The database SQL change should also be discussed in a new issue.

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux Feb 16, 2018

Contributor

The first message in this issue states that there is performance penalty when locked threads are used in servers and that there is a room to improve it by making scheduler related changes. There my spinning patch demonstrated that penalty could be reduced 4x. Reverifying this with today's tip (go version devel +c941e27e70 Fri Feb 16 19:28:41 2018 +0000 linux/amd64) on Linux 4.14.13 and i5-3427U (different cpu from original report) shows:

    name        old time/op  new time/op  delta
    Unlocked-4   695ns ± 0%   682ns ± 1%   -1.78%  (p=0.000 n=9+10)
    Locked-4    10.7µs ± 0%   1.7µs ± 3%  -83.82%  (p=0.000 n=8+10)

in other words the original problem that there is big performance penalty when runtime.LockOSThread is in use is still there. Thus to me this issue was closed incorrectly and I suggest to reopen it.

About Cgo: I was not saying a Cgo call should not call scheduler, but that the scheduler itself could be improved. By the way on this machine the same draft spinning patch that helps LockOSThread case above also helps when only Cgo is used on the server:

CGo-4        779ns ± 1%   769ns ± 0%   -1.34%  (p=0.000 n=10+9)
CGo10-4     2.19µs ± 6%  1.60µs ± 4%  -26.70%  (p=0.000 n=9+10)

Sorry if mentioning Cgo related bits here were took as offtopic, I thought it is all related and overlaps on the common scheduling ground.

The database SQL topic was moved to #23879.

Kirill

Contributor

navytux commented Feb 16, 2018

The first message in this issue states that there is performance penalty when locked threads are used in servers and that there is a room to improve it by making scheduler related changes. There my spinning patch demonstrated that penalty could be reduced 4x. Reverifying this with today's tip (go version devel +c941e27e70 Fri Feb 16 19:28:41 2018 +0000 linux/amd64) on Linux 4.14.13 and i5-3427U (different cpu from original report) shows:

    name        old time/op  new time/op  delta
    Unlocked-4   695ns ± 0%   682ns ± 1%   -1.78%  (p=0.000 n=9+10)
    Locked-4    10.7µs ± 0%   1.7µs ± 3%  -83.82%  (p=0.000 n=8+10)

in other words the original problem that there is big performance penalty when runtime.LockOSThread is in use is still there. Thus to me this issue was closed incorrectly and I suggest to reopen it.

About Cgo: I was not saying a Cgo call should not call scheduler, but that the scheduler itself could be improved. By the way on this machine the same draft spinning patch that helps LockOSThread case above also helps when only Cgo is used on the server:

CGo-4        779ns ± 1%   769ns ± 0%   -1.34%  (p=0.000 n=10+9)
CGo10-4     2.19µs ± 6%  1.60µs ± 4%  -26.70%  (p=0.000 n=9+10)

Sorry if mentioning Cgo related bits here were took as offtopic, I thought it is all related and overlaps on the common scheduling ground.

The database SQL topic was moved to #23879.

Kirill

@ianlancetaylor

This comment has been minimized.

Show comment
Hide comment
@ianlancetaylor

ianlancetaylor Feb 17, 2018

Contributor

OK, reopening.

Yes, all the cgo stuff is a distraction, please put cgo comments on a separate issue.

Contributor

ianlancetaylor commented Feb 17, 2018

OK, reopening.

Yes, all the cgo stuff is a distraction, please put cgo comments on a separate issue.

navytux added a commit to navytux/go-sqlite3 that referenced this issue Feb 17, 2018

Don't spawn interrupt goroutine if we know that context cannot be can…
…celed

For a Go-only project the following code pattern

	go func() {
		select {
		case <-ctx.Done():
			// call some cancel

		case <-done:
			// work finished ok
		}
	}()

	// do some work
	close(done)

works good and fast - without high scheduling overhead because scheduler
usually puts spawned goroutine into run queue on the same OS thread and so
after done is closed control is passed to spawned goroutine without OS context
switch.

However in the presence of Cgo calls in "do some work" the situation can
become different - Cgo calls are treated by go runtime similarly to
system calls with the effect that goroutines spawned on original OS
thread tend to be migrated by scheduler to be executed on another OS
thread.

This in turn can bring high overhead for communicating on "done", which
ultimately can result in full context switch: if the spawned goroutine
had chance to run, already checked done and ctx to be not ready, and went
into sleep via wait on futex - showing as something like below in strace for
one read query (note futex calls):

	27867 00:38:39.782146 stat(".../neo.sqlite-journal", 0x7f83809c4a20) = -1 ENOENT (No such file or directory)
	27867 00:38:39.782165 pread64(3, "\0\0\0\33\0\0\10\235\0\0\10]\0\0\0\27", 16, 24) = 16
	27871 00:38:39.782179 <... pselect6 resumed> ) = 0 (Timeout)
	27868 00:38:39.782187 <... pselect6 resumed> ) = 0 (Timeout)
	27871 00:38:39.782193 futex(0xc4200f8538, FUTEX_WAIT, 0, NULL <unfinished ...>
	27868 00:38:39.782199 futex(0xc420013138, FUTEX_WAIT, 0, NULL <unfinished ...>
	27867 00:38:39.782205 stat(".../neo.sqlite-wal", 0x7f83809c4a20) = -1 ENOENT (No such file or directory)
	27867 00:38:39.782224 fstat(3, {st_mode=S_IFREG|0644, st_size=9031680, ...}) = 0
	27867 00:38:39.782247 futex(0xc420013138, FUTEX_WAKE, 1 <unfinished ...>
	27868 00:38:39.782259 <... futex resumed> ) = 0
	27867 00:38:39.782265 <... futex resumed> ) = 1
	27868 00:38:39.782270 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=3000}, NULL <unfinished ...>
	27867 00:38:39.782279 fcntl(3, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0
	27867 00:38:39.782315 fcntl(3, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1}) = 0
	27868 00:38:39.782336 <... pselect6 resumed> ) = 0 (Timeout)
	27867 00:38:39.782342 fcntl(3, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1073741826, l_len=510} <unfinished ...>
	27868 00:38:39.782348 futex(0xc4200f8538, FUTEX_WAKE, 1 <unfinished ...>
	27867 00:38:39.782355 <... fcntl resumed> ) = 0
	27871 00:38:39.782360 <... futex resumed> ) = 0
	27868 00:38:39.782367 <... futex resumed> ) = 1
	27871 00:38:39.782372 futex(0xc4200f8138, FUTEX_WAKE, 1 <unfinished ...>
	27868 00:38:39.782377 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=3000}, NULL <unfinished ...>
	27871 00:38:39.782384 <... futex resumed> ) = 1
	27870 00:38:39.782389 <... futex resumed> ) = 0
	27867 00:38:39.782394 fcntl(3, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1073741824, l_len=1} <unfinished ...>
	27870 00:38:39.782400 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=3000}, NULL <unfinished ...>
	27867 00:38:39.782408 <... fcntl resumed> ) = 0

Below link shows that go scheduler itself might be significantly improved for
cases when there are several Cgo calls made for a request in a server:

golang/go#21827 (comment)

in particular CGo-4 case should be closely related to this sqlite3 go package,
because for one query many CGo calls are made to SQLite.

However until there are proper scheduler fixes, let's make what could
be made to improve time to do queries:

If we know that the context under which a query is executed will never
be canceled - we know we can safely skip spawning the interrupt
goroutine and this was avoid ping-pong on done in between different OS
threads.

This brings the following speedup on my notebook with go1.10:

name               old req/s    new req/s    delta
Exec                 254k ± 1%    379k ± 1%  +48.89%  (p=0.000 n=10+10)
Query               90.6k ± 2%   96.4k ± 1%   +6.37%  (p=0.000 n=10+10)
Params              81.5k ± 1%   87.0k ± 1%   +6.83%  (p=0.000 n=10+10)
Stmt                 122k ± 2%    129k ± 1%   +6.07%  (p=0.000 n=10+9)
Rows                2.98k ± 1%   3.06k ± 1%   +2.77%  (p=0.000 n=9+10)
StmtRows            3.10k ± 1%   3.13k ± 1%   +1.12%  (p=0.000 n=9+10)

name               old time/op  new time/op  delta
CustomFunctions-4  10.6µs ± 1%  10.1µs ± 1%   -5.01%  (p=0.000 n=10+10)
@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux Feb 18, 2018

Contributor

Thanks for reopening.

I've moved CGo stuff to #19574 (comment).

Contributor

navytux commented Feb 18, 2018

Thanks for reopening.

I've moved CGo stuff to #19574 (comment).

@ianlancetaylor

This comment has been minimized.

Show comment
Hide comment
@ianlancetaylor

ianlancetaylor Jul 9, 2018

Contributor

I looked at this a bit. I find that if I add time.Sleep(time.Microsecond) to the loop in the function _server, then the locked variant performs much better:

BenchmarkUnlocked-8   	   20000	     58347 ns/op
BenchmarkLocked-8     	  100000	     17142 ns/op

So I think this may be a misleading benchmark. In the unmodified benchmark, the server races ahead of the clients, and frequently goes to sleep reading from the channel. Then, I suspect, writes to the channel wake up the serving goroutine, but on the wrong M. So it has to then wake up the right M, as @dvyukov describes at #21827 (comment) . So if there is something to fix, it should probably focus on that, not on the behavior of notesleep.

Contributor

ianlancetaylor commented Jul 9, 2018

I looked at this a bit. I find that if I add time.Sleep(time.Microsecond) to the loop in the function _server, then the locked variant performs much better:

BenchmarkUnlocked-8   	   20000	     58347 ns/op
BenchmarkLocked-8     	  100000	     17142 ns/op

So I think this may be a misleading benchmark. In the unmodified benchmark, the server races ahead of the clients, and frequently goes to sleep reading from the channel. Then, I suspect, writes to the channel wake up the serving goroutine, but on the wrong M. So it has to then wake up the right M, as @dvyukov describes at #21827 (comment) . So if there is something to fix, it should probably focus on that, not on the behavior of notesleep.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Unplanned Jul 9, 2018

@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux Jul 9, 2018

Contributor

@ianlancetaylor, the sleep for 1μs in server causes some scheduler badness for unlocked case:

$ gotip test -count=3 -bench Unlocked tmp3_test.go 
goos: linux
goarch: amd64
BenchmarkUnlocked-4        20000             66237 ns/op
BenchmarkUnlocked-4        20000             66948 ns/op
BenchmarkUnlocked-4        20000             66971 ns/op
$ gotip test -count=3 -trace x.trace -bench Unlocked tmp3_test.go 
goos: linux
goarch: amd64
BenchmarkUnlocked-4       100000             14993 ns/op
BenchmarkUnlocked-4       100000             14758 ns/op
BenchmarkUnlocked-4       200000             15068 ns/op

i.e. it runs ~ 4.5x slower when run without tracing enabled.

For the reference the time for locked case only somewhat increases if tracing is enabled.

$ gotip test -count=3 -bench Locked tmp3_test.go 
goos: linux
goarch: amd64
BenchmarkLocked-4         100000             13924 ns/op
BenchmarkLocked-4         100000             13800 ns/op
BenchmarkLocked-4         100000             14025 ns/op
$ gotip test -count=3 -trace y.trace -bench Locked tmp3_test.go 
goos: linux
goarch: amd64
BenchmarkLocked-4         100000             18203 ns/op
BenchmarkLocked-4         100000             18162 ns/op
BenchmarkLocked-4         100000             18230 ns/op
go version devel +b56e24782f Mon Jul 9 02:18:16 2018 +0000 linux/amd64
Contributor

navytux commented Jul 9, 2018

@ianlancetaylor, the sleep for 1μs in server causes some scheduler badness for unlocked case:

$ gotip test -count=3 -bench Unlocked tmp3_test.go 
goos: linux
goarch: amd64
BenchmarkUnlocked-4        20000             66237 ns/op
BenchmarkUnlocked-4        20000             66948 ns/op
BenchmarkUnlocked-4        20000             66971 ns/op
$ gotip test -count=3 -trace x.trace -bench Unlocked tmp3_test.go 
goos: linux
goarch: amd64
BenchmarkUnlocked-4       100000             14993 ns/op
BenchmarkUnlocked-4       100000             14758 ns/op
BenchmarkUnlocked-4       200000             15068 ns/op

i.e. it runs ~ 4.5x slower when run without tracing enabled.

For the reference the time for locked case only somewhat increases if tracing is enabled.

$ gotip test -count=3 -bench Locked tmp3_test.go 
goos: linux
goarch: amd64
BenchmarkLocked-4         100000             13924 ns/op
BenchmarkLocked-4         100000             13800 ns/op
BenchmarkLocked-4         100000             14025 ns/op
$ gotip test -count=3 -trace y.trace -bench Locked tmp3_test.go 
goos: linux
goarch: amd64
BenchmarkLocked-4         100000             18203 ns/op
BenchmarkLocked-4         100000             18162 ns/op
BenchmarkLocked-4         100000             18230 ns/op
go version devel +b56e24782f Mon Jul 9 02:18:16 2018 +0000 linux/amd64
@navytux

This comment has been minimized.

Show comment
Hide comment
@navytux

navytux Jul 9, 2018

Contributor

I generally agree that what @dvyukov suggests should be the way to go, not my silly spinning patch.

Contributor

navytux commented Jul 9, 2018

I generally agree that what @dvyukov suggests should be the way to go, not my silly spinning patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment