Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: significant regression in pthread_cond_timedwait performance on Darwin #26019

Closed
nvanbenschoten opened this issue Jun 23, 2018 · 10 comments

Comments

Projects
None yet
5 participants
@nvanbenschoten
Copy link
Contributor

commented Jun 23, 2018

What version of Go are you using (go version)?

Custom build of go1.11beta1

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/nathan/Library/Caches/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/nathan/Go"
GORACE=""
GOROOT="/Users/nathan/Go/go"
GOTMPDIR=""
GOTOOLDIR="/Users/nathan/Go/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/pj/99_s_zrd5tx0fv6pkqjzstkw0000gn/T/go-build818492431=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

Compiled CockroachDB, ran basic kv load generator against it.

What did you expect to see?

Comparable (hopefully better!) performance to go1.10.

What did you see instead?

A large (~30%) throughput reduction. Comparing CPU profiles between 1.10 and go1.11beta1 revealed a significant increase in time spent in runtime.pthread_cond_timedwait:

go1.11beta1

screenshot-2018-6-22 unknown cpu

screenshot-2018-6-22 unknown cpu 1

go1.10

screenshot-2018-6-22 unknown cpu 2

A git blame of runtime.pthread_cond_timedwait indicated that 5b5473c could be the cause of the regression. Re-running my benchmark with a binary compiled with and without that commit confirmed it to be cuprit.

cc. @randall77

@nvanbenschoten

This comment has been minimized.

Copy link
Contributor Author

commented Jun 23, 2018

To clarify, runtime.pthread_cond_timedwait was introduced in 5b5473c and is now dominating the cost of calling runtime.semasleep. On the same workload, runtime.semasleep consumes 0.39% of CPU before 5b5473c and 35.32% after.

@nvanbenschoten

This comment has been minimized.

Copy link
Contributor Author

commented Jun 23, 2018

runtime.pthread_cond_timedwait doesn't do anything suspicious except call straight into libc's pthread_cond_timedwait. We must be seeing some effect in there.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jun 23, 2018

CC @randall77

We changed the way we handle condition variables to be more compatible with Darwin mechanisms. There may not be much we can do here.

@benesch

This comment has been minimized.

Copy link
Contributor

commented Jun 23, 2018

runtime.pthread_cond_timedwait doesn't do anything suspicious except call straight into libc's pthread_cond_timedwait. We must be seeing some effect in there.

Or perhaps we're not calling pthread_cond_timedwait correctly. Based on a cursory read it seems pthread_cond_timedwait wants an absolute time (i.e., gettimeofday() + ns), but we're passing in ns directly. That seems wrong.

I haven't spent more than about 5m looking at this so it's possible I'm entirely off base.

@benesch

This comment has been minimized.

Copy link
Contributor

commented Jun 23, 2018

And, indeed, replacing pthread_cond_timedwait with pthread_cond_timedwait_relative_np seems to resolve the problem.

diff --git a/src/runtime/os_darwin.go b/src/runtime/os_darwin.go
index 5019b9423c..ff375004a3 100644
--- a/src/runtime/os_darwin.go
+++ b/src/runtime/os_darwin.go
@@ -45,7 +45,7 @@ func semasleep(ns int64) int32 {
 		if ns >= 0 {
 			var t timespec
 			t.set_nsec(ns)
-			err := pthread_cond_timedwait(&mp.cond, &mp.mutex, &t)
+			err := pthread_cond_timedwait_relative_np(&mp.cond, &mp.mutex, &t)
 			if err == _ETIMEDOUT {
 				pthread_mutex_unlock(&mp.mutex)
 				return -1
diff --git a/src/runtime/sys_darwin.go b/src/runtime/sys_darwin.go
index ef5aef1929..f0d0815903 100644
--- a/src/runtime/sys_darwin.go
+++ b/src/runtime/sys_darwin.go
@@ -287,10 +287,10 @@ func pthread_cond_wait_trampoline()
 
 //go:nosplit
 //go:cgo_unsafe_args
-func pthread_cond_timedwait(c *pthreadcond, m *pthreadmutex, t *timespec) int32 {
-	return libcCall(unsafe.Pointer(funcPC(pthread_cond_timedwait_trampoline)), unsafe.Pointer(&c))
+func pthread_cond_timedwait_relative_np(c *pthreadcond, m *pthreadmutex, t *timespec) int32 {
+	return libcCall(unsafe.Pointer(funcPC(pthread_cond_timedwait_relative_np_trampoline)), unsafe.Pointer(&c))
 }
-func pthread_cond_timedwait_trampoline()
+func pthread_cond_timedwait_relative_np_trampoline()
 
 //go:nosplit
 //go:cgo_unsafe_args
@@ -348,7 +348,7 @@ func closeonexec(fd int32) {
 //go:cgo_import_dynamic libc_pthread_mutex_unlock pthread_mutex_unlock "/usr/lib/libSystem.B.dylib"
 //go:cgo_import_dynamic libc_pthread_cond_init pthread_cond_init "/usr/lib/libSystem.B.dylib"
 //go:cgo_import_dynamic libc_pthread_cond_wait pthread_cond_wait "/usr/lib/libSystem.B.dylib"
-//go:cgo_import_dynamic libc_pthread_cond_timedwait pthread_cond_timedwait "/usr/lib/libSystem.B.dylib"
+//go:cgo_import_dynamic libc_pthread_cond_timedwait_relative_np pthread_cond_timedwait_relative_np "/usr/lib/libSystem.B.dylib"
 //go:cgo_import_dynamic libc_pthread_cond_signal pthread_cond_signal "/usr/lib/libSystem.B.dylib"
 
 // Magic incantation to get libSystem actually dynamically linked.
diff --git a/src/runtime/sys_darwin_amd64.s b/src/runtime/sys_darwin_amd64.s
index 5522a86a1f..142933585d 100644
--- a/src/runtime/sys_darwin_amd64.s
+++ b/src/runtime/sys_darwin_amd64.s
@@ -447,13 +447,13 @@ TEXT runtime·pthread_cond_wait_trampoline(SB),NOSPLIT,$0
 	POPQ	BP
 	RET
 
-TEXT runtime·pthread_cond_timedwait_trampoline(SB),NOSPLIT,$0
+TEXT runtime·pthread_cond_timedwait_relative_np_trampoline(SB),NOSPLIT,$0
 	PUSHQ	BP
 	MOVQ	SP, BP
 	MOVQ	8(DI), SI	// arg 2 mutex
 	MOVQ	16(DI), DX	// arg 3 timeout
 	MOVQ	0(DI), DI	// arg 1 cond
-	CALL	libc_pthread_cond_timedwait(SB)
+	CALL	libc_pthread_cond_timedwait_relative_np(SB)
 	POPQ	BP
 	RET

I'm going to verify this patch more scientifically. Assuming it withstands scrutiny, I'll clean it up and submit a CL.

@benesch

This comment has been minimized.

Copy link
Contributor

commented Jun 23, 2018

Results from a slightly more scientific experiment that involved running the kv generator that Nathan mentioned in the original description:

go1.10

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  120.0s        0         564521         4704.1      3.4      2.5      5.0     35.7     88.1

d6a27e8edcd992b36446c5021a3c7560d983e9a6 (current master)

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  120.0s        0         351106         2925.9      5.5      3.4     11.5     60.8    218.1

patch

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  120.0s        0         574159         4784.5      3.3      2.5      4.5     33.6     88.1

I'd say we've found our culprit. Preparing a CL now.

@gopherbot

This comment has been minimized.

Copy link

commented Jun 23, 2018

Change https://golang.org/cl/120635 mentions this issue: runtime: respect timeout in semasleep on Darwin

@nvanbenschoten

This comment has been minimized.

Copy link
Contributor Author

commented Jun 23, 2018

Solid catch! Glad this wasn't an issue in libc itself and that we'll be able to resolve it shortly.

Out of curiosity, how did you find out about pthread_cond_timedwait_relative_np? I can see it in the global symbol table of /usr/lib/system/libsystem_pthread.dylib now that I know what to look for, but that's about it. It's not referenced at all in the pthread documentation. Instead they just have a long (but easy to miss) rationale about why pthread_cond_timedwait doesn't take a relative time measurement.

@benesch

This comment has been minimized.

Copy link
Contributor

commented Jun 23, 2018

@kevinburke

This comment has been minimized.

Copy link
Contributor

commented Jun 25, 2018

Wow, thanks for the investigation and patch!

In case anyone else sees this, I observed ~200% CPU utilization + heavy fan usage on Mac, by starting a HTTP server that opened some Postgres connections. With this patch applied, the CPU usage from my Go binary is at 0%.

@gopherbot gopherbot closed this in 6fdbed0 Jun 25, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.