Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: futex contention caused by memory allocation in goroutines #34231

Open
flxkl opened this issue Sep 11, 2019 · 4 comments

Comments

@flxkl
Copy link

commented Sep 11, 2019

What version of Go are you using (go version)?

Go 1.13

$ go version
go version go1.13 linux/amd64

Does this issue reproduce with the latest release?

I made similar obsverations with older Go versions (go 1.10.x,, go1.11.x, and go1.12.x).

What operating system and processor architecture are you using (go env)?

System: Ubuntu 18.04.3 LTS
Kernel: Linux 4.15.0-60-generic
Processor: Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz (8 cores)

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/felix/.cache/go-build"
GOENV="/home/felix/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/felix/.gvm/pkgsets/go1.13/global:/home/felix/regmatch"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/felix/.gvm/gos/go1.13"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/felix/.gvm/gos/go1.13/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build193330363=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Iteratively solving two tasks in parallel on a multi-core computer in two separate goroutines. The tasks are independent of each other. At the end of an iteration, the tasks' results are combined. Both tasks have similar running times, but they are not identical. Both tasks allocate local memory (this seems to be crucial for the problem!).

Whether using a wait group or channels at the end of an iteration does not seem to matter. The code below uses a wait group. See also
main.txt.

  • The main() function comprises a for loop with the iteration in which the tasks solved in each iteration.
  • The two tasks are performed by the task() function.
  • The main() function has also some flags, e.g., for setting the number of OS threads and for CPU profiling.
package main

import (
	"flag"
	"fmt"
	"math/rand"
	"os"
	"runtime"
	"runtime/pprof"
	"sync"
	"time"
)

func main() {
	profile := flag.Bool("profile", false, "cpu profile")
	cpu := flag.Int("cpu", runtime.GOMAXPROCS(-1), "number of OS threads")
	iterations := flag.Int("num", 100000, "number of iterations")
	//tasks    := flag.Int("tasks", 2, "number tasks per iteration")
	work := flag.Int("work", 20, "work in each task")
	flag.Parse()

	if *profile {
		var f *os.File
		var err error
		if f, err = os.Create(fmt.Sprintf("cpu%d.prof", *cpu)); err != nil {
			fmt.Printf("Failed to create file for CPU profiling (%s).\n", err)
			return
		}
		if err = pprof.StartCPUProfile(f); err != nil {
			fmt.Printf("Failed to start CPU profile (%s).\n", err)
			return
		}
		defer func() {
			pprof.StopCPUProfile()
			if err = f.Close(); err != nil {
				fmt.Printf("Failed to close file for CPU profiling (%s).\n", err)
			}
		}()
	}

	runtime.GOMAXPROCS(*cpu)
	fmt.Printf("Using %d OS threads...\n", runtime.GOMAXPROCS(-1))

	for i := 0; i < *iterations; i++ {
		n := *work - rand.Intn(8)
		in0, in1 := n+rand.Intn(4), n+rand.Intn(4)
		var out0, out1 int
		var dur0, dur1 time.Duration

		var wg sync.WaitGroup
		wg.Add(2)
		go task(in0, &out0, &dur0, &wg)
		go task(in1, &out1, &dur1, &wg)
		wg.Wait()

		fmt.Printf("%6d: %10s %10s %10s \t %d, %d\n", i, dur0, dur1, dur0-dur1, out0, out1)
	}
}

func task(in int, res *int, dur *time.Duration, wg *sync.WaitGroup) {
	stopwatch := time.Now()
	memory := make([]int, fib(in))
	for j := 0; j < len(memory); j++ {
		memory[j] = j
	}
	*res = memory[len(memory)-1]
	// Allocating some memory in the groroutine seems crucial for futex contention.
	// The bad performance behavior is not observable for *res = fib(n).
	*dur = time.Since(stopwatch)
	wg.Done()
}

func fib(n int) int {
	switch n {
	case 0:
		return 0
	case 1:
		return 1
	default:
		return fib(n-1) + fib(n-2)
	}
}

What did you expect to see?

I expect that the program runs faster on >=2 CPU cores than on a single CPU core, since the tasks are carried out in parallel. I also expect that the running times are similar on 2 and 8 CPU cores. There might be some scheduling overhead when using 8 CPU cores.

Note that -work flag allows you to control the hardness of the problem in the iteration. This parameter will depend on the computer you are using. The default parameters worked well on my computer for the issue.

What did you see instead?

The program runs indead faster with 2 CPU cores than with a single CPU core. However, on 8 CPU cores it runs significantly slower (9 seconds compared to 13 seconds; the system time as output by /usr/bin/time also increases significantly). See the file
output.txt/
for a summary of multiple runs.

Looking at the CPU profiles, one see a significant increase by runtime.futex. It seems that the goroutines are interrupted by a futex syscall, which originates from the memory allocation in one of the goroutines. Afterwards, the Go scheduler then tries to find a runnable goroutine. However, I am not very familiar with futexes and the Go scheduler. So, I might be completely wrong here.

Overall, I would have expected a far smaller scheduling overhead and unused resources (i.e., CPU cores) should not slow down the program.

@ianlancetaylor ianlancetaylor changed the title futex contention caused by memory allocation in goroutines runtime: futex contention caused by memory allocation in goroutines Sep 11, 2019

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Sep 11, 2019

For what it's worth, I cannot recreate this on my 8 core laptop.

> time ./foo -profile -cpu 1 > /tmp/foo.1

real	0m8.427s
user	0m7.901s
sys	0m0.389s
> time ./foo -profile -cpu 2 > /tmp/foo.2

real	0m6.822s
user	0m9.500s
sys	0m1.098s
> time ./foo -profile -cpu 8 > /tmp/foo.8

real	0m7.421s
user	0m11.442s
sys	0m1.802s
@flxkl

This comment has been minimized.

Copy link
Author

commented Sep 11, 2019

I guess you also do not see anything suspicious in your CPU profiles?

I ran the experiments again on my computer. This time with time and not /usr/bin/time. I get the same behavior:

time ./foo -cpu 1 >log1

real 0m9.167s
user 0m8.920s
sys 0m0.248s

time ./foo -cpu 2 >log2

real 0m6.889s
user 0m10.076s
sys 0m0.586s

time ./foo -cpu 8 >log8

real 0m11.681s
user 0m18.345s
sys 0m1.777s

Maybe it depends on the underlying Linux kernel. Here the information from my /proc/version:
Linux version 4.15.0-60-generic (buildd@lgw01-amd64-030) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019
Unfortunately, I do not have another computer with more than 2 CPU cores available at the moment for comparison.

Please let me know when you need more information about my setup or whether I should try something out on my computer.

@ALTree

This comment has been minimized.

Copy link
Member

commented Sep 11, 2019

For the record, E3-1230 V2 has 4 physical cores.

@flxkl

This comment has been minimized.

Copy link
Author

commented Sep 11, 2019

@ALTree: Sorry, I missed this detail, Hyperthreading is enabled.

The following information from the CPU profile (with the flag -cpu 8) might be helpful:

(pprof) peek futex
Showing nodes accounting for 21.50s, 100% of 21.50s total
----------------------------------------------------------+-------------
flat flat% sum% cum cum% calls calls% + context
----------------------------------------------------------+-------------
1s 54.64% | runtime.futexsleep
0.83s 45.36% | runtime.futexwakeup
1.83s 8.51% 8.51% 1.83s 8.51% | runtime.futex
----------------------------------------------------------+-------------
0.92s 89.32% | runtime.notesleep
0.10s 9.71% | runtime.notetsleep_internal
0.01s 0.97% | runtime.lock
0.03s 0.14% 8.65% 1.03s 4.79% | runtime.futexsleep
1s 97.09% | runtime.futex
----------------------------------------------------------+-------------
0.82s 98.80% | runtime.notewakeup
0.01s 1.20% | runtime.unlock
0 0% 8.65% 0.83s 3.86% | runtime.futexwakeup
0.83s 100% | runtime.futex
----------------------------------------------------------+-------------

(pprof) list futex
Total: 21.50s
ROUTINE ======================== runtime.futex in /home/felix/.gvm/gos/go1.13/src/runtime/sys_linux_amd64.s
1.83s 1.83s (flat, cum) 8.51% of Total
. . 530: MOVL val+12(FP), DX
. . 531: MOVQ ts+16(FP), R10
. . 532: MOVQ addr2+24(FP), R8
. . 533: MOVL val3+32(FP), R9
. . 534: MOVL $SYS_futex, AX
90ms 90ms 535: SYSCALL
1.71s 1.71s 536: MOVL AX, ret+40(FP)
30ms 30ms 537: RET
. . 538:
. . 539:// int32 clone(int32 flags, void *stk, M *mp, G *gp, void (*fn)(void));
. . 540:TEXT runtime·clone(SB),NOSPLIT,$0
. . 541: MOVL flags+0(FP), DI

I find the 1.71s after the SYSCALL extremely high. But as I said, I have no knowledge about the internals of the Go scheduler. It is only an educated guess that this is related to the running times on my computer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.