Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: copy() from C.malloc-ed buffer twice as slow as copying from native go buffer on Linux #49618

Closed
harixxxx opened this issue Nov 16, 2021 · 25 comments
Labels
NeedsInvestigation Performance

Comments

@harixxxx
Copy link

@harixxxx harixxxx commented Nov 16, 2021

What version of Go are you using (go version)?

$ go version
go version go1.17.3 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/u/hariharan/.cache/go-build"
GOENV="/u/hariharan/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/u/hariharan/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/u/hariharan/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.17.3"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build2098663698=/tmp/go-build -gno-record-gcc-switches"

What did you do?

// ===== Program that has the copy() performance issue =====
// Program that copies from C.malloc-ed buffer to go buffer
package main

// #cgo CFLAGS: -g -Wall
// #include <stdlib.h>
import "C"
import (
	"flag"
	"fmt"
	// "unsafe"
)

func main() {
	loop_count_ptr := flag.Int("count", 100000, "number of times to copy to go buf")
	flag.Parse()

	loop_count := *loop_count_ptr

	buf_size_go := 1024 * 1024
	out_buf := make([]byte, buf_size_go)

	buf_size_c := C.int(buf_size_go)
	c_buf := C.malloc(C.ulong(buf_size_c))
	// defer C.free(unsafe.Pointer(c_buf))
	c_buf_go := C.GoBytes(c_buf, buf_size_c)

	fmt.Println("copying: count=", loop_count)
	for i := 0; i < loop_count; i++ {
		copy(out_buf, c_buf_go)
	}
}
// ===== Program that does not have the copy() performance issue =====
// Program that copies from go buffer to go buffer
package main

import (
	"flag"
	"fmt"
)

func main() {
	loop_count_ptr := flag.Int("count", 100000, "number of times to copy to go buf")
	flag.Parse()

	loop_count := *loop_count_ptr

	buf_size_go := 1024 * 1024
	out_buf := make([]byte, buf_size_go)

	go_buf := make([]byte, buf_size_go)
	fmt.Println("copying: count=", loop_count)
	for i := 0; i < loop_count; i++ {
		copy(out_buf, go_buf)
	}
}

What did you expect to see?

Expected time taken to copy from C.malloc-ed buffer be the same as copying from native go buffer

What did you see instead?

Copying from C.malloc-ed buffer can be more than twice as slow as copying. from native go buffer

// Copying from C.malloc-ed buffer
$ time ./c
copying: count= 100000

real	0m5.357s
user	0m5.356s
sys	0m0.008s

// Copying from go buffer
$ time ./native
copying: count= 100000

real	0m2.320s
user	0m2.307s
sys	0m0.016s

Note that macOS does not seem to have this issue
It appears to be specific to Linux

@harixxxx harixxxx changed the title copy() from C.malloc-ed buffer twice as slow as copying from native go buffer copy() from C.malloc-ed buffer twice as slow as copying from native go buffer on Linux Nov 16, 2021
@dr2chase dr2chase added the NeedsInvestigation label Nov 16, 2021
@seankhliao seankhliao changed the title copy() from C.malloc-ed buffer twice as slow as copying from native go buffer on Linux runtime/cgo: copy() from C.malloc-ed buffer twice as slow as copying from native go buffer on Linux Nov 16, 2021
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 16, 2021

I do not see this on my Linux laptop running Debian rodete.

> time ./c
copying: count= 100000

real	0m4.055s
user	0m4.161s
sys	0m0.028s
> time ./native
copying: count= 100000

real	0m4.192s
user	0m4.199s
sys	0m0.024s

It might be worth trying to express this as a benchmark in a Go test to see if the issue can be recreated that way.

@ianlancetaylor ianlancetaylor changed the title runtime/cgo: copy() from C.malloc-ed buffer twice as slow as copying from native go buffer on Linux cmd/compile: copy() from C.malloc-ed buffer twice as slow as copying from native go buffer on Linux Nov 16, 2021
@harixxxx
Copy link
Author

@harixxxx harixxxx commented Nov 16, 2021

Would it help if I add the assembly from gdb for main.go for the above 2 cases?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 16, 2021

Sure, it might help. Thanks.

@randall77
Copy link
Contributor

@randall77 randall77 commented Nov 16, 2021

I can reproduce on my desktop, although it is more like 50% slower, not 2x slower.

I suspect the cause is hugepages. Try allocating the C buffer using mmap with huge page support and see if that helps.

Slow code profile is all here:

      50ms       50ms     461a1d: PREFETCHNTA 0x1c0(SI)                   ;runtime.memmove memmove_amd64.s:429
     2.51s      2.51s     461a24: PREFETCHNTA 0x280(SI)                   ;runtime.memmove memmove_amd64.s:430
     2.42s      2.42s     461a2b: VMOVDQU 0(SI), X0                       ;runtime.memmove memmove_amd64.s:435
     150ms      150ms     461a2f: VMOVDQU 0x20(SI), X1                    ;runtime.memmove memmove_amd64.s:436
     390ms      390ms     461a34: VMOVDQU 0x40(SI), X2                    ;runtime.memmove memmove_amd64.s:437
     150ms      150ms     461a39: VMOVDQU 0x60(SI), X3                    ;runtime.memmove memmove_amd64.s:438
     360ms      360ms     461a3e: ADDQ $0x80, SI                          ;runtime.memmove memmove_amd64.s:439
      40ms       40ms     461a45: VMOVNTDQ X0, 0(DI)                      ;runtime.memmove memmove_amd64.s:440
      30ms       30ms     461a49: VMOVNTDQ X1, 0x20(DI)                   ;runtime.memmove memmove_amd64.s:441
     120ms      120ms     461a4e: VMOVNTDQ X2, 0x40(DI)                   ;runtime.memmove memmove_amd64.s:442
     120ms      120ms     461a53: VMOVNTDQ X3, 0x60(DI)                   ;runtime.memmove memmove_amd64.s:443
      60ms       60ms     461a58: ADDQ $0x80, DI                          ;runtime.memmove memmove_amd64.s:444
      20ms       20ms     461a5f: SUBQ $0x80, BX                          ;runtime.memmove memmove_amd64.s:445
         .          .     461a66: JA 0x461a1d                             ;memmove_amd64.s:446

Fast code profile is all here:

         .          .     45f1bd: PREFETCHNTA 0x1c0(SI)                   ;memmove_amd64.s:429
     130ms      130ms     45f1c4: PREFETCHNTA 0x280(SI)                   ;runtime.memmove memmove_amd64.s:430
      90ms       90ms     45f1cb: VMOVDQU 0(SI), X0                       ;runtime.memmove memmove_amd64.s:435
         .          .     45f1cf: VMOVDQU 0x20(SI), X1                    ;memmove_amd64.s:436
         .          .     45f1d4: VMOVDQU 0x40(SI), X2                    ;memmove_amd64.s:437
     100ms      100ms     45f1d9: VMOVDQU 0x60(SI), X3                    ;runtime.memmove memmove_amd64.s:438
     120ms      120ms     45f1de: ADDQ $0x80, SI                          ;runtime.memmove memmove_amd64.s:439
      40ms       40ms     45f1e5: VMOVNTDQ X0, 0(DI)                      ;runtime.memmove memmove_amd64.s:440
     1.88s      1.88s     45f1e9: VMOVNTDQ X1, 0x20(DI)                   ;runtime.memmove memmove_amd64.s:441
     1.89s      1.89s     45f1ee: VMOVNTDQ X2, 0x40(DI)                   ;runtime.memmove memmove_amd64.s:442
     390ms      390ms     45f1f3: VMOVNTDQ X3, 0x60(DI)                   ;runtime.memmove memmove_amd64.s:443
     130ms      130ms     45f1f8: ADDQ $0x80, DI                          ;runtime.memmove memmove_amd64.s:444
         .          .     45f1ff: SUBQ $0x80, BX                          ;memmove_amd64.s:445
         .          .     45f206: JA 0x45f1bd                             ;memmove_amd64.s:446

Note that the former is read-limited (reading from the C buffer), whereas the latter is write limited.

@randall77
Copy link
Contributor

@randall77 randall77 commented Nov 16, 2021

Indeed, bumping the allocation up to 2MB (the size of an x86 huge page) erases the performance delta.

@harixxxx
Copy link
Author

@harixxxx harixxxx commented Nov 16, 2021

@randall77 Can you please post the code change you made so I can try it? Thanks.

@harixxxx
Copy link
Author

@harixxxx harixxxx commented Nov 16, 2021

Changed the allocated buffer size to 2MB. It made things take much longer, but the problem remains.

C.malloc:  real 0m12.336s
native:    real  0m5.696s

@randall77
Copy link
Contributor

@randall77 randall77 commented Nov 17, 2021

Making the buffer bigger was:

buf_size_go := 1024 * 1024 * 2

Getting a profile was

	f, _ := os.Create("cpu.pprof")
	pprof.StartCPUProfile(f)
	defer pprof.StopCPUProfile()

You can then look at the profile with

> go tool pprof cpu.pprof
(pprof) top10
(pprof) disasm runtime.memmove

@harixxxx
Copy link
Author

@harixxxx harixxxx commented Nov 18, 2021

Thanks for the pprof info
I ran with profiling enabled just before the copy loop, for copying 1MB buf 100000 times
Almost all the time was spent at the same instruction for both cases
So is the C.malloc case somehow copying a lot more data than the native case?
That's the only reason I can think of

Note that I'm running the test on a VM

c:
     4.68s      4.68s     460c58: REP; MOVSQ DS:0(SI), ES:0(DI)           ;runtime.memmove memmove_amd64.s:118

native:
     2.09s      2.09s     4600d8: REP; MOVSQ DS:0(SI), ES:0(DI)           ;runtime.memmove memmove_amd64.s:118

@randall77
Copy link
Contributor

@randall77 randall77 commented Nov 18, 2021

I still think this is going to be something different between malloc and our allocator with respect to huge pages.
Try allocating with mmap instead of malloc.
Try turning huge page support on your VM off.

Another thing to check is print the pointer values in both cases. Possibly there's an alignment issue that makes the copy slower.

It almost certainly isn't a different amount of actual copying. It's just the copying that is happening is slower.

@harixxxx
Copy link
Author

@harixxxx harixxxx commented Nov 20, 2021

Allocating a 8K aligned buffer did not make a difference - takes the same time as malloc-ed buffer

code change:
        c_buf := C.aligned_alloc(8192, C.ulong(buf_size_c))
        fmt.Println("aligned buffer:", c_buf)

result:
$ time ./aligned
buf_size_c: 1048576
aligned buffer: 0x7f5e0aefc000
buf_size:  1024 KiB; 1048576 bytes
copying: count= 100000

real	0m5.226s
user	0m5.133s
sys	0m0.012s

@harixxxx
Copy link
Author

@harixxxx harixxxx commented Nov 20, 2021

Looks like the vm has hugpage support, but currently there are no hugepages allocated
Can this be taken to effectively mean it's the same as no hugepage support?

$ grep -i huge /proc/meminfo
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

@harixxxx
Copy link
Author

@harixxxx harixxxx commented Nov 20, 2021

Using mmap did not make a difference either

code change:
const (
        mapPrivate = 0x02
        mapAnonymous = 0x20
        protRead = 0x1
        protWrite= 0x2
)
...
        c_buf := C.mmap(nil, C.ulong(buf_size_c), protRead|protWrite, mapPrivate|mapAnonymous, -1, 0)
        fmt.Println("c_buf:", c_buf)


result:
$ time ./mmap
c_buf: 0x7fc0909f0000
buf_size:  1024 KiB; 1048576 bytes
copying: count= 100000

real	0m5.438s
user	0m5.417s
sys	0m0.028s

@harixxxx
Copy link
Author

@harixxxx harixxxx commented Nov 20, 2021

Is it possible to pass a native go buffer allocated with make() to a C function & have that function copy data into it?

@harixxxx
Copy link
Author

@harixxxx harixxxx commented Nov 20, 2021

Disabling hugepages didn't help

(py36) dev-hariharan: /sys/kernel/mm/transparent_hugepage
$ cat enabled
always [madvise] never

root@dev-hariharan:/sys/kernel/mm/transparent_hugepage# echo never > enabled
root@dev-hariharan:/sys/kernel/mm/transparent_hugepage#

root@dev-hariharan:/sys/kernel/mm/transparent_hugepage# cat enabled
always madvise [never]

$ time ./c
buf_size_c: 1048576
buf_size:  1024 KiB; 1048576 bytes
copying: count= 100000

real	0m5.525s
user	0m5.340s
sys	0m0.021s

@randall77
Copy link
Contributor

@randall77 randall77 commented Nov 21, 2021

Try allocating a 2MB-aligned and sized buffer. That's what the Go runtime does.

@harixxxx
Copy link
Author

@harixxxx harixxxx commented Nov 21, 2021

2MB aligned & 2MB sized buffer did not help

Here's something interesting. With a 32KB buffer, copy performance can be bad even when copying from one native buffer to another. So the C side can be taken out of this.

In the first 2 runs below, buffer pointers are the same, but one took much longer than the other.
With multiple runs, the time taken is one or other of the results below - around 3.7s or 7s.
In some runs, the buffers allocated have better alignment (shown below) - in such cases, performance always seems to be good.
What can be blamed for the slowness? Can't say alignment, because the same alignment produces both good & bad results.

$ time ./native -buf_size 32 -count 10000000
buf_size:  32 KiB; 32768 bytes
in_buf: 0xc0000ca000 out_buf: 0xc0000c2000
copying: count= 10000000

real	0m3.716s
user	0m3.528s
sys	0m0.024s

$ time ./native -buf_size 32 -count 10000000
buf_size:  32 KiB; 32768 bytes
in_buf: 0xc0000ca000 out_buf: 0xc0000c2000
copying: count= 10000000

real	0m7.227s
user	0m7.079s
sys	0m0.020s

Better alignment for both in & out buffers - good result:
$ time ./native -buf_size 32 -count 10000000
buf_size:  32 KiB; 32768 bytes
in_buf: 0xc000150000 out_buf: 0xc000148000
copying: count= 10000000

real	0m3.717s
user	0m3.573s
sys	0m0.000s

@harixxxx
Copy link
Author

@harixxxx harixxxx commented Nov 21, 2021

In some runs, the buffers allocated have better alignment (shown below) - in such cases, performance always seems to be good.

Take this back. Even with better alignment, performance can be bad.

$ time ./native -buf_size 32 -count 10000000
buf_size:  32 KiB; 32768 bytes
in_buf: 0xc0000c0000 out_buf: 0xc0000b8000
copying: count= 10000000

real	0m6.748s
user	0m6.761s
sys	0m0.001s

@harixxxx
Copy link
Author

@harixxxx harixxxx commented Nov 21, 2021

native buf copy code

package main

import (
        "flag"
        "fmt"
        "os"
        "runtime/pprof"
)

func main() {
        buf_size_kib_ptr := flag.Int("buf_size", 1024, "buf size in KiB")
        loop_count_ptr := flag.Int("count", 100000, "number of times to copy to go buf")
        profile_ptr := flag.Bool("profile", false, "profile the copy loop")
        flag.Parse()

        buf_size_kib := *buf_size_kib_ptr
        loop_count := *loop_count_ptr
        profile := *profile_ptr

        buf_size_go := buf_size_kib * 1024
        out_buf := make([]byte, buf_size_go)

        go_buf := make([]byte, buf_size_go)
        fmt.Println("buf_size: ", buf_size_kib, "KiB;", buf_size_go, "bytes")
        fmt.Printf("in_buf: %p out_buf: %p\n", go_buf, out_buf)
        fmt.Println("copying: count=", loop_count)

        if profile {
                f, _ := os.Create("cpu.pprof")
                pprof.StartCPUProfile(f)
                defer pprof.StopCPUProfile()
        }

        for i := 0; i < loop_count; i++ {
                copy(out_buf, go_buf)
        }
}

@randall77
Copy link
Contributor

@randall77 randall77 commented Nov 21, 2021

I'm going to close this issue, as I don't think it is a bug in Go.
There's something going on with the hardware - cache? VM? I'm not sure.
Even though I'm closing, please report back if you figure out what is going on. I'm curious.

@harixxxx
Copy link
Author

@harixxxx harixxxx commented Nov 21, 2021

Sorry, your response is disappointing
I would have expected go developers to take an active interest in an issue like this, though it may not be a Go problem

@randall77
Copy link
Contributor

@randall77 randall77 commented Nov 21, 2021

I'm curious, but I've got a lot of stuff to do and a limited time budget.

@harixxxx
Copy link
Author

@harixxxx harixxxx commented Nov 21, 2021

No problem. Thanks for your time. I'm new to go, so it was at least a learning experience. If I find something I'll update this.

Let me ask you one more question.
I tried passing the go native buffer to the C side & had the C side copy data into it (no C.malloc). That worked fine. The go buffer was allocated using make(). A go buffer slice can change the underlying buffer any time - does that happen only when the capacity of the slice needs to increase, or can it happen any time, e.g: during garbage collection? if it can happen any time, them obviously passing the native buffer to C is dangerous.

@randall77
Copy link
Contributor

@randall77 randall77 commented Nov 21, 2021

Typically slice backing arrays are only replaced when growing capacity. That's not language enforced though, just normal practice. A programmer could reassign the backing array at any time.
The garbage collector will not do so. Slices passed to C are pinned for the duration of the C call. (Issue #46787 is about pinning for deeper data structures.)

@harixxxx
Copy link
Author

@harixxxx harixxxx commented Nov 21, 2021

Thank you. For my use case (go buffer passed to C is just a plain byte array that does not have any kind, C or Go, of pointers in it), looks like it's ok to pass the go buffer pointer to C & avoid C.malloc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Performance
Projects
None yet
Development

No branches or pull requests

5 participants