Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: performance of slice append is optimizable #36405

Closed
lhonglwl opened this issue Jan 6, 2020 · 5 comments
Closed

cmd/compile: performance of slice append is optimizable #36405

lhonglwl opened this issue Jan 6, 2020 · 5 comments

Comments

@lhonglwl
Copy link

@lhonglwl lhonglwl commented Jan 6, 2020

What version of Go are you using (go version)?

$ go version
go version go1.13.5 linux/amd64

What did you do?

package main

import "testing"

func BenchmarkAppend(b *testing.B) {
    buf1 := make([]byte, 3, 10)
    buf1[0] = 0
    buf1[1] = 1
    buf1[2] = 2
    buf2 := []byte{'a', 'b', 'c', 'd', 'e', 'f', 'g'}
    for i := 0; i < b.N; i++ {
        buf1 = buf1[:3]
        buf1 = append(buf1, buf2...)
    }
}

func BenchmarkAppend2(b *testing.B) {
    buf1 := make([]byte, 3, 10)
    buf1[0] = 0
    buf1[1] = 1
    buf1[2] = 2
    buf2 := buf1[3:10]
    buf2[0] = 'a'
    buf2[1] = 'b'
    buf2[2] = 'c'
    buf2[3] = 'd'
    buf2[4] = 'e'
    buf2[5] = 'f'
    buf2[6] = 'g'
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        buf1 = buf1[:3]
        buf1 = append(buf1, buf2...)
    }
}

func BenchmarkAppend3(b *testing.B) {
    buf1 := make([]byte, 3, 10)
    buf1[0] = 0
    buf1[1] = 1
    buf1[2] = 2
    for i := 0; i < b.N; i++ {
        buf2 := []byte{'a', 'b', 'c', 'd', 'e', 'f', 'g'}
        buf1 = buf1[:3]
        buf1 = append(buf1, buf2...)
    }
}

func BenchmarkAppend4(b *testing.B) {
    buf1 := make([]byte, 3, 10)
    buf1[0] = 0
    buf1[1] = 1
    buf1[2] = 2
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        buf2 := buf1[3:10]
        buf2[0] = 'a'
        buf2[1] = 'b'
        buf2[2] = 'c'
        buf2[3] = 'd'
        buf2[4] = 'e'
        buf2[5] = 'f'
        buf2[6] = 'g'
        buf1 = buf1[:3]
        buf1 = append(buf1, buf2...)
    }
}

output:

BenchmarkAppend-8       1000000000               0.879 ns/op
BenchmarkAppend2-8      193146717                6.20 ns/op
BenchmarkAppend3-8      172225722                7.04 ns/op
BenchmarkAppend4-8      170996562                6.96 ns/op

What did you expect to see?

I expected BenchmarkAppend2 would be obviously much fastest than BenchmarkAppend as the memory of buf1 and buf2 is continuous. append operator would not has actually memmove in BenchmarkAppend2.

But the benchmark says BenchmarkAppend2 cost much more time than BenchmarkAppend, just as it has malloc memory for every loop because it cost almost equal to BenchmarkAppend3 and BenchmarkAppend3

I wonder it's there has something wrong or bug in append for memory copy, or will it be fix/optimize in future?

@lhonglwl lhonglwl changed the title performance of slice append is optimizable append: performance of slice append is optimizable Jan 6, 2020
@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Jan 6, 2020

BenchmarkAppend and BenchmarkAppend2 have essentially the same inner loop.
BenchmarkAppend:

     260ms      260ms    10f8af6: MOVL 0x44(SP), SI                       ;example.com/m.BenchmarkAppend issue36405_test.go:13
         .          .    10f8afa: MOVL 0x41(SP), DI                       ;issue36405_test.go:13
         .          .    10f8afe: MOVL DI, 0x3(DX)
      70ms       70ms    10f8b01: MOVL SI, 0x6(DX)                        ;example.com/m.BenchmarkAppend issue36405_test.go:13
     260ms      260ms    10f8b04: INCQ CX                                 ;example.com/m.BenchmarkAppend issue36405_test.go:11
         .          .    10f8b07: CMPQ CX, 0x108(AX)                      ;issue36405_test.go:11
         .          .    10f8b0e: JLE 0x10f8b63
         .          .    10f8b10: CMPQ $0x3, BX                           ;issue36405_test.go:12
         .          .    10f8b14: JB 0x10f8b6d
         .          .    10f8b16: CMPQ $0xa, BX                           ;issue36405_test.go:13
         .          .    10f8b1a: JAE 0x10f8af6

BenchmarkAppend2:

         .          .    10f8bec: MOVL 0x54(SP), SI                       ;issue36405_test.go:32
         .          .    10f8bf0: MOVL 0x51(SP), DI
     1.12s      1.12s    10f8bf4: MOVL DI, 0x3(DX)                        ;example.com/m.BenchmarkAppend2 issue36405_test.go:32
     190ms      190ms    10f8bf7: MOVL SI, 0x6(DX)
         .          .    10f8bfa: INCQ CX                                 ;issue36405_test.go:30
         .          .    10f8bfd: CMPQ CX, 0x108(AX)
         .          .    10f8c04: JLE 0x10f8c59
         .          .    10f8c06: CMPQ $0x3, BX                           ;issue36405_test.go:31
         .          .    10f8c0a: JB 0x10f8c63
     110ms      110ms    10f8c0c: CMPQ $0xa, BX                           ;example.com/m.BenchmarkAppend2 issue36405_test.go:32
         .          .    10f8c10: JAE 0x10f8bec                           ;issue36405_test.go:32

I think what is happening here is the fact that the source and destination arrays are aliased in version 2. When copying those 7 bytes, we use two 4-byte load/stores. That would cause havoc with the write buffer on the chip, as it has to resolve the writes before doing the reads.

Are you seeing this problem in a real application? Generally one doesn't append portions of a buffer to itself.

@ianlancetaylor ianlancetaylor changed the title append: performance of slice append is optimizable cmd/compile: performance of slice append is optimizable Jan 6, 2020
@lhonglwl

This comment has been minimized.

Copy link
Author

@lhonglwl lhonglwl commented Jan 7, 2020

I doubt that is it neccessary to malloc new memory for BenchmarkAppend2, can't we just detect this situation to avoid unneccessary memmove operator?

I found this issue because I find serialize/unserialize with encoding/binary and bytes/Buffer packages performance poor(almost cost 10 times than c/c++)。With pprof tools and read the source code, I found it costs much time in runtime.mallocgc, runtime.convT* operators which belongs to malloc temporary buffer and memory copy. So, I decide rewrite the personal io.Buffer with less temporary buffer and memory move, which i want read/write direct in []byte buffer.

However, we can't append data to []byte with position index assignment(but only with append method) even it has enough cap. So I make a new []byte slice point to next to the []byte buffer, and read/write it, then append to source buffer just like BenchmarkAppend2.

Unfortunately, BenchmarkAppend2 performance so poor than expected, and it should not.

@lhonglwl lhonglwl closed this Jan 7, 2020
@lhonglwl lhonglwl reopened this Jan 7, 2020
@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Jan 7, 2020

I found it costs much time in runtime.mallocgc, runtime.convT* operators which belongs to malloc temporary buffer and memory copy.

The benchmarks you provided don't call any of these things, at least in the inner loop. So I don't think this issue is directly applicable to your original problem (slowness in encoding/binary and/or bytes.Buffer).

Perhaps you could provide a benchmark for your replacement code?

However, we can't append data to []byte with position index assignment(but only with append method) even it has enough cap. So I make a new []byte slice point to next to the []byte buffer, and read/write it, then append to source buffer just like BenchmarkAppend2.

You don't need to use append to write to a []byte. You can reslice past the length (you can reslice up to the capacity) and write directly, using direct assignments, the copy builtin, or encoding/binary.

e.g.:

b := make([]byte, 0, 10)
b = b[:1]
b[0] = 'a'
copy(b[1:4], ...some slice of size 3...)
b = b[:4]
n := binary.PutVarint(b[4:], 999)
b = b[:4+n]
@lhonglwl

This comment has been minimized.

Copy link
Author

@lhonglwl lhonglwl commented Jan 7, 2020

The benchmarks you provided don't call any of these things, at least in the inner loop. So I don't think this issue is directly applicable to your original problem (slowness in encoding/binary and/or bytes.Buffer).

This issue is not the reason for the poor performance of encoding/binary and bytes.Buffer . it's just where i found it.

You don't need to use append to write to a []byte. You can reslice past the length (you can reslice up to the capacity) and write directly, using direct assignments, the copy builtin, or encoding/binary.

Thanks, the reslice do solve my problem. In this situation append may not be used while it has poor performance util it has optimized(if it will be).

@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Jan 7, 2020

I don't think there's any optimization we can actually do here.
The performance is constrained by the hardware - it's just not designed to do writes immediately followed by differently aligned reads quickly. See the intel optimization manual, section 3.6.5.1.

Go is optimized to do appends fast when the backing store of the slice you're appending to is write-only. Switching between reads and writes on every small append is not a common use case, and is thus not something we optimize for.

I'm going to close this issue as not actionable. Please reopen if you disagree.

@randall77 randall77 closed this Jan 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.