Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/csv: do not use bufio.Writer in csv.Writer #33486

Open
yaozongyou opened this issue Aug 6, 2019 · 6 comments
Open

encoding/csv: do not use bufio.Writer in csv.Writer #33486

yaozongyou opened this issue Aug 6, 2019 · 6 comments

Comments

@yaozongyou
Copy link

@yaozongyou yaozongyou commented Aug 6, 2019

What version of Go are you using (go version)?

$ go version
go version go1.12.6 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/root/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build056785781=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Consider this situation, in my program, down to the lowest call, i need to convert my Record to csv string, so I use csv.NewWriter and pass it with bytes.Buffer, and i Write my record, and Flush, at last, i got the csv string from bytes.Buffer. Everything is fine and great for now. But, after benching, I found my program running very slowly, and the bufio used in csv.Writer is the performance killer. Because for each record, bufio.defaultBufSize length of slice is maked, and it is not easy to prevent from this slice allocing.

What did you expect to see?

Leave the choice of using bufio or not to the user.

in stdlib, use raw io.Writer, and if user want to use bufio. just wrapp it like this:

w := csv.NewWriter(bufio.NewWriter(os.Stdout))
@ALTree ALTree added this to the Go1.14 milestone Aug 6, 2019
@dsnet
Copy link
Member

@dsnet dsnet commented Aug 6, 2019

Can you post a small reproduction? The allocation of the buffer in bufio.Writer should only occur once per csv.NewWriter. Are you creating lots of small CSV files?

@yaozongyou
Copy link
Author

@yaozongyou yaozongyou commented Aug 7, 2019

No, not creating lots of csv files.
The situation is about providing a restful service for running sql on csv files, so that user can query only the selected records/columns. For example, if we have the following csv:

Year,Make,Model
1997,Ford,E350
2000,Mercury,Cougar

running select Make from s3object will return

Ford
Mercury

running select * from s3object where Year = '2000' will return

2000,Mercury,Cougar

The steps are roughly as below:

  1. Iterate each record over the csv file using csv.Reader,
  2. test the record if meeting the sql,
  3. if meeting the sql, encoding the record to csv using csv.Writer,
  4. and wrap the csv using customed message format and send the message to the user.

The csv.Writer is used at the third step, for each record, we use a new csv.Writer to encode the record to csv string, and encode the string to customed message, and write the message to the user.

Why not use csv.Writer for each csv file instead of for each record?

  1. the proto this service providing is customed binary message based on http chunked encoding, we need to send the record meeting the sql as soon as possible, and need to tell the progress periodly and sometimes sending keepalive messages.
  2. csv.Writer is used in the deepest call stack, it is not easy to refactor.

And currently, we use a global sync.Pool of bufio.Writer to prevent csv.Writer from allocating a new buffer. something like this:

var bufPool = sync.Pool{
	New: func() interface{} {
		return new(bytes.Buffer)
	},
}

var bufioWriterPool = sync.Pool{
	New: func() interface{} {
		// ioutil.Discard is just used to create the writer. Actual destination
		// writer is set later by Reset() before using it.
		return bufio.NewWriter(ioutil.Discard)
	},
}


bufioWriter := bufioWriterPool.Get().(*bufio.Writer)
buf := bufPool.Get().(*bytes.Buffer)
bufioWriter.Reset(buf)

w := csv.NewWriter(bufioWriter)
w.Write(record)
w.Flush()

str := buf.String()
// encode this str to customed message, and send it to the user 
bufioWriterPool.Put(bufioWriter)
bufPool.Put(buf)

Using this way, we avoid the performance problem, but we still think, the go stdlib should use raw io.Writer, leave the choice of bufio or not to the user.

@dsnet dsnet removed the WaitingForInfo label Aug 7, 2019
@andybons
Copy link
Member

@andybons andybons commented Aug 12, 2019

@dsnet if this is a change that needs fixing, can you add the NeedsFix label? Thanks.

@dsnet
Copy link
Member

@dsnet dsnet commented Aug 12, 2019

It's a usage pattern that is certainly not what the encoding/csv package had in mind when designed and thus this pattern is hitting some pathological behavior. It's worth improving the performance of this use case, but it's not clear to me that the solution is simply to "not use bufio.Writer in csv.Writer".

@eliben
Copy link
Member

@eliben eliben commented Aug 30, 2019

@yaozongyou could you please provide a minimal benchmark that demonstrates the slowness here?

@yaozongyou
Copy link
Author

@yaozongyou yaozongyou commented Sep 2, 2019

here is a minimal benchmark to demonstrate the slowness:

package main

import (
        "bufio"
        "bytes"
        "encoding/csv"
        "io/ioutil"
        "sync"
        "testing"
)

func BenchmarkCSV_1(b *testing.B) {
        b.ResetTimer()
        b.ReportAllocs()

        b.RunParallel(func(pb *testing.PB) {
                for pb.Next() {
                        var buf bytes.Buffer
                        w := csv.NewWriter(&buf)
                        w.Write([]string{"Hello", "World"})
                        w.Flush()

                        // buf.String() will be "Hello,World\n"
                }
        })
}

var bufPool = sync.Pool{
        New: func() interface{} {
                return new(bytes.Buffer)
        },
}

var bufioWriterPool = sync.Pool{
        New: func() interface{} {
                // ioutil.Discard is just used to create the writer. Actual destination
                // writer is set later by Reset() before using it.
                return bufio.NewWriter(ioutil.Discard)
        },
}

func BenchmarkCSV_2(b *testing.B) {
        b.ResetTimer()
        b.ReportAllocs()

        b.RunParallel(func(pb *testing.PB) {
                for pb.Next() {
                        buf := bufPool.Get().(*bytes.Buffer)
                        buf.Reset()

                        // Use bufio Writer to prevent csv.Writer from allocating a new buffer.
                        bufioWriter := bufioWriterPool.Get().(*bufio.Writer)
                        bufioWriter.Reset(buf)

                        w := csv.NewWriter(bufioWriter)
                        w.Write([]string{"Hello", "World"})
                        w.Flush()

                        // buf.String() will be "Hello,World\n"

                        bufPool.Put(buf)
                        bufioWriterPool.Put(bufioWriter)
                }
        })
}

the output result in my test environment:

$ go test -cpu 10 -benchtime 100000x -bench .
goos: linux
goarch: amd64
BenchmarkCSV_1-10         100000              1311 ns/op            4272 B/op          4 allocs/op
BenchmarkCSV_2-10         100000                90.9 ns/op             0 B/op          0 allocs/op
PASS
ok      _/home/richardyao/xxx   0.146s

From the result, BenchmarkCSV_1 has 4 allocs for each op, and BenchmarkCSV_2 has no allocs for each op.

@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.