Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/csv: do not use bufio.Writer in csv.Writer #33486

Open
yaozongyou opened this issue Aug 6, 2019 · 4 comments

Comments

@yaozongyou
Copy link

commented Aug 6, 2019

What version of Go are you using (go version)?

$ go version
go version go1.12.6 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/root/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build056785781=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Consider this situation, in my program, down to the lowest call, i need to convert my Record to csv string, so I use csv.NewWriter and pass it with bytes.Buffer, and i Write my record, and Flush, at last, i got the csv string from bytes.Buffer. Everything is fine and great for now. But, after benching, I found my program running very slowly, and the bufio used in csv.Writer is the performance killer. Because for each record, bufio.defaultBufSize length of slice is maked, and it is not easy to prevent from this slice allocing.

What did you expect to see?

Leave the choice of using bufio or not to the user.

in stdlib, use raw io.Writer, and if user want to use bufio. just wrapp it like this:

w := csv.NewWriter(bufio.NewWriter(os.Stdout))

@ALTree ALTree added this to the Go1.14 milestone Aug 6, 2019

@dsnet

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

Can you post a small reproduction? The allocation of the buffer in bufio.Writer should only occur once per csv.NewWriter. Are you creating lots of small CSV files?

@yaozongyou

This comment has been minimized.

Copy link
Author

commented Aug 7, 2019

No, not creating lots of csv files.
The situation is about providing a restful service for running sql on csv files, so that user can query only the selected records/columns. For example, if we have the following csv:

Year,Make,Model
1997,Ford,E350
2000,Mercury,Cougar

running select Make from s3object will return

Ford
Mercury

running select * from s3object where Year = '2000' will return

2000,Mercury,Cougar

The steps are roughly as below:

  1. Iterate each record over the csv file using csv.Reader,
  2. test the record if meeting the sql,
  3. if meeting the sql, encoding the record to csv using csv.Writer,
  4. and wrap the csv using customed message format and send the message to the user.

The csv.Writer is used at the third step, for each record, we use a new csv.Writer to encode the record to csv string, and encode the string to customed message, and write the message to the user.

Why not use csv.Writer for each csv file instead of for each record?

  1. the proto this service providing is customed binary message based on http chunked encoding, we need to send the record meeting the sql as soon as possible, and need to tell the progress periodly and sometimes sending keepalive messages.
  2. csv.Writer is used in the deepest call stack, it is not easy to refactor.

And currently, we use a global sync.Pool of bufio.Writer to prevent csv.Writer from allocating a new buffer. something like this:

var bufPool = sync.Pool{
	New: func() interface{} {
		return new(bytes.Buffer)
	},
}

var bufioWriterPool = sync.Pool{
	New: func() interface{} {
		// ioutil.Discard is just used to create the writer. Actual destination
		// writer is set later by Reset() before using it.
		return bufio.NewWriter(ioutil.Discard)
	},
}


bufioWriter := bufioWriterPool.Get().(*bufio.Writer)
buf := bufPool.Get().(*bytes.Buffer)
bufioWriter.Reset(buf)

w := csv.NewWriter(bufioWriter)
w.Write(record)
w.Flush()

str := buf.String()
// encode this str to customed message, and send it to the user 
bufioWriterPool.Put(bufioWriter)
bufPool.Put(buf)

Using this way, we avoid the performance problem, but we still think, the go stdlib should use raw io.Writer, leave the choice of bufio or not to the user.

@dsnet dsnet removed the WaitingForInfo label Aug 7, 2019

@andybons

This comment has been minimized.

Copy link
Member

commented Aug 12, 2019

@dsnet if this is a change that needs fixing, can you add the NeedsFix label? Thanks.

@dsnet

This comment has been minimized.

Copy link
Member

commented Aug 12, 2019

It's a usage pattern that is certainly not what the encoding/csv package had in mind when designed and thus this pattern is hitting some pathological behavior. It's worth improving the performance of this use case, but it's not clear to me that the solution is simply to "not use bufio.Writer in csv.Writer".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.