Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csv/encoding: UTF-8 Byte Order Mark (BOM) messes up quote handling #33887

Closed
MaerF0x0 opened this issue Aug 27, 2019 · 8 comments
Closed

csv/encoding: UTF-8 Byte Order Mark (BOM) messes up quote handling #33887

MaerF0x0 opened this issue Aug 27, 2019 · 8 comments

Comments

@MaerF0x0
Copy link

What version of Go are you using (go version)?

$  go version
go version go1.12.7 darwin/amd64

Does this issue reproduce with the latest release?

1.12.9 is the newest brew installs , lmk if i need to build 1.13

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/me/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/me/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/Cellar/go/1.12.9/libexec"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.12.9/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/z3/089w1c5x2ngbzyfj0wb2lp440000gp/T/go-build807181100=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

Attempted to use encoding/csv#NewReader on a file starting with the UTF-8 BOM 0xEF,0xBB,0xBF ( https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8 )

Example go playground
https://play.golang.org/p/op6f2xI5h0X

What did you expect to see?

Leading BOM is ignored.

What did you see instead?

Failure to parse CSV

@MaerF0x0
Copy link
Author

Roughly speaking this is a test for the encoding/csv/reader_test.go

{
		Name:   "UTF-8WithByteOrderMark",
		Input:  "\xef\xbb\xbf\"BOM!\"",
		Output: [][]string{{"BOM!"}},
	},

go test ./src/encoding/csv

  --- FAIL: TestRead/UTF-8WithByteOrderMark (0.00s)
        reader_test.go:410: ReadAll() error:
            got  parse error on line 1, column 1: bare " in non-quoted-field
            want <nil>

@av86743
Copy link

av86743 commented Aug 27, 2019

Per RFC4180, CSV, which by far predates UTF and everything that comes with the latter, is composed from subset of OCTETS.

@MaerF0x0
Copy link
Author

I agree that RFC4180 does not specify it. But also because Excel seems to do this on export
and some other libraries can handle it, maybe golang should in favor of interoperability ?

parse(data, {
  bom: true
})

@ianlancetaylor
Copy link
Contributor

The encoding/csv package implements RFC 4180, as the package docs state. A BOM is meaningless for CSV, and it's straightforward to use a reader that skips a BOM (e.g., a search on godoc.org turned up https://godoc.org/github.com/spkg/bom, though I don't know anything about that package). There is no need for CSV to know anything about BOM's.

@MaerF0x0
Copy link
Author

KK. thanks for considering. At least this will serve as a reference to future google searches 😃

@skplunkerin
Copy link

For future (beginner Golang) Googlers like myself, you can use @MuffinTop's trimFirstRune() suggestion from this SO answer to easily remove the BOM.

@MaerF0x0
Copy link
Author

@skplunkerin

The stack overflow solutions all require moving the entire CSV into memory. This works for small test strings, but may be prohibitive in production code.

A solution is to use a streaming (io.Reader) filter, such as https://github.com/dimchansky/utfbom roughly like this:

    f, _ := os.Open("/tmp/dat.csv")
    sr, enc := utfbom.Skip(f)
    fmt.Printf("Detected encoding: %s\n", enc)
    myCSV := csv.NewReader(sr)

@skplunkerin
Copy link

@MaerF0x0 that's a much better solution, thank you so much! 😆

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants