Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/csv: skipping of empty rows leads to loss of data in single-column datasets #39119

Open
kokes opened this issue May 17, 2020 · 12 comments · May be fixed by #39556
Open

encoding/csv: skipping of empty rows leads to loss of data in single-column datasets #39119

kokes opened this issue May 17, 2020 · 12 comments · May be fixed by #39556
Milestone

Comments

@kokes
Copy link

@kokes kokes commented May 17, 2020

What version of Go are you using (go version)?

$ go version
go version go1.14 darwin/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/ondrej/Library/Caches/go-build"
GOENV="/Users/ondrej/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/ondrej/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.14/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.14/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/hp/q7nph21s1q76nw1hv1hfxv2m0000gn/T/go-build988616937=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I wrote a CSV with a single column and missing data.

What did you expect to see?

I expected to load the data back, intact.

What did you see instead?

I lost the missing values, encoding/csv skipped them as it skips blank lines. In this case, a blank line actually represents data.


I'm not sure I understand the rationale behind skipping blank lines. Neither in terms of common practice (why would I have blank lines in my CSVs?) nor in terms of standards (the closest we have is RFC 4180 and I couldn't find anything about blank lines - so I'm not sure if Go follows it).

Here's a reproduction of the problem. I wrote a dataset into a file and was unable to read it back.

package main

import (
	"encoding/csv"
	"errors"
	"log"
	"os"
	"reflect"
)

func writeData(filename string, data [][]string) error {
	f, err := os.Create(filename)
	if err != nil {
		return err
	}
	defer f.Close()
	cw := csv.NewWriter(f)
	defer cw.Flush()
	if err := cw.WriteAll(data); err != nil {
		return err
	}
	return nil
}

func readData(filename string) ([][]string, error) {
	f, err := os.Open(filename)
	if err != nil {
		return nil, err
	}
	defer f.Close()
	cr := csv.NewReader(f)
	rows, err := cr.ReadAll()
	if err != nil {
		return nil, err
	}
	return rows, nil
}

func run() error {
	fn := "data/roundtrip.csv"
	data := [][]string{{"john"}, {"jane"}, {""}, {"jack"}}

	if err := writeData(fn, data); err != nil {
		return err
	}

	returned, err := readData(fn)
	if err != nil {
		return err
	}
	if !reflect.DeepEqual(returned, data) {
		log.Println("expected", data, "got", returned)
		return errors.New("not equal")
	}

	return nil
}

func main() {
	if err := run(); err != nil {
		log.Fatal(err)
	}
}

@cagedmantis cagedmantis added this to the Backlog milestone May 20, 2020
@cagedmantis
Copy link
Contributor

@cagedmantis cagedmantis commented May 20, 2020

As mentioned, a quick look at RFC 4180 does not seem to explicitly describe what the behavior should be for empty lines.

@cagedmantis
Copy link
Contributor

@cagedmantis cagedmantis commented May 20, 2020

@dsnet
Copy link
Member

@dsnet dsnet commented May 21, 2020

According to RFC 4180, section 2.1:

Each record is located on a separate line, delimited by a line break (CRLF).

Which suggests that this is a bug in the reader code.

However, I'm not going to be surprised if fixing this breaks lots of users due to Hyrum's law and we're forced to just document the errant behavior. Even though the package tries to follow RFC 4180, CSV files are one of those formats with many strange variants that do not folllow any specific grammar.

@dsnet
Copy link
Member

@dsnet dsnet commented May 21, 2020

Interesting that this used to work in go1.3 and stopped working in go1.4, but not because the reader behavior changed, but that the writer behavior changed.

In go1.3, it would produce:

john
jane
""
jack

but in go1.4, it would produce:

john
jane

jack

The presence of an explicit "" made it clear to the reader that there was a record present.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented May 21, 2020

The change in Go 1.4 is documented, but I guess not in user-visible documentation.

// We used to quote empty strings, but we do not anymore (as of Go 1.4).
// The two representations should be equivalent, but Postgres distinguishes
// quoted vs non-quoted empty string during database imports, and it has
// an option to force the quoted behavior for non-quoted CSV but it has
// no option to force the non-quoted behavior for quoted CSV, making
// CSV with quoted empty strings strictly less useful.
// Not quoting the empty string also makes this package match the behavior
// of Microsoft Excel and Google Drive.

See #7586.

@kokes
Copy link
Author

@kokes kokes commented May 22, 2020

Which suggests that this is a bug in the reader code.

However, I'm not going to be surprised if fixing this breaks lots of users due to Hyrum's law and we're forced to just document the errant behavior. Even though the package tries to follow RFC 4180, CSV files are one of those formats with many strange variants that do not folllow any specific grammar.

If we're worried about breaking existing code, we could add a boolean flag, which defaults to the current behaviour and then we could discuss flipping it (and potentially removing it) for Go 2.


If I understand the implementation correctly, then one cannot just initialise a Reader struct, one has to go through NewReader, because the underlying io.Reader is unexported. In that case, we can enforce our default in the constructor. (Or we could flip the boolean flag to mean e.g. ParseBlankLines, which has the desired default value.)

Something along the lines of this (haven't tested it, just sketching):

diff --git a/src/encoding/csv/reader.go b/src/encoding/csv/reader.go
index c40aa506b0..cd2b0ccfc1 100644
--- a/src/encoding/csv/reader.go
+++ b/src/encoding/csv/reader.go
@@ -16,8 +16,8 @@
 //
 // Carriage returns before newline characters are silently removed.
 //
-// Blank lines are ignored. A line with only whitespace characters (excluding
-// the ending newline character) is not considered a blank line.
+// Blank lines are ignored by default. A line with only whitespace characters
+// (excluding the ending newline character) is not considered a blank line.
 //
 // Fields which start and stop with the quote character " are called
 // quoted-fields. The beginning and ending quote are not part of the
@@ -142,6 +142,9 @@ type Reader struct {
 	// By default, each call to Read returns newly allocated memory owned by the caller.
 	ReuseRecord bool
 
+	// If SkipBlankLines is true (default), rows with no data are skipped.
+	SkipBlankLines bool
+
 	TrailingComma bool // Deprecated: No longer used.
 
 	r *bufio.Reader
@@ -169,8 +172,9 @@ type Reader struct {
 // NewReader returns a new Reader that reads from r.
 func NewReader(r io.Reader) *Reader {
 	return &Reader{
-		Comma: ',',
-		r:     bufio.NewReader(r),
+		Comma:          ',',
+		SkipBlankLines: true,
+		r:              bufio.NewReader(r),
 	}
 }
 
@@ -268,7 +272,7 @@ func (r *Reader) readRecord(dst []string) ([]string, error) {
 			line = nil
 			continue // Skip comment lines
 		}
-		if errRead == nil && len(line) == lengthNL(line) {
+		if r.SkipBlankLines && errRead == nil && len(line) == lengthNL(line) {
 			line = nil
 			continue // Skip empty lines
 		}
@kokes
Copy link
Author

@kokes kokes commented Jun 12, 2020

Here's a patch, including tests. If this is something that would be acceptable, I can go through the usual gerrit mechanism. Oh, I see PRs on Github are accepted, I'll do that then.

Also, if this patch goes through, I'd suggest for Go 2 to have the default reverted - interpret blank lines by default and either add a SkipBlankLines switch or remove this functionality altogether.

@kokes kokes linked a pull request that will close this issue Jun 12, 2020
@gopherbot
Copy link

@gopherbot gopherbot commented Jun 12, 2020

Change https://golang.org/cl/237658 mentions this issue: encoding/csv: allow for interpretation of empty lines

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jun 12, 2020

We've pretty much decided that we are not going to add additional knobs to encoding/csv. There are just too many variations in CSV files out there. Trying to handle all of them would lead to a massive proliferation of knobs and an unusable package API.

@kokes
Copy link
Author

@kokes kokes commented Jun 12, 2020

That's too bad, because two of the existing knobs are actually going against the "standard" - lazyquotes and comments, while this suggested knob actually tries to bring Go closer to standard compliance (while keeping backward compatibility).

Anyway, where can I register interest in changing this in Go 2? No knobs, just changing the default - blank lines are perfectly fine and should be interpreted as data.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jun 12, 2020

There isn't going to be a big shift to Go 2. There may never be a Go 2, except perhaps for marketing reasons. For a package like this, there may someday be a encoding/csv/v2. Although personally I think that our experience with this package is that it would be better to drop it from the standard library, and encourage a variety of third party packages instead.

@uhoh-itsmaciek
Copy link

@uhoh-itsmaciek uhoh-itsmaciek commented Nov 13, 2020

Should the documentation be updated to indicate that RFC 4180 is not fully supported? I just ran into this myself, and reading the RFC

   record = field *(COMMA field)
   name = field
   field = (escaped / non-escaped)
   escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
   non-escaped = *TEXTDATA
   ...
   TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E

suggests strongly that blank lines should correspond to a single empty field.

I realize there's already a note a couple of paragraphs down in the encoding/csv documentation that states blank lines are ignored, but it's not made clear that this behavior contradicts the RFC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

6 participants
You can’t perform that action at this time.