Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: encoding/csv: support QUOTE_NONE in Reader #23344

Closed
c0b opened this issue Jan 5, 2018 · 4 comments

Comments

@c0b
Copy link

commented Jan 5, 2018

Similar to Python csv reader's https://docs.python.org/2/library/csv.html#csv.QUOTE_NONE option

Recently I'm in a data engineering project, need to handle some huge numbers of big csv files (some GBs gz compressed) generated from some other probably proprietary software system, it uses '\t' as delimiter and some very inconsistent quoting, using Python csv reader with csv.QUOTE_NONE option it's no problem at all, can handle all of them, however not fast enough; I'm considering rewrite in Go, however does not have a similar option to Python's QUOTE_NONE I have tried set the LazyQuotes = true it helped somewhat, but still failing for some cases, after digging hard, I found the root cause
https://golang.org/pkg/encoding/csv/#Reader

f1value1	f2value1	f3"value1"
f1value2	f2value2	f3val"ue2
f1value3	f2value3	"f3value3
...

the problem is at values like "f3value3 which has beginning " and current Go's csv reader behavior is read till another " with the delimiter as an end boundary; causing this single field take some GB of memory, could be even more, and then being OOM killed.

I have made a local copy of the csv/reader.go file with some changes get the similar behavior of Python QUOTE_NONE working; possibly extend to all four quote option QUOTE_NONNUMERIC QUOTE_MINIMAL and QUOTE_ALL; let me know any one like this approach

BTW, through reading the code and the doc at https://golang.org/pkg/encoding/csv/#Reader I'm still not very clear why need a LazyQuotes option, it is just confusing and complicate the code; any one have comments?

@dsnet dsnet changed the title csv reader: need feature of QUOTE_NONE proposal: encoding/csv: support QUOTE_NONE in Reader Jan 5, 2018

@gopherbot gopherbot added this to the Proposal milestone Jan 5, 2018

@gopherbot gopherbot added the Proposal label Jan 5, 2018

@dsnet

This comment has been minimized.

Copy link
Member

commented Jan 5, 2018

The main complexity of the CSV implementation is needing to deal with multi-line quoted fields. The moment you remove the semantics of quoted strings, it seems like this problem is easier solved using bufio.Scanner to read each line and strings.Split to split across a \t delimiter:

r := strings.NewReader(`f1value1	f2value1	f3"value1"
f1value2	f2value2	f3val"ue2
f1value3	f2value3	"f3value3
`)

scanner := bufio.NewScanner(r)
for scanner.Scan() {
	fields := strings.Split(scanner.Text(), "\t")
	fmt.Printf("Fields: %q\n", fields)
}
if err := scanner.Err(); err != nil {
	log.Fatalf("scanner.Err: %v", err)
}

// Output:
// Fields: ["f1value1" "f2value1" "f3\"value1\""]
// Fields: ["f1value2" "f2value2" "f3val\"ue2"]
// Fields: ["f1value3" "f2value3" "\"f3value3"]

There is a performance detriment to strings.Split since it needs to allocate the output slice every time. One can imagine a (different) proposal to add func SplitInto(dst []string, s, sep string) []string to the strings package to allow reusing the slice.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jan 5, 2018

There are many many variations of CSV out there. We decided years ago that the standard library's encoding/csv package should stick to RFC 4180. If you have a different format, I recommend that you make your own copy of encoding/csv/reader.go and modify it for your use case. Or do as @dsnet suggests.

The LazyQuotes option, which was introduced in the first version of encoding/csv before we decided to stick closely to RFC 4180, permits parsing CSV files in which unquoted fields contain quote characters that are neither at the beginning nor the end of the field.

I'm going to close this proposal because, following past decisions in this area, we aren't going to do this., Please comment if you disagree, and explain why this situation is different.

@c0b

This comment has been minimized.

Copy link
Author

commented Jan 18, 2018

We decided years ago that ...

I would argue are those decisions made long time ago never revised?

https://golang.org/pkg/encoding/csv/#Reader for example this TrailingComma field is ignored now must be from something revised decision?

    // If LazyQuotes is true, a quote may appear in an unquoted field and a
    // non-doubled quote may appear in a quoted field.
    LazyQuotes    bool
    TrailingComma bool // ignored; here for backwards compatibility

in the same way if this LazyQuotes design is weird, (I've done CSV processing in many other languages and parsing libraries never such a similar LazyQuotes option) causing more confusing than it resolves, why can't this one be marked as ignored as well?

many many variations of CSV

Furthermore, if the builtin pkg "encoding/csv" is barely useful and people have to pull another 3rd party implementation, why not just take another csv implementation to be the new builtin "encoding/csv"

I know backward compatibility is a big concern here, but how many languages / libraries / software were dumped away in the compute related history? would you say 5 years / 10 years is enough time to make some library incompatible changes? or when Go 2 is to be released will allow some ?

after all my solution is to copy the encoding/csv/reader.go file and just made a few lines changes there, maintain this private copy is not a big deal, but in future this standard encoding/csv/reader.go file could have changes and causing merging problems for me, at this moment I'd like to contribute it in a PR; would there be some people interested to see it?

@dsnet

This comment has been minimized.

Copy link
Member

commented Jan 18, 2018

  • The presence of obscure features today does not justify that we should open the gates to allow for a variety of other strange variations on CSV.
    • The TrailingComma feature was marked deprecated because its behavior was deemed an actual bug (see #5892) according to RFC4180 (which is the "canonical" CSV specification).
    • We cannot deprecate LazyQuotes because it is a feature we already added and cannot remove because people may depend on it.
  • I disagree with the statement that "encoding/csv" is "barely useful" as there are thousands of imports of the csv package.
  • There will always be variant forms of CSV and it is impossible to satisfy them all. Especially when the feature you are suggesting is actually easier done without the csv package as I demonstrated above, it seems that overloading the csv package to handle this is a mistake. I understand your use case, but writing the logic yourself with 15 lines is not that bad.

@golang golang locked and limited conversation to collaborators Jan 18, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
4 participants
You can’t perform that action at this time.