supporting double quotes #4

ftrotter · 2015-08-17T02:20:16Z

I love the idea of Miller. It is clearly a needed tool that is missing from the standard unix toolbox.

However, you really cannot say you have a tool that is designed to support csv, without supporting csv.

CSV is a standard file format, and has an RFC: https://tools.ietf.org/html/rfc4180

Not supporting double quotes is the same thing as saying that you do not support csv, since double quotes are central to the way that the standard handles other characters... comma being just one example. Your tool is young enough that supporting the standard now will make later development much simpler. This will prevent the situation years from now where you have a 'normal mode' and a 'standards mode'. If you make the change now you can just have the one correct mode.

You have an ambitious work-list, but I would suggest taking a pause and thinking about how you will support the RFC version of the file format.

People like me (open data advocates) spend alot of time trying to ensure that organizations that release csv do so under the standard format, rather than releasing unparsable garbage. Having a library like yours that supported the standard too would be a huge boon.. I could say things like:

"See by using the RFC for your data output, all kinds of open tools will work out of the box on your data... like Miller (link)"

Thank you for working on such a clever tool...

Regards,
-FT

johnkerl · 2015-08-17T02:24:49Z

Nice. Thanks for the RFC link. For me in my daily work -- where Miller is pretty stable featurewise -- CSV is a secondary format (hence its not being the default). But I suspected (and it now seems rightly so) that CSV will be the primary for most folks.

ve3ied · 2015-08-17T11:54:07Z

I just found out about miller on the weekend.. Swiss Army knife and looking forward to getting to know it!

I would suggest that if supporting double quotes comes at a performance cost, to have a light-weight csv (current operation) option and the full csv support. 99.9% of the time, I don't need double quote support and I'm sure many others are in the same boat. Mostly need it for processing numbers from spreadsheet colleagues and the current operation is just fine as is. Thanks!!

vapniks · 2015-08-17T12:39:36Z

johnkerl: how easy/difficult would it be to add support for double quotes?
Which bits of the code need to be changed?

SikhNerd · 2015-08-17T13:26:42Z

I came here to request the same, adding support for quoted fields would make this tool awesomely usable for a lot of use cases.

johnkerl · 2015-08-18T04:14:11Z

ftrotter is spot-on -- RFC-compliant CSV is the way to go. Not only on a tool-by-tool basis but because it encourages good data citizenship on the part of producers as well as consumers.

ve3ied is (I think) also right -- I suspect a fully compliant CSV parser will run a bit slower than the handles-simple-cases code I have now (which suffices for my personal use cases). I think the right thing to do is make --csv handle RFC-compliant CSV. Then, unless the performance is absolutely as good as the current csv-lite code (which would surprise me), have a --csvlite option which will invoke the current reader.

That said, RFC-compliant CSV is going to be a bit of a project, while double quotes per se, as a small step toward that goal, is going to be easy. The RFC states that if double quotes are present, they must wrap the entire field. So it'll be pretty simple to (a) add a state to the state machines in the c/input/lrec_reader*csv.c files to ignore FS (e.g. comma) if inside quotes, (b) having delimited the field, if the start of field is a double-quote, then it's an error if the end of field isn't; (c) ++ the start-of-field pointer and null-poke the end-of-field char; (d) wrap with double-quote on output if there's an instance of FS in the field.

It's clear I've led a charmed life by having CSV data which is almost entirely double-quote-free. In light of multiple feedbacks, the minimal double-quote-handling feature sketched above is definitely top-of-list.

johnkerl · 2015-08-18T04:16:42Z

I've split out #16 for the more general case; I'll close this when double-quote handling is in place.

johnkerl · 2015-08-19T12:59:56Z

Primary question at this point: how/when to double-quote on output?

If file data have double quotes only when necessary (e.g. embedded comma) then it's easy to do the same on output: embed the output field in double quotes only when it has something needing escaping. This is easy to implement.
But some data file double-quote everything. Here again I can just double-quote all output fields, e.g. with a --quote-all-on-output flag or something.

I'm OK with both options. Question is if there's any desire for some sort of middle ground, e.g. if someone wants fields which were quoted on input to remain quoted on output. That's a bit more awkward. For transform operations which pass records through, e.g. mlr cat or mlr sort, one could put a was_quoted_on_input bit in each record and pass that through. But it gets messier for mlr put operations, e.g. doing $c = $a . $b when field a was quoted and field b was not. Should c be quoted?

In short I'm happy to implement output-quote-when-needed and output-quote-always modes; I'm not eager to do output-quote-tracked-from-input-quotes unless it's a really important use-case.

SikhNerd · 2015-08-19T13:47:47Z

Perhaps @ftrotter knows of a more proper answer, but I just did a quick survey of some of the more common tools I use and how they handle quoting on output. I hope this adds to the conversation:

Pythons CSV Module

Provides four options, and defaults to QUOTE_MINIMAL:

QUOTE_ALL
- All fields get quotes
QUOTE_MINIMAL
- Only quote those fields which contain special characters such as delimiter, quotechar or any of the characters in lineterminator.
QUOTE_NONNUMERIC
- Quote all non-numeric fields
QUOTE_NONE
- Never quote fields

With the defaults $c = $a . $b is quoted if a delimiter exists.

Perls Text::CSV

Also provides four options and the default behaviour is to quote fields if they need to - e.g. they contain the separator:

always_quote
- All fields quoted
quote_space
- By default spaces trigger quoting, this option is to disable that trigger
quote_null
- By default, null bytes trigger quoting, this option is to disable that trigger
quote_binary
- By default bytes >= 0x7f trigger quoting, this option is to disable that trigger

With the defaults $c = $a . $b is quoted if a delimiter exists.

Csvfix

Provides two options on quoting, and defaults to quoting all fields:

smq
- Turns off quoting of all fields, and quotes only those fields with a separator
sqf fields
- Allows you to provide a list of fields to be quoted, any field not in your list is not quoted.

I did a quick test and the $c = $a . $b case here is quoted by default, quoted with -smq, and you can force it to be fully unquoted with -sqf.

CsvKit

Provides four options for quoting, defaults to quoting only when a separator is present.

u=0
- Quote Minimal
u=1
- Quote All
u=2
- Quote Non-numeric
u=3
- Quote None.

I did not get time to test the $c = $a . $b case here, sorry. I expect it would quote if a delimiter exists based on the documentation.

My personal preferences are exactly the two options you laid out (quote minimally and quote everything) and to have $c quoted only when a separator is present.

Thank you for working on this :)

ftrotter · 2015-08-19T20:20:55Z

Actually @SikhNerd has just pointed out the "real" reason to support double quotes all the time on input, rather than supporting any 'modes'. So many of the standard CSV output systems will default to using no quotes for field exports that contain only numbers, but then automatically include quotes when something "turns alphanumeric" or "turns alphanumeric with commas". For instance a standard input file might go for years with a structure of

id,phone_number
1,1112223333
2,1112223334
....
n,7778889999

but then suddenly someone enters a phone number with an extension and then the identical export will get

id,phone_number
1,1112223333
2,"1112223334x509"
....
n,7778889999

or worse
id,phone_number
1,"1112223333"
2,"1112223334x509"
....
n,"7778889999"

In the same way, you might have an export process that works like this:

id,sentence
1,The quick brown fox jumps over the lazy dog
2,"The quick, brown fox jumps over the lazy dog"

In this case the quotes are only exercised when the comma appears in the string
So to stably use a system like miller, you really need to handle all of those cases really smoothly. Otherwise, later changes in data contents or data structure could produce confusing, hard-to-find bugs.

Again thanks for working on such a great tool!!

-FT

johnkerl · 2015-08-19T22:59:58Z

@ftrotter thank you! you've got a lot more experience with dirty CSV than I do. :)

johnkerl · 2015-08-22T17:52:40Z

Rethink; I'm duping #16 to this one. No need for a two-task approach.

johnkerl · 2015-08-24T04:08:17Z

Had a nice chunk of time this evening to sketch out the new CSV parser. Proof of concept works. Now just a matter of integrating into the rest of the framework, and plenty of unit-test & regression-test cases, & checking against several found-in-the-wild CSV files. There are several in miller issues, I think enough to work with.

SikhNerd · 2015-08-24T04:18:29Z

If you can push up the branch you're working on, I'd be very interested in watching what you're doing, and will try and play around with it to help if I find time :)

johnkerl · 2015-08-24T04:28:39Z

it's all in master -- this evening's pushes -- every bit compiles; there are unit-tests for some stable things; all in backward-compatibility mode so nothing is reachable from the mlr executable as yet. c/experimental/csv0.c is the proof-of-concept file.

johnkerl · 2015-08-25T02:17:23Z

Progress point:

$ cat b.tmp
a,b,c
1,"2,3",4

$ mlr --csvex --odkvp --ofs ';' cat b.tmp
a=1;b=2,3;c=4

(--csvex is transitional until --csv becomes --csvlite and --csvex becomes --csv)

johnkerl · 2015-08-25T02:18:16Z

At present this CSV reader is about 3.5x slower than the CSV-lite reader. Some performance optimizations can be made after the functionality is delivered but I very much doubt it will ever be as fast as CSV-lite.

$ time mlr --csvlite cat ../data/big.csv > /dev/null

real 0m1.456s
user 0m1.407s
sys  0m0.044s

$ time mlr --csvex cat ../data/big.csv > /dev/null

real 0m6.738s
user 0m6.593s
sys  0m0.136s

$ wc -l ../data/big.csv
 1000001 ../data/big.csv

johnkerl · 2015-08-26T03:44:31Z

OK, iterating. If you use --csvex as of the most recent push you'll get the RFC-CSV reader/writer.

Terminators are hard-coded to CRLF. If you have CR or LF CSV you found somewhere, that needs either separate manipulation, or wait until I add programmable RS/FS to RFC-CSV format.
After a bit more testing & cleanup, probably tomorrow or next day, I'll rename --csv to --csvlite and --csvex to --csv. Then we'll be able to say that Miller's CSV format is real CSV.
Some code neatening & performance work will follow on, but those won't be directly feature-impactful.
There is --quote-all, --quote-none, --quote-minimal, and --quote-numeric. Not in the online help as of yet but soon.

jungle-boogie · 2015-08-26T05:37:28Z

Hi @johnkerl,

So is this an example of your query working from master compiled mlr?:

mlr --csvlite --opprint count-distinct -f ST,ClosingDate

or

mlr --csvex --opprint count-distinct -f ST,ClosingDate

johnkerl · 2015-08-26T16:13:45Z

At the moment:

--csvlite == --csv == as-is stuff as in the release announcement, i.e. not RFC-CSV.
--csvex new experimental RFC-CSV stuff needing some more unit-test cases etc.

In a day or two once I do some code cleanup & add some test cases:

--csvlite will be the as-is stuff in the release announcement.
--csv will be the new RFC-CSV

jungle-boogie · 2015-08-26T16:17:17Z

👍

ds108j · 2015-08-27T14:46:12Z

Hello.

I thank you for this update. I have csv files where the field separator is a space, and even if the --csvex works with "," filed separator, it doesn't seem to work when you want to change the FS.

Example :
$ cat test
1,2,3,4
ab,ac,"a,d",ae

$ mlr --csvex --odkvp cat test
1=ab,2=ac,3=a,d,4=ae

This case is ok, but if I change my FS to space 👍
$ cat test2
1 2 3 4
ab ac "a d" ae

$mlr --csvex -p --odkvp cat test2
1=1 2=2 3=3 4=4
1=ab 2=ac 3="a 4=d" 5=ae

How can I use the --csvex with space separator ?

Thanks.

johnkerl · 2015-08-27T21:35:07Z

Right. --csvlite was, and is, the same not-really-CSV as I originally released, with programmable RS/FS (e.g. you can do TSV, or spaces), whereas --csvex (soon to be --csv) will be only RFC-4180 CSV. RS will be hardcoded to CRLF and FS will be hardcoded to comma. This is to deliver, as soon as possible, RFC-compliant CSV since @ftrotter hit that nail on the head. (As I noted above, read performance is significantly slower than CSV-lite.) Using semantic versioning, this will be Miller v2.0.0.

In the next minor, v2.1.0, I'll do a bit more:

--csv will be still be compliant by default, but RS/FS will be progammable (as in your note above)
RS/FS/PS for all formats will now be able to be multi-character, e.g. you'll be able to use CRLF for DKVP format
Read-performance for CSV will be optimized
Double-quoting will be supported in DKVP as well as in CSV.

johnkerl · 2015-08-27T21:35:44Z

I'll put the above info in the 2.0.0 release note as well.

johnkerl · 2015-08-27T23:31:23Z

Please see https://github.com/johnkerl/miller/releases/tag/v2.0.0. If you find any issues, please open a new issue as a bugfix request.

Thank you thank you thank you for the very informative feedback!!! :)

johnkerl · 2015-08-30T03:00:25Z

@ftrotter and @SikhNerd tagging you here for closure.

SikhNerd · 2015-08-30T17:50:20Z

Looks good 💯 I haven't been able to break it yet.

johnkerl · 2015-08-30T18:21:29Z

:D

indera · 2016-03-18T10:16:52Z

What would you recommend to do when the input file has a broken line like line 3 below:
A, B, C
1, 2, 3
"x"","abc", "cde"

Is there a way to fix it using miller commands?

johnkerl · 2016-03-22T01:21:59Z

Good question.

On @ftrotter's advice I made the CSV parser RFC-compliant. I didn't put any effort into handling non-compliant CSV (although there's a lot of it out there in the world).

I don't know how non-compliant your data is beyond the above 3-line example, so I'm unsure how to point you at a general solution, but a few ways to handle the above 3-line example are:

cat yourfile.csv | sed 's/"//g' | mlr --csv ...
cat yourfile.csv | sed 's/""/"/g' | mlr --csv ...
mlr --csvlite put '$A =~ "\"" { $A=gsub($A,"\"", "") }' yourfile.csv

In the latter case I'm suggesting the csvlite parser which isn't RFC-compliant and treats quotes as characters like any others.

Also note that for option 3 you'll need head Miller, since I introduced the necessary "\"", as well as the pattern-action syntax, after the most recent release (3.4.0).

johnkerl · 2016-03-22T13:32:48Z

4th option: mlr --csvlite put -S '$A=gsub($A,"\"", "")' yourfile.csv

indera · 2016-03-22T18:06:26Z

@johnkerl this is awesome 👍

--------------
mlr --version
Miller 3.4.0
---------------
mlr --icsv --opprint --rs lf cat ha.csv
mlr: syntax error: unwrapped double quote at line 2.
-----------------
mlr --csvlite put -S '$A=gsub($A,"\"", "")' ha.csv
A, B, C
1, 2, 3
x,"abc", "cde"

johnkerl · 2016-03-22T18:09:42Z

:)

ftrotter · 2016-03-22T23:01:28Z

Have not visited this thread in a while and I am thrilled to see the progress.

I do think that handling non-standard CSV is a good feature, but I almost think that barfing with an error message is almost a better feature.. that way I can do to CSV producers and say "see... get your shit together"...

Would also like to point out that the performance mentioned here is excellent for standards compliant CSV processing, just because it is "worse" does not mean it cannot also be "excellent".

It is quite a compliment to have such a complex feature that I suggested closed. I will do my best to promote this to all the right people. Moving to twitter now!!

-FT

johnkerl mentioned this issue Aug 18, 2015

RFC-compliant CSV I/O #16

Closed

johnkerl added enhancement and removed enhancement labels Aug 18, 2015

johnkerl mentioned this issue Aug 19, 2015

Homebrew #1

Closed

johnkerl mentioned this issue Aug 23, 2015

header data mismatch #30

Closed

johnkerl added on deck enhancement labels Aug 25, 2015

johnkerl mentioned this issue Aug 26, 2015

Expression does not evaluate to boolean: got T_NULL #34

Closed

johnkerl closed this as completed Aug 27, 2015

johnkerl removed enhancement on deck labels Aug 30, 2015

masukomi mentioned this issue Aug 23, 2018

--pprint --csv argument-parsing oddity [was: double quote handling appears to be broken again] #180

Closed

ernstki mentioned this issue Apr 5, 2019

How should Miller behave with double quotes in a tab-delimited file (--tsv)? #238

Closed

spmundi mentioned this issue Mar 9, 2020

Regex as conditional in put #317

Closed

johnkerl mentioned this issue Jan 11, 2022

RFC 4180 and double quotes #780

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

supporting double quotes #4

supporting double quotes #4

ftrotter commented Aug 17, 2015

johnkerl commented Aug 17, 2015

ve3ied commented Aug 17, 2015

vapniks commented Aug 17, 2015

SikhNerd commented Aug 17, 2015

johnkerl commented Aug 18, 2015

johnkerl commented Aug 18, 2015

johnkerl commented Aug 19, 2015

SikhNerd commented Aug 19, 2015

ftrotter commented Aug 19, 2015

johnkerl commented Aug 19, 2015

johnkerl commented Aug 22, 2015

johnkerl commented Aug 24, 2015

SikhNerd commented Aug 24, 2015

johnkerl commented Aug 24, 2015

johnkerl commented Aug 25, 2015

johnkerl commented Aug 25, 2015

johnkerl commented Aug 26, 2015

jungle-boogie commented Aug 26, 2015

johnkerl commented Aug 26, 2015

jungle-boogie commented Aug 26, 2015

ds108j commented Aug 27, 2015

johnkerl commented Aug 27, 2015

johnkerl commented Aug 27, 2015

johnkerl commented Aug 27, 2015

johnkerl commented Aug 30, 2015

SikhNerd commented Aug 30, 2015

johnkerl commented Aug 30, 2015

indera commented Mar 18, 2016

johnkerl commented Mar 22, 2016

johnkerl commented Mar 22, 2016

indera commented Mar 22, 2016

johnkerl commented Mar 22, 2016

ftrotter commented Mar 22, 2016

supporting double quotes #4

supporting double quotes #4

Comments

ftrotter commented Aug 17, 2015

johnkerl commented Aug 17, 2015

ve3ied commented Aug 17, 2015

vapniks commented Aug 17, 2015

SikhNerd commented Aug 17, 2015

johnkerl commented Aug 18, 2015

johnkerl commented Aug 18, 2015

johnkerl commented Aug 19, 2015

SikhNerd commented Aug 19, 2015

Pythons CSV Module

Perls Text::CSV

Csvfix

CsvKit

ftrotter commented Aug 19, 2015

johnkerl commented Aug 19, 2015

johnkerl commented Aug 22, 2015

johnkerl commented Aug 24, 2015

SikhNerd commented Aug 24, 2015

johnkerl commented Aug 24, 2015

johnkerl commented Aug 25, 2015

johnkerl commented Aug 25, 2015

johnkerl commented Aug 26, 2015

jungle-boogie commented Aug 26, 2015

johnkerl commented Aug 26, 2015

jungle-boogie commented Aug 26, 2015

ds108j commented Aug 27, 2015

johnkerl commented Aug 27, 2015

johnkerl commented Aug 27, 2015

johnkerl commented Aug 27, 2015

johnkerl commented Aug 30, 2015

SikhNerd commented Aug 30, 2015

johnkerl commented Aug 30, 2015

indera commented Mar 18, 2016

johnkerl commented Mar 22, 2016

johnkerl commented Mar 22, 2016

indera commented Mar 22, 2016

johnkerl commented Mar 22, 2016

ftrotter commented Mar 22, 2016