Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should Miller behave with double quotes in a tab-delimited file (--tsv)? #238

Closed
ernstki opened this issue Apr 5, 2019 · 3 comments
Closed

Comments

@ernstki
Copy link

ernstki commented Apr 5, 2019

I have these three sample input files:

$ cat test_escaped_quotes.tsv
one ⇥ two ⇥ three
value one ⇥ value two ⇥ value three has a ""quoted string""

$ cat test_rfc4180.tsv
one ⇥ two ⇥ three
"value one" ⇥ "value two" ⇥ "value three has a ""quoted string"""

$ cat test.tsv
one ⇥ two ⇥ three
value one ⇥ value two ⇥ value three has a "quoted string"

$ cat test_with_escaped_unpaired_double_quote.tsv
floor ⇥ room ⇥ asset
2nd ⇥ 217 ⇥ 36"" flatscreen LCD

$ cat test_with_unpaired_double_quote.tsv
floor ⇥ room ⇥ asset
2nd ⇥ 217 ⇥ 36" flatscreen LCD

And when I cat them with mlr --tsv, none of them really yield the "expected" (correct?) behavior:

$ for file in test*.tsv; do
> printf "$file\n"; mlr --tsv cat $file; printf '\n\n'
> done

test_escaped_quotes.tsv
mlr: syntax error: unwrapped double quote at line 1.


test_rfc4180.tsv
one	two	three
value one	value two	"value three has a ""quoted string"""


test.tsv
mlr: syntax error: unwrapped double quote at line 1.


test_with_escaped_unpaired_double_quote.tsv
mlr: syntax error: unwrapped double quote at line 1.


test_with_unpaired_double_quote.tsv
mlr: syntax error: unwrapped double quote at line 1.

I understand, after reading #4, that tab-delimited support with --tsv is basically RFC 4180 CSV support, with tab as the delimiter. But if the TSV support really is just "RFC 4180 with a tab delimiter," surely one of the above files should've satisfied the RFC 4180 requirement for "escaping" double quotes:

(2.5) Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields.

(2.7) If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.

...and printed out just value three has a "quoted string" as one would expect for the third column?

If TSV is considered to be its own animal, then why the syntax error: unwrapped double quote at all? Unescaped double quotes should just be allowed everywhere.

My workaround has been to use --tsvlite, but I'm starting to feel like --tsvlite should just be --tsv and --tsv should be called something like --tsv-strict, because it's really hard to make it happy when it comes to double quotes within fields.

I feel pretty willing to try to resolve this problem in a personal fork and open a PR, but what is the correct/expected behavior in this case?

@aborruso
Copy link
Contributor

aborruso commented Apr 5, 2019

Using the only one RFC 4180 TSV compliant file ( test_rfc4180.tsv)

one	two	three
"value one"	"value two"	"value three has a ""quoted string"""

I have no error with mlr --tsv cat input.tsv.

The other files do not work because they are not RFC 4180 compliant. The RFC 4180 version of 36" is this one

$ cat input.tsv
floor	room	asset
2nd	217	"36"" flatscreen LCD"

It's useful to pretty print it. mlr --t2p --barred cat input.tsv gives you

+-------+------+--------------------+
| floor | room | asset              |
+-------+------+--------------------+
| 2nd   | 217  | 36" flatscreen LCD |
+-------+------+--------------------+

To use no RFC 4180 compliant tsv, you should use tsvlite.

@ernstki
Copy link
Author

ernstki commented Apr 5, 2019

OK, thanks for your explanation, this is slowly starting to make more sense. The double quotes are required to wrap double quotes, and the wrapped quotes have to be "escaped" by doubling them.

And since --tsv implies --otsv, too, the output is compliant RFC 4180, which happens to remove the unnecessary outer quotes from the first two columns. That was the part that tripped me up, I think.

I would've at least expected all three fields to have been quoted, in order to be compliant with the RFC, but upon further consideration, I guess it isn't really required to quote fields just because they contain spaces.

@ernstki
Copy link
Author

ernstki commented Apr 5, 2019

After @aborruso's clarifications, I see the problem was with my understanding of the RFC, not any fault of Miller's. Thank, you sir! :)

@ernstki ernstki closed this as completed Apr 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants