Patch issue 70 #121

czaefferer · 2019-02-21T16:53:04Z

This PR contains:

Breaking Changes?

yes
no

Unescaped quotes that are not preceded or succeeded by a separator or Linebreak are handled like any other character.

Please Describe Your Changes

fixes issue #70

czaefferer · 2019-04-29T11:23:08Z

Any news or thoughts about this?

shellscape · 2019-05-13T13:27:13Z

My apologies @czaefferer, I have not been able to find the time to review this properly. Hopefully soon.

shellscape · 2019-07-05T15:04:30Z

test/data/unescaped_quotes.csv

+joe,sam,ja"n
+joe,sam,"ja"n"
+joe,"sa
+"m",jan


this test here should actually fail. each line in a csv file indicates a row, and a newline indicates the start of a new row. the result in the snapshot isn't accurate.

I don't think you're right there, this is from the RFC 4180 (https://tools.ietf.org/html/rfc4180):

field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA

As I understand this, a newline within a quoted field does not indicate the start of a new line (I wish it would, that would have saved me so much trouble in the past).

hm you may be right there. I'm giving things a big refresh in order to try and alleviate the pain of working through tokenizing things like this, remove some complication. hang tight and I'll follow up when I have that work done.

Of curious note is this bit:

7. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example: "aaa","b""bb","ccc"

That would seem to indicate that joe,sam,"ja"n" is invalid, would it not?

Yes, it should be invalid. However, if a single double-quote is found which isn't followed by a separator or a newline, it can be identified as invalid, and could then be treated like any other character. This does indeed take the standard a bit more lax, but then again, so do a lot of other options like configurable separators, quote- and escape-characters. To me it would be totally fine if this behaviour would need to be enabled by configuration. But considering how often I see this intentionally used in real world files, I think this is a far better solution than to invalidate the whole record (which, given the alternative, would not be acceptable in my use-cases).

Right now however, the parser does neither invalidate rows nor treat misplaced double-quotes as regular text, usually it merges two rows, but I've also seen it merge 8K out of 40K rows due to a single misplaced double-quote.

Yeah it certainly doesn't follow the spec at present. I'm working on a proper tokenizer for this that'll make implementing parts of the spec far easier. I work with the PostCSS folks quite a bit and maintain the postcss-less, postcss-values-parser modules which both follow the PostCSS tokenizer model and it's amazing how much easier it is to work with versus an ad-hoc parser like what's in the module at present

shellscape · 2019-07-05T23:44:49Z

FWIW I confirmed that Excel does export as "thing with ""quote" for a cell that has a double quote.

czaefferer added 2 commits February 18, 2019 11:43

fix: fixes mafintosh#70. unescaped quotes inside of cells

2850721

fix: handle crlf

4d400c9

This comment has been minimized.

Sign in to view

shellscape reviewed Jul 5, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patch issue 70 #121

Patch issue 70 #121

czaefferer commented Feb 21, 2019

czaefferer commented Apr 29, 2019

shellscape commented May 13, 2019

This comment has been minimized.

shellscape Jul 5, 2019

czaefferer Jul 5, 2019

shellscape Jul 5, 2019

shellscape Jul 5, 2019

czaefferer Jul 5, 2019

shellscape Jul 5, 2019

shellscape commented Jul 5, 2019

Patch issue 70 #121

Are you sure you want to change the base?

Patch issue 70 #121

Conversation

czaefferer commented Feb 21, 2019

Breaking Changes?

Please Describe Your Changes

czaefferer commented Apr 29, 2019

shellscape commented May 13, 2019

This comment has been minimized.

shellscape Jul 5, 2019

Choose a reason for hiding this comment

czaefferer Jul 5, 2019

Choose a reason for hiding this comment

shellscape Jul 5, 2019

Choose a reason for hiding this comment

shellscape Jul 5, 2019

Choose a reason for hiding this comment

czaefferer Jul 5, 2019

Choose a reason for hiding this comment

shellscape Jul 5, 2019

Choose a reason for hiding this comment

shellscape commented Jul 5, 2019