multi bug fixes for CSV parsing #5388

sebhrusen · 2021-03-19T21:47:05Z

https://h2oai.atlassian.net/browse/PUBDEV-7996

few bugs on backend setup and parsing logic:

parseSetup logic always consider double quotes as a potential quote, even in single quote mode.
parseSetup logic ignores escape character (\) before the quote char when guessing the separator.
parseSetup logic ignores that 2 consecutive quotes inside a quoted string is a way to escape the second quote.
parsing state machine ignores escape character \ as a way to escape quotes inside a quoted string.

Also TestUtil hardcoded (without explanation) some parsing methods with singleQuotes=true instead of default singleQuotes=false: given that there are about 100 different parsing methods there currently, it made it difficult to know for sure which quote was used in the tests.
==> those parsing utility methods are just nightmare, all tests should switch to ParseSetupTransformer if they can't use defaults.

Future improvement/PR: autodetect quoting character in ParseSetup.

michalkurka · 2021-03-21T23:26:46Z

based on the documentation

#' @param quotechar A hint for the parser which character to expect as quoting character. None (default) means autodetection.

and

single_qouotes <- if (is.null(quotechar) || quotechar != "'") FALSE else TRUE

single_quotes = FALSE should actually mean autodetect. I don't think the java code is doing that right now (I might be missing something). What is the expected behavior?

sebhrusen · 2021-03-22T15:32:13Z

single_quotes = FALSE should actually mean autodetect

according to doc, maybe it was the intent, but the pseudo-autodetection logic was just systematically considering " as a quote, even with single_quote=True, breaking the quote counting in that mode.
Without mentioning that using a boolean to represent 3 values (single quote, double quote, auto detect) is soooooo wrong…

michalkurka · 2021-03-22T15:36:31Z

Maybe the documentation needs to be changed as well then.

If it didn't work anyway, the change of a "documented" behavior doesn't matter. We are free to change it in whichever way that makes sense.

…etection and during parsing

…ult)

sebhrusen · 2021-03-23T14:54:10Z

@michalkurka tested and (almost) ready to merge.
Currently only fails on

the 3 R tests previously posted and currently being fixed.
the new Py test checking a real-life csv example with \ escaped quotes: waiting for ops to make this new file available for the builds.

Would be nice to have this \ support available for this release: please tell if you prefer to have it in next fix build.

michalkurka · 2021-03-23T15:19:40Z

@sebhrusen I don't see a an issue in the modified code, the question is what other code should have been modified as well :)

The change LGTM, eventually we should add support for autodetecting single quotes vs double quotes. Maybe it is as easy as running determineTokens one time for single and one time for double and comparing the results.

sebhrusen · 2021-03-23T15:28:36Z

yes, for the autodetect, I'll implement it before next fix build, and my first idea is the one you suggest as well.

sebhrusen · 2021-03-23T23:54:47Z

@michalkurka the new small data file was added and the new parsing test passed.
This one is now also ready to merge

sebhrusen requested a review from michalkurka March 19, 2021 21:53

sebhrusen marked this pull request as draft March 19, 2021 23:32

sebhrusen marked this pull request as ready for review March 20, 2021 01:06

sebhrusen force-pushed the seb_pubdev-7996 branch from 33f69ae to 1f332af Compare March 20, 2021 23:44

sebhrusen added the core label Mar 21, 2021

michalkurka closed this Mar 22, 2021

michalkurka reopened this Mar 22, 2021

Sebastien Poirier added 4 commits March 23, 2021 00:01

multi bug fixes for CSV parsing

81b9d49

improve handling of escape char and double quotes in both setup for d…

f78ad5f

…etection and during parsing

revert client changes, now in branch seb_pubdev-7995-client

2a22968

ensure that TestUtil parser use double quotes by default (actual defa…

6f33739

…ult)

sebhrusen force-pushed the seb_pubdev-7996 branch from c55ad4d to 6f33739 Compare March 22, 2021 23:04

sebhrusen changed the base branch from master to seb_pubdev-7996-client March 22, 2021 23:04

Sebastien Poirier added 3 commits March 23, 2021 00:06

revert client changes after bad rebase

0d929ab

added test for escape char

e0a2b2d

added unit test for escape char

fd43e46

sebhrusen changed the base branch from seb_pubdev-7996-client to rel-zipf March 23, 2021 14:50

michalkurka approved these changes Mar 23, 2021

View reviewed changes

michalkurka merged commit 4672057 into rel-zipf Mar 24, 2021

h2o-ops mentioned this pull request May 14, 2023

H2O CSV parser fails handling commas in quoted values #7679

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi bug fixes for CSV parsing #5388

multi bug fixes for CSV parsing #5388

sebhrusen commented Mar 19, 2021 •

edited

Loading

michalkurka commented Mar 21, 2021

sebhrusen commented Mar 22, 2021

michalkurka commented Mar 22, 2021

sebhrusen commented Mar 23, 2021

michalkurka commented Mar 23, 2021

sebhrusen commented Mar 23, 2021

sebhrusen commented Mar 23, 2021

multi bug fixes for CSV parsing #5388

multi bug fixes for CSV parsing #5388

Conversation

sebhrusen commented Mar 19, 2021 • edited Loading

michalkurka commented Mar 21, 2021

sebhrusen commented Mar 22, 2021

michalkurka commented Mar 22, 2021

sebhrusen commented Mar 23, 2021

michalkurka commented Mar 23, 2021

sebhrusen commented Mar 23, 2021

sebhrusen commented Mar 23, 2021

sebhrusen commented Mar 19, 2021 •

edited

Loading