Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi bug fixes for CSV parsing #5388

Merged
merged 7 commits into from
Mar 24, 2021
Merged

multi bug fixes for CSV parsing #5388

merged 7 commits into from
Mar 24, 2021

Conversation

sebhrusen
Copy link
Contributor

@sebhrusen sebhrusen commented Mar 19, 2021

https://h2oai.atlassian.net/browse/PUBDEV-7996

few bugs on backend setup and parsing logic:

  • parseSetup logic always consider double quotes as a potential quote, even in single quote mode.
  • parseSetup logic ignores escape character (\) before the quote char when guessing the separator.
  • parseSetup logic ignores that 2 consecutive quotes inside a quoted string is a way to escape the second quote.
  • parsing state machine ignores escape character \ as a way to escape quotes inside a quoted string.

Also TestUtil hardcoded (without explanation) some parsing methods with singleQuotes=true instead of default singleQuotes=false: given that there are about 100 different parsing methods there currently, it made it difficult to know for sure which quote was used in the tests.
==> those parsing utility methods are just nightmare, all tests should switch to ParseSetupTransformer if they can't use defaults.

Future improvement/PR: autodetect quoting character in ParseSetup.

@sebhrusen sebhrusen marked this pull request as draft March 19, 2021 23:32
@sebhrusen sebhrusen marked this pull request as ready for review March 20, 2021 01:06
@sebhrusen sebhrusen added the core label Mar 21, 2021
@michalkurka
Copy link
Contributor

based on the documentation

#' @param quotechar A hint for the parser which character to expect as quoting character. None (default) means autodetection.

and

single_qouotes <- if (is.null(quotechar) || quotechar != "'") FALSE else TRUE

single_quotes = FALSE should actually mean autodetect. I don't think the java code is doing that right now (I might be missing something). What is the expected behavior?

@sebhrusen
Copy link
Contributor Author

single_quotes = FALSE should actually mean autodetect

according to doc, maybe it was the intent, but the pseudo-autodetection logic was just systematically considering " as a quote, even with single_quote=True, breaking the quote counting in that mode.
Without mentioning that using a boolean to represent 3 values (single quote, double quote, auto detect) is soooooo wrong…

@michalkurka
Copy link
Contributor

Maybe the documentation needs to be changed as well then.

If it didn't work anyway, the change of a "documented" behavior doesn't matter. We are free to change it in whichever way that makes sense.

@michalkurka michalkurka reopened this Mar 22, 2021
@sebhrusen sebhrusen changed the base branch from master to seb_pubdev-7996-client March 22, 2021 23:04
@sebhrusen sebhrusen changed the base branch from seb_pubdev-7996-client to rel-zipf March 23, 2021 14:50
@sebhrusen
Copy link
Contributor Author

@michalkurka tested and (almost) ready to merge.
Currently only fails on

  • the 3 R tests previously posted and currently being fixed.
  • the new Py test checking a real-life csv example with \ escaped quotes: waiting for ops to make this new file available for the builds.

Would be nice to have this \ support available for this release: please tell if you prefer to have it in next fix build.

@michalkurka
Copy link
Contributor

@sebhrusen I don't see a an issue in the modified code, the question is what other code should have been modified as well :)

The change LGTM, eventually we should add support for autodetecting single quotes vs double quotes. Maybe it is as easy as running determineTokens one time for single and one time for double and comparing the results.

@sebhrusen
Copy link
Contributor Author

yes, for the autodetect, I'll implement it before next fix build, and my first idea is the one you suggest as well.

@sebhrusen
Copy link
Contributor Author

@michalkurka the new small data file was added and the new parsing test passed.
This one is now also ready to merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants