Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

option to ignore whitespace stripping within quotes #109

Closed
li1 opened this issue Feb 22, 2022 · 3 comments · Fixed by #112
Closed

option to ignore whitespace stripping within quotes #109

li1 opened this issue Feb 22, 2022 · 3 comments · Fixed by #112

Comments

@li1
Copy link

li1 commented Feb 22, 2022

Hey @quinnj,

I'm rewriting our data loading at the moment, migrating to Parsers.jl.

My request is more or less the opposite of #106: For CSV parsing, it would be great to provide an option that allows us strip whitespaces around unquoted fields, but leave it within quotes.

For example, a CSV

A, B   ,  C,D     
      "hello", "good day"     ,      "   same same     "   ,   whatever         

should Ideally parse into ["A", "B", "C", "D"] for the first line and

["hello", "good day", "   same same     ", "whatever"]`

for the second.

Would it be straightforward to add that as an option?

@quinnj
Copy link
Member

quinnj commented Apr 19, 2022

Sorry for the slow response here. @nickrobinson251, as I consider this, it seems to me that it's actually incorrect in our original stripwhitespace implementation to strip whitespace within quoted strings. Like, the point of quoting a string is you want to preserve what's inside, including whitespace.

I think I perhaps just got too over-eager in that first implementation and didn't realize that we're over-stripping. Does that make sense? Do we think there would still be a use-case for stripping the whitespace inside quotes? I guess there are sometimes cases where a csv writer just quotes everything and you might want to also be able to strip inside quotes, but that feels like an additional option we'd want to be opt-in, since you're electing to ignore stuff inside your quotes.

@nickrobinson251
Copy link
Collaborator

nickrobinson251 commented Apr 19, 2022

Do we think there would still be a use-case for stripping the whitespace inside quotes? I guess there are sometimes cases where a csv writer just quotes everything and you might want to also be able to strip inside quotes, but that feels like an additional option we'd want to be opt-in, since you're electing to ignore stuff inside your quotes.

Yeah, unfortunately i have this use-case, so it does exist 😞

I agree stripping whitespace within quotes should be opt-in and not the default for what stripwhitespace does (unlike currently)

quinnj added a commit that referenced this issue Apr 21, 2022
Fixes #109. As noted in that issue, stripping whitespace *within* quoted
strings, IMO, should be considered a bug, since one of the primary
reasons for quoting strings in various applications is to delineate the
exact characters that make up the string. This PR fixes
`stripwhitespace` to preserve whitepace encountered within strings, and
only strip whitespace for non-quoted strings (leading or trailing) and
leading/trailing around quoted fields.

On the other hand, there are legitimate use-cases for also stripping
whitespace within quoted strings, so we add a new opt-in `stripquoted`
keyword argument that allows the additional precision of also stripping
whitespace inside quotes. Note that passing `stripquoted=true` implies
`stripwhitespace=true`, so it can be considered a "stronger" version of
`stripewhitespace`.
@quinnj
Copy link
Member

quinnj commented Apr 21, 2022

Alright, PR is up (#112) to fix stripwhitespace to preserve whitespace inside quoted string, but also adds stripquoted to allow stripping inside quotes.

quinnj added a commit that referenced this issue Apr 21, 2022
* Add new stripquoted keyword arg and fix stripwhitespace

Fixes #109. As noted in that issue, stripping whitespace *within* quoted
strings, IMO, should be considered a bug, since one of the primary
reasons for quoting strings in various applications is to delineate the
exact characters that make up the string. This PR fixes
`stripwhitespace` to preserve whitepace encountered within strings, and
only strip whitespace for non-quoted strings (leading or trailing) and
leading/trailing around quoted fields.

On the other hand, there are legitimate use-cases for also stripping
whitespace within quoted strings, so we add a new opt-in `stripquoted`
keyword argument that allows the additional precision of also stripping
whitespace inside quotes. Note that passing `stripquoted=true` implies
`stripwhitespace=true`, so it can be considered a "stronger" version of
`stripewhitespace`.

* Update src/Parsers.jl

Co-authored-by: Nick Robinson <npr251@gmail.com>

* Update test/runtests.jl

Co-authored-by: Nick Robinson <npr251@gmail.com>

Co-authored-by: Nick Robinson <npr251@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants