Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

splitCsv does not handle quoted values containing commas correctly #1102

Closed
wflynny opened this issue Apr 5, 2019 · 6 comments
Closed

splitCsv does not handle quoted values containing commas correctly #1102

wflynny opened this issue Apr 5, 2019 · 6 comments
Labels
Milestone

Comments

@wflynny
Copy link

wflynny commented Apr 5, 2019

Bug report

It appears that .splitCsv does not handle quoted tokens containing commas correctly.

Expected behavior and actual behavior

Given a token wrapped in quotes "1,234", I would expect .splitCsv to parse this as a single value [.., '1,234', ...]. However, it breaks this token into two, leaving a singular quote with each half [..., "1, 234", ...].

Steps to reproduce the problem

Using the following test case above as test.csv

value_1,value_2,value_3
100,10.1%,300
"9,600",98.9%,"24,155"

and parsing as

Channel
    .fromPath("test.csv")
    .splitCsv(header: true, quote: '\"')
    .subscribe { println "$it" }

I've tried a variety of values of quote above to no avail (I'm new to Groovy so advice on how to best specify this option would be helpful).

Program output

Expected output

[value_1: 100, value_2: 10.1%, value_3: 300]
[value_1: '9,600', value_2: 98.9%, value_3: '24,155']

Actual output

[value_1: 100, value_2: 10.1%, value_3: 300]
[value_1: "9, value_2: 600", value_3: 98.9%]

Environment

  • Nextflow version: 19.01.0.5050
  • Java version: 1.8.0_73
  • Operating system: CentOS 6.7

Additional context

I think the issue is related to how StringUtils.splitPreserveAllTokens is used in csvSplitter.groovy. The line is tokenized by , first then quotes are removed, but the tokenization by .splitPreserveAllTokens doesn't respect quotes:

import org.apache.commons.lang.StringUtils
line = '"9,600",86.3%,"23,000"'
println StringUtils.splitPreserveAllTokens(line, ",")

yields 5 tokens:

["9, 600", 86.3%, "23, 000"]

Perhaps regex is needed to extract the proper tokens (example), then quotes can be removed from them:

line = '"9,600",86.3%,"23,000"'
println line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*\$)", -1)

yields the desired 3 tokens

["9,600", 86.3%, "23,000"]
@edgano
Copy link

edgano commented Apr 5, 2019

I'm not sure why you want this behaviour...CSV stands for "Comma-separated values"
Maybe could be nice to be able to indicate the splitter character on the .splitCsv
But I'm not sure it makes sense to not split when a comma is found (if any other splitter char is defined)

@wflynny
Copy link
Author

wflynny commented Apr 5, 2019

@edgano Well, for one, the tools I'm using generate CSV files with quoted values. Further, it seems to be a widely accepted interpretation of the CSV specification.

@edgano
Copy link

edgano commented Apr 5, 2019

Yep, the quotes are not the "problem"...
The problem I see, is to avoid splitting a value with comma. But I assume it will be possible to take the element inside the quotes as a literal

@wflynny
Copy link
Author

wflynny commented Apr 5, 2019

Quotation handling should take precedent over splitting on the separator, be it a comma or another sep value passed to .splitCsv, but also possibly over newline handling as well. Typical CSV parse implementations handle the latter differently, but nonetheless this is a more general problem with .splitCSV than just handling commas.

Anyway, comma handling might be solved just with the regex above, then stripping quotes if they exist from the tokens.

@pditommaso pditommaso added the bug label Apr 5, 2019
@pditommaso
Copy link
Member

Yes, you are right. This is a bug.

@wflynny
Copy link
Author

wflynny commented Apr 10, 2019

@pditommaso Thank you for the quick response and fix. I can confirm this is working for my pipelines using

NXF_VER=19.04.0-SNAPSHOT nextflow run ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants