Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Types and values of quoted strings #64

Closed
alyst opened this issue Mar 9, 2017 · 8 comments · Fixed by #65
Closed

Types and values of quoted strings #64

alyst opened this issue Mar 9, 2017 · 8 comments · Fixed by #65

Comments

@alyst
Copy link
Contributor

alyst commented Mar 9, 2017

AFAIU automatic type detection does not take into account whether the extracted string was quoted or not, which leads to incorrect predictions for the column containing quoted numbers as the first N elements.

Checking for quotes should also be done when checking for null strings: quoted string should be automatically regarded as non-null.

Actually, encountered this while trying to convert RDatasets into using DataTables (datasets/attenu.csv.gz). RDataset has a nice test set of .csv files.

@quinnj
Copy link
Member

quinnj commented Mar 9, 2017

Hmmmm.....we should actually be accounting for quoted strings/other types during type detection. Can you post a specific case (file) where this isn't happening? Definitely a bug.

@alyst
Copy link
Contributor Author

alyst commented Mar 9, 2017

It's attenu.csv.gz
I get

CSV.CSVError("error parsing a `Int64` value on column 3, row 170; encountered 'c'")

@alyst
Copy link
Contributor Author

alyst commented Sep 20, 2017

I've reported this issue and fixed it in #65, but in the master the issue is present once again (the "quoted numbers detected as string column" test that I provided was also modified to expect that the 2nd (quoted) column is Int).
Is it because the logic was changed to ignore the quotes when detecting column types?

Somewhat similar issue is that during type detection quoted nulls ("NA") are detected as nulls.

@nalimilan nalimilan reopened this Sep 21, 2017
@quinnj
Copy link
Member

quinnj commented Sep 21, 2017

@alyst, sorry for what's happened here. With the port to Nulls, I took the time to do quite a number of large refactorings, involving initial type detection, parsing, streaming, etc. In the process, there were some two dozen other issues closed, and I tried to make sure all existing issues stayed resolved, but I was worried that something would regress.

In this case, the behavior that has changed, and that I'd like to support is that quoted fields are not automatically treated as Strings. The reasoning here is that I've personally encountered several different csv sources where, for some reason or another, a system chooses to quote all fields, regardless of being a string or not, or containing characters needing escaping or not.

In the case of the attenu.csv.gz file, the correct way to read that file would now be

julia> df = CSV.read(joinpath(dir, "attenu.csv"); null="NA", types=Dict(3=>Union{Null, String}))

i.e. it's pretty easy to manually specify that the 3rd column should be Strings with null values as "NA".

Does that all make sense? Sorry again if this has messed anything up at all.

@alyst
Copy link
Contributor Author

alyst commented Sep 21, 2017

The mode that doesn't automatically treat quoted columns as strings definitely makes sense. But for me it's rather an indication of the problem with the .csv file.
Maybe it's possible to add an option (say, quoted_values=:string/:detect) specifying whether to always treat quoted columns as strings or to ignore the quotes and try to detect the type of the value.

@alyst
Copy link
Contributor Author

alyst commented Sep 21, 2017

Re attenu.csv.gz, the problem is that the column is inferred as non-null, but then null occurs during actual parsing.

@alyst
Copy link
Contributor Author

alyst commented Sep 21, 2017

FWIW in R, when we remove from attenu a few rows with non-digits in the 3rd column, read.csv(stringsAsFactors=FALSE) imports the 3rd column as int and treats quoted NAs as NA.
readr::read_csv() also imports it as int. But, read_csv() has quoted_na option (TRUE by default), which specifies whether to treat "NA" as NA or as a string (however, for attenu.csv.gz this option doesn't seem to have any effect, the 3rd column at 79th row is always NA).

@quinnj
Copy link
Member

quinnj commented Aug 29, 2018

This should be fixed on master with the switch to CSV.File (CSV.read still relies on the old CSV.Source, but there are plans to switch it over.

Note for now, you can get a NamedTuple of Vectors on master by doing using Tables; table = CSV.File(file; kwargs...) |> columntable

@quinnj quinnj closed this as completed Aug 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants