Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading of malformed CSV #54

Closed
bkamins opened this issue Jan 2, 2017 · 2 comments
Closed

Reading of malformed CSV #54

bkamins opened this issue Jan 2, 2017 · 2 comments

Comments

@bkamins
Copy link
Member

bkamins commented Jan 2, 2017

Below I describe three behaviors CSV.read on malformed CSV files that I found unexpected.

I have the following file:

A;B;C
1,1,10
2,0,16

which is malformed as by error ; is given in head instead of , in the header.

The behavior of three standard utilities for reading such a file in Julia is:

  • readcsv from Base: loads whole file and replaces missing column names by empty strings;
  • readtable from DataFrames throws an error;
  • CSV.read reads only a single column of data into a data frame.

Additionally:

  • adding empty lines at the end of a file creates data rows with all nulls (as oposed to readcsv and readtable);
  • if there are less fields in data row than in the header (as in the example file below) null values are silently assigned to missing entries (readtable throws an error).
A,B,C
1,1,10
6,1

In documentation of CSV.read I have not found these behaviors described so I am not sure what is the intended functionality. I would recommend to at least to give a warning in those cases.

@quinnj
Copy link
Member

quinnj commented Jan 11, 2017

When I started developing this package, I kind of implicitly optimized for the "well-formed" csv file case so as to be able to focus on performance. That's part of the reason there's not as many "warnings" and such. Happy to accept PRs that improve things without sacrificing performance. In my mind, I'd like to have a single CSV.isvalid(file) that could do some thorough checking and give some great information about what might be wrong. without needing to cram it all into CSV.read.

quinnj added a commit that referenced this issue Sep 12, 2017
@quinnj
Copy link
Member

quinnj commented Sep 12, 2017

We now have the CSV.validate(file) function which takes the same arguments as CSV.read, but will give informative error information for improperly formatted csv data.

For these two examples:

julia> io = IOBuffer("""A,B,C
       1,1,10
       6,1""")
IOBuffer(data=UInt8[...], readable=true, writable=false, seekable=true, append=false, size=16, maxsize=Inf, ptr=1, mark=-1)

julia> CSV.validate(io)
ERROR: CSV.ExpectedMoreColumnsError("row=2, col=2: expected 3 columns, parsed 2, but parsing encountered unexpected end-of-file (EOF); parsed row: '6,1'")
Stacktrace:
 [1] validate(::CSV.Source{Base.GenericIOBuffer{Array{UInt8,1}},Nulls.Null}) at /Users/jacobquinn/.julia/v0.7/CSV/src/validate.jl:26
 [2] #validate#40(::Bool, ::Dict{Int64,Function}, ::Array{Any,1}, ::Function, ::Base.GenericIOBuffer{Array{UInt8,1}}, ::Type{T} where T) at /Users/jacobquinn/.julia/v0.7/CSV/src/validate.jl:38
 [3] validate(::Base.GenericIOBuffer{Array{UInt8,1}}) at /Users/jacobquinn/.julia/v0.7/CSV/src/validate.jl:38

julia> io = IOBuffer("""A;B;C
       1,1,10
       2,0,16""")
IOBuffer(data=UInt8[...], readable=true, writable=false, seekable=true, append=false, size=19, maxsize=Inf, ptr=1, mark=-1)

julia> CSV.validate(io)
ERROR: CSV.TooManyColumnsError("row=1, col=1: expected 1 columns then a newline or EOF, but parsing encountered another delimiter: ','; parsed row: '1'")
Stacktrace:
 [1] validate(::CSV.Source{Base.GenericIOBuffer{Array{UInt8,1}},Nulls.Null}) at /Users/jacobquinn/.julia/v0.7/CSV/src/validate.jl:30
 [2] #validate#40(::Bool, ::Dict{Int64,Function}, ::Array{Any,1}, ::Function, ::Base.GenericIOBuffer{Array{UInt8,1}}, ::Type{T} where T) at /Users/jacobquinn/.julia/v0.7/CSV/src/validate.jl:38
 [3] validate(::Base.GenericIOBuffer{Array{UInt8,1}}) at /Users/jacobquinn/.julia/v0.7/CSV/src/validate.jl:38

julia> io = IOBuffer("""A;B;C
       1,1,10
       2,0,16""")
IOBuffer(data=UInt8[...], readable=true, writable=false, seekable=true, append=false, size=19, maxsize=Inf, ptr=1, mark=-1)

julia> CSV.validate(io; delim=';')
ERROR: CSV.ExpectedMoreColumnsError("row=1, col=1: expected 3 columns, parsed 1, but parsing encountered unexpected newline; parsed row: '1,1,10'")
Stacktrace:
 [1] validate(::CSV.Source{Base.GenericIOBuffer{Array{UInt8,1}},Nulls.Null}) at /Users/jacobquinn/.julia/v0.7/CSV/src/validate.jl:26
 [2] #validate#40(::Bool, ::Dict{Int64,Function}, ::Array{Any,1}, ::Function, ::Base.GenericIOBuffer{Array{UInt8,1}}, ::Type{T} where T) at /Users/jacobquinn/.julia/v0.7/CSV/src/validate.jl:38
 [3] (::getfield(CSV, Symbol("#kw##validate")))(::Array{Any,1}, ::typeof(CSV.validate), ::Base.GenericIOBuffer{Array{UInt8,1}}, ::Type{T} where T) at ./<missing>:0 (repeats 2 times)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants