-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for files with whitespace not comma as delimiter #70
Conversation
src/parsing.jl
Outdated
@@ -368,6 +422,7 @@ end | |||
### | |||
|
|||
function parse_row!(rec::R, bytes, pos, len, options) where {I, R <: MultiTerminalDCLines{I}} | |||
pos = checkdelim!(bytes, pos, len, options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@quinnj how does CSV.jl handle leading whitespace at the start of a row (when whitespace is the delim)? Does CSV.jl manage to avoid calling checkdelim!
in every parserow
call somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Parsers.checkcmtemptylines
, we check if a new row starts with a comment character or is "empty" (i.e. is followed immediately by another newline). We call that anywhere right after we parse a newline.
Otherwise, individual type parsers have their own logic for skipping whitespace, so the "first cell" of the row would be able to skip leading whitespace for the row.
If the delim is whitespace, however, then by nature it's significant, right? So we would treat it as empty cells. (i.e. consecutive delimiters like ,,
mean an empty cell.
UNLESS we're talking ignorerepeated=true
, in which case, yes, we'll parse the newline and any following delimiters until we reach a non delimiter which should be the start of our first cell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, sorry should have been clearer, our case here is delim=' '
and ignorerepeated=true
Basically we have space-seperated values, which can be seperated by an arbitrary number of spaces, and the first value might also have arbitrary number of leading spaces
e.g. data with 5 columns can look like:
8 'ABC ' 138.00 1 .000
19 'ABCDEFGH' 69.00 1 .000
In CSV, i could see for the very first row we handle this case in Context
here:
https://github.com/JuliaData/CSV.jl/blob/3ebd2c9d3baec32512a662605cf0fc378fdf1644/src/context.jl#L382-L389
# step 4a: if we're ignoring repeated delimiters, then we ignore any
# that start a row, so we need to check if we need to adjust our headerpos/datapos
if ignorerepeated
if headerpos > 0
headerpos = Parsers.checkdelim!(buf, headerpos, len, options)
end
datapos = Parsers.checkdelim!(buf, datapos, len, options)
end
but i couldn't see how we handle this in all other lines
turns out xparse
handles this itself (using Parsers.checkcmtemptylines
) and the problem in PowerFlowData.jl was that our next_line
function which we use to skip lines between sections (and which is a simplified version of checkcmtemptylines
) wasn't handling repeated delimiters at the start of the next line
thanks for the pointers -- changes needed here are much simpler now!
Codecov Report
@@ Coverage Diff @@
## main #70 +/- ##
==========================================
+ Coverage 94.92% 95.26% +0.33%
==========================================
Files 3 3
Lines 355 359 +4
==========================================
+ Hits 337 342 +5
+ Misses 18 17 -1
Continue to review full report at Codecov.
|
buses = net_space.buses | ||
@test length(buses) == 2 | ||
# https://github.com/JuliaData/Parsers.jl/issues/115 | ||
@test_broken buses.name = ["ABC", "ABCDEFGH"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6a96355
to
22f0e71
Compare
No description provided.