Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for files with whitespace not comma as delimiter #70

Merged
merged 8 commits into from
Apr 26, 2022

Conversation

nickrobinson251
Copy link
Owner

No description provided.

@nickrobinson251 nickrobinson251 marked this pull request as ready for review April 23, 2022 12:11
src/parsing.jl Outdated
@@ -368,6 +422,7 @@ end
###

function parse_row!(rec::R, bytes, pos, len, options) where {I, R <: MultiTerminalDCLines{I}}
pos = checkdelim!(bytes, pos, len, options)
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quinnj how does CSV.jl handle leading whitespace at the start of a row (when whitespace is the delim)? Does CSV.jl manage to avoid calling checkdelim! in every parserow call somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Parsers.checkcmtemptylines, we check if a new row starts with a comment character or is "empty" (i.e. is followed immediately by another newline). We call that anywhere right after we parse a newline.

Otherwise, individual type parsers have their own logic for skipping whitespace, so the "first cell" of the row would be able to skip leading whitespace for the row.

If the delim is whitespace, however, then by nature it's significant, right? So we would treat it as empty cells. (i.e. consecutive delimiters like ,, mean an empty cell.

UNLESS we're talking ignorerepeated=true, in which case, yes, we'll parse the newline and any following delimiters until we reach a non delimiter which should be the start of our first cell.

Copy link
Owner Author

@nickrobinson251 nickrobinson251 Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, sorry should have been clearer, our case here is delim=' ' and ignorerepeated=true

Basically we have space-seperated values, which can be seperated by an arbitrary number of spaces, and the first value might also have arbitrary number of leading spaces

e.g. data with 5 columns can look like:

    8 'ABC     ' 138.00 1      .000     
   19 'ABCDEFGH'  69.00 1      .000 

In CSV, i could see for the very first row we handle this case in Context here:
https://github.com/JuliaData/CSV.jl/blob/3ebd2c9d3baec32512a662605cf0fc378fdf1644/src/context.jl#L382-L389

        # step 4a: if we're ignoring repeated delimiters, then we ignore any
        # that start a row, so we need to check if we need to adjust our headerpos/datapos
        if ignorerepeated
            if headerpos > 0
                headerpos = Parsers.checkdelim!(buf, headerpos, len, options)
            end
            datapos = Parsers.checkdelim!(buf, datapos, len, options)
        end

but i couldn't see how we handle this in all other lines

turns out xparse handles this itself (using Parsers.checkcmtemptylines) and the problem in PowerFlowData.jl was that our next_line function which we use to skip lines between sections (and which is a simplified version of checkcmtemptylines) wasn't handling repeated delimiters at the start of the next line

thanks for the pointers -- changes needed here are much simpler now!

@codecov-commenter
Copy link

codecov-commenter commented Apr 23, 2022

Codecov Report

Merging #70 (22f0e71) into main (981f60c) will increase coverage by 0.33%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main      #70      +/-   ##
==========================================
+ Coverage   94.92%   95.26%   +0.33%     
==========================================
  Files           3        3              
  Lines         355      359       +4     
==========================================
+ Hits          337      342       +5     
+ Misses         18       17       -1     
Impacted Files Coverage Δ
src/parsing.jl 99.15% <100.00%> (+0.45%) ⬆️
src/types.jl 95.45% <0.00%> (-0.09%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 981f60c...22f0e71. Read the comment docs.

docs/src/index.md Outdated Show resolved Hide resolved
buses = net_space.buses
@test length(buses) == 2
# https://github.com/JuliaData/Parsers.jl/issues/115
@test_broken buses.name = ["ABC", "ABCDEFGH"]
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickrobinson251 nickrobinson251 changed the title WIP: Support for files with whitespace not comma as delimiter Support for files with whitespace not comma as delimiter Apr 26, 2022
@nickrobinson251 nickrobinson251 merged commit bcb9f34 into main Apr 26, 2022
@nickrobinson251 nickrobinson251 deleted the npr/space-as-delim branch May 20, 2022 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants