Support for files with whitespace not comma as delimiter #70

nickrobinson251 · 2022-04-22T13:20:33Z

No description provided.

nickrobinson251 · 2022-04-23T12:12:52Z

src/parsing.jl

@@ -368,6 +422,7 @@ end
 ###

 function parse_row!(rec::R, bytes, pos, len, options) where {I, R <: MultiTerminalDCLines{I}}
+    pos = checkdelim!(bytes, pos, len, options)


@quinnj how does CSV.jl handle leading whitespace at the start of a row (when whitespace is the delim)? Does CSV.jl manage to avoid calling checkdelim! in every parserow call somehow?

In Parsers.checkcmtemptylines, we check if a new row starts with a comment character or is "empty" (i.e. is followed immediately by another newline). We call that anywhere right after we parse a newline.

Otherwise, individual type parsers have their own logic for skipping whitespace, so the "first cell" of the row would be able to skip leading whitespace for the row.

If the delim is whitespace, however, then by nature it's significant, right? So we would treat it as empty cells. (i.e. consecutive delimiters like ,, mean an empty cell.

UNLESS we're talking ignorerepeated=true, in which case, yes, we'll parse the newline and any following delimiters until we reach a non delimiter which should be the start of our first cell.

yeah, sorry should have been clearer, our case here is delim=' ' and ignorerepeated=true

Basically we have space-seperated values, which can be seperated by an arbitrary number of spaces, and the first value might also have arbitrary number of leading spaces

e.g. data with 5 columns can look like:

8 'ABC ' 138.00 1 .000 19 'ABCDEFGH' 69.00 1 .000

In CSV, i could see for the very first row we handle this case in Context here:
https://github.com/JuliaData/CSV.jl/blob/3ebd2c9d3baec32512a662605cf0fc378fdf1644/src/context.jl#L382-L389

# step 4a: if we're ignoring repeated delimiters, then we ignore any # that start a row, so we need to check if we need to adjust our headerpos/datapos if ignorerepeated if headerpos > 0 headerpos = Parsers.checkdelim!(buf, headerpos, len, options) end datapos = Parsers.checkdelim!(buf, datapos, len, options) end

but i couldn't see how we handle this in all other lines

turns out xparse handles this itself (using Parsers.checkcmtemptylines) and the problem in PowerFlowData.jl was that our next_line function which we use to skip lines between sections (and which is a simplified version of checkcmtemptylines) wasn't handling repeated delimiters at the start of the next line

thanks for the pointers -- changes needed here are much simpler now!

codecov-commenter · 2022-04-23T12:13:44Z

Codecov Report

Merging #70 (22f0e71) into main (981f60c) will increase coverage by 0.33%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main      #70      +/-   ##
==========================================
+ Coverage   94.92%   95.26%   +0.33%     
==========================================
  Files           3        3              
  Lines         355      359       +4     
==========================================
+ Hits          337      342       +5     
+ Misses         18       17       -1

Impacted Files	Coverage Δ
src/parsing.jl	`99.15% <100.00%> (+0.45%)`	⬆️
src/types.jl	`95.45% <0.00%> (-0.09%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 981f60c...22f0e71. Read the comment docs.

docs/src/index.md

nickrobinson251 · 2022-04-23T13:30:38Z

test/runtests.jl

+        buses = net_space.buses
+        @test length(buses) == 2
+        # https://github.com/JuliaData/Parsers.jl/issues/115
+        @test_broken buses.name = ["ABC", "ABCDEFGH"]


JuliaData/Parsers.jl#115

nickrobinson251 marked this pull request as ready for review April 23, 2022 12:11

nickrobinson251 commented Apr 23, 2022

View reviewed changes

nickrobinson251 added 8 commits April 26, 2022 10:59

Inital support for files with whitespace not comma as delimiter

ed14815

Ignore raw files

603af32

Automatically detect the delimiter

41c55db

Add tests for space delimited files and quoted zero bus numbers

eedb1a3

Bump version

3450c92

Add delim keyword to docs homepage

5b73ca9

Update docs/src/index.md

05a02ed

Add ignoring of repeated delimiters to next_line

22f0e71

nickrobinson251 force-pushed the npr/space-as-delim branch from 6a96355 to 22f0e71 Compare April 26, 2022 09:59

nickrobinson251 changed the title ~~WIP: Support for files with whitespace not comma as delimiter~~ Support for files with whitespace not comma as delimiter Apr 26, 2022

nickrobinson251 merged commit bcb9f34 into main Apr 26, 2022

nickrobinson251 mentioned this pull request Apr 26, 2022

Fix test marked broken when Parsers#115 fixed #71

Closed

nickrobinson251 deleted the npr/space-as-delim branch May 20, 2022 19:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for files with whitespace not comma as delimiter #70

Support for files with whitespace not comma as delimiter #70

nickrobinson251 commented Apr 22, 2022

nickrobinson251 Apr 23, 2022

quinnj Apr 26, 2022

nickrobinson251 Apr 26, 2022 •

edited

Loading

codecov-commenter commented Apr 23, 2022 •

edited

Loading

nickrobinson251 Apr 23, 2022

Support for files with whitespace not comma as delimiter #70

Support for files with whitespace not comma as delimiter #70

Conversation

nickrobinson251 commented Apr 22, 2022

nickrobinson251 Apr 23, 2022

Choose a reason for hiding this comment

quinnj Apr 26, 2022

Choose a reason for hiding this comment

nickrobinson251 Apr 26, 2022 • edited Loading

Choose a reason for hiding this comment

codecov-commenter commented Apr 23, 2022 • edited Loading

Codecov Report

nickrobinson251 Apr 23, 2022

Choose a reason for hiding this comment

nickrobinson251 Apr 26, 2022 •

edited

Loading

codecov-commenter commented Apr 23, 2022 •

edited

Loading