Wrongly parsed hash #44

voxik · 2022-10-03T21:16:25Z

The hash parsing the F2022-10.xlsx is wrong:

$ pry
[1] pry(main)> require 'xsv'
=> true
[2] pry(main)> x = Xsv.open 'F2022-10.xlsx', :parse_headers => true
=> #<Xsv::Workbook:300 sheets=1 trim_empty_rows=false>
[3] pry(main)> s = x.sheets.first
=> #<Xsv::Sheet:320 mode=hash>
[4] pry(main)> s[0]
=> {"Interní číslo"=>"2022009",
 nil=>61,
 "Typ faktury"=>"FAKTURA",
 "Název firmy nebo jméno osoby"=>"Foo bar",
 "IČO"=>1234567,
 "DIČ"=>"CZ1234567",
 "Ulice"=>nil,
 "PSČ"=>"Velehradská 1",
 "Město"=>"686 03",
 "Datum vystavení"=>"Staré Město",
 "Datum splatnosti"=>nil,
 "Měna"=>#<Date: 2022-08-01 ((2459793j,0s,0n),+0s,2299161j)>,
 "Pořadí"=>#<Date: 2022-08-15 ((2459807j,0s,0n),+0s,2299161j)>,
 "Název"=>nil,
 "Množství"=>"EUR",
 "Cena za MJ"=>nil}

The correct output should be:

--- wrong
+++ correct
@@ -7,18 +7,17 @@
 => #<Xsv::Sheet:320 mode=hash>
 [4] pry(main)> s[0]
 => {"Interní číslo"=>"2022009",
- nil=>61,
  "Typ faktury"=>"FAKTURA",
  "Název firmy nebo jméno osoby"=>"Foo bar",
  "IČO"=>1234567,
  "DIČ"=>"CZ1234567",
- "Ulice"=>nil,
- "PSČ"=>"Velehradská 1",
- "Město"=>"686 03",
- "Datum vystavení"=>"Staré Město",
- "Datum splatnosti"=>nil,
- "Měna"=>#<Date: 2022-08-01 ((2459793j,0s,0n),+0s,2299161j)>,
- "Pořadí"=>#<Date: 2022-08-15 ((2459807j,0s,0n),+0s,2299161j)>,
- "Název"=>nil,
- "Množství"=>"EUR",
- "Cena za MJ"=>nil}
+ "Ulice"=>"Velehradská 1",
+ "PSČ"=>"686 03",
+ "Město"=>"Staré Město",
+ "Datum vystavení"=>#<Date: 2022-08-01 ((2459793j,0s,0n),+0s,2299161j)>,
+ "Datum splatnosti"=>#<Date: 2022-08-15 ((2459807j,0s,0n),+0s,2299161j)>,
+ "Měna"=>"EUR",
+ "Pořadí"=>1,
+ "Název"=>"Item 1,
+ "Množství"=>2,
+ "Cena za MJ"=>61}

I think that the parser is confused by the empty columns.

The text was updated successfully, but these errors were encountered:

voxik · 2022-10-03T21:18:57Z

Of course this might fall into "and does not deal with most formatting or more advanced functionality." category ...

martijn · 2022-10-04T06:09:59Z

The problem in this file is that there are multiple cells in the header that are empty (B1, G1, K1, N1, P1). Hash keys in Ruby are unique, so there can be only one 'nil' key and that ends up with the value from P2. This is the unpredicted result the README mentiond:

Be aware that hash mode will lead to unpredictable results if the worksheet has multiple columns with the same header.

As it stands, your best option is to use array mode (parse_headers: false) to read this file.

If anybody can come up with a pull request to ignore the columns with empty header cells in hash mode, I'd be open to merging that. I don't have time to work on that myself right now, unfortunately.

voxik · 2022-10-04T07:09:12Z

Right, thx for confirming my understanding.

Not sure if I'll have motivation to dig into this myself ATM, because, as you suggested, I have already went with array mode any parsing headers by hand for the moment (and I could probably live also with just indices).

Nevertheless, if I would have time, what would be your preferred solution?

Document this as expected behavior, because any other solution will likely impact the performance which seems to be key goal of this project.
Just detect the there are nils in headers and fail.
Ignore the columns with nil headers.
Something else?

shkm · 2022-10-04T08:41:56Z

I think it's reasonable to throw an error in cases where there are duplicate headers in hash mode. I could imagine wanting to be alerted of this rather than proceeding with potentially unintended behavior.

My two cents.

martijn · 2022-10-04T09:17:58Z

Right, I guess it's fair to raise an exception in this situation since the resulting data is useless as illustrated by this issue. I'll leave this issue open until we implement at least that.

Alternatively I'd opt for option 3 out of @voxik 's suggestions above: by implementing a col_skip array in SheetRowHandler we could skip the columns with empty headers. Any data in those colums would be lost. Or we replace the empty headers with a generated header name "Column B", "Column G" so data can be salvaged. That is probably easier to implement as well (in Sheet#parse_headers).

martijn added a commit that referenced this issue Oct 16, 2022

Raise error for duplicate headers in hash mode (fixes #44)

c8859f7

martijn mentioned this issue Oct 16, 2022

Raise error for duplicate headers in hash mode (fixes #44) #45

Merged

martijn added a commit that referenced this issue Dec 1, 2022

Raise error for duplicate headers in hash mode (fixes #44)

562039f

martijn closed this as completed in ebcbcf0 Dec 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrongly parsed hash #44

Wrongly parsed hash #44

voxik commented Oct 3, 2022

voxik commented Oct 3, 2022

martijn commented Oct 4, 2022

voxik commented Oct 4, 2022 •

edited

Loading

shkm commented Oct 4, 2022

martijn commented Oct 4, 2022

Wrongly parsed hash #44

Wrongly parsed hash #44

Comments

voxik commented Oct 3, 2022

voxik commented Oct 3, 2022

martijn commented Oct 4, 2022

voxik commented Oct 4, 2022 • edited Loading

shkm commented Oct 4, 2022

martijn commented Oct 4, 2022

voxik commented Oct 4, 2022 •

edited

Loading