Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrongly parsed hash #44

Closed
voxik opened this issue Oct 3, 2022 · 5 comments
Closed

Wrongly parsed hash #44

voxik opened this issue Oct 3, 2022 · 5 comments

Comments

@voxik
Copy link

voxik commented Oct 3, 2022

The hash parsing the F2022-10.xlsx is wrong:

$ pry
[1] pry(main)> require 'xsv'
=> true
[2] pry(main)> x = Xsv.open 'F2022-10.xlsx', :parse_headers => true
=> #<Xsv::Workbook:300 sheets=1 trim_empty_rows=false>
[3] pry(main)> s = x.sheets.first
=> #<Xsv::Sheet:320 mode=hash>
[4] pry(main)> s[0]
=> {"Interní číslo"=>"2022009",
 nil=>61,
 "Typ faktury"=>"FAKTURA",
 "Název firmy nebo jméno osoby"=>"Foo bar",
 "IČO"=>1234567,
 "DIČ"=>"CZ1234567",
 "Ulice"=>nil,
 "PSČ"=>"Velehradská 1",
 "Město"=>"686 03",
 "Datum vystavení"=>"Staré Město",
 "Datum splatnosti"=>nil,
 "Měna"=>#<Date: 2022-08-01 ((2459793j,0s,0n),+0s,2299161j)>,
 "Pořadí"=>#<Date: 2022-08-15 ((2459807j,0s,0n),+0s,2299161j)>,
 "Název"=>nil,
 "Množství"=>"EUR",
 "Cena za MJ"=>nil}

The correct output should be:

--- wrong
+++ correct
@@ -7,18 +7,17 @@
 => #<Xsv::Sheet:320 mode=hash>
 [4] pry(main)> s[0]
 => {"Interní číslo"=>"2022009",
- nil=>61,
  "Typ faktury"=>"FAKTURA",
  "Název firmy nebo jméno osoby"=>"Foo bar",
  "IČO"=>1234567,
  "DIČ"=>"CZ1234567",
- "Ulice"=>nil,
- "PSČ"=>"Velehradská 1",
- "Město"=>"686 03",
- "Datum vystavení"=>"Staré Město",
- "Datum splatnosti"=>nil,
- "Měna"=>#<Date: 2022-08-01 ((2459793j,0s,0n),+0s,2299161j)>,
- "Pořadí"=>#<Date: 2022-08-15 ((2459807j,0s,0n),+0s,2299161j)>,
- "Název"=>nil,
- "Množství"=>"EUR",
- "Cena za MJ"=>nil}
+ "Ulice"=>"Velehradská 1",
+ "PSČ"=>"686 03",
+ "Město"=>"Staré Město",
+ "Datum vystavení"=>#<Date: 2022-08-01 ((2459793j,0s,0n),+0s,2299161j)>,
+ "Datum splatnosti"=>#<Date: 2022-08-15 ((2459807j,0s,0n),+0s,2299161j)>,
+ "Měna"=>"EUR",
+ "Pořadí"=>1,
+ "Název"=>"Item 1,
+ "Množství"=>2,
+ "Cena za MJ"=>61}

I think that the parser is confused by the empty columns.

@voxik
Copy link
Author

voxik commented Oct 3, 2022

Of course this might fall into "and does not deal with most formatting or more advanced functionality." category ...

@martijn
Copy link
Owner

martijn commented Oct 4, 2022

The problem in this file is that there are multiple cells in the header that are empty (B1, G1, K1, N1, P1). Hash keys in Ruby are unique, so there can be only one 'nil' key and that ends up with the value from P2. This is the unpredicted result the README mentiond:

Be aware that hash mode will lead to unpredictable results if the worksheet has multiple columns with the same header.

As it stands, your best option is to use array mode (parse_headers: false) to read this file.

If anybody can come up with a pull request to ignore the columns with empty header cells in hash mode, I'd be open to merging that. I don't have time to work on that myself right now, unfortunately.

@voxik
Copy link
Author

voxik commented Oct 4, 2022

Right, thx for confirming my understanding.

Not sure if I'll have motivation to dig into this myself ATM, because, as you suggested, I have already went with array mode any parsing headers by hand for the moment (and I could probably live also with just indices).

Nevertheless, if I would have time, what would be your preferred solution?

  1. Document this as expected behavior, because any other solution will likely impact the performance which seems to be key goal of this project.
  2. Just detect the there are nils in headers and fail.
  3. Ignore the columns with nil headers.
  4. Something else?

@shkm
Copy link
Collaborator

shkm commented Oct 4, 2022

I think it's reasonable to throw an error in cases where there are duplicate headers in hash mode. I could imagine wanting to be alerted of this rather than proceeding with potentially unintended behavior.

My two cents.

@martijn
Copy link
Owner

martijn commented Oct 4, 2022

Right, I guess it's fair to raise an exception in this situation since the resulting data is useless as illustrated by this issue. I'll leave this issue open until we implement at least that.

Alternatively I'd opt for option 3 out of @voxik 's suggestions above: by implementing a col_skip array in SheetRowHandler we could skip the columns with empty headers. Any data in those colums would be lost. Or we replace the empty headers with a generated header name "Column B", "Column G" so data can be salvaged. That is probably easier to implement as well (in Sheet#parse_headers).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants