On hashCode equal, verify using string data #240

nfrawley93 · 2023-05-30T07:41:02Z

Hey so this is a quick simple fix for everyone who is having issues with cells containing the wrong data.

Overall, while quick, HashCode should not be used as a Key as they are non-unique. To keep some semblance of speed we can continue to have it verify through hash, but if they run into the same data then it would need to verify through the string data itself.

Yes this will slow things down on large excels, but that is when it is needed the most.
Some tests I was doing with my real world data took my decode times from ~5s to ~5.5s and in my larger file from ~9.2s to 9.9s. Again to be clear, the "faster" files were corrupted by this issue so being faster in the case meant nothing.

I think this relates to a few different errors people are getting and likely a lot of errors that are going unnoticed. In my case it was random cells appearing in different places and having everything shifted over by 1 in some places. As it got worse my excel would start looking like a jumbled mess.
#159

FauconSpartiate · 2023-05-30T08:26:34Z

Good job 👍
Could you maybe add some tests with the files that had problems, so we're sure that this problem doesn't show up again?

nfrawley93 · 2023-05-30T08:37:14Z

I should have some time later this week to see what I can put together.
My files that currently have this issue (now resolved) contain a lot of business costs from my workplace, I will see if I can prune it enough to get it on here!

FauconSpartiate · 2023-05-30T13:37:59Z

No worries, take your time to not leak any personal infos...
You can also try it out with the excel files provided in the issue this fixes (#159)

Edit: The error on the example file in #159 is actually not reproducible since 2.0.1, maybe there are more issues than we think 🤔

nfrawley93 · 2023-05-30T19:54:42Z

Honestly I wouldn't be surprised if this was an underlying issue in a lot cases posted under issues since there seems to be a lot of random "For some reason X is replaced with Y" or "I am getting the wrong data from the cell". I am going to see if I can find some excels posted here and check.

I am working on some simple test cases. Which means trying to force HashCode collision 😬. Luckily I have one really clear one. Feel free to use this for other testing. Seems "Wand" and "Vickers" when they are the only data in a node do have the same HashCode.

I was testing with some tried and true print statements and it is definitely what I thought. Here is a case without my change.

flutter: Attempting to add SharedString at Index 15893 with value Wand
flutter: Value Wand is already associated with Index 2937
flutter: Getting Value from index 2937 returns Vickers
flutter: HashCode for SharedString Wand is 526978181
flutter: HashCode for SharedString Vickers 526978181

So straight forward collision there. The cascading effects of that single collision added a bunch of others misplaced data to my excel. I checked every other HashCode comparison and there wasn't a single other collision. This single one alone caused tens of thousands of misplaced cells after.

Here is a super pruned example file. It is drastic how bad a collision causes everything to go sideways. Some columns look shifted, some just move cells. The reason is the entire _list structure and _map structure of the SharedString basically get "off" from one another. Then depending on how often the value for that index shows up or the real value that should show up is displayed it causes more of shift. So there is a chance that this would cause a simple duplicate cell issue, or possibly shift a row.

The 5 files below

input_with_collision.xlsx: This is an example where two distinct nodes will share the same HashCode
input_no_collision.xlsx: Essential just a clean excel file but for appearance sake looks similar to the other
output_with_collision.xlsx: This is the output using 2.1.0 things get messy
output_with_collision_fixed.xlsx: This is the output using my commit where the nodes full string is verified when there is a HashCode collision. It matches input.
output_no_collision.xlsx: This is the output of what is essentially a clean excel using where there isn't any HashCode collision just to show that this is in-fact a collision specific issue.

For some idea, my initial dataset had around 50 columns and 5000 rows, with around ~20,000 distinct Indexes so a small chance but it landed.

FauconSpartiate · 2023-05-31T00:23:00Z

Great work, thank you very much 👍

justkawal

Nice tweak and thanks alot for the in-dept research on this.

justkawal · 2023-07-02T19:34:30Z

@FauconSpartiate Can you please review this whenever you have time....

FauconSpartiate · 2023-07-02T19:38:17Z

I was actually waiting on some tests for this to not break again, but we can keep it like that and do that later (will probably never happen)

justkawal · 2023-07-02T19:40:39Z

At some.toString() == toString() will always be true, although it compromises some speed but willing to prefer correctness over speed....

What's your thought @FauconSpartiate ?

FauconSpartiate · 2023-07-02T19:44:53Z

Well if it needs to be, then yeah...
Speed always goes after bug corrections

On hashCode equal verify using string data

302b2d8

nfrawley93 requested review from FauconSpartiate and justkawal as code owners May 30, 2023 07:41

FauconSpartiate requested a review from amigne May 30, 2023 08:25

justkawal approved these changes Jul 2, 2023

View reviewed changes

Merge branch 'main' into main

0c1f3ed

FauconSpartiate approved these changes Jul 2, 2023

View reviewed changes

FauconSpartiate merged commit a2d64a3 into justkawal:main Jul 2, 2023
1 check passed

nfrawley93 mentioned this pull request Jul 11, 2023

Wrong Data. Lost Data, Null Data #244

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On hashCode equal, verify using string data #240

On hashCode equal, verify using string data #240

nfrawley93 commented May 30, 2023

FauconSpartiate commented May 30, 2023

nfrawley93 commented May 30, 2023

FauconSpartiate commented May 30, 2023 •

edited

Loading

nfrawley93 commented May 30, 2023 •

edited

Loading

FauconSpartiate commented May 31, 2023

justkawal left a comment

justkawal commented Jul 2, 2023

FauconSpartiate commented Jul 2, 2023

justkawal commented Jul 2, 2023

FauconSpartiate commented Jul 2, 2023

On hashCode equal, verify using string data #240

On hashCode equal, verify using string data #240

Conversation

nfrawley93 commented May 30, 2023

FauconSpartiate commented May 30, 2023

nfrawley93 commented May 30, 2023

FauconSpartiate commented May 30, 2023 • edited Loading

nfrawley93 commented May 30, 2023 • edited Loading

FauconSpartiate commented May 31, 2023

justkawal left a comment

Choose a reason for hiding this comment

justkawal commented Jul 2, 2023

FauconSpartiate commented Jul 2, 2023

justkawal commented Jul 2, 2023

FauconSpartiate commented Jul 2, 2023

FauconSpartiate commented May 30, 2023 •

edited

Loading

nfrawley93 commented May 30, 2023 •

edited

Loading