Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On hashCode equal, verify using string data #240

Merged
merged 2 commits into from
Jul 2, 2023

Conversation

nfrawley93
Copy link
Contributor

Hey so this is a quick simple fix for everyone who is having issues with cells containing the wrong data.

Overall, while quick, HashCode should not be used as a Key as they are non-unique. To keep some semblance of speed we can continue to have it verify through hash, but if they run into the same data then it would need to verify through the string data itself.

Yes this will slow things down on large excels, but that is when it is needed the most.
Some tests I was doing with my real world data took my decode times from ~5s to ~5.5s and in my larger file from ~9.2s to 9.9s. Again to be clear, the "faster" files were corrupted by this issue so being faster in the case meant nothing.

I think this relates to a few different errors people are getting and likely a lot of errors that are going unnoticed. In my case it was random cells appearing in different places and having everything shifted over by 1 in some places. As it got worse my excel would start looking like a jumbled mess.
#159

@FauconSpartiate
Copy link
Collaborator

Good job 👍
Could you maybe add some tests with the files that had problems, so we're sure that this problem doesn't show up again?

@nfrawley93
Copy link
Contributor Author

I should have some time later this week to see what I can put together.
My files that currently have this issue (now resolved) contain a lot of business costs from my workplace, I will see if I can prune it enough to get it on here!

@FauconSpartiate
Copy link
Collaborator

FauconSpartiate commented May 30, 2023

No worries, take your time to not leak any personal infos...
You can also try it out with the excel files provided in the issue this fixes (#159)

Edit: The error on the example file in #159 is actually not reproducible since 2.0.1, maybe there are more issues than we think 🤔

@nfrawley93
Copy link
Contributor Author

nfrawley93 commented May 30, 2023

Honestly I wouldn't be surprised if this was an underlying issue in a lot cases posted under issues since there seems to be a lot of random "For some reason X is replaced with Y" or "I am getting the wrong data from the cell". I am going to see if I can find some excels posted here and check.

I am working on some simple test cases. Which means trying to force HashCode collision 😬. Luckily I have one really clear one. Feel free to use this for other testing. Seems "Wand" and "Vickers" when they are the only data in a node do have the same HashCode.

I was testing with some tried and true print statements and it is definitely what I thought. Here is a case without my change.

flutter: Attempting to add SharedString at Index 15893 with value Wand
flutter: Value Wand is already associated with Index 2937
flutter: Getting Value from index 2937 returns Vickers
flutter: HashCode for SharedString Wand is 526978181
flutter: HashCode for SharedString Vickers 526978181

So straight forward collision there. The cascading effects of that single collision added a bunch of others misplaced data to my excel. I checked every other HashCode comparison and there wasn't a single other collision. This single one alone caused tens of thousands of misplaced cells after.

Here is a super pruned example file. It is drastic how bad a collision causes everything to go sideways. Some columns look shifted, some just move cells. The reason is the entire _list structure and _map structure of the SharedString basically get "off" from one another. Then depending on how often the value for that index shows up or the real value that should show up is displayed it causes more of shift. So there is a chance that this would cause a simple duplicate cell issue, or possibly shift a row.

The 5 files below

For some idea, my initial dataset had around 50 columns and 5000 rows, with around ~20,000 distinct Indexes so a small chance but it landed.
hashcode-collisions

@FauconSpartiate
Copy link
Collaborator

Great work, thank you very much 👍

Copy link
Owner

@justkawal justkawal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice tweak and thanks alot for the in-dept research on this.

@justkawal
Copy link
Owner

@FauconSpartiate Can you please review this whenever you have time....

@FauconSpartiate
Copy link
Collaborator

I was actually waiting on some tests for this to not break again, but we can keep it like that and do that later (will probably never happen)

@justkawal
Copy link
Owner

At some.toString() == toString() will always be true, although it compromises some speed but willing to prefer correctness over speed....

What's your thought @FauconSpartiate ?

@FauconSpartiate
Copy link
Collaborator

Well if it needs to be, then yeah...
Speed always goes after bug corrections

@FauconSpartiate FauconSpartiate merged commit a2d64a3 into justkawal:main Jul 2, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants