-
-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On hashCode equal, verify using string data #240
Conversation
Good job 👍 |
I should have some time later this week to see what I can put together. |
Honestly I wouldn't be surprised if this was an underlying issue in a lot cases posted under issues since there seems to be a lot of random "For some reason X is replaced with Y" or "I am getting the wrong data from the cell". I am going to see if I can find some excels posted here and check. I am working on some simple test cases. Which means trying to force HashCode collision 😬. Luckily I have one really clear one. Feel free to use this for other testing. Seems "Wand" and "Vickers" when they are the only data in a node do have the same HashCode. I was testing with some tried and true print statements and it is definitely what I thought. Here is a case without my change.
So straight forward collision there. The cascading effects of that single collision added a bunch of others misplaced data to my excel. I checked every other HashCode comparison and there wasn't a single other collision. This single one alone caused tens of thousands of misplaced cells after. Here is a super pruned example file. It is drastic how bad a collision causes everything to go sideways. Some columns look shifted, some just move cells. The reason is the entire _list structure and _map structure of the SharedString basically get "off" from one another. Then depending on how often the value for that index shows up or the real value that should show up is displayed it causes more of shift. So there is a chance that this would cause a simple duplicate cell issue, or possibly shift a row. The 5 files below
For some idea, my initial dataset had around 50 columns and 5000 rows, with around ~20,000 distinct Indexes so a small chance but it landed. |
Great work, thank you very much 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice tweak and thanks alot for the in-dept research on this.
@FauconSpartiate Can you please review this whenever you have time.... |
I was actually waiting on some tests for this to not break again, but we can keep it like that and do that later (will probably never happen) |
At some.toString() == toString() will always be true, although it compromises some speed but willing to prefer correctness over speed.... What's your thought @FauconSpartiate ? |
Well if it needs to be, then yeah... |
Hey so this is a quick simple fix for everyone who is having issues with cells containing the wrong data.
Overall, while quick, HashCode should not be used as a Key as they are non-unique. To keep some semblance of speed we can continue to have it verify through hash, but if they run into the same data then it would need to verify through the string data itself.
Yes this will slow things down on large excels, but that is when it is needed the most.
Some tests I was doing with my real world data took my decode times from ~5s to ~5.5s and in my larger file from ~9.2s to 9.9s. Again to be clear, the "faster" files were corrupted by this issue so being faster in the case meant nothing.
I think this relates to a few different errors people are getting and likely a lot of errors that are going unnoticed. In my case it was random cells appearing in different places and having everything shifted over by 1 in some places. As it got worse my excel would start looking like a jumbled mess.
#159