-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support rehashing in the HashIndexBuilder #2776
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #2776 +/- ##
==========================================
+ Coverage 93.45% 93.52% +0.06%
==========================================
Files 1087 1090 +3
Lines 41546 42055 +509
==========================================
+ Hits 38828 39331 +503
- Misses 2718 2724 +6 ☔ View full report in Codecov by Sentry. |
797caa5
to
ee9ac1f
Compare
Performance is fixed, with a slight improvement from before. Testing on a csv containing 1 to 60 million in a random order, copying this from csv as a primary key takes on average on my 12-core machine (over six runs):
Compared to master:
The change on the int column is mostly within the range of noise. On the other hand, when I'd benchmarked HashIndexBuilder::append by itself, 120 million inserts (with bulk reserve) was taking ~15s, compared to ~20s on master, so other parts of the copy pipeline may be dominating the runtime (strings took ~37s vs 40s; the main difference in parallel is likely due to avoiding re-acquiring the mutex for the shared overflow file). I think string performance could be further improved by using an rwlock instead of a mutex, since access to the overflow file is usually a read-only check to see if a key already exists in the index and could be done concurrently. |
ee9ac1f
to
bccc9a2
Compare
I did a quick test, and with an That should be better than an rwlock, as concurrent reads can be safely done even when writing and appending new pages (since we never delete anything and the only thing changing is the pointer at the end). I'll have to try some larger tests though to see how well it scales, since finding a particular index would be slower, so the fastest in this test may end up being slower with datasets which are much larger. Deques may scale better sometimes, but the MSVC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I love the new iterator design 😄
bccc9a2
to
e5eda4c
Compare
Still needs to be integrated with the CSV reader.
The performance has taken a bit of a hit,
HashIndexBuilder::append
is roughly 60% slower than before, mostly I think due to some unnecessary copying when iterating over slots (it was worse before I fixedInMemDiskArrayBuilder
to directly access the header ingetNumElements
instead of going through the unnecessary locking inherited fromBaseDiskArray
). I'm going to try and restructure things to remove the copies and speed things up.