Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support rehashing in the HashIndexBuilder #2776

Merged
merged 1 commit into from
Feb 7, 2024
Merged

Conversation

benjaminwinger
Copy link
Collaborator

Still needs to be integrated with the CSV reader.

The performance has taken a bit of a hit, HashIndexBuilder::append is roughly 60% slower than before, mostly I think due to some unnecessary copying when iterating over slots (it was worse before I fixed InMemDiskArrayBuilder to directly access the header in getNumElements instead of going through the unnecessary locking inherited from BaseDiskArray). I'm going to try and restructure things to remove the copies and speed things up.

Copy link

codecov bot commented Feb 2, 2024

Codecov Report

Attention: 17 lines in your changes are missing coverage. Please review.

Comparison is base (0b1e307) 93.45% compared to head (e5eda4c) 93.52%.
Report is 49 commits behind head on master.

Files Patch % Lines
src/include/storage/index/hash_index_builder.h 63.15% 7 Missing ⚠️
src/storage/index/hash_index.cpp 90.56% 5 Missing ⚠️
src/include/storage/index/hash_index.h 87.87% 4 Missing ⚠️
src/storage/index/hash_index_builder.cpp 98.46% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2776      +/-   ##
==========================================
+ Coverage   93.45%   93.52%   +0.06%     
==========================================
  Files        1087     1090       +3     
  Lines       41546    42055     +509     
==========================================
+ Hits        38828    39331     +503     
- Misses       2718     2724       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benjaminwinger
Copy link
Collaborator Author

Performance is fixed, with a slight improvement from before.

Testing on a csv containing 1 to 60 million in a random order, copying this from csv as a primary key takes on average on my 12-core machine (over six runs):

  • 3053ms when read as an int64 column
  • 21023ms when read as a string column

Compared to master:

  • 3075ms when read as an int64 column
  • 26002ms when read as a string column

The change on the int column is mostly within the range of noise. On the other hand, when I'd benchmarked HashIndexBuilder::append by itself, 120 million inserts (with bulk reserve) was taking ~15s, compared to ~20s on master, so other parts of the copy pipeline may be dominating the runtime (strings took ~37s vs 40s; the main difference in parallel is likely due to avoiding re-acquiring the mutex for the shared overflow file).

I think string performance could be further improved by using an rwlock instead of a mutex, since access to the overflow file is usually a read-only check to see if a key already exists in the index and could be done concurrently.

@benjaminwinger benjaminwinger marked this pull request as ready for review February 6, 2024 21:35
@benjaminwinger
Copy link
Collaborator Author

I did a quick test, and with an std::deque that only locks when appending, the end-to-end string benchmark I'd done in the previous comment improved to 12s, and an std::list with an append lock took only 9.8s. Presumably lists are sufficiently simpler to perform better, and we already have one level of indirection.

That should be better than an rwlock, as concurrent reads can be safely done even when writing and appending new pages (since we never delete anything and the only thing changing is the pointer at the end). I'll have to try some larger tests though to see how well it scales, since finding a particular index would be slower, so the fastest in this test may end up being slower with datasets which are much larger. Deques may scale better sometimes, but the MSVC std::deque implementation uses a block size so small that it's effectively an std::list, so I'm not keen on having to do something custom to get similar performance on all platforms, but it would at least be helpful to know if there is much of a performance gap with larger datasets.

Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I love the new iterator design 😄

src/include/storage/index/hash_index_builder.h Outdated Show resolved Hide resolved
@benjaminwinger benjaminwinger merged commit d60724e into master Feb 7, 2024
15 checks passed
@benjaminwinger benjaminwinger deleted the builder-rehashing branch February 7, 2024 17:20
@benjaminwinger benjaminwinger mentioned this pull request Feb 7, 2024
24 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants