Support rehashing in the HashIndexBuilder #2776

benjaminwinger · 2024-02-02T20:05:15Z

Still needs to be integrated with the CSV reader.

The performance has taken a bit of a hit, HashIndexBuilder::append is roughly 60% slower than before, mostly I think due to some unnecessary copying when iterating over slots (it was worse before I fixed InMemDiskArrayBuilder to directly access the header in getNumElements instead of going through the unnecessary locking inherited from BaseDiskArray). I'm going to try and restructure things to remove the copies and speed things up.

codecov · 2024-02-02T20:18:22Z

Codecov Report

Attention: 17 lines in your changes are missing coverage. Please review.

Comparison is base (0b1e307) 93.45% compared to head (e5eda4c) 93.52%.
Report is 49 commits behind head on master.

Files	Patch %	Lines
src/include/storage/index/hash_index_builder.h	63.15%	7 Missing ⚠️
src/storage/index/hash_index.cpp	90.56%	5 Missing ⚠️
src/include/storage/index/hash_index.h	87.87%	4 Missing ⚠️
src/storage/index/hash_index_builder.cpp	98.46%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2776      +/-   ##
==========================================
+ Coverage   93.45%   93.52%   +0.06%     
==========================================
  Files        1087     1090       +3     
  Lines       41546    42055     +509     
==========================================
+ Hits        38828    39331     +503     
- Misses       2718     2724       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

benjaminwinger · 2024-02-06T21:35:01Z

Performance is fixed, with a slight improvement from before.

Testing on a csv containing 1 to 60 million in a random order, copying this from csv as a primary key takes on average on my 12-core machine (over six runs):

3053ms when read as an int64 column
21023ms when read as a string column

Compared to master:

3075ms when read as an int64 column
26002ms when read as a string column

The change on the int column is mostly within the range of noise. On the other hand, when I'd benchmarked HashIndexBuilder::append by itself, 120 million inserts (with bulk reserve) was taking ~15s, compared to ~20s on master, so other parts of the copy pipeline may be dominating the runtime (strings took ~37s vs 40s; the main difference in parallel is likely due to avoiding re-acquiring the mutex for the shared overflow file).

I think string performance could be further improved by using an rwlock instead of a mutex, since access to the overflow file is usually a read-only check to see if a key already exists in the index and could be done concurrently.

benjaminwinger · 2024-02-06T22:14:26Z

I did a quick test, and with an std::deque that only locks when appending, the end-to-end string benchmark I'd done in the previous comment improved to 12s, and an std::list with an append lock took only 9.8s. Presumably lists are sufficiently simpler to perform better, and we already have one level of indirection.

That should be better than an rwlock, as concurrent reads can be safely done even when writing and appending new pages (since we never delete anything and the only thing changing is the pointer at the end). I'll have to try some larger tests though to see how well it scales, since finding a particular index would be slower, so the fastest in this test may end up being slower with datasets which are much larger. Deques may scale better sometimes, but the MSVC std::deque implementation uses a block size so small that it's effectively an std::list, so I'm not keen on having to do something custom to get similar performance on all platforms, but it would at least be helpful to know if there is much of a performance gap with larger datasets.

ray6080

Looks good to me! I love the new iterator design 😄

src/include/storage/index/hash_index_builder.h

benjaminwinger force-pushed the builder-rehashing branch from 797caa5 to ee9ac1f Compare February 6, 2024 21:34

benjaminwinger marked this pull request as ready for review February 6, 2024 21:35

benjaminwinger force-pushed the builder-rehashing branch from ee9ac1f to bccc9a2 Compare February 6, 2024 22:09

ray6080 self-requested a review February 7, 2024 00:41

benjaminwinger mentioned this pull request Feb 7, 2024

Per-index Memfile Locking Contention #2626

Closed

ray6080 approved these changes Feb 7, 2024

View reviewed changes

src/include/storage/index/hash_index_builder.h Outdated Show resolved Hide resolved

Support rehashing in the HashIndexBuilder

e5eda4c

benjaminwinger force-pushed the builder-rehashing branch from bccc9a2 to e5eda4c Compare February 7, 2024 14:38

benjaminwinger merged commit d60724e into master Feb 7, 2024
15 checks passed

benjaminwinger deleted the builder-rehashing branch February 7, 2024 17:20

benjaminwinger mentioned this pull request Feb 7, 2024

Release v0.2.0 #2662

Closed

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support rehashing in the HashIndexBuilder #2776

Support rehashing in the HashIndexBuilder #2776

benjaminwinger commented Feb 2, 2024

codecov bot commented Feb 2, 2024 •

edited

Loading

benjaminwinger commented Feb 6, 2024

benjaminwinger commented Feb 6, 2024

ray6080 left a comment

Support rehashing in the HashIndexBuilder #2776

Support rehashing in the HashIndexBuilder #2776

Conversation

benjaminwinger commented Feb 2, 2024

codecov bot commented Feb 2, 2024 • edited Loading

Codecov Report

benjaminwinger commented Feb 6, 2024

benjaminwinger commented Feb 6, 2024

ray6080 left a comment

Choose a reason for hiding this comment

codecov bot commented Feb 2, 2024 •

edited

Loading