Set default HNSW "ef_construction" to 64 #230
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The "ef_construction" parameter for HNSW controls how many nearest-neighbors are considered when inserting a vector into the index. A higher value of ef_construction generally improves recall when querying the index, but at the tradeoff of increasing index build times.
This patch sets the default value of ef_construction to 64. This takes into account several different testing methodologies to determine what makes sense for this HNSW implementation, including:
to optimize what will give users a good performance/recall ratio while accounting for the amount of time it takes to build an index.
For example, using the dbpedia-openai-1000k-angular dataset from ANN Benchmarks[1], we can see the impact of recall and index build times with doubling values of ef_construction (values in parentheses are percentages over the preceding value):
Build Times
Recall @ ef_search=10
Recall @ ef_search=20
The results show that increasing ef_consturction does increase build times, but we begin to see diminishing returns in the amount of recall of the builds at higher levels in comparison to the effort it takes to build the index. However, at ef_construciton=64, though we pay a bit more effort up front, we do see a measurable impact on recall that can substantiate defaulting ef_construction to 64.
For comparison, here is the gist-960-euclidean dataset, which has presented many other algorithms challanges with achieving good recall at lower search levels[1]. We can draw similar conclusions to the previous data set:
Build Time
Recall @ ef_search=10
Recall @ ef_search=20
Recall @ ef_search=40
To recap, the slight increase in build time under ef_construction=64 does yield much higher recall, whereas beyond this value the build time diminishes.
The last thing to consider is the behavior when inserting into an empty index, or inserts as we go. While doing this kind of build will generally be slower, we want to ensure that the chosen value of ef_construction will have a slowdown proportional to building the index in bulk. Again, here we use the dbpedia-openai-1000k-angular data set to look at these differences:
Empty Build
which is inline with the timings from the full build.
These tests were run repeatedly with a variety of the ANN Benchmarks, including[1]:
and additional data can be furnished upon request.
[1] https://github.com/erikbern/ann-benchmarks