Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about R, L parameter setup. #26

Closed
HannanKan opened this issue Oct 21, 2021 · 5 comments
Closed

Question about R, L parameter setup. #26

HannanKan opened this issue Oct 21, 2021 · 5 comments

Comments

@HannanKan
Copy link

According to README, degree of graph index is recommended to set between 60 and 150 and size of search list during index building is recommended to set between 75 and 200. What is the reason of giving such reference value range?

@ShikharJ
Copy link
Contributor

@HannanKan These are just some empirical figures that we've found to work well for some general datasets. If you want a more specific answer, then the theory of Graph ANNS dictates that the R should be greater than or equal to the intrinsic dimensionality of the dataset. Here the intrinsic dimensionality is a tricky notion, because it is subjective to the way in which you want to measure dimensionality. It can be through PCA, or some other measure of choice. For most human-generated (and some machine-generated) datasets, the intrinsic dimensionality is much lower than the actual dimensionality, and as such, graph ANNS algorithms give much better performance compared to other algos which do not make use of this information.

As to what value of L is suitable, we've only found this through our experiments that a value between 1.5x to 2x the value of R suffices for most datasets. You can always increase L, as much as you want, but then you'll only be able to achieve diminishing returns, while the build time blows up.

Hope this helps.

@harsha-simhadri
Copy link
Contributor

According to README, degree of graph index is recommended to set between 60 and 150 and size of search list during index building is recommended to set between 75 and 200. What is the reason of giving such reference value range?

To add to Shikhar's points, since there is an element of randomness to the graph construction, the degree needs to be at least log n (n=#pts in index), and we find that 2log n to 4 log n is suitable. So for a billion points, 64 to 128 is reasonable.

The search list size to hit a recall is a function of "hardness" of the dataset and Shikhar has given some idea of how hardness could be quantified. However, he empirically try a few values to see what matches scenario requirements. For datasets like BIGANN-1B, DEEP-1B, this parameter size corresponds to reasonable results with a build time around 4 days

@HannanKan
Copy link
Author

Grateful for your explanation @ShikharJ @harsha-simhadri. Why does increasing L in building phase leads to better search performance(recall and qps)?

My understanding is: large L result in large visited set V (returned by GreedySearch function). And large V increases the probability of introducing long range edges (in RobustPrune function)

Is that right?

@ShikharJ
Copy link
Contributor

@HannanKan Sorry for the late reply. Yes this is correct.

@ShikharJ
Copy link
Contributor

Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants