-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about R, L parameter setup. #26
Comments
@HannanKan These are just some empirical figures that we've found to work well for some general datasets. If you want a more specific answer, then the theory of Graph ANNS dictates that the R should be greater than or equal to the intrinsic dimensionality of the dataset. Here the intrinsic dimensionality is a tricky notion, because it is subjective to the way in which you want to measure dimensionality. It can be through PCA, or some other measure of choice. For most human-generated (and some machine-generated) datasets, the intrinsic dimensionality is much lower than the actual dimensionality, and as such, graph ANNS algorithms give much better performance compared to other algos which do not make use of this information. As to what value of L is suitable, we've only found this through our experiments that a value between 1.5x to 2x the value of R suffices for most datasets. You can always increase L, as much as you want, but then you'll only be able to achieve diminishing returns, while the build time blows up. Hope this helps. |
To add to Shikhar's points, since there is an element of randomness to the graph construction, the degree needs to be at least log n (n=#pts in index), and we find that 2log n to 4 log n is suitable. So for a billion points, 64 to 128 is reasonable. The search list size to hit a recall is a function of "hardness" of the dataset and Shikhar has given some idea of how hardness could be quantified. However, he empirically try a few values to see what matches scenario requirements. For datasets like BIGANN-1B, DEEP-1B, this parameter size corresponds to reasonable results with a build time around 4 days |
Grateful for your explanation @ShikharJ @harsha-simhadri. Why does increasing L in building phase leads to better search performance(recall and qps)? My understanding is: large L result in large visited set V (returned by GreedySearch function). And large V increases the probability of introducing long range edges (in RobustPrune function) Is that right? |
@HannanKan Sorry for the late reply. Yes this is correct. |
Closing for now. |
According to README, degree of graph index is recommended to set between 60 and 150 and size of search list during index building is recommended to set between 75 and 200. What is the reason of giving such reference value range?
The text was updated successfully, but these errors were encountered: