Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self-loops in NN graph #12

Open
SamGRosen opened this issue Mar 21, 2024 · 1 comment
Open

Self-loops in NN graph #12

SamGRosen opened this issue Mar 21, 2024 · 1 comment

Comments

@SamGRosen
Copy link

SamGRosen commented Mar 21, 2024

Using R version 4.2.2 and rnndescent 0.1.4, the returned NN graph contains self-loops. Is this by design, for computational reasons, or a bug? Does this affect performance if the returned graphs are used as the init argument in subsequent calls to nnd_knn?
I could always just call the functions with $k+1$ neighbors and remove the first column, but I'm not sure if this will lead to a noticeable difference in speed or recall.

Ex:

test <- matrix(rnorm(200), nrow=20)
brute <- brute_force_knn(test, 4)
approx <- nnd_knn(test, 4)
brute$idx[, 1] #  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
approx$idx[, 1] #  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

I see now that this is the behavior of pynndescent. However, could you elaborate how to best handle this for init arguments, keep the loops or remove them?

@jlmelville
Copy link
Owner

Yes, this is by design. I also found this very confusing initially -- I even opened an issue about it in the UMAP repo (lmcinnes/umap#53) many years ago, but it turns out it's not uncommon to define nearest neighbors this way.

If you need $k$ distinct non-self neighbors, then the best way forward is always to ask for $k + 1$ neighbors. In terms of init, I don't think it really matters because there is nothing in any of the code that relies on self loops not existing, it's assumed that the self neighbor is just as valid an item in a neighbor list as any other item. For e.g. k = 15, the fact that you only have 14 "real" neighbors to initialize with versus 15 probably doesn't affect the behavior of the algorithm that much: most candidates are either already seen or aren't moving the search in the right direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants