-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hnswlib as k nearest neighbour index #148
Conversation
Thank you for this PR, I think it would be a great idea to support However, as it stands, we can't add a hard dependency onto I would be very happy to add it as an optional dependency though. So if you wanted to use |
Thank you for the feedback. I didn't have the compatibility issues in mind, but adding a hard dependency felt wrong somehow anyway. Unfortunately, I won't really have time before Wednesday to make this PR nice (including proper tests and documentation). But I set a reminder for myself, so you can expect an update next week for sure! |
That's not a problem at all. I appreciate the PR! I would also be interested in seeing benchmarks between annoy and hnswlib. I was under the impression that annoy was pretty fast. |
Hi, I updated the pull request
Notes
Let me know what you think :-) |
openTSNE/nearest_neighbors.py
Outdated
|
||
self.index = Index(space=hnsw_space, dim=data.shape[1]) | ||
|
||
metric_params = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pavlin-policar Is this the intended use of metric_params
? It does work, but the naming is a little bit confusing. Here, it is more used as index_params
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you're using metric_params
to pass parameters to the index construction. The idea of metric params is in case the metric we use takes additional params, as indicated by scikit-learn
. I don't think I've ever had to use this.
We currently don't have any way to actually alter the index parameters, as you're doing here. Which kind of takes away from the flexibility I guess. But that kind of forces us to have good defaults.
Please change this though, since this is not the intedend usage of metric_params.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I removed this part for now, it could stay in as an undocumented hack though :P
There seems to be no easy way to add the ability to configure the defaults from the outside, so I set the defaults from the HNSW examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this took me so long to get to. Overall, I think it looks really good. I also ran it locally on a 44k data set, and it took 2 seconds! Annoy took 10 seconds. So that's really quite impressive.
I have a few nitpicks regarding code style, and I'd appreciate if you could fix them. Mainly, you changed a few lines where there are no changes by introducing white-space. This will mess up the git history, so remove that. I also generally use trailing commas, but that's really a nitpick.
If you could change the metric_params
and the k
check that you mentioned, I think this is fine to merge.
One more general remark: you're creating a PR with your master branch, which makes it difficult to check out locally. Generally, the best practice is to make a new branch on your end and make a PR on that. Just so you know for next time :)
openTSNE/nearest_neighbors.py
Outdated
|
||
self.index = Index(space=hnsw_space, dim=data.shape[1]) | ||
|
||
metric_params = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you're using metric_params
to pass parameters to the index construction. The idea of metric params is in case the metric we use takes additional params, as indicated by scikit-learn
. I don't think I've ever had to use this.
We currently don't have any way to actually alter the index parameters, as you're doing here. Which kind of takes away from the flexibility I guess. But that kind of forces us to have good defaults.
Please change this though, since this is not the intedend usage of metric_params.
openTSNE/nearest_neighbors.py
Outdated
@@ -12,7 +12,7 @@ class KNNIndex: | |||
VALID_METRICS = [] | |||
|
|||
def __init__( | |||
self, metric, metric_params=None, n_jobs=1, random_state=None, verbose=False | |||
self, metric, metric_params=None, n_jobs=1, random_state=None, verbose=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the added spacing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, kind of muscle memory to run PEP8 check once I wrote a few lines, so apparently my IDE reformatted other parts. I fixed that now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first commit you added fixed this, the second one re-added this spacing. I know this isn't technically PEP8 compliant, but we follow the black style guide. I think it's cleaner than strict adherence to PEP8.
I'm probably going to have to delay the next release until numba supports py39, which is supposedly going to be by mid-december. So if you can finalize this PR in the next two weeks, that would be amazing. I'd be really happy to include this in the next release! |
Hopefully captured all your points. Let me know what needs updating :-) |
Perfect! Thanks for indulging my nitpicks. I squashed them here, because there were 3 commits called "Clean up code style" :) Thanks for all your help. I haven't done any proper benchmarks, but HNSW seems to be even faster than annoy. |
No worries, it's an honour to contribute :-) |
Issue
The kNN algorithms available in openTSNE are not as fast and/or memory efficient as HNSW (https://arxiv.org/abs/1603.09320).
Furthermore, Annoy, BallTree, and pynndescent are not made for high-dimensional data; For example, Annoy even throws an error above 1000 dimensions.
Description of changes
This pull request ...
build_knn_index
forAffinities
KNNIndex
class to interface with hnswlibIncludes
If you like the idea of having HNSW as an alternative, I'll add tests and some documentation.