Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hnswlib as k nearest neighbour index #148

Merged
merged 5 commits into from
Nov 27, 2020

Conversation

TimRepke
Copy link
Contributor

@TimRepke TimRepke commented Oct 25, 2020

Issue

The kNN algorithms available in openTSNE are not as fast and/or memory efficient as HNSW (https://arxiv.org/abs/1603.09320).
Furthermore, Annoy, BallTree, and pynndescent are not made for high-dimensional data; For example, Annoy even throws an error above 1000 dimensions.

Description of changes

This pull request ...

  • adds hnswlib as a dependency
  • adds option to choose 'hnswlib' to build_knn_index for Affinities
  • implements additional KNNIndex class to interface with hnswlib
Includes
  • Code changes
  • Tests
  • Documentation

If you like the idea of having HNSW as an alternative, I'll add tests and some documentation.

@pavlin-policar
Copy link
Owner

Thank you for this PR, I think it would be a great idea to support hnswlib, and it's something I have looked at in the past.

However, as it stands, we can't add a hard dependency onto hnswlib. While hnswlib is available on PyPI, there's only a source distribution. Because hnswlib is a C++ library, this will need to be compiled on the users machine, which is not okay. This is especially problematic on Windows machines, which do not come with a C++ compiler by default. What we would need are precompiled wheels on all major platforms for different Python versions. Furthermore, openTSNE is also available on conda-forge, so there would also need to be a conda-forge package of hnswlib, which there currently isn't. So a hard dependency is definitely a no go.

I would be very happy to add it as an optional dependency though. So if you wanted to use hnswlib with openTSNE, you would just manually install it into your environment. This is what we currently do with pynndescent. The changes to this PR should be fairly minimal, just take a look at what we do for NNDescent. And also, please add unit tests, something like TestAnnoy should be fine.

@TimRepke
Copy link
Contributor Author

Thank you for the feedback. I didn't have the compatibility issues in mind, but adding a hard dependency felt wrong somehow anyway.

Unfortunately, I won't really have time before Wednesday to make this PR nice (including proper tests and documentation). But I set a reminder for myself, so you can expect an update next week for sure!

@pavlin-policar
Copy link
Owner

Unfortunately, I won't really have time before Wednesday to make this PR nice (including proper tests and documentation). But I set a reminder for myself, so you can expect an update next week for sure!

That's not a problem at all. I appreciate the PR! I would also be interested in seeing benchmarks between annoy and hnswlib. I was under the impression that annoy was pretty fast.

@TimRepke
Copy link
Contributor Author

TimRepke commented Oct 29, 2020

Hi,

I updated the pull request

  • Mark hnswlib as an optional dependency (also did this for pynndescend)
  • Add hint for hnsw index to docstrings
  • Add sensible import check
  • Add pip install commands to azure setups
  • Update metric aliases for hnsw (originally, ip and l2 were missing)
  • Update metric_param handling (originally, missing arguments might have triggered failure

Notes

  • I did not change the auto-index setup (I guess that's fine as is)
  • I added tests for exact nearest neighbour matches, those fail (it's an approximate NN index afterall...); not included here
    • Should I add them back in? There are 25/1500 (1.67%) mismatched elements. There could be a threshold check.
  • As far as I can tell, hnswlib has no true random initialisation, they default their random seed if not provided. Is that an issue?

Let me know what you think :-)


self.index = Index(space=hnsw_space, dim=data.shape[1])

metric_params = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavlin-policar Is this the intended use of metric_params? It does work, but the naming is a little bit confusing. Here, it is more used as index_params.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you're using metric_params to pass parameters to the index construction. The idea of metric params is in case the metric we use takes additional params, as indicated by scikit-learn. I don't think I've ever had to use this.

We currently don't have any way to actually alter the index parameters, as you're doing here. Which kind of takes away from the flexibility I guess. But that kind of forces us to have good defaults.

Please change this though, since this is not the intedend usage of metric_params.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I removed this part for now, it could stay in as an undocumented hack though :P
There seems to be no easy way to add the ability to configure the defaults from the outside, so I set the defaults from the HNSW examples.

Copy link
Owner

@pavlin-policar pavlin-policar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this took me so long to get to. Overall, I think it looks really good. I also ran it locally on a 44k data set, and it took 2 seconds! Annoy took 10 seconds. So that's really quite impressive.

I have a few nitpicks regarding code style, and I'd appreciate if you could fix them. Mainly, you changed a few lines where there are no changes by introducing white-space. This will mess up the git history, so remove that. I also generally use trailing commas, but that's really a nitpick.

If you could change the metric_params and the k check that you mentioned, I think this is fine to merge.

One more general remark: you're creating a PR with your master branch, which makes it difficult to check out locally. Generally, the best practice is to make a new branch on your end and make a PR on that. Just so you know for next time :)


self.index = Index(space=hnsw_space, dim=data.shape[1])

metric_params = {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you're using metric_params to pass parameters to the index construction. The idea of metric params is in case the metric we use takes additional params, as indicated by scikit-learn. I don't think I've ever had to use this.

We currently don't have any way to actually alter the index parameters, as you're doing here. Which kind of takes away from the flexibility I guess. But that kind of forces us to have good defaults.

Please change this though, since this is not the intedend usage of metric_params.

@@ -12,7 +12,7 @@ class KNNIndex:
VALID_METRICS = []

def __init__(
self, metric, metric_params=None, n_jobs=1, random_state=None, verbose=False
self, metric, metric_params=None, n_jobs=1, random_state=None, verbose=False
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the added spacing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, kind of muscle memory to run PEP8 check once I wrote a few lines, so apparently my IDE reformatted other parts. I fixed that now.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first commit you added fixed this, the second one re-added this spacing. I know this isn't technically PEP8 compliant, but we follow the black style guide. I think it's cleaner than strict adherence to PEP8.

openTSNE/nearest_neighbors.py Outdated Show resolved Hide resolved
openTSNE/nearest_neighbors.py Outdated Show resolved Hide resolved
openTSNE/nearest_neighbors.py Outdated Show resolved Hide resolved
openTSNE/nearest_neighbors.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
@pavlin-policar
Copy link
Owner

I'm probably going to have to delay the next release until numba supports py39, which is supposedly going to be by mid-december. So if you can finalize this PR in the next two weeks, that would be amazing. I'd be really happy to include this in the next release!

@TimRepke
Copy link
Contributor Author

Hopefully captured all your points. Let me know what needs updating :-)
Do you prefer having squashed commits per PR? Not sure how well GitHub likes it across repositories, but I could try.

@pavlin-policar pavlin-policar merged commit 95dfe72 into pavlin-policar:master Nov 27, 2020
@pavlin-policar
Copy link
Owner

Perfect! Thanks for indulging my nitpicks. I squashed them here, because there were 3 commits called "Clean up code style" :) Thanks for all your help. I haven't done any proper benchmarks, but HNSW seems to be even faster than annoy.

@TimRepke
Copy link
Contributor Author

No worries, it's an honour to contribute :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants