Add hnswlib as k nearest neighbour index #148

TimRepke · 2020-10-25T14:43:57Z

Issue

The kNN algorithms available in openTSNE are not as fast and/or memory efficient as HNSW (https://arxiv.org/abs/1603.09320).
Furthermore, Annoy, BallTree, and pynndescent are not made for high-dimensional data; For example, Annoy even throws an error above 1000 dimensions.

Description of changes

This pull request ...

adds hnswlib as a dependency
adds option to choose 'hnswlib' to build_knn_index for Affinities
implements additional KNNIndex class to interface with hnswlib

Includes

Code changes
Tests
Documentation

If you like the idea of having HNSW as an alternative, I'll add tests and some documentation.

openTSNE/nearest_neighbors.py

pavlin-policar · 2020-10-25T14:58:10Z

Thank you for this PR, I think it would be a great idea to support hnswlib, and it's something I have looked at in the past.

However, as it stands, we can't add a hard dependency onto hnswlib. While hnswlib is available on PyPI, there's only a source distribution. Because hnswlib is a C++ library, this will need to be compiled on the users machine, which is not okay. This is especially problematic on Windows machines, which do not come with a C++ compiler by default. What we would need are precompiled wheels on all major platforms for different Python versions. Furthermore, openTSNE is also available on conda-forge, so there would also need to be a conda-forge package of hnswlib, which there currently isn't. So a hard dependency is definitely a no go.

I would be very happy to add it as an optional dependency though. So if you wanted to use hnswlib with openTSNE, you would just manually install it into your environment. This is what we currently do with pynndescent. The changes to this PR should be fairly minimal, just take a look at what we do for NNDescent. And also, please add unit tests, something like TestAnnoy should be fine.

TimRepke · 2020-10-25T16:08:59Z

Thank you for the feedback. I didn't have the compatibility issues in mind, but adding a hard dependency felt wrong somehow anyway.

Unfortunately, I won't really have time before Wednesday to make this PR nice (including proper tests and documentation). But I set a reminder for myself, so you can expect an update next week for sure!

pavlin-policar · 2020-10-25T16:28:55Z

Unfortunately, I won't really have time before Wednesday to make this PR nice (including proper tests and documentation). But I set a reminder for myself, so you can expect an update next week for sure!

That's not a problem at all. I appreciate the PR! I would also be interested in seeing benchmarks between annoy and hnswlib. I was under the impression that annoy was pretty fast.

TimRepke · 2020-10-29T17:35:05Z

Hi,

I updated the pull request

Mark hnswlib as an optional dependency (also did this for pynndescend)
Add hint for hnsw index to docstrings
Add sensible import check
Add pip install commands to azure setups
Update metric aliases for hnsw (originally, ip and l2 were missing)
Update metric_param handling (originally, missing arguments might have triggered failure

Notes

I did not change the auto-index setup (I guess that's fine as is)
I added tests for exact nearest neighbour matches, those fail (it's an approximate NN index afterall...); not included here
- Should I add them back in? There are 25/1500 (1.67%) mismatched elements. There could be a threshold check.
As far as I can tell, hnswlib has no true random initialisation, they default their random seed if not provided. Is that an issue?

Let me know what you think :-)

TimRepke · 2020-10-29T17:44:00Z

openTSNE/nearest_neighbors.py

+
+        self.index = Index(space=hnsw_space, dim=data.shape[1])
+
+        metric_params = {


@pavlin-policar Is this the intended use of metric_params? It does work, but the naming is a little bit confusing. Here, it is more used as index_params.

No, you're using metric_params to pass parameters to the index construction. The idea of metric params is in case the metric we use takes additional params, as indicated by scikit-learn. I don't think I've ever had to use this.

We currently don't have any way to actually alter the index parameters, as you're doing here. Which kind of takes away from the flexibility I guess. But that kind of forces us to have good defaults.

Please change this though, since this is not the intedend usage of metric_params.

Fair enough. I removed this part for now, it could stay in as an undocumented hack though :P
There seems to be no easy way to add the ability to configure the defaults from the outside, so I set the defaults from the HNSW examples.

openTSNE/nearest_neighbors.py

pavlin-policar

Sorry, this took me so long to get to. Overall, I think it looks really good. I also ran it locally on a 44k data set, and it took 2 seconds! Annoy took 10 seconds. So that's really quite impressive.

I have a few nitpicks regarding code style, and I'd appreciate if you could fix them. Mainly, you changed a few lines where there are no changes by introducing white-space. This will mess up the git history, so remove that. I also generally use trailing commas, but that's really a nitpick.

If you could change the metric_params and the k check that you mentioned, I think this is fine to merge.

One more general remark: you're creating a PR with your master branch, which makes it difficult to check out locally. Generally, the best practice is to make a new branch on your end and make a PR on that. Just so you know for next time :)

pavlin-policar · 2020-11-26T16:41:35Z

openTSNE/nearest_neighbors.py

+
+        self.index = Index(space=hnsw_space, dim=data.shape[1])
+
+        metric_params = {


No, you're using metric_params to pass parameters to the index construction. The idea of metric params is in case the metric we use takes additional params, as indicated by scikit-learn. I don't think I've ever had to use this.

We currently don't have any way to actually alter the index parameters, as you're doing here. Which kind of takes away from the flexibility I guess. But that kind of forces us to have good defaults.

Please change this though, since this is not the intedend usage of metric_params.

pavlin-policar · 2020-11-26T16:41:52Z

openTSNE/nearest_neighbors.py

@@ -12,7 +12,7 @@ class KNNIndex:
    VALID_METRICS = []

    def __init__(
-        self, metric, metric_params=None, n_jobs=1, random_state=None, verbose=False
+            self, metric, metric_params=None, n_jobs=1, random_state=None, verbose=False


Please remove the added spacing here.

Sorry, kind of muscle memory to run PEP8 check once I wrote a few lines, so apparently my IDE reformatted other parts. I fixed that now.

The first commit you added fixed this, the second one re-added this spacing. I know this isn't technically PEP8 compliant, but we follow the black style guide. I think it's cleaner than strict adherence to PEP8.

openTSNE/nearest_neighbors.py

setup.py

pavlin-policar · 2020-11-26T17:19:42Z

I'm probably going to have to delay the next release until numba supports py39, which is supposedly going to be by mid-december. So if you can finalize this PR in the next two weeks, that would be amazing. I'd be really happy to include this in the next release!

TimRepke · 2020-11-27T13:03:38Z

Hopefully captured all your points. Let me know what needs updating :-)
Do you prefer having squashed commits per PR? Not sure how well GitHub likes it across repositories, but I could try.

pavlin-policar · 2020-11-27T13:39:57Z

Perfect! Thanks for indulging my nitpicks. I squashed them here, because there were 3 commits called "Clean up code style" :) Thanks for all your help. I haven't done any proper benchmarks, but HNSW seems to be even faster than annoy.

TimRepke · 2020-11-27T15:25:28Z

No worries, it's an honour to contribute :-)

Add hnswlib as k nearest neighbour index

e760681

TimRepke commented Oct 25, 2020

View reviewed changes

openTSNE/nearest_neighbors.py Show resolved Hide resolved

Update hnswlib handling, add tests

71f8e3d

TimRepke commented Oct 29, 2020

View reviewed changes

openTSNE/nearest_neighbors.py Outdated Show resolved Hide resolved

pavlin-policar requested changes Nov 26, 2020

View reviewed changes

TimRepke added 2 commits November 27, 2020 13:07

Clean up code style

6a9f74b

Clean up code style

48d4f01

Clean up code style

22c306c

pavlin-policar merged commit 95dfe72 into pavlin-policar:master Nov 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add hnswlib as k nearest neighbour index #148

Add hnswlib as k nearest neighbour index #148

TimRepke commented Oct 25, 2020 •

edited

Loading

pavlin-policar commented Oct 25, 2020

TimRepke commented Oct 25, 2020

pavlin-policar commented Oct 25, 2020

TimRepke commented Oct 29, 2020 •

edited

Loading

TimRepke Oct 29, 2020

pavlin-policar Nov 26, 2020

TimRepke Nov 27, 2020

pavlin-policar left a comment •

edited

Loading

pavlin-policar Nov 26, 2020

pavlin-policar Nov 26, 2020

TimRepke Nov 27, 2020

pavlin-policar Nov 27, 2020

pavlin-policar commented Nov 26, 2020

TimRepke commented Nov 27, 2020

pavlin-policar commented Nov 27, 2020

TimRepke commented Nov 27, 2020


		self.index = Index(space=hnsw_space, dim=data.shape[1])

		metric_params = {

Add hnswlib as k nearest neighbour index #148

Add hnswlib as k nearest neighbour index #148

Conversation

TimRepke commented Oct 25, 2020 • edited Loading

Issue

Description of changes

Includes

pavlin-policar commented Oct 25, 2020

TimRepke commented Oct 25, 2020

pavlin-policar commented Oct 25, 2020

TimRepke commented Oct 29, 2020 • edited Loading

TimRepke Oct 29, 2020

Choose a reason for hiding this comment

pavlin-policar Nov 26, 2020

Choose a reason for hiding this comment

TimRepke Nov 27, 2020

Choose a reason for hiding this comment

pavlin-policar left a comment • edited Loading

Choose a reason for hiding this comment

pavlin-policar Nov 26, 2020

Choose a reason for hiding this comment

pavlin-policar Nov 26, 2020

Choose a reason for hiding this comment

TimRepke Nov 27, 2020

Choose a reason for hiding this comment

pavlin-policar Nov 27, 2020

Choose a reason for hiding this comment

pavlin-policar commented Nov 26, 2020

TimRepke commented Nov 27, 2020

pavlin-policar commented Nov 27, 2020

TimRepke commented Nov 27, 2020

TimRepke commented Oct 25, 2020 •

edited

Loading

TimRepke commented Oct 29, 2020 •

edited

Loading

pavlin-policar left a comment •

edited

Loading