Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[INFO] scipy cKDTree is faster than sklearn NN for single queries #38

Closed
btjanaka opened this issue Jan 6, 2021 · 1 comment
Closed

Comments

@btjanaka
Copy link
Member

btjanaka commented Jan 6, 2021

In CVTArchive, we need to use a k-D tree for nearest neighbor searches when there are a lot of centroids. There are many implementations of k-D tree to choose from. Two notable implementations are scipy.spatial.cKDTree and sklearn.neighbors.NearestNeighbors. Both implementations are optimized for batched nearest-neighbor queries, but in _get_index, we query the nearest neighbor of a single point. If we run the following code, we can compare the performance of each implementation on single and batch queries.

import time

import numpy as np
from scipy.spatial import cKDTree
from sklearn.neighbors import NearestNeighbors
from tqdm import trange

samples = np.random.uniform(
    np.array([-1, -1]),
    np.array([1, 1]),
    size=(100_000, 2),
)

points = np.random.uniform(
    np.array([-1, -1]),
    np.array([1, 1]),
    size=(100, 2),
)

print("scipy cKDTree (batch)")
nn = cKDTree(points)
start = time.time()
nn.query(samples)
print(time.time() - start)

print("scipy cKDTree (single)")
nn = cKDTree(points)
start = time.time()
for i in trange(len(samples)):
    nn.query(samples[i])
print(time.time() - start)

print("sklearn NN with kd_tree (batch)")
nn = NearestNeighbors(n_neighbors=1, algorithm="kd_tree").fit(points)
start = time.time()
nn.kneighbors(samples)
print(time.time() - start)

print("sklearn NN with kd_tree (single)")
nn = NearestNeighbors(n_neighbors=1, algorithm="kd_tree").fit(points)
start = time.time()
for i in trange(len(samples)):
    nn.kneighbors(np.expand_dims(samples[i], axis=0))
print(time.time() - start)

I got the following output, which shows that NearestNeighbors is ~10x slower on single queries.

scipy cKDTree (batch)                                                                                     
0.036180734634399414                                                                                      
scipy cKDTree (single)                                                                                    
100%|██████████████████████████████████████████████████████████| 100000/100000 [00:03<00:00, 27718.94it/s]
3.6090946197509766   # cKDTree is fast                                                                                     
sklearn NN with kd_tree (batch)                                                                           
0.052004098892211914
sklearn NN with kd_tree (single)
100%|███████████████████████████████████████████████████████████| 100000/100000 [00:41<00:00, 2397.46it/s$
41.71117091178894    # NearesNeighbors is slow

In short, we should definitely use cKDTree.

@btjanaka btjanaka closed this as completed Jan 6, 2021
@btjanaka btjanaka mentioned this issue Jan 6, 2021
@btjanaka
Copy link
Member Author

Update as of v0.5.0: Most of pyribs now uses batch inputs, making this PR a bit less relevant since we do not have to worry about adding single entries to the k-D tree. However, this is still useful for functions such as add_single which rely on adding one entry at a time. Furthermore, it seems scipy's k-D tree was faster than sklearn's even with batched inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant