[INFO] scipy cKDTree is faster than sklearn NN for single queries #38

btjanaka · 2021-01-06T04:45:32Z

In CVTArchive, we need to use a k-D tree for nearest neighbor searches when there are a lot of centroids. There are many implementations of k-D tree to choose from. Two notable implementations are scipy.spatial.cKDTree and sklearn.neighbors.NearestNeighbors. Both implementations are optimized for batched nearest-neighbor queries, but in _get_index, we query the nearest neighbor of a single point. If we run the following code, we can compare the performance of each implementation on single and batch queries.

import time

import numpy as np
from scipy.spatial import cKDTree
from sklearn.neighbors import NearestNeighbors
from tqdm import trange

samples = np.random.uniform(
    np.array([-1, -1]),
    np.array([1, 1]),
    size=(100_000, 2),
)

points = np.random.uniform(
    np.array([-1, -1]),
    np.array([1, 1]),
    size=(100, 2),
)

print("scipy cKDTree (batch)")
nn = cKDTree(points)
start = time.time()
nn.query(samples)
print(time.time() - start)

print("scipy cKDTree (single)")
nn = cKDTree(points)
start = time.time()
for i in trange(len(samples)):
    nn.query(samples[i])
print(time.time() - start)

print("sklearn NN with kd_tree (batch)")
nn = NearestNeighbors(n_neighbors=1, algorithm="kd_tree").fit(points)
start = time.time()
nn.kneighbors(samples)
print(time.time() - start)

print("sklearn NN with kd_tree (single)")
nn = NearestNeighbors(n_neighbors=1, algorithm="kd_tree").fit(points)
start = time.time()
for i in trange(len(samples)):
    nn.kneighbors(np.expand_dims(samples[i], axis=0))
print(time.time() - start)

I got the following output, which shows that NearestNeighbors is ~10x slower on single queries.

scipy cKDTree (batch)                                                                                     
0.036180734634399414                                                                                      
scipy cKDTree (single)                                                                                    
100%|██████████████████████████████████████████████████████████| 100000/100000 [00:03<00:00, 27718.94it/s]
3.6090946197509766   # cKDTree is fast                                                                                     
sklearn NN with kd_tree (batch)                                                                           
0.052004098892211914
sklearn NN with kd_tree (single)
100%|███████████████████████████████████████████████████████████| 100000/100000 [00:41<00:00, 2397.46it/s$
41.71117091178894    # NearesNeighbors is slow

In short, we should definitely use cKDTree.

The text was updated successfully, but these errors were encountered:

btjanaka · 2023-03-22T19:51:58Z

Update as of v0.5.0: Most of pyribs now uses batch inputs, making this PR a bit less relevant since we do not have to worry about adding single entries to the k-D tree. However, this is still useful for functions such as add_single which rely on adding one entry at a time. Furthermore, it seems scipy's k-D tree was faster than sklearn's even with batched inputs.

btjanaka closed this as completed Jan 6, 2021

btjanaka mentioned this issue Jan 6, 2021

Library Basics #27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[INFO] scipy cKDTree is faster than sklearn NN for single queries #38

[INFO] scipy cKDTree is faster than sklearn NN for single queries #38

btjanaka commented Jan 6, 2021

btjanaka commented Mar 22, 2023

[INFO] scipy cKDTree is faster than sklearn NN for single queries #38

[INFO] scipy cKDTree is faster than sklearn NN for single queries #38

Comments

btjanaka commented Jan 6, 2021

btjanaka commented Mar 22, 2023