In [2]:
from timeit import timeit

import numpy as np
import pandas as pd
from scipy.spatial import KDTree as scTree
from sklearn.neighbors import KDTree as skTree

First we read the data from the csv. Latitude and longitude will be used to build the K-dimensional tree.

In [5]:
df = pd.read_csv('worldcities.csv')
print(df.shape)
print(df.columns)

(42905, 11)
Index(['city', 'city_ascii', 'lat', 'lng', 'country', 'iso2', 'iso3',
       'admin_name', 'capital', 'population', 'id'],
      dtype='object')


We'll use values from "lat" and "lng" columns to build the tree.

In [8]:
coord = df.loc[:, ['lat', 'lng']].values

Creating trees using both implementations - Scipy and Scikit-learn.

In [18]:
scp_tree = scTree(coord, leafsize=1)
skl_tree = skTree(coord, leaf_size=1)

To lookup the data we need to provide a 2-dimensional array in the shape [n_samples, n_attributes].

In [36]:
lookup = np.array([38.65645570177941, -78.03798797140651]).reshape(1, -1)
lookup.shape

(1, 2)

The query method has similar implementation in both libraries and returns a tuple of distance for our point(s) and the integer value corre4sponding to the index in the original data.

In [37]:
dist, idx = scp_tree.query(lookup)

Now we can use the returned index to query the original dataframe and find out what the nearest town was.

In [20]:
df.loc[idx, :]

city               Culpeper
city_ascii         Culpeper
lat                 38.4705
lng                -78.0001
country       United States
iso2                     US
iso3                    USA
admin_name         Virginia
capital                 NaN
population          20485.0
id               1840006169
Name: 20475, dtype: object

Let's test time performance:

In [46]:
b1 = timeit("scTree(coord, leafsize=1)", globals=globals(), number=100) / 100
b2 = timeit("skTree(coord, leaf_size=1)", globals=globals(), number=100) / 100
print(f'Building the tree: Scipy: {b1:.4f} seconds, Sklearn: {b2:.4f} seconds')
print(f'Sklearn is {b2/b1:.2f} times slower')

Scipy: 0.0114 seconds, Sklearn: 0.0181 seconds
Sklearn is 1.59 times slower


In [60]:
q1 = timeit("scp_tree.query(lookup)", globals=globals(), number=10000) / 10000
q2 = timeit("skl_tree.query(lookup)", globals=globals(), number=10000) / 10000
print(f'Query: Scipy: {q1:.5f}, Sklearn: {q2:.5f}')
print(f'Sklearn is {q2/q1:.2f} times slower')

Query: Scipy: 0.00002, Sklearn: 0.00004
Sklearn is 1.98 times slower


For both building of a tree and querying, Scipy is significantly faster. This difference becomes even greater with the increase in the size of data.