In [1]:
import numpy as np
import pandas as pd

# Loading Data

Let's start by loading a large geographic dataset. In this notebook we will be using the Dublin Bus dataset as loaded and prepared on another [repository](https://github.com/joaofig/dublin-bus "Dublin Buses"). You just need the very first notebook to download and prepare the data, so we can use it here. Please be patient as it may take some time. Once you have the data file, please copy it to the data folder.

Note: Make sure you have the pyarrow package installed for the code below to work.

In [2]:
columns_to_read = ['Lon', 'Lat']
df = pd.read_parquet("data/sir010113-310113.parquet", columns=columns_to_read)

# Brute-Force
We start by using a brute-force approach to finding all the points whithin a 100 meter radius from arbitrarily selected locations. Two such locations were selected from the Dublin map, curtesy of Google maps: the University College and the Guiness Storehouse.

The brute-force approach implies calculating the distance between these two points and all the other 44 million points from the Dublin Bus dataset. Once the distances are calculated, we can simply select the ones that are whithin the 100 meter radius.

In [3]:
from geo.geomath import vec_haversine
from geo.geospoke import GeoBrute

Define the two locations for which we want to query all the sampled points within a 100 meter radius.

In [4]:
uni_col = np.array([[53.3277162, -6.2672435]])

In [5]:
guiness = np.array([[53.3428673, -6.2717738]])

Now we extract all the latitudes and longitudes to specific NumPy array.

In [None]:
positions = df[['Lat', 'Lon']].to_numpy()

Here, we calculate the distance between each of the selected points to the whole dataset. Its is far from slow, but there is a performance penalty, for sure.

In [None]:
brute = GeoBrute(positions)

In [None]:
%%timeit
ind = brute.query_radius(uni_col, r=100.0)

In [None]:
ind.shape

In [None]:
%%timeit
ind = brute.query_radius(guiness, r=100.0)

In [None]:
ind.shape

# Triangle Inequality
The triangle inequality query is implemented by the GeoSpoke clas. It uses an interface that is quite similar to the BallTree class (see below).

In [6]:
from geo.geospoke import GeoSpoke

In [7]:
positions = df[['Lat', 'Lon']].to_numpy()

In [10]:
#%%timeit -r1 -n1
geo_query = GeoSpoke(positions)

In [35]:
%%timeit
ind = geo_query.query_radius(guiness, r=100.0)

318 ms ± 6.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [34]:
ind.shape

(97867,)

In [18]:
#%%timeit
ind = geo_query.query_radius(uni_col, r=100.0)

In [19]:
ind

array([], dtype=int64)

# Building a BallTree

In this section, we will create a BallTree to perform fast searches on our Dublin geographic data. The queries we will look into are *k-nearest neighbors* and *neighbors within a given radius*. Both the tree object and the distance metric object live in the scikit-learn *neighbors* namespace. Let's import those first.

In [20]:
from sklearn.neighbors import BallTree

Before we build the tree, we must select the latitude and longitude columns of the data frame. The distance measure for geographic coordinates is the *haversine distance* and the DistanceMetric class requires that we feed the locations as an array of latitude and longitude in radians.

In [21]:
positions = np.radians(df[['Lat', 'Lon']].to_numpy())

Now we can create the BallTree using the *positions* array. Please be patient as the next line may take some time to run.

In [22]:
#%%timeit -r1 -n1
tree = BallTree(positions, metric="haversine")

In [26]:
tree

<sklearn.neighbors.ball_tree.BallTree at 0x7ffd56ab2c18>

In [23]:
import math

In [27]:
earth_radius = 6371000.0

In [None]:
# dist, ind = tree.query(guiness, k=100) 

In [29]:
guiness_rad = np.radians(guiness)
radius = 100.0 / earth_radius

In [37]:
#%%timeit
ind = tree.query_radius(guiness_rad, r=radius) 

In [38]:
ind[0].shape

(97867,)