Skip to content
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

lbnn - Location based nearest neighbors

This project shows how to use kd-trees with Apache Spark and Scala for the analysis of location based / geometric data.

The example data consists of records of "check ins" of users with a mobile phone. For each checkin the user id, the date and time and the location coordinates (latitude and longitude) are recorded.

[user]  [check-in time]	        [latitude]      [longitude]     [location id]
196514  2010-07-24T13:45:06Z    53.3648119      -2.2723465833   145064
196514  2010-07-24T13:44:58Z    53.360511233    -2.276369017    1275991

The example data set is called "loc-Gowalla" and available at the Stanford Large Network Dataset Collection.

With this app we can calculate for every user at some specified point in time the number of users that are checked in in his/her "neighborhood" (specified in kilometers).

Tools used:

  • Apache Hadoop, Apache Spark, Scala, SBT, ScalaTest, Kryo.

Tools needed:

  • Scala, SBT
  • If you edit the SBT you can use the standalone version of Spark, so a Spark cluster is not needed

How to use

Checkout the source

git clone

Change into the directory

cd location-based-nearest-neighbours

Create an assembly

$ sbt assembly

Check the file and adapt it to your situation. If you want to use a local filesystem see You'll also have to change the spark hostname in scripts/ (or .cmd if you are a Windows user).

Download the example data, prepare the data and find the neighbors for the 6th of october 2009 within a 5 km range for every user. Make sure that you use the correct hadoop user name, otherwise set the environment variable HADOOP_USER_NAME.

$ scripts/ download
$ scripts/ sort-by-user
$ scripts/ neighbors 20091006 5

The output will be a file num_neighbors_20091006_5.0.csv that contains data like the following:

CustomerId;number of neighbors

Current state

The current state is first proof of concept. The app works, but is not performance optimized. Partitioning of Apache Spark is ignored at the moment.

Example analysis

Future work

  • Partitioning in Spark
  • Testing with larger data
    • Writing a random data generator or "multiplying" the current data
  • Use other geometric data structures
    • Range trees


kd tree, kd-tree, k-d-tree, range queries, range query, latitude, longitude


Written by Jörn Dinkla. See my homepage for further information


Using k-d trees with Apache Spark and Scala




No releases published


You can’t perform that action at this time.