We present a distributed and parallel extension and implementation of Cover Tree data structure for nearest neighbour search. The data structure was originally presented in and improved in:
- Alina Beygelzimer, Sham Kakade, and John Langford. "Cover trees for nearest neighbor." Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
- Mike Izbicki and Christian Shelton. "Faster cover trees." Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 2015.
Under active development
- All codes are under
srcwithin respective folder
- Dependencies are provided under
- For running cover tree an example script is provided under
datais a placeholder folder where to put the data
distfolder will be created to hold the executables
- gcc >= 4.8.4 or Intel® C++ Compiler 2016 for using C++11 features
How to use
We will show how to run our Cover Tree on a single machine using synthetic dataset
First of all compile by hitting make
Generate synthetic dataset
Run Cover Tree
dist/cover_tree data/train_100d_1000k_1000.dat data/test_100d_1000k_10.dat
The make file has some useful features:
if you have Intel® C++ Compiler, then you can instead
or if you want to use Intel® C++ Compiler's cross-file optimization (ipo), then hit
Also you can selectively compile individual modules by specifying
or clean individually by
Based on our evaluation the implementation is easily scalable and efficient. For example on Amazon EC2 c4.8xlarge, we could insert more than 1 million vectors of 1000 dimensions in Euclidean space with L2 norm under 250 seconds. During query time we can process > 300 queries per second per core.