Lightweight-Yet-Efficient: Revitalizing Ball-Tree for Point-to-Hyperplane Nearest Neighbor Search

Welcome to the Ball-Tree and BC-Tree GitHub!

In this repository, we target at solving the problem of Point-to-Hyperplane Nearest Neighbor Search (P2HNNS). Given a set of data points in a ($d-1$)-dimensional Euclidean space and a hyperplane query (represented as a vector in $d$-dimensional Euclidean space), the problem of P2HNNS aims to find the point whose distance to the hyperplane query is minimized among all the data.

This problem plays a vital role in many research domains. For example, in the applications of pool-based active learning with SVMs, the goal is to request labels for the data points closest (with minimum margin) to the SVM's decision hyperplane to reduce human efforts for annotation. Moreover, motivated by the success of SVM for classification, the maximum margin clustering aims at finding the hyperplane maximizing the minimum margin to the data, which can separate the data from different classes. Such applications require finding the data points that are closest to the hyperplane.

This repository provides the implementations and experiments of our work entitled Lightweight-Yet-Efficient: Revitalizing Ball-Tree for Point-to-Hyperplane Nearest Neighbor Search that has been published in IEEE ICDE 2023. You could also find our arXiv version here. We implement Ball-Tree and BC-Tree for performing P2HNNS in high-dimensional spaces. To make a systematic comparison, we include two state-of-the-art hyperplane hashing schemes NH and FH (published in SIGMOD 2021) as baselines for evaluations, where their implementation codes can be found here.

Data Sets

Data Sets Details

We study the performance of Ball-Tree and BC-Tree on 16 real-world data sets, i.e., Music, GloVe, Sift, UKBench, Tiny, Msong, NUSW, Cifar-10, Sun, LabelMe, Gist, Enron, Trevi, P53, Deep100M, and Sift100M. Their statistics are summarized as follows.

Data Sets	# Data Points	# Dim	# Queries	Data Size	Data Type
Music	1,000,000	100	100	386 MB	Rating
GloVe	1,183,514	100	100	460 MB	Text
Sift	985,462	128	100	485 MB	Image
UKBench	1,097,907	128	100	541 MB	Image
Tiny	1,000,000	384	100	1.5 GB	Image
Msong	992,272	420	100	1.6 GB	Audio
NUSW	268,643	500	100	514 MB	Image
Cifar-10	50,000	512	100	98 MB	Image
Sun	79,106	512	100	155 MB	Image
LabelMe	181,093	512	100	355 MB	Image
Gist	982,694	960	100	3.6 GB	Image
Enron	94,987	1,369	100	497 MB	Text
Trevi	100,900	4,096	100	1.6 GB	Image
P53	31,153	5,408	100	643 MB	Biology
Deep100M	100,000,000	96	100	36.1 GB	Image
Sift100M	99,986,452	128	100	48.0 GB	Image

Hyperplane Query Generation

Different from NH and FH that generate the hyperplane queries uniformly at random from a hypercube, we generate the hyperplane queries uniformly at random from a $d$-ball. We prefer this new way because it is more natural to simulate the hyperplane queries appeared in real-world applications. Please click here for more details about the $d$-ball and its generation algorithm. The data sets and hyperplane queries used in our experiments can be donwload here.

Except for the link to download the data sets and hyperplane queries, we have also enclosed the sourece codes to generate the hyperplane query and the source codes for data sets deduplication, so that users can conduct experiemnts on any data sets of interests. Please refer to the fold pre_process/ for more details.

Data Format

To reduce the I/O cost, all the datasets and queries are stored in a binary format. For a dataset (or query set) with n vectors in d dimensions, the binary file comparises with a float array of n*d length. Please specify the cardinality n and data dimension d in the bash scripts in advance.

Requirements

Ubuntu 18.04 (or higher version)
g++ 8.3.1 (or higher version).
Python 3.7 (or higher version)

Compilation

The source codes are implemented by C++, which requires g++-8 with c++17 (or higher version) for compilation. Thus, please check whether the g++-8 is installed before the compilation. If not, we provide a way to install g++-8 in Ubuntu 18.04 as follows.

sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install g++-8
sudo apt-get install gcc-8 (optional)

Users can use the following commands to compile the C++ source codes:

cd methods/
make -j

Experiments

We provide the bash scripts to run all experiments. Once you have downloaded the data sets and queries and completed the compilation, you can reproduce the experiments by simply running the following commands:

cd methods/
bash run.sh

Visualization

Finally, we privode python scripts for visualization. These scripts require python 3.7 (or higher versions) with numpy, scipy, and matplotlib installed. If not, you might need to use anaconda to create a new virtual environment and use pip to install those packages.

After you have conducted all experiments, you can plot all the figures in our paper with the following commands.

cd scripts/
python3 plot.py

or

cd scripts/
python plot.py

Reference

Thank you for being patient in reading the user manual. We will appreciate using the following BibTeX to cite this work when you use these source codes in your paper.

@inproceedings{huang2023lightweight,
  title={Lightweight-Yet-Efficient: Revitalizing Ball-Tree for Point-to-Hyperplane Nearest Neighbor Search},
  author={Huang, Qiang and Tung, Anthony K. H.},
  booktitle={2023 IEEE 39th International Conference on Data Engineering (ICDE)},
  pages={436--449},
  year={2023},
  organization={IEEE}
}

It is welcome to contact me (huangq@comp.nus.edu.sg) if you meet any issue. Thank you for your interest!

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
lccs_bucket		lccs_bucket
methods		methods
pre_process		pre_process
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lccs_bucket

lccs_bucket

methods

methods

pre_process

pre_process

scripts

scripts

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Lightweight-Yet-Efficient: Revitalizing Ball-Tree for Point-to-Hyperplane Nearest Neighbor Search

Data Sets

Data Sets Details

Hyperplane Query Generation

Data Format

Requirements

Compilation

Experiments

Visualization

Reference

About

Releases

Packages

Languages

License

HuangQiang/BC-Tree

Folders and files

Latest commit

History

Repository files navigation

Lightweight-Yet-Efficient: Revitalizing Ball-Tree for Point-to-Hyperplane Nearest Neighbor Search

Data Sets

Data Sets Details

Hyperplane Query Generation

Data Format

Requirements

Compilation

Experiments

Visualization

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Languages