k-Distance Approximation for Memory-Efficient RkNN Retrieval

Repository containing the code for the paper

k-Distance Approximation for Memory-Efficient RkNN Retrieval
Max Berrendorf, Felix Borutta, and Peer Kröger
SISAP'19

If you find this code useful, please consider citing us.

@InProceedings{kDistanceApproximation,
    author="Berrendorf, Max
    and Borutta, Felix
    and Kr{\"o}ger, Peer",
    title="k-Distance Approximation for Memory-Efficient RkNN Retrieval",
    booktitle="Similarity Search and Applications",
    year="2019",
    publisher="Springer International Publishing",
}

Data Preprocessing

Download road networks OL, TG, and SF from here. Then, run preprocessing script to

generate synthetic datasets.
for each dataset, compute and store
- BallTree index for fast range queries in evaluation code, and
- k-distances for building the index.
- MRkNNCoP tree coefficients. You can use the command line argument --index_root to specify a directory where the data is stored (consuming approx. 1.1 GiB), and --model_root to specify a directory to store the MRkNNCoP tree coefficients (consuming approx. 5.0 MiB).

python3 preprocess.py --index_root=./index --model_root=./models

The indices are stored as pickle file named <dataset_name>.pkl, and the k-distances as sparse CSR matrices in a HDF5 file in a custom format (cf. persistence.py:save_csr_to_hdf).

Model Selection

After preprocessing you can perform model selection to analyse the trade-off between model size and candidate set size. To this end, use the following

python3 model_selection.py --index_root=./index --model_root=./models --output_root=./results

which reads data and MRkNNCoP tree coefficients and trains numerous predefined models to predict the MRkNNCoP tree coefficients and thereby the k-distance. As a result, for each of the datasets the following files are produced:

a CSV containing the error in coefficients prediction for all models
a CSV containing the mean candidate set size for every model in the model_size-mae skyline.
a pickle file containing the candidate set sizes for each model for each data point and each value of k between 1 and K_MAX

The output files consume approx. 4.0 GiB.

Evaluation

After training all models, the notebooks/Evaluation.ipynb can be used to evaluate the results.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
linear_approx.py		linear_approx.py
model_selection.py		model_selection.py
persistence.py		persistence.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
settings.py		settings.py
test_linear_approx.py		test_linear_approx.py
visualisation.py		visualisation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

k-Distance Approximation for Memory-Efficient RkNN Retrieval

Data Preprocessing

Model Selection

Evaluation

About

Releases

Packages

Languages

License

mberr/k-distance-prediction

Folders and files

Latest commit

History

Repository files navigation

k-Distance Approximation for Memory-Efficient RkNN Retrieval

Data Preprocessing

Model Selection

Evaluation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages