Skip to content

kampersanda/dyft

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dynamic Filter Trie (DyFT)

This is an experimental library for data structures described in the following papers:

Build instructions

You can download and compile this library by the following commands:

$ git clone https://github.com/kampersanda/dyft.git
$ cd dyft
$ mkdir build
$ cd build
$ cmake ..
$ make -j

After the commands, the executables will be produced in build/bin directory.

The code is written in C++17, so please install g++ >= 7.0 or clang >= 4.0. The following dependencies have to be installed to compile the library: CMake >= 3.0, for the build system, and Boost >= 1.42.

The library employs the third-party libraries cmd_line_parser and tinyformat, whose header files are contained in this repository.

Data structures

The library contains some implementations of data structures in addition to our DyFT, which were used in the experiments (Section VI of our paper), as follows.

DyFT family

  • mart_index is an implementation of DyFT with MART representation.
  • art_index is an implementation of DyFT with ART representation.
  • array_index is an implementation of DyFT with Array representation.
  • mi_frame<T> is an implementation of DyFT+ (i.e., multi-index variant) for the index which is specified by the template argument T of a DyFT class.

Others

  • hms1v_index is an implementation of HmSearch 1-var (HSV), which is designed for binary sketches.
  • hms1dv_index is an implementation of HmSearch 1-del-var (HSD), which is designed for integer sketches.
  • gv_index is an implementation of multi-index hashing for integer sketches, which follows an idea by Gog and Venturini.

Input data format

Binary sketches

Binary sketches should be stored in binary format, where each sketch is 64 bits of size. The executables provided by the library can indicate the number of dimensions (32 or 64) to evaluate through argument -N, and use the first N dimensions.

You can use src/simhash.cpp to generate such a dataset using Charikar’s simhash algorithm.

Integer sketches

Integer sketches should be stored in TEXMEX's bvecs format. That is, the dimension number and features for each sketch are interleaved, where the number is 4 bytes of size and each feature is 1 byte of size. In the same manner, the executables provided by the library can indicate the dimension number (32 or 64) to evaluate through argument -N. And, argument -B can indicate the number of lowest bits to evaluate for each feature (1 to 8).

You can use consistent_weighted_sampling to generate such a dataset using the GCWS algorithm.

Usage

In the data directory, there are the tiny datasets:

  • SIFT.10K.bin and SIFT.100.bin are binary sketches generated from ANN_SIFT10K using Charikar’s simhash algorithm.
  • mnist.scale.cws.bvecs and mnist.scale.1K.cws.bvecs are 1-byte integer sketches generated from minst using the GCWS algorithm.

By using these datasets, you can test the data structure as follows.

Benchmark for range search

The executables range_search_* are used to analyze the performance of range search time for each data structure. Given a dataset and query files, the executable inserts data sketches one by one and performs range search for the queries when the number of inserted sketches is power of ten. The benchmark results are output as a json file.

Example: DyFT with MART on binary sketches

$ ./bin/range_search_bin_dyft ../data/SIFT.10K.bin ../data/SIFT.100.bin -o results -R 2 -N 32 -A mart -K 1

The command tests range search with radius R=2 for mart_index of single index (K=1) through dataset SIFT.10K.bin and queryset SIFT.100.bin, where the number of dimensions is N=32 (i.e., the first 32 dimensions are used). The benchmark result is output in results directory, where the file name is assigned automatically.

Example: DyFT+ with ART on integer sketches

$ ./bin/range_search_int_dyft ../data/mnist.scale.cws.bvecs ../data/mnist.scale.1K.cws.bvecs -o results -R 4 -N 64 -B 4 -A art -K 3

The command tests range search with radius R=4 for mi_frame<art_index> of K=3 blocks through dataset mnist.scale.cws.bvecs and queryset mnist.scale.1K.cws.bvecs, where the number of dimensions is N=64 and the lowest B=4 bits are used for each feature.

Example: GV on integer sketches

$ ./bin/range_search_int_gv ../data/mnist.scale.cws.bvecs ../data/mnist.scale.1K.cws.bvecs -o results -R 6 -N 64 -B 8

The command tests range search with radius R=6 for gv_index through dataset mnist.scale.cws.bvecs and queryset mnist.scale.1K.cws.bvecs, where the number of dimensions is N=64 and all the B=8 bits are used for each feature.

Benchmark for construction

The executables build_index_* are used to analyze the performance of insertion time and process size for each data structure. Given a dataset, the executable inserts data sketches one by one and reports the statistics when the number of inserted sketches is power of ten. The benchmark results are output as a json file.

Example: DyFT with Array on binary sketches

$ ./bin/build_index_bin_dyft ../data/SIFT.10K.bin -o results -R 2 -N 32 -A array -K 1

The command builds array_index of single index (K=1) on radius R=2 for dataset SIFT.10K.bin, where the number of dimension is N=32.

Example: DyFT+ with MART on integer sketches

$ ./bin/build_index_int_dyft ../data/mnist.scale.cws.bvecs -o results -R 4 -N 64 -B 4 -A mart -K 0

The command builds mi_frame<mart_index> with a reasonable number of blocks on radius R=4 for dataset mnist.scale.cws.bvecs, where the number of dimension is N=64 and the lowest B=4 bits are used for each feature.

When K is set to 0, the number of blocks is set to ⌊R/2⌋+1 following the idea of GV.

Example: HSV on binary sketches

$ ./bin/build_index_bin_hms1v ../data/SIFT.10K.bin -o results -R 6 -N 64

The command builds hms1v_index on radius R=6 for dataset SIFT.10K.bin, where the number of dimension is N=64.

Licensing

This library is free software provided under MIT License.

If you use the library, please cite the following paper:

@inproceedings{kanda2020dynamic,
  author = {Kanda, Shunsuke and Tabei, Yasuo},
  title = {Dynamic Similarity Search on Integer Sketches},
  booktitle = {Proceedings of the 20th IEEE International Conference on Data Mining (ICDM)},
  pages={242-251},
  year = {2020}
}