Naive implementations of some ANN (Approximate Nearest Neighbor) search algorithms without any optimization and generalization.
The algorithms used in ANN search can generally be classified as tree based, LHS based, graph based and IVF based, where graph and IVF based algorithms are more popular today but tree and LHS based algorithms are no longer commonly used. Moreover, quantization can be used to compress data and reduce memory usage.
[CVPR20 Tutorial] Billion-scale Approximate Nearest Neighbor Search is a good tutorial for ANN beginners.
ann-benchmarks contains some tools to benchmark various implementations of approximate nearest neighbor (ANN) search for different metrics.
Based on the idea of partitioning vector space, performs poorly in high demensional vector space due to the curse of dimensionality.
-
- automatically select "Randomized KD Tree" or "k-means Tree"
- take any given dataset and desired degree of precision and use these to automatically determine the best algorithm and parameter values.
- Paper: FAST APPROXIMATE NEAREST NEIGHBORS WITH AUTOMATIC ALGORITHM CONFIGURATION
-
- is based on random projection forest, uses multi-tree with a shared priority queue
- Blog: Nearest neighbors and vector models – part 2 – algorithms and data structures
Instead of collision avoidance, the general idea of hashing, the idea of LSH is to expolit collisions for mapping points which are nearby into the same bucket. It is popular in theory area, but performs poorly in practice with real-world data.
Popular in recent years, mostly based on the idea of proximity graph. Given a query, start from a source point (randomly chosen or supplied by a separate algorithm), greedily find the closest point to the query.
Blog: Proximity Graph-based Approximate Nearest Neighbor Search
-
Delaunay Triangulations , which are proximity graphs, are defined as the dual graph of the Voronoi diagram, and many graph based algorithms (like kNN graphs and NSW graphs) are approximation of DT.
-
NSW
- has small world nagigation properties which has polylog time complexity of insertion and seraching.
- Paper: Approximate nearest neighbor algorithm based on navigable small world graphs
-
- incrementally builds a multi-layer structure consisting from hierarchical set of proximity graphs (layers) for nested subsets of the stored elements
- can be seen as an extension of the probabilistic skip list structure with proximity graphs instead of the linked lists
- Paper: Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs
-
- optimizes for disk IO, and makes previous NSG method SSD-friendly.
- Paper: DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node
- Blog: DiskANN: A Disk-based ANNS Solution with High Recall and High QPS on Billion-scale Dataset
-
- is based on navgating spread-out graph and is being used in Taobao now.
- Paper: Optimization of Indexing Based on k-Nearest Neighbor Graph for Proximity
This algorithm clusters the dataset to partition the space to Voronoi Cells, and by searching only neighbor cells reduces the data points needed to search.
-
- The essence of PQ is to decompose the high-dimensional vector space into the Cartesian product of subspaces and then quantize these subspaces separately
- Faiss
- is based on PQ + IVF.
- Tutorial: https://rutgers-db.github.io/cs541-fall19/slides/notes5.pdf
- Paper: Product Quantization for Nearest Neighbor Search
- Survey: A Survey of Product Quantization