__Problem setting:__ in Retrieval tasks (search, recommendations, RAG) we seek for records in a huge dataset, that are the most relevant in some sense to our query (i.e. have the smallest $L_2$ distance). 

Full-scan of the database will require linear complexity $O(n)$, which is OK for toy examples but not acceptable for large-scale products => we need to find a way to do it faster. 

Solution = store records in a format optimized for proximity-based retrieval, that would provide less than $O(n)$ complexity.

# FAISS
---
FAISS = "Facebook AI Similarity Search" by Meta. This library provides several ANN algorithms

## IndexFlat (Brute-Force)
Index construction:<br>- store observations as-is in a plain table

Search<br>- compute $L_2$ distances from query to all records and select top-n

## IndexIVF (Inverted File Index)

Index construction:<br>- do K-means clustering and describe each cluster with its centroid<br>- store all members of each cluster in a separate bucket<br>$\,\,\,$(during search those K buckets will act as an inverted index: cluster_id $\rightarrow$ members list).


Search<br>- compare query vector with all centroids in $O(n)$ and select top-N clusters<br>- do linear search among all members of those n clusters

<img src="img/ivf.png" width=500>

Red partitions = top-2 selected clusters<br>we seek for top-5 neighbours in those clusters. Out of 5 points: 1 missed / 1 is falsely selected<br>

It is not perfect. If clusters are not unifrom (see example above), we can get trapped in proximal but small clusters and overlook many neighbouring points <br>

__Example__<br>
Supppose we have $10^6$ document vectors in a $\mathbb{R}^{64}$ Document-vector space. If we cluster it into 1024 clusters and retrieve top-3 clusters we get the 250x FLOPs reduction:$$\frac{64 \cdot 10^6}{64 \cdot 10^3 + 3 \cdot 64 \cdot 10^3} = \frac{64 \cdot 10^6}{256 \cdot 10^3} = 250$$



## PQ (Product Quantization)
Index construction:<br>- split dimensions into m bands<br>- do K-means clustering (1024 clusters) for each band and describe each cluster with its centroid<br>- store all cluster members in a separate bucket B(b,c)<br>- store as cluster index (8-bit)

Search<br>- compare query vector with all centroids and select top-n $O(n)$<br>- do linear search among n clusters

<img src="img/pq.png" width=500>

# Annoy
---
Annoy = "Approximate Nearest Neighbors Oh Yeah" by Spotify

Index contruction<br>- Build a binary tree using random space partitioning<br>$\,\,\,\,$close points have more chance to appear in the same partition<br>- Repeat to get multiple trees

Search<br>- Traverse all trees to reach the resulting node<br>- Intersect those nodes to get a partition with candidate points<br>- Evaluate candidates in $O(n)$ and rank them using $L_2$ distance

<img src="img/annoy.png" width=500>

Because of the "random" nature both index construction and search are lightning fast

## ScaNN
---
ScaNN = "Scalable Nearest Neighbors" by Google

### Algorithm: Hybrid Tree + Asymmetric Hashing + Reordering

1. Partitioning (Tree Search):
   - Data clustered into centroids (leaves).
   - Only top-N leaves visited during search.

2. Asymmetric Hashing:
   - Compresses database vectors.
   - Approximates dot/cosine similarity using quantized representations.

3. Reordering:
   - Top-K candidates re-evaluated with full-precision vectors.



# HNSW
---
HNWS = "Hierarchical Navigable Small World", endorsed by Yandex

Index construction
<br>- store the data as a multi-layer graph
<br>- top layers are sparse (few points), bottom layers are dense (all points)
<br>- each node links to its K closest neighbours on the same layer
<br>- each node links to its instance on the lower layer

Search
<br>- start from the top level of the hierarchy and navigate in search for the closest node
<br>- move to the lower layer and repeat the process
<br>- do linear search on the bottom layer to find the exact solutions

In Graph theory a "Small World" graph = highly connected graph, when a distance (# of edges) between every pair of nodes is relatively small. This propery is beneficial for fast navigation and search




<img src="img/hnsw.png" width=750>

## Summary Comparison

| Library   | Core Algorithm                       | Index Type              | Speed     | Accuracy | Memory Use   | Best Use                          |
|-----------|---------------------------------------|--------------------------|-----------|----------|--------------|-----------------------------------|
| **FAISS** | IVF + PQ / HNSW / Flat                | Quantization / Graph     | ✅✅✅ (GPU) | ✅✅✅    | Medium–Low   | Large-scale, GPU-based search     |
| **Annoy** | Random Projection Trees               | Tree-based               | ✅✅       | ✅✅      | Low          | Lightweight, disk-based retrieval |
| **ScaNN** | Tree + Hashing + Reordering           | Hybrid                   | ✅✅✅      | ✅✅✅    | Medium       | ML-serving, dot/cosine retrieval  |
| **HNSW**  | Multi-layer Navigable Graph           | Graph-based              | ✅✅✅      | ✅✅✅    | Medium–High  | Real-time, high-recall search     |


In [None]:
import faiss
import numpy as np

# Sample item embeddings (100 items, 64-dim)
item_embeddings = np.random.rand(100, 64).astype('float32')

# Build index for Inner Product similarity (cosine if normalized)
index = faiss.IndexFlatIP(64)  # or faiss.IndexFlatL2(64)
index.add(item_embeddings)

# Query: 1 user embedding
query = np.random.rand(1, 64).astype('float32')
scores, ids = index.search(query, k=5)

print("FAISS:", ids, scores)


In [17]:
from annoy import AnnoyIndex
import numpy as np

f = 64  # dimension
index = AnnoyIndex(f, 'angular')

# Add 1000000 items to index
for i in range(1000):
    vec = np.random.rand(f)
    index.add_item(i, vec.tolist())

index.build(10)  # 10 trees

# Query
query = np.random.rand(f)

%time
ids = index.get_nns_by_vector(query.tolist(), 5, include_distances=True)

print("Annoy:", ids)


CPU times: user 3 μs, sys: 1e+03 ns, total: 4 μs
Wall time: 5.01 μs
Annoy: ([28, 598, 407, 114, 266], [0.5223425626754761, 0.5260087251663208, 0.56670081615448, 0.567012369632721, 0.5683410167694092])


In [5]:
!pip3.13 install hnswlib

Collecting hnswlib
  Downloading hnswlib-0.8.0.tar.gz (36 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: hnswlib
  Building wheel for hnswlib (pyproject.toml) ... [?25ldone
[?25h  Created wheel for hnswlib: filename=hnswlib-0.8.0-cp313-cp313-macosx_10_13_universal2.whl size=419349 sha256=a66111578fc206f4eeb6f120684acc6e244ee311893156ed33c504bcb6247095
  Stored in directory: /Users/konstantin/Library/Caches/pip/wheels/35/04/88/b31765a4b9957705e18065db4657e61fc8da54f50e3ef0b67e
Successfully built hnswlib
Installing collected packages: hnswlib
Successfully installed hnswlib-0.8.0


In [None]:
import numpy as np
import tensorflow as tf
import scann

# Sample data
item_embeddings = np.random.rand(100, 64).astype('float32')
query = np.random.rand(1, 64).astype('float32')

# Build ScaNN index
searcher = scann.scann_ops_pybind.builder(item_embeddings, 5, "dot_product").tree(
    num_leaves=100, num_leaves_to_search=10, training_sample_size=1000).score_ah(
    2, anisotropic_quantization_threshold=0.2).reorder(5).build()

# Query
neighbors, distances = searcher.search_batched(query)

print("ScaNN:", neighbors)


In [16]:
import hnswlib
import numpy as np

dim = 64
num_elements = 1000

# Initialize index
index = hnswlib.Index(space='cosine', dim=dim)
index.init_index(max_elements=num_elements, ef_construction=100, M=16)

# Add item vectors
data = np.random.rand(num_elements, dim).astype('float32')
index.add_items(data)

# Query
query = np.random.rand(1, dim).astype('float32')

%time
labels, distances = index.knn_query(query, k=5)

print("HNSW:", labels, distances)


CPU times: user 2 μs, sys: 0 ns, total: 2 μs
Wall time: 4.77 μs
HNSW: [[778  55 597 391 866]] [[0.13336885 0.15304852 0.1588034  0.16153485 0.16769516]]


In [2]:
import numpy as np
import faiss
import time

# Configurations
d = 128         # vector dimension
nb = 100_000    # number of database vectors
nq = 100        # number of query vectors
k = 10          # number of nearest neighbors to search

# Generate random vectors
np.random.seed(123)
xb = np.random.random((nb, d)).astype('float32')
xq = np.random.random((nq, d)).astype('float32')

# ----------------------
# 1. Exact: IndexFlatL2
# ----------------------
index_flat = faiss.IndexFlatL2(d)
index_flat.add(xb)

start_flat = time.time()
D_flat, I_flat = index_flat.search(xq, k)
end_flat = time.time()

print(f"IndexFlatL2 search time: {end_flat - start_flat:.4f} seconds")

# -------------------------
# 2. Approximate: IndexIVFPQ
# -------------------------
nlist = 100      # number of clusters
m = 16           # number of PQ subquantizers (d should be divisible by m)

quantizer = faiss.IndexFlatL2(d)  # quantizer used for k-means
index_ivfpq = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)  # 8 bits per subquantizer

# Training (needed for IVFPQ)
index_ivfpq.train(xb)
index_ivfpq.add(xb)

# Set number of clusters to search (tradeoff between recall and speed)
index_ivfpq.nprobe = 10

start_ivfpq = time.time()
D_ivfpq, I_ivfpq = index_ivfpq.search(xq, k)
end_ivfpq = time.time()

print(f"IndexIVFPQ search time: {end_ivfpq - start_ivfpq:.4f} seconds")


ModuleNotFoundError: No module named 'faiss'