# Comparison of Approximate Nearest Neighbor Libraries

This section compares **FAISS**, **Annoy**, **ScaNN**, and **HNSW (via hnswlib)** based on their underlying algorithms and use cases.

---

## 1. FAISS – Facebook AI Similarity Search

### 🔍 Algorithm
FAISS provides multiple ANN algorithms:

#### a. **IndexFlat (Brute-Force)**
- Computes all distances (L2 or dot product).
- Best for small datasets or as a baseline.

#### b. **IndexIVF (Inverted File Index)**
- **Quantization-based** ANN search:
  1. K-means clustering → coarse centroids.
  2. Assign vectors to closest centroid (inverted list).
  3. During search: search only in top-N centroids.

#### c. **PQ (Product Quantization)**
- Splits vectors into subvectors and quantizes each.
- Great for memory efficiency when paired with IVF.

#### d. **HNSW in FAISS**
- Graph-based index: `IndexHNSWFlat`.
- Supports faster search with higher recall.

> **FAISS excels** in GPU acceleration, batch processing, and large-scale use.

---

## 2. Annoy – Approximate Nearest Neighbors Oh Yeah

### Algorithm: Random Projection Trees

- Builds multiple **binary trees** using random hyperplanes.
- Similar to KD-Trees, but splits are randomized.
- During search:
  - Traverse multiple trees.
  - Aggregate and rank candidates using real distance.

### Key Features:
- **Disk-backed index** (can be memory-mapped).
- Randomized structure, deterministic build.

> Ideal for **read-heavy**, low-resource recommendation services.

---

## ✅ 3. ScaNN – Scalable Nearest Neighbors (by Google)

### 🔍 Algorithm: Hybrid Tree + Asymmetric Hashing + Reordering

1. **Partitioning (Tree Search)**:
   - Data clustered into centroids (leaves).
   - Only top-N leaves visited during search.

2. **Asymmetric Hashing**:
   - Compresses database vectors.
   - Approximates dot/cosine similarity using quantized representations.

3. **Reordering**:
   - Top-K candidates re-evaluated with full-precision vectors.

### 🧠 Features:
- TPU/GPU compatible.
- Very high throughput for dot product/cosine tasks.

> 📌 Great for large-scale ML serving (e.g., **recommendation**, **semantic search**).

---

## ✅ 4. HNSW – Hierarchical Navigable Small World Graphs (via hnswlib)

### 🔍 Algorithm: Multi-layer Navigable Graph

- Constructs a **multi-layer graph**:
  - Top layers are sparse, bottom layers are dense.
- Indexing:
  - Each node links to its closest neighbors probabilistically.

### Search Process:
1. Start at the top layer.
2. Navigate greedily to closer nodes.
3. Descend to the next layer and repeat.
4. Perform best-first local search at bottom.

### Features:
- High recall.
- `ef` parameter tunes recall vs. latency.
- Excellent for real-time applications.

> One of the most accurate ANN algorithms for **search and recommendations**.

---

## Summary Comparison

| Library   | Core Algorithm                       | Index Type              | Speed     | Accuracy | Memory Use   | Best Use                          |
|-----------|---------------------------------------|--------------------------|-----------|----------|--------------|-----------------------------------|
| **FAISS** | IVF + PQ / HNSW / Flat                | Quantization / Graph     | ✅✅✅ (GPU) | ✅✅✅    | Medium–Low   | Large-scale, GPU-based search     |
| **Annoy** | Random Projection Trees               | Tree-based               | ✅✅       | ✅✅      | Low          | Lightweight, disk-based retrieval |
| **ScaNN** | Tree + Hashing + Reordering           | Hybrid                   | ✅✅✅      | ✅✅✅    | Medium       | ML-serving, dot/cosine retrieval  |
| **HNSW**  | Multi-layer Navigable Graph           | Graph-based              | ✅✅✅      | ✅✅✅    | Medium–High  | Real-time, high-recall search     |

---

Would you like benchmarks or real-world use case examples?


In [None]:
import faiss
import numpy as np

# Sample item embeddings (100 items, 64-dim)
item_embeddings = np.random.rand(100, 64).astype('float32')

# Build index for Inner Product similarity (cosine if normalized)
index = faiss.IndexFlatIP(64)  # or faiss.IndexFlatL2(64)
index.add(item_embeddings)

# Query: 1 user embedding
query = np.random.rand(1, 64).astype('float32')
scores, ids = index.search(query, k=5)

print("FAISS:", ids, scores)


In [17]:
from annoy import AnnoyIndex
import numpy as np

f = 64  # dimension
index = AnnoyIndex(f, 'angular')

# Add 1000000 items to index
for i in range(1000):
    vec = np.random.rand(f)
    index.add_item(i, vec.tolist())

index.build(10)  # 10 trees

# Query
query = np.random.rand(f)

%time
ids = index.get_nns_by_vector(query.tolist(), 5, include_distances=True)

print("Annoy:", ids)


CPU times: user 3 μs, sys: 1e+03 ns, total: 4 μs
Wall time: 5.01 μs
Annoy: ([28, 598, 407, 114, 266], [0.5223425626754761, 0.5260087251663208, 0.56670081615448, 0.567012369632721, 0.5683410167694092])


In [5]:
!pip3.13 install hnswlib

Collecting hnswlib
  Downloading hnswlib-0.8.0.tar.gz (36 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: hnswlib
  Building wheel for hnswlib (pyproject.toml) ... [?25ldone
[?25h  Created wheel for hnswlib: filename=hnswlib-0.8.0-cp313-cp313-macosx_10_13_universal2.whl size=419349 sha256=a66111578fc206f4eeb6f120684acc6e244ee311893156ed33c504bcb6247095
  Stored in directory: /Users/konstantin/Library/Caches/pip/wheels/35/04/88/b31765a4b9957705e18065db4657e61fc8da54f50e3ef0b67e
Successfully built hnswlib
Installing collected packages: hnswlib
Successfully installed hnswlib-0.8.0


In [None]:
import numpy as np
import tensorflow as tf
import scann

# Sample data
item_embeddings = np.random.rand(100, 64).astype('float32')
query = np.random.rand(1, 64).astype('float32')

# Build ScaNN index
searcher = scann.scann_ops_pybind.builder(item_embeddings, 5, "dot_product").tree(
    num_leaves=100, num_leaves_to_search=10, training_sample_size=1000).score_ah(
    2, anisotropic_quantization_threshold=0.2).reorder(5).build()

# Query
neighbors, distances = searcher.search_batched(query)

print("ScaNN:", neighbors)


In [16]:
import hnswlib
import numpy as np

dim = 64
num_elements = 1000

# Initialize index
index = hnswlib.Index(space='cosine', dim=dim)
index.init_index(max_elements=num_elements, ef_construction=100, M=16)

# Add item vectors
data = np.random.rand(num_elements, dim).astype('float32')
index.add_items(data)

# Query
query = np.random.rand(1, dim).astype('float32')

%time
labels, distances = index.knn_query(query, k=5)

print("HNSW:", labels, distances)


CPU times: user 2 μs, sys: 0 ns, total: 2 μs
Wall time: 4.77 μs
HNSW: [[778  55 597 391 866]] [[0.13336885 0.15304852 0.1588034  0.16153485 0.16769516]]
