Skip to content

A minimal Vectror Database in C++ with HNSW indexing

Notifications You must be signed in to change notification settings

ntrehan/ntvecdb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ntvecdb — A Minimal Vector Database with HNSW

ntvecdb is a minimal, embedded vector database designed to demonstrate how modern vector databases work under the hood. It supports storing high-dimensional embeddings, approximate nearest-neighbor search using HNSW, persistence to disk, concurrent access, and batch queries.

This project was built as part of a systems-focused exercise. The emphasis is on algorithmic understanding, architecture, and explicit trade-offs, rather than production completeness.


Features

  • HNSW index implemented from scratch for scalable ANN search
  • Exact (brute-force) search for correctness validation
  • Thread-safe engine
    • Multiple concurrent readers
    • Serialized writers
  • Batch search API for bulk query workloads
  • Explicit persistence (save/load index to disk)
  • Embedded, in-process design (DuckDB/SQLite-style)
  • Minimal CLI for demonstration
  • No external dependencies

Quick Start

Prerequisites

  • C++17 compatible compiler (clang or gcc)
  • CMake ≥ 3.10

Build

git clone <your-repo-url>
cd ntvecdb
mkdir build && cd build
cmake ..
make

Run Tests

./correctness        # HNSW vs brute-force comparison + persistence
./engine_sanity      # VecDBEngine API validation
./concurrency_test   # Multi-threaded stress test
./benchmark          # Performance comparison

Run CLI

./vecdb              # Shows usage

Architecture Overview

ntvecdb is designed as an embedded, in-process vector database, similar in spirit to DuckDB.

Application / CLI
        |
   VecDBEngine
        |
     HNSWIndex
        |
   In-memory graph

Key design choices

  • The engine owns the index and enforces all concurrency guarantees.
  • The index is kept entirely in memory for fast access.
  • Persistence is explicit and opt-in via save/load operations.
  • There is no server or IPC layer; concurrency is handled within a single process.

This design keeps the system simple, testable, and focused on the core problem: efficient similarity search.


HNSW Index (High Level)

The HNSW (Hierarchical Navigable Small World) index is a multi-layer graph structure that enables sub-linear approximate nearest-neighbor search.

High-level idea:

  • Vectors are stored as nodes in a layered graph
  • Upper layers provide long-range navigation
  • Lower layers refine local neighborhoods
  • Search starts from an entry point at the top layer and greedily descends

This implementation focuses on clarity and correctness rather than aggressive optimization, closely following the ideas from the original HNSW paper.

Reference: Yu. A. Malkov and D. A. Yashunin, "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. [arXiv:1603.09320]


Data Model

  • Each vector has:
    • A unique integer ID
    • A fixed-dimension float embedding
  • All embeddings in an index share the same dimensionality
  • (Future work) Metadata can be attached to vectors for filtered search

API Overview

VecDBEngine Methods

Method Description
create_index(dim, M, efConstruction) Create a new HNSW index
insert(id, embedding) Insert a vector
search(query, k, efSearch) Single query search
batch_search(queries, k, efSearch) Bulk query search
save_db(db_name) Save index to database root
load_db(db_name) Load index from database root
save_index_path(path) Save index to absolute path
load_index_path(path) Load index from absolute path
set_db_root(root) Set database root directory
has_index() Check if index is loaded

Batch Search

batch_search executes multiple queries under a single shared lock, amortizing synchronization overhead. This is useful for bulk query workloads.


Concurrency Model

  • Implemented using std::shared_mutex
  • Multiple concurrent readers (search, batch_search, has_index)
  • Exclusive writers (insert, create, load, save)

This guarantees:

  • No data races
  • No partial visibility of index state
  • Deterministic behavior under concurrency

Operation-level Atomicity

  • Each write operation is atomic and isolated
  • Reads observe either the state before or after a write
  • No partial updates are ever visible

Persistence Design

  • The index lives in memory during execution
  • Persistence is explicit via save/load operations
  • Index state is serialized to disk as binary files (meta.bin + nodes.bin)
  • On startup, the engine can load an existing index

This keeps durability simple and predictable while avoiding the complexity of WAL or recovery protocols.


CLI

A minimal CLI is provided for demonstration and testing.

Example usage:

vecdb create wiki 128
vecdb insert wiki 1 0.1,0.2,0.3,...
vecdb search wiki 0.1,0.2,0.3,... 5 50
vecdb batch-search wiki queries.txt 5 50
vecdb save wiki
vecdb load wiki

The CLI is intentionally lightweight and delegates all logic to the engine.


Testing

  • correctness Compares HNSW search results against brute-force ground truth. Also validates save/load persistence.

  • engine_sanity Tests VecDBEngine API: create, insert, search, save, and load operations.

  • concurrency_test Spawns multiple reader and writer threads to validate thread safety and measure contention overhead.

  • benchmark Measures search latency for brute-force vs HNSW across different dataset sizes.


Benchmark

Performance comparison of brute-force search vs HNSW (dim=128, k=10, 100 queries, efSearch=100):

Vectors Brute Force (ms) HNSW (ms) Speedup
1,000 43.10 69.37 0.62x
5,000 215.41 113.35 1.90x
10,000 432.23 125.12 3.45x

At small dataset sizes, HNSW's graph construction overhead exceeds the benefit. As the dataset grows, HNSW's sub-linear search demonstrates clear speedups over linear scan.

(Results from Apple M3, your numbers may vary.)


Access Patterns: OLAP vs OLTP

Current Design: Read-Optimized (OLAP-friendly)

The current implementation prioritizes read-heavy workloads:

  • Multiple readers can execute searches in parallel (shared_lock)
  • batch_search amortizes lock acquisition for bulk queries
  • Writers are serialized and block readers during graph updates

This aligns well with typical vector search use cases: embeddings are generated offline (or infrequently), and the system serves many concurrent search requests.

The OLTP Challenge

HNSW is not optimized for high-frequency small writes:

  • Each insert traverses the graph to find optimal neighbors
  • Neighbor connections are updated, potentially affecting multiple nodes
  • Writers hold an exclusive lock, blocking all concurrent readers

For workloads with frequent inserts (e.g., real-time embedding ingestion), this becomes a bottleneck.

Potential Improvement: Dual-Index Architecture

A common solution in production vector databases (e.g., Milvus, Pinecone) is a dual-index approach:

Writes → Append Buffer (fast) → [Background Merge] → HNSW Index
                ↓                                         ↓
            Reads query both (buffer + index)
  • Write path: New vectors append to an in-memory buffer (O(1), no graph update)
  • Read path: Query both the buffer (brute-force scan) and the HNSW index, merge results
  • Background merge: Periodically flush the buffer into the HNSW index

This would provide:

  • Fast writes (no blocking graph traversal)
  • Still-fast reads (small buffer scan + HNSW lookup)
  • Eventually consistent index state

This extension is not implemented but represents a natural evolution for OLTP-style workloads.


Trade-offs & Future Improvements

Known limitations

  • In-memory only (no memory-mapped storage)
  • No delete or update operations
  • No metadata filtering (yet)
  • No multi-process coordination
  • No WAL or crash recovery

Possible extensions

  • Metadata-aware search
  • Incremental persistence / WAL
  • Memory-mapped index
  • Multi-process server mode (IPC)
  • Distributed indexing / sharding

Summary

ntvecdb focuses on core vector database concepts:

  • scalable ANN search
  • concurrency control
  • persistence
  • clear separation of concerns

The project intentionally avoids overengineering and prioritizes correctness, clarity, and explainability.

About

A minimal Vectror Database in C++ with HNSW indexing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published