# Scaling Vector Database for Production

## Key Scaling Considerations

1. **Speed vs. Accuracy** - Understanding the tradeoffs between query performance and result quality
2. **Resource Limitations** - Managing memory, CPU, and storage constraints
3. **Horizontal Scaling** - Distributing the workload across multiple instances

## Approximate Nearest Neighbor (ANN) Implementations

ANN algorithms like HNSW (Hierarchical Navigable Small World) allow us to trade some accuracy for significant performance improvements at scale. We'll explore different HNSW configurations and their impact on search performance.

In [1]:
import chromadb
from chromadb.utils import embedding_functions
import time

# Initialize ChromaDB client
client = chromadb.Client()
embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(
  model_name="all-MiniLM-L6-v2"
)

### Creating Collections with Different HNSW Configurations

We'll create three collections with different index settings:

1. **Default** - Uses ChromaDB's default configuration
2. **High Accuracy** - Prioritizes result quality with higher `ef` and `M` values
3. **Fast Search** - Prioritizes speed with lower `ef` and `M` values

**Parameter Explanation:**
- `hnsw:space`: The distance metric used (cosine, euclidean, etc.)
- `hnsw:construction_ef`: Controls index build quality (higher = better quality, slower build)
- `hnsw:search_ef`: Controls search quality (higher = better quality, slower search)
- `hnsw:M`: Controls the maximum number of connections per node (higher = better quality, more memory)

In [2]:
# Create collections with different HNSW configurations
collections = {}

# 1. Default settings
collections["default"] = client.create_collection(
    name="default_index",
    embedding_function=embedding_function
)

# 2. High accuracy configuration
collections["high_accuracy"] = client.create_collection(
    name="high_accuracy_index",
    embedding_function=embedding_function,
    metadata={"hnsw:space": "cosine", "hnsw:construction_ef": 500, "hnsw:search_ef": 250, "hnsw:M": 36}
)

# 3. Fast search configuration
collections["fast_search"] = client.create_collection(
    name="fast_search_index",
    embedding_function=embedding_function,
    metadata={"hnsw:space": "cosine", "hnsw:construction_ef": 80, "hnsw:search_ef": 40, "hnsw:M": 12}
)

### Generating Sample Documents

Now let's create some sample documents across different categories to populate our collections.

In [3]:
# Generate sample data
num_docs = 5000
print(f"Generating {num_docs} sample documents...")

# Create documents with some patterns for testing
categories = ["technology", "science", "health", "business", "entertainment"]
documents = []
ids = []

for i in range(num_docs):
    category = categories[i % len(categories)]
    document = f"This is document {i} about {category} with some additional text to make it more unique."
    documents.append(document)
    ids.append(f"doc_{i}")

Generating 5000 sample documents...


### Adding Documents to Collections

Let's add the generated documents to all three collections.

In [4]:
# Add documents to all collections
print("Adding documents to collections with different index configurations...")
for name, collection in collections.items():
    collection.add(
        documents=documents,
        ids=ids
    )
    print(f"  Added {num_docs} documents to {name} collection")

Adding documents to collections with different index configurations...
  Added 5000 documents to default collection
  Added 5000 documents to high_accuracy collection
  Added 5000 documents to fast_search collection


### Benchmark Query Performance

Now let's evaluate how each configuration performs with a set of representative queries.

In [None]:
# Benchmark query performance
print("\nBenchmarking query performance across different configurations...")

# Prepare queries
query_texts = [
    "Latest technology trends in artificial intelligence",
    "Scientific research on climate change",
    "Health benefits of regular exercise",
    "Business strategies for startups",
    "Entertainment news about recent movie releases"
]

# Set up benchmark parameters
results = {}
num_trials = 5

In [None]:
# Run benchmark for each collection
for name, collection in collections.items():
    print(f"\nTesting {name} configuration:")
    times = []
    
    for query in query_texts:
        query_times = []
        
        for _ in range(num_trials):
            start_time = time.time()
            collection.query(
                query_texts=[query],
                n_results=10
            )
            query_time = time.time() - start_time
            query_times.append(query_time)
        
        avg_time = sum(query_times) / len(query_times)
        times.append(avg_time)
        print(f"  Query: '{query[:30]}...': {avg_time:.4f} seconds")
    
    results[name] = {
        "mean": sum(times) / len(times),
        "min": min(times),
        "max": max(times),
        "times": times
    }

In [None]:
# Print summary of benchmark results
print("\nPerformance Summary:")
for name, metrics in results.items():
    print(f"  {name}: Mean={metrics['mean']:.4f}s, Min={metrics['min']:.4f}s, Max={metrics['max']:.4f}s")

## Part 2: Caching Implementation

Caching is a crucial optimization technique for production systems. It can significantly reduce latency and computational load by storing frequently accessed results.

### Why Implement Caching?

1. **Reduced Latency** - Cached results can be returned instantly without computing embeddings or searching the vector space
2. **Lower Computational Costs** - Fewer embedding calculations mean lower GPU/CPU usage
3. **Better Scalability** - Handle more queries with the same resources

We'll implement a simple LRU (Least Recently Used) cache and measure its performance impact.

In [None]:
# Import necessary libraries
import chromadb
from chromadb.utils import embedding_functions
import time
import random

print("\n=== CACHING IMPLEMENTATION ===")

### LRU Cache Implementation

Let's implement a simple LRU (Least Recently Used) cache. This type of cache keeps track of which queries are used most frequently and evicts the least recently used entries when the cache is full.

In [None]:
# Implement a simple LRU cache
class LRUCache:
    def __init__(self, capacity=100):
        self.capacity = capacity  # Maximum number of items the cache can hold
        self.cache = {}           # Dictionary to store cache items
        self.usage_order = []     # List to track access order
    
    def get(self, key):
        if key in self.cache:
            # Update usage order - move to end of list (most recently used)
            self.usage_order.remove(key)
            self.usage_order.append(key)
            return self.cache[key]
        return None  # Cache miss
    
    def put(self, key, value):
        if key in self.cache:
            # Update existing entry
            self.cache[key] = value
            self.usage_order.remove(key)
            self.usage_order.append(key)
        else:
            # Add new entry
            if len(self.cache) >= self.capacity:
                # Evict least recently used item
                lru_key = self.usage_order.pop(0)
                del self.cache[lru_key]
            
            self.cache[key] = value
            self.usage_order.append(key)
    
    def clear(self):
        self.cache = {}
        self.usage_order = []
    
    def __len__(self):
        return len(self.cache)

### Setting Up the Collection

Now let's create a collection and populate it with sample documents for our caching experiment.

In [None]:
# Initialize Chroma
client = chromadb.Client()
embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

# Create a collection
collection = client.create_collection(
    name="cache_test",
    embedding_function=embedding_function
)

In [None]:
# Add sample documents
num_docs = 1000
documents = [f"This is a sample document {i} with various content for testing caching" for i in range(num_docs)]
ids = [f"cache_doc_{i}" for i in range(num_docs)]

# Add documents in batches to avoid overwhelming the system
for i in range(0, num_docs, 100):
    end_idx = min(i + 100, num_docs)
    
    collection.add(
        documents=documents[i:end_idx],
        ids=ids[i:end_idx]
    )

print(f"Added {num_docs} documents to the collection")

### Cached Query Function

Let's implement a function that uses our cache to store and retrieve query results.

In [None]:
# Initialize cache with a capacity of 50 entries
query_cache = LRUCache(capacity=50)

# Function to query with caching
def cached_query(query_text, n_results=10, use_cache=True):
    # Create a unique cache key from the query text and number of results
    cache_key = f"{query_text}:{n_results}"
    
    if use_cache:
        # Check cache first
        cached_result = query_cache.get(cache_key)
        if cached_result is not None:
            return cached_result, True  # Cache hit
    
    # Cache miss or cache disabled, perform actual query
    result = collection.query(
        query_texts=[query_text],
        n_results=n_results
    )
    
    if use_cache:
        # Update cache
        query_cache.put(cache_key, result)
    
    return result, False  # Cache miss

### Preparing Query Mix

To simulate a realistic workload, we'll create a mix of common (frequently repeated) and unique queries.

In [None]:
# Test queries with varying cache hit rates
print("\nTesting query performance with caching:")

# Prepare query mix (some repeated, some unique)
common_queries = [
    "document with content",
    "sample document",
    "testing caching",
    "various content"
]

unique_queries = [f"unique query {i}" for i in range(50)]

# Mix queries with different distributions to test cache performance
mixed_queries = []
for _ in range(20):
    # Add common queries (higher probability)
    mixed_queries.extend(common_queries)
    
    # Add some unique queries
    mixed_queries.extend(random.sample(unique_queries, 5))

# Shuffle to ensure realistic query pattern
random.shuffle(mixed_queries)

### Benchmark: No Cache vs. With Cache

Now let's measure the performance difference between running queries without a cache versus with a cache.

In [None]:
# Run without cache
print("Running queries without cache...")
start_time = time.time()

for query in mixed_queries:
    _, _ = cached_query(query, use_cache=False)

no_cache_time = time.time() - start_time

In [None]:
# Run with cache
print("Running queries with cache...")
query_cache.clear()  # Clear the cache

start_time = time.time()
hits = 0

for query in mixed_queries:
    _, is_hit = cached_query(query, use_cache=True)
    if is_hit:
        hits += 1

with_cache_time = time.time() - start_time
hit_rate = hits / len(mixed_queries)

In [None]:
# Report results
print("\nCache Performance Results:")
print(f"  Without cache: {no_cache_time:.4f} seconds")
print(f"  With cache: {with_cache_time:.4f} seconds")
print(f"  Time saved: {no_cache_time - with_cache_time:.4f} seconds ({(1 - with_cache_time/no_cache_time) * 100:.1f}%)")
print(f"  Cache hit rate: {hit_rate:.1%}")
print(f"  Cache size: {len(query_cache)}")

## Part 3: Advanced Scaling Strategies

### Horizontal Scaling Approaches

As your vector database grows beyond the capacity of a single machine, you'll need to implement horizontal scaling strategies. Here are some common approaches:

1. **Sharding** - Partitioning your vector space across multiple instances
   - **By ID range** - Deterministic but may lead to unbalanced shards
   - **By vector clustering** - Better search performance but more complex

2. **Replication** - Creating copies of your data across multiple instances
   - Improves read throughput and fault tolerance
   - Requires synchronization mechanisms for writes

3. **Hybrid approaches** - Combining sharding and replication
   - Example: ChromaDB cluster with data sharded across nodes and each shard replicated

### Resource Management Best Practices

1. **Memory Optimization**
   - Use quantization to reduce vector size (e.g., 32-bit to 8-bit)
   - Implement disk-based storage for less frequently accessed vectors

2. **CPU Utilization**
   - Batch similar operations
   - Use asynchronous processing where possible

3. **Network Efficiency**
   - Minimize data transfer between components
   - Compress payloads when possible

### Real-world Implementation Considerations

1. **Monitoring and Observability**
   - Track latency, throughput, and error rates
   - Set up alerts for performance degradation

2. **Failure Handling**
   - Implement graceful degradation strategies
   - Consider fallback search methods

3. **Update Strategies**
   - Batch updates to reduce index rebuilding frequency
   - Consider incremental index updates

4. **Hybrid Search Approaches**
   - Combine vector search with keyword search for better results
   - Filter vectors based on metadata before computing distances

## Conclusion and Key Takeaways

In this notebook, we've explored practical approaches to scaling vector databases for production use:

1. **ANN Implementations**
   - Configuring HNSW parameters allows for customized speed-accuracy tradeoffs
   - The right configuration depends on your specific application requirements

2. **Caching**
   - Can significantly reduce latency (70%+ in our example)
   - Most effective when query patterns show temporal locality

3. **Advanced Scaling Strategies**
   - Horizontal scaling through sharding and replication
   - Resource optimization across memory, CPU, and network
   - Operational considerations for production deployments

### Next Steps

1. Experiment with different HNSW configurations on your specific dataset
2. Implement and tune caching based on your application's query patterns
3. Explore ChromaDB's distributed deployment options for horizontal scaling
4. Set up monitoring and observability for your vector database