# Chroma DB basics

In the rapidly evolving landscape of AI and ML, the ability to efficiently store, search, and retrieve high-dimensional vector data has become crucial. Traditional databases excel at handling structured data with exact matches, but they fall short when dealing with semantic similarity, embeddings, and approximate nearest neighbor searches that are fundamental to modern AI applications.

Chroma DB addresses this challenge by providing a vector database that specializes in storing document embeddings and performing semantic similarity searches. Whether we are building a recommendation system, implementing retrieval-augmented generation (RAG) for large language models, or creating content discovery platforms, Chroma DB offers a streamlined solution for managing vector data at scale.

This notebook will guide us through Chroma DB's core concepts and practical implementation, starting from basic setup and progressing to advanced features like metadata filtering, custom embedding functions, and production deployment considerations.

In [2]:
# Install ChromaDB
!pip install chromadb

# !pip install sentence-transformers  # For better embedding models

ChromaDB can operate in different modes - from a simple in-memory database perfect for experimentation to a persistent client-server setup suitable for production environments.

In [3]:
import chromadb
from chromadb.config import Settings
from chromadb import Documents, EmbeddingFunction, Embeddings
import numpy as np
import tempfile
import os
from typing import List, Dict, Any, cast
from sentence_transformers import SentenceTransformer

### Create Chroma client (in-memory)
We will create a ChromaDB client that works in memory, meaning the data won't be saved after the session ends. (If we would like to keep the data saved to disk, we can use the `PersistentClient` instead). This step prepares ChromaDB for storing and searching data later.


In [4]:
# Create a ChromaDB client (in-memory by default)
client = chromadb.Client()

# Alternative: Create a persistent client that saves data to disk
# client = chromadb.PersistentClient(path="./chroma_db")

print("ChromaDB client initialized successfully!")
print(f"ChromaDB version: {chromadb.__version__}")

ChromaDB client initialized successfully!
ChromaDB version: 1.1.0


The `Client()` constructor creates an in-memory database instance for testing and development. When we need persistence, `PersistentClient()` creates a local database that survives between sessions. The client serves as the main interface for all database operations and manages the connection to ChromaDB's storage engine.

### Understanding collections
Collections in ChromaDB are analogous to tables in traditional databases, but specifically designed for vector data. Each collection stores documents along with their vector embeddings and optional metadata. Understanding how to create and manage collections is fundamental to working with ChromaDB effectively.

Before creating our first collection, let's understand the key concepts:
- Documents: The actual text content we want to store and search.
- Embeddings: Vector representations of our documents (automatically generated or custom).
- Metadata: Additional structured information about each document
IDs: Unique identifiers for each document in the collection.

In [5]:
# Create a new collection
collection = client.create_collection(
    name="my_documents",
    metadata={"description": "A collection of sample documents for learning ChromaDB"}
)

# Alternative: switch `create_collection` to `get_or_create_collection` to avoid creating a new collection every time
# collection = client.get_or_create_collection(name="my_documents")

print(f"Collection created: {collection.name}")
print(f"Collection metadata: {collection.metadata}")
print(f"Collection count: {collection.count()}")  # Should be 0 initially

Collection created: my_documents
Collection metadata: {'description': 'A collection of sample documents for learning ChromaDB'}
Collection count: 0


The `create_collection()` method instantiates a new vector space within ChromaDB. The collection object provides methods for adding, querying, and managing documents. ChromaDB automatically handles the underlying vector indexing and storage optimization. The metadata parameter allows us to store collection-level information that can be useful for organization and debugging.

Let's also explore how to list and manage existing collections:

In [6]:
# List all collections
collections = client.list_collections()
print("Available collections:")
for col in collections:
    print(f"  - {col.name}: {col.count()} documents")

# Delete a collection (be careful with this!)
# client.delete_collection(name="my_documents")

# Get an existing collection
existing_collection = client.get_collection(name="my_documents")
print(f"\nRetrieved collection: {existing_collection.name}")

Available collections:
  - my_documents: 0 documents

Retrieved collection: my_documents


ChromaDB maintains a registry of all collections, which allows for easy discovery and management. The `list_collections()` method returns collection objects, not just names, giving us access to their properties and methods. This design pattern makes it easy to iterate over multiple collections and perform batch operations.

### Adding documents to collections
Now that we have a collection, let's explore how to add documents. ChromaDB provides flexibility in how we handle embeddings - we can let it generate them automatically using default embedding functions, or provide our own custom embeddings.

Understanding the data structure is crucial here. Each document addition requires:
- documents: List of text strings
- ids: Unique identifiers (strings)
- metadatas: Optional list of dictionaries with additional information
- embeddings: Optional custom vectors (if not provided, ChromaDB generates them)

In [7]:
# Sample documents for demonstration
sample_documents = [
    "ChromaDB is a vector database designed for storing and querying embeddings.",
    "Machine learning models can convert text into high-dimensional vectors.",
    "Semantic search allows finding similar content based on meaning rather than keywords.",
    "Vector databases are essential for modern AI applications like RAG systems.",
    "Embeddings capture semantic relationships between different pieces of text."
]

# Corresponding metadata for each document
sample_metadata = [
    {"category": "database", "topic": "vector_db", "difficulty": "beginner"},
    {"category": "ml", "topic": "embeddings", "difficulty": "intermediate"},
    {"category": "search", "topic": "semantic", "difficulty": "intermediate"},
    {"category": "database", "topic": "ai_applications", "difficulty": "advanced"},
    {"category": "ml", "topic": "nlp", "difficulty": "beginner"}
]

# Generate unique IDs for our documents
document_ids = [f"doc_{i+1}" for i in range(len(sample_documents))]

# Add documents to the collection (ChromaDB will generate embeddings automatically)
collection.add(
    documents=sample_documents,
    metadatas=sample_metadata,
    ids=document_ids
)

print(f"Added {len(sample_documents)} documents to the collection.")
print(f"Collection now contains {collection.count()} documents.")

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:01<00:00, 62.9MiB/s]


Added 5 documents to the collection.
Collection now contains 5 documents.


When we call `add()` without providing embeddings, ChromaDB uses its default collection-level embedding function (typically a sentence transformer model) to generate vector representations of our documents. The embedding process happens automatically and efficiently, with ChromaDB handling the model loading and inference. The metadata is stored alongside the vectors and can be used for both filtering and result enrichment.

#### Adding documents to collections with custom embeddings
Let's also see how to add documents with custom embeddings:

In [8]:
# Example: Adding documents with custom embeddings (In practice, we would use a proper embedding model)
custom_embeddings = np.random.rand(2, 384).tolist()  # 384-dimensional vectors

custom_documents = [
    "This document has a custom embedding vector.",
    "Another document with manually specified embeddings."
]

custom_ids = ["custom_1", "custom_2"]
custom_metadata = [
    {"source": "custom", "type": "example"},
    {"source": "custom", "type": "demo"}
]

collection.add(
    documents=custom_documents,
    embeddings=custom_embeddings,
    metadatas=custom_metadata,
    ids=custom_ids
)

print(f"Collection now contains {collection.count()} documents total.")

Collection now contains 7 documents total.


When we provide custom embeddings, ChromaDB bypasses its default embedding function and uses our vectors directly. This is useful when we have pre-computed embeddings from specialized models or when we need fine-grained control over the embedding process. The **embedding dimensions must be consistent across all documents in a collection**.

### Querying and semantic search
The real power of ChromaDB lies in its querying capabilities. Unlike traditional databases that rely on exact matches, ChromaDB performs semantic similarity searches, finding documents that are conceptually similar to our query even if they don't share exact keywords.

Understanding query mechanics is essential. ChromaDB converts our query text into an embedding vector, then finds the documents with embeddings closest to our query vector using cosine similarity or other distance metrics.

In [9]:
# Basic semantic search query
query_text = "What are vector databases used for?"

# Perform a similarity search
results = collection.query(
    query_texts=[query_text],
    n_results=3  # Return top 3 most similar documents
)

print(f"Query: '{query_text}'")
print("\nTop 3 similar documents:")
for i, (doc, distance, metadata, doc_id) in enumerate(zip(
    results['documents'][0],
    results['distances'][0],
    results['metadatas'][0],
    results['ids'][0]
)):
    print(f"\n{i+1}. Document ID: {doc_id}")
    print(f"   Distance: {distance:.4f}")
    print(f"   Category: {metadata['category']}")
    print(f"   Content: {doc[:100]}...")

Query: 'What are vector databases used for?'

Top 3 similar documents:

1. Document ID: doc_4
   Distance: 0.7018
   Category: database
   Content: Vector databases are essential for modern AI applications like RAG systems....

2. Document ID: doc_1
   Distance: 0.8530
   Category: database
   Content: ChromaDB is a vector database designed for storing and querying embeddings....

3. Document ID: doc_2
   Distance: 1.3882
   Category: ml
   Content: Machine learning models can convert text into high-dimensional vectors....


The `query()` method converts our text into an embedding using the same function used for document ingestion, ensuring consistency. It then performs an approximate nearest neighbor search through the stored vectors, ranking results by similarity score (distance). Lower distance values indicate higher similarity.

Let's explore more advanced querying options.

#### Metadata filtering
Metadata filtering (`where` parameter) applies constraints before the similarity search, effectively creating a subset of documents to search within. This is computationally efficient as it reduces the search space.

In [10]:
# Query with metadata filtering
filtered_results = collection.query(
    query_texts=["machine learning concepts"],
    n_results=5,
    where={"category": "ml"}  # Only return documents with category = "ml"
)

print("Filtered query results (ML category only):")
for i, (doc, metadata) in enumerate(zip(
    filtered_results['documents'][0],
    filtered_results['metadatas'][0]
)):
    print(f"{i+1}. {metadata['topic']}: {doc[:80]}...")

Filtered query results (ML category only):
1. embeddings: Machine learning models can convert text into high-dimensional vectors....
2. nlp: Embeddings capture semantic relationships between different pieces of text....


ChromaDB's filtering system evaluates metadata conditions before performing vector similarity search. This preprocessing step significantly improves query efficiency by reducing the search space.

#### Batch querying
Batch querying processes multiple queries simultaneously, leveraging vectorized operations for better performance when we have multiple search requests.

In [11]:
# Multiple queries at once (batch processing)
batch_queries = [
    "database technology",
    "text processing",
    "artificial intelligence"
]

batch_results = collection.query(
    query_texts=batch_queries,
    n_results=2
)

print(f"Batch query results for {len(batch_queries)} queries:")
for query_idx, query in enumerate(batch_queries):
    print(f"\nQuery: '{query}'")
    for doc_idx, doc in enumerate(batch_results['documents'][query_idx]):
        print(f"  {doc_idx+1}. {doc[:60]}...")

Batch query results for 3 queries:

Query: 'database technology'
  1. ChromaDB is a vector database designed for storing and query...
  2. Vector databases are essential for modern AI applications li...

Query: 'text processing'
  1. Machine learning models can convert text into high-dimension...
  2. Embeddings capture semantic relationships between different ...

Query: 'artificial intelligence'
  1. Vector databases are essential for modern AI applications li...
  2. Machine learning models can convert text into high-dimension...


### Advanced metadata filtering
ChromaDB's metadata filtering system provides powerful ways to constrain our searches based on structured information. This capability is crucial for building sophisticated applications that need to combine semantic similarity with specific criteria.
Let's explore the various filtering operators and patterns available. ChromaDB supports a MongoDB-like query syntax for metadata filtering, making it intuitive for developers familiar with document databases.

In [12]:
# First, let's add more documents with richer metadata for demonstration
extended_documents = [
    "Python is a versatile programming language used in data science.",
    "JavaScript enables dynamic web applications and user interactions.",
    "Docker containers provide consistent deployment environments.",
    "Kubernetes orchestrates containerized applications at scale.",
    "TensorFlow is a machine learning framework for building neural networks.",
    "React creates interactive user interfaces for web applications."
]

extended_metadata = [
    {"language": "Python", "domain": "data_science", "popularity": 95, "year": 1991},
    {"language": "JavaScript", "domain": "web_development", "popularity": 98, "year": 1995},
    {"language": "Docker", "domain": "devops", "popularity": 87, "year": 2013},
    {"language": "Kubernetes", "domain": "devops", "popularity": 82, "year": 2014},
    {"language": "Python", "domain": "machine_learning", "popularity": 91, "year": 1991},
    {"language": "JavaScript", "domain": "web_development", "popularity": 96, "year": 1995}
]

extended_ids = [f"tech_{i+1}" for i in range(len(extended_documents))]

collection.add(
    documents=extended_documents,
    metadatas=extended_metadata,
    ids=extended_ids
)

print(f"Added {len(extended_documents)} more documents. Total: {collection.count()}")

Added 6 more documents. Total: 13


Here, we are adding documents with more complex metadata structures that include different data types (strings, integers) and multiple fields. This creates a richer dataset for demonstrating advanced filtering capabilities.

Now let's explore various filtering operators.

#### Exact match filtering

In [13]:
# Exact match filtering
python_docs = collection.query(
    query_texts=["programming language"],
    n_results=10,
    where={"language": "Python"}
)

print("Documents about Python:")
for doc, meta in zip(python_docs['documents'][0], python_docs['metadatas'][0]):
    print(f"  - {meta['domain']}: {doc[:50]}...")

Documents about Python:
  - data_science: Python is a versatile programming language used in...
  - machine_learning: TensorFlow is a machine learning framework for bui...


#### Numerical comparisons

In [14]:
# Numerical comparisons
popular_tech = collection.query(
    query_texts=["technology tools"],
    n_results=10,
    where={"popularity": {"$gt": 90}}  # Popularity greater than 90
)

print(f"Popular technologies (>90 popularity):")
for doc, meta in zip(popular_tech['documents'][0], popular_tech['metadatas'][0]):
    print(f"  - {meta['language']} ({meta['popularity']}): {doc[:50]}...")

Popular technologies (>90 popularity):
  - JavaScript (98): JavaScript enables dynamic web applications and us...
  - JavaScript (96): React creates interactive user interfaces for web ...
  - Python (95): Python is a versatile programming language used in...
  - Python (91): TensorFlow is a machine learning framework for bui...


 The `$gt`, `$gte`, `$lt`, `$lte` operators work with numerical values.

 #### Multiple conditions

In [15]:
# Multiple conditions with AND logic
recent_popular = collection.query(
    query_texts=["modern technology"],
    n_results=10,
    where={
        "$and": [
            {"popularity": {"$gte": 85}},  # Greater than or equal to 85
            {"year": {"$gt": 2000}}        # Created after 2000
        ]
    }
)

print(f"Recent and popular technologies:")
for doc, meta in zip(recent_popular['documents'][0], recent_popular['metadatas'][0]):
    print(f"  - {meta['language']} ({meta['year']}): {doc[:50]}...")


# Multiple conditions with OR logic
web_or_data_science = collection.query(
    query_texts=["development frameworks"],
    n_results=10,
    where={
        "$or": [
            {"domain": "web_development"},
            {"domain": "data_science"}
        ]
    }
)

print("\nWeb development or data science documents:")
for doc, meta in zip(web_or_data_science['documents'][0], web_or_data_science['metadatas'][0]):
    print(f"  - {meta['domain']}: {doc[:50]}...")

Recent and popular technologies:
  - Docker (2013): Docker containers provide consistent deployment en...

Web development or data science documents:
  - web_development: React creates interactive user interfaces for web ...
  - web_development: JavaScript enables dynamic web applications and us...
  - data_science: Python is a versatile programming language used in...


The `$and` and `$or` operators enable complex logical combinations.

#### Inclusion filtering

In [16]:
# Inclusion filtering with $in operator
specific_languages = collection.query(
    query_texts=["programming tools"],
    n_results=10,
    where={"language": {"$in": ["Python", "JavaScript"]}}
)

print(f"\nPython or JavaScript documents:")
for doc, meta in zip(specific_languages['documents'][0], specific_languages['metadatas'][0]):
    print(f"  - {meta['language']}: {doc[:50]}...")


Python or JavaScript documents:
  - Python: Python is a versatile programming language used in...
  - JavaScript: JavaScript enables dynamic web applications and us...
  - Python: TensorFlow is a machine learning framework for bui...
  - JavaScript: React creates interactive user interfaces for web ...


#### Nested filtering

In [17]:
# Complex nested conditions
complex_filter = collection.query(
    query_texts=["software development"],
    n_results=10,
    where={
        "$and": [
            {
                "$or": [
                    {"domain": "web_development"},
                    {"domain": "devops"}
                ]
            },
            {"popularity": {"$gte": 85}}
        ]
    }
)

print(f"\nComplex filter results:")
for doc, meta in zip(complex_filter['documents'][0], complex_filter['metadatas'][0]):
    print(f"  - {meta['language']} in {meta['domain']}: {doc[:50]}...")


Complex filter results:
  - JavaScript in web_development: JavaScript enables dynamic web applications and us...
  - JavaScript in web_development: React creates interactive user interfaces for web ...
  - Docker in devops: Docker containers provide consistent deployment en...


The nested query structure demonstrates ChromaDB's ability to handle complex logical expressions. The query planner optimizes these conditions, often using indexed metadata fields to quickly eliminate non-matching documents before computing similarity scores.

### Custom embedding functions
While ChromaDB's default embedding function works well for general use cases, we might need custom embedding functions for specialized domains, different languages, or specific model requirements. Understanding how to implement and use custom embedding functions gives us complete control over how our text is vectorized.

Custom embedding functions must implement ChromaDB's EmbeddingFunction interface, which defines how text gets converted to vectors. This abstraction allows us to integrate any embedding model while maintaining compatibility with ChromaDB's query system.

#### Example 1: Simple custom embedding function using random vectors


In [19]:
# Simple custom embedding function using random vectors (In practice, we would use a real embedding model like sentence-transformers)
class SimpleEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        """
        Convert documents to embeddings.

        Args:
            input: List of document strings

        Returns:
            List of embedding vectors
        """
        embeddings = []
        np.random.seed(42)  # For reproducible results in this demo

        for doc in input:
            # Simple hash-based embedding (don't use in production!)
            doc_hash = hash(doc) % 1000000
            np.random.seed(doc_hash)
            embedding = np.random.rand(384).tolist()  # 384-dimensional vector
            embeddings.append(embedding)

        return cast(Embeddings, embeddings)

# Create a collection with custom embedding function
custom_embedding_fn = SimpleEmbeddingFunction()

custom_collection = client.create_collection(
    name="custom_embeddings_demo",
    embedding_function=custom_embedding_fn
)

print("Created collection with custom embedding function")

# Add documents to our simple custom embedding collection
test_docs = [
    "Natural language processing enables computers to understand human language.",
    "Deep learning models can learn complex patterns from large datasets.",
    "Transformers revolutionized the field of artificial intelligence."
]

custom_collection.add(
    documents=test_docs,
    ids=[f"custom_doc_{i}" for i in range(len(test_docs))],
    metadatas=[{"source": "custom_embedding"} for _ in test_docs]
)

print("Added documents using simple custom embedding function")

Created collection with custom embedding function
Added documents using simple custom embedding function


  custom_embedding_fn = SimpleEmbeddingFunction()


The `EmbeddingFunction` interface ensures consistency between document ingestion and querying. The `__call__` method receives a list of documents and must return a list of embeddings with consistent dimensions. ChromaDB caches the embedding function with the collection, so queries automatically use the same embedding logic as document ingestion.

#### Example 2: Using sentence-transformers for custom embeddings
Let's implement a more realistic custom embedding function using sentence-transformers:

In [21]:
# Using sentence-transformers for custom embeddings
class SentenceTransformerEmbedding(EmbeddingFunction):
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        """
        Initialize with a specific sentence-transformer model.

        Args:
            model_name: Name of the sentence-transformer model to use
        """
        self.model = SentenceTransformer(model_name)

    def __call__(self, input: Documents) -> Embeddings:
        """Generate embeddings using sentence-transformers."""
        embeddings = self.model.encode(input, convert_to_tensor=False)
        return embeddings.tolist()

# Create collection with sentence-transformer embeddings
st_embedding_fn = SentenceTransformerEmbedding("all-MiniLM-L6-v2")

st_collection = client.get_or_create_collection(
    name="sentence_transformer_demo",
    embedding_function=st_embedding_fn
)

print("Created collection with sentence-transformer embeddings")

# Add some documents to test
test_docs = [
    "Natural language processing enables computers to understand human language.",
    "Deep learning models can learn complex patterns from large datasets.",
    "Transformers revolutionized the field of artificial intelligence."
]

st_collection.add(
    documents=test_docs,
    ids=[f"st_doc_{i}" for i in range(len(test_docs))],
    metadatas=[{"source": "sentence_transformer"} for _ in test_docs]
)

print("Added documents using sentence-transformer embedding function")

Created collection with sentence-transformer embeddings
Added documents using sentence-transformer embedding function


Sentence-transformers provide advanced embeddings for semantic similarity tasks. The custom embedding function encapsulates model loading and inference, making it reusable across collections. ChromaDB automatically handles the embedding function serialization and ensures consistency between ingestion and query time.

#### Query using the custom embedding collection
Now let's test our custom embedding function:

In [22]:
# Query using the custom embedding collection
query_result = custom_collection.query(
    query_texts=["machine learning artificial intelligence"],
    n_results=3
)

print("Query results using custom embedding function:")
for i, (doc, distance) in enumerate(zip(
    query_result['documents'][0],
    query_result['distances'][0]
)):
    print(f"{i+1}. Distance: {distance:.4f}")
    print(f"   Document: {doc}")
    print()

Query results using custom embedding function:
1. Distance: 63.3140
   Document: Deep learning models can learn complex patterns from large datasets.

2. Distance: 63.3435
   Document: Natural language processing enables computers to understand human language.

3. Distance: 64.3630
   Document: Transformers revolutionized the field of artificial intelligence.



When querying, ChromaDB uses the same embedding function that was used during document ingestion to convert the query text into a vector. This ensures consistency in the vector space and meaningful similarity calculations. The distance values reflect how similar the query embedding is to each document embedding in the high-dimensional space.

### Data management and updates
Real-world applications require robust data management capabilities. ChromaDB provides methods to update, delete, and retrieve documents efficiently. Understanding these operations is crucial for maintaining dynamic datasets and handling evolving content.

Data management in ChromaDB operates on document IDs, making it essential to maintain consistent ID strategies. Let's explore the various data management operations and their implications.

#### Get documents in the collection

In [23]:
# First, let's check what documents we have in our main collection
all_docs = collection.get()
print(f"Current collection contains {len(all_docs['ids'])} documents:")
for doc_id, doc, metadata in zip(all_docs['ids'], all_docs['documents'], all_docs['metadatas']):
    print(f"  {doc_id}: {doc[:50]}... (Category: {metadata.get('category', 'N/A')})")

Current collection contains 13 documents:
  doc_1: ChromaDB is a vector database designed for storing... (Category: database)
  doc_2: Machine learning models can convert text into high... (Category: ml)
  doc_3: Semantic search allows finding similar content bas... (Category: search)
  doc_4: Vector databases are essential for modern AI appli... (Category: database)
  doc_5: Embeddings capture semantic relationships between ... (Category: ml)
  custom_1: This document has a custom embedding vector.... (Category: N/A)
  custom_2: Another document with manually specified embedding... (Category: N/A)
  tech_1: Python is a versatile programming language used in... (Category: N/A)
  tech_2: JavaScript enables dynamic web applications and us... (Category: N/A)
  tech_3: Docker containers provide consistent deployment en... (Category: N/A)
  tech_4: Kubernetes orchestrates containerized applications... (Category: N/A)
  tech_5: TensorFlow is a machine learning framework for bui... (Category:

The `get()` method without parameters retrieves all documents in the collection. This is useful for inventory management and understanding our dataset's current state. For large collections, we should use filtering or pagination to avoid loading too much data at once.

#### Update existing documents in the collection
Now let's explore update operations:


In [24]:
# Let's update the first document with new content and metadata
updated_document = "ChromaDB is an advanced vector database optimized for AI applications and semantic search."
updated_metadata = {"category": "database", "topic": "vector_db", "difficulty": "beginner", "updated": True}

collection.update(
    ids=["doc_1"],
    documents=[updated_document],
    metadatas=[updated_metadata]
)

# Verify the update
updated_doc = collection.get(ids=["doc_1"])
print("Updated document:")
print(f"Content: {updated_doc['documents'][0]}")
print(f"Metadata: {updated_doc['metadatas'][0]}")

# Update multiple documents at once
collection.update(
    ids=["doc_2", "doc_3"],
    metadatas=[
        {"category": "ml", "topic": "embeddings", "difficulty": "intermediate", "batch_updated": True},
        {"category": "search", "topic": "semantic", "difficulty": "intermediate", "batch_updated": True}
    ]
    # Note: Here, we are not updating documents content, only metadata
)

print("\nBatch updated documents 2 and 3 metadata")

Updated document:
Content: ChromaDB is an advanced vector database optimized for AI applications and semantic search.
Metadata: {'category': 'database', 'topic': 'vector_db', 'updated': True, 'difficulty': 'beginner'}

Batch updated documents 2 and 3 metadata


The `update()` method modifies existing documents in-place. When we update a document's text content, ChromaDB automatically regenerates its embedding using the collection's embedding function. Updating only metadata is more efficient as it doesn't require re-embedding. The operation is atomic for each document but not across the entire batch.

#### Upsert operation (update if exists, insert if not)
Let's explore partial updates and upsert operations:

In [25]:
# Upsert operation (update if exists, insert if not)
collection.upsert(
    ids=["doc_new", "doc_1"],  # doc_new doesn't exist, doc_1 exists
    documents=[
        "This is a completely new document added via upsert.",
        "ChromaDB is a powerful vector database for modern AI workflows."  # Updated content for doc_1
    ],
    metadatas=[
        {"category": "example", "topic": "upsert", "difficulty": "beginner"},
        {"category": "database", "topic": "vector_db", "difficulty": "beginner", "upserted": True}
    ]
)

print("Performed upsert operation")
print(f"Collection now has {collection.count()} documents")

# Retrieve specific documents to verify changes
specific_docs = collection.get(ids=["doc_new", "doc_1"])
print("\nDocuments after upsert:")
for doc_id, doc, metadata in zip(specific_docs['ids'], specific_docs['documents'], specific_docs['metadatas']):
    print(f"{doc_id}: {doc[:60]}...")
    print(f"  Metadata keys: {list(metadata.keys())}")

Performed upsert operation
Collection now has 14 documents

Documents after upsert:
doc_1: ChromaDB is a powerful vector database for modern AI workflo...
  Metadata keys: ['topic', 'updated', 'category', 'upserted', 'difficulty']
doc_new: This is a completely new document added via upsert....
  Metadata keys: ['category', 'difficulty', 'topic']


Upsert operations combine the logic of insert and update, making them ideal for scenarios where we are not sure if a document already exists. This is particularly useful in ETL pipelines or when synchronizing data from external sources. ChromaDB determines whether to insert or update based on ID existence.

#### Deletion operation
Now let's look at deletion operations:

In [26]:
# Delete specific documents
collection.delete(ids=["doc_new"])
print("Deleted doc_new")

# Delete documents based on metadata filtering
collection.delete(where={"batch_updated": True})
print("Deleted all documents with batch_updated=True")

print(f"Collection now has {collection.count()} documents")

# Get remaining documents to see what's left
remaining = collection.get()
print("\nRemaining documents:")
for doc_id in remaining['ids']:
    print(f"  - {doc_id}")

Deleted doc_new
Deleted all documents with batch_updated=True
Collection now has 11 documents

Remaining documents:
  - doc_1
  - doc_4
  - doc_5
  - custom_1
  - custom_2
  - tech_1
  - tech_2
  - tech_3
  - tech_4
  - tech_5
  - tech_6


ChromaDB supports both ID-based and metadata-based deletion. ID-based deletion is more efficient as it directly targets specific documents. Metadata-based deletion first filters documents matching the criteria, then removes them. Both operations immediately free up storage space and remove documents from the search index.

### Collection management and persistence
Understanding how ChromaDB handles data persistence and collection lifecycle management is crucial for production deployments. Different persistence modes offer tradeoffs between performance, durability, and resource usage.

ChromaDB supports several persistence modes, each suitable for different use cases. Let's explore these options and understand when to use each approach.

#### Persistent client
We will create a persistent client that saves data to disk. `PersistentClient` creates a local database that writes data to disk immediately. This ensures durability across application restarts and system failures. The database uses an embedded storage engine that handles indexing, compression, and transaction logging automatically.

In [27]:
# Create a temporary directory for our persistent database
persistent_path = tempfile.mkdtemp()
print(f"Creating persistent database at: {persistent_path}")

# Initialize persistent client
persistent_client = chromadb.PersistentClient(path=persistent_path)

# Create a collection in the persistent database
persistent_collection = persistent_client.create_collection(
    name="persistent_demo",
    metadata={"persistence": True, "created_at": "2024-01-01"}
)

# Add some data to demonstrate persistence
demo_docs = [
    "Persistent storage ensures data survives application restarts.",
    "ChromaDB can operate in both memory and disk-based modes.",
    "Production applications should use persistent storage for reliability."
]

persistent_collection.add(
    documents=demo_docs,
    ids=[f"persistent_{i}" for i in range(len(demo_docs))],
    metadatas=[{"type": "persistence_demo"} for _ in demo_docs]
)

print(f"Added {len(demo_docs)} documents to persistent collection")
print(f"Data is stored at: {persistent_path}")

Creating persistent database at: /tmp/tmp5smxktwi
Added 3 documents to persistent collection
Data is stored at: /tmp/tmp5smxktwi


Let's demonstrate the persistence by creating a new client instance:

In [28]:
# Simulate application restart by creating a new client pointing to the same path
new_persistent_client = chromadb.PersistentClient(path=persistent_path)

# Retrieve the existing collection
recovered_collection = new_persistent_client.get_collection("persistent_demo")

print(f"Recovered collection: {recovered_collection.name}")
print(f"Document count: {recovered_collection.count()}")
print(f"Collection metadata: {recovered_collection.metadata}")

# Query the recovered data to verify it's intact
recovery_test = recovered_collection.query(
    query_texts=["database storage"],
    n_results=2
)

print("\nRecovered documents:")
for doc in recovery_test['documents'][0]:
    print(f"  - {doc}")

Recovered collection: persistent_demo
Document count: 3
Collection metadata: {'created_at': '2024-01-01', 'persistence': True}

Recovered documents:
  - Persistent storage ensures data survives application restarts.
  - Production applications should use persistent storage for reliability.


When you create a new `PersistentClient` with an existing database path, ChromaDB automatically loads the existing data, indexes, and metadata. This demonstrates true persistence - our data survives application restarts without any additional setup or migration steps.

Now let's explore advanced collection management:

In [29]:
# Collection metadata and configuration
collection_config = {
    "hnsw:space": "cosine",  # Distance metric for similarity calculations
    "hnsw:construction_ef": 200,  # Controls index build quality vs speed
    "hnsw:M": 16  # Controls index connectivity and memory usage
}

# Create a collection with specific configuration
configured_collection = persistent_client.create_collection(
    name="configured_demo",
    metadata={
        "description": "Collection with custom HNSW parameters",
        "optimization": "quality",
        "created_by": "tutorial",
        **collection_config  # Apply the HNSW configuration parameters
    }
)

print("Created collection with custom configuration")

# Inspect collection details
collections_info = persistent_client.list_collections()
print(f"\nAvailable collections in persistent database:")
for col in collections_info:
    print(f"  - {col.name}: {col.count()} documents")
    if hasattr(col, 'metadata') and col.metadata:
        print(f"    Metadata: {col.metadata}")

Created collection with custom configuration

Available collections in persistent database:
  - configured_demo: 0 documents
    Metadata: {'created_by': 'tutorial', 'hnsw:construction_ef': 200, 'optimization': 'quality', 'description': 'Collection with custom HNSW parameters', 'hnsw:space': 'cosine', 'hnsw:M': 16}
  - persistent_demo: 3 documents
    Metadata: {'created_at': '2024-01-01', 'persistence': True}


ChromaDB uses HNSW (Hierarchical Navigable Small World) indices for efficient approximate nearest neighbor search. The configuration parameters control the trade-off between search accuracy, build time, and memory usage. Higher values generally improve accuracy at the cost of resources.

#### Export and import collection
Let's also explore collection backup and migration strategies:


In [30]:
# Export collection data for backup or migration
def export_collection(collection, filename):
    """Export collection data to a dictionary that can be saved."""
    all_data = collection.get(include=["documents", "metadatas", "embeddings"])

    export_data = {
        "name": collection.name,
        "metadata": collection.metadata,
        "documents": all_data["documents"],
        "metadatas": all_data["metadatas"],
        "embeddings": all_data["embeddings"],
        "ids": all_data["ids"],
        "count": len(all_data["ids"])
    }

    # In a real application, you'd save this to JSON or pickle
    print(f"Exported {export_data['count']} documents from collection '{collection.name}'")
    return export_data

def import_collection(client, export_data, new_name=None):
    """Import previously exported collection data."""
    collection_name = new_name or export_data["name"]

    # Create new collection
    imported_collection = client.create_collection(
        name=collection_name,
        metadata=export_data["metadata"]
    )

    # Add all the data back
    if export_data["count"] > 0:
        imported_collection.add(
            documents=export_data["documents"],
            metadatas=export_data["metadatas"],
            embeddings=export_data["embeddings"],
            ids=export_data["ids"]
        )

    print(f"Imported {export_data['count']} documents to collection '{collection_name}'")
    return imported_collection

# Demonstrate export/import
backup_data = export_collection(persistent_collection, "backup.json")

# Import to a new collection
imported_collection = import_collection(
    persistent_client,
    backup_data,
    new_name="imported_demo"
)

print(f"Import successful. New collection has {imported_collection.count()} documents")

Exported 3 documents from collection 'persistent_demo'
Imported 3 documents to collection 'imported_demo'
Import successful. New collection has 3 documents


The export/import functions provide a way to backup collections or migrate data between ChromaDB instances. The `get()` method with `include=["documents", "metadatas", "embeddings"]` retrieves all stored data including pre-computed embeddings. This is more efficient than re-computing embeddings during import, especially for large collections.