# ChromaDB Semantic Search Tutorial

This notebook demonstrates using ChromaDB for semantic search. ChromaDB is a vector database that makes it easy to build semantic search systems.

## Setup and Initialization

In [1]:
import chromadb
from chromadb.utils import embedding_functions

In [2]:
# Initialize an in-memory Chroma client
print("Initializing Chroma client...")
client = chromadb.Client()

# Create embedding function using SentenceTransformer
embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"  # Lightweight, effective model
)

Initializing Chroma client...


## Creating Collections

In ChromaDB, documents are organized into collections.

In [3]:
# Create a new collection for documents
collection = client.create_collection(
    name="documents",
    embedding_function=embedding_function
)

## Helper Function for Results Display

In [4]:
# Helper function to display search results in a readable format
def display_results(results):
    """Display ChromaDB search results"""
    print("\nResults:")
    for i, (doc, doc_id, metadata, distance) in enumerate(zip(
        results['documents'][0],
        results['ids'][0],
        results['metadatas'][0] if results['metadatas'] else [
            None] * len(results['ids'][0]),
        results['distances'][0]
    )):
        print(f"{i+1}. Document: {doc}")
        print(f"   ID: {doc_id}")
        if metadata:
            print(f"   Metadata: {metadata}")
        print(f"   Distance: {distance:.4f}")
        print()

## Basic Vector Operations

Adding documents and performing simple semantic search.

In [5]:
# Basic Vector Operations
print("\n=== BASIC VECTOR OPERATIONS ===")

# Example documents covering various topics
documents = [
    "The quick brown fox jumps over the lazy dog",
    "A man is walking his dog in the park",
    "The weather is sunny and warm today",
    "Artificial intelligence is transforming the technology landscape",
    "Vector databases are essential for semantic search applications",
    "Deep learning models require substantial computational resources",
    "The city skyline looks beautiful at sunset",
    "Machine learning algorithms find patterns in data"
]
ids = ["doc1", "doc2", "doc3", "doc4", "doc5", "doc6", "doc7", "doc8"]

# Add documents to collection
print("Adding documents to collection...")
collection.add(
    documents=documents,
    ids=ids
)

# Get collection count
count = collection.count()
print(f"Collection now contains {count} documents")

# Perform a semantic search
query_text = "AI and technology trends"
print(f"\nPerforming similarity search for: '{query_text}'")

results = collection.query(
    query_texts=[query_text],
    n_results=3  # Return top 3 most similar documents
)

# Display results
display_results(results)


=== BASIC VECTOR OPERATIONS ===
Adding documents to collection...
Collection now contains 8 documents

Performing similarity search for: 'AI and technology trends'

Results:
1. Document: Artificial intelligence is transforming the technology landscape
   ID: doc4
   Distance: 0.6473

2. Document: Machine learning algorithms find patterns in data
   ID: doc8
   Distance: 1.3642

3. Document: Deep learning models require substantial computational resources
   ID: doc6
   Distance: 1.4002



## Working with Metadata and Filtering

ChromaDB allows attaching metadata to documents and filtering searches based on this metadata.

In [6]:
print("\n=== METADATA AND FILTERING ===")

# Create a new collection for filtered documents
filtered_docs_collection = client.create_collection(
    name="filtered_documents",
    embedding_function=embedding_function
)

# Metadata for each document
metadatas = [
    {"category": "animal", "length": "short", "year": 2021},
    {"category": "lifestyle", "length": "short", "year": 2022},
    {"category": "weather", "length": "short", "year": 2023},
    {"category": "technology", "length": "medium", "year": 2023},
    {"category": "technology", "length": "medium", "year": 2024},
    {"category": "technology", "length": "long", "year": 2024},
    {"category": "travel", "length": "short", "year": 2023},
    {"category": "technology", "length": "medium", "year": 2024}
]

# Add documents with metadata
print("Adding documents with metadata...")
filtered_docs_collection.add(
    documents=documents,
    ids=ids,
    metadatas=metadatas
)

# Get collection count
count = filtered_docs_collection.count()
print(f"Filtered Docs Collection now contains {count} documents.")


=== METADATA AND FILTERING ===
Adding documents with metadata...
Filtered Docs Collection now contains 8 documents.


## Simple Metadata Filtering

In [7]:
# Simple metadata filtering - find technology documents about AI
print("\nFiltering by category 'technology':")
results = filtered_docs_collection.query(
    query_texts=["AI advancements"],
    n_results=3,
    where={"category": "technology"}  # Only search technology documents
)

display_results(results)


Filtering by category 'technology':

Results:
1. Document: Artificial intelligence is transforming the technology landscape
   ID: doc4
   Metadata: {'category': 'technology', 'length': 'medium', 'year': 2023}
   Distance: 0.8661

2. Document: Machine learning algorithms find patterns in data
   ID: doc8
   Metadata: {'category': 'technology', 'length': 'medium', 'year': 2024}
   Distance: 1.3540

3. Document: Deep learning models require substantial computational resources
   ID: doc6
   Metadata: {'category': 'technology', 'length': 'long', 'year': 2024}
   Distance: 1.3605



## Complex Metadata Filtering

Using logical operators for more advanced filtering.

In [8]:
# Complex filtering - technology documents from 2024
print("\nComplex filtering (technology documents from 2024):")
results = filtered_docs_collection.query(
    query_texts=["AI advancements"],
    n_results=3,
    where={"$and": [
        {"category": {"$eq": "technology"}},  # Category must be technology
        {"year": {"$eq": 2024}}               # Year must be 2024
    ]}
)

display_results(results)


Complex filtering (technology documents from 2024):

Results:
1. Document: Machine learning algorithms find patterns in data
   ID: doc8
   Metadata: {'category': 'technology', 'length': 'medium', 'year': 2024}
   Distance: 1.3540

2. Document: Deep learning models require substantial computational resources
   ID: doc6
   Metadata: {'category': 'technology', 'length': 'long', 'year': 2024}
   Distance: 1.3605

3. Document: Vector databases are essential for semantic search applications
   ID: doc5
   Metadata: {'category': 'technology', 'length': 'medium', 'year': 2024}
   Distance: 1.5393



## Content-Based Filtering

In [9]:
# Using where_document to filter by document content
print("\nFiltering documents containing 'Artificial intelligence ':")
results = filtered_docs_collection.query(
    query_texts=["AI advancements"],
    n_results=3,
    where_document={"$contains": "Artificial intelligence"}  # Content filter
)
display_results(results)


Filtering documents containing 'Artificial intelligence ':

Results:
1. Document: Artificial intelligence is transforming the technology landscape
   ID: doc4
   Metadata: {'category': 'technology', 'length': 'medium', 'year': 2023}
   Distance: 0.8661



## Document Management

ChromaDB provides methods for updating and deleting documents.

In [10]:
# Get document by ID
print(f"Getting document by ID: doc1")
result = collection.get(ids=["doc1"])
print(f"Original document: {result['documents'][0]}")

# Update document
print("\nUpdating document...")
collection.update(
    ids=["doc1"],
    documents=["The quick silver fox leaps over the sleepy hound"]
)

# Verify update
result = collection.get(ids=["doc1"])
print(f"Updated document: {result['documents'][0]}")

# Delete document
print("\nDeleting document doc2...")
collection.delete(ids=["doc2"])

# Verify deletion
count = collection.count()
print(f"Collection now has {count} documents")

Getting document by ID: doc1
Original document: The quick brown fox jumps over the lazy dog

Updating document...
Updated document: The quick silver fox leaps over the sleepy hound

Deleting document doc2...
Collection now has 7 documents
