# ChromaDB Complete Guide: From Basics to Advanced Use Cases

This notebook covers everything you need to know about ChromaDB, including:
- Basic setup and operations
- File structure understanding
- RAG (Retrieval Augmented Generation)
- Other practical use cases

---

## Installation

```bash
pip install chromadb
pip install openai  # If using OpenAI embeddings
```

---

## Part 1: ChromaDB Basics

### 1.1 Simple In-Memory Database

In [24]:
import chromadb

# Create an in-memory client (data disappears when program ends)
client = chromadb.Client()

# Create a collection (like a table)
collection = client.get_or_create_collection(name="my_first_collection")

# Add documents
collection.add(
    documents=[
        "The cat sat on the mat",
        "The dog played in the park",
        "Python is a programming language"
    ],
    ids=["doc1", "doc2", "doc3"]
)

# Query for similar documents
results = collection.query(
    query_texts=["Tell me about animals"],
    n_results=2
)

print("Documents found:", results['documents'])
print("Distances:", results['distances'])

Documents found: [['The dog played in the park', 'Python is a programming language']]
Distances: [[1.378471851348877, 1.6124422550201416]]


Note: `n_results` will return the top n document with the closest distances, if you do not use the option it will return all. for large document set, `n_results` show be set.

### 1.2 Persistent Database (Saves to Disk)

In [25]:
import chromadb

# Create a persistent client (data saved to ./my_chroma_db folder)
client = chromadb.PersistentClient(path="./data/my_chroma_db")

# Get or create collection
collection = client.get_or_create_collection(name="persistent_collection")

# Add data
collection.add(
    documents=["This data will persist across sessions"],
    ids=["persistent_doc1"]
)

print("‚úÖ Data saved to disk!")

‚úÖ Data saved to disk!


### 1.3 Understanding Embeddings

In [27]:
# Let's see what ChromaDB actually stores
collection.add(
    documents=["The cat sat on the mat"],
    ids=["doc_with_embedding"]
)

# Retrieve the embedding
result = collection.get(
    ids=["doc_with_embedding"],
    include=["embeddings", "documents"]
)

print("Original text:", result['documents'][0])
print("\nEmbedding (first 10 numbers):", result['embeddings'][0][:10])
print("Embedding dimension:", len(result['embeddings'][0]))
print("\nüí° ChromaDB converted text into a {}-dimensional vector!".format(len(result['embeddings'][0])))

Original text: The cat sat on the mat

Embedding (first 10 numbers): [ 0.13040181 -0.01187013 -0.02811698  0.05123861 -0.05597446  0.03019161
  0.03016139  0.02469836 -0.01837054  0.05876685]
Embedding dimension: 384

üí° ChromaDB converted text into a 384-dimensional vector!


---

## Part 2: ChromaDB File Structure

```
./my_chroma_db/
‚îú‚îÄ‚îÄ chroma.sqlite3          # Metadata database
‚îî‚îÄ‚îÄ <hash-folder>/          # e.g., 4f2a3b1c-...
    ‚îú‚îÄ‚îÄ data_level0.bin     # Vector data
    ‚îú‚îÄ‚îÄ header.bin          # Index metadata
    ‚îú‚îÄ‚îÄ length.bin          # Document lengths
    ‚îî‚îÄ‚îÄ link_lists.bin      # Graph connections (for HNSW)
```

### What Each Component Does:

#### 1. `chroma.sqlite3` - The Metadata Store
- Collection names and settings
- Document IDs
- Document text (actual strings)
- Metadata
- Configuration (embedding function, distance metric)

**Think of it as:** The catalog/index system

#### 2. Hash Folder - The Vector Store
- **`data_level0.bin`**: Actual embedding vectors
- **`header.bin`**: Index structure metadata
- **`length.bin`**: Vector dimensions info
- **`link_lists.bin`**: HNSW graph for fast search

**Think of it as:** The warehouse for numerical data

### Key Concepts Summary:

| What | Where Stored | When Created |
|------|--------------|-------------|
| **Document text** | `chroma.sqlite3` | When you `.add()` |
| **Document vectors** | `data_level0.bin` | When you `.add()` (via embedding function) |
| **Query vector** | Nowhere (temporary) | When you `.query()` (on-the-fly) |
| **Distance** | Nowhere | When you `.query()` (calculated on-the-fly) |

‚úÖ Distance is calculated **on-the-fly** during search  
‚úÖ If a document is never queried, no distance is calculated  
‚úÖ Distance only exists **between two vectors** (query ‚Üî document)

---

## Part 3: Using Different Embedding Models

### 3.1 Default Embedding (sentence-transformers)

In [28]:
# Default uses 'all-MiniLM-L6-v2' sentence transformer
client = chromadb.Client()
collection = client.create_collection(name="default_embeddings")

collection.add(
    documents=["Default embedding model"],
    ids=["default1"]
)

print("Using default embedding model: all-MiniLM-L6-v2")

Using default embedding model: all-MiniLM-L6-v2


### 3.2 OpenAI Embeddings

In [29]:
import chromadb
from chromadb.utils import embedding_functions
import os
from dotenv import load_dotenv

# Set your OpenAI API key
load_dotenv()
openai_api_key = os.environ.get("OPENAI_API_KEY")
if not openai_api_key:
    raise ValueError("OPENAI_API_KEY environment variable is not set")

# Create OpenAI embedding function
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key=openai_api_key,
    model_name="text-embedding-3-small"  # 1536 dimensions
)

# Create collection with OpenAI embeddings
client = chromadb.PersistentClient(path="./data/openai_chroma_db")
collection = client.get_or_create_collection(
    name="openai_collection",
    embedding_function=openai_ef
)

collection.add(
    documents=[
        "The cat sat on the mat",
        "The dog played in the park"
    ],
    ids=["openai_doc1", "openai_doc2"]
)

results = collection.query(
    query_texts=["Tell me about animals"],
    n_results=2
)

print("Results with OpenAI embeddings:", results['documents'])

Results with OpenAI embeddings: [['The dog played in the park', 'The dog played in the park']]


### 3.3 Custom Embedding Function (Advanced)

In [30]:
from chromadb.api.types import Documents, EmbeddingFunction, Embeddings
import numpy as np

class MyCustomEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        # This is a dummy example - replace with your actual embedding logic
        embeddings = []
        for doc in input:
            # Example: create a random 128-dim vector (replace with real embeddings)
            embedding = np.random.rand(128).tolist()
            embeddings.append(embedding)
        return embeddings

# Use custom embedding function
custom_ef = MyCustomEmbeddingFunction()
collection = client.create_collection(
    name="custom_embeddings",
    embedding_function=custom_ef
)

print("‚úÖ Using custom embedding function!")

‚úÖ Using custom embedding function!


  custom_ef = MyCustomEmbeddingFunction()


---

## Part 4: RAG (Retrieval Augmented Generation)

### What is RAG?

RAG combines ChromaDB (retrieval) with LLMs (generation) to answer questions using your custom data.

**The Flow:**
```
User Question 
    ‚Üì
1. Convert question to embedding
    ‚Üì
2. Search ChromaDB for similar documents
    ‚Üì
3. Retrieve top relevant documents
    ‚Üì
4. Give LLM: [Question + Retrieved Documents]
    ‚Üì
5. LLM generates answer based on context
```

### 4.1 RAG Example: Company Knowledge Base

In [31]:
import chromadb

# Setup ChromaDB with company documents
client = chromadb.PersistentClient(path="./data/rag_example_db")
collection = client.get_or_create_collection(name="company_docs")

# Add company documents
company_docs = [
    "Q3 2024 Revenue Report: Total revenue was $5.2M, up 15% from Q2. Major growth in enterprise segment.",
    "Vacation Policy: Employees get 15 days paid vacation per year. Must be requested 2 weeks in advance.",
    "Remote Work Policy: Employees can work remotely up to 3 days per week. Must coordinate with team.",
    "Health Benefits: Company covers 80% of health insurance premiums. Dental and vision included."
]

collection.add(
    documents=company_docs,
    ids=[f"doc_{i}" for i in range(len(company_docs))]
)

print("‚úÖ Company knowledge base loaded!")

‚úÖ Company knowledge base loaded!


In [32]:
# RAG Function
def rag_query(question, collection, n_results=3):
    """
    Retrieval Augmented Generation query
    """
    # Step 1: Search ChromaDB for relevant context
    results = collection.query(
        query_texts=[question],
        n_results=n_results
    )
    
    # Step 2: Get retrieved documents
    retrieved_docs = results['documents'][0]
    context = "\n\n".join(retrieved_docs)
    
    # Step 3: Build prompt for LLM
    prompt = f"""Based on the following information:

{context}

Answer this question: {question}

If the information doesn't contain the answer, say so."""
    
    return {
        'prompt': prompt,
        'retrieved_docs': retrieved_docs,
        'distances': results['distances'][0]
    }

# Example usage
question = "What was our Q3 revenue?"
result = rag_query(question, collection)

print("Question:", question)
print("\nRetrieved Documents:")
for i, doc in enumerate(result['retrieved_docs']):
    print(f"  {i+1}. {doc[:100]}...")
print("\nPrompt to send to LLM:")
print(result['prompt'])

Question: What was our Q3 revenue?

Retrieved Documents:
  1. Q3 2024 Revenue Report: Total revenue was $5.2M, up 15% from Q2. Major growth in enterprise segment....
  2. Health Benefits: Company covers 80% of health insurance premiums. Dental and vision included....
  3. Vacation Policy: Employees get 15 days paid vacation per year. Must be requested 2 weeks in advance....

Prompt to send to LLM:
Based on the following information:

Q3 2024 Revenue Report: Total revenue was $5.2M, up 15% from Q2. Major growth in enterprise segment.

Health Benefits: Company covers 80% of health insurance premiums. Dental and vision included.

Vacation Policy: Employees get 15 days paid vacation per year. Must be requested 2 weeks in advance.

Answer this question: What was our Q3 revenue?

If the information doesn't contain the answer, say so.


In [33]:
# Try different questions
questions = [
    "How many vacation days do I get?",
    "Can I work from home?",
    "What are the health benefits?",
    "What's our office address?"  # This is NOT in the database
]

for q in questions:
    result = rag_query(q, collection, n_results=1)
    print(f"\n‚ùì Question: {q}")
    print(f"üìÑ Retrieved: {result['retrieved_docs'][0][:100]}...")
    print(f"üìä Distance: {result['distances'][0]:.4f}")


‚ùì Question: How many vacation days do I get?
üìÑ Retrieved: Vacation Policy: Employees get 15 days paid vacation per year. Must be requested 2 weeks in advance....
üìä Distance: 0.5961

‚ùì Question: Can I work from home?
üìÑ Retrieved: Remote Work Policy: Employees can work remotely up to 3 days per week. Must coordinate with team....
üìä Distance: 1.2501

‚ùì Question: What are the health benefits?
üìÑ Retrieved: Health Benefits: Company covers 80% of health insurance premiums. Dental and vision included....
üìä Distance: 0.9587

‚ùì Question: What's our office address?
üìÑ Retrieved: Remote Work Policy: Employees can work remotely up to 3 days per week. Must coordinate with team....
üìä Distance: 1.7489


---

## Part 5: Other Use Cases (Beyond RAG)

### 5.1 Semantic Search (No LLM Needed)

In [35]:
# E-commerce product search
client = chromadb.Client()
products = client.get_or_create_collection(name="products")

products.add(
    documents=[
        "Wireless Bluetooth headphones with noise cancellation, 30-hour battery",
        "USB-C charging cable 6ft braided nylon, fast charging",
        "Laptop stand adjustable aluminum ergonomic design",
        "Mechanical keyboard RGB backlit gaming switches"
    ],
    ids=["prod1", "prod2", "prod3", "prod4"]
)

# User searches with natural language
search_queries = [
    "cord for phone",  # Finds charging cable
    "something for music",  # Finds headphones
    "typing device"  # Finds keyboard
]

for query in search_queries:
    results = products.query(query_texts=[query], n_results=1)
    print(f"\nüîç Search: '{query}'")
    print(f"‚úÖ Found: {results['documents'][0][0]}")


üîç Search: 'cord for phone'
‚úÖ Found: USB-C charging cable 6ft braided nylon, fast charging

üîç Search: 'something for music'
‚úÖ Found: Wireless Bluetooth headphones with noise cancellation, 30-hour battery

üîç Search: 'typing device'
‚úÖ Found: Mechanical keyboard RGB backlit gaming switches


### 5.2 Recommendation System

In [36]:
# Movie recommendation
movies = client.get_or_create_collection(name="movies")

movies.add(
    documents=[
        "A thrilling space adventure with aliens and cosmic battles",
        "Romantic comedy set in Paris with a charming love story",
        "Sci-fi thriller about AI taking over the world",
        "Action-packed superhero movie with epic fight scenes",
        "Heartwarming drama about family and redemption"
    ],
    ids=["movie1", "movie2", "movie3", "movie4", "movie5"]
)

# User watched: "A thrilling space adventure with aliens and cosmic battles"
# Find similar movies
user_watched = "A thrilling space adventure with aliens and cosmic battles"
recommendations = movies.query(query_texts=[user_watched], n_results=3)

print("üé¨ Because you watched: 'Space Adventure'")
print("\nYou might also like:")
for i, movie in enumerate(recommendations['documents'][0][1:], 1):  # Skip first (itself)
    print(f"  {i}. {movie}")

üé¨ Because you watched: 'Space Adventure'

You might also like:
  1. Action-packed superhero movie with epic fight scenes
  2. Sci-fi thriller about AI taking over the world


### 5.3 Duplicate Detection

In [38]:
# Customer support ticket deduplication
tickets = client.get_or_create_collection(name="support_tickets")

tickets.add(
    documents=[
        "App crashes when I try to login",
        "Cannot sign in, app freezes",
        "Login button not working"
    ],
    ids=["ticket1", "ticket2", "ticket3"]
)

# New ticket comes in
new_ticket = "Unable to log into the application"

# Check for duplicates
results = tickets.query(query_texts=[new_ticket], n_results=1)
distance = results['distances'][0][0]

print(f"New ticket: '{new_ticket}'")
print(f"\nMost similar existing ticket: '{results['documents'][0][0]}'")
print(f"Distance: {distance:.4f}")

if distance < 0.5:  # Threshold for "too similar"
    print("\n‚ö†Ô∏è Possible duplicate detected!")
else:
    print("\n‚úÖ New unique ticket")

New ticket: 'Unable to log into the application'

Most similar existing ticket: 'App crashes when I try to login'
Distance: 0.5349

‚úÖ New unique ticket


### 5.4 Content Categorization

In [39]:
# Categorize customer feedback
feedback = client.get_or_create_collection(name="feedback")

feedback.add(
    documents=[
        "App crashes when I try to login",
        "Cannot sign in, keeps failing",
        "Love the new dark mode feature",
        "Great UI updates in latest version",
        "Checkout process is broken",
        "Payment doesn't work"
    ],
    ids=["f1", "f2", "f3", "f4", "f5", "f6"]
)

# Define categories
categories = {
    "Login Issues": "problems with authentication and signing in",
    "UI Feedback": "comments about design and user interface",
    "Payment Problems": "issues with checkout and payment processing"
}

# Categorize each feedback
all_feedback = feedback.get()

for doc_id, doc in zip(all_feedback['ids'], all_feedback['documents']):
    print(f"\nüìù Feedback: '{doc}'")
    
    # Find best matching category
    best_category = None
    best_distance = float('inf')
    
    for category_name, category_desc in categories.items():
        result = feedback.query(query_texts=[category_desc], n_results=10)
        if doc_id in result['ids'][0]:
            idx = result['ids'][0].index(doc_id)
            distance = result['distances'][0][idx]
            if distance < best_distance:
                best_distance = distance
                best_category = category_name
    
    print(f"üè∑Ô∏è  Category: {best_category}")


üìù Feedback: 'App crashes when I try to login'
üè∑Ô∏è  Category: Login Issues

üìù Feedback: 'Cannot sign in, keeps failing'
üè∑Ô∏è  Category: Login Issues

üìù Feedback: 'Love the new dark mode feature'
üè∑Ô∏è  Category: UI Feedback

üìù Feedback: 'Great UI updates in latest version'
üè∑Ô∏è  Category: UI Feedback

üìù Feedback: 'Checkout process is broken'
üè∑Ô∏è  Category: Payment Problems

üìù Feedback: 'Payment doesn't work'
üè∑Ô∏è  Category: Payment Problems


### 5.5 AI Agent Memory

In [40]:
# Give AI agent long-term memory
agent_memory = client.get_or_create_collection(name="agent_memory")

# Store user preferences and past interactions
agent_memory.add(
    documents=[
        "User prefers vegetarian restaurants",
        "User's favorite cuisine is Italian",
        "User is allergic to peanuts",
        "User lives in San Francisco"
    ],
    ids=["pref1", "pref2", "pref3", "pref4"]
)

# Later, when user asks for recommendations
user_query = "recommend a good restaurant for dinner"

# Retrieve relevant memories
relevant_memories = agent_memory.query(
    query_texts=[user_query],
    n_results=3
)

print(f"User asks: '{user_query}'")
print("\nüß† Agent remembers:")
for memory in relevant_memories['documents'][0]:
    print(f"  - {memory}")
print("\nüí° Agent can now give personalized recommendations!")

User asks: 'recommend a good restaurant for dinner'

üß† Agent remembers:
  - User prefers vegetarian restaurants
  - User's favorite cuisine is Italian
  - User lives in San Francisco

üí° Agent can now give personalized recommendations!


---

## Part 6: Advanced Features

### 6.1 Using Metadata for Filtering

In [41]:
# Add documents with metadata
advanced_collection = client.get_or_create_collection(name="advanced_features")

advanced_collection.add(
    documents=[
        "Python tutorial for beginners",
        "Advanced Python patterns",
        "JavaScript basics guide",
        "React framework tutorial"
    ],
    metadatas=[
        {"language": "python", "level": "beginner"},
        {"language": "python", "level": "advanced"},
        {"language": "javascript", "level": "beginner"},
        {"language": "javascript", "level": "intermediate"}
    ],
    ids=["doc1", "doc2", "doc3", "doc4"]
)

# Query with metadata filter
results = advanced_collection.query(
    query_texts=["learning programming"],
    n_results=5,
    where={"language": "python"}  # Only Python documents
)

print("Python documents only:")
for doc in results['documents'][0]:
    print(f"  - {doc}")

Python documents only:
  - Python tutorial for beginners - Updated 2024
  - Advanced Python patterns


### 6.2 Updating and Deleting Documents

In [42]:
# Update a document
advanced_collection.update(
    ids=["doc1"],
    documents=["Python tutorial for beginners - Updated 2024"]
)

# Delete a document
advanced_collection.delete(ids=["doc4"])

# Verify changes
all_docs = advanced_collection.get()
print("Remaining documents:")
for doc_id, doc in zip(all_docs['ids'], all_docs['documents']):
    print(f"  {doc_id}: {doc}")

Remaining documents:
  doc1: Python tutorial for beginners - Updated 2024
  doc2: Advanced Python patterns
  doc3: JavaScript basics guide


### 6.3 Collection Management

In [43]:
# List all collections
collections = client.list_collections()
print("All collections:")
for col in collections:
    print(f"  - {col.name}")

# Get collection info
collection_info = advanced_collection.count()
print(f"\nDocuments in 'advanced_features': {collection_info}")

# Delete a collection (careful!)
# client.delete_collection(name="collection_to_delete")

All collections:
  - my_first_collection
  - default_embeddings
  - advanced_features
  - movies
  - support_tickets
  - feedback
  - products
  - agent_memory

Documents in 'advanced_features': 3


---

## Summary: When to Use ChromaDB

### ‚úÖ Use ChromaDB when you need:

1. **RAG (Retrieval Augmented Generation)**
   - Give LLMs access to your custom data
   - Build chatbots with company knowledge
   - Document Q&A systems

2. **Semantic Search**
   - E-commerce product search
   - Document/knowledge base search
   - Finding similar content

3. **Recommendations**
   - "Customers also liked..."
   - Content recommendations
   - Similar items/products

4. **Duplicate Detection**
   - Find similar tickets/issues
   - Prevent duplicate content
   - Plagiarism detection

5. **Categorization**
   - Auto-categorize feedback
   - Organize documents by topic
   - Route support tickets

6. **AI Agent Memory**
   - Long-term memory for chatbots
   - Personalized AI assistants
   - Context-aware responses

### üîë Key Concepts:

- **Embeddings** encode meaning as numerical vectors
- **Distance** measures similarity between vectors
- Similar meanings ‚Üí close vectors (small distance)
- ChromaDB handles embedding generation automatically
- You can use custom embedding models

### üìÅ Remember:

- Use `Client()` for temporary/testing
- Use `PersistentClient()` for production
- Data stored in SQLite (metadata) + bin files (vectors)
- Distance calculated on-the-fly during search

---

## Next Steps

1. **Try building a RAG application** with your own documents
2. **Experiment with different embedding models** (OpenAI, sentence-transformers, custom)
3. **Explore metadata filtering** for more precise searches
4. **Integrate with LLM APIs** (OpenAI, Anthropic Claude, local models)
5. **Scale up** with larger document collections

### Useful Resources:
- [ChromaDB Documentation](https://docs.trychroma.com/)
- [Sentence Transformers](https://www.sbert.net/)
- [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings)

Happy coding! üöÄ