# Vector Database Comparison: Chroma

This notebook demonstrates how to:
1. Load and prepare article data
2. Generate embeddings using all-MiniLM-L6-v2
3. Store vectors in Chroma Cloud
4. Perform semantic search
5. Apply metadata filtering (category, datetime)

## Is Chroma Free?
- ✅ Free for self-hosted
- ✅ $5 free credits for Chroma Cloud

In [1]:
# Reload
%reload_ext autoreload
%autoreload 2

## 1. Setup and Imports

In [2]:
import sys
sys.path.append('..')  # Add parent directory to path

import os
from dotenv import load_dotenv
from tqdm.auto import tqdm
import time

# Import our utilities
from utils.embeddings import EmbeddingGenerator
from utils.data_loader import load_articles, get_article_metadata

# Load environment variables
load_dotenv()

print("✓ Imports successful")

✓ Imports successful


## 2. Initialize Embedding Model

We'll use **all-MiniLM-L6-v2** - a free 384-dimensional embedding model (good baseline)

In [3]:
# Initialize embedding model
embedding_model = EmbeddingGenerator()

# Test the model
test_text = ("Only if we understand, can we care. Only if we care, will we help. Only if we help, shall all be saved. "
             "- Jane Goodall")
test_embedding = embedding_model.embed_text(test_text)

print(f"  - Embedding dimension: {len(test_embedding)}")
print(f"  - Sample values: {test_embedding[:5]}")

Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
✓ Model loaded successfully. Embedding dimension: 384
  - Embedding dimension: 384
  - Sample values: [-0.01758349  0.05359234 -0.00871514 -0.00626488  0.07676981]


## 3. Load Article Data

In [25]:
import json
import random

# Load articles
articles = load_articles("../sample_articles.json")

print(f"\nLoaded {len(articles)} articles")

# Random pick on article and preview
print("\nRandom Sample article:")
selected_index = random.randint(0, len(articles) - 1)
print(json.dumps(articles[selected_index], indent=2))

Loaded 100 articles from ../sample_articles.json

Loaded 100 articles

Random Sample article:
{
  "id": 15574369,
  "item_source": "SKI_MAG",
  "item_title": "My Secret to Skiing 100 Days Injury-Free",
  "item_subtitle": "Want to maximize your season while minimizing pain and injuries? Yeah, who doesn't? Here's how one writer stays fit and healthy all season long.",
  "body_content": "Newton\u2019s third law of physics states that every action has an equal and opposite reaction. Metaphorically speaking, I\u2019ve found that this law also applies to skiing. I\u2019ve had powder days of such immense joy that life feels infinite. \u201cLive Forever\u201d is still my favorite Oasis song.\nBut then comes the reaction. I\u2019ve seen it take many forms in ski towns, including the loss of early compounding interest retirement savings, a lack of professional work experience, and failure to find a long-term partner, to name a few. However, for me, it took the form of chronic pain.\nDay in and d

## 4. Connect to Chroma

Make sure you've set up your `.env` file with:
```
CHROMA_API_KEY=your_api_key
CHROMA_TENANT=your_tenant_id
CHROMA_DATABASE=articles
```

In [6]:
import chromadb

# Get credentials from environment
CHROMA_API_KEY = os.getenv("CHROMA_API_KEY")
CHROMA_TENANT = os.getenv("CHROMA_TENANT")
CHROMA_DATABASE = os.getenv("CHROMA_DATABASE")  # My DB name is 'articles'

if not CHROMA_API_KEY or not CHROMA_TENANT:
    raise ValueError("Please set CHROMA_API_KEY and CHROMA_TENANT in .env file")

# Connect to Chroma Cloud
print("Connecting to Chroma Cloud...")
client = chromadb.CloudClient(
    api_key=CHROMA_API_KEY,
    tenant=CHROMA_TENANT,
    database=CHROMA_DATABASE
)

print(f"✓ Connected to Chroma Cloud")
print(f"  - Tenant: {CHROMA_TENANT}")
print(f"  - Database: {CHROMA_DATABASE}")

# List existing collections
collections = client.list_collections()
print(f"\nExisting collections: {[c.name for c in collections]}")

Connecting to Chroma Cloud...
✓ Connected to Chroma Cloud
  - Tenant: 3842bba4-7792-4f68-97af-a3be4ab6275b
  - Database: None

Existing collections: ['articles_chroma']


## 5. Create Collection

Chroma is **schema-less** - we can add any metadata fields without pre-declaration!

In [7]:
# Collection name
COLLECTION_NAME = "articles_chroma"

# Try to get existing collection, or create new one
collection = client.get_or_create_collection(name=COLLECTION_NAME)
print(f"✓ Using existing collection: {COLLECTION_NAME}")
print(f"  - Current count: {collection.count()} articles")

✓ Using existing collection: articles_chroma
  - Current count: 100 articles


## 6. Generate Embeddings and Upsert Data

We'll process articles in batches for efficiency.

Note: Chroma provides lightweight wrappers around popular embedding providers, but here we use our own embedding
function for flexibility and easy comparison across vector DBs.

In [20]:
# Process in batches
BATCH_SIZE = 20
total_articles = len(articles)


print(f"Processing {total_articles} articles in batches of {BATCH_SIZE}...\n")

start_time = time.time()

for i in tqdm(range(0, total_articles, BATCH_SIZE), desc="Inserting batches"):
    batch = articles[i:i + BATCH_SIZE]
    
    # Generate embeddings for batch
    texts = [
        f"Title: {a['item_title']}\nSubtitle: {a.get('item_subtitle', '')}\nContent: {a['body_content'][:500]}"
        for a in batch
    ]
    embeddings = embedding_model.embed_batch(texts, show_progress=False)
    
    # Prepare metadata
    ids = [f"article_{a['id']}" for a in batch]
    metadatas = [get_article_metadata(a, db_type="chroma") for a in batch]
    
    # Insert into Chroma
    collection.upsert(
        ids=ids,
        embeddings=embeddings.tolist(),
        metadatas=metadatas
    )

elapsed_time = time.time() - start_time

print(f"\n✓ Successfully inserted {total_articles} articles")
print(f"  - Time taken: {elapsed_time:.2f} seconds")
print(f"  - Average: {elapsed_time/total_articles:.2f} seconds per article")

# Verify collection count
count = collection.count()
print(f"  - Collection count: {count}")

Processing 100 articles in batches of 20...



Inserting batches:   0%|          | 0/5 [00:00<?, ?it/s]


✓ Successfully inserted 100 articles
  - Time taken: 2.34 seconds
  - Average: 0.02 seconds per article
  - Collection count: 100


## 7. Semantic Search - Basic Query

In [8]:
# Test query
query_text = "Most haunted hikes in the US"

print(f"Query: '{query_text}'\n")

# Generate query embedding
query_embedding = embedding_model.embed_text(query_text)

# Search
results = collection.query(
    query_embeddings=[query_embedding.tolist()],
    n_results=5,
    include=["metadatas", "distances"]
)

# Display results
print("Top 5 Results:\n")
for i, (metadata, distance) in enumerate(zip(results['metadatas'][0], results['distances'][0])):
    print(f"{i+1}. {metadata['title']}...")
    print(f"   Category: {metadata['category']} | Source: {metadata['source']}")
    print(f"   Distance: {distance:.4f}")
    print(f"   URL: {metadata['url']}...")

Query: 'Most haunted hikes in the US'

Top 5 Results:

1. 13 of the Most Haunted Hikes in the U.S....
   Category: Destinations | Source: OUTSIDE
   Distance: 0.3884
   URL: https://www.outsideonline.com/adventure-travel/destinations/haunted-hikes/...
2. A Missing Dog Helped a Stranded Hiker Return to Shadow Mountain Trail. Both Were Rescued....
   Category: Hiking | Source: OUTSIDE
   Distance: 1.0845
   URL: https://www.outsideonline.com/outdoor-adventure/hiking-and-backpacking/arizona-lost-hiker-missing-dog-shadow-mountain/...
3. An Inside Look at Outside’s 2025 Winter Editors’ Choice Testing Trip...
   Category: Gear | Source: OUTSIDE
   Distance: 1.2680
   URL: https://www.outsideonline.com/outdoor-gear/winter-editors-choice-trip-maine/...
4. Two Hikers in British Columbia Were Hospitalized After a Grizzly Sow Attack...
   Category: Hiking | Source: OUTSIDE
   Distance: 1.3272
   URL: https://www.outsideonline.com/outdoor-adventure/hiking-and-backpacking/two-hikers-in-british-colu

## 8. Metadata Filtering - Category

Chroma's schema-less design makes filtering super easy!

In [10]:
# Filter by category
query_text = "Women's Ironman World Championship"
target_category = "News"

print(f"Query: '{query_text}'")
print(f"Filter: category = '{target_category}'\n")

query_embedding = embedding_model.embed_text(query_text)

results = collection.query(
    query_embeddings=[query_embedding.tolist()],
    n_results=5,
    where={"category": target_category},
    include=["metadatas", "distances"]
)

print(f"Top 5 Results (Category: {target_category}):\n")
for i, (metadata, distance) in enumerate(zip(results['metadatas'][0], results['distances'][0])):
    print(f"{i+1}. {metadata['title']}...")
    print(f"   Category: {metadata['category']} | Source: {metadata['source']}")
    print(f"   Created: {metadata['created_at']}")
    print(f"   Distance: {distance:.4f}")
    print()

Query: 'Women's Ironman World Championship'
Filter: category = 'News'

Top 5 Results (Category: News):

1. After Joy of Women's-Only Ironman World Championship, Grief Sets In...
   Category: News | Source: TRIATHLETE
   Created: 1760296873
   Distance: 0.4460

2. What a Race! Here's Where the Ironman Pro Series Stands After the Ironman World Championship Drama...
   Category: News | Source: TRIATHLETE
   Created: 1760352009
   Distance: 0.7072

3. The Fastest Shoes at 2025 Ironman World Championship Kona...
   Category: News | Source: TRIATHLETE
   Created: 1760353908
   Distance: 0.7636

4. The DNF Files: 2025 Ironman World Championship Kona...
   Category: News | Source: TRIATHLETE
   Created: 1760441445
   Distance: 0.8166

5. In Sweltering Conditions, Norway’s Solveig Løvseth Takes 2025 Ironman World Championship Win...
   Category: News | Source: TRIATHLETE
   Created: 1760160735
   Distance: 0.8761



## 9. Metadata Filtering - Date Range

Find recent articles published after a specific date.

In [24]:
from utils.date_utils import date_string_to_timestamp, timestamp_to_datetime_string

# Filter by date
query_text = "cycling deals" #"simple exercise for my back pain"
cutoff_date = "2025-10-08"
cutoff_timestamp = date_string_to_timestamp(cutoff_date)

print(f"Query: '{query_text}'")
print(f"Filter: created_at >= '{cutoff_date}' (timestamp: {cutoff_timestamp})\n")

query_embedding = embedding_model.embed_text(query_text)

results = collection.query(
    query_embeddings=[query_embedding.tolist()],
    n_results=5,
    where={
        "created_at": {"$gte": cutoff_timestamp}
    },
    include=["metadatas", "distances"]
)

print(f"Top 5 Recent Results (after {cutoff_date}):\n")
for i, (metadata, distance) in enumerate(zip(results['metadatas'][0], results['distances'][0])):
    created_at = timestamp_to_datetime_string(metadata['created_at'])
    print(f"{i+1}. {metadata['title']}")
    print(f"   Category: {metadata['category']}")
    print(f"   Created: {created_at}")
    print(f"   Tags: {metadata['tags']}")

Query: 'cycling deals'
Filter: created_at >= '2025-10-08' (timestamp: 1759906800)

Top 5 Recent Results (after 2025-10-08):

1. Opinion: Cycling's Soccer-Inspired Relegation System Is a Hot Mess That Solves Nothing
   Category: Road Racing
   Created: 2025-10-15 22:42:10
   Tags: Analysis, ASO, Cofidis, Tour de France, Tour de Hoody
2. Deal: Tailwind Endurance Fuel Is the Cycling Nutrition I Actually Use
   Category: Road Gear
   Created: 2025-10-13 04:30:52
   Tags: Velo Deals
3. Pogačar's Bonuses and Brand Deals Revealed: Inside His $14 Million Pay Check
   Category: Road Racing
   Created: 2025-10-13 20:39:12
   Tags: Alex Carera, Remco Evenepoel, Tadej Pogačar, Transfers, UAE Emirates
4. Shop Evo's Anniversary Sale and Save up to 50% on Ski, Snowboard, and MTB Gear
   Category: Gear News
   Created: 2025-10-14 03:53:27
   Tags: Commerce, Deals
5. Deal: One of the Best Headphones for Cycling Is 50% Off
   Category: Road Gear
   Created: 2025-10-15 05:12:34
   Tags: headphones, Velo 

## 10. Combined Filters -  Evergreen AND Date

Chroma supports complex boolean queries with `$and`, `$or` operators.

In [26]:
from utils.date_utils import date_string_to_timestamp

# Combine multiple filters: Boolean (evergreen) + Date range
query_text = "Halloween outdoor activities"
cutoff_date = "2025-10-09"
cutoff_timestamp = date_string_to_timestamp(cutoff_date)

print(f"Query: '{query_text}'")
print(f"Filters:")
print(f"  - evergreen = True (timeless content)")
print(f"  - created_at >= '{cutoff_date}' (timestamp: {cutoff_timestamp})\n")

query_embedding = embedding_model.embed_text(query_text)

results = collection.query(
    query_embeddings=[query_embedding.tolist()],
    n_results=10,  # Increased to 10 since evergreen articles might be fewer
    where={
        "$and": [
            {"evergreen": True},
            {"created_at": {"$gte": cutoff_timestamp}}
        ]
    },
    include=["metadatas", "distances"]
)

if results['metadatas'][0]:
    print(f"Top Evergreen Results (After {cutoff_date}):\n")
    for i, (metadata, distance) in enumerate(zip(results['metadatas'][0], results['distances'][0])):
        created_at = timestamp_to_datetime_string(metadata['created_at'])

        print(f"{i+1}. {metadata['title']}...")
        print(f"   Category: {metadata['category']} | Evergreen: {metadata['evergreen']}")
        print(f"   Tags: {metadata.get('tags', 'No tags')}")
        print(f"   Created: {created_at}")
    print(f"Total results: {len(results['metadatas'][0])}")
else:
    print("No evergreen articles found after this date.")

Query: 'Halloween outdoor activities'
Filters:
  - evergreen = True (timeless content)
  - created_at >= '2025-10-09' (timestamp: 1759993200)

Top Evergreen Results (After 2025-10-09):

1. 13 of the Most Haunted Hikes in the U.S....
   Category: Destinations | Evergreen: True
   Tags: evergreen, Halloween, Hiking
   Created: 2025-10-16 04:22:41
2. The Thule Outset Hitch-Mounted Tent Turns Your Car Into a Campsite on Wheels...
   Category: Camping | Evergreen: True
   Tags: 2025 Gear Reviews, Car Camping, Car Racks, Commerce, evergreen
   Created: 2025-10-14 03:30:11
3. The Best Daypacks for Every Kind of Hiker (2025)...
   Category: Daypacks | Evergreen: True
   Tags: 2025 Gear Reviews, 2025 Summer Gear Guide, backpack, Commerce, Day Packs
   Created: 2025-10-16 04:31:44
4. Everything You Need To Know Before Skiing Telluride For The First Time...
   Category: Resort Skiing | Evergreen: True
   Tags: evergreen, Telluride Ski Resort
   Created: 2025-10-13 07:39:24
5. He’s Hunted for Elk 

## 11. Performance Summary

Let's benchmark query performance.

In [28]:
from utils.benchmark import benchmark_queries

# Define query function for Chroma
def chroma_query_fn(query_text: str):
    """Query function for Chroma benchmarking."""
    query_embedding = embedding_model.embed_text(query_text)
    return collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=10
    )

# Run standardized benchmark
results = benchmark_queries(chroma_query_fn)

Running performance benchmark...

'outdoor hiking adventures' -> 642.7ms
'cycling race performance' -> 137.5ms
'travel destinations and tips' -> 298.3ms
'fitness training techniques' -> 183.5ms
'gear reviews and recommendations' -> 115.0ms

Performance Summary:
  - Average query time: 275.4ms
  - Min query time: 115.0ms
  - Max query time: 642.7ms


## 12. Key Takeaways: Chroma

### Pros ✅
1. **Schema-less metadata** - Add any fields without pre-planning
2. **Simple API** - Easy to use, minimal boilerplate
3. **Fast setup** - Quickest to get started
4. **Good free tier** - $5 credits, 5GB storage

### Cons ⚠️
1. **Credits run out** - $5 may not last long with heavy usage
2. **Less enterprise features** - Compared to Milvus/Weaviate

### Best For
- Prototyping and MVPs
- Projects with evolving schemas

### Metadata Filtering
- Great support for metadata filtering
- You can do datetime filtering if you store unix timestamps

In [None]:
# Uncomment to delete collection
# client.delete_collection(name=COLLECTION_NAME)
# print(f"Deleted collection: {COLLECTION_NAME}")