# ChromaDB Query Examples for Product Recommendations

This notebook demonstrates various query patterns using ChromaDB for our outdoor product recommendation system.

## Setup

In [1]:
import chromadb
from chromadb.config import Settings
import pandas as pd
from pprint import pprint

# Initialize ChromaDB client
client = chromadb.PersistentClient(path="../chroma_db")
collection = client.get_collection(name="outdoor_products")

print(f"Collection: {collection.name}")
print(f"Total products: {collection.count()}")

Collection: outdoor_products
Total products: 300


## 1. Semantic Search (Vector Search)

Use natural language queries to find semantically similar products.

In [2]:
# Example 1: Search for warm winter jackets
results = collection.query(
    query_texts=["warm insulated jacket for cold winter weather"],
    n_results=5
)

print("=== Warm Winter Jackets ===")
for i, (metadata, distance) in enumerate(zip(results['metadatas'][0], results['distances'][0]), 1):
    print(f"\n{i}. {metadata['product_name']}")
    print(f"   Brand: {metadata['brand']}")
    print(f"   Category: {metadata['subcategory']}")
    print(f"   Price: ${metadata['price_usd']}")
    print(f"   Insulation: {metadata['insulation']}")
    print(f"   Similarity Score: {1 - distance:.3f}")

=== Warm Winter Jackets ===

1. Heatzone StormPro Down Jacket
   Brand: TrailForge
   Category: Down Jackets
   Price: $309.0
   Insulation: Insulated
   Similarity Score: 0.170

2. Heatzone GlacierGrid Down Jacket
   Brand: NorthPeak
   Category: Down Jackets
   Price: $314.0
   Insulation: Insulated
   Similarity Score: 0.161

3. Heatzone CascadeGrid Down Jacket
   Brand: NorthPeak
   Category: Down Jackets
   Price: $236.0
   Insulation: Insulated
   Similarity Score: 0.137

4. Omni-Heat Core StormPro Down Jacket
   Brand: NorthPeak
   Category: Down Jackets
   Price: $425.0
   Insulation: Insulated
   Similarity Score: 0.114

5. Heatzone AtlasGTX Down Jacket
   Brand: TrailForge
   Category: Down Jackets
   Price: $298.0
   Insulation: Insulated
   Similarity Score: 0.107


In [3]:
# Example - my own example 
results = collection.query(
    query_texts=["best hiking boots for rocky terrain"],
    n_results=5
)

print("=== Best Hiking Boots ===")
for i, (metadata, distance) in enumerate(zip(results['metadatas'][0], results['distances'][0]), 1):
    print(f"\n{i}. {metadata['product_name']}")
    print(f"   Brand: {metadata['brand']}")
    print(f"   Category: {metadata['subcategory']}")
    print(f"   Price: ${metadata['price_usd']}")
    print(f"   Insulation: {metadata['insulation']}")
    print(f"   Similarity Score: {1 - distance:.3f}")

=== Best Hiking Boots ===

1. PeakFreak SummitX Hiking boots/shoe
   Brand: AlpineCo
   Category: Hiking boots/shoes
   Price: $145.0
   Insulation: nan
   Similarity Score: 0.362

2. PeakFreak CascadeGrid Hiking boots/shoe
   Brand: AlpineCo
   Category: Hiking boots/shoes
   Price: $220.0
   Insulation: nan
   Similarity Score: 0.335

3. PeakFreak PioneerX Hiking boots/shoe
   Brand: TrailForge
   Category: Hiking boots/shoes
   Price: $154.0
   Insulation: nan
   Similarity Score: 0.316

4. Facet Trail CascadePro Hiking boots/shoe
   Brand: AlpineCo
   Category: Hiking boots/shoes
   Price: $206.0
   Insulation: nan
   Similarity Score: 0.302

5. Newton Ridge SierraPrime Hiking boots/shoe
   Brand: AlpineCo
   Category: Hiking boots/shoes
   Price: $251.0
   Insulation: nan
   Similarity Score: 0.277


In [None]:
# Example 2: Search for lightweight travel gear
results = collection.query(
    query_texts=["lightweight packable jacket for travel and city commute"],
    n_results=5
)

print("=== Lightweight Travel Jackets ===")
for i, metadata in enumerate(results['metadatas'][0], 1):
    print(f"{i}. {metadata['product_name']} - ${metadata['price_usd']}")
    print(f"   Purpose: {metadata['primary_purpose']} | Season: {metadata['season']}")

In [4]:
# Example 3: Search for hiking gear
results = collection.query(
    query_texts=["waterproof breathable jacket for mountain hiking"],
    n_results=5
)

print("=== Hiking Jackets ===")
for i, metadata in enumerate(results['metadatas'][0], 1):
    print(f"{i}. {metadata['product_name']}")
    print(f"   Waterproofing: {metadata['waterproofing']} | Material: {metadata['material']}")

=== Hiking Jackets ===
1. Whirlibird AtlasGTX Raincoats/Shell Jacket
   Waterproofing: Waterproof | Material: eVent
2. Whirlibird StormThermo Raincoats/Shell Jacket
   Waterproofing: Waterproof | Material: Gore-Tex
3. Whirlibird AeroShield Raincoats/Shell Jacket
   Waterproofing: Waterproof | Material: Gore-Tex
4. Whirlibird GlacierPro Raincoats/Shell Jacket
   Waterproofing: Waterproof | Material: eVent
5. Whirlibird CascadePro Raincoats/Shell Jacket
   Waterproofing: Waterproof | Material: eVent


## 2. Filter-Based Search (Keyword/Metadata Search)

Use exact metadata filters to find products matching specific criteria.

In [5]:
# Example 1: Filter by single attribute - Brand
results = collection.get(
    where={"brand": {"$eq": "NorthPeak"}},
    limit=10
)

print(f"=== NorthPeak Products ({len(results['metadatas'])} found) ===")
for i, metadata in enumerate(results['metadatas'][:5], 1):
    print(f"{i}. {metadata['product_name']} - ${metadata['price_usd']}")

=== NorthPeak Products (10 found) ===
1. Whirlibird TerraLite Raincoats/Shell Jacket - $199.0
2. Whirlibird GlacierPro Raincoats/Shell Jacket - $202.0
3. Delta Ridge SummitFlex Down Jacket - $371.0
4. Delta Ridge AeroCore Vest - $93.0
5. Powder Lite SummitPro Bombers/Softshell - $122.0


In [6]:
# Example 2: Filter by multiple attributes - Women's Down Jackets
results = collection.get(
    where={
        "$and": [
            {"gender": {"$eq": "Women"}},
            {"subcategory": {"$eq": "Down Jackets"}}
        ]
    },
    limit=10
)

print(f"=== Women's Down Jackets ({len(results['metadatas'])} found) ===")
for i, metadata in enumerate(results['metadatas'], 1):
    print(f"{i}. {metadata['product_name']}")
    print(f"   Brand: {metadata['brand']} | Price: ${metadata['price_usd']} | Rating: {metadata['rating']}")

=== Women's Down Jackets (7 found) ===
1. Delta Ridge SummitFlex Down Jacket
   Brand: NorthPeak | Price: $371.0 | Rating: 4.5
2. Delta Ridge AuroraFlex Down Jacket
   Brand: TrailForge | Price: $346.0 | Rating: 4.7
3. Heatzone GlacierGrid Down Jacket
   Brand: NorthPeak | Price: $314.0 | Rating: 4.4
4. Heatzone VentureCore Down Jacket
   Brand: AlpineCo | Price: $316.0 | Rating: 4.6
5. Heatzone StormPro Down Jacket
   Brand: TrailForge | Price: $309.0 | Rating: 4.7
6. Omni-Heat Core SierraGTX Down Jacket
   Brand: AlpineCo | Price: $402.0 | Rating: 4.6
7. Omni-Heat Core VenturePro Down Jacket
   Brand: TrailForge | Price: $352.0 | Rating: 4.6


In [None]:
# Example 3: Filter by price range - Budget jackets under $200
results = collection.get(
    where={
        "$and": [
            {"price_usd": {"$lt": 200}},
            {"category": {"$eq": "Outerwear"}}
        ]
    },
    limit=10
)

print(f"=== Budget Outerwear Under $200 ({len(results['metadatas'])} found) ===")
for i, metadata in enumerate(results['metadatas'][:10], 1):
    print(f"{i}. {metadata['product_name']} - ${metadata['price_usd']}")
    print(f"   {metadata['subcategory']} | Rating: {metadata['rating']}")

In [None]:
# Example 4: Filter by season and waterproofing
results = collection.get(
    where={
        "$and": [
            {"season": {"$eq": "All-season"}},
            {"waterproofing": {"$eq": "Waterproof"}}
        ]
    },
    limit=10
)

print(f"=== All-Season Waterproof Products ({len(results['metadatas'])} found) ===")
for i, metadata in enumerate(results['metadatas'][:5], 1):
    print(f"{i}. {metadata['product_name']}")
    print(f"   Material: {metadata['material']} | Purpose: {metadata['primary_purpose']}")

## 3. Hybrid Search (Semantic + Filters)

Combine natural language queries with metadata filters for precise results.

In [None]:
# Example 1: Search for warm jackets for women
results = collection.query(
    query_texts=["warm insulated jacket for cold weather"],
    where={"gender": {"$eq": "Women"}},
    n_results=5
)

print("=== Warm Women's Jackets ===")
for i, metadata in enumerate(results['metadatas'][0], 1):
    print(f"{i}. {metadata['product_name']}")
    print(f"   Brand: {metadata['brand']} | Price: ${metadata['price_usd']}")
    print(f"   Insulation: {metadata['insulation']} | Season: {metadata['season']}")

In [None]:
# Example 2: Search for hiking gear from specific brand
results = collection.query(
    query_texts=["jacket for trail hiking in mountains"],
    where={"brand": {"$eq": "TrailForge"}},
    n_results=5
)

print("=== TrailForge Hiking Jackets ===")
for i, metadata in enumerate(results['metadatas'][0], 1):
    print(f"{i}. {metadata['product_name']} - ${metadata['price_usd']}")
    print(f"   Purpose: {metadata['primary_purpose']} | Terrain: {metadata['terrain']}")

In [None]:
# Example 3: Search for affordable waterproof jackets
results = collection.query(
    query_texts=["waterproof rain jacket"],
    where={
        "$and": [
            {"waterproofing": {"$eq": "Waterproof"}},
            {"price_usd": {"$lt": 250}}
        ]
    },
    n_results=5
)

print("=== Affordable Waterproof Jackets (Under $250) ===")
for i, metadata in enumerate(results['metadatas'][0], 1):
    print(f"{i}. {metadata['product_name']} - ${metadata['price_usd']}")
    print(f"   Rating: {metadata['rating']} | Material: {metadata['material']}")

In [None]:
# Example 4: Search for winter gear for specific activity
results = collection.query(
    query_texts=["jacket for skiing and snowboarding"],
    where={
        "$and": [
            {"season": {"$eq": "Winter"}},
            {"insulation": {"$eq": "Insulated"}}
        ]
    },
    n_results=5
)

print("=== Insulated Winter Jackets ===")
for i, metadata in enumerate(results['metadatas'][0], 1):
    print(f"{i}. {metadata['product_name']}")
    print(f"   Weather: {metadata['weather_profile']} | Price: ${metadata['price_usd']}")

## 4. Similar Product Search

Find products similar to a specific item based on vector similarity.

In [7]:
# First, let's get a product to use as reference
reference_results = collection.get(
    where={"subcategory": {"$eq": "Down Jackets"}},
    limit=1
)

reference_product = reference_results['metadatas'][0]
reference_id = reference_results['ids'][0]
reference_doc = reference_results['documents'][0]

print("=== Reference Product ===")
print(f"Name: {reference_product['product_name']}")
print(f"Brand: {reference_product['brand']}")
print(f"Category: {reference_product['subcategory']}")
print(f"Price: ${reference_product['price_usd']}")
print(f"\nID: {reference_id}")

=== Reference Product ===
Name: Delta Ridge SummitFlex Down Jacket
Brand: NorthPeak
Category: Down Jackets
Price: $371.0

ID: PRD-27E167FA


In [9]:
reference_doc

'Delta Ridge SummitFlex Down Jacket\n        Brand: NorthPeak\n        Category: Outerwear - Down Jackets\n        Description: Delta Ridge SummitFlex Down Jacket engineered for travel on airport/city in variable conditions. Built with Down/Pertex and weather resistant protection; insulation: insulated.\n        Gender: Women\n        Material: Down/Pertex\n        Season: Winter\n        Purpose: Travel\n        Weather: Variable\n        Terrain: Airport/City\n        Features: Waterproofing=Weather Resistant, Insulation=Insulated\n        Price: $371\n        Color: Crimson'

In [8]:
# Find similar products using the document text
similar_results = collection.query(
    query_texts=[reference_doc],
    n_results=6  # +1 because it includes the reference product
)

print("=== Similar Products ===")
for i, (metadata, distance) in enumerate(zip(similar_results['metadatas'][0], similar_results['distances'][0]), 1):
    # Skip the reference product itself (distance will be 0 or very close)
    if metadata['product_id'] == reference_product['product_id']:
        continue
    
    print(f"\n{i-1}. {metadata['product_name']}")
    print(f"   Brand: {metadata['brand']} | Price: ${metadata['price_usd']}")
    print(f"   Category: {metadata['subcategory']}")
    print(f"   Similarity: {1 - distance:.3f}")

=== Similar Products ===

1. Delta Ridge AuroraFlex Down Jacket
   Brand: TrailForge | Price: $346.0
   Category: Down Jackets
   Similarity: 0.750

2. Delta Ridge RidgePrime Down Jacket
   Brand: AlpineCo | Price: $446.0
   Category: Down Jackets
   Similarity: 0.706

3. Delta Ridge AeroCore Vest
   Brand: NorthPeak | Price: $93.0
   Category: Vests
   Similarity: 0.615

4. Heatzone CascadeGrid Down Jacket
   Brand: NorthPeak | Price: $236.0
   Category: Down Jackets
   Similarity: 0.586

5. Heatzone GlacierGrid Down Jacket
   Brand: NorthPeak | Price: $314.0
   Category: Down Jackets
   Similarity: 0.564


## 5. Advanced Filters

Use logical operators for complex queries.

In [None]:
# Example 1: OR condition - Either AlpineCo or TrailForge brands
results = collection.get(
    where={
        "$or": [
            {"brand": {"$eq": "AlpineCo"}},
            {"brand": {"$eq": "TrailForge"}}
        ]
    },
    limit=10
)

print(f"=== AlpineCo or TrailForge Products ({len(results['metadatas'])} found) ===")
for i, metadata in enumerate(results['metadatas'][:5], 1):
    print(f"{i}. {metadata['product_name']} - {metadata['brand']}")

In [None]:
# Example 2: Complex nested conditions
results = collection.get(
    where={
        "$and": [
            {
                "$or": [
                    {"gender": {"$eq": "Women"}},
                    {"gender": {"$eq": "Unisex"}}
                ]
            },
            {"price_usd": {"$lte": 300}},
            {"rating": {"$gte": 4.5}}
        ]
    },
    limit=10
)

print(f"=== Women's/Unisex, Under $300, High-Rated (4.5+) ({len(results['metadatas'])} found) ===")
for i, metadata in enumerate(results['metadatas'], 1):
    print(f"{i}. {metadata['product_name']}")
    print(f"   Gender: {metadata['gender']} | Price: ${metadata['price_usd']} | Rating: {metadata['rating']}")

In [None]:
# Example 3: Price range filtering
results = collection.get(
    where={
        "$and": [
            {"price_usd": {"$gte": 200}},
            {"price_usd": {"$lte": 350}},
            {"category": {"$eq": "Outerwear"}}
        ]
    },
    limit=10
)

print(f"=== Mid-Range Outerwear ($200-$350) ({len(results['metadatas'])} found) ===")
for i, metadata in enumerate(results['metadatas'][:10], 1):
    print(f"{i}. {metadata['product_name']} - ${metadata['price_usd']}")
    print(f"   {metadata['subcategory']} | {metadata['brand']}")

## 6. Aggregation and Analysis

Use pandas to analyze the retrieved results.

In [None]:
# Get all products and convert to DataFrame
all_results = collection.get(
    limit=1000  # Get all products
)

df = pd.DataFrame(all_results['metadatas'])
print(f"Total products in database: {len(df)}")
print(f"\nDataFrame shape: {df.shape}")
df.head()

In [None]:
# Analyze products by category
print("=== Products by Subcategory ===")
print(df['subcategory'].value_counts())

In [None]:
# Price statistics by brand
print("=== Price Statistics by Brand ===")
price_stats = df.groupby('brand')['price_usd'].agg(['count', 'mean', 'min', 'max'])
print(price_stats.round(2))

In [None]:
# Average rating by subcategory
print("=== Average Rating by Subcategory ===")
rating_by_category = df.groupby('subcategory')['rating'].mean().sort_values(ascending=False)
print(rating_by_category.round(2))

In [None]:
# Products by gender distribution
print("=== Products by Gender ===")
print(df['gender'].value_counts())
print(f"\nPercentage:")
print((df['gender'].value_counts() / len(df) * 100).round(1))

In [None]:
# Top rated products
print("=== Top 10 Highest Rated Products ===")
top_rated = df.nlargest(10, 'rating')[['product_name', 'brand', 'subcategory', 'price_usd', 'rating']]
for i, row in enumerate(top_rated.itertuples(), 1):
    print(f"{i}. {row.product_name}")
    print(f"   {row.brand} | {row.subcategory} | ${row.price_usd} | Rating: {row.rating}")

## 7. Custom Query Helper Functions

Reusable functions for common query patterns.

In [None]:
def search_by_price_range(min_price, max_price, n_results=10):
    """Search products within a price range."""
    results = collection.get(
        where={
            "$and": [
                {"price_usd": {"$gte": min_price}},
                {"price_usd": {"$lte": max_price}}
            ]
        },
        limit=n_results
    )
    return results

# Test the function
results = search_by_price_range(100, 200, n_results=5)
print("=== Products Between $100-$200 ===")
for i, metadata in enumerate(results['metadatas'], 1):
    print(f"{i}. {metadata['product_name']} - ${metadata['price_usd']}")

In [None]:
def recommend_for_activity(activity, n_results=5):
    """Recommend products for a specific activity."""
    results = collection.query(
        query_texts=[f"gear and clothing for {activity}"],
        n_results=n_results
    )
    return results

# Test the function
results = recommend_for_activity("alpine climbing", n_results=5)
print("=== Recommended for Alpine Climbing ===")
for i, (metadata, distance) in enumerate(zip(results['metadatas'][0], results['distances'][0]), 1):
    print(f"{i}. {metadata['product_name']}")
    print(f"   Purpose: {metadata['primary_purpose']} | Similarity: {1-distance:.3f}")

In [None]:
def find_best_value(subcategory, max_price=300):
    """Find highest-rated products in a category under a price limit."""
    results = collection.get(
        where={
            "$and": [
                {"subcategory": {"$eq": subcategory}},
                {"price_usd": {"$lte": max_price}}
            ]
        },
        limit=50
    )
    
    # Sort by rating
    df = pd.DataFrame(results['metadatas'])
    if len(df) > 0:
        return df.nlargest(5, 'rating')
    return df

# Test the function
best_value = find_best_value("Down Jackets", max_price=350)
print("=== Best Value Down Jackets (Under $350) ===")
for i, row in enumerate(best_value.itertuples(), 1):
    print(f"{i}. {row.product_name}")
    print(f"   ${row.price_usd} | Rating: {row.rating} | {row.brand}")

## Summary

This notebook demonstrated:

1. **Semantic Search** - Natural language queries using vector embeddings
2. **Filter-Based Search** - Exact metadata filtering with logical operators
3. **Hybrid Search** - Combining semantic search with filters
4. **Similar Products** - Finding related items based on vector similarity
5. **Advanced Filters** - Complex nested conditions with $and/$or
6. **Analysis** - Using pandas to aggregate and analyze results
7. **Helper Functions** - Reusable query patterns

### Key ChromaDB Filter Operators:
- `$eq` - Equal to
- `$ne` - Not equal to
- `$gt` - Greater than
- `$gte` - Greater than or equal
- `$lt` - Less than
- `$lte` - Less than or equal
- `$and` - Logical AND
- `$or` - Logical OR

### Next Steps:
- Experiment with different query combinations
- Tune the similarity thresholds
- Build custom recommendation algorithms
- Add user preference modeling