# Understanding Embeddings in ChromaDB

This notebook explains how embeddings work in our product recommendation system.

## What are Embeddings?

Embeddings are **numerical vector representations** of text. They convert words, sentences, or documents into arrays of numbers (vectors) that capture semantic meaning.

For example:
- "warm winter jacket" → [0.23, -0.45, 0.67, ..., 0.12] (384 numbers)
- "insulated cold weather coat" → [0.25, -0.43, 0.65, ..., 0.14] (384 numbers)

Similar meanings = Similar vectors (close together in vector space)

## The Embedding Model Used

ChromaDB uses **`all-MiniLM-L6-v2`** by default:

- **Model**: sentence-transformers/all-MiniLM-L6-v2
- **Vector Dimensions**: 384
- **Model Size**: ~80MB (downloaded on first use)
- **Speed**: Very fast (optimized for CPU)
- **Quality**: Good balance of speed and accuracy
- **Use Case**: General-purpose semantic similarity

This model was automatically downloaded when you first ran `load_products.py`.

In [1]:
import chromadb
from chromadb.utils import embedding_functions

# Initialize ChromaDB client
client = chromadb.PersistentClient(path="../chroma_db")
collection = client.get_collection(name="outdoor_products")

# Get the embedding function info
print("Collection Metadata:")
print(f"Name: {collection.name}")
print(f"Count: {collection.count()}")
print(f"\nMetadata: {collection.metadata}")

Collection Metadata:
Name: outdoor_products
Count: 300

Metadata: {'description': 'Outdoor apparel and gear products'}


## How the Embedding Process Works

### 1. At Index Time (Loading Products)

When you ran `load_products.py`:

```python
collection.add(
    documents=["NorthPeak Jacket for hiking..."],  # Text description
    metadatas=[{"brand": "NorthPeak", ...}],
    ids=["PRD-123"]
)
```

**What happens:**
1. ChromaDB takes the document text
2. Passes it through the `all-MiniLM-L6-v2` model
3. Gets back a 384-dimensional vector
4. Stores: [text, metadata, vector, id]

### 2. At Query Time (Searching)

When you search:

```python
collection.query(
    query_texts=["warm jacket for skiing"],
    n_results=5
)
```

**What happens:**
1. ChromaDB takes your query text
2. Passes it through the **same** `all-MiniLM-L6-v2` model
3. Gets back a 384-dimensional vector
4. Compares this vector to all stored product vectors
5. Returns the closest matches (by cosine similarity/distance)

## Who Transforms Queries to Embeddings?

**ChromaDB does it automatically!**

You never manually create embeddings. ChromaDB handles it:

- **At index time**: Documents → Embeddings (stored)
- **At query time**: Query text → Embeddings (computed on-the-fly)
- **Same model used for both** to ensure consistency

In [2]:
# Let's manually see how the embedding function works
from chromadb.utils.embedding_functions import DefaultEmbeddingFunction

# Get the default embedding function (same one ChromaDB uses)
embedding_fn = DefaultEmbeddingFunction()

# Test text
test_texts = [
    "warm winter jacket",
    "insulated coat for cold weather",
    "lightweight summer shirt"
]

# Generate embeddings
embeddings = embedding_fn(test_texts)

print(f"Number of texts: {len(test_texts)}")
print(f"Number of embeddings: {len(embeddings)}")
print(f"\nEmbedding vector dimensions: {len(embeddings[0])}")
print(f"\nFirst embedding (first 10 values):")
print(embeddings[0][:10])

Number of texts: 3
Number of embeddings: 3

Embedding vector dimensions: 384

First embedding (first 10 values):
[-0.09634393  0.11972902  0.00860771  0.09623481  0.06638822  0.04304032
  0.09908793 -0.0490774  -0.04074297  0.00528337]


## Visualizing Semantic Similarity

Let's see how similar embeddings are for similar text.

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Calculate cosine similarity between embeddings
similarity_matrix = cosine_similarity(embeddings)

print("Cosine Similarity Matrix:")
print("(1.0 = identical, 0.0 = unrelated)\n")

for i, text1 in enumerate(test_texts):
    print(f"'{text1}':")
    for j, text2 in enumerate(test_texts):
        print(f"  vs '{text2}': {similarity_matrix[i][j]:.4f}")
    print()

## Understanding Vector Distance

ChromaDB returns **distances** (not similarity scores):
- Smaller distance = More similar
- Distance = 0 = Identical
- To convert: `similarity = 1 - distance`

In [None]:
# Example query showing distances
results = collection.query(
    query_texts=["warm insulated jacket for winter"],
    n_results=5
)

print("Query: 'warm insulated jacket for winter'\n")
print("Results with Distance and Similarity Scores:\n")

for i, (metadata, distance) in enumerate(zip(results['metadatas'][0], results['distances'][0]), 1):
    similarity = 1 - distance
    print(f"{i}. {metadata['product_name']}")
    print(f"   Distance: {distance:.4f}")
    print(f"   Similarity: {similarity:.4f}")
    print(f"   Category: {metadata['subcategory']}")
    print()

## The Complete Flow

### Loading Products (One-time)
```
Product CSV → Python Dict → Text Description
                                    ↓
                          all-MiniLM-L6-v2 Model
                                    ↓
                          384-dim Vector [0.23, -0.45, ...]
                                    ↓
                          ChromaDB Storage
                          [text, metadata, vector, id]
```

### Searching (Every query)
```
User Query: "warm jacket"
         ↓
all-MiniLM-L6-v2 Model
         ↓
Query Vector [0.25, -0.43, ...]
         ↓
Compare with ALL product vectors
(using cosine similarity/distance)
         ↓
Return top N closest matches
```

## Testing Different Queries

Let's see how the same model handles different types of queries.

In [None]:
# Test various query types
test_queries = [
    "jacket for cold weather skiing",
    "lightweight breathable hiking gear",
    "waterproof rain protection",
    "casual urban style",
]

for query in test_queries:
    print(f"\n{'='*60}")
    print(f"Query: '{query}'")
    print('='*60)
    
    results = collection.query(
        query_texts=[query],
        n_results=3
    )
    
    for i, (metadata, distance) in enumerate(zip(results['metadatas'][0], results['distances'][0]), 1):
        print(f"\n{i}. {metadata['product_name']}")
        print(f"   Similarity: {1-distance:.3f}")
        print(f"   Purpose: {metadata['primary_purpose']}")
        print(f"   Features: {metadata['waterproofing']}, {metadata['insulation']}")

## Embedding Model Comparison

ChromaDB's default `all-MiniLM-L6-v2` is great for most use cases, but you can use different models:

| Model | Dimensions | Size | Speed | Use Case |
|-------|-----------|------|-------|----------|
| all-MiniLM-L6-v2 | 384 | 80MB | ⚡⚡⚡ Fast | General purpose (default) |
| all-mpnet-base-v2 | 768 | 420MB | ⚡⚡ Medium | Higher quality |
| multi-qa-MiniLM-L6-cos-v1 | 384 | 80MB | ⚡⚡⚡ Fast | Q&A/Search optimized |
| paraphrase-multilingual | 384 | 470MB | ⚡⚡ Medium | 50+ languages |

For your 300 products, the default model is perfect!

## Using a Custom Embedding Model (Optional)

If you want to use a different model, you can specify it when creating the collection:

In [None]:
# Example: Using a different embedding model (don't run this - just for reference)
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

# Create custom embedding function
custom_ef = SentenceTransformerEmbeddingFunction(
    model_name="all-mpnet-base-v2"  # More accurate but slower
)

# When creating collection, specify the embedding function
# collection = client.create_collection(
#     name="products_with_custom_model",
#     embedding_function=custom_ef
# )

## Key Takeaways

1. **Embedding Model**: `all-MiniLM-L6-v2` (384 dimensions)
2. **Who creates embeddings**: ChromaDB automatically
3. **When**: 
   - At load time: Products → Embeddings (stored)
   - At query time: Query → Embeddings (computed)
4. **Same model for both**: Ensures consistency
5. **You never see the vectors**: ChromaDB handles everything
6. **Distance vs Similarity**: Lower distance = Higher similarity

## Behind the Scenes

```python
# What you write:
collection.query(query_texts=["warm jacket"], n_results=5)

# What ChromaDB does internally:
# 1. query_vector = embedding_model("warm jacket")  # [0.25, -0.43, ...]
# 2. for each product_vector in database:
#       distance = cosine_distance(query_vector, product_vector)
# 3. sort by distance (ascending)
# 4. return top 5
```

The magic is: **The model understands that "warm jacket" is semantically similar to "insulated coat" even though they share no words!**