# Vector Embeddings in FireProx

This notebook demonstrates how to work with vector embeddings in FireProx using the native `google.cloud.firestore_v1.vector.Vector` class.

## Important Limitations

**Firestore Emulator Does NOT Support Vector Embeddings**

- Vector embeddings are a production-only feature
- The Firestore emulator will reject any operations involving vectors
- All examples in this notebook require a real Firestore instance
- See [GitHub Issue #7216](https://github.com/firebase/firebase-tools/issues/7216)

**Vector Constraints**:
- Maximum 2048 dimensions per vector
- Vectors cannot be nested inside arrays or maps
- Vectors must be at the top level of a document field

## What are Vector Embeddings?

Vector embeddings are numerical representations of data (text, images, etc.) that capture semantic meaning. They enable:
- Semantic search (find similar documents)
- Clustering and classification
- Recommendation systems
- Question answering

FireProx uses the native Firestore `Vector` type directly for seamless integration.

## Setup

**Note**: These examples will fail with the emulator. You must use a real Firestore project.

In [None]:
import os
from google.cloud import firestore
from google.cloud.firestore_v1.vector import Vector
from google.cloud.firestore_v1.base_vector_query import DistanceMeasure

from fire_prox import AsyncFireProx, FireProx

# Check if running in CI environment
if os.environ.get('NOTEBOOK_CI'):
    print("⚠️  Running in CI - skipping vector examples (requires production Firestore)")
    import sys
    sys.exit(0)

# Initialize clients (PRODUCTION ONLY - will not work with emulator)
project_id = 'your-project-id'  # Replace with your actual project ID

# Synchronous client
sync_client = firestore.Client(project=project_id)
db = FireProx(sync_client)

# Asynchronous client
async_client = firestore.AsyncClient(project=project_id)
async_db = AsyncFireProx(async_client)

print("✓ Connected to production Firestore")
print("⚠️  Remember: Vector embeddings DO NOT work with the emulator")

## Feature 1: Creating and Storing Vectors (Sync)

Create a native `Vector` from a list of floats and store it in a document.

In [None]:
# Create a collection for documents with embeddings
documents = db.collection('semantic_documents')

# Create a simple 3-dimensional embedding using native Vector
doc1 = documents.new()
doc1.title = "Introduction to Machine Learning"
doc1.content = "Machine learning is a subset of artificial intelligence..."
doc1.embedding = Vector([0.12, 0.45, 0.78])  # Native Vector instance

# Save to Firestore
doc1.save(doc_id='ml_intro')

print(f"✓ Saved document with {len(doc1.embedding.to_map_value()['value'])} dimensions")
print(f"  Title: {doc1.title}")
print(f"  Embedding type: {type(doc1.embedding).__name__}")

## Feature 2: Reading Vectors from Firestore (Sync)

FireProx automatically preserves native Firestore Vectors when reading.

In [None]:
# Read the document back
retrieved = db.doc('semantic_documents/ml_intro')
retrieved.fetch()

# Access the vector - stays as native Vector
print(f"Document: {retrieved.title}")
print(f"Embedding type: {type(retrieved.embedding).__name__}")
print(f"Vector: {retrieved.embedding}")

## Feature 3: Working with Higher-Dimensional Embeddings

Real-world embeddings typically have many more dimensions (e.g., 384, 768, 1536 dimensions).

In [None]:
import random

# Create a document with a realistic 384-dimensional embedding
# (typical for models like sentence-transformers/all-MiniLM-L6-v2)
doc2 = documents.new()
doc2.title = "Deep Learning Fundamentals"
doc2.content = "Deep learning uses neural networks with multiple layers..."

# Generate a random 384-dimensional embedding (in practice, use a real model)
embedding_384d = [random.random() for _ in range(384)]
doc2.embedding = Vector(embedding_384d)

doc2.save(doc_id='dl_fundamentals')

dimension_count = len(doc2.embedding.to_map_value()['value'])
print(f"✓ Saved document with {dimension_count}-dimensional embedding")
values = doc2.embedding.to_map_value()['value']
print(f"  First 5 dimensions: {values[:5]}")
print(f"  Last 5 dimensions: {values[-5:]}")

## Feature 4: Dimension Validation

Firestore enforces a maximum dimension limit of 2048.

In [None]:
MAX_DIMENSIONS = 2048

print(f"Firestore maximum dimensions: {MAX_DIMENSIONS}")

# This works - exactly at the limit
max_vector = Vector([0.1] * MAX_DIMENSIONS)
print(f"✓ Created vector with {len(max_vector.to_map_value()['value'])} dimensions (max allowed)")

# This will fail when you try to save - exceeds the limit
try:
    too_large = Vector([0.1] * (MAX_DIMENSIONS + 1))
    doc_test = documents.new()
    doc_test.embedding = too_large
    # doc_test.save()  # This would fail
    print(f"\n⚠️  Created vector with {len(too_large.to_map_value()['value'])} dimensions")
    print("   (This will fail when you try to save to Firestore!)")
except Exception as e:
    print(f"\n✗ Error: {e}")

## Feature 5: Async Operations with Vectors

Vectors work seamlessly with the async API.

In [None]:
# Async version - store and retrieve vectors
async_documents = async_db.collection('semantic_documents')

# Create and save
async_doc = async_documents.new()
async_doc.title = "Neural Network Architectures"
async_doc.content = "Neural networks consist of interconnected layers..."
async_doc.embedding = Vector([0.23, 0.56, 0.89])

await async_doc.save(doc_id='nn_architectures')

dimension_count = len(async_doc.embedding.to_map_value()['value'])
print(f"✓ Saved async document with {dimension_count}D embedding")

# Read back
async_retrieved = async_db.doc('semantic_documents/nn_architectures')
await async_retrieved.fetch()

print(f"\nRetrieved: {async_retrieved.title}")
print(f"Embedding: {async_retrieved.embedding}")

## Feature 6: Real-World Example - Text Embeddings

Simulate generating embeddings from text using a hypothetical embedding model.

**Note**: This example shows the pattern. In production, you would use a real embedding model like:
- OpenAI's `text-embedding-ada-002` (1536 dimensions)
- Sentence Transformers (384-768 dimensions)
- Google's Vertex AI embeddings (768 dimensions)

In [None]:
def generate_fake_embedding(text: str, dimensions: int = 384) -> list:
    """
    Simulate an embedding model (in production, use a real model).
    
    Real examples:
    - openai.embeddings.create(input=text, model="text-embedding-ada-002")
    - sentence_transformers.SentenceTransformer('all-MiniLM-L6-v2').encode(text)
    - vertexai.TextEmbeddingModel.from_pretrained('textembedding-gecko').get_embeddings([text])
    """
    import hashlib
    import random

    # Use text hash as seed for reproducible "embeddings"
    seed = int(hashlib.md5(text.encode()).hexdigest(), 16) % (2**32)
    random.seed(seed)

    return [random.gauss(0, 1) for _ in range(dimensions)]

# Example documents to embed
articles = [
    {
        'title': 'Introduction to Python',
        'content': 'Python is a high-level programming language known for its simplicity and readability.'
    },
    {
        'title': 'JavaScript Basics',
        'content': 'JavaScript is the programming language of the web, enabling interactive websites.'
    },
    {
        'title': 'Database Design Principles',
        'content': 'Good database design ensures data integrity, reduces redundancy, and improves query performance.'
    }
]

# Store articles with embeddings
for i, article in enumerate(articles):
    doc = documents.new()
    doc.title = article['title']
    doc.content = article['content']

    # Generate embedding from content
    embedding = generate_fake_embedding(article['content'])
    doc.embedding = Vector(embedding)

    doc.save(doc_id=f'article_{i}')
    dimension_count = len(doc.embedding.to_map_value()['value'])
    print(f"✓ Saved: {article['title']} ({dimension_count}D)")

print("\n✓ All articles embedded and stored")

## Feature 7: Vector Similarity Search with find_nearest

Use FireProx's `find_nearest()` method to perform vector similarity search and find nearest neighbors.

**Requirements**:
- A vector index must be created on the field (using gcloud CLI or Firebase console)
- Does NOT work with emulator (production only)

In [None]:
# Create a query vector (in practice, this would be an embedding of a search query)
query_text = "programming languages and coding"
query_embedding = generate_fake_embedding(query_text)
query_vector = Vector(query_embedding)

print(f"Searching for documents similar to: '{query_text}'")
print("\nNote: This requires a vector index on the 'embedding' field.")
print("Create index with: gcloud firestore indexes composite create ...\n")

# Find nearest neighbors using EUCLIDEAN distance
try:
    vector_query = documents.find_nearest(
        vector_field="embedding",
        query_vector=query_vector,
        distance_measure=DistanceMeasure.EUCLIDEAN,
        limit=5,
        distance_result_field="distance"  # Optional: store calculated distance
    )

    print("Top 5 nearest neighbors:")
    print("=" * 60)
    
    for doc in vector_query.get():
        print(f"\nTitle: {doc.title}")
        print(f"Content: {doc.content}")
        # Access distance if distance_result_field was specified
        if hasattr(doc, 'distance'):
            print(f"Distance: {doc.distance:.4f}")
        
except Exception as e:
    print(f"⚠️  Vector search failed: {e}")
    print("\nThis is expected if:")
    print("  1. No vector index exists on the 'embedding' field")
    print("  2. Running against emulator (vectors not supported)")
    print("  3. Collection has no documents with embeddings")

## Feature 8: Vector Search with Pre-filtering

Combine `where()` clauses with `find_nearest()` to filter documents before searching.

**Note**: Requires a composite index when combining filters with vector search.

In [None]:
# First, let's add a category field to our documents
doc_python = db.doc('semantic_documents/article_0')
doc_python.fetch()
doc_python.category = 'programming'
doc_python.save()

doc_js = db.doc('semantic_documents/article_1')
doc_js.fetch()
doc_js.category = 'programming'
doc_js.save()

doc_db = db.doc('semantic_documents/article_2')
doc_db.fetch()
doc_db.category = 'database'
doc_db.save()

print("✓ Added categories to documents")

# Now search with pre-filtering
try:
    # Find nearest neighbors only among 'programming' category
    filtered_query = (
        documents
        .where('category', '==', 'programming')
        .find_nearest(
            vector_field="embedding",
            query_vector=query_vector,
            distance_measure=DistanceMeasure.COSINE,
            limit=3
        )
    )
    
    print("\nFiltered results (category='programming' only):")
    print("=" * 60)
    
    for doc in filtered_query.get():
        print(f"\nTitle: {doc.title}")
        print(f"Category: {doc.category}")
        print(f"Content: {doc.content[:50]}...")
        
except Exception as e:
    print(f"\n⚠️  Filtered vector search failed: {e}")
    print("\nThis requires a composite index with:")
    print("  - category field")
    print("  - embedding vector field")

## Feature 9: Async Vector Search

Vector search works with the async API as well.

In [None]:
# Async vector search
async_documents = async_db.collection('semantic_documents')

try:
    async_vector_query = async_documents.find_nearest(
        vector_field="embedding",
        query_vector=query_vector,
        distance_measure=DistanceMeasure.DOT_PRODUCT,
        limit=3
    )

    print("Async vector search results:")
    print("=" * 60)
    
    async for doc in async_vector_query.stream():
        print(f"\nTitle: {doc.title}")
        print(f"Content: {doc.content[:60]}...")
        
except Exception as e:
    print(f"⚠️  Async vector search failed: {e}")
    print("Requires vector index and production Firestore")

## Distance Measures

Firestore supports three distance measures for vector similarity:

1. **EUCLIDEAN**: Measures straight-line distance between vectors
   - Good for: Spatial data, when magnitude matters
   - Range: 0 to ∞ (lower is more similar)

2. **COSINE**: Measures angle between vectors (direction)
   - Good for: Text embeddings, when direction matters more than magnitude
   - Range: -1 to 1 (higher is more similar)

3. **DOT_PRODUCT**: Measures both angle and magnitude
   - Good for: When both direction and magnitude are important
   - Range: -∞ to ∞ (higher is more similar)

In [None]:
# Compare different distance measures
print("Comparing distance measures:")
print("=" * 60)

for measure in [DistanceMeasure.EUCLIDEAN, DistanceMeasure.COSINE, DistanceMeasure.DOT_PRODUCT]:
    print(f"\n{measure.name}:")
    try:
        query = documents.find_nearest(
            vector_field="embedding",
            query_vector=query_vector,
            distance_measure=measure,
            limit=2,
            distance_result_field="distance"
        )
        
        for doc in query.get():
            distance = getattr(doc, 'distance', 'N/A')
            print(f"  - {doc.title}: {distance}")
            
    except Exception as e:
        print(f"  ⚠️  Failed: {str(e)[:60]}...")

## Server-Side Embedding Generation

### Firebase Extension for Automatic Embeddings

Firebase provides extensions that can automatically generate embeddings when documents are created or updated:

**How it works**:
1. Configure which collection and field to monitor
2. When a document is created/updated, the extension triggers
3. It sends the text field to an embedding model (Vertex AI / Gemini)
4. The generated embedding is stored back in the document

**Example workflow**:
```python
# 1. Save document with text content (no embedding yet)
doc = documents.new()
doc.title = "My Article"
doc.content = "This is the text content to embed..."
doc.save()

# 2. Extension automatically triggers:
#    - Reads doc.content
#    - Calls Vertex AI embedding API
#    - Writes result to doc.embedding

# 3. Read back with embedding (after extension completes)
import time
time.sleep(2)  # Wait for extension to process
doc.fetch(force=True)
print(f"Auto-generated embedding: {len(doc.embedding.to_map_value()['value'])}D")
```

**Alternative: Client-Side Embeddings**

For more control, generate embeddings in your application:

```python
# Using OpenAI
import openai

response = openai.embeddings.create(
    input="Your text here",
    model="text-embedding-ada-002"
)
embedding = response.data[0].embedding
doc.embedding = Vector(embedding)
doc.save()

# Using Sentence Transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
embedding = model.encode("Your text here").tolist()
doc.embedding = Vector(embedding)
doc.save()
```

## Cleanup

In [None]:
# Delete test documents
test_docs = [
    'ml_intro',
    'dl_fundamentals',
    'nn_architectures',
    'article_0',
    'article_1',
    'article_2'
]

for doc_id in test_docs:
    try:
        doc = db.doc(f'semantic_documents/{doc_id}')
        doc.delete()
        print(f"✓ Deleted {doc_id}")
    except Exception as e:
        print(f"  (Could not delete {doc_id}: {e})")

print("\n✓ Cleanup complete")

## Summary

### Key Takeaways

1. **Native Vector Support**: FireProx uses native `google.cloud.firestore_v1.vector.Vector` directly
2. **Automatic Handling**: FireProx preserves Vector types seamlessly during read/write operations
3. **Vector Search**: Use `find_nearest()` for similarity search and nearest neighbor queries
4. **Distance Measures**: Choose from EUCLIDEAN, COSINE, or DOT_PRODUCT based on your use case
5. **Pre-filtering**: Combine `where()` with `find_nearest()` for filtered vector search
6. **Sync & Async**: Works with both synchronous and asynchronous APIs
7. **Production Only**: Vectors do NOT work with Firestore emulator

### Limitations to Remember

- ⚠️ Emulator does not support vectors
- ⚠️ Maximum 2048 dimensions
- ⚠️ Maximum 1000 results per query
- ⚠️ Vectors cannot be nested in arrays/maps
- ⚠️ Vectors must be top-level document fields
- ⚠️ Requires vector index for search operations
- ⚠️ No real-time snapshot listeners for vector queries

### Use Cases

- **Semantic Search**: Find documents similar to a query
- **Content Recommendations**: Suggest related articles/products
- **Question Answering**: Match questions to relevant answers
- **Image Search**: Find similar images by embedding
- **Clustering**: Group similar documents together
- **Duplicate Detection**: Find near-duplicate content

### Next Steps

To build a complete semantic search system:
1. Choose an embedding model (OpenAI, Sentence Transformers, Vertex AI)
2. Generate embeddings for your documents
3. Store using native `Vector` type
4. Create vector indexes (using gcloud CLI or Firebase console)
5. Use `find_nearest()` for similarity search
6. Optionally combine with filters using `where()`