Skip to content

Feature/knn vector update mongodb#3971

Draft
ranfysvalle02 wants to merge 4 commits intomem0ai:mainfrom
ranfysvalle02:feature/knnVector-update-mongodb
Draft

Feature/knn vector update mongodb#3971
ranfysvalle02 wants to merge 4 commits intomem0ai:mainfrom
ranfysvalle02:feature/knnVector-update-mongodb

Conversation

@ranfysvalle02
Copy link
Contributor

@ranfysvalle02 ranfysvalle02 commented Feb 3, 2026

MongoDB Vector Store: Migration to vectorSearch

Description

Migrates MongoDB vector store from deprecated knnVector to MongoDB Atlas vectorSearch index type with $vectorSearch aggregation pipeline. Includes comprehensive integration tests that verify $vectorSearch functionality end-to-end and automatic migration of legacy indexes.

🔍 Integration tests explicitly test $vectorSearch aggregation pipeline using MongoDB Atlas Local containers to ensure the vector search implementation works correctly in production-like environments.

Fixes #3970


Type of Change

  • Refactor (migration from knnVector to vectorSearch)
  • New feature (integration tests + auto-healing legacy indexes)

Summary of Changes

1. Migration: knnVectorvectorSearch Index Type

Before:

VECTOR_TYPE = "knnVector"
definition = {
    "mappings": {
        "dynamic": False,
        "fields": {
            "embedding": {
                "type": "knnVector",
                "dimensions": self.embedding_model_dims,
                "similarity": "cosine",
            }
        }
    }
}

After:

definition = {
    "fields": [
        {
            "type": "vector",
            "path": "embedding",
            "numDimensions": self.embedding_model_dims,
            "similarity": "cosine",
        }
    ]
}
search_index_model = SearchIndexModel(
    name=self.index_name,
    type="vectorSearch",  # MongoDB Atlas Search API
    definition=definition
)

Impact:

  • Uses MongoDB Atlas Search API standard (vectorSearch index type)
  • Aligns with MongoDB's recommended vector search implementation
  • Removes dependency on deprecated knnVector field type

2. Vector Search Improvements

  • Increased accuracy: numCandidates changed from limit to limit * 20 for better HNSW index recall
  • Performance: Removed redundant list_search_indexes() check on every search operation
  • Pipeline optimization: Uses $vectorSearch aggregation stage with proper score extraction

3. Auto-Healing Legacy Indexes (Zero Data Loss)

NEW: Automatic detection and migration of legacy knnVector indexes without data loss.

  • Automatic Detection: On initialization, create_col() inspects existing indexes
  • Legacy Detection: Identifies knnVector type in old mappings structure or outdated index configurations
  • Surgical Migration: Drops only the index (preserving all vector data) and recreates with vectorSearch
  • Asynchronous Handling: Properly waits for index deletion before recreation, and for index readiness after creation
  • Zero Downtime: No manual intervention required - happens automatically on next initialization

Before (Manual Migration Required):

# Users had to manually reset, losing data or requiring backup
mongo_store.reset()  # Drops entire collection!

After (Automatic):

# Simply initialize - auto-healing happens automatically
vector_store = MongoDB(
    db_name="mem0",
    collection_name="my_vectors",
    embedding_model_dims=1536,
    mongo_uri="mongodb://...",
    wait_for_index_ready=True,  # Wait for migration to complete
    index_creation_timeout=300   # Configurable timeout
)
# Legacy index detected → dropped → recreated with vectorSearch
# Your data remains intact!

4. Asynchronous Index Operations

NEW: Robust handling of MongoDB Atlas Search's asynchronous index operations.

  • Polling Logic: _wait_for_index_status() method polls index status until ready or deleted
  • Queryable Check: Verifies queryable=True before using indexes (not just existence)
  • Sequential Operations: Waits for index deletion before recreation (prevents conflicts)
  • Configurable Timeouts: index_creation_timeout parameter (default 300s, adjustable for large datasets)
  • Production Mode: wait_for_index_ready=False option for non-blocking initialization in APIs

Key Features:

  • Prevents "index not ready" errors by waiting for queryable=True
  • Handles race conditions when dropping and recreating indexes
  • Configurable for different use cases (scripts vs. production APIs)

5. Code Quality

  • Improved error handling and logging
  • Better handling of optional _id in insert operations

Integration Tests: $vectorSearch Verification

🎯 NEW: Comprehensive Integration Test Suite

Added tests/vector_stores/test_mongodb_integration.py that explicitly tests $vectorSearch aggregation pipeline using real MongoDB Atlas Local containers.

Why MongoDB Atlas Local?

  • Standard MongoDB images do not support $vectorSearch
  • MongoDB Atlas Local is required for $vectorSearch functionality
  • Tests run against production-like environment

Test Coverage

test_vector_lifecycle - Full CRUD with $vectorSearch

# 1. Insert vectors
mongo_store.insert(vectors, payloads, ids)

# 2. Test $vectorSearch aggregation pipeline
results = mongo_store.search(
    query="unused", 
    vectors=[1.0, 0.0, 0.0], 
    limit=1
)
assert results[0].score > 0.99  # Verifies $vectorSearch score calculation

# 3. Test $vectorSearch with $match filtering
results_filtered = mongo_store.search(
    query="unused",
    vectors=[0.0, 1.0, 0.0], 
    limit=5,
    filters={"type": "other"}  # Tests $match stage in pipeline
)

What it verifies:

  • $vectorSearch aggregation pipeline execution
  • vectorSearchScore metadata extraction
  • $match stage filtering with $vectorSearch
  • ✅ Index creation and HNSW indexing

test_list_functionality - List operations with filters

Tests list() method with payload filters.

test_knnvector_to_vectorsearch_migration - NEW: Migration Test

Comprehensive test that verifies automatic migration from legacy knnVector to vectorSearch:

# 1. Create legacy knnVector index using old format
legacy_index = SearchIndexModel(
    name=index_name,
    definition={
        "mappings": {
            "dynamic": False,
            "fields": {
                "embedding": {
                    "type": "knnVector",  # Old format
                    "dimensions": embedding_dims,
                    "similarity": "cosine",
                }
            },
        }
    },
)
collection.create_search_index(legacy_index)

# 2. Insert test data
# ... insert vectors with payloads ...

# 3. Initialize MongoDB class (triggers auto-healing)
store = MongoDB(...)  # Detects legacy index and migrates

# 4. Verify migration
# - Old mappings structure is gone
# - New fields array with vector type exists
# - Index is queryable
# - All data is preserved
# - Search functionality works

What it verifies:

  • ✅ Legacy knnVector index creation (old mappings format)
  • ✅ Automatic detection of legacy index structure
  • ✅ Index migration (drop + recreate) without data loss
  • ✅ Data preservation (all vectors and payloads intact)
  • ✅ Search functionality after migration
  • ✅ Filtered search after migration

Running the Tests

# Install dependencies
pip install testcontainers pytest pymongo

# Run integration tests
pytest tests/vector_stores/test_mongodb_integration.py -v -s

Test Results:

tests/vector_stores/test_mongodb_integration.py::test_vector_lifecycle PASSED
tests/vector_stores/test_mongodb_integration.py::test_list_functionality PASSED
tests/vector_stores/test_mongodb_integration.py::test_edge_cases PASSED
tests/vector_stores/test_mongodb_integration.py::test_reset_functionality PASSED
tests/vector_stores/test_mongodb_integration.py::test_knnvector_to_vectorsearch_migration PASSED

5 passed, 1 warning in ~90s

Migration Test Output:

🔄 Testing knnVector → vectorSearch Migration...
📝 Step 1a: Creating legacy knnVector index using old format...
✅ Legacy knnVector index creation initiated
⏳ Step 1b: Waiting for legacy index to be ready...
✅ Verified legacy knnVector index structure (old format confirmed)
📥 Step 2: Inserting test data...
✅ Test data inserted (3 vectors)
🔧 Step 3: Initializing MongoDB class (should trigger auto-healing)...
✅ MongoDB class initialized
🔍 Step 4: Verifying index migration...
✅ Index successfully migrated to vectorSearch format (fields array with vector type)
💾 Step 5: Verifying data preservation...
✅ All data preserved during migration
🔎 Step 6: Verifying search functionality with migrated index...
✅ Search works correctly with migrated index
✅ Filtered search works correctly
✅ Migration test completed successfully!

Test Infrastructure

  • Uses testcontainers library for Docker container management
  • Custom AtlasContainer wrapper for MongoDB Atlas Local
  • Explicit readiness checks (ping-based, not log-based)
  • Proper cleanup and teardown

How Has This Been Tested?

✅ Integration Tests (NEW)

  • 5 comprehensive integration tests that verify $vectorSearch functionality and migration
  • Tests run against MongoDB Atlas Local containers
  • Full CRUD lifecycle testing
  • Edge case testing (empty results, non-existent IDs, empty filters, etc.)
  • Reset functionality verification
  • Migration test: Verifies automatic knnVectorvectorSearch migration with data preservation

✅ Existing Unit Tests

  • All existing unit tests continue to pass
  • Updated unit tests to match new vectorSearch index format
  • No breaking changes to public API

✅ Manual Testing

  • Verified $vectorSearch pipeline works correctly
  • Tested index creation and management
  • Confirmed automatic migration works with legacy indexes

Migration Guide

For Existing Users

No API changes - The public interface remains identical. Migration is automatic and zero-downtime:

  1. Automatic Index Migration: Existing collections with legacy knnVector indexes are automatically detected and migrated

    # Simply initialize - auto-healing happens automatically
    vector_store = MongoDB(
        db_name="mem0",
        collection_name="my_vectors",
        embedding_model_dims=1536,
        mongo_uri="mongodb://..."
    )
    # Legacy index detected → dropped → recreated with vectorSearch
    # Your data remains intact!

    What happens:

    • On initialization, create_col() inspects existing indexes
    • If legacy knnVector detected (checks old mappings structure), it drops only the index (preserving data)
    • Waits for index deletion to complete (prevents conflicts)
    • Recreates with modern vectorSearch configuration
    • Waits for new index to become queryable (if wait_for_index_ready=True)
    • No manual intervention needed - all data preserved
  2. MongoDB Version: Requires MongoDB Atlas or MongoDB Atlas Local (standard MongoDB doesn't support $vectorSearch)

  3. Manual Reset (Optional): If you prefer to start fresh, reset() is still available but no longer required for migration


Technical Details

$vectorSearch Pipeline Structure

The search method now uses this aggregation pipeline:

pipeline = [
    {
        "$vectorSearch": {
            "index": self.index_name,
            "limit": limit,
            "numCandidates": limit * 20,  # Improved accuracy
            "queryVector": vectors,
            "path": "embedding",
        }
    },
    {"$set": {"score": {"$meta": "vectorSearchScore"}}},  # Extract similarity score
    {"$project": {"embedding": 0}},  # Exclude vectors from results
]

If filters are provided, a $match stage is inserted after $vectorSearch:

pipeline.insert(1, {"$match": {"$and": filter_conditions}})

Auto-Healing Implementation

The create_col() method now includes intelligent legacy index detection and migration with robust asynchronous handling:

# 1. Inspect existing indexes
found_indexes = list(collection.list_search_indexes(name=self.index_name))

# 2. Check for legacy configuration (supports old mappings structure)
if found_indexes:
    existing_index = found_indexes[0]
    definition = existing_index.get("latestDefinition", {})
    
    # Check old format: mappings.fields.embedding.type == "knnVector"
    mappings = definition.get("mappings", {})
    if mappings:
        legacy_fields = mappings.get("fields", {})
        embedding_field = legacy_fields.get("embedding", {})
        if embedding_field.get("type") == "knnVector":
            is_legacy = True
    
    # Also check if type is not vectorSearch
    if definition.get("type") != "vectorSearch":
        is_legacy = True
    
    # 3. Surgical migration (preserves data)
    if is_legacy:
        collection.drop_search_index(self.index_name)
        # BLOCKING WAIT: Ensure deletion completes before recreation
        self._wait_for_index_status(collection, index_name, "deleted")
        # Index will be recreated below with vectorSearch type
    
# 4. Create new index (if needed)
if should_create_index:
    collection.create_search_index(search_index_model)
    # BLOCKING WAIT: Ensure index is queryable before returning
    if self.wait_for_index_ready:
        self._wait_for_index_status(collection, index_name, "ready")

Key Features:

  • Data Preservation: Only drops the index definition, not the collection
  • Automatic: Happens transparently on initialization
  • Asynchronous Handling: Properly waits for index operations to complete
  • Legacy Detection: Supports both old mappings structure and missing type field
  • Idempotent: Safe to call multiple times
  • Error Handling: Gracefully handles race conditions and timeouts
  • Configurable: wait_for_index_ready and index_creation_timeout parameters

Asynchronous Index Operations

MongoDB Atlas Search index operations are asynchronous. The implementation includes:

_wait_for_index_status() Helper Method:

  • Polls index status until target state is reached
  • Checks queryable=True for readiness (not just existence)
  • Enforces configurable timeout to prevent infinite loops
  • Handles both "ready" and "deleted" states

Why This Matters:

  • Index creation/deletion returns immediately, but work happens in background
  • Without waiting, searches may fail with "index not ready" errors
  • Prevents race conditions when dropping and recreating indexes with same name

Configuration:

MongoDB(
    ...,
    wait_for_index_ready=True,      # Block until index is queryable (default: True)
    index_creation_timeout=300      # Max seconds to wait (default: 300s)
)

Production Considerations:

  • For large datasets (1M+ vectors), increase index_creation_timeout
  • For production APIs, set wait_for_index_ready=False and handle "index not ready" errors gracefully
  • Consider running index creation in separate migration scripts for very large datasets

Performance Considerations

  • Removed: Redundant index existence check on every search (was calling list_search_indexes())
  • Improved: numCandidates = limit * 20 for better recall (trade-off: slightly slower but more accurate)
  • Optimized: Direct aggregation pipeline execution
  • Asynchronous: Proper handling of async index operations prevents errors and race conditions

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

Breaking Changes

None - The API remains fully backward compatible. Only internal implementation changed.

Migration: Existing knnVector indexes are automatically detected and migrated on initialization. No manual intervention required - your data is preserved during the migration process.


Maintainer Checklist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MONGODB BUG FIX: knnVector is deprecated

1 participant