Feature/knn vector update mongodb by ranfysvalle02 · Pull Request #3971 · mem0ai/mem0

ranfysvalle02 · 2026-02-03T02:40:15Z

MongoDB Vector Store: Migration to vectorSearch

Description

Migrates MongoDB vector store from deprecated knnVector to MongoDB Atlas vectorSearch index type with $vectorSearch aggregation pipeline. Includes comprehensive integration tests that verify $vectorSearch functionality end-to-end and automatic migration of legacy indexes.

🔍 Integration tests explicitly test $vectorSearch aggregation pipeline using MongoDB Atlas Local containers to ensure the vector search implementation works correctly in production-like environments.

Fixes #3970

Type of Change

Refactor (migration from knnVector to vectorSearch)
New feature (integration tests + auto-healing legacy indexes)

Summary of Changes

1. Migration: `knnVector` → `vectorSearch` Index Type

Before:

VECTOR_TYPE = "knnVector"
definition = {
    "mappings": {
        "dynamic": False,
        "fields": {
            "embedding": {
                "type": "knnVector",
                "dimensions": self.embedding_model_dims,
                "similarity": "cosine",
            }
        }
    }
}

After:

definition = {
    "fields": [
        {
            "type": "vector",
            "path": "embedding",
            "numDimensions": self.embedding_model_dims,
            "similarity": "cosine",
        }
    ]
}
search_index_model = SearchIndexModel(
    name=self.index_name,
    type="vectorSearch",  # MongoDB Atlas Search API
    definition=definition
)

Impact:

Uses MongoDB Atlas Search API standard (vectorSearch index type)
Aligns with MongoDB's recommended vector search implementation
Removes dependency on deprecated knnVector field type

2. Vector Search Improvements

Increased accuracy: numCandidates changed from limit to limit * 20 for better HNSW index recall
Performance: Removed redundant list_search_indexes() check on every search operation
Pipeline optimization: Uses $vectorSearch aggregation stage with proper score extraction

3. Auto-Healing Legacy Indexes (Zero Data Loss)

NEW: Automatic detection and migration of legacy knnVector indexes without data loss.

Automatic Detection: On initialization, create_col() inspects existing indexes
Legacy Detection: Identifies knnVector type in old mappings structure or outdated index configurations
Surgical Migration: Drops only the index (preserving all vector data) and recreates with vectorSearch
Asynchronous Handling: Properly waits for index deletion before recreation, and for index readiness after creation
Zero Downtime: No manual intervention required - happens automatically on next initialization

Before (Manual Migration Required):

# Users had to manually reset, losing data or requiring backup
mongo_store.reset()  # Drops entire collection!

After (Automatic):

# Simply initialize - auto-healing happens automatically
vector_store = MongoDB(
    db_name="mem0",
    collection_name="my_vectors",
    embedding_model_dims=1536,
    mongo_uri="mongodb://...",
    wait_for_index_ready=True,  # Wait for migration to complete
    index_creation_timeout=300   # Configurable timeout
)
# Legacy index detected → dropped → recreated with vectorSearch
# Your data remains intact!

4. Asynchronous Index Operations

NEW: Robust handling of MongoDB Atlas Search's asynchronous index operations.

Polling Logic: _wait_for_index_status() method polls index status until ready or deleted
Queryable Check: Verifies queryable=True before using indexes (not just existence)
Sequential Operations: Waits for index deletion before recreation (prevents conflicts)
Configurable Timeouts: index_creation_timeout parameter (default 300s, adjustable for large datasets)
Production Mode: wait_for_index_ready=False option for non-blocking initialization in APIs

Key Features:

Prevents "index not ready" errors by waiting for queryable=True
Handles race conditions when dropping and recreating indexes
Configurable for different use cases (scripts vs. production APIs)

5. Code Quality

Improved error handling and logging
Better handling of optional _id in insert operations

Integration Tests: `$vectorSearch` Verification

🎯 NEW: Comprehensive Integration Test Suite

Added tests/vector_stores/test_mongodb_integration.py that explicitly tests $vectorSearch aggregation pipeline using real MongoDB Atlas Local containers.

Why MongoDB Atlas Local?

Standard MongoDB images do not support $vectorSearch
MongoDB Atlas Local is required for $vectorSearch functionality
Tests run against production-like environment

Test Coverage

`test_vector_lifecycle` - Full CRUD with `$vectorSearch`

# 1. Insert vectors
mongo_store.insert(vectors, payloads, ids)

# 2. Test $vectorSearch aggregation pipeline
results = mongo_store.search(
    query="unused", 
    vectors=[1.0, 0.0, 0.0], 
    limit=1
)
assert results[0].score > 0.99  # Verifies $vectorSearch score calculation

# 3. Test $vectorSearch with $match filtering
results_filtered = mongo_store.search(
    query="unused",
    vectors=[0.0, 1.0, 0.0], 
    limit=5,
    filters={"type": "other"}  # Tests $match stage in pipeline
)

What it verifies:

✅ $vectorSearch aggregation pipeline execution
✅ vectorSearchScore metadata extraction
✅ $match stage filtering with $vectorSearch
✅ Index creation and HNSW indexing

`test_list_functionality` - List operations with filters

Tests list() method with payload filters.

`test_knnvector_to_vectorsearch_migration` - NEW: Migration Test

Comprehensive test that verifies automatic migration from legacy knnVector to vectorSearch:

# 1. Create legacy knnVector index using old format
legacy_index = SearchIndexModel(
    name=index_name,
    definition={
        "mappings": {
            "dynamic": False,
            "fields": {
                "embedding": {
                    "type": "knnVector",  # Old format
                    "dimensions": embedding_dims,
                    "similarity": "cosine",
                }
            },
        }
    },
)
collection.create_search_index(legacy_index)

# 2. Insert test data
# ... insert vectors with payloads ...

# 3. Initialize MongoDB class (triggers auto-healing)
store = MongoDB(...)  # Detects legacy index and migrates

# 4. Verify migration
# - Old mappings structure is gone
# - New fields array with vector type exists
# - Index is queryable
# - All data is preserved
# - Search functionality works

What it verifies:

✅ Legacy knnVector index creation (old mappings format)
✅ Automatic detection of legacy index structure
✅ Index migration (drop + recreate) without data loss
✅ Data preservation (all vectors and payloads intact)
✅ Search functionality after migration
✅ Filtered search after migration

Running the Tests

# Install dependencies
pip install testcontainers pytest pymongo

# Run integration tests
pytest tests/vector_stores/test_mongodb_integration.py -v -s

Test Results:

tests/vector_stores/test_mongodb_integration.py::test_vector_lifecycle PASSED
tests/vector_stores/test_mongodb_integration.py::test_list_functionality PASSED
tests/vector_stores/test_mongodb_integration.py::test_edge_cases PASSED
tests/vector_stores/test_mongodb_integration.py::test_reset_functionality PASSED
tests/vector_stores/test_mongodb_integration.py::test_knnvector_to_vectorsearch_migration PASSED

5 passed, 1 warning in ~90s

Migration Test Output:

🔄 Testing knnVector → vectorSearch Migration...
📝 Step 1a: Creating legacy knnVector index using old format...
✅ Legacy knnVector index creation initiated
⏳ Step 1b: Waiting for legacy index to be ready...
✅ Verified legacy knnVector index structure (old format confirmed)
📥 Step 2: Inserting test data...
✅ Test data inserted (3 vectors)
🔧 Step 3: Initializing MongoDB class (should trigger auto-healing)...
✅ MongoDB class initialized
🔍 Step 4: Verifying index migration...
✅ Index successfully migrated to vectorSearch format (fields array with vector type)
💾 Step 5: Verifying data preservation...
✅ All data preserved during migration
🔎 Step 6: Verifying search functionality with migrated index...
✅ Search works correctly with migrated index
✅ Filtered search works correctly
✅ Migration test completed successfully!

Test Infrastructure

Uses testcontainers library for Docker container management
Custom AtlasContainer wrapper for MongoDB Atlas Local
Explicit readiness checks (ping-based, not log-based)
Proper cleanup and teardown

How Has This Been Tested?

✅ Integration Tests (NEW)

5 comprehensive integration tests that verify $vectorSearch functionality and migration
Tests run against MongoDB Atlas Local containers
Full CRUD lifecycle testing
Edge case testing (empty results, non-existent IDs, empty filters, etc.)
Reset functionality verification
Migration test: Verifies automatic knnVector → vectorSearch migration with data preservation

✅ Existing Unit Tests

All existing unit tests continue to pass
Updated unit tests to match new vectorSearch index format
No breaking changes to public API

✅ Manual Testing

Verified $vectorSearch pipeline works correctly
Tested index creation and management
Confirmed automatic migration works with legacy indexes

Migration Guide

For Existing Users

No API changes - The public interface remains identical. Migration is automatic and zero-downtime:

Automatic Index Migration: Existing collections with legacy knnVector indexes are automatically detected and migrated
```
# Simply initialize - auto-healing happens automatically
vector_store = MongoDB(
    db_name="mem0",
    collection_name="my_vectors",
    embedding_model_dims=1536,
    mongo_uri="mongodb://..."
)
# Legacy index detected → dropped → recreated with vectorSearch
# Your data remains intact!
```
What happens:
- On initialization, create_col() inspects existing indexes
- If legacy knnVector detected (checks old mappings structure), it drops only the index (preserving data)
- Waits for index deletion to complete (prevents conflicts)
- Recreates with modern vectorSearch configuration
- Waits for new index to become queryable (if wait_for_index_ready=True)
- No manual intervention needed - all data preserved
MongoDB Version: Requires MongoDB Atlas or MongoDB Atlas Local (standard MongoDB doesn't support $vectorSearch)
Manual Reset (Optional): If you prefer to start fresh, reset() is still available but no longer required for migration

Technical Details

`$vectorSearch` Pipeline Structure

The search method now uses this aggregation pipeline:

pipeline = [
    {
        "$vectorSearch": {
            "index": self.index_name,
            "limit": limit,
            "numCandidates": limit * 20,  # Improved accuracy
            "queryVector": vectors,
            "path": "embedding",
        }
    },
    {"$set": {"score": {"$meta": "vectorSearchScore"}}},  # Extract similarity score
    {"$project": {"embedding": 0}},  # Exclude vectors from results
]

If filters are provided, a $match stage is inserted after $vectorSearch:

pipeline.insert(1, {"$match": {"$and": filter_conditions}})

Auto-Healing Implementation

The create_col() method now includes intelligent legacy index detection and migration with robust asynchronous handling:

# 1. Inspect existing indexes
found_indexes = list(collection.list_search_indexes(name=self.index_name))

# 2. Check for legacy configuration (supports old mappings structure)
if found_indexes:
    existing_index = found_indexes[0]
    definition = existing_index.get("latestDefinition", {})
    
    # Check old format: mappings.fields.embedding.type == "knnVector"
    mappings = definition.get("mappings", {})
    if mappings:
        legacy_fields = mappings.get("fields", {})
        embedding_field = legacy_fields.get("embedding", {})
        if embedding_field.get("type") == "knnVector":
            is_legacy = True
    
    # Also check if type is not vectorSearch
    if definition.get("type") != "vectorSearch":
        is_legacy = True
    
    # 3. Surgical migration (preserves data)
    if is_legacy:
        collection.drop_search_index(self.index_name)
        # BLOCKING WAIT: Ensure deletion completes before recreation
        self._wait_for_index_status(collection, index_name, "deleted")
        # Index will be recreated below with vectorSearch type
    
# 4. Create new index (if needed)
if should_create_index:
    collection.create_search_index(search_index_model)
    # BLOCKING WAIT: Ensure index is queryable before returning
    if self.wait_for_index_ready:
        self._wait_for_index_status(collection, index_name, "ready")

Key Features:

Data Preservation: Only drops the index definition, not the collection
Automatic: Happens transparently on initialization
Asynchronous Handling: Properly waits for index operations to complete
Legacy Detection: Supports both old mappings structure and missing type field
Idempotent: Safe to call multiple times
Error Handling: Gracefully handles race conditions and timeouts
Configurable: wait_for_index_ready and index_creation_timeout parameters

Asynchronous Index Operations

MongoDB Atlas Search index operations are asynchronous. The implementation includes:

_wait_for_index_status() Helper Method:

Polls index status until target state is reached
Checks queryable=True for readiness (not just existence)
Enforces configurable timeout to prevent infinite loops
Handles both "ready" and "deleted" states

Why This Matters:

Index creation/deletion returns immediately, but work happens in background
Without waiting, searches may fail with "index not ready" errors
Prevents race conditions when dropping and recreating indexes with same name

Configuration:

MongoDB(
    ...,
    wait_for_index_ready=True,      # Block until index is queryable (default: True)
    index_creation_timeout=300      # Max seconds to wait (default: 300s)
)

Production Considerations:

For large datasets (1M+ vectors), increase index_creation_timeout
For production APIs, set wait_for_index_ready=False and handle "index not ready" errors gracefully
Consider running index creation in separate migration scripts for very large datasets

Performance Considerations

Removed: Redundant index existence check on every search (was calling list_search_indexes())
Improved: numCandidates = limit * 20 for better recall (trade-off: slightly slower but more accurate)
Optimized: Direct aggregation pipeline execution
Asynchronous: Proper handling of async index operations prevents errors and race conditions

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules
I have checked my code and corrected any misspellings

Breaking Changes

None - The API remains fully backward compatible. Only internal implementation changed.

Migration: Existing knnVector indexes are automatically detected and migrated on initialization. No manual intervention required - your data is preserved during the migration process.

Maintainer Checklist

closes MONGODB BUG FIX: knnVector is deprecated #3970 (Replace with the GitHub issue number)
Made sure Checks passed

…ictionary.

ranfysvalle02 added 4 commits February 2, 2026 20:54

allowing field-level updates without overwriting the entire payload d…

bb912c2

…ictionary.

main is reverted

555c4ee

update to vectorSearch + testing

08b123a

no more knnVector -- and testing

e783be8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/knn vector update mongodb#3971

Feature/knn vector update mongodb#3971
ranfysvalle02 wants to merge 4 commits intomem0ai:mainfrom
ranfysvalle02:feature/knnVector-update-mongodb

ranfysvalle02 commented Feb 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ranfysvalle02 commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MongoDB Vector Store: Migration to vectorSearch

Description

Type of Change

Summary of Changes

1. Migration: knnVector → vectorSearch Index Type

2. Vector Search Improvements

3. Auto-Healing Legacy Indexes (Zero Data Loss)

4. Asynchronous Index Operations

5. Code Quality

Integration Tests: $vectorSearch Verification

🎯 NEW: Comprehensive Integration Test Suite

Test Coverage

test_vector_lifecycle - Full CRUD with $vectorSearch

test_list_functionality - List operations with filters

test_knnvector_to_vectorsearch_migration - NEW: Migration Test

Running the Tests

Test Infrastructure

How Has This Been Tested?

✅ Integration Tests (NEW)

✅ Existing Unit Tests

✅ Manual Testing

Migration Guide

For Existing Users

Technical Details

$vectorSearch Pipeline Structure

Auto-Healing Implementation

Asynchronous Index Operations

Performance Considerations

Checklist

Breaking Changes

Maintainer Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ranfysvalle02 commented Feb 3, 2026 •

edited

Loading

1. Migration: `knnVector` → `vectorSearch` Index Type

Integration Tests: `$vectorSearch` Verification

`test_vector_lifecycle` - Full CRUD with `$vectorSearch`

`test_list_functionality` - List operations with filters

`test_knnvector_to_vectorsearch_migration` - NEW: Migration Test

`$vectorSearch` Pipeline Structure