# Vector Database Comparison For AI Workloads: Elasticsearch vs MongoDB Atlas Vector Search


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mongodb-developer/GenAI-Showcase/blob/main/notebooks/performance_guidance/vector_database_comparison_mongodb_elastic.ipynb)
-----

While both MongoDB Atlas and Elasticsearch can store vector embeddings for AI applications, they serve fundamentally different purposes:

- **Elasticsearch** is primarily a search engine optimized for information retrieval and analytics. While it can store vector embeddings, it wasn't designed as a primary database for applications that require strong consistency and ACID properties.

- **MongoDB Atlas** is a fully-featured database with built-in vector search capabilities. As a true database, MongoDB provides ACID compliance (Atomicity, Consistency, Isolation, Durability), which is essential for production AI applications that require data reliability.

### Why ACID Compliance Matters for AI Applications:

1. **Atomicity**: Ensures that complex AI operations (like updating a knowledge base with new embeddings) either complete entirely or not at all.

2. **Consistency**: Guarantees that AI systems work with consistent data, preventing issues like training on inconsistent datasets.

3. **Isolation**: Allows multiple AI processes to work concurrently without interfering with each other.

4. **Durability**: Ensures that committed AI data (like trained embeddings or inference results) isn't lost due to system failures.

In this notebook, we'll compare both systems for AI workloads, but keep in mind that while Elasticsearch excels at search-specific tasks, MongoDB provides a complete solution for AI applications that need both vector search and database functionality.
```



## Part 1: Data Setup

In [2]:
import getpass
import os


# Function to securely get and set environment variables
def set_env_securely(var_name, prompt):
    value = getpass.getpass(prompt)
    os.environ[var_name] = value

### Step 1: Install Libraries

All the libraries are installed using pip and facilitate the sourcing of data, embedding generation, and data visualization.

- `datasets`: Hugging Face library for managing and preprocessing datasets across text, image, and audio (https://huggingface.co/datasets)
- `sentence_transformers`: For creating sentence embeddings for tasks like semantic search and clustering. (https://www.sbert.net/)
- `pandas`: A library for data manipulation and analysis with DataFrames and Series (https://pandas.pydata.org/)
- `matplotlib`: A library for creating static, interactive, and animated data visualizations (https://matplotlib.org/)
- `seaborn`: A library for creating statistical data visualizations (https://seaborn.pydata.org/)
- `cohere`: A library for generating embeddings and accessing the Cohere API or models (https://cohere.ai/)

In [None]:
%pip install --upgrade --quiet datasets sentence_transformers pandas matplotlib seaborn cohere

### Step 2: Data Loading

The dataset for the benchmark is sourced from the Hugging Face Cohere Wikipedia dataset.

The [Cohere/wikipedia-22-12-en-embeddings](https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings) dataset on Hugging Face comprises English Wikipedia articles embedded using Cohere's multilingual-22-12 model. Each entry includes the article's title, text, URL, Wikipedia ID, view count, paragraph ID, language codes, and a 768-dimensional embedding vector. This dataset is valuable for tasks like semantic search, information retrieval, and NLP model training.

For this benchmark, we are using 100,000 rows of the dataset and have removed the id, wiki_id, paragraph_id, langs and views columns.

In [116]:
import pandas as pd
from datasets import load_dataset

# Using 100,000 rows for testing, feel free to change this to any number of rows you want to test
# The wikipedia-22-12-en-embeddings dataset has approximately 35,000,000 rows and requires 120GB of memory to load
MAX_ROWS = 100000

dataset = load_dataset(
    "Cohere/wikipedia-22-12-en-embeddings", split="train", streaming=True
)
dataset_segment = dataset.take(MAX_ROWS)

# Convert the dataset to a pandas dataframe
dataset_df = pd.DataFrame(dataset_segment)

In [117]:
# Remove the id field, wiki_id, paragraph_id, langs and views from the dataset
# This is to replicate the structure of dataset usually encountered in AI workloads, particularly in RAG systems where metadata is extracted from documents and stored.
dataset_df = dataset_df.drop(
    columns=["id", "wiki_id", "paragraph_id", "langs", "views"]
)

In [118]:
# Change the emb colomn name to embedding
dataset_df = dataset_df.rename(columns={"emb": "embedding"})

In [119]:
dataset_df.head(5)

Unnamed: 0,title,text,url,embedding
0,Deaths in 2022,The following notable deaths occurred in 2022....,https://en.wikipedia.org/wiki?curid=69407798,"[0.2865696847438812, -0.03181683272123337, 0.0..."
1,YouTube,YouTube is a global online video sharing and s...,https://en.wikipedia.org/wiki?curid=3524766,"[-0.09689381718635559, 0.1619211882352829, -0...."
2,YouTube,"In October 2006, YouTube was bought by Google ...",https://en.wikipedia.org/wiki?curid=3524766,"[0.1302049309015274, 0.265736848115921, 0.4018..."
3,YouTube,"Since its purchase by Google, YouTube has expa...",https://en.wikipedia.org/wiki?curid=3524766,"[-0.09791257232427597, 0.13586106896400452, -0..."
4,YouTube,YouTube has had an unprecedented social impact...,https://en.wikipedia.org/wiki?curid=3524766,"[-0.2641527056694031, 0.06968216598033905, -0...."


### Step 3: Embedding Generation

In [9]:
# Set Cohere API key
set_env_securely("COHERE_API_KEY", "Enter your Cohere API key: ")

Using the Cohere API to generate embeddings for the test queries.

Using the `embed-multilingual-v2.0` model. This is the same model used in the Cohere Wikipedia dataset.

Embedding size is 768 dimensions and the precision is float32.

In [10]:
from typing import List, Tuple

import cohere

# Initialize Cohere Client
co = cohere.Client()


def get_cohere_embeddings(
    sentences: List[str],
    model: str = "embed-multilingual-v2.0",
    input_type: str = "search_document",
) -> Tuple[List[float], List[int]]:
    """
    Generates embeddings for the provided sentences using Cohere's embedding model.

    Args:
    sentences (list of str): List of sentences to generate embeddings for.

    Returns:
    Tuple[List[float], List[int]]: A tuple containing two lists of embeddings (float and int8).
    """
    generated_embedding = co.embed(
        texts=sentences,
        model="embed-multilingual-v2.0",
        input_type="search_document",
        embedding_types=["float"],
    ).embeddings

    return generated_embedding.float[0]

Generate embeddings for the query templates used in benchmarking process

Note: Doing this to avoid the overhead of generating embeddings for each query during the benchmark process

Note: Feel free to add more queries to the query_templates list to test the performance of the vector database with a larger number of queries

In [120]:
query_templates = [
    "When was YouTube officially launched, and by whom?",
    "What is YouTube's slogan introduced after Google's acquisition?",
    "How many hours of videos are collectively watched on YouTube daily?",
    "Which was the first video uploaded to YouTube, and when was it uploaded?",
    "What was the acquisition cost of YouTube by Google, and when was the deal finalized?",
    "What was the first YouTube video to reach one million views, and when did it happen?",
    "What are the three separate branches of the United States government?",
    "Which country has the highest documented incarceration rate and prison population?",
    "How many executions have occurred in the United States since 1977, and which countries have more?",
    "What percentage of the global military spending did the United States account for in 2019?",
    "How is the U.S. president elected?",
    "What cooling system innovation was included in the proposed venues for the World Cup in Qatar?",
    "What lawsuit was filed against Google in June 2020, and what was it about?",
    "How much was Google fined by CNIL in January 2022, and for what reason?",
    "When did YouTube join the NSA's PRISM program, according to reports?",
]

# For each query template question, generate an embedding
# NOTE: Doing this to avoid the overhead of generating embeddings for each query during the benchmark process
query_embeddings = [
    get_cohere_embeddings(sentences=[query], input_type="search_query")
    for query in query_templates
]

In [121]:
# Create a dictionary with the query templates and their corresponding embeddings
query_embeddings_dict = {
    query: embedding for query, embedding in zip(query_templates, query_embeddings)
}

## Part 2: Search with Elasticsearch


### Step 1: Install Libraries


In [14]:
%pip install --upgrade --quiet elasticsearch eland


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Step 2: Installing Elasticsearch Locally

**Run start-local**

To set up Elasticsearch and Kibana locally, run the start-local script:

```curl -fsSL https://elastic.co/start-local | sh```

This script creates an elastic-start-local folder containing configuration files and starts both Elasticsearch and Kibana using Docker.

After running the script, you can access Elastic services at the following endpoints:

- Elasticsearch: http://localhost:9200
- Kibana: http://localhost:5601

Elasticsearch is installed using docker.

Find more instructions on installing Elasticsearch with docker [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/run-elasticsearch-locally.html)

The Elasticsearch docker image is pulled from the [elasticsearch](https://hub.docker.com/_/elasticsearch) repository.

NOTE: To uninstall Elasticsearch, run the following command:

```./uninstall.sh```


Set the Elastic Cloud ID and API key in the environment variables

In [None]:
# set_env_securely("ELASTIC_CLOUD_ID", "Enter your Elastic Cloud ID: ")

In [15]:
set_env_securely("ELASTIC_API_KEY", "Enter your Elastic API key: ")

In [129]:
from elasticsearch import Elasticsearch, helpers

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
# ELASTIC_CLOUD_ID = os.environ['ELASTIC_CLOUD_ID']

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
ELASTIC_API_KEY = os.environ["ELASTIC_API_KEY"]

# Create the client instance
client = Elasticsearch(
    # For local development
    hosts=["http://localhost:9200"],
    # cloud_id=ELASTIC_CLOUD_ID,
    api_key=ELASTIC_API_KEY,
)

# Confirm the client has connected
print(client.info())

{'name': '98e3e4446da5', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'IVNLlQWMTBmdUq8OAmhneA', 'version': {'number': '8.17.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': 'a091390de485bd4b127884f7e565c0cad59b10d2', 'build_date': '2025-02-28T10:07:26.089129809Z', 'build_snapshot': False, 'lucene_version': '9.12.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'}


### Step 3: Create Elasticsearch Index

Create an index with the name `wikipedia_data` and the following mapping:
- `title`: The title of the Wikipedia article
- `text`: The text of the Wikipedia article
- `url`: The URL of the Wikipedia article
- `json_data`: The JSON data of the Wikipedia article
- `embedding`: The embedding vector for the Wikipedia article

Create an index in Elasticsearch with the right index mappings to handle vector searches.

One thing to note is that by default elasticsearch quantizes the embeddings to 8 bits. This means that the precison of your vector embeddings are reduced if you don't explicitly set the index_options to hnsw. More information can be found in the elasticsearch documentation.


In [132]:
index_name = "wikipedia_data"

# Using an explicit mapping to handle various search patterns
index_mapping = {
    "properties": {
        "title": {"type": "text"},
        "text": {"type": "text"},
        "url": {"type": "text"},
        "embedding": {
            "type": "dense_vector",
            "dims": 768,
            "index": "true",
            "similarity": "cosine",
            "index_options": {
                "type": "hnsw"
            }
        }
    }
}

# flag to check if index has to be deleted before creating
should_delete_index = True

# check if we want to delete index before creating the index
if should_delete_index:
    if client.indices.exists(index=index_name):
        print(f"Deleting existing {index_name}")
        client.indices.delete(index=index_name, ignore=[400, 404])

print(f"Creating index {index_name}")

index_settings = {}

# Create the index
client.options(ignore_status=[400, 404]).indices.create(
    index=index_name, mappings=index_mapping, settings=index_settings
)

Creating index wikipedia_data


ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'wikipedia_data'})

### Step 4: Define insert function

In [133]:
import time
from elasticsearch.helpers import BulkIndexError


# Define the function to batch the data to bulk actions
def batch_to_bulk_actions(batch):
    for _, record in batch.iterrows():
        action = {
            "_index": "wikipedia_data",
            "_source": {
                "title": record["title"],
                "text": record["text"],
                "url": record["url"],
                "embedding": record["embedding"],
            },
        }
        yield action


def insert_data_to_elastic(dataframe, client, database_type="Elasticsearch"):
    """
    Insert data into Elasticsearch and record benchmark metrics.

    Args:
        dataframe (pandas.DataFrame): The dataframe containing the data to insert.
        client (elasticsearch.Elasticsearch): The Elasticsearch client to use for the insertion.
        database_type (str): The type of database (default: "Elasticsearch").
    """
    start_time = time.time()
    total_rows = len(dataframe)

    try:
        # Convert DataFrame records to Elasticsearch actions
        actions = list(batch_to_bulk_actions(dataframe))

        # Perform bulk insert
        helpers.bulk(client, actions)

        end_time = time.time()
        total_time = end_time - start_time
        rows_per_second = total_rows / total_time

        # print(f"\nElasticsearch Insertion Statistics:")
        # print(f"Total time: {total_time:.2f} seconds")
        # print(f"Average insertion rate: {rows_per_second:.2f} rows/second")
        # print(f"Total rows inserted: {total_rows}")

        return True

    except BulkIndexError as e:
        print(f"Error during bulk insert: {e.errors}")
        return False
    except Exception as e:
        print(f"Error during data ingestion: {e}")
        return False

### Step 5: Insert Data into Elasic

In [134]:
try:
    insert_data_to_elastic(dataset_df, client)
    print("Data ingestion into Elasticsearch complete!")
except BulkIndexError as e:
    print(f"{e.errors}")

Data ingestion into Elasticsearch complete!
{'insert_time': {'total_time': 257.366415977478, 'rows_per_second': 388.55108433709137, 'total_rows': 100000}, 'incremental_insert': {1: {'total_time': 0.09557032585144043, 'rows_per_second': 10.463498906077321, 'batch_size': 1}, 10: {'total_time': 0.04177069664001465, 'rows_per_second': 239.40227969337724, 'batch_size': 10}, 20: {'total_time': 0.15852594375610352, 'rows_per_second': 126.16231467305153, 'batch_size': 20}, 40: {'total_time': 0.08487296104431152, 'rows_per_second': 471.29261790591124, 'batch_size': 40}, 80: {'total_time': 0.21546411514282227, 'rows_per_second': 371.2915254912462, 'batch_size': 80}, 160: {'total_time': 0.44211506843566895, 'rows_per_second': 361.8967355401984, 'batch_size': 160}, 320: {'total_time': 0.8344941139221191, 'rows_per_second': 383.4658563330078, 'batch_size': 320}, 640: {'total_time': 1.7254469394683838, 'rows_per_second': 370.91838952589654, 'batch_size': 640}, 1280: {'total_time': 2.4678359031677246

### Step 6: Define Full Text Search function

In [135]:
def full_text_search_with_elastic(query, client, top_n=5):
    search_body = {
        "query": {
            "match": {
                "text": query
            }
        }
    }

    response = client.search(
        index="wikipedia_data",
        body=search_body,
        size=top_n,
        _source_excludes=["embedding", "id"]  # Exclude unwanted fields
    )

    results = []
    for hit in response["hits"]["hits"]:
        score = hit["_score"]
        title = hit["_source"]["title"]
        text = hit["_source"]["text"]
        url = hit["_source"]["url"]
        result = {
            "_score": score,
            "title": title,
            "text": text,
            "url": url
        }
        results.append(result)
    return results


In [136]:
query_text = "When was YouTube officially launched, and by whom?"

get_knowledge_full_text = full_text_search_with_elastic(query_text, client, top_n=5)

  response = client.search(


In [138]:
pd.DataFrame(get_knowledge_full_text).head()

Unnamed: 0,_score,title,text,url
0,21.650394,YouTube Premium,YouTube Red was officially unveiled on October...,https://en.wikipedia.org/wiki?curid=44382466
1,19.211123,YouTube,Susan Wojcicki was appointed CEO of YouTube in...,https://en.wikipedia.org/wiki?curid=3524766
2,17.225754,YouTube,"Through this period, YouTube tried several new...",https://en.wikipedia.org/wiki?curid=3524766
3,16.050175,Sweden,The licence-funded television service was offi...,https://en.wikipedia.org/wiki?curid=5058739
4,15.866064,YouTube,"In September 2012, YouTube launched its first ...",https://en.wikipedia.org/wiki?curid=3524766


### Step 7: Define semantic search function


In [139]:
def semantic_search_with_elastic(plot_query, client, top_n=5):
    query_embedding = query_embeddings_dict[plot_query]

    knn = {
        "field": "embedding",
        "query_vector": query_embedding,
        "k": top_n,
        "num_candidates": 100,
    }

    response = client.search(
        index="wikipedia_data", 
        knn=knn, 
        size=top_n, 
        _source_excludes=["embedding", "id"]  # Exclude the embedding field from the results)
    )
    results = []
    for hit in response["hits"]["hits"]:
        score = hit["_score"]
        title = hit["_source"]["title"]
        text = hit["_source"]["text"]
        url = hit["_source"]["url"]
        result = {
            "_score": score,
            "title": title,
            "text": text,
            "url": url
        }
        results.append(result)
    return results

In [140]:
query_text = "When was YouTube officially launched, and by whom?"

get_knowledge_semantic = semantic_search_with_elastic(query_text, client, top_n=5)

In [142]:
pd.DataFrame(get_knowledge_semantic).head()

Unnamed: 0,_score,title,text,url
0,0.951713,YouTube,YouTube announced the project in September 201...,https://en.wikipedia.org/wiki?curid=3524766
1,0.948441,YouTube,The mobile version of the site was relaunched ...,https://en.wikipedia.org/wiki?curid=3524766
2,0.94837,YouTube,"In January 2009, YouTube launched ""YouTube for...",https://en.wikipedia.org/wiki?curid=3524766
3,0.947532,YouTube,"Later the same year, ""YouTube Feather"" was int...",https://en.wikipedia.org/wiki?curid=3524766
4,0.946378,Twitch (service),"On May 18, 2014, ""Variety"" first reported that...",https://en.wikipedia.org/wiki?curid=33548254


### Step 8: Define Hybrid Search function

In [143]:
def hybrid_search_with_elastic(query, client, top_n=5):
    """
    Perform hybrid search by combining a full-text search (on the 'text' field)
    with a vector (kNN) search (on the 'embedding' field) using Reciprocal Rank Fusion (RRF).
    
    Args:
        query (str): The search query.
        client: The Elasticsearch client instance.
        top_n (int): Number of top results to return.
        
    Returns:
        List[dict]: A list of search result documents.
    """
    # Get the query embedding (assumes query_embeddings_dict is defined)
    try:
        query_embedding = query_embeddings_dict[query]
    except KeyError:
        raise ValueError("No embedding found for the given query.")

    # Define the full-text search part (matching on the 'text' field)
    full_text_query = {
        "match": {
            "text": query
        }
    }

    # Define the kNN (vector) search part on the 'embedding' field
    knn_query = {
        "field": "embedding",
        "query_vector": query_embedding,
        "k": top_n,
        "num_candidates": 100,
    }

    # Execute the search with both query and knn parts along with RRF ranking
    response = client.search(
        index="wikipedia_data",
        query=full_text_query,
        knn=knn_query,
        rank={"rrf": {}},  # Enables reciprocal rank fusion to combine both result lists
        size=top_n,
        _source_excludes=["embedding", "id"]
    )

    # Parse the results
    results = []
    for hit in response["hits"]["hits"]:
        score = hit["_score"]
        title = hit["_source"].get("title", "")
        text = hit["_source"].get("text", "")
        url = hit["_source"].get("url", "")
        result = {
            "_score": score,
            "title": title,
            "text": text,
            "url": url
        }
        results.append(result)

    return results


In [144]:
query_text = "When was YouTube officially launched, and by whom?"

get_knowledge_hybrid = hybrid_search_with_elastic(query_text, client, top_n=5)

  response = client.search(


In [145]:
pd.DataFrame(get_knowledge_hybrid).head()

Unnamed: 0,_score,title,text,url
0,0.016393,YouTube Premium,YouTube Red was officially unveiled on October...,https://en.wikipedia.org/wiki?curid=44382466
1,0.016393,YouTube,YouTube announced the project in September 201...,https://en.wikipedia.org/wiki?curid=3524766
2,0.016129,YouTube,Susan Wojcicki was appointed CEO of YouTube in...,https://en.wikipedia.org/wiki?curid=3524766
3,0.016129,YouTube,The mobile version of the site was relaunched ...,https://en.wikipedia.org/wiki?curid=3524766
4,0.015873,YouTube,"Through this period, YouTube tried several new...",https://en.wikipedia.org/wiki?curid=3524766


## Part 3: Search with MongoDB Atlas Vector Search

### Step 1: Install Libraries

- `pymongo` (4.10.1): A Python driver for MongoDB (https://pymongo.readthedocs.io/en/stable/)

In [26]:
%pip install --quiet --upgrade pymongo

python(30505) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Step 2: Installing MongoDB via Atlas CLI

The Atlas CLI is a command line interface built specifically for MongoDB Atlas. 
Interact with your Atlas database deployments and Atlas Search from the terminal with short, intuitive commands, so you can accomplish complex database management tasks in seconds.

You can follow the instructions [here](https://www.mongodb.com/docs/atlas/cli/current/install-atlas-cli/#complete-the-prerequisites-3) to install the Atlas CLI using docker(other options are available) and get a local MongoDB database instance running.

Follow the steps [here](https://www.mongodb.com/docs/atlas/cli/current/atlas-cli-docker/#follow-these-steps) to run Altas CLI commands with Docker.

Find more information on the Atlas CLI [here](https://www.mongodb.com/docs/atlas/cli/): 

### Step 3: Connect to MongoDB and Create Database and Collection

After installing the Atlas CLI, you can run the following command to connect to your MongoDB database:
1. ```atlas deployments connect```
2. You will be prompted to specificy "How would you like to connect to local9410"
3. Select connectionString
4. Copy the connection string and paste it into the MONGO_URI environment variable

More information [here](https://www.mongodb.com/docs/atlas/cli/current/atlas-cli-deploy-fts/#connect-to-the-deployment).

In [27]:
# Set MongoDB URI
# Example: mongodb://localhost:52094/?directConnection=true
set_env_securely("MONGO_URI", "Enter your MONGO URI: ")

In the following code blocks below we do the following:
1. Establish a connection to the MongoDB database
2. Create a database and collection if they do not already exist
3. Delete all data in the collection if it already exists


In [149]:
import pymongo


def get_mongo_client(mongo_uri):
    """Establish and validate connection to the MongoDB."""

    client = pymongo.MongoClient(
        mongo_uri, appname="devrel.showcase.mongodb_vs_elasticsearch.python"
    )

    # Validate the connection
    ping_result = client.admin.command("ping")
    if ping_result.get("ok") == 1.0:
        # Connection successful
        print("Connection to MongoDB successful")
        return client
    else:
        print("Connection to MongoDB failed")
    return None


MONGO_URI = os.environ["MONGO_URI"]
if not MONGO_URI:
    print("MONGO_URI not set in environment variables")

In [150]:
from pymongo.errors import CollectionInvalid

mongo_client = get_mongo_client(MONGO_URI)

DB_NAME = "vector_db"
COLLECTION_NAME = "wikipedia_data_test"

# Create or get the database
db = mongo_client[DB_NAME]

# Check if the collection exists
if COLLECTION_NAME not in db.list_collection_names():
    try:
        # Create the collection
        db.create_collection(COLLECTION_NAME)
        print(f"Collection '{COLLECTION_NAME}' created successfully.")
    except CollectionInvalid as e:
        print(f"Error creating collection: {e}")
else:
    print(f"Collection '{COLLECTION_NAME}' already exists.")

# Assign the collection
collection = db[COLLECTION_NAME]

Connection to MongoDB successful
Collection 'wikipedia_data_test' already exists.


In [83]:
collection.delete_many({})

DeleteResult({'n': 100000, 'electionId': ObjectId('7fffffff0000000000000009'), 'opTime': {'ts': Timestamp(1741686592, 614), 't': 9}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1741686592, 614), 'signature': {'hash': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'keyId': 0}}, 'operationTime': Timestamp(1741686592, 614)}, acknowledged=True)

### Step 4: Vector Index Creation

The `setup_vector_search_index` function creates a vector search index for the MongoDB collection.

The `index_name` parameter is the name of the index to create.

The `embedding_field_name` parameter is the name of the field containing the text embeddings on each document within the wikipedia_data collection.


In [151]:
embedding_field_name = "embedding"
vector_search_index_name = "vector_index"

In [154]:
import time

from pymongo.operations import SearchIndexModel


def setup_vector_search_index(collection, index_name="vector_index"):
    """
    Setup a vector search index for a MongoDB collection and wait for 30 seconds.

    Args:
    collection: MongoDB collection object
    index_definition: Dictionary containing the index definition
    index_name: Name of the index (default: "vector_index")
    """
    new_vector_search_index_model = SearchIndexModel(
        definition={
            "fields": [
                {
                    "type": "vector",
                    "path": "embedding",
                    "numDimensions": 768,
                    "similarity": "cosine"
                },
                {
                    "type": "filter",
                    "path": "title",
                }
            ]
        },
        name=index_name,
        type="vectorSearch",
    )

    # Create the new index
    try:
        result = collection.create_search_index(model=new_vector_search_index_model)
        print(f"Creating index '{index_name}'...")
        return result

    except Exception as e:
        print(f"Error creating new vector search index '{index_name}': {e!s}")
        return None

In [155]:
setup_vector_search_index(collection, vector_search_index_name)

Creating index 'vector_index'...


'vector_index'

### Step 5: Create Search Index

In [157]:

def setup_text_search_index(collection, index_name="text_search_index"):
    """
    Setup a text search index for a MongoDB collection in Atlas.
    Args:
        collection (Collection): MongoDB collection object.
        index_name (str): Name of the index (default: "text_search_index").
    """
    # Define the search index model
    search_index_model = {
        "name": index_name,
        "type": "search",
        "definition": {
            "mappings": {
                "dynamic": False,
                "fields": {
                    "title": {
                        "type": "string"
                    }
                }
            }
        }
    }
    # Create the search index
    try:
        result = collection.create_search_index(search_index_model)
        print(f"Creating index '{index_name}'...")
        return result
    except Exception as e:
        print(f"Error creating text search index '{index_name}': {e}")
        return None

In [158]:
setup_text_search_index(collection)

Creating index 'text_search_index'...


'text_search_index'

### Step 5: Define Insert Data Function

Because of the affinity of MongoDB for JSON data, we don't have to convert the Python Dictionary in the `json_data` attribute to a JSON string using the `json.dumps()` function. Instead, we can directly insert the Python Dictionary into the MongoDB collection.

This reduced the operational overhead of the insertion processes in AI workloads.


In [84]:
def insert_data_to_mongodb(dataframe, collection, database_type="MongoDB"):

    try:
        # Convert DataFrame to list of dictionaries for MongoDB insertion
        documents = dataframe.to_dict("records")

        # Use insert_many for better performance
        result = collection.insert_many(documents)

        return True

    except Exception as e:
        print(f"Error during MongoDB insertion: {e}")
        return False

### Step 6: Insert Data into MongoDB


In [36]:
from bson.binary import Binary, BinaryVectorDtype

def generate_bson_vector(array, data_type):
    return Binary.from_vector(array, BinaryVectorDtype(data_type))

In [38]:
# Make a copy of the dataset to avoid modifying the original
mongodb_df = dataset_df.copy()

# Convert the vector embeddings in the dataset to Binary Vector Data
# This converts the numpy array embeddings to BSON binary vector format
# which is more efficient for storage and retrieval in MongoDB
mongodb_df["embedding"] = mongodb_df["embedding"].apply(
    lambda x: generate_bson_vector(x, BinaryVectorDtype.FLOAT32)
)


In [None]:
%pip install --upgrade pandas

In [85]:
success = insert_data_to_mongodb(mongodb_df, collection)

### Step 7: Define Full Text Search Function

In [165]:
def text_search_with_mongodb(query_text, collection, top_n=5):
    """
    Perform a text search in the MongoDB collection based on the user query.

    Args:
        query_text (str): The user's query string.
        collection (MongoCollection): The MongoDB collection to search.
        top_n (int): The number of top results to return.

    Returns:
    list: A list of matching documents.
    """
    # Define the text search stage
    # The text operator performs a full-text search using the analyzer that you specify in the index configuration. 
    # The text operator below uses the default standard analyzer.
    text_search_stage = {
        "$search": {
            "index": "text_search_index",
            # Search for the query text in the title field
            "text": {
                "query": query_text,
                "path": "title"
            }
        }
    }

    limit_stage = {"$limit": top_n}

    project_stage = {
        "$project": {
            "_id": 0,
            "title": 1,
            "text": 1,
            "url": 1
        }
    }

    # Define the aggregate pipeline with the text search stage
    pipeline = [text_search_stage, limit_stage, project_stage]

    # Execute the search
    results = collection.aggregate(pipeline)
    return list(results)

In [166]:
query_text = "When was YouTube officially launched, and by whom?"

get_knowledge_full_text_mdb = text_search_with_mongodb(query_text, collection)


In [167]:
pd.DataFrame(get_knowledge_full_text_mdb).head()

Unnamed: 0,title,text,url
0,YouTube,YouTube has had an unprecedented social impact...,https://en.wikipedia.org/wiki?curid=3524766
1,YouTube,"The company experienced rapid growth. ""The Dai...",https://en.wikipedia.org/wiki?curid=3524766
2,YouTube,"The company was attacked on April 3, 2018, whe...",https://en.wikipedia.org/wiki?curid=3524766
3,YouTube,YouTube primarily uses the VP9 and H.264/MPEG-...,https://en.wikipedia.org/wiki?curid=3524766
4,YouTube,"From 2008 to 2017, users could add ""annotation...",https://en.wikipedia.org/wiki?curid=3524766


### Step 8: Define Semantic Search Function

The `semantic_search_with_mongodb` function performs a vector search in the MongoDB collection based on the user query.

- `user_query` parameter is the user's query string.
- `collection` parameter is the MongoDB collection to search.
- `top_n` parameter is the number of top results to return.
- `vector_search_index_name` parameter is the name of the vector search index to use for the search.

The `numCandidates` parameter is the number of candidate matches to consider. This is set to 150 to match the number of candidate matches to consider in the Elasticsearch vector search.

Another point to note is the queries in MongoDB are performed using the `aggregate` function enabled by the MongoDB Query Language(MQL).

This allows for more flexibility in the queries and the ability to perform more complex searches. And data processing opreations can be defined as stages in the pipeline. If you are a data engineer, data scientist or ML Engineer, the concept of pipeline processing is a key concept.




In [168]:
def semantic_search_with_mongodb(
    user_query, collection, top_n=5, vector_search_index_name="vector_index"
):
    """
    Perform a vector search in the MongoDB collection based on the user query.

    Args:
    user_query (str): The user's query string.
    collection (MongoCollection): The MongoDB collection to search.
    vector_search_index_name (str): The name of the vector search index.

    Returns:
    list: A list of matching documents.
    """

    # Take a query embedding from the query_embeddings_dict
    query_embedding = query_embeddings_dict[user_query]

    if query_embedding is None:
        return "Invalid query or embedding generation failed."

    # Define the vector search stage
    vector_search_stage = {
        "$vectorSearch": {
            "index": vector_search_index_name,  # specifies the index to use for the search
            "queryVector": query_embedding,  # the vector representing the query
            "path": "embedding",  # field in the documents containing the vectors to search against
            "numCandidates": 100,  # number of candidate matches to consider
            "limit": top_n,  # return top n matches
        }
    }

    project_stage = {
        "$project": {
            "_id": 0,  # Exclude the _id field
            "title": 1,
            "text": 1,
            "url": 1,
            "score": {
                "$meta": "vectorSearchScore"  # Include the search score
            },
        }
    }

    # Define the aggregate pipeline with the vector search stage
    pipeline = [vector_search_stage, project_stage]

    # Execute the search
    results = collection.aggregate(pipeline)
    return list(results)

In [170]:
query_text = "When was YouTube officially launched, and by whom?"

get_knowledge_semantic_mdb = semantic_search_with_mongodb(
    query_text, collection, vector_search_index_name=vector_search_index_name
)

In [171]:
pd.DataFrame(get_knowledge_semantic_mdb).head()

Unnamed: 0,title,text,url,score
0,YouTube,YouTube announced the project in September 201...,https://en.wikipedia.org/wiki?curid=3524766,0.951712
1,YouTube,The mobile version of the site was relaunched ...,https://en.wikipedia.org/wiki?curid=3524766,0.948441
2,YouTube,"In January 2009, YouTube launched ""YouTube for...",https://en.wikipedia.org/wiki?curid=3524766,0.94837
3,YouTube,"Later the same year, ""YouTube Feather"" was int...",https://en.wikipedia.org/wiki?curid=3524766,0.947532
4,Twitch (service),"On May 18, 2014, ""Variety"" first reported that...",https://en.wikipedia.org/wiki?curid=33548254,0.946378


### Step 9: Define Hybrid Search Function


The `hybrid_search_with_mongodb` function conducts a hybrid search on a MongoDB Atlas collection that combines a vector search and a full-text search using Atlas Search.

In the MongoDB hybrid search function, there are two weights:

- vector_weight = 0.5: This weight scales the score obtained from the vector search portion.
- full_text_weight = 0.5: This weight scales the score from the full-text search portion.

#### Note: In the MongoDB hybrid search function, two weights:
    - `vector_weight` 
    - `full_text_weight` 

They are used to control the influence of each search component on the final score. 

Here's how they work:

Purpose:
The weights allow you to adjust how much the vector (semantic) search and the full-text search contribute to the overall ranking. 
For example, a higher full_text_weight means that the full-text search results will have a larger impact on the final score, whereas a higher vector_weight would give more importance to the vector similarity score.

Usage in the Pipeline:
Within the aggregation pipeline, after retrieving results from each search type, the function computes a reciprocal ranking score for each result (using an expression like `1/(rank + 60)`). 
This score is then multiplied by the corresponding weight:

**Vector Search:**

```
"vs_score": {
  "$multiply": [ vector_weight, { "$divide": [1.0, { "$add": ["$rank", 60] } ] } ]
}
```


**Full-Text Search:**
```
"fts_score": {
  "$multiply": [ full_text_weight, { "$divide": [1.0, { "$add": ["$rank", 60] } ] } ]
}
```

Finally, these weighted scores are combined (typically by adding them together) to produce a final score that determines the ranking of the documents.

**Impact:**
By adjusting these weights, you can fine-tune the search results to better match your application's needs. For instance, if the full-text component is more reliable for your dataset, you might set full_text_weight higher than vector_weight.

The weights in the MongoDB function allow you to balance the contributions from vector-based and full-text search components, ensuring that the final ranking score reflects the desired importance of each search method.

In [198]:
def hybrid_search_with_mongodb(
    user_query,
    collection,
    vector_search_index_name="vector_index",
    text_search_index_name="text_search_index",
    vector_weight=0.5,
    full_text_weight=0.5,
    top_k=10
):
    """
    Conduct a hybrid search on a MongoDB Atlas collection that combines a vector search 
    and a full-text search using Atlas Search.

    Args:
        user_query (str): The user's query string.
        collection (MongoCollection): The MongoDB collection to search.
        vector_search_index_name (str): The name of the vector search index.
        text_search_index_name (str): The name of the text search index.
        vector_weight (float): The weight of the vector search.
        full_text_weight (float): The weight of the full-text search.

    Returns:
        list: A list of documents (dict) with combined scores.
    """

    collection_name = "wikipedia_data"
    query_vector = query_embeddings_dict[user_query]

    pipeline = [
        {
            "$vectorSearch": {
                "index": vector_search_index_name,
                "path": "embedding",
                "queryVector": query_vector,
                "numCandidates": 100,
                "limit": top_k
            }
        },
        {
            "$group": {
                "_id": None,
                "docs": {"$push": "$$ROOT"}
            }
        },
        {
            "$unwind": {
                "path": "$docs",
                "includeArrayIndex": "rank"
            }
        },
        {
            "$addFields": {
                "vs_score": {
                    "$multiply": [
                        vector_weight,
                        {"$divide": [1.0, {"$add": ["$rank", 60]}]}
                    ]
                }
            }
        },
        {
            "$project": {
                "vs_score": 1,
                "_id": "$docs._id",
                "title": "$docs.title",
                "url": "$docs.url",
                "text": "$docs.text"
            }
        },
        {
            "$unionWith": {
                "coll": collection_name,
                "pipeline": [
                    {
                        "$search": {
                            "index": text_search_index_name,
                            "text": {
                                "query": user_query,
                                "path": "title"
                            }
                        }
                    },
                    {"$limit": top_k},
                    {
                        "$group": {
                            "_id": None,
                            "docs": {"$push": "$$ROOT"}
                        }
                    },
                    {
                        "$unwind": {
                            "path": "$docs",
                            "includeArrayIndex": "rank"
                        }
                    },
                    {
                        "$addFields": {
                            "fts_score": {
                                "$multiply": [
                                    full_text_weight,
                                    {"$divide": [1.0, {"$add": ["$rank", 60]}]}
                                ]
                            }
                        }
                    },
                    {
                        "$project": {
                            "fts_score": 1,
                            "_id": "$docs._id",
                            "title": "$docs.title",
                            "url": "$docs.url",
                            "text": "$docs.text"
                        }
                    }
                ]
            }
        },
        {
            "$group": {
                "_id": "$_id",
                "title": {"$first": "$title"},
                "url": {"$first": "$url"},
                "text": {"$first": "$text"},
                "vs_score": {"$max": "$vs_score"},
                "fts_score": {"$max": "$fts_score"}
            }
        },
        {
            "$project": {
                "_id": 1,
                "title": 1,
                "url": 1,
                "text": 1,
                "vs_score": {"$ifNull": ["$vs_score", 0]},
                "fts_score": {"$ifNull": ["$fts_score", 0]}
            }
        },
        {
            "$project": {
                "score": {"$add": ["$fts_score", "$vs_score"]},
                "_id": 0,
                "title": 1,
                "url": 1,
                "text": 1,
                "vs_score": 1,
                "fts_score": 1
            }
        },
        {"$sort": {"score": -1}},
        {"$limit": top_k}
    ]

    results = list(collection.aggregate(pipeline))
    return results

In [199]:
query_text = "When was YouTube officially launched, and by whom?"

get_knowledge_hybrid_mdb = hybrid_search_with_mongodb(query_text, collection, vector_weight=0.1, full_text_weight=0.9, top_k=10)

In [200]:
pd.DataFrame(get_knowledge_hybrid_mdb).head()

Unnamed: 0,title,url,text,vs_score,fts_score,score
0,YouTube,https://en.wikipedia.org/wiki?curid=3524766,YouTube announced the project in September 201...,0.001667,0,0.001667
1,YouTube,https://en.wikipedia.org/wiki?curid=3524766,The mobile version of the site was relaunched ...,0.001639,0,0.001639
2,YouTube,https://en.wikipedia.org/wiki?curid=3524766,"In January 2009, YouTube launched ""YouTube for...",0.001613,0,0.001613
3,YouTube,https://en.wikipedia.org/wiki?curid=3524766,"Later the same year, ""YouTube Feather"" was int...",0.001587,0,0.001587
4,Twitch (service),https://en.wikipedia.org/wiki?curid=33548254,"On May 18, 2014, ""Variety"" first reported that...",0.001563,0,0.001563


## Part 4: Demonstrating MongoDB's ACID Capabilities for AI Workloads

In AI applications, it's common to need to update both vector embeddings and metadata atomically.
This section demonstrates MongoDB's ACID transactions, which ensure that complex updates
maintain data consistency - a critical feature missing in Elasticsearch.

In [None]:
import datetime

def update_document_with_transaction(collection, document_id, new_text, new_embedding):
    """
    Update both the text and embedding of a document atomically using a transaction.
    This ensures that if the embedding generation fails, the text update won't be committed.
    
    Args:
        collection: MongoDB collection
        document_id: ID of document to update
        new_text: New text content
        new_embedding: New vector embedding
    """
    # Start a session for the transaction
    with mongo_client.start_session() as session:
        # Start a transaction
        session.start_transaction()
        
        try:
            # Update the document's text
            collection.update_one(
                {"_id": document_id},
                {"$set": {"text": new_text}},
                session=session
            )
            
            # Update the document's embedding
            collection.update_one(
                {"_id": document_id},
                {"$set": {"embedding": new_embedding}},
                session=session
            )
            
            # If both operations succeed, commit the transaction
            session.commit_transaction()
            return True
            
        except Exception as e:
            # If any operation fails, abort the transaction
            session.abort_transaction()
            print(f"Transaction aborted: {e}")
            return False

In [None]:
# Demonstrating a transaction with a sample document
sample_doc = collection.find_one({})
if sample_doc:
    # Get a new embedding for demonstration
    new_text = sample_doc["text"] + " [Updated content]"
    new_embedding_vector = get_cohere_embeddings([new_text])[0]
    new_embedding = generate_bson_vector(new_embedding_vector, BinaryVectorDtype.FLOAT32)
    
    # Update with transaction
    success = update_document_with_transaction(
        collection, 
        sample_doc["_id"], 
        new_text, 
        new_embedding
    )
    
    print(f"Transaction {'succeeded' if success else 'failed'}")
    
    # Verify the update
    updated_doc = collection.find_one({"_id": sample_doc["_id"]})
    print(f"Document updated: {updated_doc['text'] == new_text}")

Note: Elasticsearch cannot perform this kind of atomic update across multiple fields.
If a failure occurs during the update process, you could end up with inconsistent
data where the text is updated but the embedding is not, creating a mismatch
between content and vector representation.