# Atlas Vector Search

MongoDB Atlas can store embeddings along with metadata and allows vector search in the database.

## Use Case
Create a movie vector database and perform a vector search to find movies similar to the query plot.

# Step 1: Connect to MongoDB instance

In [1]:
!pip install pymongo
!pip install sentence-transformers

Collecting pymongo
  Obtaining dependency information for pymongo from https://files.pythonhosted.org/packages/de/67/949da6f882723be8ca8ef63678d7f999b4d9c235c656c0376ea8b6c041d6/pymongo-4.5.0-cp311-cp311-macosx_10_9_universal2.whl.metadata
  Using cached pymongo-4.5.0-cp311-cp311-macosx_10_9_universal2.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Obtaining dependency information for dnspython<3.0.0,>=1.16.0 from https://files.pythonhosted.org/packages/f6/b4/0a9bee52c50f226a3cbfb54263d02bb421c7f2adc136520729c2c689c1e5/dnspython-2.4.2-py3-none-any.whl.metadata
  Using cached dnspython-2.4.2-py3-none-any.whl.metadata (4.9 kB)
Using cached pymongo-4.5.0-cp311-cp311-macosx_10_9_universal2.whl (529 kB)
Using cached dnspython-2.4.2-py3-none-any.whl (300 kB)
Installing collected packages: dnspython, pymongo
Successfully installed dnspython-2.4.2 pymongo-4.5.0


In [2]:
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi

uri = ""

# Create a new client and connect to the server
client = MongoClient(uri, server_api=ServerApi('1'))

# Send a ping to confirm a successful connection
try:
    client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)

Pinged your deployment. You successfully connected to MongoDB!


In [None]:
db = client.sample_mflix
collection = db.movies

# Step 2: Set up the embedding creation function

## Embeddings from HuggingFace API (has limits)

In [8]:
import requests

hf_token = ""
embedding_url = "https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/all-MiniLM-L6-v2"

def generate_hf_embedding(text: str) -> list[float]:

	response = requests.post(
		embedding_url,
		headers={"Authorization": f"Bearer {hf_token}"},
		json={"inputs": text})

	if response.status_code != 200:
		raise ValueError(f"Request failed with status code {response.status_code}: {response.text}")

	return response.json()

## Locally generate embeddings

In [21]:
from sentence_transformers import SentenceTransformer

# https://huggingface.co/obrizum/all-MiniLM-L6-v2
model = SentenceTransformer('obrizum/all-MiniLM-L6-v2', device='mps')

def generate_embedding(text: str) -> list[float]:
    return model.encode(text).tolist()

In [23]:
len(generate_embedding("MongoDB is awesome"))

384

# Step 3: Create and store embeddings

In [29]:
# Store the newly created vector embeddings in a field called "plot_embedding_hf"
count = 0
for doc in collection.find({'plot':{"$exists": True}}):
    if 'plot_embedding_hf' not in doc.keys():
        doc['plot_embedding_hf'] = generate_embedding(doc['plot'])
        collection.replace_one({'_id': doc['_id']}, doc)
    count += 1
    if count % 1000 == 0:
        print(count, "done")

print(f"All {count} finished!")

1000 done
2000 done
3000 done
4000 done
5000 done
6000 done
All 6750 finished!


# Step 4: Create a vector search index

Now, head over to Atlas Search and create an index. First, click the “search” tab on cluster and click on “Create Search Index.”

1. Select the database and collection on the left. For this tutorial, it should be sample_mflix/movies.
2. Enter the Index Name. For this tutorial, we are choosing to call it PlotSemanticSearch2.
3. Enter the configuration JSON (given below) into the text editor. The field name should match the name of the embedding field created in Step 3 (for this tutorial it should be plot_embedding_hf), and the dimensions match those of the chosen model (for this tutorial it should be 384). The chosen value for the "similarity" field (of “dotProduct”) represents cosine similarity, in our case.

In [36]:
# {
#     "mappings": {
#         "dynamic": true,
#         "fields": {
#             "plot_embedding_hf": {
#                 "dimensions": 384,
#                 "similarity": "dotProduct",
#                 "type": "knnVector"
#             }
#         }
#     }
# }

# Step 5: Query the database

In [41]:
query = "imaginary characters from outer space at war"

In [88]:
# $vectorSearch in Atlas does not work
results = collection.aggregate([
    {
        "$vectorSearch": {
            "queryVector": generate_embedding(query),
            "path": "plot_embedding_hf",
            "numCandidates": 100,
            "limit": 4,
            "index": "PlotSemanticSearch2",
            }
    }
])

OperationFailure: $vectorSearch is not allowed or the syntax is incorrect, see the Atlas documentation for more information, full error: {'ok': 0, 'errmsg': '$vectorSearch is not allowed or the syntax is incorrect, see the Atlas documentation for more information', 'code': 8000, 'codeName': 'AtlasError'}

In [80]:
# query top 10 nearest neighbors
results = collection.aggregate([
    {
        '$search': {
            "index": "PlotSemanticSearch2",
            "knnBeta": {
                "vector": generate_embedding(query),
                "k": 10,
                "path": "plot_embedding_hf"
                "similarity": "euclidean | cosine | dotProduct"}
        }
    }
])

In [81]:
# print queried movie titles and plots
for document in results:
    print(f'Movie Name: {document["title"]},\nMovie Plot: {document["plot"]}\n')

Movie Name: The Victors,
Movie Plot: Intelligent, sprawling saga of a squad of American soldiers, following them through Europe during World War II.

Movie Name: Space Odyssey: Voyage to the Planets,
Movie Plot: This two-part science fiction docu-drama examines the possibilities of a dangerous, manned space mission to explore the inner and outer planets of the Solar system.

Movie Name: Edge of Tomorrow,
Movie Plot: A military officer is brought into an alien war against an extraterrestrial enemy who can reset the day and know the future. When this officer is enabled with the same power, he teams up with a Special Forces warrior to try and end the war.

Movie Name: First Squad: The Moment of Truth,
Movie Plot: Set during the opening days of World War II on the Eastern Front. Its main cast are a group of Soviet teenagers with extraordinary abilities; the teenagers have been drafted to form a ...

Movie Name: The Fallen,
Movie Plot: The story of ordinary men during WWII as seen from thre