# MongoDB Vector Search - Create Embeddings - OpenAI - Existing Data

This notebook is a companion to the [Create Embeddings](https://www.mongodb.com/docs/atlas/atlas-vector-search/create-embeddings/) page. Refer to the page for set-up instructions and detailed explanations.

This notebook takes you through how to generate embeddings from **existing data in MongoDB** by using OpenAI's ``text-embedding-3-small`` model.

<a target="_blank" href="https://colab.research.google.com/github/mongodb/docs-notebooks/blob/main/create-embeddings/openai-existing-data.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [None]:
pip install --quiet --upgrade openai pymongo

## Use an Embedding Model

In [None]:
import os
from openai import OpenAI

# Specify your OpenAI API key and embedding model
os.environ["OPENAI_API_KEY"] = "<api-key>"
model = "text-embedding-3-small"
openai_client = OpenAI()

# Define a function to generate embeddings
def get_embedding(text):
   """Generates vector embeddings for the given text."""

   embedding = openai_client.embeddings.create(input = [text], model=model).data[0].embedding
   return embedding

# Generate an embedding
embedding = get_embedding("foo")
print(embedding)

### (Optional) Compress your embeddings

Optionally, run the following code to define a function that converts your embeddings into BSON `binData` vectors for [efficient storage and retrieval](https://www.mongodb.com/docs/atlas/atlas-vector-search/create-embeddings/#vector-compression).

In [None]:
from bson.binary import Binary 
from bson.binary import BinaryVectorDtype

# Define a function to generate BSON vectors
def generate_bson_vector(vector, vector_dtype):
   return Binary.from_vector(vector, vector_dtype)

# Generate BSON vector from the sample float32 embedding
bson_float32_embedding = generate_bson_vector(embedding, BinaryVectorDtype.FLOAT32)

# Print the converted embedding
print(f"The converted BSON embedding is: {bson_float32_embedding}")

## Generate Embeddings

In [None]:
import pymongo

# Connect to your MongoDB cluster
mongo_client = pymongo.MongoClient("<connection-string>")
db = mongo_client["sample_airbnb"]
collection = db["listingsAndReviews"]

# Define a filter to exclude documents with null or empty 'summary' fields
filter = { 'summary': { '$exists': True, "$nin": [ None, "" ] } }

# Get a subset of documents in the collection
documents = collection.find(filter, {'_id': 1, 'summary': 1}).limit(50)

In [None]:
from pymongo import UpdateOne

# Generate the list of bulk write operations
operations = []
for doc in documents:
   summary = doc["summary"]
   # Generate embeddings for this document
   embedding = get_embedding(summary)

   # Uncomment the following line to convert to BSON vectors
   # embedding = generate_bson_vector(embedding, BinaryVectorDtype.FLOAT32)

   # Add the update operation to the list
   operations.append(UpdateOne(
      {"_id": doc["_id"]},
      {"$set": {
         "embedding": embedding
      }}
   ))

# Execute the bulk write operation
if operations:
   result = collection.bulk_write(operations)
   updated_doc_count = result.modified_count

print(f"Updated {updated_doc_count} documents.")

## Index and Query Your Embeddings

In [None]:
from pymongo.operations import SearchIndexModel
import time

# Create your index model, then create the search index
search_index_model = SearchIndexModel(
  definition = {
    "fields": [
      {
        "type": "vector",
        "path": "embedding",
        "similarity": "dotProduct",
        "numDimensions": 1536
      }
    ]
  },
  name="vector_index",
  type="vectorSearch"
)
result = collection.create_search_index(model=search_index_model)

# Wait for initial sync to complete
print("Polling to check if the index is ready. This may take up to a minute.")
predicate=None
if predicate is None:
  predicate = lambda index: index.get("queryable") is True

while True:
  indices = list(collection.list_search_indexes(result))
  if len(indices) and predicate(indices[0]):
    break
  time.sleep(5)
print(result + " is ready for querying.")

In [None]:
# Generate embedding for the search query
query_embedding = get_embedding("beach house")

# Sample vector search pipeline
pipeline = [
   {
      "$vectorSearch": {
            "index": "vector_index",
            "queryVector": query_embedding,
            "path": "embedding",
            "exact": True,
            "limit": 5
      }
   }, 
   {
      "$project": {
         "_id": 0, 
         "summary": 1,
         "score": {
            "$meta": "vectorSearchScore"
         }
      }
   }
]

# Execute the search
results = collection.aggregate(pipeline)

# Print results
for i in results:
   print(i)
