# Atlas Vector Search - Create Embeddings - Open Source - Existing Data

This notebook is a companion to the [Create Embeddings](https://www.mongodb.com/docs/atlas/atlas-vector-search/create-embeddings/) page. Refer to the page for set-up instructions and detailed explanations.

This notebook takes you through how to generate embeddings from **existing data in Atlas** by using the open-source ``nomic-embed-text-v1`` model. It also includes code to convert your embeddings to BSON binData vectors for efficient processing of your data.

<a target="_blank" href="https://colab.research.google.com/github/mongodb/docs-notebooks/blob/main/create-embeddings/open-source-existing-data.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [14]:
pip install --quiet --upgrade sentence-transformers pymongo einops

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [15]:
from sentence_transformers import SentenceTransformer

# Load the embedding model
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)

# Define a function to generate embeddings in multiple precisions
def get_embedding(data, precision="float32"):
    return model.encode(data, precision=precision)

<All keys matched successfully>


In [16]:
from bson.binary import Binary

# Generate BSON vector using `BinaryVectorDtype`
def generate_bson_vector(vector, vector_dtype):
    return Binary.from_vector(vector, vector_dtype)

In [17]:
# Function to create documents with BSON vector embeddings
def create_docs_with_bson_vector_embeddings(bson_float32, bson_int8, bson_int1, data):
  docs = []
  for i, (bson_f32_emb, bson_int8_emb, bson_int1_emb, text) in enumerate(zip(bson_float32, bson_int8, bson_int1, data)):
        doc = {
            "_id": i,
            "data": text,
            "BSON-Float32-Embedding": bson_f32_emb,
            "BSON-Int8-Embedding": bson_int8_emb,
            "BSON-Int1-Embedding": bson_int1_emb,
        }
        docs.append(doc)
  return docs

In [18]:
# Example generating embeddings for the strings "foo" and "bar"
data = ["foo", "bar"]
float32_embeddings = get_embedding(data, "float32")
int8_embeddings = get_embedding(data, "int8")
int1_embeddings = get_embedding(data, "ubinary")

print("Float32 Embedding:", float32_embeddings)
print("Int8 Embedding:", int8_embeddings)
print("Int1 Embedding (binary representation):", int1_embeddings)

Computing int8 quantization buckets based on 2 embeddings. int8 quantization is more stable with `ranges` calculated from more embeddings or a `calibration_embeddings` that can be used to calculate the buckets.


Float32 Embedding: [[-0.02980826  0.03841477 -0.0256112  ... -0.05328758 -0.0335409
  -0.02591544]
 [-0.0274888   0.03717752 -0.03104551 ...  0.02413219 -0.02402255
   0.02810649]]
Int8 Embedding: [[-128  127  127 ... -128 -128 -128]
 [ 127 -128 -128 ...  127  127  127]]
Int1 Embedding (binary representation): [[ 77  30   4 131  15 123 146 149 138  85 185   5  68 249 163  48 195 102
  163 228 197  90 123 195  11 102 161  28 134 245 101 222 106  58 216  22
  229 196 237  14 135 178 114 221  58 215 143  94 158  77 116 132 191 153
  209 158 230 173 132 249 134 156  35 235 233 148  34 149  22  56  27  18
  234  73 244 237  80 216 214  25 236 245 152 143  20  79  50  24  43 245
  159 142 205  23 119 120]
 [ 79  82 208 180  45  79 209 189 168 198  63 124 109 247  10 245 131 186
  169 199 129  74  49 251  63  86 160 156 140 205 243 198 138 248 252  71
  103 196 251 223 139 246 158  85  90 166 110 122  72 105   6 171 238 140
   73   6 243 240  47  11  96  80 119  64 205 230  34 210 130  60 136

In [19]:
from bson.binary import BinaryVectorDtype

bson_float32_embeddings = []
bson_int8_embeddings = []
bson_int1_embeddings = []

for (f32_emb, int8_emb, int1_emb) in zip(float32_embeddings, int8_embeddings, int1_embeddings):
    bson_float32_embeddings.append(generate_bson_vector(f32_emb, BinaryVectorDtype.FLOAT32))
    bson_int8_embeddings.append(generate_bson_vector(int8_emb, BinaryVectorDtype.INT8))
    bson_int1_embeddings.append(generate_bson_vector(int1_emb, BinaryVectorDtype.PACKED_BIT))

# Print the embeddings
print(f"The converted bson_float32_new_embedding is: {bson_float32_embeddings}")
print(f"The converted bson_int8_new_embedding is: {bson_int8_embeddings}")
print(f"The converted bson_int1_new_embedding is: {bson_int1_embeddings}")

The converted bson_float32_new_embedding is: [Binary(b'\'\x00v0\xf4\xbc\xcdX\x1d=\x92\xce\xd1\xbc\xb0^\x89\xbd\xfee\x1e=~\x0bO;\xb9\xa3K\xbd\x9eI\xc1<\x94\xb7\x8f\xbdt1\xb5\xbd\xc5\xd8e\xbd\x10-R=$3\xe2<\x00\x1b\x15=\xb5~\x94<\xae\x9fR\xbb\xbc\xe7\x89\xbb\x06\xc3\x0e\xbd\xbe\x1a\x80\xbc\x8f\xf66\xbd\x18\x1b\x95\xbb\x10\xe2J=\x9d\xde\x8d\xbc\xe8@\x8f\xbb\xea\xabR>\x17\xc3\xc9\xbc\xaf\xea\xa3\xbcQF<\xbbW\xddW\xbd\x93\xa9\x1a\xbc\x1fW@;#\xb6\xc4<U\xe6\xee\xbb\xb6\xceL\xbd\xe7\x1bJ\xbd}#\xb9\xbd\xa9\xc4s<\xe5\xdb\x1b=\x1b\x07\xe5<\xa6\xf1s<!o\xbf\xbct|\xf3<d\x02\xa2<\xf0\xec\x8e<\xdd\xa5X:\'\x13*\xbd\xc9\xa3\xa1<`8?=rOJ=\xb5v\xeb\xbcA\x1c\x98\xbc\xd3\xad\x9b9\xc3\xc1<\xbb\x81\x05}\xbd\x99F\'<\x81\xe3\xb9\xbb\xe0\x1cS<?\x84\xac\xbd\x1a\xa5\x96\xbc,\xc6\x16;\x88\xff!\xbd\r\xb5\x81<\x91\x85\xea\xbc)\x88\xd0<d\xde$<\xa8\x8a4\xbd"Q\xcb\xbc\\Il\xbc\x0e\xc7\x00<\x1c\x14^\xbdqHP=\xcdfq\xbc\x8a4\xed\xbc\x93\xb0\xb0<\xc0\x93\xf7\xbc\x84\xe5Y<D\x1f\x19\xbb/\x19\xee;1\xf2!\xbd\x08\xac\xe1<TMl<[\x87\xb

In [51]:
import pymongo

# Connect to your Atlas cluster
mongo_client = pymongo.MongoClient("mongodb+srv://dhou:test@cluster0.rhweq.mongodb-dev.net/?retryWrites=true&w=majority&appName=Cluster0")
db = mongo_client["sample_airbnb"]
collection = db["listingsAndReviews"]

# Define a filter to exclude documents with null or empty 'summary' fields
summary_filter = { '$and': [ { 'summary': { '$exists': True, '$ne': None } } ] }

# Get a subset of documents in the collection
documents = collection.find(summary_filter, {'_id': 1, 'summary': 1}).limit(50)

In [50]:
from pymongo import UpdateOne

# Generate the list of bulk write operations
operations = []
for doc in documents:
    summary = doc["summary"]
    # Generate embeddings for this document
    float32_embeddings = get_embedding(summary, precision="float32")
    int8_embeddings = get_embedding(summary, precision="int8")
    int1_embeddings = get_embedding(summary, precision="ubinary")
    
    # Convert embeddings to BSON vectors
    bson_float32_embeddings = generate_bson_vector(float32_embeddings, BinaryVectorDtype.FLOAT32)
    bson_int8_embeddings = generate_bson_vector(int8_embeddings, BinaryVectorDtype.INT8)
    bson_int1_embeddings = generate_bson_vector(int1_embeddings, BinaryVectorDtype.PACKED_BIT)
    
    # Add the update operation to the list
    operations.append(UpdateOne(
        {"_id": doc["_id"]},
        {"$set": {
            "BSON-Float32-Embedding": bson_float32_embeddings,
            "BSON-Int8-Embedding": bson_int8_embeddings,
            "BSON-Int1-Embedding": bson_int1_embeddings
        }}
    ))

# Execute the bulk write operation
if operations:
    result = collection.bulk_write(operations)
    updated_doc_count = result.modified_count

print(f"Updated {updated_doc_count} documents.")

Updated 50 documents.


In [None]:
from pymongo.operations import SearchIndexModel
import time

# Create your index model, then create the search index
search_index_model = SearchIndexModel(
  definition = {
    "fields": [
      {
        "type": "vector",
        "path": "BSON-Float32-Embedding",
        "similarity": "dotProduct",
        "numDimensions": 768
      },
      {
        "type": "vector",
        "path": "BSON-Int8-Embedding",
        "similarity": "dotProduct",
        "numDimensions": 768
      },
      {
        "type": "vector",
        "path": "BSON-Int1-Embedding",
        "similarity": "euclidean",
        "numDimensions": 768
      }
    ]
  },
  name="vector_index",
  type="vectorSearch",
)
result = collection.create_search_index(model=search_index_model)
print("New search index named " + result + " is building.")

# Wait for initial sync to complete
print("Polling to check if the index is ready. This may take up to a minute.")
predicate=None
if predicate is None:
  predicate = lambda index: index.get("queryable") is True

while True:
  indices = list(collection.list_search_indexes(result))
  if len(indices) and predicate(indices[0]):
    break
  time.sleep(5)
print(result + " is ready for querying.")

In [None]:
# Prepare your query
query_text = "beach house"

# Generate embedding for the search query
query_float32_embeddings = get_embedding(query_text, precision="float32")
query_int8_embeddings = get_embedding(query_text, precision="int8")
query_int1_embeddings = get_embedding(query_text, precision="ubinary")

# Convert each embedding to BSON format
query_bson_float32_embeddings = generate_bson_vector(query_float32_embeddings, BinaryVectorDtype.FLOAT32)
query_bson_int8_embeddings = generate_bson_vector(query_int8_embeddings, BinaryVectorDtype.INT8)
query_bson_int1_embeddings = generate_bson_vector(query_int1_embeddings, BinaryVectorDtype.PACKED_BIT)

# Define vector search pipeline for each precision
pipelines = []
for query_embedding, path in zip(
    [query_bson_float32_embeddings, query_bson_int8_embeddings, query_bson_int1_embeddings],
    ["BSON-Float32-Embedding", "BSON-Int8-Embedding", "BSON-Int1-Embedding"]
):
    pipeline = [
       {
          "$vectorSearch": {
                "index": "vector_index",  # Adjust if necessary
                "queryVector": query_embedding,
                "path": path,
                "exact": True,
                "limit": 5
          }
       },
       {
          "$project": {
             "_id": 0,
             "summary": 1,
             "score": {
                "$meta": "vectorSearchScore"
             }
          }
       }
    ]
    pipelines.append(pipeline)

# Execute the search for each precision
for pipeline in pipelines:
    print(f"\nResults for {pipeline[0]['$vectorSearch']['path']}:")
    results = collection.aggregate(pipeline)
    
    # Print results
    for i in results:
        print(i)