# Optimizing Vector Database Performance: Reducing Retrieval Latency with Quantization

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mongodb-developer/GenAI-Showcase/blob/main/notebooks/techniques/automatic_quantization_of_nomic_emebddings_with_mongodb.ipynb)

---



**Summary**

This notebook explores techniques for optimizing vector database performance, focusing on reducing retrieval latency through the use of quantization methods. We examine the practical application of various embedding types
- float32_embedding
- int8_embedding
- binary_embedding

We analyze their impact on query precision and retrieval speed.

By leveraging quantization strategies like scalar and binary quantization, we highlight the trade-offs between precision and efficiency.

The notebook also includes a step-by-step demonstration of executing vector searches, measuring retrieval latencies, and visualizing results in a comparative framework.







**Use Case:**

The notebook demonstrates how to optimize vector database performance, specifically focusing on reducing retrieval latency using quantization methods.

**Scenario**:
You have a large dataset of text data (in this case, a book from Gutenberg) and you want to build a system that can efficiently find similar pieces of text based on a user's query.

**Approach**:
- Embeddings: The notebook uses SentenceTransformer to convert text into numerical vectors (embeddings) which capture the semantic meaning of the text.
- Vector Database: MongoDB is used as a vector database to store and search these embeddings efficiently.
- Quantization: To speed up retrieval, the notebook applies quantization techniques (scalar and binary) to the embeddings. This reduces the size of the embeddings, making searches faster but potentially impacting precision.
Goal: By comparing the performance of different embedding types (float32, int8, binary), the notebook aims to show the trade-offs between retrieval speed and accuracy when using quantization. This helps in choosing the best approach for a given use case.

## Step 1: Install Libaries

Here's a breakdown of the libraries and their roles:

- **unstructured**: This library is used to process and structure various data formats, including text, enabling efficient analysis and extraction of information.
- **pymongo**: This library provides the tools necessary to interact with MongoDB allowing for storage and retrieval of data within the project.
- **nomic**: This library is used for vector embedding and other functions related to Nomic AI's models, specifically for generating and working with text embeddings.
- **pandas**: This popular library is used for data manipulation and analysis, providing data structures and functions for efficient data handling and exploration.
- **sentence_transformers**: This library is used for generating embeddings for text data using the SentenceTransformer model.

By installing these packages, the code sets up the tools necessary for data processing, embedding generation, and storage with MongoDB.


In [None]:
%pip install --quiet -U unstructured pymongo nomic pandas sentence_transformers

In [2]:
import os
import getpass

# Function to securely get and set environment variables
def set_env_securely(var_name, prompt):
  value = getpass.getpass(prompt)
  os.environ[var_name] = value

## Step 2: Data Loading and Preparation

**Dataset Information**

The dataset used in this example is "Pushing to the Front," an ebook from Project Gutenberg. This book, focusing on self-improvement and success, is freely available for public use.

The code leverages the ```unstructured``` library to process this raw text data, transforming it into a structured format suitable for semantic analysis and search. By chunking the text based on titles, the code creates meaningful units that can be embedded and stored in a vector database for efficient retrieval. This preprocessing is essential for building a robust and performant semantic search system.



The code below ```requests``` library to fetch the text content of the book "Pushing to the Front" from Project Gutenberg's website. The URL points to the raw text file of the book.

In [37]:
import requests

url = "https://www.gutenberg.org/cache/epub/21291/pg21291.txt"
response = requests.get(url)
response.raise_for_status()
book_text = response.text

Data Cleaning: The ```unstructured``` library is used to clean and structure the raw text. The ```group_broken_paragraphs``` function helps in combining fragmented paragraphs, ensuring better text flow.



In [38]:
from unstructured.partition.text import partition_text
from unstructured.cleaners.core import group_broken_paragraphs

cleaned_text = group_broken_paragraphs(book_text)

parsed_sections = partition_text(text=cleaned_text)

The ```partition_text``` function further processes the cleaned text, dividing it into logical sections. These sections could represent chapters, sub-sections, or other meaningful units within the book.

In [39]:
# Show the first 5 sections
for text in parsed_sections[:5]:
  print(text)
  print("\n")

The Project Gutenberg eBook of Pushing to the Front


This ebook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this ebook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook.


Title: Pushing to the Front


Author: Orison Swett Marden


Release date: May 4, 2007 [eBook #21291]




Chunking by Title: The ```chunk_by_title``` function identifies titles or headings within the parsed sections and uses them to create distinct chunks of text. This step is crucial for organizing the data into manageable units for subsequent embedding generation and semantic search.



In [40]:
from unstructured.chunking.title import chunk_by_title

chunks = chunk_by_title(parsed_sections)

In [41]:
for chunk in chunks:
  print(chunk)
  break

The Project Gutenberg eBook of Pushing to the Front

This ebook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this ebook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook.


## Step 3: Embeddings Generation

In [42]:
from sentence_transformers import SentenceTransformer
from tqdm import tqdm

embedding_model = SentenceTransformer('nomic-ai/nomic-embed-text-v1.5', trust_remote_code=True)

# Determine the maximum sequence length for the model
max_seq_length = embedding_model.max_seq_length

def chunk_text(text, tokenizer, max_length=8192, overlap=50):
    """
    Split the text into overlapping chunks based on token length.
    """
    tokens = tokenizer.tokenize(text)
    chunks = []
    for i in range(0, len(tokens), max_length - overlap):
        chunk_tokens = tokens[i:i + max_length]
        chunk = tokenizer.convert_tokens_to_string(chunk_tokens)
        chunks.append(chunk)
    return chunks

def get_embedding(text, task_prefix):
    """
    Generate embeddings for a text string with a task-specific prefix.
    """

    if not text.strip():
        print("Attempted to get embedding for empty text.")
        return []

    # Prepend the task instruction prefix to the text
    prefixed_text = f"{task_prefix}: {text}"

    # Get the tokenizer from the model
    tokenizer = embedding_model.tokenizer

    # Split text into chunks if it's too long
    chunks = chunk_text(prefixed_text, tokenizer, max_length=max_seq_length)

    # Embed each chunk
    chunk_embeddings = embedding_model.encode(chunks)

    # Return the first embedding as a list
    return chunk_embeddings[0].tolist()




The embedding generation might take a approximately 20 minutes

In [43]:
from tqdm import tqdm

# Pass chunks into embedding function with a progress bar
embeddings = []
# If you don't want to chunk the entire document simply slice the chunks
# e.g for chunk in tqdm(chunks[:20], desc="Generating embeddings")
for chunk in tqdm(chunks, desc="Generating embeddings"):
    embedding = get_embedding(str(chunk), task_prefix="search_document")
    embeddings.append(embedding)


Generating embeddings: 100%|██████████| 4135/4135 [36:22<00:00,  1.89it/s]


In [44]:
# Store the embedding data alongside the chunk, so a datapoing is {chunk:"text", embedding: "embedding"}
embedding_data = []
for chunk, embedding in zip(chunks, embeddings):
  embedding_data.append({
      "chunk": chunk.text,
      "float32_embedding": embedding,
      "int8_embedding": embedding,
      "binary_embedding": embedding
  })

In [48]:
# Convert the embedding data to a Pandas dataframe
import pandas as pd

dataset_df = pd.DataFrame(embedding_data)

When visualizing the dataset values, you will observe that the embedding attributes: float32_embedding, int_embedding and binary_emebedding all have the same values.

In downstream proceses the values of the int_embedding and binary_embedding attributes for each data point will be modified to their respective data types, as a result of MongoDB Atlas auto quantization feature.

In [47]:
dataset_df.head()

Unnamed: 0,chunk,float32_embedding,int8_embedding,binary_embedding
0,﻿The Project Gutenberg eBook of Pushing to the...,"[0.6708638072013855, 1.6244568824768066, -3.93...","[0.6708638072013855, 1.6244568824768066, -3.93...","[0.6708638072013855, 1.6244568824768066, -3.93..."
1,Title: Pushing to the Front\n\nAuthor: Orison ...,"[0.40157243609428406, 0.9250301718711853, -3.8...","[0.40157243609428406, 0.9250301718711853, -3.8...","[0.40157243609428406, 0.9250301718711853, -3.8..."
2,"SAN JOSE\n\nCOPYRIGHT, 1911,\n\nBy ORISON SWET...","[0.8498769402503967, 1.2074620723724365, -4.07...","[0.8498769402503967, 1.2074620723724365, -4.07...","[0.8498769402503967, 1.2074620723724365, -4.07..."
3,"It has sent thousands of youths, with renewed ...","[0.21243047714233398, 0.9721435308456421, -3.2...","[0.21243047714233398, 0.9721435308456421, -3.2...","[0.21243047714233398, 0.9721435308456421, -3.2..."
4,The author has received thousands of letters f...,"[0.18898999691009521, 1.158793330192566, -3.47...","[0.18898999691009521, 1.158793330192566, -3.47...","[0.18898999691009521, 1.158793330192566, -3.47..."


## Step 4: MongoDB (Operational and Vector Database)

MongoDB acts as both an operational and vector database for the RAG system.
MongoDB Atlas specifically provides a database solution that efficiently stores, queries and retrieves vector embeddings.

Creating a database and collection within MongoDB is made simple with MongoDB Atlas.

1. First, register for a [MongoDB Atlas account](https://www.mongodb.com/cloud/atlas/register). For existing users, sign into MongoDB Atlas.
2. [Follow the instructions](https://www.mongodb.com/docs/atlas/tutorial/deploy-free-tier-cluster/). Select Atlas UI as the procedure to deploy your first cluster.

Follow MongoDB’s [steps to get the connection](https://www.mongodb.com/docs/manual/reference/connection-string/) string from the Atlas UI. After setting up the database and obtaining the Atlas cluster connection URI, securely store the URI within your development environment.


In [14]:
# Set MongoDB URI
set_env_securely("MONGO_URI", "Enter your MONGO URI: ")

Enter your MONGO URI: ··········


In [82]:
import pymongo

def get_mongo_client(mongo_uri):
  """Establish and validate connection to the MongoDB."""

  client = pymongo.MongoClient(mongo_uri, appname="devrel.showcase.quantized_embeddings_nomic.python")

  # Validate the connection
  ping_result = client.admin.command('ping')
  if ping_result.get('ok') == 1.0:
    # Connection successful
    print("Connection to MongoDB successful")
    return client
  else:
    print("Connection to MongoDB failed")
  return None

MONGO_URI = os.environ['MONGO_URI']
if not MONGO_URI:
  print("MONGO_URI not set in environment variables")

In [83]:
from pymongo.errors import CollectionInvalid

mongo_client = get_mongo_client(MONGO_URI)

DB_NAME = "career_coach"
COLLECTION_NAME = "pushing_to_the_front_orison_quantized"

# Create or get the database
db = mongo_client[DB_NAME]

# Check if the collection exists
if COLLECTION_NAME not in db.list_collection_names():
    try:
        # Create the collection
        db.create_collection(COLLECTION_NAME)
        print(f"Collection '{COLLECTION_NAME}' created successfully.")
    except CollectionInvalid as e:
        print(f"Error creating collection: {e}")
else:
    print(f"Collection '{COLLECTION_NAME}' already exists.")

# Assign the collection
collection = db[COLLECTION_NAME]


Connection to MongoDB successful
Collection 'pushing_to_the_front_orison_quantized' created successfully.


## Step 5: Data Ingestion

In [84]:
collection.delete_many({})

DeleteResult({'n': 0, 'electionId': ObjectId('7fffffff0000000000000039'), 'opTime': {'ts': Timestamp(1734322581, 1), 't': 57}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1734322581, 1), 'signature': {'hash': b'\xd3\xaeF\x07\xd0\xf6\xd7\xe9C;\xfe\x88\xed\x13\\\xd1\x17\x05\x96\xcc', 'keyId': 7390008424139849730}}, 'operationTime': Timestamp(1734322581, 1)}, acknowledged=True)

In [85]:
documents = dataset_df.to_dict('records')
collection.insert_many(documents)

print("Data ingestion into MongoDB completed")

Data ingestion into MongoDB completed


## Step 6: Vector Search Index Creation

In [86]:
import time
from pymongo.operations import SearchIndexModel

def setup_vector_search_index(collection, index_definition, index_name="vector_index"):
  """
  Setup a vector search index for a MongoDB collection and wait for 30 seconds.

  Args:
  collection: MongoDB collection object
  index_definition: Dictionary containing the index definition
  index_name: Name of the index (default: "vector_index")
  """
  new_vector_search_index_model = SearchIndexModel(
      definition=index_definition,
      name=index_name,
      type="vectorSearch"
  )

  # Create the new index
  try:
    result = collection.create_search_index(model=new_vector_search_index_model)
    print(f"Creating index '{index_name}'...")

    # Sleep for 60 seconds
    print(f"Waiting for 60 seconds to allow index '{index_name}' to be created...")
    time.sleep(60)

    print(f"60-second wait completed for index '{index_name}'.")
    return result

  except Exception as e:
    print(f"Error creating new vector search index '{index_name}': {str(e)}")
    return None

In [98]:
def create_vector_index_definition():
    """
    Create a vector index definition with predefined quantization methods.

    This function defines vector index fields with specific paths, dimensionalities,
    and similarity metrics. It includes support for quantization methods:
    - "scalar" quantization is applied to the "int8_embedding" field.
    - "binary" quantization is applied to the "binary_embedding" field.
    - No quantization is applied to the "float32_embedding" field.

    Returns:
      dict: A dictionary containing the vector index definition, including
      fields with their respective paths, quantization methods, dimensions,
      and similarity measures.
    """

    # Define the field types
    base_fields = [
        {
            "type": "vector",
            "path": "float32_embedding",
            "numDimensions": 768,
            "similarity": "cosine"
        },
        {
            "type": "vector",
            "path": "int8_embedding",
            "quantization": "scalar",
            "numDimensions": 768,
            "similarity": "cosine"
        },
        {
            "type": "vector",
            "path": "binary_embedding",
            "quantization": "binary",
            "numDimensions": 768,
            "similarity": "euclidean"
        }
    ]

    return {
        "fields": base_fields
    }

In [88]:
vector_index_definition = create_vector_index_definition()

In [89]:
print(vector_index_definition)

{'fields': [{'type': 'vector', 'path': 'float32_embedding', 'numDimensions': 768, 'similarity': 'cosine'}, {'type': 'vector', 'path': 'int8_embedding', 'quantization': 'scalar', 'numDimensions': 768, 'similarity': 'cosine'}, {'type': 'vector', 'path': 'binary_embedding', 'quantization': 'binary', 'numDimensions': 768, 'similarity': 'cosine'}]}


In [90]:
setup_vector_search_index(collection, vector_index_definition, "vector_index")

Creating index 'vector_index'...
Waiting for 60 seconds to allow index 'vector_index' to be created...
60-second wait completed for index 'vector_index'.


'vector_index'

## Step 7: Vector Search Operation

In [99]:
def custom_vector_search(user_query, collection, embedding_path, vector_search_index_name="vector_index"):
    """
    Perform a vector search in the MongoDB collection based on the user query.

    Args:
        user_query (str): The user's query string.
        collection (MongoCollection): The MongoDB collection to search.
        embedding_path (str): The path of the embedding field in the documents.
        vector_search_index_name (str): The name of the vector search index.

    Returns:
        list: A list of matching documents.
    """

    # Generate embedding for the user query
    query_embedding = get_embedding(user_query, task_prefix="search_query")

    if query_embedding is None:
      return "Invalid query or embedding generation failed."

    # Define the vector search stage
    vector_search_stage = {
        "$vectorSearch": {
            "index": vector_search_index_name,  # Specifies the index to use for the search
            "queryVector": query_embedding,  # The vector representing the query
            "path": embedding_path,  # Field in the documents containing the vectors to search against
            "numCandidates": 6,  # Number of candidate matches to consider
            "limit": 5  # Return top 5 matches
        }
    }

    project_stage = {
        "$project": {
            "_id": 0,  # Exclude the _id field
            "chunk": 1,
            "score": {
                "$meta": "vectorSearchScore"  # Include the search score
            }
        }
    }

    # Define the aggregate pipeline with the vector search stage and additional stages
    pipeline = [vector_search_stage, project_stage]

    # Execute the explain command
    explain_result = collection.database.command(
      'explain',
      {
        'aggregate': collection.name,
        'pipeline': pipeline,
        'cursor': {}
      },
      verbosity='executionStats'
    )

    # Extract the execution time
    vector_search_explain = explain_result['stages'][0]['$vectorSearch']
    execution_time_ms = vector_search_explain['explain']['query']['stats']['context']['millisElapsed']


     # Execute the actual query
    results = list(collection.aggregate(pipeline))

    return results, execution_time_ms


In [100]:
def run_vector_search_operations(user_query, collection, vector_search_index_name="vector_index"):
    """
    Run vector search operations for different embedding paths and store results in a DataFrame.

    Args:
        user_query (str): The user's query string.
        collection (MongoCollection): The MongoDB collection to search.
        vector_search_index_name (str): The name of the vector search index.

    Returns:
        pd.DataFrame: A DataFrame containing precision, retrieval latency, and query results.
    """
    embedding_paths = ["float32_embedding", "int8_embedding", "binary_embedding"]
    results_data = []

    for path in embedding_paths:
        # Perform vector search
        try:
            results, execution_time_ms = custom_vector_search(
                user_query=user_query,
                collection=collection,
                embedding_path=path,
                vector_search_index_name=vector_search_index_name
            )

            # Store the results in the data structure
            results_data.append({
                "Precision (Data Type)": path.split("_")[0],
                "Retrieval Latency (ms)": execution_time_ms,
                "Results": results
            })
        except Exception as e:
            # Handle errors
            results_data.append({
                "Precision (Data Type)": path.split("_")[0],
                "Retrieval Latency (ms)": "Error",
                "Results": str(e)
            })

    results_df = pd.DataFrame(results_data)

    return results_df

## Step 8: Retrieving Documents and Analysing Results

In [101]:
# Perform the vector search and visualize the results
user_query = "How do I increase my productivity for maximum output"
results_df = run_vector_search_operations(user_query, collection)

In [102]:
results_df.head()

Unnamed: 0,Precision (Data Type),Retrieval Latency (ms),Results
0,float32,0.894696,"[{'chunk': 'With penetrating, almost clairvoya..."
1,int8,0.729637,"[{'chunk': 'With penetrating, almost clairvoya..."
2,binary,0.675838,"[{'chunk': '""Success grows out of struggles to..."
