This Python script evaluates the performance of vector search on a MongoDB Atlas database using the FeedbackQARetrieval dataset from the Massive Text Embedding Benchmark (MTEB). It uses the Voyage AI API to generate text embeddings and a custom library (mdbra) to orchestrate the evaluation and calculate retrieval metrics.
The script is designed to test multiple embedding dimensionalities, creating a separate MongoDB collection and vector search index for each dimension to compare their performance.
Before you begin, ensure you have the following:
- Python 3.8+
- MongoDB Atlas Cluster: A running MongoDB Atlas cluster. You will need the connection URI. The cluster must be M10 or larger to use MongoDB Atlas Vector Search.
- Voyage AI API Key: An active API key from Voyage AI.
-
Clone the Repository: If this script is part of a larger project, clone that project.
git clone https://github.com/jegentile/ext_retrieval_accuracy cd ext_retrieval_accuracy -
Install Dependencies: Install the necessary Python libraries.
pip install mteb voyageai pymongo tqdm
or
uv pip install mteb voyageai pymongo tqdm
-
Set Environment Variables: The script requires two environment variables to connect to the necessary services.
-
For Voyage AI API Key:
export VOYAGEAPI="your-voyage-ai-api-key"
-
For MongoDB Atlas Connection:
export MDBURI="mongodb+srv://<user>:<password>@<cluster-uri>/..."
Replace the placeholders with your actual credentials and cluster information.
-
You can customize the script's behavior by modifying the global constants at the top of the file:
VOYAGE_MODEL: The name of the Voyage AI model to use for generating embeddings (e.g.,'voyage-3.5').DATABASE_NAME: The name of the MongoDB database where collections will be created (e.g.,'MRL_Assessment').MRL_DIMENSIONS: This is the most important setting. It is a list of integers representing the different embedding dimensions you want to test. The script will loop through this list.python MRL_DIMENSIONS = [256, 512, 1024]MTEB_DATA_NAME: The name of the MTEB task to use (e.g.,'FeedbackQARetrieval').
Once the prerequisites are met and the configuration is set, you can execute the script from your terminal:
python FeedbackQARetrieval.pyThe script will print progress updates as it loads data, creates indexes, and finally, it will print a dictionary of performance metrics for each dimension tested.
The script follows a systematic process to evaluate retrieval performance for each specified dimension.
- The script initializes a connection to your MongoDB Atlas cluster.
- It downloads and loads the
FeedbackQARetrievaldataset from MTEB, which includes a corpus of documents, a set of queries, and the ground-truth "relevant documents" for each query. - It initializes the
MDBRAevaluation framework.
For each embedding dimension defined in the MRL_DIMENSIONS list, the script performs the following steps:
-
📄 Create Collection: A new, unique MongoDB collection is created for the current dimension (e.g.,
FeedbackQARetrieval_256). -
✨ Generate & Store Embeddings: The
load_data_into_collectionfunction iterates through the corpus documents, generates vector embeddings of the specified dimension using the Voyage AI API, and inserts the documents along with their new vectors into the collection. -
🔎 Create Vector Index: The
create_vector_indexfunction programmatically creates a MongoDB Atlas Vector Search index on the vector field of the newly populated collection. The script then pauses for 120 seconds to allow the index to finish building. -
🏷️ Load Labels and Queries: The
load_mteb_labelsfunction prepares the ground-truth data for evaluation. It generates embeddings for each query and stores the queries, their vectors, and the list of known relevant document IDs in the database using themdbraframework.
- After processing all specified dimensions, the script queries the
mdbraframework. - It calls
CalculateMetricsfor each configured index. This function runs the test queries against the corresponding vector index, retrieves the results, and compares them to the ground-truth labels. - Finally, it prints a metrics dictionary (e.g., containing recall) for each dimension, allowing you to compare their retrieval performance.