# Lab 2: Data Ingestion with Voyage-AI Embeddings into MongoDB Atlas

This lab focuses on preparing your data, generating vector embeddings using Voyage-AI, and storing these embeddings along with your text chunks in a MongoDB Atlas collection. This forms the indexing part of your RAG pipeline.

## Objectives
- Define sample text data (or load from a source).
- Use Voyage-AI to generate embeddings for text chunks.
- Store text chunks and their embeddings in MongoDB Atlas.
- Understand the structure of documents for vector search.

## Prerequisites
- Complete Lab 1: MongoDB Atlas Setup (ensuring `MONGODB_URI` is set in `.env`).
- Obtain a Voyage-AI API Key and set it as `VOYAGE_API_KEY` in your `.env` file.
- Python environment set up with `pymongo`, `voyageai`, and `python-dotenv` installed.
  ```bash
  pip install pymongo voyageai python-dotenv
  ```

In [None]:
%pip install pymongo python-dotenv voyageai tqdm

## Step 1: Load Environment Variables and Initialize Clients

In [None]:
from dotenv import load_dotenv
import os
import voyageai
from pymongo import MongoClient

# Load environment variables
load_dotenv()

# Initialize Voyage-AI Client
voyageai_api_key = os.environ.get("VOYAGEAI_API_KEY")
if not voyageai_api_key:
    raise ValueError("VOYAGEAI_API_KEY not found in .env file or environment variables.")
vo = voyageai.Client(api_key=voyageai_api_key)

# Initialize MongoDB Client
mongodb_uri = os.environ.get("MONGODB_URI")
if not mongodb_uri:
    raise ValueError("MONGODB_URI not found in .env file or environment variables.")
client = MongoClient(mongodb_uri)

# Select your database and collection
# (These will be created if they don't exist upon first insertion)
db = client['rag_db']
collection = db['documents']

print("Clients initialized successfully.")

## Step 2: Prepare Sample Data

For this lab, we'll use a small array of text snippets. In a real-world scenario, you would typically load and chunk data from documents, articles, etc.

In [None]:
sample_texts = [
    "The new product features include enhanced security protocols and faster processing.",
    "Our customer support is available 24/7 via live chat and email.",
    "This document outlines the privacy policy regarding user data collection and usage.",
    "Upcoming software updates will introduce a dark mode and custom themes.",
    "Please refer to the user manual for detailed installation instructions.",
    "MongoDB Atlas provides a fully managed cloud database service for modern applications.",
    "Vector search enables semantic similarity queries on unstructured data like text and images.",
    "RAG stands for Retrieval-Augmented Generation, combining search with language model responses.",
    "Voyage-AI generates high-quality embeddings optimized for retrieval and similarity tasks.",
    "Cosine similarity is commonly used to measure the angle between two embedding vectors.",
    "Chunking large documents into smaller pieces improves the precision of vector search results.",
    "An embedding is a dense numerical representation of text that captures semantic meaning.",
    "MongoDB supports flexible document schemas, making it ideal for storing heterogeneous data.",
    "The vector search index must specify the number of dimensions matching the embedding model.",
    "Reranking improves retrieval quality by reordering initial search results using a cross-encoder.",
    "LangChain is a popular framework for building applications powered by language models.",
    "A knowledge base stores curated information that an AI system can retrieve and reference.",
    "Tokenization is the process of breaking text into smaller units called tokens for model input.",
    "The dot product is another similarity metric used for comparing embedding vectors.",
    "Prompt engineering involves crafting effective prompts to guide language model behavior.",
    "Hybrid search combines keyword-based search with vector-based semantic search for better results.",
    "MongoDB Atlas Search integrates full-text search and vector search in a single platform.",
    "Data ingestion pipelines transform raw data into structured formats suitable for storage and retrieval.",
    "The voyage-3-large model produces 1024-dimensional embeddings for high-accuracy retrieval.",
    "Batch processing of embeddings reduces API calls and improves throughput during data ingestion.",
    "Semantic search understands the intent behind a query, not just the exact keywords.",
    "A retrieval pipeline fetches the most relevant documents from a knowledge base given a query.",
    "Context window size determines how much text a language model can process in a single request.",
    "Fine-tuning adjusts a pre-trained model on domain-specific data for improved performance.",
    "Metadata fields like source and timestamp help filter and organize retrieved documents.",
    "The pymongo library provides a Python interface for interacting with MongoDB databases.",
    "Environment variables store sensitive configuration like API keys outside of source code.",
    "BSON is the binary serialization format used by MongoDB to store documents internally.",
    "An aggregation pipeline in MongoDB processes data through a sequence of transformation stages.",
    "The $vectorSearch stage in an aggregation pipeline performs approximate nearest neighbor search.",
    "Approximate nearest neighbor (ANN) algorithms trade perfect accuracy for much faster search speed.",
    "Embedding models convert both queries and documents into the same vector space for comparison.",
    "Atlas Vector Search uses the HNSW algorithm for efficient nearest neighbor lookups.",
    "A well-designed chunk overlap strategy prevents information loss at document boundaries.",
    "Temperature controls the randomness of language model outputs, lower values produce more focused text.",
    "Grounding AI responses in retrieved documents reduces hallucinations and increases factual accuracy.",
    "The input_type parameter in Voyage-AI distinguishes between document and query embeddings.",
    "Index building in MongoDB Atlas may take several minutes depending on the volume of data.",
    "Python's dotenv library loads environment variables from a .env file into the process environment.",
    "Dimensionality reduction techniques like PCA can compress embeddings while preserving semantic information.",
    "Cross-encoder rerankers evaluate query-document pairs jointly for more accurate relevance scoring.",
    "MongoDB's flexible schema allows storing embeddings alongside text and metadata in a single document.",
    "Rate limiting on embedding APIs requires implementing retry logic and batching strategies.",
    "The recall metric measures the proportion of relevant documents successfully retrieved by the system.",
    "Evaluation of RAG systems involves measuring both retrieval quality and generation accuracy.",
]

print(f"Prepared {len(sample_texts)} text chunks.")


## Step 3: Generate Embeddings with Voyage-AI

We'll use the `vo.embed()` method to convert our text chunks into vector embeddings. It's important to specify `input_type="document"` when embedding documents for your knowledge base.

In [None]:
print("Generating embeddings with Voyage-AI...")
try:
    response = vo.embed(texts=sample_texts, model="voyage-3-large", input_type="document")
    embeddings = response.embeddings
    print(f"Generated {len(embeddings)} embeddings. Dimension: {len(embeddings[0])}")
except Exception as e:
    print(f"Error generating embeddings: {e}")
    # Exit or handle error appropriately
    exit()

## Step 4: Store Documents and Embeddings in MongoDB Atlas

Now, we'll create documents for MongoDB, each containing the original text chunk, its embedding, and some optional metadata (like `source`). Then we insert them into our collection.

In [None]:
documents_to_insert = []
for i, text in enumerate(sample_texts):
    documents_to_insert.append({
        "text_chunk": text,
        "embedding": embeddings[i],
        "source": f"sample_doc_{i+1}" # Example metadata
    })

print(f"Preparing to insert {len(documents_to_insert)} documents...")

# Clear existing documents if you want a fresh start each time
# collection.delete_many({})

try:
    if documents_to_insert:
        # Check if collection is empty before inserting to avoid duplicates if run multiple times
        if collection.count_documents({}) == 0:
            insert_result = collection.insert_many(documents_to_insert)
            print(f"Successfully inserted {len(insert_result.inserted_ids)} documents into MongoDB.")
        else:
            print("Collection is not empty. Skipping insertion to avoid duplicates. Clear the collection manually if you want to re-insert.")
    else:
        print("No documents to insert.")
except Exception as e:
    print(f"Error inserting documents: {e}")

## Step 5: Verify Data and Plan Vector Search Index Creation

You can now go to your MongoDB Atlas UI, navigate to your cluster, and browse the `rag_db.documents` collection to see the inserted data.

To enable vector search, you need to create a Vector Search Index on the `embedding` field. This is typically done through the MongoDB Atlas UI.

1.  In MongoDB Atlas, go to **"DATABASE"**.
2.  Click **"Search & Vector Search** tab.
3.  Click **"Create Search Index"**.
4.  Select **"Vector Search**.
5.  Select **"JSON Editor"** as the configuration method.
6.  Enter **Index Name**: **"vector_index"**.
7.  Select **`rag_db.documents`** as database and collection.
8.  Click **Next**.
9.  Copy and paste the following index definition:

    ```json
      {
        "fields": [
          {
            "type": "vector",
            "path": "embedding",
            "numDimensions": 1024,
            "similarity": "cosine"
          }
        ]
      }
    ```
    *Remember to adjust `numDimensions` if you use a different Voyage-AI model with a different embedding size.*
10.  Name the index `vector_index` (or a name you prefer, but remember it for Lab 3).
11.  Click **"Next"**.
12.  Click **"Create Vector Search Index"**.

Wait for the index to build (it might take a few minutes). Once it's ready, you can proceed to Lab 3!

In [None]:
# Don't forget to close the MongoDB client connection when done with your script/notebook
client.close()
print("MongoDB client connection closed.")