<a href="https://colab.research.google.com/github/hamzafarooq/multi-agent-course/blob/main/Module_1/Agentic_RAG/Upload_data_to_Qdrant_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Uploading PDF Data to Qdrant with Embeddings

## Introduction

This notebook demonstrates how to process unstructured document data (such as PDF files) and store it in a local vector database using Qdrant. This workflow is useful for building applications like intelligent document search, semantic search engines, or AI-based question-answering systems.

We will start by extracting text content from PDF files, convert that text into numerical representations called embeddings, and finally upload those embeddings into a Qdrant database for efficient retrieval and future use.

In this example, we will use two types of PDF data:
- **OpenAI documentation** (covering tools, APIs, and usage guidelines).
- **10-K financial filings** (official company reports and financial statements by Uber and Lyft).

## Objectives

- **Extract text from PDF files** using the PyMuPDF library.
- **Generate semantic embeddings** for each chunk of text using a model from Nomic from Hugging Face.
- **Store the embeddings in Qdrant**, a vector database running locally.

By the end of this notebook, you'll have a working pipeline that reads documents, encodes them into meaningful vector representations, and persists them in a local database that can be queried later.



## Setup and Dependencies

Before we begin, ensure the necessary libraries are installed and imported


In [None]:
# Install the Qdrant client, which allows you to connect to and interact with a Qdrant vector database.
# Qdrant is often used for similarity search like semantic search or recommendation systems.
!pip install qdrant_client

# Install the Hugging Face Transformers library.
# This library provides pre-trained models for tasks like text embeddings, classification, translation, summarization, and more.
!pip install transformers

# Install the PyMuPDF library (also known as Fitz), which is used for working with PDF files.
# It allows you to extract text, images, and metadata from PDFs,
!pip install PyMuPDF


In [None]:
# Import necessary libraries

# Import the PyMuPDF library, which is installed as 'fitz'
# This library is used for reading and extracting content from PDF documents.
import fitz

# Import the 'os' module for handling file paths and operating system interactions (not used directly here but often useful)
import os

## 1. Extract Data from PDF Files

In this step, we will use the PyMuPDF library to extract text from our PDF documents. This will allow us to process the content and prepare it for embedding.


In [None]:
def read_text_pymupdf(path):
    """
    Extracts text from a PDF file using the PyMuPDF (fitz) library.

    Parameters:
        path (str): The file path to the PDF document.

    Returns:
        str: A single string containing all the text extracted from the PDF.
    """

    # Open the PDF document using the provided file path
    # This returns a Document object that allows access to each page
    doc = fitz.open(path)

    # Initialize an empty string to collect text from all pages
    text_results = ''

    # Loop through each page in the PDF document
    for page in doc:
        # Extract text content from the current page
        text = page.get_text()

        # Append the extracted text to the cumulative result
        text_results += text

    # Return the complete text from the PDF
    return text_results


In [None]:
# Define the path to the folder containing OpenAI-related PDF documents
document_path_opnai = "/content/drive/MyDrive/Router RAG docs/openai" #make sure to add your folder path here, this data is available in the github repository for the course

# Define the path to the folder containing 10-K filing PDF documents
document_path_10k = "/content/drive/MyDrive/Router RAG docs/10k files" #make sure to add your folder path here, this data is available in the github repository for the course

# Initialize an empty list to hold the extracted text from the OpenAI documents
openai_docs = []

# Initialize an empty list to hold the extracted text from the 10-K documents
docs_10k = []

# Sets to track processed filenames (avoid duplicates)
seen_files_opnai = set()
seen_files_10k = set()

# Loop through each file in the OpenAI documents folder
for _f in os.listdir(document_path_opnai):
    # Process only PDF files and skip duplicate filenames
    if _f.lower().endswith(".pdf") and _f not in seen_files_opnai:
        path = os.path.join(document_path_opnai, _f)
        openai_docs.append(read_text_pymupdf(path))
        seen_files_opnai.add(_f)

# Loop through each file in the 10-K documents folder
for _f in os.listdir(document_path_10k):
    # Process only PDF files and skip duplicate filenames
    if _f.lower().endswith(".pdf") and _f not in seen_files_10k:
        path = os.path.join(document_path_10k, _f)
        docs_10k.append(read_text_pymupdf(path))
        seen_files_10k.add(_f)


In [None]:
docs_10k

## 2. Prepare Document Chunks with Metadata for Vector Storage

Before we store our documents in a vector database like Qdrant, it's important to organize the data in a meaningful way. This section assigns unique identifiers and metadata to each document chunk.

### Why this step is important:

- **Chunk-Level Tracking**: Each document is split into smaller text chunks to fit the input size of embedding models. Assigning metadata to each chunk helps trace it back to its original source.
  
- **UUID Generation**: By attaching a universally unique identifier (UUID) to each chunk, we ensure every piece of data can be reliably referenced or retrieved later.

- **Metadata Enrichment**: Adding metadata such as the original document path enables better filtering, searching, and organization within the vector database.

This process ensures that once the chunks are embedded and stored, they remain well-organized and easily searchable in downstream applications such as semantic search or document-based Q&A systems.


In [None]:
# Import a text splitter that breaks large texts into smaller, overlapping chunks
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Create the text splitter with settings for chunk size and overlap
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2048,          # Max characters per chunk
    chunk_overlap=50,         # Overlap between chunks to preserve context
    length_function=len,      # Use Python's len() to measure text length
    is_separator_regex=False, # Treat separators as plain text, not regex
    separators=[              # Preferred breakpoints for splitting
        "\n\n", "\n", " ", ".", ",",
        "\u200b", "\uff0c", "\u3001", "\uff0e", "\u3002", ""
    ],
)

# Split the extracted OpenAI document text into chunks
opnai_chunks = text_splitter.create_documents(openai_docs)

# Split the 10-K documents into chunks as well
chunks_10k = text_splitter.create_documents(docs_10k)

# View the result (list of text chunks)
opnai_chunks


In [None]:
# Import the uuid module to generate unique identifiers
import uuid

# Loop through each chunk of the OpenAI documents
for i in range(len(opnai_chunks)):
    # Generate a unique ID for each chunk (helps track and reference later)
    unique_id = str(uuid.uuid4())

    # Add metadata to the chunk:
    # 'document_info' stores the source path (where the document came from)
    # 'uuid' stores a unique identifier for the chunk
    opnai_chunks[i].metadata['document_info'] = document_path_opnai
    opnai_chunks[i].metadata['uuid'] = unique_id

# Display the first 5 chunks with their metadata
opnai_chunks[:5]


In [None]:
# Loop through each chunk of the 10-K documents
for i in range(len(chunks_10k)):
    # Generate a unique identifier (UUID) for each chunk
    unique_id = str(uuid.uuid4())

    # Assign metadata to each chunk:
    # 'document_info' stores the path of the folder where the 10-K files are located
    # 'uuid' stores the unique identifier for the chunk
    chunks_10k[i].metadata['document_info'] = document_path_10k
    chunks_10k[i].metadata['uuid'] = unique_id

# Display the metadata and content of the 6th chunk (index 5) from the 10-K documents
chunks_10k[5]


## 3. Embed Chunks for Vector Database

### Purpose:

Embedding the chunks of text is a crucial step in preparing the data for storage in a **vector database** like Qdrant. In this step, we convert each text chunk into a **numerical representation** (vector) that captures its semantic meaning. This allows us to perform advanced operations like **semantic search**, **similarity comparison**, and **retrieval** based on meaning, rather than just keyword matching.

### Why we Embed?

- **Vector Representation**: Embedding transforms text into vectors (lists of numbers) that machine learning models can understand. These vectors represent the **semantic meaning** of the text, allowing similar texts to be grouped together, even if they don’t share exact words.
  
- **Efficient Search**: Storing these vectors in a vector database enables **fast similarity searches**. For example, if you ask a question or search for a document, the vector database can quickly find and return the most relevant results based on the meaning of the text, not just exact matches.
  
- **Contextual Understanding**: The embedding process allows the system to "understand" the context of words, which improves the relevance and accuracy of search results. For example, it helps understand that "machine learning" and "artificial intelligence" are related concepts, even if they don't appear together in the same document.

### What it Does?

- **Transforms text** into dense vectors, capturing the **semantic essence** of the content.
- These vectors are then stored in a **vector database** (Qdrant in our case) for fast retrieval and comparison based on similarity, enabling features like document search or answering questions.
  
In the following steps, we will embed the text chunks using a pre-trained nomic text embed model, and store the resulting vectors in Qdrant for future use.


In [None]:
# Import the necessary libraries from the Hugging Face Transformers library
# AutoTokenizer is used to tokenize the input text into a format that the model can process.
# AutoModel loads the pre-trained model for generating embeddings.
from transformers import AutoTokenizer, AutoModel

# Load the tokenizer and model from Hugging Face
# 'nomic-ai/nomic-embed-text-v1.5' is a pre-trained model designed for text embeddings
# The `trust_remote_code=True` argument allows the use of the model's code from the remote repository
text_tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
text_model = AutoModel.from_pretrained("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)

# Define a function to get embeddings for a given text input
def get_text_embeddings(text):
    # Tokenize the text input into format the model understands
    # `return_tensors="pt"` means the output will be in PyTorch tensor format
    # `padding=True` ensures that inputs of varying lengths are padded to a consistent length
    # `truncation=True` ensures that long inputs are truncated to fit the model's max input length
    inputs = text_tokenizer(text, return_tensors="pt", padding=True, truncation=True)

    # Pass the tokenized inputs into the model to get embeddings
    # The model's output contains hidden states from the transformer layers
    outputs = text_model(**inputs)

    # We use the mean of the last hidden state (output of the final layer) for each token
    # This gives us a fixed-size vector for the input text
    embeddings = outputs.last_hidden_state.mean(dim=1)

    # Return the embeddings as a numpy array (for use in further processing or storage)
    return embeddings[0].detach().numpy()  # Detach the tensor from the computation graph and convert to numpy

# Example usage
text = "This is a test sentence."  # Sample text to embed
embeddings = get_text_embeddings(text)  # Get the embeddings for the text

# Print the first 5 values of the embeddings (just to inspect the result)
print(embeddings[:5])


In [None]:
# Embed the OpenAI document chunks into vectors using the `get_text_embeddings` function
# For each chunk in the OpenAI documents, the text is passed through the embedding model
# `document.page_content` refers to the actual text content of each chunk.
opnai_texts_embeded = [get_text_embeddings(document.page_content) for document in opnai_chunks]

# Embed the 10-K document chunks into vectors in the same way
# For each chunk in the 10-K documents, the `page_content` is passed to the embedding model
texts_embeded_10k = [get_text_embeddings(document.page_content) for document in chunks_10k]


##  4. Initialize Qdrant

### Purpose:

This step initializes **Qdrant**, a vector database, to store the text embeddings generated earlier. Qdrant is designed for high-performance similarity search and retrieval of vector data, making it ideal for tasks like semantic search, recommendation systems, or nearest neighbor searches.

### Why this is important?

- **Qdrant Initialization**: This code sets up the Qdrant database to store vectors efficiently.
- **Creating a Collection**: Collections are like tables in a relational database, and in this case, the collection stores vectors representing text embeddings.
- **Cosine Similarity**: Using cosine similarity enables the database to quickly identify vectors (texts) that are semantically similar to a given query.
  
Once the collection is created and initialized, it will be ready to store the embeddings from the OpenAI documents and 10-K filings, making it possible to perform fast similarity searches later.



In [None]:
# Import necessary libraries from the Qdrant client
# QdrantClient is used to interact with the Qdrant database.
# models provides predefined parameters like vector configurations.
from qdrant_client import QdrantClient, models
import os

# Define the path where the Qdrant database will be stored
# This is the directory where Qdrant will save its data on your local machine
qdrant_data_dir = '/content/qdrant_data'

# Create the directory if it doesn't already exist
# `exist_ok=True` ensures no error is raised if the directory already exists
os.makedirs(qdrant_data_dir, exist_ok=True)

# Initialize the Qdrant client, passing the path where the database will be stored
# The Qdrant client will manage operations like adding vectors, creating collections, and querying data
client = QdrantClient(path=qdrant_data_dir)


In [None]:
# Determine the size (dimensionality) of the embeddings by checking the length of the first embedding vector
# The embedding size is the number of values in each vector representation of the text (e.g., 768, 1024, etc.)
text_embeddings_size = len(opnai_texts_embeded[0])

# Check if the collection "opnai_data" already exists in Qdrant
# If it does not exist, the code proceeds to create a new collection
if not client.collection_exists("opnai_data"):
    # Create a new collection in Qdrant to store the OpenAI document embeddings
    client.create_collection(
        # Name of the collection, which helps identify it in the database
        collection_name="opnai_data",

        # Define the vector configuration (size and distance metric) for the collection
        vectors_config=models.VectorParams(
            size=text_embeddings_size,  # Set the vector size to match the embeddings' dimensionality
            distance=models.Distance.COSINE,  # Use Cosine similarity to measure distance between vectors
        ),
    )


In [None]:
# Check if the collection "10k_data" already exists in Qdrant
# If it does not exist, the code proceeds to create a new collection
if not client.collection_exists("10k_data"):
    # Create a new collection in Qdrant to store the 10-K document embeddings
    client.create_collection(
        # Name of the collection, which will hold the 10-K document embeddings
        collection_name="10k_data",

        # Define the vector configuration (size and distance metric) for the collection
        vectors_config=models.VectorParams(
            size=text_embeddings_size,  # Set the vector size to match the embeddings' dimensionality
            distance=models.Distance.COSINE,  # Use Cosine similarity to measure distance between vectors
        ),
    )


## 5. Store Embeddings in Qdrant

### Purpose:

In this step, we insert the embeddings (numerical representations of text) for both the OpenAI documents and the 10-K filings into their respective collections in Qdrant. By storing these embeddings in Qdrant, we can enable efficient semantic search functionality, which is a key part of the **Agentic RAG** (Retriever-Augmented Generation) project.

Once the embeddings are stored, we can leverage Qdrant's similarity search capabilities to retrieve semantically relevant documents based on a given query.


In [None]:
# Define the names of the Qdrant collections we're working with
# These are the collections where document embeddings will be stored
clusters = ["opnai_data", "10k_data"]

# Import numpy to handle vector data as arrays
import numpy as np

# Upload (store) the embeddings of OpenAI documents into the 'opnai_data' collection in Qdrant
client.upload_points(
    collection_name="opnai_data",  # Name of the target collection in Qdrant

    # Create a list of PointStruct objects, one for each embedded document chunk
    points=[
        models.PointStruct(
            id=doc.metadata['uuid'],  # Use the pre-generated UUID as a unique ID for each point (chunk)

            vector=np.array(opnai_texts_embeded[idx]),  # The embedding vector for the chunk, converted to a NumPy array

            payload={  # Payload stores extra information (metadata and content) along with the vector
                "metadata": doc.metadata,       # Include metadata like document path and UUID
                "content": doc.page_content     # Store the original text content of the chunk
            }
        )
        for idx, doc in enumerate(opnai_chunks)  # Loop through all OpenAI chunks and embed them
    ]
)


In [None]:
# Upload (store) the embeddings of 10-K documents into the '10k_data' collection in Qdrant
client.upload_points(
    collection_name="10k_data",  # Target collection name in Qdrant

    # Create a list of points, one for each chunk in the 10-K documents
    points=[
        models.PointStruct(
            id=doc.metadata['uuid'],  # Unique ID for each vector (from previously assigned UUID)

            vector=np.array(texts_embeded_10k[idx]),  # The embedding vector for the chunk, converted to NumPy array

            payload={  # Payload contains additional information for each vector
                "metadata": doc.metadata,       # Include metadata such as source path and UUID
                "content": doc.page_content     # Include the original chunk text for retrieval/display
            }
        )
        for idx, doc in enumerate(chunks_10k)  # Loop through all 10-K chunks to create PointStructs
    ]
)


## Final Notes

At this point, we've successfully completed the full pipeline for preparing and storing document embeddings:

1. Extracted text from PDFs using PyMuPDF.
2. Split the documents into manageable text chunks.
3. Embedded each chunk using the `nomic-embed-text` model.
4. Stored the resulting vectors, along with metadata and original content, into Qdrant collections (`opnai_data` and `10k_data`).

These embeddings are now stored in a local Qdrant vector database located at:

/content/qdrant_data


You can reuse this vector database in any other project by pointing to the same path. In our case, we will be using it as the retrieval layer in the **Agentic RAG** system, where relevant chunks will be fetched based on user queries and passed to an LLM to generate meaningful, grounded responses.

This setup forms the foundation for building retrieval-augmented applications.
