## Advanced RAG - Data Ingestion Pipeline for PageRAG
### Page-wise Document Processing with Gemini Embeddings and Qdrant

**Learning Objectives:**
- Extract text from PDFs page by page
- Extract metadata from filename
- Store in Qdrant with rich metadata
- Use Gemini embeddings

**Use Cases:**
1. Financial Analysis: Process SEC filings (10-K, 10-Q)
2. Legal: Organize contracts and case documents
3. Research: Index academic papers
4. Enterprise: Searchable document repositories

![image.png](attachment:image.png)

### Setup and Configuration

In [1]:
from dotenv import load_dotenv
load_dotenv()

import hashlib
from pathlib import Path
from typing import List

from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_qdrant import QdrantVectorStore, RetrievalMode, FastEmbedSparse
from langchain_core.documents import Document
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, SparseVectorParams, SparseIndexParams

from docling.document_converter import DocumentConverter

In [2]:
# Configuration
DATA_DIR = "data"
QDRANT_PATH = "./qdrant_financial_db"
COLLECTION_NAME = "financial_docs"
EMBEDDING_MODEL = "models/gemini-embedding-001"

### Initialize Gemini Embeddings, BM25, and Qdrant

**Hybrid Retrieval**: Combines dense (semantic) and sparse (keyword) search for better results

In [3]:
# Dense embeddings (Gemini)
embeddings = GoogleGenerativeAIEmbeddings(model=EMBEDDING_MODEL)

# Sparse embeddings (BM25)
sparse_embeddings = FastEmbedSparse(model_name="Qdrant/bm25")

# Initialize vector store with hybrid retrieval
vector_store = QdrantVectorStore.from_documents(
    documents=[],
    embedding=embeddings,
    sparse_embedding=sparse_embeddings,
    collection_name=COLLECTION_NAME,
    url="http://localhost:6333",
    retrieval_mode=RetrievalMode.HYBRID,
    force_recreate=True
)

2025-12-12 14:20:29,225 - INFO - HTTP Request: GET http://localhost:6333 "HTTP/1.1 200 OK"
2025-12-12 14:20:29,233 - INFO - HTTP Request: GET http://localhost:6333/collections/financial_docs/exists "HTTP/1.1 200 OK"
2025-12-12 14:20:29,780 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs "HTTP/1.1 200 OK"


### Metadata Extraction from Filename

In [4]:
def extract_metadata_from_filename(filename: str) -> dict:
    """
    Extract metadata from filename.
    
    Expected format: {company} {doc_type} {quarter} {year}.pdf
    Examples:
    - amazon 10-k 2024.pdf
    - amazon 10-q q1 2024.pdf
    
    Returns:
        dict with company_name, doc_type, fiscal_year, fiscal_quarter
    """
    name = filename.replace('.pdf', '')
    parts = name.split()
    
    metadata = {}
    
    if len(parts) == 4:
        metadata['fiscal_quarter'] = parts[2]
        metadata['fiscal_year'] = int(parts[3])
    else:
        metadata['fiscal_quarter'] = None
        metadata['fiscal_year'] = int(parts[2])
    
    metadata['company_name'] = parts[0]
    metadata['doc_type'] = parts[1]
    
    return metadata

In [5]:
extract_metadata_from_filename('amazon 10-k 2023.pdf')

{'fiscal_quarter': None,
 'fiscal_year': 2023,
 'company_name': 'amazon',
 'doc_type': '10-k'}

In [6]:
extract_metadata_from_filename('amazon 10-q q1 2024.pdf')

{'fiscal_quarter': 'q1',
 'fiscal_year': 2024,
 'company_name': 'amazon',
 'doc_type': '10-q'}

### Extract Text from PDF Pages

In [7]:
def extract_pdf_pages(pdf_path: str) -> List[str]:
    """
    Extract text from each page of PDF.
    
    Returns:
        List of page texts
    """
    converter = DocumentConverter()
    result = converter.convert(pdf_path)
    
    page_break = "<!-- page break -->"
    markdown_text = result.document.export_to_markdown(page_break_placeholder=page_break)
    
    pages = markdown_text.split(page_break)
    
    return pages

In [9]:
pages = extract_pdf_pages('data/rag-data/amazon/amazon 10-q q1 2024.pdf')
print(f"Total pages: {len(pages)}")

2025-12-12 14:21:55,006 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-12 14:21:55,067 - INFO - Going to convert document batch...
2025-12-12 14:21:55,068 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 44ae89a68fc272bc7889292e9b5a1bad
2025-12-12 14:21:55,094 - INFO - Loading plugin 'docling_defaults'
2025-12-12 14:21:55,107 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-12-12 14:21:55,125 - INFO - Loading plugin 'docling_defaults'
2025-12-12 14:21:55,153 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2025-12-12 14:21:56,763 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-12 14:21:56,773 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-12 14:21:56,791 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-12-12 14:21:56,792 

Total pages: 52


### File Hash for Duplicate Detection

In [11]:
def compute_file_hash(file_path: str) -> str:
    """Compute SHA-256 hash of file content."""
    sha256_hash = hashlib.sha256()
    with open(file_path, "rb") as f:
        for byte_block in iter(lambda: f.read(4096), b""):
            sha256_hash.update(byte_block)
    return sha256_hash.hexdigest()

In [13]:
compute_file_hash('data/rag-data/amazon/amazon 10-q q1 2024.pdf')

'c08079bc14250c896f3ca151f9a72ecc1ddcb9ca8e5b021539e91af10fae5c4b'

### Track Processed Files

In [14]:
# Get already processed files from Qdrant
all_points = vector_store.client.scroll(
    collection_name=COLLECTION_NAME,
    limit=10000,
    with_payload=True
)

processed_hashes = set(
    point.payload.get('file_hash') 
    for point in all_points[0] 
    if point.payload.get('file_hash')
)

print(f"Already processed: {len(processed_hashes)} files")

2025-12-12 14:23:05,103 - INFO - HTTP Request: POST http://localhost:6333/collections/financial_docs/points/scroll "HTTP/1.1 200 OK"


Already processed: 0 files


### Document Ingestion Pipeline

In [15]:
def ingest_docs_in_vectordb(pdf_path: Path):
    """Process and ingest PDF into Qdrant vector store."""
    print(f"Processing: {pdf_path.name}")
    
    file_hash = compute_file_hash(pdf_path)
    if file_hash in processed_hashes:
        print(f"[SKIP] Already processed: {pdf_path.name}")
        return
    
    pages = extract_pdf_pages(str(pdf_path))
    file_metadata = extract_metadata_from_filename(pdf_path.name)
    
    documents = []
    
    for page_num, page_text in enumerate(pages, start=1):
        metadata = file_metadata.copy()
        metadata['page'] = page_num
        metadata['file_hash'] = file_hash
        metadata['source_file'] = pdf_path.name
        
        doc = Document(page_content=page_text, metadata=metadata)
        documents.append(doc)
    
    vector_store.add_documents(documents=documents)
    processed_hashes.add(file_hash)
    
    print(f"[DONE] Ingested {len(documents)} pages from {pdf_path.name}")

### Process All PDFs

In [16]:
data_path = Path(DATA_DIR)
pdf_files = list(data_path.rglob("*.pdf"))
print(f"Found {len(pdf_files)} PDF files")
pdf_files[:3]

Found 19 PDF files


[WindowsPath('data/rag-data/amazon/amazon 10-k 2023.pdf'),
 WindowsPath('data/rag-data/amazon/amazon 10-k 2024.pdf'),
 WindowsPath('data/rag-data/amazon/amazon 10-q q1 2024.pdf')]

In [17]:
for pdf_path in pdf_files:
    ingest_docs_in_vectordb(pdf_path)

2025-12-12 14:23:24,218 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-12 14:23:24,220 - INFO - Going to convert document batch...
2025-12-12 14:23:24,221 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 44ae89a68fc272bc7889292e9b5a1bad
2025-12-12 14:23:24,221 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-12 14:23:24,231 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-12 14:23:24,234 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-12-12 14:23:24,234 [RapidOCR] main.py:53: Using C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-12-12 14:23:24,276 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-12 14:23:24,277 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\

Processing: amazon 10-k 2023.pdf


2025-12-12 14:23:25,460 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:23:25,715 - INFO - Processing document amazon 10-k 2023.pdf
2025-12-12 14:23:56,286 - INFO - Finished converting document amazon 10-k 2023.pdf in 32.07 sec.
2025-12-12 14:24:06,484 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"
2025-12-12 14:24:10,156 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"
2025-12-12 14:24:10,177 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-12 14:24:10,181 - INFO - Going to convert document batch...
2025-12-12 14:24:10,182 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 44ae89a68fc272bc7889292e9b5a1bad
2025-12-12 14:24:10,184 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-12 14:24:10,198 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-12 14:24:10,202 [RapidOCR] download_file.py:6

[DONE] Ingested 93 pages from amazon 10-k 2023.pdf
Processing: amazon 10-k 2024.pdf


2025-12-12 14:24:10,386 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-12 14:24:10,387 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:24:11,781 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:24:12,094 - INFO - Processing document amazon 10-k 2024.pdf
2025-12-12 14:24:44,865 - INFO - Finished converting document amazon 10-k 2024.pdf in 34.69 sec.
2025-12-12 14:24:55,896 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"
2025-12-12 14:24:59,186 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"
2025-12-12 14:24:59,193 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-12 14:24:59,197 - INFO - Going to convert document batch...
2025-12-12 14:24:59,198 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 44ae89a68fc272bc7889292e9b5a1bad
2025-12-12 14:24:59,200 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2

[DONE] Ingested 88 pages from amazon 10-k 2024.pdf
Processing: amazon 10-q q1 2024.pdf


2025-12-12 14:24:59,394 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-12 14:24:59,394 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:25:00,694 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:25:00,970 - INFO - Processing document amazon 10-q q1 2024.pdf
2025-12-12 14:25:16,255 - INFO - Finished converting document amazon 10-q q1 2024.pdf in 17.06 sec.
2025-12-12 14:25:24,170 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"
2025-12-12 14:25:24,180 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-12 14:25:24,182 - INFO - Going to convert document batch...
2025-12-12 14:25:24,183 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 44ae89a68fc272bc7889292e9b5a1bad
2025-12-12 14:25:24,184 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-12 14:25:24,193 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-12 14:25:24,196 [RapidOCR] d

[DONE] Ingested 52 pages from amazon 10-q q1 2024.pdf
Processing: amazon 10-q q1 2025.pdf


2025-12-12 14:25:24,444 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-12 14:25:24,444 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:25:25,468 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:25:25,748 - INFO - Processing document amazon 10-q q1 2025.pdf
2025-12-12 14:25:40,906 - INFO - Finished converting document amazon 10-q q1 2025.pdf in 16.73 sec.
2025-12-12 14:25:48,642 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"
2025-12-12 14:25:48,653 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-12 14:25:48,654 - INFO - Going to convert document batch...
2025-12-12 14:25:48,655 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 44ae89a68fc272bc7889292e9b5a1bad
2025-12-12 14:25:48,655 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-12 14:25:48,666 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-12 14:25:48,669 [RapidOCR] d

[DONE] Ingested 50 pages from amazon 10-q q1 2025.pdf
Processing: amazon 10-q q2 2024.pdf


2025-12-12 14:25:48,876 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-12 14:25:48,877 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:25:49,915 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:25:50,185 - INFO - Processing document amazon 10-q q2 2024.pdf
2025-12-12 14:26:07,586 - INFO - Finished converting document amazon 10-q q2 2024.pdf in 18.93 sec.
2025-12-12 14:26:16,180 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"
2025-12-12 14:26:16,202 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-12 14:26:16,204 - INFO - Going to convert document batch...
2025-12-12 14:26:16,205 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 44ae89a68fc272bc7889292e9b5a1bad
2025-12-12 14:26:16,205 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-12 14:26:16,217 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-12 14:26:16,220 [RapidOCR] d

[DONE] Ingested 51 pages from amazon 10-q q2 2024.pdf
Processing: amazon 10-q q2 2025.pdf


2025-12-12 14:26:16,458 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-12 14:26:16,460 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:26:17,518 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:26:17,777 - INFO - Processing document amazon 10-q q2 2025.pdf
2025-12-12 14:26:35,115 - INFO - Finished converting document amazon 10-q q2 2025.pdf in 18.91 sec.
2025-12-12 14:26:43,665 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"
2025-12-12 14:26:43,679 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-12 14:26:43,683 - INFO - Going to convert document batch...
2025-12-12 14:26:43,685 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 44ae89a68fc272bc7889292e9b5a1bad
2025-12-12 14:26:43,685 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-12 14:26:43,698 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-12 14:26:43,703 [RapidOCR] d

[DONE] Ingested 51 pages from amazon 10-q q2 2025.pdf
Processing: amazon 10-q q3 2024.pdf


2025-12-12 14:26:44,990 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:26:45,311 - INFO - Processing document amazon 10-q q3 2024.pdf
2025-12-12 14:27:11,385 - INFO - Finished converting document amazon 10-q q3 2024.pdf in 27.71 sec.
2025-12-12 14:27:21,794 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"
2025-12-12 14:27:29,701 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"
2025-12-12 14:27:30,666 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"
2025-12-12 14:27:30,682 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-12 14:27:30,684 - INFO - Going to convert document batch...
2025-12-12 14:27:30,685 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 44ae89a68fc272bc7889292e9b5a1bad
2025-12-12 14:27:30,685 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-1

[DONE] Ingested 150 pages from amazon 10-q q3 2024.pdf
Processing: apple 10-k 2023.pdf


2025-12-12 14:27:31,870 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:27:32,130 - INFO - Processing document apple 10-k 2023.pdf
[32m[INFO] 2025-12-12 14:27:38,567 [RapidOCR] download_file.py:68: Initiating download: https://www.modelscope.cn/models/RapidAI/RapidOCR/resolve/v3.4.0/resources/fonts/FZYTK.TTF[0m
[31m[ERROR] 2025-12-12 14:27:42,024 [RapidOCR] download_file.py:74: Download failed: https://www.modelscope.cn/models/RapidAI/RapidOCR/resolve/v3.4.0/resources/fonts/FZYTK.TTF[0m
2025-12-12 14:27:42,026 - ERROR - Stage ocr failed for run 1: Failed to download https://www.modelscope.cn/models/RapidAI/RapidOCR/resolve/v3.4.0/resources/fonts/FZYTK.TTF
Traceback (most recent call last):
  File "c:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\urllib3\connection.py", line 198, in _new_conn
    sock = connection.create_connection(
        (self._dns_host, self.port),
    ...<2 lines>...
        socket_options=self.socket_options,
    )
  File "c:\Users\laxmi\anaconda3\envs\m

[DONE] Ingested 76 pages from apple 10-k 2023.pdf
Processing: apple 10-k 2024.pdf


2025-12-12 14:28:15,905 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-12 14:28:15,906 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:28:16,953 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:28:17,274 - INFO - Processing document apple 10-k 2024.pdf
[32m[INFO] 2025-12-12 14:28:23,754 [RapidOCR] download_file.py:68: Initiating download: https://www.modelscope.cn/models/RapidAI/RapidOCR/resolve/v3.4.0/resources/fonts/FZYTK.TTF[0m
[32m[INFO] 2025-12-12 14:28:27,048 [RapidOCR] download_file.py:82: Download size: 3.09MB[0m
[32m[INFO] 2025-12-12 14:28:27,840 [RapidOCR] download_file.py:95: Successfully saved to: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\FZYTK.TTF[0m
2025-12-12 14:28:54,471 - INFO - Finished converting document apple 10-k 2024.pdf in 38.79 sec.
2025-12-12 14:29:05,745 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"
2025-12-12 14:29:12,275 - INFO - HTTP 

[DONE] Ingested 121 pages from apple 10-k 2024.pdf
Processing: apple 10-q q1 2024.pdf


2025-12-12 14:29:12,519 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-12 14:29:12,519 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:29:13,591 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:29:13,878 - INFO - Processing document apple 10-q q1 2024.pdf
2025-12-12 14:29:24,906 - INFO - Finished converting document apple 10-q q1 2024.pdf in 12.62 sec.
2025-12-12 14:29:28,942 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"
2025-12-12 14:29:28,960 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-12 14:29:28,963 - INFO - Going to convert document batch...
2025-12-12 14:29:28,964 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 44ae89a68fc272bc7889292e9b5a1bad
2025-12-12 14:29:28,964 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-12 14:29:28,973 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-12 14:29:28,977 [RapidOCR] dow

[DONE] Ingested 28 pages from apple 10-q q1 2024.pdf
Processing: apple 10-q q2 2024.pdf


2025-12-12 14:29:30,206 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:29:30,516 - INFO - Processing document apple 10-q q2 2024.pdf
2025-12-12 14:29:41,857 - INFO - Finished converting document apple 10-q q2 2024.pdf in 12.90 sec.
2025-12-12 14:29:45,962 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"
2025-12-12 14:29:45,980 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-12 14:29:45,981 - INFO - Going to convert document batch...
2025-12-12 14:29:45,982 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 44ae89a68fc272bc7889292e9b5a1bad
2025-12-12 14:29:45,982 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-12 14:29:45,998 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-12 14:29:46,001 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32

[DONE] Ingested 28 pages from apple 10-q q2 2024.pdf
Processing: apple 10-q q4 2023.pdf


2025-12-12 14:29:46,232 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-12 14:29:46,232 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:29:47,259 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:29:47,532 - INFO - Processing document apple 10-q q4 2023.pdf
2025-12-12 14:29:56,602 - INFO - Finished converting document apple 10-q q4 2023.pdf in 10.62 sec.
2025-12-12 14:30:00,151 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"
2025-12-12 14:30:00,169 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-12 14:30:00,171 - INFO - Going to convert document batch...
2025-12-12 14:30:00,171 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 44ae89a68fc272bc7889292e9b5a1bad
2025-12-12 14:30:00,172 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-12 14:30:00,183 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-12 14:30:00,188 [RapidOCR] dow

[DONE] Ingested 28 pages from apple 10-q q4 2023.pdf
Processing: apple 8-k q4 2023.pdf


2025-12-12 14:30:01,514 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:30:01,777 - INFO - Processing document apple 8-k q4 2023.pdf
2025-12-12 14:30:04,769 - INFO - Finished converting document apple 8-k q4 2023.pdf in 4.60 sec.
2025-12-12 14:30:06,259 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"
2025-12-12 14:30:06,270 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-12 14:30:06,274 - INFO - Going to convert document batch...
2025-12-12 14:30:06,275 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 44ae89a68fc272bc7889292e9b5a1bad
2025-12-12 14:30:06,276 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-12 14:30:06,285 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-12 14:30:06,288 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[I

[DONE] Ingested 9 pages from apple 8-k q4 2023.pdf
Processing: google 10-k 2023.pdf


2025-12-12 14:30:06,472 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:30:07,462 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:30:07,736 - INFO - Processing document google 10-k 2023.pdf
2025-12-12 14:30:44,705 - INFO - Finished converting document google 10-k 2023.pdf in 38.44 sec.
2025-12-12 14:30:54,720 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"
2025-12-12 14:31:01,879 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"
2025-12-12 14:31:01,890 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-12 14:31:01,894 - INFO - Going to convert document batch...
2025-12-12 14:31:01,895 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 44ae89a68fc272bc7889292e9b5a1bad
2025-12-12 14:31:01,895 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-12 14:31:01,908 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[3

[DONE] Ingested 111 pages from google 10-k 2023.pdf
Processing: google 10-k 2024.pdf


2025-12-12 14:31:03,180 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:31:03,436 - INFO - Processing document google 10-k 2024.pdf
2025-12-12 14:31:38,813 - INFO - Finished converting document google 10-k 2024.pdf in 36.92 sec.
2025-12-12 14:31:49,106 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"
2025-12-12 14:31:56,088 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"
2025-12-12 14:31:56,108 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-12 14:31:56,110 - INFO - Going to convert document batch...
2025-12-12 14:31:56,111 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 44ae89a68fc272bc7889292e9b5a1bad
2025-12-12 14:31:56,112 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-12 14:31:56,123 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-12 14:31:56,127 [RapidOCR] download_file.py:6

[DONE] Ingested 108 pages from google 10-k 2024.pdf
Processing: google 10-q q1 2025.pdf


2025-12-12 14:31:56,361 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-12 14:31:56,362 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:31:57,381 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:31:57,668 - INFO - Processing document google 10-q q1 2025.pdf
2025-12-12 14:32:19,361 - INFO - Finished converting document google 10-q q1 2025.pdf in 23.25 sec.
2025-12-12 14:32:28,432 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"
2025-12-12 14:32:28,446 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-12 14:32:28,447 - INFO - Going to convert document batch...
2025-12-12 14:32:28,448 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 44ae89a68fc272bc7889292e9b5a1bad
2025-12-12 14:32:28,449 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-12 14:32:28,461 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-12 14:32:28,465 [RapidOCR] d

[DONE] Ingested 52 pages from google 10-q q1 2025.pdf
Processing: google 10-q q2 2024.pdf


2025-12-12 14:32:28,684 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-12 14:32:28,685 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:32:29,732 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:32:30,014 - INFO - Processing document google 10-q q2 2024.pdf
2025-12-12 14:32:55,423 - INFO - Finished converting document google 10-q q2 2024.pdf in 26.98 sec.
2025-12-12 14:33:05,621 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"
2025-12-12 14:33:05,645 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-12 14:33:05,647 - INFO - Going to convert document batch...
2025-12-12 14:33:05,648 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 44ae89a68fc272bc7889292e9b5a1bad
2025-12-12 14:33:05,649 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-12 14:33:05,658 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-12 14:33:05,663 [RapidOCR] d

[DONE] Ingested 55 pages from google 10-q q2 2024.pdf
Processing: google 10-q q2 2025.pdf


2025-12-12 14:33:05,895 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-12 14:33:05,896 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:33:06,893 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:33:07,147 - INFO - Processing document google 10-q q2 2025.pdf
2025-12-12 14:33:36,131 - INFO - Finished converting document google 10-q q2 2025.pdf in 30.49 sec.
2025-12-12 14:33:47,257 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"
2025-12-12 14:33:47,267 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-12 14:33:47,269 - INFO - Going to convert document batch...
2025-12-12 14:33:47,269 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 44ae89a68fc272bc7889292e9b5a1bad
2025-12-12 14:33:47,271 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-12 14:33:47,283 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-12 14:33:47,287 [RapidOCR] d

[DONE] Ingested 59 pages from google 10-q q2 2025.pdf
Processing: google 10-q q3 2024.pdf


2025-12-12 14:33:47,492 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-12 14:33:47,493 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:33:48,552 - INFO - Accelerator device: 'cuda:0'
2025-12-12 14:33:48,806 - INFO - Processing document google 10-q q3 2024.pdf
2025-12-12 14:34:12,517 - INFO - Finished converting document google 10-q q3 2024.pdf in 25.25 sec.
2025-12-12 14:34:24,029 - INFO - HTTP Request: PUT http://localhost:6333/collections/financial_docs/points?wait=true "HTTP/1.1 200 OK"


[DONE] Ingested 56 pages from google 10-q q3 2024.pdf


### Verify Ingestion

In [18]:
collection_info = vector_store.client.get_collection(COLLECTION_NAME)
print(f"Total documents in Qdrant: {collection_info.points_count}")

2025-12-12 14:34:24,080 - INFO - HTTP Request: GET http://localhost:6333/collections/financial_docs "HTTP/1.1 200 OK"


Total documents in Qdrant: 1266


In [20]:
# Hybrid search example (dense + sparse)
results = vector_store.similarity_search(
    "What is Tesla's revenue for Q1 2024?",
    k=3
)

2025-12-12 14:47:46,638 - INFO - HTTP Request: POST http://localhost:6333/collections/financial_docs/points/query "HTTP/1.1 200 OK"


In [21]:
results

[Document(metadata={'fiscal_quarter': 'q3', 'fiscal_year': 2024, 'company_name': 'google', 'doc_type': '10-q', 'page': 40, 'file_hash': '51ab83179bff647b1b2521836748d83939671d40a1d954fad9df93cd720e5784', 'source_file': 'google 10-q q3 2024.pdf', '_id': '0f20a018-982d-4411-84c0-6212a49f5f4e', '_collection_name': 'financial_docs'}, page_content='\n\nThe  following  table  presents  the  foreign  currency  exchange  effect  on  international  revenues  and  total  revenues  (in  millions,  except percentages):\n\n|                                    |                                  |                                  | Three Months Ended September 30, 2024   | Three Months Ended September 30, 2024   | Three Months Ended September 30, 2024   | Three Months Ended September 30, 2024   | Three Months Ended September 30, 2024   | Three Months Ended September 30, 2024   |\n|------------------------------------|----------------------------------|----------------------------------|--------------