In this notebook, I leveraged Docling for high-fidelity PDF parsing, specifically to ensure the structural integrity of complex tables by converting them into Markdown. The processed documents were then partitioned using LlamaIndex’s chunking strategies, with Gemini embeddings persisted in a local ChromaDB instance. I implemented and evaluated two retrieval pipelines: Hybrid Search and Rerank-optimized retrieval. Benchmarking results on specific financial queries (e.g., 'R&D expense in 2025Q3') demonstrated that both methods achieved consistent and accurate performance.

In [1]:
from docling.document_converter import DocumentConverter
from pathlib import Path

source = Path("../data/goog_2025Q3.pdf")
converter = DocumentConverter()
result = converter.convert(source)
# print(result.document.export_to_markdown())

2026-02-09 00:01:33,700 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2026-02-09 00:01:33,838 - INFO - Going to convert document batch...
2026-02-09 00:01:33,840 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e15bc6f248154cc62f8db15ef18a8ab7
2026-02-09 00:01:33,855 - INFO - Loading plugin 'docling_defaults'
2026-02-09 00:01:33,859 - INFO - Registered picture descriptions: ['vlm', 'api']
2026-02-09 00:01:33,872 - INFO - Loading plugin 'docling_defaults'
2026-02-09 00:01:33,881 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2026-02-09 00:01:34,956 - INFO - Accelerator device: 'cpu'
[32m[INFO] 2026-02-09 00:01:34,977 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-02-09 00:01:34,994 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\jyflo\miniforge3\envs\fr_rag_side\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-02-09 00:01:

In [2]:
from llama_index.core import Document

content_md = result.document.export_to_markdown()
file_name = result.input.file.name

meta_data = {
    "file_name": file_name,
    "company": "Alphabet", 
    "year": "2025",
    "quarter": "Q3"
}
doc = Document(text=content_md, metadata=meta_data)

In [3]:
# Initialize Chromadb and turn into LlamaIndex Vector Store
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext, VectorStoreIndex

db = chromadb.PersistentClient(path="../chroma_db")
chroma_collection = db.get_or_create_collection("financial_reports")

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

2026-02-09 00:06:24,930 - INFO - Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.


In [4]:
from llama_index.core.node_parser import MarkdownElementNodeParser
from llama_index.llms.google_genai import GoogleGenAI
from llama_index.embeddings.google_genai import GoogleGenAIEmbedding
import os
from dotenv import load_dotenv

load_dotenv(override=True)
llm = GoogleGenAI(
    model="gemini-2.5-flash",
    api_key=os.environ.get("GOOGLE_API_KEY")
)


2026-02-09 00:06:33,542 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash "HTTP/1.1 200 OK"


In [5]:

# Generate Embedding and store into local chromadb

embed_model = GoogleGenAIEmbedding(model_name="models/gemini-embedding-001")

node_parser = MarkdownElementNodeParser(
    llm=llm, 
    num_workers=4  
)

nodes = node_parser.get_nodes_from_documents([doc])

index = VectorStoreIndex(
    nodes, 
    embed_model=embed_model,
    storage_context=storage_context,
    show_progress=True
)

print(f"Successfully stored {len(nodes)} of nodes from {file_name} in ChromaDB.")

64it [00:00, ?it/s]
2026-02-09 00:06:59,809 - INFO - AFC is enabled with max remote calls: 10.
2026-02-09 00:06:59,816 - INFO - AFC is enabled with max remote calls: 10.
2026-02-09 00:06:59,819 - INFO - AFC is enabled with max remote calls: 10.
2026-02-09 00:06:59,826 - INFO - AFC is enabled with max remote calls: 10.
2026-02-09 00:07:04,132 - INFO - AFC is enabled with max remote calls: 10.
2026-02-09 00:07:04,176 - INFO - AFC is enabled with max remote calls: 10.
2026-02-09 00:07:05,021 - INFO - AFC is enabled with max remote calls: 10.
2026-02-09 00:07:06,632 - INFO - AFC is enabled with max remote calls: 10.
2026-02-09 00:07:12,448 - INFO - AFC is enabled with max remote calls: 10.
2026-02-09 00:07:12,501 - INFO - AFC is enabled with max remote calls: 10.
2026-02-09 00:07:12,798 - INFO - AFC is enabled with max remote calls: 10.
2026-02-09 00:07:13,861 - INFO - AFC is enabled with max remote calls: 10.
2026-02-09 00:07:17,492 - INFO - AFC is enabled with max remote calls: 10.
2026-

Generating embeddings:   0%|          | 0/202 [00:00<?, ?it/s]

2026-02-09 00:09:03,802 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-embedding-001:batchEmbedContents "HTTP/1.1 200 OK"
2026-02-09 00:09:04,352 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-embedding-001:batchEmbedContents "HTTP/1.1 200 OK"
2026-02-09 00:09:04,874 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-embedding-001:batchEmbedContents "HTTP/1.1 200 OK"
2026-02-09 00:09:05,416 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-embedding-001:batchEmbedContents "HTTP/1.1 200 OK"
2026-02-09 00:09:05,884 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-embedding-001:batchEmbedContents "HTTP/1.1 200 OK"
2026-02-09 00:09:06,420 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-embedding-001:batchEmbedContents "HTTP/1.1 200 OK"
2026

Successfully stored 202 of nodes from goog_2025Q3.pdf in ChromaDB.


In [6]:
# Check if the embedding is correctly stored

print(f"List all the collections: {db.list_collections()}")
collection = db.get_collection("financial_reports")
count = collection.count()
print(f"The number of collections: {count}")
assert count == len(nodes)

if count > 0:
    sample = collection.get(limit=2, include=['embeddings', 'documents', 'metadatas'])
    print(f"The sample record id: {sample['ids']}")
    print(f"The sample meta data: {sample['metadatas'][0]}")

    if sample['embeddings'] is not None:
        print(f"Embedding checked! The dimension is {len(sample['embeddings'][0])}")


List all the collections: [Collection(name=financial_reports)]
The number of collections: 202
The sample record id: ['11fb01fc-dba6-4f1b-b8dc-f3e90af9dcda', '5f255dc3-cfef-435c-a507-e14eef7d4868']
The sample meta data: {'_node_type': 'TextNode', 'ref_doc_id': 'c1ff7fd0-0323-41cc-9eb2-a84d99baaabb', 'doc_id': 'c1ff7fd0-0323-41cc-9eb2-a84d99baaabb', 'company': 'Alphabet', 'quarter': 'Q3', 'file_name': 'goog_2025Q3.pdf', 'year': '2025', '_node_content': '{"id_": "11fb01fc-dba6-4f1b-b8dc-f3e90af9dcda", "embedding": null, "metadata": {"file_name": "goog_2025Q3.pdf", "company": "Alphabet", "year": "2025", "quarter": "Q3"}, "excluded_embed_metadata_keys": [], "excluded_llm_metadata_keys": [], "relationships": {"1": {"node_id": "c1ff7fd0-0323-41cc-9eb2-a84d99baaabb", "node_type": "4", "metadata": {"file_name": "goog_2025Q3.pdf", "company": "Alphabet", "year": "2025", "quarter": "Q3"}, "hash": "bca5990f068a244c9428202bb7a1eb6b90403664ea53d1ac24f9d099d6849b62", "class_name": "RelatedNodeInfo"}, 

In [7]:
# Try one query and similarity search in local chromadb
query_text = "What is the revenue of 2025Q3?"
query_vector = embed_model.get_query_embedding(query_text)

results = collection.query(
    query_embeddings=[query_vector]
)
print("The most relevant document is:", results['documents'][0])


2026-02-09 00:10:56,546 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-embedding-001:batchEmbedContents "HTTP/1.1 200 OK"


The most relevant document is: ['Total constant currency revenues of $101.2 billion for the three months ended September 30, 2025 increased $13.0 billion compared to $88.3 billion in revenues, excluding hedging effect, for the three months ended September 30, 2024. (1)\n\nEMEA revenue growth was favorably affected by changes in foreign currency exchange rates, primarily due to the U.S. dollar weakening relative to the euro and British pound.\n\nAPAC revenue growth was not materially affected by changes in foreign currency exchange rates.\n\nOther Americas revenue growth was unfavorably affected by changes in foreign currency exchange rates, primarily due to the U.S. dollar strengthening relative to the Argentine peso.', 'The following table presents revenues disaggregated by geography, based on the addresses of our customers (in millions):', "This table presents a breakdown of Google's revenues by segment, including Google Search & other, YouTube ads, Google Network, Google advertising

In [11]:
# hybrid retrieve 
from llama_index.core import Settings

Settings.llm = GoogleGenAI(model="models/gemini-2.5-flash")
Settings.embed_model = GoogleGenAIEmbedding(model_name="models/gemini-embedding-001")

index = VectorStoreIndex.from_vector_store(
    vector_store,
    embed_model=Settings.embed_model 
)

fast_hybrid_query_engine = index.as_query_engine(
    vector_store_query_mode="hybrid", 
    similarity_top_k=5,        
    sparse_top_k=5             
)


2026-02-09 00:11:54,433 - INFO - HTTP Request: GET https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash "HTTP/1.1 200 OK"


In [12]:
# Hybrid retrieve and Rerank
from llama_index.core.postprocessor import LLMRerank
from llama_index.core.postprocessor import SentenceTransformerRerank

rerank_postprocessor = SentenceTransformerRerank(
    model="BAAI/bge-reranker-v2-m3", 
    top_n=3  
)

deep_rerank_engine = index.as_query_engine(
    vector_store_query_mode="hybrid",
    similarity_top_k=10,             
    sparse_top_k=10,
    node_postprocessors=[rerank_postprocessor] 
)

In [13]:
# Generate the answer
import nest_asyncio
nest_asyncio.apply()

response_fast = fast_hybrid_query_engine.query("What is the R&D expense of 2025Q3？")
print("--- The response of fast ---")
print(response_fast.response)

response_deep = deep_rerank_engine.query("What is the R&D expense of 2025Q3？")
print("--- The response of deep rerank ---")
print(response_deep.response)

2026-02-09 00:12:01,559 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-embedding-001:batchEmbedContents "HTTP/1.1 200 OK"
2026-02-09 00:12:01,597 - INFO - AFC is enabled with max remote calls: 10.
2026-02-09 00:12:02,589 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent "HTTP/1.1 200 OK"


--- The response of fast ---
Research and development expenses for the three months ended September 30, 2025, were $15,151 million.


2026-02-09 00:12:02,807 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-embedding-001:batchEmbedContents "HTTP/1.1 200 OK"


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2026-02-09 00:12:19,382 - INFO - AFC is enabled with max remote calls: 10.
2026-02-09 00:12:20,390 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent "HTTP/1.1 200 OK"


--- The response of deep rerank ---
Research and development expenses for the three months ended September 30, 2025, were $15,151 million.
