# General Tips
## Using virtual environments
**Step 1:** CD to desired directory and Create a Virtual Environment `python3 -m venv myenv`. (Run `py -3.13 -m venv myenv` for a specific version of python)

Check your python installed versions with `py -0` on Windows (`python3 --version` on Linux)

**Step 2:** Activate the Environment `source myenv/bin/activate` (on Linux) and `myenv\Scripts\activate` (on Windows).

**Step 3:** Install Any Needed Packages. e.g: `pip install requests pandas`. Or better to use `requirements.txt` file (`pip install -r requirements.txt`)

**Step 4:** List All Installed Packages using `pip list`

## Connecting the Jupyter Notebook to the vistual env
1. Make sure that myenv is activate (`myenv\Scripts\activate`)
2. Run this inside the virtual environment: `pip install ipykernel`
3. Still inside the environment: `python -m ipykernel install --user --name=myenv --display-name "Whatever Python Kernel Name"`
   
   --name=myenv: internal identifier for the kernel
   
   --display-name: name that shows up in VS Code kernel picker
4. Open VS Code and select the kernel

   At the top-right, click "Select Kernel".
   Look for “Whatever Python Kernel Name” — pick that.
5. If you don’t see it right away, try: Reloading VS Code, Or running Reload Window from Command Palette (Ctrl+Shift+P)

## Useful Commands
1. Use `py -0` to check which python installation we have on Windows

In [3]:
# ============================================================================
# Step 1: Setup and Imports
# ============================================================================

# %% [markdown]
# # FinanceBench RAG Pipeline - Clean Modular Approach
# 
# This notebook processes financial documents and creates vector embeddings
# for retrieval-augmented generation (RAG).

# %% [markdown]
# ## 1.1 Install Requirements
# 
# Make sure you have installed:
# ```bash
# pip install -r requirements.txt
# ```

# %% [markdown]
# ## 1.2 Imports

# %%
import os
import shutil
import time
from pathlib import Path
from typing import List, Dict, Optional, Tuple

# Environment
from dotenv import load_dotenv

# Progress
from tqdm.auto import tqdm

# Data
import pandas as pd
from datasets import load_dataset

# Document processing
from llama_index.core.schema import Document, BaseNode
from llama_index.core.node_parser import SentenceSplitter
from llama_index.readers.file import PyMuPDFReader

# Vector stores
from langchain.docstore.document import Document as LCDocument
from langchain.vectorstores import Chroma

print("✓ All imports successful")

# %% [markdown]
# ## 1.3 Load Environment Variables

# %%
# Load .env file
load_dotenv()

# Check environment variables
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
OLLAMA_BASE_URL = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
VOYAGE_API_KEY = os.getenv("VOYAGE_API_KEY")

if OPENAI_API_KEY:
    print("✓ OpenAI API key loaded")
else:
    print("⚠ OpenAI API key not found (only needed if using OpenAI embeddings)")

print(f"✓ Ollama URL: {OLLAMA_BASE_URL}")

if VOYAGE_API_KEY:
    print("✓ VoyageAI API key loaded")
else:
    print("⚠ VoyageAI API key not found (only needed if using VoyageAI embeddings)")

# %% [markdown]
# ## 1.4 Configuration Variables

# %%
# Paths
PDF_DIR = "../../financebench/documents"
VECTOR_DB_DIR = "../../vector_databases"

# Dataset
DATASET_NAME = "PatronusAI/financebench"
DATASET_SPLIT = "train"

# Processing
COLLECTION_PREFIX = "financebench_docs_chunk_"
CHUNK_OVERLAP_PERCENTAGE = 15

print("✓ Configuration set")
print(f"  PDF Directory: {PDF_DIR}")
print(f"  Vector DB Directory: {VECTOR_DB_DIR}")

✓ All imports successful
✓ OpenAI API key loaded
✓ Ollama URL: http://localhost:11434
✓ VoyageAI API key loaded
✓ Configuration set
  PDF Directory: ../../financebench/documents
  Vector DB Directory: ../../vector_databases


In [4]:
# ============================================================================
# Step 2: Load Dataset and Documents
# ============================================================================

# %% [markdown]
# ## 2.1 Load FinanceBench Dataset

# %%
def load_financebench_dataset(dataset_name: str, split: str):
    """Load the FinanceBench dataset from HuggingFace."""
    print(f"Loading dataset: {dataset_name}")
    ds = load_dataset(dataset_name, split=split)
    print(f"✓ Loaded {len(ds)} records")
    return ds

# %%
# Load dataset
dataset = load_financebench_dataset(DATASET_NAME, DATASET_SPLIT)

# Show sample
print("\nSample record keys:")
for key in dataset[0].keys():
    print(f"  - {key}")

# %% [markdown]
# ## 2.2 Extract Required PDFs

# %%
def get_required_pdfs(dataset) -> set:
    """Extract unique PDF filenames needed."""
    unique_pdfs = set()
    for record in tqdm(dataset, desc="Scanning for PDFs"):
        pdf_filename = record["doc_name"] + ".pdf"
        unique_pdfs.add(pdf_filename)
    print(f"✓ Found {len(unique_pdfs)} unique PDFs required")
    return unique_pdfs

# %%
required_pdfs = get_required_pdfs(dataset)

# %% [markdown]
# ## 2.3 Verify PDF Availability

# %%
def verify_pdfs(pdf_dir: str, required_pdfs: set) -> Tuple[List[str], List[str]]:
    """Check which PDFs are available."""
    available = []
    missing = []
    
    for pdf in tqdm(required_pdfs, desc="Verifying PDFs"):
        path = os.path.join(pdf_dir, pdf)
        if os.path.isfile(path):
            available.append(pdf)
        else:
            missing.append(pdf)
    
    print(f"\n✓ Available: {len(available)} PDFs")
    if missing:
        print(f"✗ Missing: {len(missing)} PDFs")
        for f in missing[:5]:
            print(f"  - {f}")
        if len(missing) > 5:
            print(f"  ... and {len(missing)-5} more")
    
    return available, missing

# %%
available_pdfs, missing_pdfs = verify_pdfs(PDF_DIR, required_pdfs)

# %%
# Check if we can proceed
if missing_pdfs:
    print("\n⚠ Some PDFs are missing")
    proceed = input("Continue with available PDFs only? (y/n): ").lower().strip()
    if proceed != 'y':
        raise SystemExit("Stopped by user")

# %% [markdown]
# ## 2.4 Load PDF Documents

# %%
def load_pdf_documents(pdf_dir: str, pdf_files: List[str]) -> List[Document]:
    """Load PDFs using PyMuPDF."""
    reader = PyMuPDFReader()
    documents = []
    failed = []
    
    for pdf in tqdm(pdf_files, desc="Loading PDFs"):
        path = os.path.join(pdf_dir, pdf)
        try:
            docs = reader.load(path)
            documents.extend(docs)
        except Exception as e:
            failed.append((pdf, str(e)))
            print(f"\n✗ Failed: {pdf}: {e}")
    
    print(f"\n✓ Loaded {len(documents)} pages from {len(pdf_files)-len(failed)} PDFs")
    if failed:
        print(f"✗ Failed to load {len(failed)} PDFs")
    
    return documents

# %%
documents = load_pdf_documents(PDF_DIR, available_pdfs)

# %% [markdown]
# ## 2.5 Analyze Documents

# %%
def analyze_documents(documents: List[Document]) -> Dict:
    """Analyze loaded documents."""
    total_pages = len(documents)
    total_chars = sum(len(doc.text) for doc in documents)
    estimated_tokens = total_chars // 4
    
    char_counts = [len(doc.text) for doc in documents]
    avg_chars = total_chars / total_pages if total_pages > 0 else 0
    
    stats = {
        'total_pages': total_pages,
        'total_characters': total_chars,
        'estimated_tokens': estimated_tokens,
        'avg_chars_per_page': avg_chars,
        'min_chars': min(char_counts) if char_counts else 0,
        'max_chars': max(char_counts) if char_counts else 0
    }
    
    print("\n" + "="*60)
    print("DOCUMENT STATISTICS")
    print("="*60)
    print(f"Total Pages:           {stats['total_pages']:,}")
    print(f"Total Characters:      {stats['total_characters']:,}")
    print(f"Estimated Tokens:      {stats['estimated_tokens']:,}")
    print(f"\nPer-Page Statistics:")
    print(f"  Average:             {stats['avg_chars_per_page']:,.0f} chars")
    print(f"  Min:                 {stats['min_chars']:,} chars")
    print(f"  Max:                 {stats['max_chars']:,} chars")
    print("="*60)
    
    return stats

# %%
doc_stats = analyze_documents(documents)

# %%
print("\n✓ Step 2 complete!")
print(f"  Dataset records: {len(dataset)}")
print(f"  PDFs loaded: {len(available_pdfs)}")
print(f"  Document pages: {len(documents)}")

Loading dataset: PatronusAI/financebench
✓ Loaded 150 records

Sample record keys:
  - financebench_id
  - company
  - doc_name
  - question_type
  - question_reasoning
  - domain_question_num
  - question
  - answer
  - justification
  - dataset_subset_label
  - evidence
  - gics_sector
  - doc_type
  - doc_period
  - doc_link


Scanning for PDFs:   0%|          | 0/150 [00:00<?, ?it/s]

✓ Found 84 unique PDFs required


Verifying PDFs:   0%|          | 0/84 [00:00<?, ?it/s]


✓ Available: 84 PDFs


Loading PDFs:   0%|          | 0/84 [00:00<?, ?it/s]


✓ Loaded 12013 pages from 84 PDFs

DOCUMENT STATISTICS
Total Pages:           12,013
Total Characters:      40,649,449
Estimated Tokens:      10,162,362

Per-Page Statistics:
  Average:             3,384 chars
  Min:                 0 chars
  Max:                 10,738 chars

✓ Step 2 complete!
  Dataset records: 150
  PDFs loaded: 84
  Document pages: 12013


In [5]:
# ============================================================================
# Step 3: Process Documents into Chunks
# ============================================================================

# %% [markdown]
# ## 3.1 Generate Nodes (Chunks)

# %%
def generate_nodes(
    documents: List[Document],
    chunk_size: int,
    chunk_overlap: int
) -> List[BaseNode]:
    """Generate nodes from documents using SentenceSplitter."""
    parser = SentenceSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    
    print(f"Generating nodes (size={chunk_size}, overlap={chunk_overlap})...")
    nodes = parser.get_nodes_from_documents(documents, show_progress=True)
    print(f"✓ Created {len(nodes):,} nodes")
    
    return nodes

# %% [markdown]
# ## 3.2 Convert to LangChain Documents

# %%
def nodes_to_langchain_docs(
    nodes: List[BaseNode],
    chunk_size: int
) -> List[LCDocument]:
    """Convert LlamaIndex nodes to LangChain documents."""
    lc_docs = []
    
    for node in tqdm(nodes, desc="Converting to LangChain"):
        metadata = {"chunk_size": chunk_size}
        
        # Add original metadata
        if hasattr(node, 'metadata'):
            metadata.update(node.metadata)
        
        doc = LCDocument(
            page_content=node.get_content(),
            metadata=metadata
        )
        lc_docs.append(doc)
    
    print(f"✓ Converted {len(lc_docs):,} documents")
    return lc_docs

# %% [markdown]
# ## 3.3 Analyze Chunks

# %%
def analyze_chunks(lc_docs: List[LCDocument], chunk_size: int) -> Dict:
    """Analyze generated chunks."""
    chunk_lengths = [len(doc.page_content) for doc in lc_docs]
    total_chunks = len(lc_docs)
    total_chars = sum(chunk_lengths)
    
    stats = {
        'total_chunks': total_chunks,
        'total_characters': total_chars,
        'estimated_tokens': total_chars // 4,
        'avg_length': total_chars / total_chunks if total_chunks > 0 else 0,
        'min_length': min(chunk_lengths) if chunk_lengths else 0,
        'max_length': max(chunk_lengths) if chunk_lengths else 0
    }
    
    print("\n" + "="*60)
    print(f"CHUNK STATISTICS (Size: {chunk_size})")
    print("="*60)
    print(f"Total Chunks:          {stats['total_chunks']:,}")
    print(f"Total Characters:      {stats['total_characters']:,}")
    print(f"Estimated Tokens:      {stats['estimated_tokens']:,}")
    print(f"\nPer-Chunk Statistics:")
    print(f"  Average:             {stats['avg_length']:,.0f} chars")
    print(f"  Min:                 {stats['min_length']:,} chars")
    print(f"  Max:                 {stats['max_length']:,} chars")
    print("="*60)
    
    return stats

# %% [markdown]
# ## 3.4 Process Single Chunk Size

# %%
def process_chunk_size(
    documents: List[Document],
    chunk_size: int,
    overlap_percentage: int = 15
) -> Dict:
    """Process documents for a single chunk size."""
    # Calculate overlap
    chunk_overlap = int(chunk_size * (overlap_percentage / 100))
    
    print(f"\n{'='*60}")
    print(f"PROCESSING CHUNK SIZE: {chunk_size}")
    print(f"{'='*60}")
    print(f"Overlap: {chunk_overlap} chars ({overlap_percentage}%)")
    
    # Generate nodes
    nodes = generate_nodes(documents, chunk_size, chunk_overlap)
    
    # Convert to LangChain docs
    lc_docs = nodes_to_langchain_docs(nodes, chunk_size)
    
    # Analyze
    stats = analyze_chunks(lc_docs, chunk_size)
    
    return {
        'chunk_size': chunk_size,
        'chunk_overlap': chunk_overlap,
        'nodes': nodes,
        'lc_docs': lc_docs,
        'stats': stats
    }

# %% [markdown]
# ## 3.5 Process Multiple Chunk Sizes

# %%
def process_multiple_chunk_sizes(
    documents: List[Document],
    chunk_sizes: List[int],
    overlap_percentage: int = 15
) -> Dict[int, Dict]:
    """Process documents for multiple chunk sizes."""
    print(f"\nProcessing {len(chunk_sizes)} chunk size(s)...")
    
    processed_data = {}
    
    for chunk_size in chunk_sizes:
        data = process_chunk_size(documents, chunk_size, overlap_percentage)
        processed_data[chunk_size] = data
    
    # Summary
    print("\n" + "="*60)
    print("PROCESSING SUMMARY")
    print("="*60)
    for cs, data in processed_data.items():
        stats = data['stats']
        print(f"Chunk {cs}: {stats['total_chunks']:,} chunks, "
              f"~{stats['estimated_tokens']:,} tokens")
    print("="*60)
    
    return processed_data

# %% [markdown]
# ## 3.6 Execute Processing

# %%
# Define which chunk sizes you want to process
CHUNK_SIZES = [256, 512, 1024, 2048, 4096]  # Add more as needed: [256, 512, 1024, 2048]

# %%
# Process all chunk sizes
processed_data = process_multiple_chunk_sizes(
    documents=documents,
    chunk_sizes=CHUNK_SIZES,
    overlap_percentage=CHUNK_OVERLAP_PERCENTAGE
)

# %%
print("\n✓ Step 3 complete!")
print(f"Processed chunk sizes: {list(processed_data.keys())}")
print(f"Total chunks across all sizes: {sum(d['stats']['total_chunks'] for d in processed_data.values()):,}")


Processing 5 chunk size(s)...

PROCESSING CHUNK SIZE: 256
Overlap: 38 chars (15%)
Generating nodes (size=256, overlap=38)...


Parsing nodes:   0%|          | 0/12013 [00:00<?, ?it/s]

✓ Created 57,903 nodes


Converting to LangChain:   0%|          | 0/57903 [00:00<?, ?it/s]

✓ Converted 57,903 documents

CHUNK STATISTICS (Size: 256)
Total Chunks:          57,903
Total Characters:      43,436,009
Estimated Tokens:      10,859,002

Per-Chunk Statistics:
  Average:             750 chars
  Min:                 0 chars
  Max:                 2,121 chars

PROCESSING CHUNK SIZE: 512
Overlap: 76 chars (15%)
Generating nodes (size=512, overlap=76)...


Parsing nodes:   0%|          | 0/12013 [00:00<?, ?it/s]

✓ Created 28,657 nodes


Converting to LangChain:   0%|          | 0/28657 [00:00<?, ?it/s]

✓ Converted 28,657 documents

CHUNK STATISTICS (Size: 512)
Total Chunks:          28,657
Total Characters:      43,783,209
Estimated Tokens:      10,945,802

Per-Chunk Statistics:
  Average:             1,528 chars
  Min:                 0 chars
  Max:                 4,103 chars

PROCESSING CHUNK SIZE: 1024
Overlap: 153 chars (15%)
Generating nodes (size=1024, overlap=153)...


Parsing nodes:   0%|          | 0/12013 [00:00<?, ?it/s]

✓ Created 15,787 nodes


Converting to LangChain:   0%|          | 0/15787 [00:00<?, ?it/s]

✓ Converted 15,787 documents

CHUNK STATISTICS (Size: 1024)
Total Chunks:          15,787
Total Characters:      42,395,002
Estimated Tokens:      10,598,750

Per-Chunk Statistics:
  Average:             2,685 chars
  Min:                 0 chars
  Max:                 7,205 chars

PROCESSING CHUNK SIZE: 2048
Overlap: 307 chars (15%)
Generating nodes (size=2048, overlap=307)...


Parsing nodes:   0%|          | 0/12013 [00:00<?, ?it/s]

✓ Created 12,099 nodes


Converting to LangChain:   0%|          | 0/12099 [00:00<?, ?it/s]

✓ Converted 12,099 documents

CHUNK STATISTICS (Size: 2048)
Total Chunks:          12,099
Total Characters:      40,711,884
Estimated Tokens:      10,177,971

Per-Chunk Statistics:
  Average:             3,365 chars
  Min:                 0 chars
  Max:                 10,737 chars

PROCESSING CHUNK SIZE: 4096
Overlap: 614 chars (15%)
Generating nodes (size=4096, overlap=614)...


Parsing nodes:   0%|          | 0/12013 [00:00<?, ?it/s]

✓ Created 11,970 nodes


Converting to LangChain:   0%|          | 0/11970 [00:00<?, ?it/s]

✓ Converted 11,970 documents

CHUNK STATISTICS (Size: 4096)
Total Chunks:          11,970
Total Characters:      40,623,247
Estimated Tokens:      10,155,811

Per-Chunk Statistics:
  Average:             3,394 chars
  Min:                 0 chars
  Max:                 10,737 chars

PROCESSING SUMMARY
Chunk 256: 57,903 chunks, ~10,859,002 tokens
Chunk 512: 28,657 chunks, ~10,945,802 tokens
Chunk 1024: 15,787 chunks, ~10,598,750 tokens
Chunk 2048: 12,099 chunks, ~10,177,971 tokens
Chunk 4096: 11,970 chunks, ~10,155,811 tokens

✓ Step 3 complete!
Processed chunk sizes: [256, 512, 1024, 2048, 4096]
Total chunks across all sizes: 126,416


In [7]:
# ============================================================================
# Step 4: Inspect Existing Databases
# ============================================================================

# %% [markdown]
# ## 4.1 Scan All Databases

# %%
def inspect_all_databases(base_dir: str = "../../vector_databases") -> Dict:
    """Scan and inspect all embedding databases."""
    if not os.path.exists(base_dir):
        print(f"No databases found at: {base_dir}")
        return {}
    
    print(f"\n{'='*60}")
    print("SCANNING DATABASES")
    print(f"{'='*60}")
    print(f"Location: {base_dir}\n")
    
    all_dbs = {}
    
    for item in os.listdir(base_dir):
        item_path = os.path.join(base_dir, item)
        if not os.path.isdir(item_path):
            continue
        
        # Parse provider_model format
        if '_' not in item:
            continue
        
        parts = item.split('_', 1)
        provider = parts[0]
        model = parts[1]
        
        print(f"Database: {item}")
        print(f"  Provider: {provider}")
        print(f"  Model: {model}")
        
        # Check for ChromaDB
        if not os.path.exists(os.path.join(item_path, "chroma.sqlite3")):
            print(f"  Status: Not a valid ChromaDB\n")
            continue
        
        # Inspect collections
        collections = {}
        try:
            # Import appropriate embedding
            if provider == "ollama":
                from langchain_ollama import OllamaEmbeddings
                emb = OllamaEmbeddings(model=model)
            elif provider == "openai":
                from langchain_openai import OpenAIEmbeddings
                emb = OpenAIEmbeddings(model=model)
            elif provider == "voyage":
                from langchain_voyageai import VoyageAIEmbeddings
                voyage_api_key = os.getenv("VOYAGE_API_KEY")
                return VoyageAIEmbeddings(
                    model=model,
                    voyage_api_key=voyage_api_key
                )
            else:
                print(f"  Status: Unknown provider\n")
                continue
            
            # Check common chunk sizes
            for cs in [128, 256, 512, 1024, 2048]:
                coll_name = f"{COLLECTION_PREFIX}{cs}"
                try:
                    vs = Chroma(
                        collection_name=coll_name,
                        embedding_function=emb,
                        persist_directory=item_path
                    )
                    count = vs._collection.count()
                    if count > 0:
                        collections[cs] = count
                        print(f"    • Chunk {cs}: {count:,} documents")
                except Exception:
                    pass
            
            if collections:
                all_dbs[item] = {
                    'provider': provider,
                    'model': model,
                    'path': item_path,
                    'collections': collections,
                    'total_docs': sum(collections.values())
                }
                print(f"  Total: {sum(collections.values()):,} documents\n")
            else:
                print(f"  Status: No collections found\n")
                
        except Exception as e:
            print(f"  Error: {e}\n")
    
    return all_dbs

# %% [markdown]
# ## 4.2 Display Summary

# %%
def display_summary(databases: Dict):
    """Display summary of all databases."""
    if not databases:
        print("\n❌ No databases found")
        return
    
    print(f"\n{'='*60}")
    print("DATABASE SUMMARY")
    print(f"{'='*60}")
    
    total_colls = 0
    total_docs = 0
    
    for db_name, info in databases.items():
        print(f"\n{db_name}")
        print(f"  Provider: {info['provider']}")
        print(f"  Model: {info['model']}")
        print(f"  Collections: {len(info['collections'])}")
        print(f"  Documents: {info['total_docs']:,}")
        
        total_colls += len(info['collections'])
        total_docs += info['total_docs']
    
    print(f"\n{'='*60}")
    print(f"Total: {len(databases)} database(s), {total_colls} collection(s), {total_docs:,} documents")
    print(f"{'='*60}")

# %% [markdown]
# ## 4.3 Execute Inspection

# %%
# Scan all databases
all_databases = inspect_all_databases(VECTOR_DB_DIR)

# %%
# Display summary
#display_summary(all_databases)

# %%
print("\n✓ Step 4 complete!")


SCANNING DATABASES
Location: ../../vector_databases

Database: voyage_voyage-3-large
  Provider: voyage
  Model: voyage-3-large

✓ Step 4 complete!


In [8]:
# ============================================================================
# Step 5: Add Embeddings Flexibly
# ============================================================================

# %% [markdown]
# ## 5.1 Helper Functions

# %%
def get_embedding_function(provider: str, model: str):
    """Get embedding function for a provider/model."""
    if provider == "ollama":
        from langchain_ollama import OllamaEmbeddings
        return OllamaEmbeddings(model=model, base_url=OLLAMA_BASE_URL)
    elif provider == "openai":
        from langchain_openai import OpenAIEmbeddings
        return OpenAIEmbeddings(model=model, openai_api_key=OPENAI_API_KEY)
    elif provider == "voyage":
        from langchain_voyageai import VoyageAIEmbeddings
        return VoyageAIEmbeddings(model=model, voyage_api_key=VOYAGE_API_KEY)
    else:
        raise ValueError(f"Unknown provider: {provider}")


def get_db_path(base_dir: str, provider: str, model: str) -> str:
    """Get database path for embedding."""
    model_id = f"{provider}_{model.replace('/', '_')}"
    return os.path.join(base_dir, model_id)


def check_collection_exists(db_path: str, collection_name: str, embedding_fn) -> Tuple[bool, int]:
    """Check if collection exists and get count."""
    if not os.path.exists(db_path):
        return False, 0
    
    try:
        vs = Chroma(
            collection_name=collection_name,
            embedding_function=embedding_fn,
            persist_directory=db_path
        )
        count = vs._collection.count()
        return count > 0, count
    except Exception:
        return False, 0

# %% [markdown]
# ## 5.2 Add Single Chunk Size

def add_chunk_size_to_embedding(
    processed_data: Dict[int, Dict],
    chunk_size: int,
    embedding_provider: str,
    embedding_model: str,
    base_db_dir: str = "../../vector_databases",
    collection_prefix: str = "financebench_docs_chunk_",
    batch_size: int = 100,
    max_tokens_per_batch: int = None,
    skip_if_exists: bool = True
) -> Dict:
    """
    Add a single chunk size to an embedding database.
    
    Args:
        processed_data: Output from Step 3
        chunk_size: Which chunk size (must exist in processed_data)
        embedding_provider: "ollama" or "openai"
        embedding_model: Model name
        base_db_dir: Base database directory
        collection_prefix: Collection name prefix
        batch_size: Maximum documents per batch (fallback if max_tokens_per_batch not set)
        max_tokens_per_batch: Maximum tokens per batch (overrides batch_size for smart batching)
        skip_if_exists: Skip if collection already exists
        
    Returns:
        Statistics dictionary
    """
    # Validate
    if chunk_size not in processed_data:
        raise ValueError(f"Chunk size {chunk_size} not in processed_data. "
                        f"Available: {list(processed_data.keys())}")
    
    # Setup
    db_path = get_db_path(base_db_dir, embedding_provider, embedding_model)
    collection_name = f"{collection_prefix}{chunk_size}"
    
    print(f"\n{'='*60}")
    print(f"ADDING CHUNK SIZE {chunk_size}")
    print(f"{'='*60}")
    print(f"Provider: {embedding_provider}")
    print(f"Model: {embedding_model}")
    print(f"Database: {db_path}")
    print(f"Collection: {collection_name}")
    
    # Set default token limits for OpenAI
    if max_tokens_per_batch is None and embedding_provider == "openai":
        # Use 250k to leave safety margin below 300k limit
        max_tokens_per_batch = 250000
        print(f"Using OpenAI token limit: {max_tokens_per_batch:,} tokens per batch")
    
    # Get embedding function
    emb_fn = get_embedding_function(embedding_provider, embedding_model)
    
    # Check if exists
    exists, count = check_collection_exists(db_path, collection_name, emb_fn)
    if exists and skip_if_exists:
        print(f"\n✓ Already exists with {count:,} documents - SKIPPING")
        return {
            'status': 'skipped',
            'chunk_size': chunk_size,
            'collection_name': collection_name,
            'document_count': count
        }
    
    # Get documents
    lc_docs = processed_data[chunk_size]['lc_docs']
    print(f"Documents: {len(lc_docs):,}")
    
    # Create database directory
    os.makedirs(db_path, exist_ok=True)
    
    # Initialize vectorstore
    vectorstore = Chroma(
        collection_name=collection_name,
        embedding_function=emb_fn,
        persist_directory=db_path
    )
    
    # Create batches (smart token-aware batching for OpenAI)
    if max_tokens_per_batch:
        print(f"\nUsing token-aware batching (max {max_tokens_per_batch:,} tokens/batch)")
        batches = create_token_aware_batches(lc_docs, max_tokens_per_batch)
    else:
        print(f"\nUsing document-count batching ({batch_size} docs/batch)")
        total = len(lc_docs)
        num_batches = (total + batch_size - 1) // batch_size
        batches = [lc_docs[i*batch_size:(i+1)*batch_size] for i in range(num_batches)]
    
    print(f"Created {len(batches)} batch(es)")
    
    # Add in batches
    added = 0
    failed_batches = []
    
    with tqdm(total=len(lc_docs), desc="Progress") as pbar:
        for i, batch in enumerate(batches):
            try:
                vectorstore.add_documents(batch)
                added += len(batch)
                pbar.update(len(batch))
            except Exception as e:
                print(f"\nBatch {i+1} failed: {e}")
                failed_batches.append((i+1, len(batch), str(e)))
                pbar.update(len(batch))
    
    # Persist
    vectorstore.persist()
    final_count = vectorstore._collection.count()
    
    # Report
    print(f"\n✓ Complete: {added:,}/{len(lc_docs):,} added, {final_count:,} final")
    
    if failed_batches:
        print(f"\n⚠ {len(failed_batches)} batch(es) failed:")
        for batch_num, batch_size, error in failed_batches[:3]:
            print(f"  Batch {batch_num} ({batch_size} docs): {error[:100]}")
        if len(failed_batches) > 3:
            print(f"  ... and {len(failed_batches)-3} more")
    
    return {
        'status': 'completed' if not failed_batches else 'completed_with_errors',
        'chunk_size': chunk_size,
        'collection_name': collection_name,
        'added': added,
        'failed': len(failed_batches),
        'final_count': final_count
    }


def create_token_aware_batches(
    documents: List[LCDocument],
    max_tokens: int,
    chars_per_token: float = 4.0
) -> List[List[LCDocument]]:
    """
    Create batches that respect token limits.
    
    Args:
        documents: List of documents to batch
        max_tokens: Maximum tokens per batch
        chars_per_token: Estimated characters per token (default 4.0)
        
    Returns:
        List of document batches
    """
    batches = []
    current_batch = []
    current_tokens = 0
    
    for doc in documents:
        # Estimate tokens for this document
        doc_tokens = len(doc.page_content) / chars_per_token
        
        # If adding this doc would exceed limit, start new batch
        if current_batch and (current_tokens + doc_tokens > max_tokens):
            batches.append(current_batch)
            current_batch = [doc]
            current_tokens = doc_tokens
        else:
            current_batch.append(doc)
            current_tokens += doc_tokens
    
    # Add final batch
    if current_batch:
        batches.append(current_batch)
    
    return batches

def delete_collection(
    embedding_provider: str,
    embedding_model: str,
    chunk_size: int,
    base_db_dir: str = "../../vector_databases",
    collection_prefix: str = "financebench_docs_chunk_"
) -> bool:
    """
    Delete a specific collection from a database.
    
    Args:
        embedding_provider: "ollama" or "openai"
        embedding_model: Model name
        chunk_size: Chunk size of collection to delete
        base_db_dir: Base database directory
        collection_prefix: Collection name prefix
        
    Returns:
        True if deleted successfully, False otherwise
    """
    import chromadb
    
    # Build paths
    model_id = f"{embedding_provider}_{embedding_model.replace('/', '_')}"
    db_path = os.path.join(base_db_dir, model_id)
    collection_name = f"{collection_prefix}{chunk_size}"
    
    print(f"\n{'='*60}")
    print(f"DELETING COLLECTION")
    print(f"{'='*60}")
    print(f"Provider: {embedding_provider}")
    print(f"Model: {embedding_model}")
    print(f"Database: {db_path}")
    print(f"Collection: {collection_name}")
    
    if not os.path.exists(db_path):
        print(f"\n✗ Database does not exist")
        return False
    
    try:
        client = chromadb.PersistentClient(path=db_path)
        
        # Check if collection exists
        existing_collections = [col.name for col in client.list_collections()]
        
        if collection_name not in existing_collections:
            print(f"\n✗ Collection does not exist")
            return False
        
        # Get count before deletion
        collection = client.get_collection(name=collection_name)
        count = collection.count()
        
        # Delete
        client.delete_collection(name=collection_name)
        print(f"\n✓ Deleted collection with {count:,} documents")
        return True
        
    except Exception as e:
        print(f"\n✗ Error: {e}")
        return False

In [None]:
# Delete collection example

# delete_collection(
#     embedding_provider="voyage",
#     embedding_model="voyage-3-large",
#     chunk_size=512
# )


DELETING COLLECTION
Provider: voyage
Model: voyage-3-large
Database: ../../vector_databases/voyage_voyage-3-large
Collection: financebench_docs_chunk_512

✓ Deleted collection with 2,041 documents


True

In [9]:
# %% [markdown]
# ## 5.3 Example Usage

# %% [markdown]
# ### Example 1: Add chunk 512 to Ollama

# %%
# Uncomment to run:
# stats = add_chunk_size_to_embedding(
#     processed_data=processed_data,
#     chunk_size=512,
#     embedding_provider="ollama",
#     embedding_model="nomic-embed-text",
#     skip_if_exists=True
# )

# %% [markdown]
# ### Example 2: Add chunk 1024 to same Ollama database

# %%
# Uncomment to run:
# stats = add_chunk_size_to_embedding(
#     processed_data=processed_data,
#     chunk_size=1024,
#     embedding_provider="ollama",
#     embedding_model="nomic-embed-text",
#     skip_if_exists=True
# )

# %% [markdown]
# ### Example 3: Add chunk 512 to OpenAI (different database)

# %%
# Uncomment to run:
# stats = add_chunk_size_to_embedding(
#     processed_data=processed_data,
#     chunk_size=512,
#     embedding_provider="openai",
#     embedding_model="text-embedding-3-small",
#     skip_if_exists=True
# )

# %% [markdown]
# ## 5.4 Your Turn: Add Your Embeddings

# %%
# Define what you want to add
# Modify these parameters:

# EMBED_PROVIDER = "ollama"  # or "openai" pr "voyage"
EMBED_PROVIDER = "openai"
EMBED_MODEL = "text-embedding-3-large"  # or "text-embedding-3-small"
CHUNK_TO_ADD = 2048  # Must exist in processed_data


stats = add_chunk_size_to_embedding(
    processed_data=processed_data,
    chunk_size=CHUNK_TO_ADD,
    embedding_provider=EMBED_PROVIDER,
    embedding_model=EMBED_MODEL,
    skip_if_exists=True,
    max_tokens_per_batch=200000,  # Only for OpenAI
    batch_size=100
)

# %%
print("\n✓ Step 5 complete!")
print(f"Status: {stats['status']}")


ADDING CHUNK SIZE 2048
Provider: openai
Model: text-embedding-3-large
Database: ../../vector_databases/openai_text-embedding-3-large
Collection: financebench_docs_chunk_2048


  vs = Chroma(
2025-11-02 13:03:19,031 - INFO - Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.


Documents: 12,099

Using token-aware batching (max 200,000 tokens/batch)
Created 52 batch(es)


Progress:   0%|          | 0/12099 [00:00<?, ?it/s]

2025-11-02 13:03:22,229 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-11-02 13:03:28,232 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-11-02 13:03:32,551 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-11-02 13:03:36,807 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-11-02 13:03:40,958 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-11-02 13:03:45,127 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-11-02 13:03:49,459 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-11-02 13:03:53,763 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-11-02 13:03:55,673 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 429 Too Many Requests"
2025-11-02 13:03:55,67


Batch 35 failed: Error code: 400 - {'error': {'message': 'Requested 302468 tokens, max 300000 tokens per request', 'type': 'max_tokens_per_request', 'param': None, 'code': 'max_tokens_per_request'}}


2025-11-02 13:08:49,997 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 429 Too Many Requests"
2025-11-02 13:08:50,000 - INFO - Retrying request to /embeddings in 8.805000 seconds
2025-11-02 13:09:03,067 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-11-02 13:09:05,176 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 429 Too Many Requests"
2025-11-02 13:09:05,178 - INFO - Retrying request to /embeddings in 3.881000 seconds
2025-11-02 13:09:11,645 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-11-02 13:09:13,238 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 429 Too Many Requests"
2025-11-02 13:09:13,240 - INFO - Retrying request to /embeddings in 6.486000 seconds
2025-11-02 13:09:23,139 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-11-02 13:09:25,705 - INFO - HTTP Request: POST https:


✓ Complete: 11,857/12,099 added, 11,857 final

⚠ 1 batch(es) failed:
  Batch 35 (242 docs): Error code: 400 - {'error': {'message': 'Requested 302468 tokens, max 300000 tokens per request', 't

✓ Step 5 complete!
Status: completed_with_errors


  vectorstore.persist()


In [None]:
# %%
# Scan all databases
print(f"Vector DB Directory: {VECTOR_DB_DIR}")
all_databases = inspect_all_databases(VECTOR_DB_DIR)

# %%
# Display summary
display_summary(all_databases)

Vector DB Directory: ../../vector_databases

SCANNING DATABASES
Location: ../../vector_databases

Database: voyage_voyage-3-large
  Provider: voyage
  Model: voyage-3-large


In [42]:
# ============================================================================
# Step 6: Query and Test
# ============================================================================

# %% [markdown]
# ## 6.1 Load Vector Store

# %%
def load_vector_store(
    embedding_provider: str,
    embedding_model: str,
    chunk_size: int,
    base_db_dir: str = "../../vector_databases",
    collection_prefix: str = "financebench_docs_chunk_"
) -> Chroma:
    """
    Load a vector store for querying.
    
    Args:
        embedding_provider: "ollama" or "openai"
        embedding_model: Model name
        chunk_size: Which chunk size collection to load
        base_db_dir: Base database directory
        collection_prefix: Collection name prefix
        
    Returns:
        Chroma vectorstore instance
    """
    # Get paths
    db_path = get_db_path(base_db_dir, embedding_provider, embedding_model)
    collection_name = f"{collection_prefix}{chunk_size}"
    
    print(f"Loading vector store:")
    print(f"  Provider: {embedding_provider}")
    print(f"  Model: {embedding_model}")
    print(f"  Database: {db_path}")
    print(f"  Collection: {collection_name}")
    
    # Get embedding function
    emb_fn = get_embedding_function(embedding_provider, embedding_model)
    
    # Load
    vectorstore = Chroma(
        collection_name=collection_name,
        embedding_function=emb_fn,
        persist_directory=db_path
    )
    
    count = vectorstore._collection.count()
    print(f"  Documents: {count:,}")
    print("✓ Loaded")
    
    return vectorstore

# %% [markdown]
# ## 6.2 Simple Search

# %%
def search(
    vectorstore: Chroma,
    query: str,
    k: int = 5
) -> List:
    """Perform similarity search."""
    print(f"\n{'='*60}")
    print(f"SEARCH")
    print(f"{'='*60}")
    print(f"Query: {query}")
    print(f"Top-{k} results:\n")
    
    docs = vectorstore.similarity_search(query, k=k)
    
    for i, doc in enumerate(docs, 1):
        print(f"[{i}] {doc.page_content[:200]}...")
        if 'file_name' in doc.metadata:
            print(f"    Source: {doc.metadata['file_name']}")
        print()
    
    return docs

# %% [markdown]
# ## 6.3 Search with Scores

# %%
def search_with_scores(
    vectorstore: Chroma,
    query: str,
    k: int = 5
) -> List[Tuple]:
    """Perform similarity search with relevance scores."""
    print(f"\n{'='*60}")
    print(f"SEARCH WITH SCORES")
    print(f"{'='*60}")
    print(f"Query: {query}")
    print(f"Top-{k} results:\n")
    
    results = vectorstore.similarity_search_with_score(query, k=k)
    
    for i, (doc, score) in enumerate(results, 1):
        print(f"[{i}] Score: {score:.4f}")
        print(f"{doc.page_content[:200]}...")
        if 'file_name' in doc.metadata:
            print(f"Source: {doc.metadata['file_name']}")
        if 'page_label' in doc.metadata:
            print(f"Page: {doc.metadata['page_label']}")
        print()
    
    return results

# %% [markdown]
# ## 6.4 Compare Chunk Sizes

# %%
def compare_chunk_sizes(
    embedding_provider: str,
    embedding_model: str,
    query: str,
    chunk_sizes: List[int],
    k: int = 3,
    base_db_dir: str = "../../vector_databases"
):
    """Compare retrieval across different chunk sizes."""
    print(f"\n{'='*60}")
    print(f"COMPARING CHUNK SIZES")
    print(f"{'='*60}")
    print(f"Query: {query}\n")
    
    for chunk_size in chunk_sizes:
        print(f"--- Chunk Size: {chunk_size} ---")
        try:
            vs = load_vector_store(
                embedding_provider, 
                embedding_model, 
                chunk_size,
                base_db_dir
            )
            
            results = vs.similarity_search_with_score(query, k=k)
            if results:
                doc, score = results[0]
                print(f"Top result (Score: {score:.4f}):")
                print(f"{doc.page_content[:250]}...")
            print()
        except Exception as e:
            print(f"Error: {e}\n")

# %% [markdown]
# ## 6.5 Test with FinanceBench Questions

# %%
def test_with_dataset_questions(
    vectorstore: Chroma,
    dataset,
    num_questions: int = 3,
    k: int = 3
):
    """Test with actual FinanceBench questions."""
    print(f"\n{'='*60}")
    print(f"TESTING WITH FINANCEBENCH QUESTIONS")
    print(f"{'='*60}\n")
    
    import random
    indices = random.sample(range(len(dataset)), num_questions)
    
    for idx in indices:
        record = dataset[idx]
        question = record['question']
        answer = record['answer']
        company = record['company']
        
        print(f"{'='*60}")
        print(f"Company: {company}")
        print(f"Question: {question}")
        print(f"Expected Answer: {answer}")
        print(f"{'='*60}\n")
        
        # Retrieve
        docs = vectorstore.similarity_search(question, k=k)
        
        print(f"Retrieved {len(docs)} documents:\n")
        for i, doc in enumerate(docs, 1):
            print(f"[{i}] {doc.page_content[:150]}...")
            if 'file_name' in doc.metadata:
                print(f"    Source: {doc.metadata['file_name']}")
        
        print("\n" + "-"*60 + "\n")



In [43]:
# %% [markdown]
# ## 6.6 Execute Tests

# %% [markdown]
# ### Load Vector Store

# %%
# Configure which embedding to test
TEST_PROVIDER = "ollama"  # or "openai"
TEST_MODEL = "nomic-embed-text"  # or "text-embedding-3-small"
TEST_CHUNK_SIZE = 512

# %%
# Load vector store
vectorstore = load_vector_store(
    embedding_provider=TEST_PROVIDER,
    embedding_model=TEST_MODEL,
    chunk_size=TEST_CHUNK_SIZE
)

# %% [markdown]
# ### Test 1: Simple Search

# %%
query = "What was the capital expenditure in 2018?"
docs = search(vectorstore, query, k=3)

# %% [markdown]
# ### Test 2: Search with Scores

# %%
query = "What is the total revenue for fiscal year 2022?"
results = search_with_scores(vectorstore, query, k=5)

# %% [markdown]
# ### Test 3: Compare Chunk Sizes

# %%
# Only works if you have multiple chunk sizes for the same embedding
query = "What were the operating expenses in 2021?"
compare_chunk_sizes(
    embedding_provider=TEST_PROVIDER,
    embedding_model=TEST_MODEL,
    query=query,
    chunk_sizes=[512, 1024],  # Adjust based on what you have
    k=3
)

# %% [markdown]
# ### Test 4: Test with Real Questions

# %%
test_with_dataset_questions(
    vectorstore=vectorstore,
    dataset=dataset,
    num_questions=3,
    k=3
)

# %% [markdown]
# ## 6.7 Quick Query Function

# %%
def quick_query(query: str, provider: str = "openai", model: str = "nomic-embed-text", chunk_size: int = 512, k: int = 5):
    """Quick helper for ad-hoc queries."""
    vs = load_vector_store(provider, model, chunk_size)
    return search_with_scores(vs, query, k)

# %% [markdown]
# ### Your Custom Queries

# %%
# Try your own queries here
# results = quick_query(
#     query="Your question here",
#     provider="ollama",
#     model="nomic-embed-text",
#     chunk_size=512,
#     k=5
# )

# %%
print("\n✓ Step 6 complete!")
print("You can now query your RAG system!")

2025-10-06 10:57:38,824 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-10-06 10:57:38,873 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"


Loading vector store:
  Provider: ollama
  Model: nomic-embed-text
  Database: ../../vector_databases/ollama_nomic-embed-text
  Collection: financebench_docs_chunk_512
  Documents: 28,657
✓ Loaded

SEARCH
Query: What was the capital expenditure in 2018?
Top-3 results:

[1] Capital Spending
 
Capital spending was $1.6 billion in 2021, an increase of $260 million when compared to 2020. We expect our 2022 capital expenditures to be consistent with 2021.
Cash Flows
 
Summar...

[2] Total U.S. capital expenditures decreased $478 million for fiscal 2018 , when compared to the previous fiscal year. Capital expenditures related to new stores and
clubs, including expansions and reloc...

[3] Year Ended December 31,
 
 
 
2020
  
2019
  
2018
 
Capital expenditures:
 
(In thousands)
 
Las Vegas Strip Resorts
 
$
87,511   
$
285,863   
$
501,044 
Regional Operations
 
 
41,456   
 
187,489 ...


SEARCH WITH SCORES
Query: What is the total revenue for fiscal year 2022?
Top-5 results:

[1] Score: 0

2025-10-06 10:57:38,952 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-10-06 10:57:39,042 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"


  Documents: 28,657
✓ Loaded
Top result (Score: 0.4891):
Table of Contents
Operating Expenses
Information about operating expenses is as follows (in millions):
 
 
Year Ended December 31,
  
2015
2016
2017
Operating expenses:
Cost of sales
$
71,651
$
88,265
$
111,934
Fulfillment
13,410
17,619
25,249
Market...

--- Chunk Size: 1024 ---
Loading vector store:
  Provider: ollama
  Model: nomic-embed-text
  Database: ../../vector_databases/ollama_nomic-embed-text
  Collection: financebench_docs_chunk_1024
  Documents: 0
✓ Loaded


TESTING WITH FINANCEBENCH QUESTIONS

Company: Corning
Question: Does Corning have positive working capital based on FY2022 data? If working capital is not a useful or relevant metric for this company, then please state that and explain why.
Expected Answer: Yes. Corning had a positive working capital amount of $831 million by FY 2022 close. This answer considers only operating current assets and current liabilities that were clearly shown in the balance sheet.



2025-10-06 10:57:39,128 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-10-06 10:57:39,168 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-10-06 10:57:39,209 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"


Retrieved 3 documents:

[1] Effective January 1, 2019, Corning began using constant-currency reporting for our Environmental Technologies 
and Life Sciences segments. The Company...
[2] Despite the pandemic and resulting global disruptions, Corning adapted rapidly and remained resilient. We acted quickly to preserve our financial stre...
[3] Our probability of success increases as we invest in our world-class capabilities.  Corning is concentrating approximately 80% of its research, develo...

------------------------------------------------------------

Company: Boeing
Question: Who are the primary customers of Boeing as of FY2022?
Expected Answer: Boeing's primary customers as of FY2022 are a limited number of commercial airlines and the US government. The US government accounted for 40% of Boeing's total revenues in FY2022.

Retrieved 3 documents:

[1] We address employee concerns and take appropriate
actions that uphold our Boeing values.
Competition
The commercial jet aircraft mar

In [None]:
# ============================================================================
# Test: Check Metadata in ChromaDB Collections
# ============================================================================

# %%
import os
from langchain.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings
from langchain_openai import OpenAIEmbeddings
from langchain_voyageai import VoyageAIEmbeddings

# %% [markdown]
# ## Function to Inspect Metadata

# %%
def inspect_collection_metadata(
    embedding_provider: str,
    embedding_model: str,
    chunk_size: int,
    base_db_dir: str = "../../vector_databases",
    collection_prefix: str = "financebench_docs_chunk_",
    num_samples: int = 5
):
    """
    Inspect metadata from a specific collection.
    
    Args:
        embedding_provider: "ollama" or "openai"
        embedding_model: Model name
        chunk_size: Chunk size of collection
        base_db_dir: Base directory
        collection_prefix: Collection prefix
        num_samples: Number of sample documents to inspect
    """
    # Build paths
    model_id = f"{embedding_provider}_{embedding_model.replace('/', '_')}"
    db_path = os.path.join(base_db_dir, model_id)
    collection_name = f"{collection_prefix}{chunk_size}"
    
    print(f"\n{'='*60}")
    print(f"INSPECTING METADATA")
    print(f"{'='*60}")
    print(f"Provider: {embedding_provider}")
    print(f"Model: {embedding_model}")
    print(f"Database: {db_path}")
    print(f"Collection: {collection_name}")
    
    if not os.path.exists(db_path):
        print(f"\n✗ Database does not exist")
        return
    
    # Get embedding function
    if embedding_provider == "ollama":
        emb_fn = OllamaEmbeddings(model=embedding_model)
    elif embedding_provider == "openai":
        emb_fn = OpenAIEmbeddings(model=embedding_model)
    elif embedding_provider == "voyage":
        emb_fn = VoyageAIEmbeddings(model=embedding_model, voyage_api_key=VOYAGE_API_KEY)   
    else:
        print(f"\n✗ Unknown provider")
        return
    
    # Load vectorstore
    try:
        vectorstore = Chroma(
            collection_name=collection_name,
            embedding_function=emb_fn,
            persist_directory=db_path
        )
        
        total_docs = vectorstore._collection.count()
        print(f"Total documents: {total_docs:,}")
        
        # Get sample documents
        print(f"\nFetching {num_samples} sample documents...")
        results = vectorstore.similarity_search("sample query", k=num_samples)
        
        print(f"\n{'='*60}")
        print(f"METADATA ANALYSIS")
        print(f"{'='*60}")
        
        # Collect all unique metadata keys
        all_keys = set()
        for doc in results:
            all_keys.update(doc.metadata.keys())
        
        print(f"\nAll metadata keys found: {sorted(all_keys)}")
        
        # Show detailed samples
        print(f"\n{'='*60}")
        print(f"SAMPLE DOCUMENTS")
        print(f"{'='*60}")
        
        for i, doc in enumerate(results, 1):
            print(f"\n[Sample {i}]")
            print(f"Content preview: {doc.page_content[:150]}...")
            print(f"\nMetadata:")
            for key, value in sorted(doc.metadata.items()):
                # Truncate long values
                if isinstance(value, str) and len(value) > 100:
                    value = value[:100] + "..."
                print(f"  {key}: {value}")
            print("-" * 60)
        
        # Check for required fields
        print(f"\n{'='*60}")
        print(f"REQUIRED FIELDS CHECK")
        print(f"{'='*60}")
        
        required_fields = {
            'file_name': 'Document name (e.g., 3M_2018_10K)',
            'page_label': 'Page number or label',
            'page_number': 'Page number (alternative field)'
        }
        
        for field, description in required_fields.items():
            has_field = field in all_keys
            symbol = "✓" if has_field else "✗"
            print(f"{symbol} {field}: {description}")
        
        # Summary
        print(f"\n{'='*60}")
        print(f"SUMMARY")
        print(f"{'='*60}")
        
        has_doc_name = 'file_name' in all_keys
        has_page_info = 'page_label' in all_keys or 'page_number' in all_keys
        
        if has_doc_name and has_page_info:
            print("✓ Collection has required metadata for evaluation")
            print("  - Document name: file_name")
            page_field = 'page_label' if 'page_label' in all_keys else 'page_number'
            print(f"  - Page info: {page_field}")
        else:
            print("✗ Collection is missing required metadata")
            if not has_doc_name:
                print("  Missing: file_name (document name)")
            if not has_page_info:
                print("  Missing: page_label or page_number")
        
    except Exception as e:
        print(f"\n✗ Error: {e}")

# %% [markdown]
# ## Test Your Collections

# %%
# Test collection 1: Ollama nomic-embed-text, chunk 512
inspect_collection_metadata(
    embedding_provider="ollama",
    embedding_model="nomic-embed-text",
    chunk_size=512,
    num_samples=5
)

# %%
# Test collection 2: Add more as needed
# inspect_collection_metadata(
#     embedding_provider="ollama",
#     embedding_model="nomic-embed-text",
#     chunk_size=1024,
#     num_samples=5
# )

# %%
# Test collection 3: OpenAI example
# inspect_collection_metadata(
#     embedding_provider="openai",
#     embedding_model="text-embedding-3-small",
#     chunk_size=512,
#     num_samples=5
# )


INSPECTING METADATA
Provider: ollama
Model: nomic-embed-text
Database: ../../vector_databases/ollama_nomic-embed-text
Collection: financebench_docs_chunk_512


  vectorstore = Chroma(
2025-10-06 19:05:07,954 - INFO - Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.


Total documents: 28,657

Fetching 5 sample documents...


2025-10-06 19:05:09,050 - INFO - HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"



METADATA ANALYSIS

All metadata keys found: ['chunk_size', 'file_path', 'source', 'total_pages']

SAMPLE DOCUMENTS

[Sample 1]
Content preview: Table of Contents
The table below presents the estimated maximum potential VAR arising from a one-day loss in fair value for our interest rate, foreig...

Metadata:
  chunk_size: 512
  file_path: ../../financebench/documents/GENERALMILLS_2019_10K.pdf
  source: 49
  total_pages: 140
------------------------------------------------------------

[Sample 2]
Content preview: The VaR model 
results across all portfolios are aggregated at the Firm 
level.
As VaR is based on historical data, it is an imperfect 
measure of mar...

Metadata:
  chunk_size: 512
  file_path: ../../financebench/documents/JPMORGAN_2022_10K.pdf
  source: 135
  total_pages: 382
------------------------------------------------------------

[Sample 3]
Content preview: ___
 
Indicate by check mark whether the Registrant is a shell company (as defined in Rule 12b-2 of the Act):   