# General Tips
## Using virtual environments
**Step 1:** CD to desired directory and Create a Virtual Environment `python3 -m venv myenv`. (Run `py -3.13 -m venv myenv` for a specific version of python)

Check your python installed versions with `py -0` on Windows (`python3 --version` on Linux)

**Step 2:** Activate the Environment `source myenv/bin/activate` (on Linux) and `myenv\Scripts\activate` (on Windows).

**Step 3:** Install Any Needed Packages. e.g: `pip install requests pandas`. Or better to use `requirements.txt` file (`pip install -r requirements.txt`)

**Step 4:** List All Installed Packages using `pip list`

## Connecting the Jupyter Notebook to the vistual env
1. Make sure that myenv is activate (`myenv\Scripts\activate`)
2. Run this inside the virtual environment: `pip install ipykernel`
3. Still inside the environment: `python -m ipykernel install --user --name=myenv --display-name "Whatever Python Kernel Name"`
   
   --name=myenv: internal identifier for the kernel
   
   --display-name: name that shows up in VS Code kernel picker
4. Open VS Code and select the kernel

   At the top-right, click "Select Kernel".
   Look for “Whatever Python Kernel Name” — pick that.
5. If you don’t see it right away, try: Reloading VS Code, Or running Reload Window from Command Palette (Ctrl+Shift+P)

## Useful Commands
1. Use `py -0` to check which python installation we have on Windows

In [9]:
# ============================================================================
# FinanceBench RAG Pipeline - Clean Implementation
# Step 1: Configuration and Imports
# ============================================================================

# %% [markdown]
# # FinanceBench RAG Pipeline
# 
# This notebook provides a clean, modular approach to building a RAG system with:
# - Multiple embedding providers (Ollama, OpenAI)
# - Configurable chunk sizes and overlaps
# - Progress tracking
# - Proper error handling

# %% [markdown]
# ## 1.1 Imports

# %%
import os
import shutil
import time
from pathlib import Path
from typing import List, Dict, Optional, Tuple
from dataclasses import dataclass

# Environment and progress
from dotenv import load_dotenv
from tqdm.auto import tqdm

# Data processing
import pandas as pd
from datasets import load_dataset

# Document processing
from llama_index.core.schema import Document, BaseNode
from llama_index.core.node_parser import SentenceSplitter
from llama_index.readers.file import PyMuPDFReader

# Embeddings and vector stores
from langchain.docstore.document import Document as LCDocument
from langchain.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings
from langchain_openai import OpenAIEmbeddings

# %% [markdown]
# ## 1.2 Load Environment Variables

# %%
# Load environment variables from .env file
load_dotenv()

# Verify critical environment variables
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
OLLAMA_BASE_URL = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")

if not OPENAI_API_KEY:
    print("⚠️  Warning: OPENAI_API_KEY not found in .env file")
else:
    print("✓ OpenAI API key loaded")

print(f"✓ Ollama base URL: {OLLAMA_BASE_URL}")

# %% [markdown]
# ## 1.3 Configuration Dataclass

# %%
@dataclass
class RAGConfig:
    """Configuration for RAG pipeline."""
    
    # Dataset
    dataset_name: str = "PatronusAI/financebench"
    dataset_split: str = "train"
    
    # Paths
    pdf_dir: str = "../../financebench/documents"
    vector_db_dir: str = "../../vector_databases"
    
    # Chunking
    chunk_sizes: List[int] = None
    chunk_overlap_percentage: int = 15  # Percentage overlap between chunks
    
    # Embeddings
    embedding_provider: str = "ollama"  # "ollama" or "openai"
    ollama_model: str = "nomic-embed-text"
    openai_model: str = "text-embedding-3-small"
    
    # Vector store
    collection_name_prefix: str = "financebench_docs_chunk_"
    
    # Processing
    batch_size: int = 500
    clear_existing_db: bool = False
    keep_node_metadata: bool = True
    
    def __post_init__(self):
        """Set defaults for mutable attributes."""
        if self.chunk_sizes is None:
            self.chunk_sizes = [128, 256, 512, 1024]
    
    def get_model_identifier(self) -> str:
        """Get a filesystem-safe identifier for the current model."""
        if self.embedding_provider == "ollama":
            return f"ollama_{self.ollama_model.replace('/', '_')}"
        elif self.embedding_provider == "openai":
            return f"openai_{self.openai_model.replace('/', '_')}"
        else:
            raise ValueError(f"Unknown embedding provider: {self.embedding_provider}")
    
    def get_vector_db_dir(self) -> str:
        """Get the vector database directory for the current embedding model."""
        model_id = self.get_model_identifier()
        return os.path.join(self.vector_db_dir, model_id)
    
    def get_embedding_function(self):
        """Get the appropriate embedding function based on provider."""
        if self.embedding_provider == "ollama":
            return OllamaEmbeddings(
                model=self.ollama_model,
                base_url=OLLAMA_BASE_URL
            )
        elif self.embedding_provider == "openai":
            return OpenAIEmbeddings(
                model=self.openai_model,
                openai_api_key=OPENAI_API_KEY
            )
        else:
            raise ValueError(f"Unknown embedding provider: {self.embedding_provider}")
    
    def validate(self) -> Tuple[bool, List[str]]:
        """Validate configuration."""
        errors = []
        
        # Check paths
        if not os.path.exists(self.pdf_dir):
            errors.append(f"PDF directory not found: {self.pdf_dir}")
        
        # Check chunk sizes
        if not self.chunk_sizes or not all(isinstance(s, int) and s > 0 for s in self.chunk_sizes):
            errors.append("Chunk sizes must be a list of positive integers")
        
        # Check overlap
        if not (1 <= self.chunk_overlap_percentage <= 99):
            errors.append("Chunk overlap percentage must be between 1 and 99")
        
        # Check embedding provider
        if self.embedding_provider not in ["ollama", "openai"]:
            errors.append("Embedding provider must be 'ollama' or 'openai'")
        
        if self.embedding_provider == "openai" and not OPENAI_API_KEY:
            errors.append("OpenAI provider selected but OPENAI_API_KEY not set")
        
        return len(errors) == 0, errors


# %% [markdown]
# ## 1.4 Create and Validate Configuration

# %%
# Create configuration
config = RAGConfig(
    chunk_sizes=[512],  # Start with one size for testing
    embedding_provider="ollama",  # Change to "openai" if needed
    clear_existing_db=False
)

# Validate configuration
is_valid, errors = config.validate()

if is_valid:
    print("✓ Configuration validated successfully")
    print(f"\nConfiguration:")
    print(f"  - Embedding Provider: {config.embedding_provider}")
    print(f"  - Model: {config.ollama_model if config.embedding_provider == 'ollama' else config.openai_model}")
    print(f"  - Chunk Sizes: {config.chunk_sizes}")
    print(f"  - Chunk Overlap: {config.chunk_overlap_percentage}%")
    print(f"  - PDF Directory: {config.pdf_dir}")
    print(f"  - Vector DB Directory: {config.vector_db_dir}")
else:
    print("❌ Configuration validation failed:")
    for error in errors:
        print(f"  - {error}")

✓ OpenAI API key loaded
✓ Ollama base URL: http://localhost:11434
✓ Configuration validated successfully

Configuration:
  - Embedding Provider: ollama
  - Model: nomic-embed-text
  - Chunk Sizes: [512]
  - Chunk Overlap: 15%
  - PDF Directory: ../../financebench/documents
  - Vector DB Directory: ../../vector_databases


In [10]:
# ============================================================================
# Step 2: Dataset Loading Utilities
# ============================================================================

# %% [markdown]
# ## 2.1 Load FinanceBench Dataset

# %%
def load_financebench_dataset(config: RAGConfig):
    """
    Load the FinanceBench dataset from HuggingFace.
    
    Args:
        config: RAGConfig instance
        
    Returns:
        Dataset object
    """
    print(f"Loading dataset: {config.dataset_name}")
    
    ds = load_dataset(config.dataset_name, split=config.dataset_split)
    
    print(f"✓ Loaded {len(ds)} records from FinanceBench")
    
    # Display sample record structure
    if len(ds) > 0:
        print("\nSample record keys:")
        for key in ds[0].keys():
            print(f"  - {key}")
    
    return ds


# %% [markdown]
# ## 2.2 Extract Unique PDF Requirements

# %%
def get_required_pdfs(dataset) -> set:
    """
    Extract unique PDF filenames required by the dataset.
    
    Args:
        dataset: HuggingFace dataset
        
    Returns:
        Set of PDF filenames
    """
    unique_pdfs = set()
    
    for record in tqdm(dataset, desc="Scanning dataset for PDFs"):
        pdf_filename = record["doc_name"] + ".pdf"
        unique_pdfs.add(pdf_filename)
    
    print(f"\n✓ Found {len(unique_pdfs)} unique PDF files required")
    
    return unique_pdfs


# %% [markdown]
# ## 2.3 Verify PDF Availability

# %%
def verify_pdfs(pdf_dir: str, required_pdfs: set) -> Tuple[List[str], List[str]]:
    """
    Verify which PDFs are available and which are missing.
    
    Args:
        pdf_dir: Directory containing PDFs
        required_pdfs: Set of required PDF filenames
        
    Returns:
        Tuple of (available_pdfs, missing_pdfs)
    """
    available_pdfs = []
    missing_pdfs = []
    
    for pdf_filename in tqdm(required_pdfs, desc="Verifying PDFs"):
        pdf_path = os.path.join(pdf_dir, pdf_filename)
        if os.path.isfile(pdf_path):
            available_pdfs.append(pdf_filename)
        else:
            missing_pdfs.append(pdf_filename)
    
    print(f"\n✓ Available: {len(available_pdfs)} PDFs")
    
    if missing_pdfs:
        print(f"✗ Missing: {len(missing_pdfs)} PDFs")
        print("\nMissing files:")
        for filename in missing_pdfs[:10]:  # Show first 10
            print(f"  - {filename}")
        if len(missing_pdfs) > 10:
            print(f"  ... and {len(missing_pdfs) - 10} more")
    else:
        print("✓ All required PDFs are available")
    
    return available_pdfs, missing_pdfs


# %% [markdown]
# ## 2.4 Load PDF Documents

# %%
def load_pdf_documents(pdf_dir: str, pdf_filenames: List[str]) -> List[Document]:
    """
    Load PDF documents using PyMuPDF via LlamaIndex.
    
    Args:
        pdf_dir: Directory containing PDFs
        pdf_filenames: List of PDF filenames to load
        
    Returns:
        List of LlamaIndex Document objects
    """
    pdf_reader = PyMuPDFReader()
    documents = []
    failed_files = []
    
    for pdf_file in tqdm(pdf_filenames, desc="Loading PDFs"):
        file_path = os.path.join(pdf_dir, pdf_file)
        try:
            doc = pdf_reader.load(file_path)
            documents.extend(doc)
        except Exception as e:
            failed_files.append((pdf_file, str(e)))
            print(f"\n✗ Failed to load {pdf_file}: {e}")
    
    print(f"\n✓ Successfully loaded {len(documents)} document pages")
    
    if failed_files:
        print(f"✗ Failed to load {len(failed_files)} files")
    
    return documents


# %% [markdown]
# ## 2.5 Analyze Document Statistics

# %%
def analyze_documents(documents: List[Document]) -> Dict:
    """
    Analyze loaded documents and provide detailed statistics.
    
    Args:
        documents: List of LlamaIndex Document objects
        
    Returns:
        Dictionary containing document statistics
    """
    print("Analyzing documents...")
    
    # Basic statistics
    total_pages = len(documents)
    total_chars = sum(len(doc.text) for doc in documents)
    
    # Approximate token count (rough estimate: 1 token ≈ 4 characters)
    estimated_tokens = total_chars // 4
    
    # Character statistics per page
    char_counts = [len(doc.text) for doc in documents]
    avg_chars_per_page = total_chars / total_pages if total_pages > 0 else 0
    min_chars = min(char_counts) if char_counts else 0
    max_chars = max(char_counts) if char_counts else 0
    
    # Estimate chunks for different sizes
    chunk_estimates = {}
    for chunk_size in config.chunk_sizes:
        overlap = int(chunk_size * (config.chunk_overlap_percentage / 100))
        effective_chunk_size = chunk_size - overlap
        estimated_chunks = (total_chars // effective_chunk_size) + len(documents)
        chunk_estimates[chunk_size] = estimated_chunks
    
    # Get unique document sources
    unique_sources = set()
    for doc in documents:
        if hasattr(doc, 'metadata') and 'file_name' in doc.metadata:
            unique_sources.add(doc.metadata['file_name'])
    
    stats = {
        'total_pages': total_pages,
        'unique_documents': len(unique_sources),
        'total_characters': total_chars,
        'estimated_tokens': estimated_tokens,
        'avg_chars_per_page': avg_chars_per_page,
        'min_chars_per_page': min_chars,
        'max_chars_per_page': max_chars,
        'chunk_estimates': chunk_estimates
    }
    
    # Display statistics
    print("\n" + "="*60)
    print("DOCUMENT STATISTICS")
    print("="*60)
    print(f"Total Pages Loaded:        {stats['total_pages']:,}")
    print(f"Unique PDF Documents:      {stats['unique_documents']:,}")
    print(f"\nContent Size:")
    print(f"  Total Characters:        {stats['total_characters']:,}")
    print(f"  Estimated Tokens:        {stats['estimated_tokens']:,}")
    print(f"\nPer-Page Statistics:")
    print(f"  Average Characters:      {stats['avg_chars_per_page']:,.0f}")
    print(f"  Min Characters:          {stats['min_chars_per_page']:,}")
    print(f"  Max Characters:          {stats['max_chars_per_page']:,}")
    print(f"\nEstimated Chunks (with {config.chunk_overlap_percentage}% overlap):")
    for chunk_size, estimated_chunks in stats['chunk_estimates'].items():
        print(f"  Chunk Size {chunk_size:4d}:        ~{estimated_chunks:,} chunks")
    print("="*60)
    
    return stats


# %% [markdown]
# ## 2.6 Execute Dataset Loading Pipeline

# %%
# Load dataset
dataset = load_financebench_dataset(config)

# %%
# Get required PDFs
required_pdfs = get_required_pdfs(dataset)

# %%
# Verify PDFs
available_pdfs, missing_pdfs = verify_pdfs(config.pdf_dir, required_pdfs)

# %%
# Check if we can proceed
if missing_pdfs:
    print("\n⚠️  Warning: Some PDFs are missing. Proceeding with available PDFs only.")
    proceed = input("Continue? (y/n): ").lower().strip() == 'y'
    if not proceed:
        raise SystemExit("Aborted by user")

# %%
# Load documents
documents = load_pdf_documents(config.pdf_dir, available_pdfs)

# %%
# Analyze documents
doc_stats = analyze_documents(documents)

# %%
# Display summary
print(f"\n✓ Dataset loading complete!")
print(f"  - Dataset records: {len(dataset)}")
print(f"  - PDF files loaded: {len(available_pdfs)}")
print(f"  - Document pages: {len(documents)}")
print(f"  - Estimated tokens: {doc_stats['estimated_tokens']:,}")

Loading dataset: PatronusAI/financebench
✓ Loaded 150 records from FinanceBench

Sample record keys:
  - financebench_id
  - company
  - doc_name
  - question_type
  - question_reasoning
  - domain_question_num
  - question
  - answer
  - justification
  - dataset_subset_label
  - evidence
  - gics_sector
  - doc_type
  - doc_period
  - doc_link


Scanning dataset for PDFs:   0%|          | 0/150 [00:00<?, ?it/s]


✓ Found 84 unique PDF files required


Verifying PDFs:   0%|          | 0/84 [00:00<?, ?it/s]


✓ Available: 84 PDFs
✓ All required PDFs are available


Loading PDFs:   0%|          | 0/84 [00:00<?, ?it/s]


✓ Successfully loaded 12013 document pages
Analyzing documents...

DOCUMENT STATISTICS
Total Pages Loaded:        12,013
Unique PDF Documents:      0

Content Size:
  Total Characters:        40,649,449
  Estimated Tokens:        10,162,362

Per-Page Statistics:
  Average Characters:      3,384
  Min Characters:          0
  Max Characters:          10,738

Estimated Chunks (with 15% overlap):
  Chunk Size  512:        ~105,245 chunks

✓ Dataset loading complete!
  - Dataset records: 150
  - PDF files loaded: 84
  - Document pages: 12013
  - Estimated tokens: 10,162,362


In [11]:
# ============================================================================
# Step 3: Document Processing Functions
# ============================================================================

# %% [markdown]
# ## 3.1 Generate Nodes from Documents

# %%
def generate_nodes(
    documents: List[Document],
    chunk_size: int,
    chunk_overlap: int
) -> List[BaseNode]:
    """
    Generate nodes from documents using LlamaIndex SentenceSplitter.
    
    Args:
        documents: List of LlamaIndex Document objects
        chunk_size: Maximum characters per chunk
        chunk_overlap: Overlap between chunks
        
    Returns:
        List of nodes
        
    Raises:
        ValueError: If chunk_size or chunk_overlap is invalid
    """
    # Validation
    if chunk_size <= 0:
        raise ValueError("Chunk size must be positive")
    if chunk_overlap < 0:
        raise ValueError("Chunk overlap cannot be negative")
    if chunk_overlap >= chunk_size:
        raise ValueError("Chunk overlap must be less than chunk size")
    
    # Initialize parser
    parser = SentenceSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    
    # Generate nodes with progress bar
    print(f"Generating nodes (chunk_size={chunk_size}, overlap={chunk_overlap})...")
    nodes = parser.get_nodes_from_documents(documents, show_progress=True)
    
    print(f"✓ Created {len(nodes):,} nodes")
    
    return nodes


# %% [markdown]
# ## 3.2 Convert Nodes to LangChain Documents

# %%
def nodes_to_langchain_docs(
    nodes: List[BaseNode],
    chunk_size: int,
    keep_node_metadata: bool = True
) -> List[LCDocument]:
    """
    Convert LlamaIndex nodes to LangChain documents.
    
    Args:
        nodes: List of LlamaIndex nodes
        chunk_size: Chunk size (for metadata)
        keep_node_metadata: If True, include original node metadata
        
    Returns:
        List of LangChain Document objects
    """
    lc_docs = []
    
    for node in tqdm(nodes, desc="Converting to LangChain docs"):
        # Base metadata
        metadata = {"chunk_size": chunk_size}
        
        # Add original metadata if requested
        if keep_node_metadata and hasattr(node, 'metadata'):
            metadata.update(node.metadata)
        
        # Create LangChain document
        doc = LCDocument(
            page_content=node.get_content(),
            metadata=metadata
        )
        lc_docs.append(doc)
    
    print(f"✓ Converted {len(lc_docs):,} nodes to LangChain documents")
    
    return lc_docs


# %% [markdown]
# ## 3.3 Analyze Chunks

# %%
def analyze_chunks(lc_docs: List[LCDocument], chunk_size: int) -> Dict:
    """
    Analyze generated chunks and provide statistics.
    
    Args:
        lc_docs: List of LangChain documents
        chunk_size: Configured chunk size
        
    Returns:
        Dictionary containing chunk statistics
    """
    # Calculate statistics
    chunk_lengths = [len(doc.page_content) for doc in lc_docs]
    total_chunks = len(lc_docs)
    total_chars = sum(chunk_lengths)
    avg_length = total_chars / total_chunks if total_chunks > 0 else 0
    min_length = min(chunk_lengths) if chunk_lengths else 0
    max_length = max(chunk_lengths) if chunk_lengths else 0
    
    # Estimate tokens
    estimated_tokens = total_chars // 4
    avg_tokens_per_chunk = estimated_tokens / total_chunks if total_chunks > 0 else 0
    
    # Calculate distribution
    length_ranges = {
        f"< {chunk_size//2}": sum(1 for l in chunk_lengths if l < chunk_size//2),
        f"{chunk_size//2}-{chunk_size}": sum(1 for l in chunk_lengths if chunk_size//2 <= l < chunk_size),
        f"{chunk_size}+": sum(1 for l in chunk_lengths if l >= chunk_size)
    }
    
    stats = {
        'total_chunks': total_chunks,
        'total_characters': total_chars,
        'estimated_tokens': estimated_tokens,
        'avg_length': avg_length,
        'avg_tokens_per_chunk': avg_tokens_per_chunk,
        'min_length': min_length,
        'max_length': max_length,
        'length_distribution': length_ranges
    }
    
    # Display statistics
    print("\n" + "="*60)
    print(f"CHUNK STATISTICS (Chunk Size: {chunk_size})")
    print("="*60)
    print(f"Total Chunks:              {stats['total_chunks']:,}")
    print(f"\nContent Size:")
    print(f"  Total Characters:        {stats['total_characters']:,}")
    print(f"  Estimated Tokens:        {stats['estimated_tokens']:,}")
    print(f"\nChunk Statistics:")
    print(f"  Average Length:          {stats['avg_length']:,.0f} chars ({stats['avg_tokens_per_chunk']:,.0f} tokens)")
    print(f"  Min Length:              {stats['min_length']:,} chars")
    print(f"  Max Length:              {stats['max_length']:,} chars")
    print(f"\nLength Distribution:")
    for range_label, count in stats['length_distribution'].items():
        percentage = (count / total_chunks * 100) if total_chunks > 0 else 0
        print(f"  {range_label:20s}: {count:6,} chunks ({percentage:5.1f}%)")
    print("="*60)
    
    return stats


# %% [markdown]
# ## 3.4 Process All Chunk Sizes

# %%
def process_all_chunk_sizes(
    documents: List[Document],
    config: RAGConfig
) -> Dict[int, Dict]:
    """
    Process documents for all configured chunk sizes.
    
    Args:
        documents: List of LlamaIndex documents
        config: RAGConfig instance
        
    Returns:
        Dictionary mapping chunk_size to processed data
        {chunk_size: {'nodes': nodes, 'lc_docs': lc_docs, 'stats': stats}}
    """
    processed_data = {}
    
    print(f"\nProcessing {len(config.chunk_sizes)} chunk size(s)...")
    print("="*60)
    
    for chunk_size in config.chunk_sizes:
        print(f"\n>>> Processing chunk size: {chunk_size}")
        
        # Calculate overlap
        chunk_overlap = int(chunk_size * (config.chunk_overlap_percentage / 100))
        print(f"Overlap: {chunk_overlap} chars ({config.chunk_overlap_percentage}%)")
        
        # Generate nodes
        nodes = generate_nodes(
            documents=documents,
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap
        )
        
        # Convert to LangChain documents
        lc_docs = nodes_to_langchain_docs(
            nodes=nodes,
            chunk_size=chunk_size,
            keep_node_metadata=config.keep_node_metadata
        )
        
        # Analyze chunks
        stats = analyze_chunks(lc_docs, chunk_size)
        
        # Store processed data
        processed_data[chunk_size] = {
            'nodes': nodes,
            'lc_docs': lc_docs,
            'stats': stats,
            'chunk_overlap': chunk_overlap
        }
        
        print(f"✓ Chunk size {chunk_size} processing complete")
    
    print("\n" + "="*60)
    print("✓ All chunk sizes processed successfully")
    
    return processed_data


# %% [markdown]
# ## 3.5 Execute Document Processing

# %%
# Process all configured chunk sizes
processed_data = process_all_chunk_sizes(documents, config)

# %%
# Display summary
print("\n" + "="*60)
print("PROCESSING SUMMARY")
print("="*60)
for chunk_size, data in processed_data.items():
    stats = data['stats']
    print(f"\nChunk Size {chunk_size}:")
    print(f"  Chunks:          {stats['total_chunks']:,}")
    print(f"  Est. Tokens:     {stats['estimated_tokens']:,}")
    print(f"  Avg per Chunk:   {stats['avg_tokens_per_chunk']:,.0f} tokens")
print("="*60)


Processing 1 chunk size(s)...

>>> Processing chunk size: 512
Overlap: 76 chars (15%)
Generating nodes (chunk_size=512, overlap=76)...


Parsing nodes:   0%|          | 0/12013 [00:00<?, ?it/s]

✓ Created 28,657 nodes


Converting to LangChain docs:   0%|          | 0/28657 [00:00<?, ?it/s]

✓ Converted 28,657 nodes to LangChain documents

CHUNK STATISTICS (Chunk Size: 512)
Total Chunks:              28,657

Content Size:
  Total Characters:        43,783,209
  Estimated Tokens:        10,945,802

Chunk Statistics:
  Average Length:          1,528 chars (382 tokens)
  Min Length:              0 chars
  Max Length:              4,103 chars

Length Distribution:
  < 256               :    465 chunks (  1.6%)
  256-512             :  1,212 chunks (  4.2%)
  512+                : 26,980 chunks ( 94.1%)
✓ Chunk size 512 processing complete

✓ All chunk sizes processed successfully

PROCESSING SUMMARY

Chunk Size 512:
  Chunks:          28,657
  Est. Tokens:     10,945,802
  Avg per Chunk:   382 tokens


In [19]:
# ============================================================================
# Step 4: Database Inspection
# ============================================================================

# %% [markdown]
# ## 4.1 Inspect Existing Vector Databases

# %%
def inspect_vector_databases(base_dir: str) -> Dict:
    """
    Inspect all existing vector databases and their collections.
    
    Args:
        base_dir: Base directory containing vector databases
        
    Returns:
        Dictionary with database information
    """
    if not os.path.exists(base_dir):
        print(f"No databases found at: {base_dir}")
        return {}
    
    print(f"Inspecting: {base_dir}")
    print("="*60)
    
    databases = {}
    
    # Scan for model-specific directories
    for item in os.listdir(base_dir):
        item_path = os.path.join(base_dir, item)
        if not os.path.isdir(item_path):
            continue
            
        print(f"\nDatabase: {item}")
        collections = {}
        
        # Check if valid ChromaDB exists
        if not os.path.exists(os.path.join(item_path, "chroma.sqlite3")):
            print(f"  No valid ChromaDB found")
            continue
            
        print(f"  Valid ChromaDB detected")
        
        # Try to find collections by scanning common chunk sizes
        for chunk_size in [128, 256, 512, 1024, 2048]:
            collection_name = f"financebench_docs_chunk_{chunk_size}"
            
            try:
                # Use a dummy embedding for inspection
                from langchain_ollama import OllamaEmbeddings
                dummy_embedding = OllamaEmbeddings(model="nomic-embed-text")
                
                vectorstore = Chroma(
                    collection_name=collection_name,
                    embedding_function=dummy_embedding,
                    persist_directory=item_path
                )
                
                count = vectorstore._collection.count()
                if count > 0:
                    collections[collection_name] = {
                        'chunk_size': chunk_size,
                        'document_count': count
                    }
                    print(f"    • {collection_name}: {count:,} documents")
            except Exception:
                pass  # Collection doesn't exist
        
        databases[item] = {
            'path': item_path,
            'collections': collections,
            'total_documents': sum(c['document_count'] for c in collections.values())
        }
    
    return databases


# %% [markdown]
# ## 4.2 Display Summary

# %%
def display_database_summary(databases: Dict):
    """Display a summary of all databases and collections."""
    if not databases:
        print("\nNo existing vector databases found")
        return
    
    print("\n" + "="*60)
    print("VECTOR DATABASE SUMMARY")
    print("="*60)
    
    total_collections = 0
    total_documents = 0
    
    for db_name, db_info in databases.items():
        print(f"\n{db_name}")
        print(f"  Path: {db_info['path']}")
        print(f"  Collections: {len(db_info['collections'])}")
        print(f"  Documents: {db_info['total_documents']:,}")
        
        if db_info['collections']:
            for coll_name, coll_info in db_info['collections'].items():
                print(f"    • Chunk {coll_info['chunk_size']}: {coll_info['document_count']:,} docs")
        
        total_collections += len(db_info['collections'])
        total_documents += db_info['total_documents']
    
    print(f"\n{'='*60}")
    print(f"Total: {len(databases)} database(s), {total_collections} collection(s), {total_documents:,} documents")
    print("="*60)


# %% [markdown]
# ## 4.3 Execute Inspection

# %%
# Inspect existing databases
existing_databases = inspect_vector_databases(config.vector_db_dir)

# %%
# Display summary
display_database_summary(existing_databases)

# %%
# Store for later use
print(f"\nYour current configuration will create:")
print(f"  Database: {config.get_model_identifier()}")
print(f"  Location: {config.get_vector_db_dir()}")
print(f"  Collections: {[f'chunk_{cs}' for cs in config.chunk_sizes]}")

Inspecting: ../../vector_databases

No existing vector databases found

Your current configuration will create:
  Database: ollama_nomic-embed-text
  Location: ../../vector_databases/ollama_nomic-embed-text
  Collections: ['chunk_512']


In [20]:
# ============================================================================
# Step 5: Vector Store Population
# ============================================================================

# %% [markdown]
# ## 5.1 Helper Functions

# %%
def clear_vector_db(db_path: str, max_attempts: int = 5, delay: float = 1.0) -> bool:
    """Clear existing vector database directory with retry logic."""
    if not os.path.exists(db_path):
        return True
    
    print(f"Clearing: {db_path}")
    
    for attempt in range(max_attempts):
        try:
            shutil.rmtree(db_path, ignore_errors=True)
            print(f"✓ Cleared successfully")
            return True
        except PermissionError as e:
            print(f"Attempt {attempt + 1}/{max_attempts} failed")
            if attempt < max_attempts - 1:
                time.sleep(delay)
        except Exception as e:
            print(f"Error: {e}")
            return False
    
    return False


def check_existing_collections(
    persist_directory: str,
    collection_name: str,
    embedding_function
) -> Tuple[bool, int]:
    """Check if collection exists and get document count."""
    if not os.path.exists(persist_directory):
        return False, 0
    
    try:
        vectorstore = Chroma(
            collection_name=collection_name,
            embedding_function=embedding_function,
            persist_directory=persist_directory
        )
        count = vectorstore._collection.count()
        return count > 0, count
    except Exception:
        return False, 0


# %% [markdown]
# ## 5.2 Populate Single Chunk Size

# %%
def populate_vector_store_for_chunk_size(
    lc_docs: List[LCDocument],
    chunk_size: int,
    config: RAGConfig,
    embedding_function
) -> Dict:
    """Populate vector store for a specific chunk size."""
    collection_name = f"{config.collection_name_prefix}{chunk_size}"
    persist_directory = config.get_vector_db_dir()
    
    print(f"\n{'='*60}")
    print(f"Chunk Size {chunk_size}")
    print(f"{'='*60}")
    print(f"Collection: {collection_name}")
    print(f"Documents: {len(lc_docs):,}")
    
    # Create directory
    os.makedirs(persist_directory, exist_ok=True)
    
    # Initialize vector store
    vectorstore = Chroma(
        collection_name=collection_name,
        embedding_function=embedding_function,
        persist_directory=persist_directory
    )
    
    # Add documents in batches
    total_docs = len(lc_docs)
    batch_size = config.batch_size
    num_batches = (total_docs + batch_size - 1) // batch_size
    
    print(f"Adding in {num_batches} batch(es)...")
    
    added_count = 0
    failed_batches = []
    
    with tqdm(total=total_docs, desc="Progress") as pbar:
        for batch_idx in range(num_batches):
            batch_start = batch_idx * batch_size
            batch_end = min(batch_start + batch_size, total_docs)
            batch = lc_docs[batch_start:batch_end]
            
            try:
                vectorstore.add_documents(batch)
                added_count += len(batch)
                pbar.update(len(batch))
            except Exception as e:
                failed_batches.append((batch_idx, str(e)))
                print(f"\nBatch {batch_idx + 1} failed: {e}")
                pbar.update(len(batch))
    
    # Persist
    vectorstore.persist()
    final_count = vectorstore._collection.count()
    
    print(f"\n✓ Added: {added_count:,} / {total_docs:,}")
    print(f"✓ Final count: {final_count:,}")
    
    return {
        'chunk_size': chunk_size,
        'collection_name': collection_name,
        'status': 'completed' if not failed_batches else 'partial',
        'total_documents': total_docs,
        'added_documents': added_count,
        'final_count': final_count,
        'failed_batches': len(failed_batches)
    }


# %% [markdown]
# ## 5.3 Smart Populate (Skip Existing)

# %%
def smart_populate_vector_stores(
    processed_data: Dict[int, Dict],
    config: RAGConfig,
    force_recreate: bool = False
) -> Dict[int, Dict]:
    """
    Populate vector stores, skipping existing collections.
    
    Args:
        processed_data: Dictionary from Step 3
        config: RAGConfig instance
        force_recreate: If True, recreate existing collections
        
    Returns:
        Population statistics
    """
    # Initialize embedding
    print(f"Embedding: {config.embedding_provider}")
    if config.embedding_provider == "ollama":
        print(f"Model: {config.ollama_model}")
    else:
        print(f"Model: {config.openai_model}")
    
    embedding_function = config.get_embedding_function()
    print("✓ Initialized")
    
    db_dir = config.get_vector_db_dir()
    print(f"Database: {db_dir}")
    
    # Clear if force recreate
    if force_recreate or config.clear_existing_db:
        print("\n⚠️  Force recreate enabled")
        if not clear_vector_db(db_dir):
            raise RuntimeError("Failed to clear database")
    
    # Check existing collections
    print(f"\nChecking existing collections...")
    all_stats = {}
    to_process = []
    skipped = []
    
    for chunk_size in config.chunk_sizes:
        collection_name = f"{config.collection_name_prefix}{chunk_size}"
        exists, count = check_existing_collections(
            db_dir, collection_name, embedding_function
        )
        
        if exists and not force_recreate:
            print(f"  ✓ chunk_{chunk_size}: EXISTS ({count:,} docs) - SKIP")
            skipped.append(chunk_size)
            all_stats[chunk_size] = {
                'chunk_size': chunk_size,
                'collection_name': collection_name,
                'status': 'skipped',
                'document_count': count
            }
        else:
            status = "RECREATE" if exists else "CREATE"
            print(f"  → chunk_{chunk_size}: {status}")
            to_process.append(chunk_size)
    
    if not to_process:
        print("\n✓ All collections exist. Nothing to do.")
        return all_stats
    
    print(f"\nProcessing {len(to_process)} chunk size(s)...")
    
    # Process each chunk size
    for chunk_size in to_process:
        if chunk_size not in processed_data:
            print(f"\n✗ No data for chunk size {chunk_size}")
            continue
        
        lc_docs = processed_data[chunk_size]['lc_docs']
        
        stats = populate_vector_store_for_chunk_size(
            lc_docs=lc_docs,
            chunk_size=chunk_size,
            config=config,
            embedding_function=embedding_function
        )
        
        all_stats[chunk_size] = stats
        
        if len(to_process) > 1:
            time.sleep(1)
    
    return all_stats


# %% [markdown]
# ## 5.4 Display Results

# %%
def display_population_summary(stats: Dict[int, Dict], config: RAGConfig):
    """Display summary of population results."""
    print("\n" + "="*60)
    print("POPULATION SUMMARY")
    print("="*60)
    print(f"Database: {config.get_model_identifier()}")
    print(f"Location: {config.get_vector_db_dir()}")
    
    total_docs = 0
    for chunk_size, stat in stats.items():
        status = "✓" if stat['status'] in ['completed', 'skipped'] else "⚠"
        doc_count = stat.get('final_count', stat.get('document_count', 0))
        print(f"\n{status} Chunk {chunk_size}: {stat['status'].upper()}")
        print(f"    Documents: {doc_count:,}")
        total_docs += doc_count
    
    print(f"\nTotal: {total_docs:,} documents")
    print("="*60)


# %% [markdown]
# ## 5.5 Execute Population

# %%
# Populate with smart skip logic
population_stats = smart_populate_vector_stores(
    processed_data=processed_data,
    config=config,
    force_recreate=False  # Set True to recreate all
)

# %%
# Display results
display_population_summary(population_stats, config)

# %%
print("\n✓ Population complete!")

Embedding: ollama
Model: nomic-embed-text
✓ Initialized
Database: ../../vector_databases/ollama_nomic-embed-text

Checking existing collections...
  → chunk_512: CREATE

Processing 1 chunk size(s)...

Chunk Size 512
Collection: financebench_docs_chunk_512
Documents: 28,657
Adding in 58 batch(es)...


Progress:   0%|          | 0/28657 [00:00<?, ?it/s]

2025-10-05 17:52:59,531 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"



Batch 1 failed: Query error: Database error: error returned from database: (code: 1032) attempt to write a readonly database


KeyboardInterrupt: 

In [None]:
# ============================================================================
# Step 6: Create Additional Embeddings (Optional)
# ============================================================================

# %% [markdown]
# ## 6.1 Create New Configuration for Different Embedding
# 
# Use this step when you want to create embeddings with a different model
# WITHOUT reloading documents from Step 2-3.

# %%
def create_embedding_config(
    base_config: RAGConfig,
    embedding_provider: str,
    model_name: str,
    chunk_sizes: List[int] = None,
    clear_existing: bool = False
) -> RAGConfig:
    """
    Create configuration for a different embedding model.
    
    Args:
        base_config: Your existing RAGConfig
        embedding_provider: "ollama" or "openai"
        model_name: Model name
        chunk_sizes: Which chunk sizes to process (None = all)
        clear_existing: Whether to clear existing database
        
    Returns:
        New RAGConfig instance
    """
    new_config = RAGConfig(
        # Copy dataset settings
        dataset_name=base_config.dataset_name,
        dataset_split=base_config.dataset_split,
        pdf_dir=base_config.pdf_dir,
        vector_db_dir=base_config.vector_db_dir,
        
        # Chunking settings
        chunk_sizes=chunk_sizes if chunk_sizes else base_config.chunk_sizes,
        chunk_overlap_percentage=base_config.chunk_overlap_percentage,
        
        # NEW embedding settings
        embedding_provider=embedding_provider,
        ollama_model=model_name if embedding_provider == "ollama" else base_config.ollama_model,
        openai_model=model_name if embedding_provider == "openai" else base_config.openai_model,
        
        # Other settings
        collection_name_prefix=base_config.collection_name_prefix,
        batch_size=base_config.batch_size,
        clear_existing_db=clear_existing,
        keep_node_metadata=base_config.keep_node_metadata
    )
    
    print("="*60)
    print("NEW EMBEDDING CONFIGURATION")
    print("="*60)
    print(f"Provider: {new_config.embedding_provider}")
    print(f"Model: {model_name}")
    print(f"Database: {new_config.get_vector_db_dir()}")
    print(f"Chunk sizes: {new_config.chunk_sizes}")
    print("="*60)
    
    return new_config


# %% [markdown]
# ## 6.2 Example 1: Create OpenAI Embeddings

# %%
# Create OpenAI configuration
new_config = create_embedding_config(
    base_config=config,
    embedding_provider="openai",
    model_name="text-embedding-3-small",
    chunk_sizes=[512],  # Or None for all chunk sizes, or [512, 1024] for specific ones
    clear_existing=False
)

# Validate
is_valid, errors = new_config.validate()
if not is_valid:
    print("\n❌ Configuration errors:")
    for error in errors:
        print(f"  - {error}")
else:
    print("\n✓ Configuration valid")

# %% [markdown]
# ## 6.3 Example 2: Create Different Ollama Model

# %%
# Uncomment to use a different Ollama model
# new_config = create_embedding_config(
#     base_config=config,
#     embedding_provider="ollama",
#     model_name="mxbai-embed-large",  # Or any other Ollama model
#     chunk_sizes=[512, 1024],
#     clear_existing=False
# )

# # Validate
# is_valid, errors = new_config.validate()
# if not is_valid:
#     print("\n❌ Configuration errors:")
#     for error in errors:
#         print(f"  - {error}")
# else:
#     print("\n✓ Configuration valid")

# %% [markdown]
# ## 6.4 Verify Chunk Sizes Match

# %%
# Check what chunk sizes are available in processed_data
print("Available chunk sizes in processed_data:", list(processed_data.keys()))
print("Requested chunk sizes in new_config:", new_config.chunk_sizes)

# Make sure they match!
missing = set(new_config.chunk_sizes) - set(processed_data.keys())
if missing:
    print(f"\n⚠️  Warning: These chunk sizes are not in processed_data: {missing}")
    print("You need to either:")
    print("  1. Adjust new_config.chunk_sizes to match available sizes")
    print("  2. Or go back to Step 3 and process those chunk sizes")
else:
    print("\n✓ All requested chunk sizes are available")

# %% [markdown]
# ## 6.5 Execute Population for New Embedding

# %%
# Populate with the new embedding model
new_population_stats = smart_populate_vector_stores(
    processed_data=processed_data,
    config=new_config,
    force_recreate=False
)

# %%
# Display results
display_population_summary(new_population_stats, new_config)

# %% [markdown]
# ## 6.6 Final Overview of All Databases

# %%
# Re-inspect to see all databases
print("\n" + "="*60)
print("ALL DATABASES OVERVIEW")
print("="*60)

all_databases = inspect_vector_databases(config.vector_db_dir)
display_database_summary(all_databases)

# %% [markdown]
# ## 6.7 Quick Reference: Common Embedding Models
# 
# **Ollama Models:**
# - `nomic-embed-text` - Fast, good quality
# - `mxbai-embed-large` - Larger, better quality
# - `all-minilm` - Small, fast
# 
# **OpenAI Models:**
# - `text-embedding-3-small` - Cost-effective, 1536 dimensions
# - `text-embedding-3-large` - Higher quality, 3072 dimensions
# - `text-embedding-ada-002` - Legacy model

TypeError: RAGConfig.__init__() got an unexpected keyword argument 'vector_db_base_dir'. Did you mean 'vector_db_dir'?

In [25]:
# ============================================================================
# Step 7: Testing and Querying Your Vector Stores
# ============================================================================

# %% [markdown]
# ## 7.1 Load Vector Store for Querying

# %%
def load_vector_store(
    config: RAGConfig,
    chunk_size: int
) -> Chroma:
    """
    Load an existing vector store for querying.
    
    Args:
        config: RAGConfig instance
        chunk_size: Which chunk size to load
        
    Returns:
        Chroma vectorstore instance
    """
    collection_name = f"{config.collection_name_prefix}{chunk_size}"
    persist_directory = config.get_vector_db_dir()
    
    print(f"Loading vector store:")
    print(f"  Database: {persist_directory}")
    print(f"  Collection: {collection_name}")
    
    # Get embedding function
    embedding_function = config.get_embedding_function()
    
    # Load vector store
    vectorstore = Chroma(
        collection_name=collection_name,
        embedding_function=embedding_function,
        persist_directory=persist_directory
    )
    
    doc_count = vectorstore._collection.count()
    print(f"  Documents: {doc_count:,}")
    print("✓ Loaded successfully")
    
    return vectorstore


# %% [markdown]
# ## 7.2 Simple Similarity Search

# %%
def test_similarity_search(
    vectorstore: Chroma,
    query: str,
    k: int = 5
) -> List:
    """
    Perform similarity search and display results.
    
    Args:
        vectorstore: Loaded Chroma vectorstore
        query: Search query
        k: Number of results to return
        
    Returns:
        List of documents
    """
    print(f"\n{'='*60}")
    print(f"SIMILARITY SEARCH")
    print(f"{'='*60}")
    print(f"Query: {query}")
    print(f"Top-{k} results:")
    print(f"{'='*60}")
    
    # Perform search
    docs = vectorstore.similarity_search(query, k=k)
    
    # Display results
    for i, doc in enumerate(docs, 1):
        print(f"\n[Result {i}]")
        print(f"Content: {doc.page_content[:200]}...")
        print(f"Metadata: {doc.metadata}")
        print("-" * 60)
    
    return docs


# %% [markdown]
# ## 7.3 Similarity Search with Scores

# %%
def test_similarity_search_with_scores(
    vectorstore: Chroma,
    query: str,
    k: int = 5
) -> List:
    """
    Perform similarity search with relevance scores.
    
    Args:
        vectorstore: Loaded Chroma vectorstore
        query: Search query
        k: Number of results to return
        
    Returns:
        List of (document, score) tuples
    """
    print(f"\n{'='*60}")
    print(f"SIMILARITY SEARCH WITH SCORES")
    print(f"{'='*60}")
    print(f"Query: {query}")
    print(f"Top-{k} results:")
    print(f"{'='*60}")
    
    # Perform search with scores
    results = vectorstore.similarity_search_with_score(query, k=k)
    
    # Display results
    for i, (doc, score) in enumerate(results, 1):
        print(f"\n[Result {i}] Score: {score:.4f}")
        print(f"Content: {doc.page_content[:200]}...")
        
        # Show relevant metadata
        if 'file_name' in doc.metadata:
            print(f"File: {doc.metadata['file_name']}")
        if 'page_label' in doc.metadata:
            print(f"Page: {doc.metadata['page_label']}")
        if 'chunk_size' in doc.metadata:
            print(f"Chunk Size: {doc.metadata['chunk_size']}")
        
        print("-" * 60)
    
    return results


# %% [markdown]
# ## 7.4 Compare Different Chunk Sizes

# %%
def compare_chunk_sizes(
    config: RAGConfig,
    query: str,
    chunk_sizes: List[int] = None,
    k: int = 3
):
    """
    Compare retrieval results across different chunk sizes.
    
    Args:
        config: RAGConfig instance
        query: Search query
        chunk_sizes: List of chunk sizes to compare (None = all)
        k: Number of results per chunk size
    """
    if chunk_sizes is None:
        chunk_sizes = config.chunk_sizes
    
    print(f"\n{'='*60}")
    print(f"COMPARING CHUNK SIZES")
    print(f"{'='*60}")
    print(f"Query: {query}")
    print(f"{'='*60}")
    
    results = {}
    
    for chunk_size in chunk_sizes:
        print(f"\n--- Chunk Size: {chunk_size} ---")
        
        try:
            # Load vector store
            vectorstore = load_vector_store(config, chunk_size)
            
            # Search
            docs_with_scores = vectorstore.similarity_search_with_score(query, k=k)
            
            results[chunk_size] = docs_with_scores
            
            # Display top result
            if docs_with_scores:
                doc, score = docs_with_scores[0]
                print(f"\nTop Result (Score: {score:.4f}):")
                print(f"{doc.page_content[:300]}...")
            
        except Exception as e:
            print(f"Error loading chunk size {chunk_size}: {e}")
    
    return results


# %% [markdown]
# ## 7.5 Test with FinanceBench Questions

# %%
def test_with_financebench_questions(
    vectorstore: Chroma,
    dataset,
    num_questions: int = 5,
    k: int = 3
):
    """
    Test retrieval using actual FinanceBench questions.
    
    Args:
        vectorstore: Loaded Chroma vectorstore
        dataset: FinanceBench dataset
        num_questions: Number of questions to test
        k: Number of documents to retrieve per question
    """
    print(f"\n{'='*60}")
    print(f"TESTING WITH FINANCEBENCH QUESTIONS")
    print(f"{'='*60}")
    
    import random
    sample_indices = random.sample(range(len(dataset)), num_questions)
    
    for idx in sample_indices:
        record = dataset[idx]
        question = record['question']
        answer = record['answer']
        company = record['company']
        
        print(f"\n{'='*60}")
        print(f"Company: {company}")
        print(f"Question: {question}")
        print(f"Expected Answer: {answer}")
        print(f"{'='*60}")
        
        # Retrieve relevant documents
        docs = vectorstore.similarity_search(question, k=k)
        
        print(f"\nRetrieved {len(docs)} documents:")
        for i, doc in enumerate(docs, 1):
            print(f"\n[Doc {i}]")
            print(f"{doc.page_content[:200]}...")
            if 'file_name' in doc.metadata:
                print(f"Source: {doc.metadata['file_name']}")
        
        print("\n" + "-"*60)


# %% [markdown]
# ## 7.6 Execute Tests

# %%
# Load a vector store (using chunk size 512 as example)
vectorstore = load_vector_store(config, chunk_size=512)

# %% [markdown]
# ### Test 1: Simple Similarity Search

# %%
# Test with a financial question
query = "What was the capital expenditure in 2018?"
docs = test_similarity_search(vectorstore, query, k=3)

# %% [markdown]
# ### Test 2: Similarity Search with Scores

# %%
# Test with scores
query = "What is the total revenue for fiscal year 2022?"
results = test_similarity_search_with_scores(vectorstore, query, k=5)

# %% [markdown]
# ### Test 3: Compare Different Chunk Sizes

# %%
# Compare retrieval across chunk sizes
query = "What were the operating expenses in 2021?"
comparison_results = compare_chunk_sizes(
    config=config,
    query=query,
    chunk_sizes=[512],  # Add more if you have them: [256, 512, 1024]
    k=3
)

# %% [markdown]
# ### Test 4: Test with Real FinanceBench Questions

# %%
# Test with actual questions from the dataset
test_with_financebench_questions(
    vectorstore=vectorstore,
    dataset=dataset,
    num_questions=3,
    k=3
)

# %% [markdown]
# ## 7.7 Custom Query Helper

# %%
def quick_query(query: str, chunk_size: int = 512, k: int = 5):
    """
    Quick helper function for ad-hoc queries.
    
    Args:
        query: Your question
        chunk_size: Which chunk size to use
        k: Number of results
    """
    vs = load_vector_store(config, chunk_size)
    results = test_similarity_search_with_scores(vs, query, k)
    return results

# %% [markdown]
# ## 7.8 Try Your Own Queries
# 
# Use the quick_query function for ad-hoc testing:

# %%
# Example: Your custom query
# results = quick_query(
#     query="What is the revenue growth rate?",
#     chunk_size=512,
#     k=5
# )

Loading vector store:
  Database: ../../vector_databases/ollama_nomic-embed-text
  Collection: financebench_docs_chunk_512
  Documents: 2,000
✓ Loaded successfully

SIMILARITY SEARCH
Query: What was the capital expenditure in 2018?
Top-3 results:


2025-10-05 18:02:44,236 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-10-05 18:02:44,272 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-10-05 18:02:44,327 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-10-05 18:02:44,355 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-10-05 18:02:44,395 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"
2025-10-05 18:02:44,416 - INFO - HTTP Request: POST http://localhost:11434/api/embed "HTTP/1.1 200 OK"



[Result 1]
Content: Other Division Information 
Total assets and capital spending of each division are as follows:
 
Total Assets
Capital Spending
 
2022
2021
2022
2021
2020
FLNA
$
11,042 
$
9,763 
$
1,464 
$
1,411 
$
1,...
Metadata: {'total_pages': 503, 'chunk_size': 512, 'file_path': '../../financebench/documents/PEPSICO_2022_10K.pdf', 'source': '73'}
------------------------------------------------------------

[Result 2]
Content: 2018, respectively
—
 
—
Additional paid-in capital
11,174
 
10,963
Less: Treasury stock, at cost, 428,676,471 shares at December 31, 2019 and December 31, 2018
(5,563)  
(5,563)
Retained earnings
7,8...
Metadata: {'chunk_size': 512, 'source': '69', 'total_pages': 198, 'file_path': '../../financebench/documents/ACTIVISIONBLIZZARD_2019_10K.pdf'}
------------------------------------------------------------

[Result 3]
Content: The deferred revenue is tracked on a per-customer contract-unit basis. As customers take delivery of the committed volumes under the