# General Tips
## Using virtual environments
**Step 1:** CD to desired directory and Create a Virtual Environment `python3 -m venv myenv`. (Run `py -3.13 -m venv myenv` for a specific version of python)

Check your python installed versions with `py -0` on Windows (`python3 --version` on Linux)

**Step 2:** Activate the Environment `source myenv/bin/activate` (on Linux) and `myenv\Scripts\activate` (on Windows).

**Step 3:** Install Any Needed Packages. e.g: `pip install requests pandas`. Or better to use `requirements.txt` file (`pip install -r requirements.txt`)

**Step 4:** List All Installed Packages using `pip list`

## Connecting the Jupyter Notebook to the vistual env
1. Make sure that myenv is activate (`myenv\Scripts\activate`)
2. Run this inside the virtual environment: `pip install ipykernel`
3. Still inside the environment: `python -m ipykernel install --user --name=myenv --display-name "Whatever Python Kernel Name"`
   
   --name=myenv: internal identifier for the kernel
   
   --display-name: name that shows up in VS Code kernel picker
4. Open VS Code and select the kernel

   At the top-right, click "Select Kernel".
   Look for “Whatever Python Kernel Name” — pick that.
5. If you don’t see it right away, try: Reloading VS Code, Or running Reload Window from Command Palette (Ctrl+Shift+P)

## Useful Commands
1. Use `py -0` to check which python installation we have on Windows

## Step 0: Setup Global Variables

In [1]:
embedding_model = "nomic-embed-text"
chunk_sizes = [128, 256, 512, 1024]
chunk_overlap_percentage = 20

# Step 1: Load the Dataset

In [2]:
from datasets import load_dataset

ds = load_dataset("PatronusAI/financebench", split="train")

# Define PDF directory path
pdf_dir = "../pdfs"

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
print("Records: ", len(ds))
print("Keys: ", ds[0])
print("Dataset [0]: ", ds[0])

# print("List of document links:")
# counter = 0
# for doc in ds:
#     counter += 1
#     print(f"{counter}: {doc['doc_link']}")

Records:  150
Keys:  {'financebench_id': 'financebench_id_03029', 'company': '3M', 'doc_name': '3M_2018_10K', 'question_type': 'metrics-generated', 'question_reasoning': 'Information extraction', 'domain_question_num': None, 'question': 'What is the FY2018 capital expenditure amount (in USD millions) for 3M? Give a response to the question by relying on the details shown in the cash flow statement.', 'answer': '$1577.00', 'justification': 'The metric capital expenditures was directly extracted from the company 10K. The line item name, as seen in the 10K, was: Purchases of property, plant and equipment (PP&E).', 'dataset_subset_label': 'OPEN_SOURCE', 'evidence': [{'evidence_text': 'Table of Contents \n3M Company and Subsidiaries\nConsolidated Statement of Cash Flow s\nYears ended December 31\n \n(Millions)\n \n2018\n \n2017\n \n2016\n \nCash Flows from Operating Activities\n \n \n \n \n \n \n \nNet income including noncontrolling interest\n \n$\n5,363 \n$\n4,869 \n$\n5,058 \nAdjustments

## Verify if all of the pdfs are in place

In [4]:
import os

# Track missing and unique PDF filenames
unique_pdfs = set()
missing_pdfs = []

# Collect unique PDF filenames from the dataset
for record in ds:
    pdf_filename = record["doc_name"] + ".pdf"
    unique_pdfs.add(pdf_filename)

# Check for existence of each unique PDF
for pdf_filename in unique_pdfs:
    pdf_path = os.path.join(pdf_dir, pdf_filename)
    if not os.path.isfile(pdf_path):
        missing_pdfs.append(pdf_filename)

# Report
print(f"Total unique PDF files required: {len(unique_pdfs)}")
print(f"Total missing PDF files: {len(missing_pdfs)}")

if missing_pdfs:
    print("Missing PDF files:")
    for missing_file in missing_pdfs:
        print(" -", missing_file)
else:
    print("All required PDF files are present.")


Total unique PDF files required: 84
Total missing PDF files: 0
All required PDF files are present.


## Move Required PDF files that the dataset needs to a new folder

In [5]:
import os
import shutil

# Source and target directories
source_dir = "../pdfs"
target_dir = "../financebench_pdfs"
os.makedirs(target_dir, exist_ok=True)

# Track unique doc_names
unique_doc_names = {record["doc_name"] for record in ds}

# Copy only the needed PDFs
copied_count = 0
for doc_name in unique_doc_names:
    filename = doc_name + ".pdf"
    source_path = os.path.join(source_dir, filename)
    target_path = os.path.join(target_dir, filename)

    if os.path.isfile(source_path):
        shutil.copy2(source_path, target_path)
        copied_count += 1
    else:
        print(f"Missing file: {filename}")

print(f"Copied {copied_count} PDF files to '{target_dir}'")


Copied 84 PDF files to '../financebench_pdfs'


## Load documents using LlamaIndex

In [10]:
import os
from llama_index.readers.file import PyMuPDFReader

pdf_dir = "../financebench_pdfs"
pdf_reader = PyMuPDFReader()

# List of PDF files
pdf_files = [f for f in os.listdir(pdf_dir) if f.endswith(".pdf")]

documents = []

for pdf_file in pdf_files:
    file_path = os.path.join(pdf_dir, pdf_file)
    print(f"Processing: {pdf_file}")
    try:
        doc = pdf_reader.load(file_path)
        documents.extend(doc)  # note: `doc` is a list of Document objects
        print(f"Loaded: {pdf_file}")
    except Exception as e:
        print(f"Failed to load {pdf_file}: {e}")

Processing: AMCOR_2020_10K.pdf
Loaded: AMCOR_2020_10K.pdf
Processing: PEPSICO_2023Q1_EARNINGS.pdf
Loaded: PEPSICO_2023Q1_EARNINGS.pdf
Processing: CVSHEALTH_2022_10K.pdf
Loaded: CVSHEALTH_2022_10K.pdf
Processing: AMCOR_2023_10K.pdf
Loaded: AMCOR_2023_10K.pdf
Processing: PEPSICO_2022_10K.pdf
Loaded: PEPSICO_2022_10K.pdf
Processing: NETFLIX_2015_10K.pdf
Loaded: NETFLIX_2015_10K.pdf
Processing: 3M_2023Q2_10Q.pdf
Loaded: 3M_2023Q2_10Q.pdf
Processing: 3M_2017_10K.pdf
Loaded: 3M_2017_10K.pdf
Processing: GENERALMILLS_2019_10K.pdf
Loaded: GENERALMILLS_2019_10K.pdf
Processing: BESTBUY_2017_10K.pdf
Loaded: BESTBUY_2017_10K.pdf
Processing: BOEING_2022_10K.pdf
Loaded: BOEING_2022_10K.pdf
Processing: PEPSICO_2023_8K_dated-2023-05-30.pdf
Loaded: PEPSICO_2023_8K_dated-2023-05-30.pdf
Processing: COCACOLA_2022_10K.pdf
Loaded: COCACOLA_2022_10K.pdf
Processing: AES_2022_10K.pdf
Loaded: AES_2022_10K.pdf
Processing: NIKE_2018_10K.pdf
Loaded: NIKE_2018_10K.pdf
Processing: AMCOR_2023Q2_10Q.pdf
Loaded: AMCOR_2

## Create the nodes with specific chunk and overlap size

In [11]:
from llama_index.core.node_parser import SentenceSplitter
from typing import List
from llama_index.core.schema import Document

def generate_nodes(
    documents: List[Document],
    chunk_size: int = 512,
    chunk_overlap: int = 512 // 4 # 20% overlap
) -> List:
    """
    Generate nodes from documents using LlamaIndex SentenceSplitter.

    Args:
        documents: List of LlamaIndex Document objects to process
        chunk_size: Maximum characters per chunk (default: 512)
        chunk_overlap: Overlap between chunks to preserve context (default: 25% overlap)

    Returns:
        List of nodes generated from the documents

    Raises:
        ValueError: If chunk_size or chunk_overlap is invalid
        TypeError: If documents is not a list of Document objects
    """
    # Input validation
    if not isinstance(documents, list):
        raise TypeError("Documents must be provided as a list")
    
    if not all(isinstance(doc, Document) for doc in documents):
        raise TypeError("All items in documents list must be LlamaIndex Document objects")
    
    if chunk_size <= 0:
        raise ValueError("Chunk size must be positive")
        
    if chunk_overlap < 0:
        raise ValueError("Chunk overlap cannot be negative")
        
    if chunk_overlap >= chunk_size:
        raise ValueError("Chunk overlap must be less than chunk size")

    # Initialize SentenceSplitter
    parser = SentenceSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )

    # Generate nodes
    nodes = parser.get_nodes_from_documents(documents)
    
    print(f"Created {len(nodes)} chunks with chunk_size={chunk_size} and chunk_overlap={chunk_overlap}")
    
    return nodes

# # Example usage
# # Sample documents
# sample_text = "This is a sample document for testing. " * 50
# documents = [Document(text=sample_text)]

# try:
#     # Generate nodes with default parameters
#     nodes = generate_nodes(documents)
    
#     # Generate nodes with custom parameters
#     custom_nodes = generate_nodes(
#         documents=documents,
#         chunk_size=1000,
#         chunk_overlap=200
#     )
# except (ValueError, TypeError) as e:
#     print(f"Error: {e}")

In [12]:
from typing import List
from langchain.docstore.document import Document as LCDocument
from llama_index.core.schema import BaseNode

def nodes_to_langchain_docs(
    nodes: List[BaseNode],
    chunk_size: int,
    keep_node_metadata: bool = True
) -> List[LCDocument]:
    """
    Convert LlamaIndex nodes to LangChain documents.

    Args:
        nodes: List of LlamaIndex nodes to convert
        chunk_size: Chunk size used for node creation (for metadata)
        keep_node_metadata: If True, include original node metadata in addition to chunk_size

    Returns:
        List of LangChain Document objects

    Raises:
        TypeError: If nodes is not a list of LlamaIndex BaseNode objects
        ValueError: If chunk_size is invalid
    """
    # Input validation
    if not isinstance(nodes, list):
        raise TypeError("Nodes must be provided as a list")
    
    if not all(isinstance(node, BaseNode) for node in nodes):
        raise TypeError("All items in nodes list must be LlamaIndex BaseNode objects")
    
    if chunk_size <= 0:
        raise ValueError("Chunk size must be positive")

    # Convert nodes to LangChain documents
    lc_docs = []
    for node in nodes:
        # Base metadata with chunk_size
        metadata = {"chunk_size": chunk_size}
        
        # Add original node metadata if keep_node_metadata is True
        if keep_node_metadata:
            metadata.update(node.metadata)
        
        # Create LangChain document
        doc = LCDocument(
            page_content=node.get_content(),
            metadata=metadata
        )
        lc_docs.append(doc)
    
    print(f"Converted {len(lc_docs)} nodes to LangChain documents "
          f"(keep_node_metadata={keep_node_metadata})")
    
    return lc_docs

# # Example usage
# # Create sample nodes
# sample_text = "This is a sample document for testing. " * 50
# doc = Document(text=sample_text)
# doc.metadata = {"source": "sample.pdf", "author": "John Doe", "page": 1}
# parser = SentenceSplitter(chunk_size=500, chunk_overlap=150)
# nodes = parser.get_nodes_from_documents([doc])

# try:
#     # Convert nodes without keeping node metadata
#     lc_docs_without_metadata = nodes_to_langchain_docs(
#         nodes=nodes,
#         chunk_size=500,
#         keep_node_metadata=False
#     )
    
#     # Convert nodes keeping node metadata
#     lc_docs_with_metadata = nodes_to_langchain_docs(
#         nodes=nodes,
#         chunk_size=500,
#         keep_node_metadata=True
#     )
    
#     # Print sample results
#     print("\nSample document without node metadata:")
#     print(lc_docs_without_metadata[0].metadata)
    
#     print("\nSample document with node metadata:")
#     print(lc_docs_with_metadata[0].metadata)
    
# except (TypeError, ValueError) as e:
#     print(f"Error: {e}")

## Populate documents to vectore storage

In [13]:
from typing import List
from langchain.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings
from llama_index.core.schema import Document, BaseNode
from llama_index.core.node_parser import SentenceSplitter
from langchain.docstore.document import Document as LCDocument
import os
import shutil
import uuid
import time

def clear_directory_with_retry(directory: str, max_attempts: int = 5, delay: float = 1.0) -> None:
    """
    Attempt to clear a directory with retries to handle file access issues on Windows.

    Args:
        directory: Path to the directory to clear
        max_attempts: Maximum number of retry attempts
        delay: Delay between attempts in seconds
    """
    if not os.path.exists(directory):
        return

    for attempt in range(max_attempts):
        try:
            shutil.rmtree(directory, ignore_errors=True)
            print(f"Cleared existing ChromaDB at {directory}")
            return
        except PermissionError as e:
            print(f"Attempt {attempt + 1}/{max_attempts} failed: {e}")
            if attempt < max_attempts - 1:
                time.sleep(delay)
        except Exception as e:
            print(f"Failed to clear directory: {e}")
            raise
    raise PermissionError(f"Could not clear directory {directory} after {max_attempts} attempts")

def populate_vector_store(
    documents: List[Document],
    chunk_sizes: List[int],
    embedding_model: str,
    collection_name_prefix: str,
    persist_directory: str,
    chunk_overlap_percentage: int = 30,
    keep_node_metadata: bool = False,
    clear_old_db: bool = False,
    max_batch_size: int = 5000
) -> None:
    """
    Populate Chroma vector store with embeddings for multiple chunk sizes.

    Args:
        documents: List of LlamaIndex Document objects
        chunk_sizes: List of chunk sizes to process
        embedding_model: Ollama embedding model name
        collection_name_prefix: Prefix for Chroma collection names
        persist_directory: Directory to store ChromaDB
        chunk_overlap_percentage: Overlap percentage (1-99) for chunks (default: 30)
        keep_node_metadata: If True, keep original node metadata
        clear_old_db: If True, remove existing ChromaDB directory

    Raises:
        ValueError: If inputs are invalid
        TypeError: If documents or chunk_sizes are not lists
        PermissionError: If directory cannot be cleared
    """
    # Input validation
    if not isinstance(documents, list):
        raise TypeError("Documents must be provided as a list")
    
    if not all(isinstance(doc, Document) for doc in documents):
        raise TypeError("All items in documents list must be LlamaIndex Document objects")
    
    if not isinstance(chunk_sizes, list):
        raise TypeError("Chunk sizes must be provided as a list")
    
    if not all(isinstance(size, int) and size > 0 for size in chunk_sizes):
        raise ValueError("All chunk sizes must be positive integers")
    
    if not isinstance(chunk_overlap_percentage, int) or chunk_overlap_percentage < 1 or chunk_overlap_percentage > 99:
        raise ValueError("Chunk overlap percentage must be an integer between 1 and 99")
    
    if not embedding_model:
        raise ValueError("Embedding model name must be provided")

    # Clear existing ChromaDB if requested
    if clear_old_db:
        clear_directory_with_retry(persist_directory)

    # Initialize embeddings
    embedding = OllamaEmbeddings(model=embedding_model)

    # Process each chunk size
    for chunk_size in chunk_sizes:
        # Calculate chunk overlap based on percentage
        chunk_overlap = int(chunk_size * (chunk_overlap_percentage / 100))
        
        if chunk_overlap >= chunk_size:
            raise ValueError(f"Calculated chunk overlap ({chunk_overlap}) must be less than chunk size ({chunk_size})")
        
        collection_name = f"{collection_name_prefix}{chunk_size}"
        
        # Check if collection already exists
        vectorstore = None
        if not clear_old_db and os.path.exists(persist_directory):
            try:
                vectorstore = Chroma(
                    collection_name=collection_name,
                    embedding_function=embedding,
                    persist_directory=persist_directory
                )
                # If collection exists and clear_old_db is False, skip
                if vectorstore._collection.count() > 0:
                    print(f"Skipping chunk size {chunk_size}: Collection already exists")
                    vectorstore = None  # Explicitly release the connection
                    continue
            except Exception:
                pass  # Collection doesn't exist, proceed with creation

        # Generate nodes
        nodes = generate_nodes(
            documents=documents,
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap
        )

        print(f"Generated {len(nodes)} nodes for chunk size {chunk_size} ")

        # Convert nodes to LangChain documents
        lc_docs = nodes_to_langchain_docs(
            nodes=nodes,
            chunk_size=chunk_size,
            keep_node_metadata=keep_node_metadata
        )

        print(f"Converted {len(lc_docs)} nodes to LangChain documents for chunk size {chunk_size}")

        os.makedirs(persist_directory, exist_ok=True)
        # Initialize Chroma vector store
        vectorstore = Chroma(
            collection_name=collection_name,
            embedding_function=embedding,
            persist_directory=persist_directory
        )

        vectorstore.persist()

        print(f"Initialized Chroma vector store for chunk size {chunk_size}")

        # Add documents to vector store in batches
        total_docs = len(lc_docs)
        for batch_start in range(0, total_docs, max_batch_size):
            print(f"Adding batch {batch_start//max_batch_size + 1}")
            batch_end = min(batch_start + max_batch_size, total_docs)
            batch = lc_docs[batch_start:batch_end]
            try:
                print(f"Try adding {len(batch)} batche of documents")
                vectorstore.add_documents(batch)
                print(f"Added batch {batch_start//max_batch_size + 1} "
                      f"({len(batch)} documents) for chunk size {chunk_size}")
                vectorstore.persist()
            except Exception as e:
                print(f"Error adding batch {batch_start//max_batch_size + 1} "
                      f"for chunk size {chunk_size}: {str(e)}")
                raise

        vectorstore.persist()

        print(f"Populated vector store for chunk size {chunk_size} with {len(lc_docs)} documents "
              f"(overlap={chunk_overlap_percentage}% -> {chunk_overlap} tokens)")

        # Explicitly release the vectorstore connection
        vectorstore = None

# # Example usage
# # Sample documents
# sample_text = "This is a sample document for testing. " * 50
# documents = [Document(text=sample_text)]

# try:
#     populate_vector_store(
#         documents=documents,
#         chunk_sizes=[128, 256, 512, 1024],
#         embedding_model="all-minilm",
#         collection_name_prefix="rag_docs_chunk_",
#         persist_directory="chroma_db2",
#         chunk_overlap_percentage=20,
#         keep_node_metadata=True,
#         clear_old_db=False,
#         max_batch_size=5000
#     )
# except (ValueError, TypeError) as e:
#     print(f"Error: {e}")

In [14]:
embedding_model = "nomic-embed-text"
# chunk_sizes = [128, 256, 512, 1024]
chunk_sizes = [512]
chunk_overlap_percentage = 20

# Generate real vectore store from one of the datasets
try:
    populate_vector_store(
        documents=documents,
        chunk_sizes=chunk_sizes,
        embedding_model=embedding_model,
        collection_name_prefix="financebench_docs_chunk_",
        persist_directory="./financebench_db4",
        chunk_overlap_percentage=chunk_overlap_percentage,
        keep_node_metadata=True,
        clear_old_db=True,
        max_batch_size=500
    )
except (ValueError, TypeError) as e:
    print(f"Error: {e}")

Cleared existing ChromaDB at ./financebench_db4
Created 30293 chunks with chunk_size=512 and chunk_overlap=102
Generated 30293 nodes for chunk size 512 
Converted 30293 nodes to LangChain documents (keep_node_metadata=True)
Converted 30293 nodes to LangChain documents for chunk size 512


  vectorstore = Chroma(
  vectorstore.persist()


Initialized Chroma vector store for chunk size 512
Adding batch 1
Try adding 500 batche of documents
Added batch 1 (500 documents) for chunk size 512
Adding batch 2
Try adding 500 batche of documents
Added batch 2 (500 documents) for chunk size 512
Adding batch 3
Try adding 500 batche of documents
Added batch 3 (500 documents) for chunk size 512
Adding batch 4
Try adding 500 batche of documents
Added batch 4 (500 documents) for chunk size 512
Adding batch 5
Try adding 500 batche of documents
Added batch 5 (500 documents) for chunk size 512
Adding batch 6
Try adding 500 batche of documents
Added batch 6 (500 documents) for chunk size 512
Adding batch 7
Try adding 500 batche of documents
Added batch 7 (500 documents) for chunk size 512
Adding batch 8
Try adding 500 batche of documents
Added batch 8 (500 documents) for chunk size 512
Adding batch 9
Try adding 500 batche of documents
Added batch 9 (500 documents) for chunk size 512
Adding batch 10
Try adding 500 batche of documents
Added b

KeyboardInterrupt: 