# PDF Document Loading and Chunking Example

This notebook demonstrates:
- Loading PDF documents using the RAG pipeline
- Page-based chunking with token size considerations
- Detailed reporting of chunking results
- Using FAST_ANSWERS configuration parameters

We'll use a test PDF document to show the complete process from loading to chunking analysis.

In [None]:
import setup_notebook  # This fixes the path for imports
from rag_pipeline.config.parameter_sets import FAST_ANSWERS
from rag_pipeline.core.embeddings import process_pdf
from rag_pipeline.utils.directory_utils import get_project_root, get_test_data_dir

# Use FAST_ANSWERS configuration for this example
rag_params = FAST_ANSWERS

# Set up paths
root = get_project_root()
test_data = get_test_data_dir()
pdf_path = test_data / "2303.18223v16.pdf"
persist_dir = root / "data" / "test_chunks"

# Ensure persist directory exists
persist_dir.mkdir(parents=True, exist_ok=True)

## Processing the PDF Document

We'll now process the PDF document using the following steps:
1. Load the PDF using our custom processor
2. Apply page-based chunking with rag_params = FAST_ANSWERS parameters:
   - Chunk size: rag_params.chunking.chunk_size tokens
   - Chunk overlap: rag_params.chunking.chunk_overlap tokens
   - Chunk size should not exceed a single PDF page for PDF files, multiple chunks should then be created for a single page
3. Enable deduplication to avoid redundant content processing
   - Deduplication here means that we don't want to process the same PDF file with the same chunking parameters (We don't expect duplicate chunks in the same PDF)
   - Chunking the same file with the same parameters should be idempotent
   - Suggestions:
      - keep the chunking database should be different from the embedding database
      - use the chunking parameters in the database name
      - check the database for any records with the given file name to detect if chunking was already executed
      - apply embeddings on the chunking database records when needed
4. Process all pages to get a complete view of the document structure

The process_pdf function will return:
- File name, file size, file number of pages
- Total number of chunks created
- Database: file name, table name, and other relevant attributes.


In [None]:
# Process the PDF with full page processing and deduplication
if not (persist_dir / "chroma.sqlite3").exists():
    chunks, records = await process_pdf(
        pdf_path,
        rag_params.embedding.model_name,
        persist_dir=str(persist_dir),
        chunk_size=rag_params.chunking.chunk_size,
        chunk_overlap=rag_params.chunking.chunk_overlap,
        max_pages=None,  # Process all pages
        deduplicate=True,
    )
    print(f"PDF Processing Results:")
    print(f"----------------------")
    print(f"Total chunks created: {chunks}")
    print(f"Unique records after deduplication: {records}")
    print(f"Deduplication removed: {chunks - records} chunks")
    print(f"Deduplication ratio: {(chunks - records) / chunks:.2%}")
else:
    print(f"Embeddings already exist in {persist_dir}, skipping processing.")

In [None]:
from rag_pipeline.core.chunking import get_page_chunks

# Get chunks per page analysis
page_chunks = get_page_chunks(pdf_path)

print(f"Page-Level Analysis:")
print(f"-----------------")
print(f"Total pages in document: {len(page_chunks)}")
print(f"Average chunks per page: {sum(len(chunks) for chunks in page_chunks.values()) / len(page_chunks):.2f}")
print(f"\nPages with most chunks:")
sorted_pages = sorted(page_chunks.items(), key=lambda x: len(x[1]), reverse=True)
for page_num, chunks in sorted_pages[:5]:
    print(f"Page {page_num}: {len(chunks)} chunks")

print(f"\nPages with least chunks:")
for page_num, chunks in sorted_pages[-5:]:
    print(f"Page {page_num}: {len(chunks)} chunks")

## Analysis Summary

The above results show:
1. The complete document processing with page-by-page chunking
2. Deduplication effectiveness in removing redundant content
3. Distribution of chunks across pages, highlighting:
   - Pages with dense content (more chunks)
   - Pages with sparse content (fewer chunks)
   - Average chunk distribution

This analysis helps understand how the document is being processed and can be used to tune the chunking parameters if needed.
