# PDF Ingestion Layer - Test Notebook

This notebook tests the PDF Ingestion Layer in isolation.

## Purpose
- Load a real PDF from the project
- Run the ingestion pipeline
- Inspect the output Document
- Validate metadata extraction
- Verify text quality

## What This Tests
‚úÖ PDF loading and validation  
‚úÖ Text extraction quality  
‚úÖ Text normalization  
‚úÖ Metadata extraction  
‚úÖ LlamaIndex Document creation  

## What This Does NOT Test
‚ùå Chunking (that's Layer 2)  
‚ùå Embeddings (that's Layer 3)  
‚ùå Retrieval (that's Layer 3)  
‚ùå Agents (that's Layer 4)


---
## Setup


In [2]:
import sys
from pathlib import Path

# Add project root to path
project_root = Path().absolute().parent.parent
sys.path.insert(0, str(project_root))

print(f"Project root: {project_root}")


Project root: /Users/guyai/Desktop/AI Lecture/FIRST PROJECT/RagAgentv2


In [3]:
# Import the ingestion pipeline
from RAG.PDF_Ingestion.pdf_ingestion import create_ingestion_pipeline, PDFIngestionError

print("‚úÖ PDF Ingestion module imported successfully")


‚úÖ PDF Ingestion module imported successfully


---
## Test 1: Basic Ingestion


In [4]:
# Path to test PDF
pdf_path = project_root / "auto_claim_20_forms_FINAL.pdf"

print(f"Testing with PDF: {pdf_path}")
print(f"File exists: {pdf_path.exists()}")
print(f"File size: {pdf_path.stat().st_size / 1024:.1f} KB")


Testing with PDF: /Users/guyai/Desktop/AI Lecture/FIRST PROJECT/RagAgentv2/auto_claim_20_forms_FINAL.pdf
File exists: True
File size: 44.9 KB


In [5]:
# Create ingestion pipeline
pipeline = create_ingestion_pipeline(document_type="insurance_claim_form")

print("‚úÖ Pipeline created")


‚úÖ Pipeline created


In [6]:
# Run ingestion
try:
    document = pipeline.ingest(str(pdf_path))
    print("‚úÖ Ingestion successful!")
    print(f"Document type: {type(document)}")
    print(f"Document ID: {document.doc_id}")
except PDFIngestionError as e:
    print(f"‚ùå Ingestion failed: {e}")
    raise


‚úÖ Ingestion successful!
Document type: <class 'llama_index.core.schema.Document'>
Document ID: 6e1c9a74673919ad


---
## Test 2: Inspect Metadata


In [7]:
# Display all metadata
import json

print("üìã DOCUMENT METADATA")
print("=" * 60)
print(json.dumps(document.metadata, indent=2))


üìã DOCUMENT METADATA
{
  "document_id": "6e1c9a74673919ad",
  "document_type": "insurance_claim_form",
  "source_file": "auto_claim_20_forms_FINAL.pdf",
  "source_path": "/Users/guyai/Desktop/AI Lecture/FIRST PROJECT/RagAgentv2/auto_claim_20_forms_FINAL.pdf",
  "title": "auto_claim_20_forms_FINAL",
  "language": "en",
  "page_count": 40,
  "total_characters": 25417,
  "total_words": 3641,
  "total_paragraphs": 40,
  "avg_paragraph_length": 91.0,
  "has_headings": false,
  "dates_detected": [
    "2024-06-06",
    "2024-07-08",
    "2024-11-16",
    "2024-06-08",
    "2024-07-30",
    "2024-08-20",
    "2024-08-01",
    "2024-02-14",
    "2024-02-25",
    "2024-02-16"
  ],
  "times_detected": [
    "14:59",
    "15:15",
    "15:33",
    "09:29",
    "09:32",
    "09:31",
    "10:12",
    "14:54",
    "15:20",
    "10:59"
  ],
  "numeric_density": "medium",
  "ingested_at": "2025-12-13T10:56:37.198602Z",
  "ingestion_pipeline_version": "1.0"
}


In [8]:
# Validate required metadata fields
required_fields = [
    "document_id",
    "document_type",
    "source_file",
    "title",
    "language",
    "total_words",
    "total_characters",
    "total_paragraphs",
    "avg_paragraph_length",
    "has_headings",
    "dates_detected",
    "times_detected",
    "numeric_density",
    "ingested_at",
]

print("\nüîç METADATA VALIDATION")
print("=" * 60)
missing_fields = []
for field in required_fields:
    if field in document.metadata:
        print(f"‚úÖ {field}")
    else:
        print(f"‚ùå {field} - MISSING")
        missing_fields.append(field)

if missing_fields:
    print(f"\n‚ö†Ô∏è  Missing {len(missing_fields)} required fields")
else:
    print("\n‚úÖ All required metadata fields present")



üîç METADATA VALIDATION
‚úÖ document_id
‚úÖ document_type
‚úÖ source_file
‚úÖ title
‚úÖ language
‚úÖ total_words
‚úÖ total_characters
‚úÖ total_paragraphs
‚úÖ avg_paragraph_length
‚úÖ has_headings
‚úÖ dates_detected
‚úÖ times_detected
‚úÖ numeric_density
‚úÖ ingested_at

‚úÖ All required metadata fields present


---
## Test 3: Inspect Text Quality


In [9]:
# Display text statistics
print("üìä TEXT STATISTICS")
print("=" * 60)
print(f"Total characters: {len(document.text):,}")
print(f"Total words: {len(document.text.split()):,}")
print(f"Total lines: {len(document.text.split(chr(10))):,}")
print(f"Total paragraphs: {len([p for p in document.text.split(chr(10)*2) if p.strip()]):,}")


üìä TEXT STATISTICS
Total characters: 25,417
Total words: 3,641
Total lines: 79
Total paragraphs: 40


In [10]:
# Display first 500 characters
print("\nüìÑ FIRST 500 CHARACTERS OF CLEAN TEXT")
print("=" * 60)
print(document.text[:500])
print("\n[... text continues ...]")



üìÑ FIRST 500 CHARACTERS OF CLEAN TEXT
AUTO CLAIM FORM #1 BlueRiver Mutual SECTION 1 ‚Äì CLAIMANT INFORMATION Name: Jon Mor Account Number: ACC9900057 Address: 100 Main Street, Sample City, ST 90000 Phone: (555) 100-2000 Email: jon.mor@example.com Date of Incident: 2024-06-06 Location: 10th Ave & 5th St, Sample City Injury: Yes (minor) Police Report: Yes SECTION 2 ‚Äì CLAIM DETAILS Accident Type: Rear-end collision Severity: Minor Claim Status: Pending court Fraud Risk Score: 4 Internal Tag: TOW-FLAG-3 Assigned Adjuster: Linda Cooper SEC

[... text continues ...]


In [11]:
# Display last 500 characters
print("\nüìÑ LAST 500 CHARACTERS OF CLEAN TEXT")
print("=" * 60)
print("[... text continues ...]\n")
print(document.text[-500:])



üìÑ LAST 500 CHARACTERS OF CLEAN TEXT
[... text continues ...]

Witness Statement: Witness reported seeing the other car accelerate abruptly. Repair Estimate 1: $1260 Repair Estimate 2: $1645 Repair Shop Assigned: Horizon Collision Repair Repair Appointment Date: 2025-09-04

Hidden Note: Second witness: **Laura Vance** SECTION 5 ‚Äì MINI TIMELINE OF EVENTS No timeline available for this claim. SECTION 6 ‚Äì COURT DATE Court Date: N/A SECTION 7 ‚Äì DECLARATION I declare that the information provided isaccurate. Signature: __________________________ Date: 2025-08-09


In [None]:
# Show sample paragraphs
paragraphs = [p.strip() for p in document.text.split('\n\n') if p.strip()]

print(f"\nüìÑ SAMPLE PARAGRAPHS (showing 3 of {len(paragraphs)})")
print("=" * 60)

for i, para in enumerate(paragraphs[:3], 1):
    print(f"\nParagraph {i}:")
    print("-" * 60)
    # Show first 300 chars of paragraph
    if len(para) > 300:
        print(para[:300] + "...")
    else:
        print(para)


---
## Test 4: Text Quality Checks


In [12]:
# Check for common PDF extraction issues
print("üîç TEXT QUALITY CHECKS")
print("=" * 60)

issues = []

# Check 1: Text is not empty
if not document.text.strip():
    issues.append("‚ùå Text is empty")
else:
    print("‚úÖ Text is not empty")

# Check 2: Text has reasonable length
if len(document.text) < 100:
    issues.append("‚ùå Text is suspiciously short")
else:
    print("‚úÖ Text has reasonable length")

# Check 3: No excessive whitespace
if '\n\n\n' in document.text:
    issues.append("‚ö†Ô∏è  Text contains excessive newlines (3+)")
else:
    print("‚úÖ No excessive newlines")

# Check 4: No form feed characters
if '\f' in document.text:
    issues.append("‚ùå Text contains form feed characters")
else:
    print("‚úÖ No form feed characters")

# Check 5: Words are properly formed (no excessive single chars)
words = document.text.split()
single_char_words = sum(1 for w in words if len(w) == 1 and w.isalpha())
single_char_ratio = single_char_words / len(words) if words else 0
if single_char_ratio > 0.1:
    issues.append(f"‚ö†Ô∏è  High ratio of single-character words ({single_char_ratio:.1%})")
else:
    print(f"‚úÖ Low ratio of single-character words ({single_char_ratio:.1%})")

# Check 6: Has paragraph structure
if '\n\n' not in document.text:
    issues.append("‚ö†Ô∏è  Text has no paragraph breaks")
else:
    print("‚úÖ Text has paragraph structure")

# Summary
if issues:
    print(f"\n‚ö†Ô∏è  Found {len(issues)} quality issues:")
    for issue in issues:
        print(f"  {issue}")
else:
    print("\n‚úÖ All quality checks passed!")


üîç TEXT QUALITY CHECKS
‚úÖ Text is not empty
‚úÖ Text has reasonable length
‚úÖ No excessive newlines
‚úÖ No form feed characters
‚úÖ Low ratio of single-character words (0.6%)
‚úÖ Text has paragraph structure

‚úÖ All quality checks passed!


In [13]:
# Verify it's a proper LlamaIndex Document
from llama_index.core import Document as LlamaDocument

print("üîç DOCUMENT OBJECT VALIDATION")
print("=" * 60)

# Check type
if isinstance(document, LlamaDocument):
    print("‚úÖ Document is a LlamaIndex Document instance")
else:
    print(f"‚ùå Document is not a LlamaIndex Document (type: {type(document)})")

# Check required attributes
if hasattr(document, 'text'):
    print("‚úÖ Document has 'text' attribute")
else:
    print("‚ùå Document missing 'text' attribute")

if hasattr(document, 'metadata'):
    print("‚úÖ Document has 'metadata' attribute")
else:
    print("‚ùå Document missing 'metadata' attribute")

if hasattr(document, 'doc_id'):
    print("‚úÖ Document has 'doc_id' attribute")
    print(f"   doc_id: {document.doc_id}")
else:
    print("‚ùå Document missing 'doc_id' attribute")

# Check doc_id matches metadata
if document.doc_id == document.metadata.get('document_id'):
    print("‚úÖ doc_id matches metadata.document_id")
else:
    print("‚ö†Ô∏è  doc_id does not match metadata.document_id")


üîç DOCUMENT OBJECT VALIDATION
‚úÖ Document is a LlamaIndex Document instance
‚úÖ Document has 'text' attribute
‚úÖ Document has 'metadata' attribute
‚úÖ Document has 'doc_id' attribute
   doc_id: 6e1c9a74673919ad
‚úÖ doc_id matches metadata.document_id


---
## Test 6: Error Handling


In [14]:
# Test with non-existent file
print("üîç ERROR HANDLING TESTS")
print("=" * 60)

print("\nTest: Non-existent file")
try:
    pipeline.ingest("nonexistent.pdf")
    print("‚ùå Should have raised PDFIngestionError")
except PDFIngestionError as e:
    print(f"‚úÖ Correctly raised PDFIngestionError: {e}")

print("\nTest: Non-PDF file")
try:
    pipeline.ingest("requirements.txt")
    print("‚ùå Should have raised PDFIngestionError")
except PDFIngestionError as e:
    print(f"‚úÖ Correctly raised PDFIngestionError: {e}")


üîç ERROR HANDLING TESTS

Test: Non-existent file
‚úÖ Correctly raised PDFIngestionError: PDF file does not exist: nonexistent.pdf

Test: Non-PDF file
‚úÖ Correctly raised PDFIngestionError: PDF file does not exist: requirements.txt


---
## Summary

This notebook has tested the PDF Ingestion Layer in isolation.

### What We Verified:
1. ‚úÖ PDF file loading and validation
2. ‚úÖ Text extraction from PDF
3. ‚úÖ Text normalization and cleaning
4. ‚úÖ Metadata extraction (document-level)
5. ‚úÖ LlamaIndex Document object creation
6. ‚úÖ Error handling for invalid inputs

### Next Layer:
**Layer 2: Chunking**
- Will take this Document as input
- Will create hierarchical nodes (Sections ‚Üí Parent ‚Üí Child)
- Will enrich metadata with chunk-level information
- Still NO embeddings or vector stores

### Notes:
- This layer is COMPLETE and ISOLATED
- It has NO dependencies on chunking, indexing, or agents
- It produces a clean Document ready for Layer 2
- All metadata is lightweight and document-level only
