# Terry Real Corpus Processing

**Purpose**: Process Terry Real's 3 books into ChromaDB collection for RAG-enhanced AI conversations

**Task 2 Requirements**:
- 📚 Extract text from Terry Real PDFs systematically
- 🔪 Implement semantic chunking for relationship concepts
- 🏷️ Preserve metadata (book source, chapter, concept type)
- 🚀 Batch embed all chunks with validated all-MiniLM-L6-v2
- ✅ Validate quality - chunk coherence and embedding coverage

**Technology Stack**: ChromaDB + all-MiniLM-L6-v2 (validated in Task 1)

---

## 📋 Processing Overview

**Source Materials**:
1. `terry-real-how-can-i-get-through-to-you.pdf`
2. `terry-real-new-rules-of-marriage.pdf`
3. `terry-real-us-getting-past-you-and-me.pdf`

**Processing Pipeline**:
1. **Text Extraction** - Extract clean text from PDFs
2. **Content Analysis** - Understand structure and identify chapters
3. **Chunking Strategy** - Semantic chunking for relationship concepts
4. **Metadata Creation** - Preserve book/chapter/concept information
5. **Embedding Generation** - Process with all-MiniLM-L6-v2
6. **Quality Validation** - Test retrieval and coherence
7. **Performance Testing** - Verify query performance for AI conversations

---

## 1. Dependencies & Environment Setup

In [9]:
# Core dependencies
import os
import re
import time
from pathlib import Path

# PDF processing
from pdfminer.high_level import extract_text

# Text processing and chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter

# ChromaDB and embeddings
import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer

# Data analysis and visualization
import pandas as pd
import numpy as np
from collections import Counter

print("📦 All dependencies imported successfully")
print(f"ChromaDB version: {chromadb.__version__}")

📦 All dependencies imported successfully
ChromaDB version: 1.0.12


In [10]:
# Project configuration
PROJECT_ROOT = Path("../..").resolve()  # From notebooks/ to project root
PDF_DIR = PROJECT_ROOT / "docs" / "Research" / "source-materials" / "pdf books"
CHROMA_DIR = PROJECT_ROOT / "rag_dev" / "chroma_db"
COLLECTION_NAME = "terry_real_corpus"

# Processing parameters (we'll optimize these)
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
EMBEDDING_MODEL = "all-MiniLM-L6-v2"  # Validated in Task 1

print(f"📁 PDF Directory: {PDF_DIR}")
print(f"📁 ChromaDB Directory: {CHROMA_DIR}")
print(f"🗂️ Collection Name: {COLLECTION_NAME}")
print(f"🔧 Chunk Size: {CHUNK_SIZE}, Overlap: {CHUNK_OVERLAP}")
print(f"🤖 Embedding Model: {EMBEDDING_MODEL}")

# Verify PDF files exist
pdf_files = list(PDF_DIR.glob("*.pdf"))
print(f"\n📚 Found {len(pdf_files)} PDF files:")
for pdf in pdf_files:
    print(f"   - {pdf.name}")
    
if len(pdf_files) != 3:
    print("⚠️ Expected 3 Terry Real PDFs, please verify file paths")
else:
    print("✅ All Terry Real PDFs found")

📁 PDF Directory: D:\Github\Relational_Life_Practice\docs\Research\source-materials\pdf books
📁 ChromaDB Directory: D:\Github\Relational_Life_Practice\rag_dev\chroma_db
🗂️ Collection Name: terry_real_corpus
🔧 Chunk Size: 1000, Overlap: 200
🤖 Embedding Model: all-MiniLM-L6-v2

📚 Found 3 PDF files:
   - terry-real-how-can-i-get-through-to-you.pdf
   - terry-real-new-rules-of-marriage.pdf
   - terry-real-us-getting-past-you-and-me.pdf
✅ All Terry Real PDFs found


In [11]:
# Initialize ChromaDB client and embedding model
print("🚀 Initializing ChromaDB and embedding model...")

# Create ChromaDB directory if it doesn't exist
CHROMA_DIR.mkdir(parents=True, exist_ok=True)

# Initialize persistent ChromaDB client
client = chromadb.PersistentClient(path=str(CHROMA_DIR))
print(f"✅ ChromaDB client initialized at {CHROMA_DIR}")

# Initialize embedding model (same as Task 1 validation)
embedder = SentenceTransformer(EMBEDDING_MODEL)
print(f"✅ Embedding model '{EMBEDDING_MODEL}' loaded")
print(f"📐 Embedding dimension: {embedder.get_sentence_embedding_dimension()}")

# Verify this matches our Task 1 validation (should be 384)
expected_dim = 384
actual_dim = embedder.get_sentence_embedding_dimension()
if actual_dim == expected_dim:
    print(f"✅ Embedding dimensions match Task 1 validation: {actual_dim}")
else:
    print(f"⚠️ Dimension mismatch! Expected {expected_dim}, got {actual_dim}")

🚀 Initializing ChromaDB and embedding model...
✅ ChromaDB client initialized at D:\Github\Relational_Life_Practice\rag_dev\chroma_db
✅ Embedding model 'all-MiniLM-L6-v2' loaded
📐 Embedding dimension: 384
✅ Embedding dimensions match Task 1 validation: 384


In [13]:
# Clean up any existing collection (for fresh processing)
print(f"🧹 Preparing clean environment for {COLLECTION_NAME}...")

try:
    existing_collection = client.get_collection(COLLECTION_NAME)
    client.delete_collection(COLLECTION_NAME)
    print(f"🗑️ Deleted existing collection '{COLLECTION_NAME}'")
except Exception as e:
    print(f"ℹ️ No existing collection to delete: {e}")

# Create fresh collection
collection = client.create_collection(
    name=COLLECTION_NAME,
    metadata={"description": "Terry Real's Relational Life Therapy corpus for AI conversations"}
)
print(f"✅ Fresh collection '{COLLECTION_NAME}' created")
print(f"📊 Collection count: {collection.count()} documents")

print("\n" + "="*60)
print("🎉 ENVIRONMENT SETUP COMPLETE")
print("✅ Dependencies loaded")
print("✅ Paths configured and verified")
print("✅ ChromaDB client initialized")
print("✅ Embedding model ready (384 dimensions)")
print("✅ Fresh collection created")
print("🚀 Ready for PDF text extraction")
print("="*60)

🧹 Preparing clean environment for terry_real_corpus...
🗑️ Deleted existing collection 'terry_real_corpus'
✅ Fresh collection 'terry_real_corpus' created
📊 Collection count: 0 documents

🎉 ENVIRONMENT SETUP COMPLETE
✅ Dependencies loaded
✅ Paths configured and verified
✅ ChromaDB client initialized
✅ Embedding model ready (384 dimensions)
✅ Fresh collection created
🚀 Ready for PDF text extraction


## 2. PDF Text Extraction & Content Analysis

**Objective**: Extract and analyze text from Terry Real PDFs to understand structure and optimize chunking strategy

**Steps**:
1. Test text extraction from one book
2. Analyze content structure and chapter organization  
3. Identify patterns for semantic chunking
4. Validate text quality and readability

### Code Cell 1: Test Single PDF Extraction

In [14]:
# Test extraction from one Terry Real book first
print("🔍 Testing PDF text extraction...")

# Select first PDF for testing
test_pdf = pdf_files[0]
print(f"📖 Testing with: {test_pdf.name}")

# Extract text from PDF
start_time = time.time()
raw_text = extract_text(str(test_pdf))
extraction_time = time.time() - start_time

print(f"⏱️ Extraction time: {extraction_time:.2f} seconds")
print(f"📊 Total characters: {len(raw_text):,}")
print(f"📊 Total lines: {len(raw_text.splitlines()):,}")

# Show first 1000 characters to understand structure
print("\n" + "="*60)
print("📋 FIRST 1000 CHARACTERS:")
print("="*60)
print(raw_text[:1000])
print("="*60)

🔍 Testing PDF text extraction...
📖 Testing with: terry-real-how-can-i-get-through-to-you.pdf
⏱️ Extraction time: 17.15 seconds
📊 Total characters: 579,103
📊 Total lines: 12,212

📋 FIRST 1000 CHARACTERS:
How Can I Get Through to You?: Closing the
Intimacy Gap Between Men and Women

Terrence Real

2003

1

How Can I Get Through to You?

Reconnecting Men and Womeng

Terrence Real

SCRIBNER
New York London Toronto Sydney Singapore

SCRIBNER
1230 Avenue of the Americas
New York, NY 10020
www.SimonandSchuster.com

2

Copyright © 2002 by Terrence Real

All rights reserved, including the right of reproduction in whole or in part in
any form.

SCRIBNER and design are trademarks of Macmillan Library Reference USA,
Inc., used under license by Simon & Schuster, the publisher of this work.

For information about special discounts for bulk purchases, please contact Simon
& Schuster Special Sales: 1-800-465-6798 or business@simonandschuster.com

DESIGNED BY ERICH HOBBING

Text set in Janson

Manufa

### Code Cell 2: Content Structure Analysis

## 📘 Advanced Chapter Detection & Content Analysis
A comprehensive debugging tool that validates chapter detection across multiple book formats and reveals content structure patterns. Originally developed to solve missing chapters in Terry Real's corpus processing.

### 🔍 Core Features:
- **Multi-Format Pattern Detection**: Automatically detects chapters using diverse formats:
  - Numeric: `"Chapter 1"`, `"CHAPTER 2"`, `"3. Title"`
  - Word-based: `"CHAPTER EIGHT"`, `"Chapter Eleven"`
  - Title patterns: First 3 words of actual chapter titles
- **Intelligent Number-Word Conversion**: Maps 1-20 to `"ONE"`, `"EIGHT"`, `"SEVENTEEN"`, etc.
- **Metadata Integration**: Leverages existing `chapter_metadata` for targeted title searches
- **Content Structure Discovery**: Reveals book organization patterns (TOC, main content, appendices)

### 📊 Advanced Analysis & Reporting:
- **Pattern Effectiveness**: Shows which search strategies work best for each chapter
- **Content Density Mapping**: Identifies heavily referenced vs. sparse chapters
- **Location Distribution**: Reveals duplicate sections, indexes, and reference areas
- **Quality Assurance**: 100% detection validation with detailed coverage metrics

### 🎯 Real-World Problem Solved:
**Challenge**: Missing "Chapter 8" and "Chapter 11" due to inconsistent formatting (`"CHAPTER EIGHT"` vs `"CHAPTER 8"`)
**Solution**: Dynamic pattern generation covering all numeric and word-based variations
**Result**: Perfect 17/17 chapter detection with comprehensive content mapping

### ✅ Enhanced Output Example:
```
📖 Chapter 8 detection:
   Pattern 'CHAPTER\s+EIGHT\b' → 2 matches:
      Line 3587: CHAPTER EIGHT...
   ✅ Found at 3 unique locations

📊 SUMMARY: 17/17 chapters detected
✅ Chapter 13: 18 locations found (heavily referenced)
✅ Chapter 8: 3 locations found (word-format detection)
```

### 🚀 Use Cases:
- **Book Corpus Processing**: Validate complete chapter coverage before chunking
- **Content Structure Analysis**: Understand document organization patterns  
- **Quality Assurance**: Ensure no missing content in RAG system preparation
- **Format Debugging**: Identify inconsistent chapter formatting across documents

**Perfect for preprocessing academic texts, technical manuals, and therapeutic literature where complete content coverage is critical.**

In [35]:
# DEBUG: Comprehensive chapter detection for all chapters
print(f"\n🔍 DEBUG: Searching for ALL chapters with multiple patterns...")

# Helper function to convert numbers to words
def num_to_word_debug(num):
    words = {
        1: "ONE", 2: "TWO", 3: "THREE", 4: "FOUR", 5: "FIVE",
        6: "SIX", 7: "SEVEN", 8: "EIGHT", 9: "NINE", 10: "TEN",
        11: "ELEVEN", 12: "TWELVE", 13: "THIRTEEN", 14: "FOURTEEN", 15: "FIFTEEN",
        16: "SIXTEEN", 17: "SEVENTEEN", 18: "EIGHTEEN", 19: "NINETEEN", 20: "TWENTY"
    }
    return words.get(num, str(num))

# Create comprehensive search patterns for all chapters
all_debug_patterns = {}

for chapter_num in range(1, 18):  # Chapters 1-17
    chapter_word = num_to_word_debug(chapter_num)
    
    # Generate multiple pattern variations for each chapter
    patterns = [
        f"CHAPTER\\s+{chapter_num}\\b",           # "CHAPTER 1"
        f"Chapter\\s+{chapter_num}\\b",           # "Chapter 1"
        f"CHAPTER\\s+{chapter_word}\\b",          # "CHAPTER ONE"
        f"Chapter\\s+{chapter_word}\\b",          # "Chapter One"
        f"^{chapter_num}\\.\\s+",                 # "1. " (start of line)
    ]
    
    # Add chapter-specific title patterns if available
    if 'chapter_metadata' in globals():
        for ch in chapter_metadata:
            if ch['number'] == chapter_num:
                # Add first few words of title
                title_words = ch['title'].split()[:3]  # First 3 words
                title_pattern = "\\s+".join(re.escape(word) for word in title_words)
                patterns.append(title_pattern)
                break
    
    all_debug_patterns[chapter_num] = patterns

# Search for each chapter using all patterns
chapter_detection_summary = {}

for chapter_num, patterns in all_debug_patterns.items():
    print(f"\n📖 Chapter {chapter_num} detection:")
    chapter_matches = []
    
    for pattern in patterns:
        matches = []
        for i, line in enumerate(non_empty_lines):
            if re.search(pattern, line, re.IGNORECASE):
                matches.append((i, line[:80]))
        
        if matches:
            print(f"   Pattern '{pattern}' → {len(matches)} matches:")
            for line_idx, text in matches[:2]:  # Show first 2 per pattern
                print(f"      Line {line_idx:4d}: {text}...")
            chapter_matches.extend(matches)
    
    # Summary for this chapter
    unique_lines = list(set(match[0] for match in chapter_matches))
    chapter_detection_summary[chapter_num] = len(unique_lines)
    
    if len(unique_lines) == 0:
        print(f"   ❌ NO matches found for Chapter {chapter_num}")
    else:
        print(f"   ✅ Found at {len(unique_lines)} unique locations")

# Overall detection summary
print(f"\n" + "="*60)
print(f"📊 COMPREHENSIVE CHAPTER DETECTION SUMMARY")
print(f"="*60)

detected_chapters = [ch for ch, count in chapter_detection_summary.items() if count > 0]
missing_chapters = [ch for ch, count in chapter_detection_summary.items() if count == 0]

print(f"✅ Chapters detected: {len(detected_chapters)}/17")
print(f"❌ Chapters missing: {len(missing_chapters)}/17")

if detected_chapters:
    print(f"\n✅ Successfully detected chapters: {detected_chapters}")

if missing_chapters:
    print(f"\n❌ Missing chapters: {missing_chapters}")
    print(f"💡 These chapters may need additional search patterns")
else:
    print(f"\n🎉 ALL CHAPTERS DETECTED! Perfect coverage achieved!")

print(f"\n📋 Detection details:")
for ch_num in range(1, 18):
    status = "✅" if chapter_detection_summary[ch_num] > 0 else "❌"
    count = chapter_detection_summary[ch_num]
    print(f"   {status} Chapter {ch_num:2d}: {count} locations found")

print(f"="*60)


🔍 DEBUG: Searching for ALL chapters with multiple patterns...

📖 Chapter 1 detection:
   Pattern 'CHAPTER\s+ONE\b' → 2 matches:
      Line  297: CHAPTER ONE...
      Line 7921: CHAPTER ONE...
   Pattern 'Chapter\s+ONE\b' → 2 matches:
      Line  297: CHAPTER ONE...
      Line 7921: CHAPTER ONE...
   Pattern '^1\.\s+' → 5 matches:
      Line   70: 1. Love on the Ropes : Men and Women in Crisis...
      Line 5585: 1. Self-Esteem...
   Pattern 'Love\s+on\s+the' → 2 matches:
      Line   70: 1. Love on the Ropes : Men and Women in Crisis...
      Line  298: Love on the Ropes: Men and Women in Crisis...
   ✅ Found at 8 unique locations

📖 Chapter 2 detection:
   Pattern 'CHAPTER\s+TWO\b' → 2 matches:
      Line  801: CHAPTER TWO...
      Line 7938: CHAPTER TWO...
   Pattern 'Chapter\s+TWO\b' → 2 matches:
      Line  801: CHAPTER TWO...
      Line 7938: CHAPTER TWO...
   Pattern '^2\.\s+' → 5 matches:
      Line   71: 2. Echo Speaks: Empowering the Woman...
      Line 5587: 2. Self-Awarenes

In [34]:
# Enhanced analysis with improved chapter detection and processing logic
print("🔍 Analyzing content structure with enhanced detection...")

lines = raw_text.splitlines()
non_empty_lines = [line.strip() for line in lines if line.strip()]

print(f"📊 Non-empty lines: {len(non_empty_lines):,}")

# Enhanced chapter patterns (including suggested improvements)
chapter_patterns = [
    r"^Chapter\s+\d+",           # "Chapter 1", "Chapter 2"
    r"^CHAPTER\s+\d+",           # "CHAPTER 1", "CHAPTER 2" 
    r"^Chapter\s+\w+",           # "Chapter One", "Chapter Two"
    r"^CHAPTER\s+\w+",           # "CHAPTER ONE", "CHAPTER EIGHT" ✅ NEW
    r"^\d+\s*\.\s+\w+",          # "1. Love on the Ropes", "2. Echo Speaks" (Terry Real's format)
    r"^\d+\.\s+",                # "1. ", "2. " (simpler version)
    r"^[IVXLCDM]+\.",            # "I.", "II.", "III." (Roman numerals)
    r"^Part\s+\w+",              # "Part One", "Part Two"
    r"^PART\s+\w+"               # "PART ONE", "PART TWO"
]

# Find all potential chapters with enhanced detection
potential_chapters = []
for i, line in enumerate(non_empty_lines[:300]):  # Check first 300 lines (expanded)
    for pattern_idx, pattern in enumerate(chapter_patterns):
        if re.match(pattern, line, re.IGNORECASE):
            potential_chapters.append({
                'line_index': i,
                'text': line,
                'pattern_type': pattern_idx,
                'pattern': pattern
            })

print(f"\n📚 Enhanced chapter detection results: {len(potential_chapters)} markers found")

# After finding potential_chapters, add deduplication:
# Remove duplicates by keeping only unique line indices
seen_lines = set()
unique_chapters = []
for chapter in potential_chapters:
    line_idx = chapter['line_index']
    if line_idx not in seen_lines:
        seen_lines.add(line_idx)
        unique_chapters.append(chapter)

potential_chapters = unique_chapters
print(f"📚 After deduplication: {len(potential_chapters)} unique markers")

# Group by pattern type for analysis
pattern_counts = Counter([ch['pattern_type'] for ch in potential_chapters])
pattern_names = [
    "Chapter X",        # 0: r"^Chapter\s+\d+"
    "CHAPTER X",        # 1: r"^CHAPTER\s+\d+"
    "Chapter Word",     # 2: r"^Chapter\s+\w+" ✅ NEW
    "CHAPTER WORD",     # 3: r"^CHAPTER\s+\w+" ✅ NEW
    "X. Title",         # 4: r"^\d+\s*\.\s+\w+"
    "X.",               # 5: r"^\d+\.\s+"
    "Roman",            # 6: r"^[IVXLCDM]+\."
    "Part Word",        # 7: r"^Part\s+\w+"
    "PART WORD"         # 8: r"^PART\s+\w+"
]

for pattern_idx, count in pattern_counts.items():
    print(f"   {pattern_names[pattern_idx]}: {count} matches")

# Show enhanced chapter information
print(f"\n📖 Detected chapters with enhanced metadata:")
for i, chapter in enumerate(potential_chapters[:12]):  # Show first 12
    line_idx = chapter['line_index']
    text = chapter['text']
    pattern_type = pattern_names[chapter['pattern_type']]
    print(f"   {i+1:2d}. Line {line_idx:3d} [{pattern_type:8s}]: {text[:70]}...")

# Extract chapter titles and numbers for Terry Real's format (pattern type 4: "X. Title")
terry_real_chapters = [ch for ch in potential_chapters if ch['pattern_type'] == 4]
print(f"\n🎯 Terry Real format chapters (X. Title): {len(terry_real_chapters)}")

chapter_metadata = []
for chapter in terry_real_chapters:
    text = chapter['text']
    # Extract chapter number and title
    match = re.match(r'^(\d+)\s*\.\s+(.+)', text)
    if match:
        chapter_num = int(match.group(1))
        chapter_title = match.group(2).strip()
        chapter_metadata.append({
            'number': chapter_num,
            'title': chapter_title,
            'line_index': chapter['line_index'],
            'full_text': text
        })

# Display structured chapter information
if chapter_metadata:
    print(f"\n📋 Structured chapter metadata extracted:")
    for ch in chapter_metadata:
        print(f"   Chapter {ch['number']:2d}: {ch['title'][:60]}...")
    
    # Helper function to convert numbers to words (for word-based chapter search)
    def num_to_word(num):
        words = {
            1: "ONE", 2: "TWO", 3: "THREE", 4: "FOUR", 5: "FIVE",
            6: "SIX", 7: "SEVEN", 8: "EIGHT", 9: "NINE", 10: "TEN",
            11: "ELEVEN", 12: "TWELVE", 13: "THIRTEEN", 14: "FOURTEEN", 15: "FIFTEEN",
            16: "SIXTEEN", 17: "SEVENTEEN", 18: "EIGHTEEN", 19: "NINETEEN", 20: "TWENTY"
        }
        return words.get(num, str(num))
    
    # NEW: Find actual chapter content locations (beyond TOC) with enhanced search
    print(f"\n🔍 Locating actual chapter content (beyond TOC) with enhanced patterns...")
    
    actual_chapter_locations = []
    toc_end_line = max(ch['line_index'] for ch in chapter_metadata) + 20  # Start searching after TOC
    
    for chapter in chapter_metadata:
        chapter_title_pattern = re.escape(chapter['title'][:30])  # First 30 chars of title
        chapter_num_pattern = f"^{chapter['number']}\\."  # Just the number pattern
        
        # ENHANCED: Add word-based chapter patterns for missing chapters
        chapter_word_patterns = [
            f"CHAPTER\\s+{num_to_word(chapter['number'])}",  # CHAPTER EIGHT
            f"Chapter\\s+{num_to_word(chapter['number'])}"   # Chapter Eight
        ]
        
        # Search for this chapter in the main content (after TOC)
        found = False
        for i, line in enumerate(non_empty_lines[toc_end_line:], start=toc_end_line):
            # Original patterns
            if (re.search(chapter_title_pattern, line, re.IGNORECASE) or 
                re.match(chapter_num_pattern, line)):
                actual_chapter_locations.append({
                    'number': chapter['number'],
                    'title': chapter['title'],
                    'line_index': i,
                    'found_text': line[:100]
                })
                found = True
                break
            
            # NEW: Check word-based patterns for chapters like CHAPTER EIGHT
            for word_pattern in chapter_word_patterns:
                if re.search(word_pattern, line, re.IGNORECASE):
                    actual_chapter_locations.append({
                        'number': chapter['number'],
                        'title': chapter['title'],
                        'line_index': i,
                        'found_text': line[:100]
                    })
                    found = True
                    break
            
            if found:
                break
    
    # CRITICAL FIX: Sort actual chapter locations by line index (not chapter number)
    actual_chapter_locations.sort(key=lambda x: x['line_index'])
    
    print(f"📍 Found {len(actual_chapter_locations)} actual chapter locations (sorted by position):")
    for loc in actual_chapter_locations[:5]:  # Show first 5
        print(f"   Ch {loc['number']:2d}: Line {loc['line_index']:4d} - {loc['found_text'][:60]}...")
    
    # Use actual locations if found, otherwise fall back to TOC
    if len(actual_chapter_locations) >= len(chapter_metadata) * 0.5:  # Found at least half
        print(f"\n✅ Using actual chapter locations for boundaries")
        locations_to_use = actual_chapter_locations
    else:
        print(f"\n⚠️ Using TOC locations (could not find enough actual chapters)")
        locations_to_use = chapter_metadata
    
    # Create chapter boundaries using the best available locations
    chapter_boundaries = []
    for i, ch in enumerate(locations_to_use):
        start_line = ch['line_index']
        end_line = (locations_to_use[i + 1]['line_index'] 
                   if i + 1 < len(locations_to_use) 
                   else len(non_empty_lines))
        chapter_boundaries.append({
            'chapter_num': ch['number'],
            'title': ch['title'],
            'start_line': start_line,
            'end_line': end_line,
            'estimated_lines': end_line - start_line
        })
    
    print(f"\n📐 Chapter boundaries for processing:")
    total_content_lines = 0
    for boundary in chapter_boundaries:  # Show ALL chapters (removed [:8] limit)
        lines_count = boundary['estimated_lines']
        total_content_lines += lines_count
        print(f"   Ch {boundary['chapter_num']:2d}: Lines {boundary['start_line']:4d}-{boundary['end_line']:4d} ({lines_count:4d} lines) - {boundary['title'][:45]}...")

    # Remove the conditional since we're showing all chapters now
    print(f"\n📊 All {len(chapter_boundaries)} chapters displayed above")
        
    print(f"\n📊 Chapter-based processing summary:")
    print(f"   Total chapters identified: {len(chapter_boundaries)}")
    print(f"   Total content lines: {sum(b['estimated_lines'] for b in chapter_boundaries):,}")
    print(f"   Average lines per chapter: {sum(b['estimated_lines'] for b in chapter_boundaries) // len(chapter_boundaries):,}")
    
    # Store for later use
    globals()['chapter_metadata'] = chapter_metadata
    globals()['chapter_boundaries'] = chapter_boundaries
    if 'actual_chapter_locations' in locals():
        globals()['actual_chapter_locations'] = actual_chapter_locations
    print(f"   ✅ Chapter boundaries stored for processing pipeline")

else:
    print("⚠️ No Terry Real format chapters detected - will use alternative chunking")

🔍 Analyzing content structure with enhanced detection...
📊 Non-empty lines: 9,025

📚 Enhanced chapter detection results: 38 markers found
📚 After deduplication: 19 unique markers
   X. Title: 17 matches
   Part Word: 1 matches
   Chapter Word: 1 matches

📖 Detected chapters with enhanced metadata:
    1. Line  70 [X. Title]: 1. Love on the Ropes : Men and Women in Crisis...
    2. Line  71 [X. Title]: 2. Echo Speaks: Empowering the Woman...
    3. Line  72 [X. Title]: 3. Bringing Men in from the Cold...
    4. Line  73 [X. Title]: 4. Psychological Patriarchy: The Dance of Contempt...
    5. Line  74 [X. Title]: 5. The Third Ring: A Conspiracy of Silence...
    6. Line  75 [X. Title]: 6. The Unspeakable Pain of Collusion...
    7. Line  76 [X. Title]: 7. Narcissus Resigns: An Unconventional Therapy...
    8. Line  77 [X. Title]: 8. Small Murders : How We Lose Passion...
    9. Line  78 [X. Title]: 9. A New Model of Love...
   10. Line  79 [X. Title]: 10. Recovering Real Passion...
   11

### Code Cell 3: Content Quality Assessment

In [None]:
# Assess text quality and identify any extraction issues
print("🔍 Assessing text extraction quality...")

# Sample paragraphs for readability check
sample_paragraphs = []
current_paragraph = []

for line in non_empty_lines[:500]:  # Check first 500 lines
    if len(line) > 100:  # Likely paragraph content
        current_paragraph.append(line)
    elif current_paragraph:
        if len(current_paragraph) >= 2:  # Multi-line paragraph
            sample_paragraphs.append(" ".join(current_paragraph))
        current_paragraph = []
    
    if len(sample_paragraphs) >= 3:  # Get 3 sample paragraphs
        break

print(f"📖 Sample readable paragraphs found: {len(sample_paragraphs)}")

for i, paragraph in enumerate(sample_paragraphs):
    print(f"\n📖 Sample Paragraph {i+1} ({len(paragraph)} chars):")
    print("-" * 50)
    print(paragraph[:300] + ("..." if len(paragraph) > 300 else ""))
    print("-" * 50)

# Check for common extraction issues
issues = []
if raw_text.count("�") > 0:
    issues.append(f"Encoding issues: {raw_text.count('�')} replacement characters")

if len([line for line in lines if len(line) == 1]) > 100:
    issues.append("Many single-character lines (possible formatting issues)")

if len(re.findall(r'\w+\w+\w+', raw_text)) < len(raw_text.split()) * 0.8:
    issues.append("Possible word separation issues")

if issues:
    print(f"\n⚠️ Potential extraction issues:")
    for issue in issues:
        print(f"   - {issue}")
else:
    print(f"\n✅ Text extraction quality looks good!")

### Code Cell 4: Chunking Strategy Analysis

In [None]:
# Analyze optimal chunking strategy for Terry Real content
print("🔪 Analyzing optimal chunking strategy...")

# Test current chunking parameters
splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    separators=["\n\n", "\n", ". ", " ", ""]
)

# Create initial chunks for analysis
test_chunks = splitter.split_text(raw_text[:50000])  # Test with first 50k characters

print(f"📊 Test chunking results:")
print(f"   Source text: {50000:,} characters")
print(f"   Generated chunks: {len(test_chunks):,}")
print(f"   Average chunk size: {np.mean([len(chunk) for chunk in test_chunks]):.0f} characters")
print(f"   Min chunk size: {min(len(chunk) for chunk in test_chunks)}")
print(f"   Max chunk size: {max(len(chunk) for chunk in test_chunks)}")

# Show sample chunks
print(f"\n📋 Sample chunks for quality assessment:")
for i, chunk in enumerate(test_chunks[:3]):
    print(f"\n--- Chunk {i+1} ({len(chunk)} chars) ---")
    print(chunk[:200] + ("..." if len(chunk) > 200 else ""))
    print("--- End Chunk ---")

# Analyze chunk coherence
relationship_terms = [
    "relationship", "marriage", "partner", "couple", "intimacy", 
    "communication", "conflict", "emotion", "boundary", "repair",
    "empathy", "connection", "trust", "vulnerability", "healing"
]

chunks_with_terms = []
for chunk in test_chunks[:20]:  # Analyze first 20 chunks
    term_count = sum(1 for term in relationship_terms if term.lower() in chunk.lower())
    chunks_with_terms.append(term_count)

avg_terms_per_chunk = np.mean(chunks_with_terms)
print(f"\n🔍 Relationship content analysis:")
print(f"   Average relationship terms per chunk: {avg_terms_per_chunk:.1f}")
print(f"   Chunks with 3+ relationship terms: {sum(1 for count in chunks_with_terms if count >= 3)}/{len(chunks_with_terms)}")

if avg_terms_per_chunk >= 2:
    print("✅ Good relationship content density in chunks")
else:
    print("⚠️ Consider adjusting chunk size for better content coherence")

### Code Cell 5: Processing Strategy Summary

In [None]:
# Summarize findings and prepare processing strategy
print("📋 PROCESSING STRATEGY SUMMARY")
print("=" * 60)

print(f"📖 Source Material Analysis:")
print(f"   Book: {test_pdf.name}")
print(f"   Total characters: {len(raw_text):,}")
print(f"   Total lines: {len(raw_text.splitlines()):,}")
print(f"   Extraction time: {extraction_time:.2f} seconds")

print(f"\n🏗️ Content Structure:")
print(f"   Potential chapters found: {len(potential_chapters)}")
print(f"   All-caps headings: {len(all_caps_lines)}")
print(f"   Text quality: {'Good' if not issues else 'Issues detected'}")

print(f"\n🔪 Chunking Strategy:")
print(f"   Current chunk size: {CHUNK_SIZE}")
print(f"   Current overlap: {CHUNK_OVERLAP}")
print(f"   Average chunk size: {np.mean([len(chunk) for chunk in test_chunks]):.0f} chars")
print(f"   Relationship content density: {avg_terms_per_chunk:.1f} terms/chunk")

# Recommendations
print(f"\n💡 Recommendations:")
if len(potential_chapters) > 0:
    print("   ✅ Chapter structure detected - can preserve in metadata")
else:
    print("   ⚠️ No clear chapter structure - use semantic chunking only")

if avg_terms_per_chunk >= 2:
    print("   ✅ Current chunk size preserves relationship content well")
else:
    print("   📝 Consider increasing chunk size for better content coherence")

print(f"\n🚀 Ready to process all {len(pdf_files)} Terry Real books!")
print("=" * 60)