# Terry Real Corpus Processing

**Purpose**: Process Terry Real's 3 books into ChromaDB collection for RAG-enhanced AI conversations

**Task 2 Requirements**:
- 📚 Extract text from Terry Real PDFs systematically
- 🔪 Implement semantic chunking for relationship concepts
- 🏷️ Preserve metadata (book source, chapter, concept type)
- 🚀 Batch embed all chunks with validated all-MiniLM-L6-v2
- ✅ Validate quality - chunk coherence and embedding coverage

**Technology Stack**: ChromaDB + all-MiniLM-L6-v2 (validated in Task 1)

---

## 📋 Processing Overview

**Source Materials**:
1. `terry-real-how-can-i-get-through-to-you.pdf`
2. `terry-real-new-rules-of-marriage.pdf`
3. `terry-real-us-getting-past-you-and-me.pdf`

**Processing Pipeline**:
1. **Text Extraction** - Extract clean text from PDFs
2. **Content Analysis** - Understand structure and identify chapters
3. **Chunking Strategy** - Semantic chunking for relationship concepts
4. **Metadata Creation** - Preserve book/chapter/concept information
5. **Embedding Generation** - Process with all-MiniLM-L6-v2
6. **Quality Validation** - Test retrieval and coherence
7. **Performance Testing** - Verify query performance for AI conversations

---

## 1. Dependencies & Environment Setup

In [4]:
# Core dependencies
import os
import re
import time
from pathlib import Path

# PDF processing
from pdfminer.high_level import extract_text

# Text processing and chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter

# ChromaDB and embeddings
import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer

# Data analysis and visualization
import pandas as pd
import numpy as np
from collections import Counter

print("📦 All dependencies imported successfully")
print(f"ChromaDB version: {chromadb.__version__}")

📦 All dependencies imported successfully
ChromaDB version: 1.0.12


In [12]:
# Project configuration
PROJECT_ROOT = Path("../..").resolve()  # From notebooks/ to project root
PDF_DIR = PROJECT_ROOT / "docs" / "Research" / "source-materials" / "pdf books"
CHROMA_DIR = PROJECT_ROOT / "rag_dev" / "chroma_db"
COLLECTION_NAME = "terry_real_corpus"

# Processing parameters (we'll optimize these)
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
EMBEDDING_MODEL = "all-MiniLM-L6-v2"  # Validated in Task 1

print(f"📁 PDF Directory: {PDF_DIR}")
print(f"📁 ChromaDB Directory: {CHROMA_DIR}")
print(f"🗂️ Collection Name: {COLLECTION_NAME}")
print(f"🔧 Chunk Size: {CHUNK_SIZE}, Overlap: {CHUNK_OVERLAP}")
print(f"🤖 Embedding Model: {EMBEDDING_MODEL}")

# Verify PDF files exist
pdf_files = list(PDF_DIR.glob("*.pdf"))
print(f"\n📚 Found {len(pdf_files)} PDF files:")
for pdf in pdf_files:
    print(f"   - {pdf.name}")
    
if len(pdf_files) != 3:
    print("⚠️ Expected 3 Terry Real PDFs, please verify file paths")
else:
    print("✅ All Terry Real PDFs found")

📁 PDF Directory: D:\Github\Relational_Life_Practice\docs\Research\source-materials\pdf books
📁 ChromaDB Directory: D:\Github\Relational_Life_Practice\rag_dev\chroma_db
🗂️ Collection Name: terry_real_corpus
🔧 Chunk Size: 1000, Overlap: 200
🤖 Embedding Model: all-MiniLM-L6-v2

📚 Found 3 PDF files:
   - terry-real-how-can-i-get-through-to-you.pdf
   - terry-real-new-rules-of-marriage.pdf
   - terry-real-us-getting-past-you-and-me.pdf
✅ All Terry Real PDFs found


In [13]:
# Initialize ChromaDB client and embedding model
print("🚀 Initializing ChromaDB and embedding model...")

# Create ChromaDB directory if it doesn't exist
CHROMA_DIR.mkdir(parents=True, exist_ok=True)

# Initialize persistent ChromaDB client
client = chromadb.PersistentClient(path=str(CHROMA_DIR))
print(f"✅ ChromaDB client initialized at {CHROMA_DIR}")

# Initialize embedding model (same as Task 1 validation)
embedder = SentenceTransformer(EMBEDDING_MODEL)
print(f"✅ Embedding model '{EMBEDDING_MODEL}' loaded")
print(f"📐 Embedding dimension: {embedder.get_sentence_embedding_dimension()}")

# Verify this matches our Task 1 validation (should be 384)
expected_dim = 384
actual_dim = embedder.get_sentence_embedding_dimension()
if actual_dim == expected_dim:
    print(f"✅ Embedding dimensions match Task 1 validation: {actual_dim}")
else:
    print(f"⚠️ Dimension mismatch! Expected {expected_dim}, got {actual_dim}")

🚀 Initializing ChromaDB and embedding model...
✅ ChromaDB client initialized at D:\Github\Relational_Life_Practice\rag_dev\chroma_db
✅ Embedding model 'all-MiniLM-L6-v2' loaded
📐 Embedding dimension: 384
✅ Embedding dimensions match Task 1 validation: 384


In [14]:
# Clean up any existing collection (for fresh processing)
print(f"🧹 Preparing clean environment for {COLLECTION_NAME}...")

try:
    existing_collection = client.get_collection(COLLECTION_NAME)
    client.delete_collection(COLLECTION_NAME)
    print(f"🗑️ Deleted existing collection '{COLLECTION_NAME}'")
except Exception as e:
    print(f"ℹ️ No existing collection to delete: {e}")

# Create fresh collection
collection = client.create_collection(
    name=COLLECTION_NAME,
    metadata={"description": "Terry Real's Relational Life Therapy corpus for AI conversations"}
)
print(f"✅ Fresh collection '{COLLECTION_NAME}' created")
print(f"📊 Collection count: {collection.count()} documents")

print("\n" + "="*60)
print("🎉 ENVIRONMENT SETUP COMPLETE")
print("✅ Dependencies loaded")
print("✅ Paths configured and verified")
print("✅ ChromaDB client initialized")
print("✅ Embedding model ready (384 dimensions)")
print("✅ Fresh collection created")
print("🚀 Ready for PDF text extraction")
print("="*60)

🧹 Preparing clean environment for terry_real_corpus...
🗑️ Deleted existing collection 'terry_real_corpus'
✅ Fresh collection 'terry_real_corpus' created
📊 Collection count: 0 documents

🎉 ENVIRONMENT SETUP COMPLETE
✅ Dependencies loaded
✅ Paths configured and verified
✅ ChromaDB client initialized
✅ Embedding model ready (384 dimensions)
✅ Fresh collection created
🚀 Ready for PDF text extraction
