# Terry Real Corpus Processing

**Purpose**: Process Terry Real's 3 books into ChromaDB collection for RAG-enhanced AI conversations

**Task 2 Requirements**:
- üìö Extract text from Terry Real PDFs systematically
- üî™ Implement semantic chunking for relationship concepts
- üè∑Ô∏è Preserve metadata (book source, chapter, concept type)
- üöÄ Batch embed all chunks with validated all-MiniLM-L6-v2
- ‚úÖ Validate quality - chunk coherence and embedding coverage

**Technology Stack**: ChromaDB + all-MiniLM-L6-v2 (validated in Task 1)

---

## üìã Processing Overview

**Source Materials**:
1. `terry-real-how-can-i-get-through-to-you.pdf`
2. `terry-real-new-rules-of-marriage.pdf`
3. `terry-real-us-getting-past-you-and-me.pdf`

**Processing Pipeline**:
1. **Text Extraction** - Extract clean text from PDFs
2. **Content Analysis** - Understand structure and identify chapters
3. **Chunking Strategy** - Semantic chunking for relationship concepts
4. **Metadata Creation** - Preserve book/chapter/concept information
5. **Embedding Generation** - Process with all-MiniLM-L6-v2
6. **Quality Validation** - Test retrieval and coherence
7. **Performance Testing** - Verify query performance for AI conversations

---

## 1. Dependencies & Environment Setup

In [None]:
# Core dependencies
import os
import re
import time
from pathlib import Path

# PDF processing
from pdfminer.high_level import extract_text

# Text processing and chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter

# ChromaDB and embeddings
import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer

# Data analysis and visualization
import pandas as pd
import numpy as np
from collections import Counter

print("üì¶ All dependencies imported successfully")
print(f"ChromaDB version: {chromadb.__version__}")

In [None]:
# Project configuration
PROJECT_ROOT = Path("..").resolve()  # From notebooks/ to project root
PDF_DIR = PROJECT_ROOT / "docs" / "Research" / "source-materials" / "pdf books"
CHROMA_DIR = PROJECT_ROOT / "rag_dev" / "chroma_db"
COLLECTION_NAME = "terry_real_corpus"

# Processing parameters (we'll optimize these)
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
EMBEDDING_MODEL = "all-MiniLM-L6-v2"  # Validated in Task 1

print(f"üìÅ PDF Directory: {PDF_DIR}")
print(f"üìÅ ChromaDB Directory: {CHROMA_DIR}")
print(f"üóÇÔ∏è Collection Name: {COLLECTION_NAME}")
print(f"üîß Chunk Size: {CHUNK_SIZE}, Overlap: {CHUNK_OVERLAP}")
print(f"ü§ñ Embedding Model: {EMBEDDING_MODEL}")

# Verify PDF files exist
pdf_files = list(PDF_DIR.glob("*.pdf"))
print(f"\nüìö Found {len(pdf_files)} PDF files:")
for pdf in pdf_files:
    print(f"   - {pdf.name}")
    
if len(pdf_files) != 3:
    print("‚ö†Ô∏è Expected 3 Terry Real PDFs, please verify file paths")
else:
    print("‚úÖ All Terry Real PDFs found")

In [None]:
# Initialize ChromaDB client and embedding model
print("üöÄ Initializing ChromaDB and embedding model...")

# Create ChromaDB directory if it doesn't exist
CHROMA_DIR.mkdir(parents=True, exist_ok=True)

# Initialize persistent ChromaDB client
client = chromadb.PersistentClient(path=str(CHROMA_DIR))
print(f"‚úÖ ChromaDB client initialized at {CHROMA_DIR}")

# Initialize embedding model (same as Task 1 validation)
embedder = SentenceTransformer(EMBEDDING_MODEL)
print(f"‚úÖ Embedding model '{EMBEDDING_MODEL}' loaded")
print(f"üìê Embedding dimension: {embedder.get_sentence_embedding_dimension()}")

# Verify this matches our Task 1 validation (should be 384)
expected_dim = 384
actual_dim = embedder.get_sentence_embedding_dimension()
if actual_dim == expected_dim:
    print(f"‚úÖ Embedding dimensions match Task 1 validation: {actual_dim}")
else:
    print(f"‚ö†Ô∏è Dimension mismatch! Expected {expected_dim}, got {actual_dim}")

In [None]:
# Clean up any existing collection (for fresh processing)
print(f"üßπ Preparing clean environment for {COLLECTION_NAME}...")

try:
    existing_collection = client.get_collection(COLLECTION_NAME)
    client.delete_collection(COLLECTION_NAME)
    print(f"üóëÔ∏è Deleted existing collection '{COLLECTION_NAME}'")
except Exception as e:
    print(f"‚ÑπÔ∏è No existing collection to delete: {e}")

# Create fresh collection
collection = client.create_collection(
    name=COLLECTION_NAME,
    metadata={"description": "Terry Real's Relational Life Therapy corpus for AI conversations"}
)
print(f"‚úÖ Fresh collection '{COLLECTION_NAME}' created")
print(f"üìä Collection count: {collection.count()} documents")

print("\n" + "="*60)
print("üéâ ENVIRONMENT SETUP COMPLETE")
print("‚úÖ Dependencies loaded")
print("‚úÖ Paths configured and verified")
print("‚úÖ ChromaDB client initialized")
print("‚úÖ Embedding model ready (384 dimensions)")
print("‚úÖ Fresh collection created")
print("üöÄ Ready for PDF text extraction")
print("="*60)