# Terry Real Corpus Processing

**Purpose**: Process Terry Real's 3 books into ChromaDB collection for RAG-enhanced AI conversations

**Task 2 Requirements**:
- üìö Extract text from Terry Real PDFs systematically
- üî™ Implement semantic chunking for relationship concepts
- üè∑Ô∏è Preserve metadata (book source, chapter, concept type)
- üöÄ Batch embed all chunks with validated all-MiniLM-L6-v2
- ‚úÖ Validate quality - chunk coherence and embedding coverage

**Technology Stack**: ChromaDB + all-MiniLM-L6-v2 (validated in Task 1)

---

## üìã Processing Overview

**Source Materials**:
1. `terry-real-how-can-i-get-through-to-you.pdf`
2. `terry-real-new-rules-of-marriage.pdf`
3. `terry-real-us-getting-past-you-and-me.pdf`

**Processing Pipeline**:
1. **Text Extraction** - Extract clean text from PDFs
2. **Content Analysis** - Understand structure and identify chapters
3. **Chunking Strategy** - Semantic chunking for relationship concepts
4. **Metadata Creation** - Preserve book/chapter/concept information
5. **Embedding Generation** - Process with all-MiniLM-L6-v2
6. **Quality Validation** - Test retrieval and coherence
7. **Performance Testing** - Verify query performance for AI conversations

---

## 1. Dependencies & Environment Setup

In [2]:
# Core dependencies
import os
import re
import time
from pathlib import Path

# PDF processing
from pdfminer.high_level import extract_text

# Text processing and chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter

# ChromaDB and embeddings
import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer

# Data analysis and visualization
import pandas as pd
import numpy as np
from collections import Counter

print("üì¶ All dependencies imported successfully")
print(f"ChromaDB version: {chromadb.__version__}")

üì¶ All dependencies imported successfully
ChromaDB version: 1.0.12


In [3]:
# Project configuration
PROJECT_ROOT = Path("..").resolve()  # From notebooks/ to project root
PDF_DIR = PROJECT_ROOT / "docs" / "Research" / "source-materials" / "pdf books"
CHROMA_DIR = PROJECT_ROOT / "rag_dev" / "chroma_db"
COLLECTION_NAME = "terry_real_corpus"

# Processing parameters (we'll optimize these)
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
EMBEDDING_MODEL = "all-MiniLM-L6-v2"  # Validated in Task 1

print(f"üìÅ PDF Directory: {PDF_DIR}")
print(f"üìÅ ChromaDB Directory: {CHROMA_DIR}")
print(f"üóÇÔ∏è Collection Name: {COLLECTION_NAME}")
print(f"üîß Chunk Size: {CHUNK_SIZE}, Overlap: {CHUNK_OVERLAP}")
print(f"ü§ñ Embedding Model: {EMBEDDING_MODEL}")

# Verify PDF files exist
pdf_files = list(PDF_DIR.glob("*.pdf"))
print(f"\nüìö Found {len(pdf_files)} PDF files:")
for pdf in pdf_files:
    print(f"   - {pdf.name}")
    
if len(pdf_files) != 3:
    print("‚ö†Ô∏è Expected 3 Terry Real PDFs, please verify file paths")
else:
    print("‚úÖ All Terry Real PDFs found")

üìÅ PDF Directory: D:\Github\Relational_Life_Practice\rag_dev\docs\Research\source-materials\pdf books
üìÅ ChromaDB Directory: D:\Github\Relational_Life_Practice\rag_dev\rag_dev\chroma_db
üóÇÔ∏è Collection Name: terry_real_corpus
üîß Chunk Size: 1000, Overlap: 200
ü§ñ Embedding Model: all-MiniLM-L6-v2

üìö Found 0 PDF files:
‚ö†Ô∏è Expected 3 Terry Real PDFs, please verify file paths


In [4]:
# Initialize ChromaDB client and embedding model
print("üöÄ Initializing ChromaDB and embedding model...")

# Create ChromaDB directory if it doesn't exist
CHROMA_DIR.mkdir(parents=True, exist_ok=True)

# Initialize persistent ChromaDB client
client = chromadb.PersistentClient(path=str(CHROMA_DIR))
print(f"‚úÖ ChromaDB client initialized at {CHROMA_DIR}")

# Initialize embedding model (same as Task 1 validation)
embedder = SentenceTransformer(EMBEDDING_MODEL)
print(f"‚úÖ Embedding model '{EMBEDDING_MODEL}' loaded")
print(f"üìê Embedding dimension: {embedder.get_sentence_embedding_dimension()}")

# Verify this matches our Task 1 validation (should be 384)
expected_dim = 384
actual_dim = embedder.get_sentence_embedding_dimension()
if actual_dim == expected_dim:
    print(f"‚úÖ Embedding dimensions match Task 1 validation: {actual_dim}")
else:
    print(f"‚ö†Ô∏è Dimension mismatch! Expected {expected_dim}, got {actual_dim}")

üöÄ Initializing ChromaDB and embedding model...
‚úÖ ChromaDB client initialized at D:\Github\Relational_Life_Practice\rag_dev\rag_dev\chroma_db


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

‚úÖ Embedding model 'all-MiniLM-L6-v2' loaded
üìê Embedding dimension: 384
‚úÖ Embedding dimensions match Task 1 validation: 384


In [5]:
# Clean up any existing collection (for fresh processing)
print(f"üßπ Preparing clean environment for {COLLECTION_NAME}...")

try:
    existing_collection = client.get_collection(COLLECTION_NAME)
    client.delete_collection(COLLECTION_NAME)
    print(f"üóëÔ∏è Deleted existing collection '{COLLECTION_NAME}'")
except Exception as e:
    print(f"‚ÑπÔ∏è No existing collection to delete: {e}")

# Create fresh collection
collection = client.create_collection(
    name=COLLECTION_NAME,
    metadata={"description": "Terry Real's Relational Life Therapy corpus for AI conversations"}
)
print(f"‚úÖ Fresh collection '{COLLECTION_NAME}' created")
print(f"üìä Collection count: {collection.count()} documents")

print("\n" + "="*60)
print("üéâ ENVIRONMENT SETUP COMPLETE")
print("‚úÖ Dependencies loaded")
print("‚úÖ Paths configured and verified")
print("‚úÖ ChromaDB client initialized")
print("‚úÖ Embedding model ready (384 dimensions)")
print("‚úÖ Fresh collection created")
print("üöÄ Ready for PDF text extraction")
print("="*60)

üßπ Preparing clean environment for terry_real_corpus...
‚ÑπÔ∏è No existing collection to delete: Collection [terry_real_corpus] does not exists
‚úÖ Fresh collection 'terry_real_corpus' created
üìä Collection count: 0 documents

üéâ ENVIRONMENT SETUP COMPLETE
‚úÖ Dependencies loaded
‚úÖ Paths configured and verified
‚úÖ ChromaDB client initialized
‚úÖ Embedding model ready (384 dimensions)
‚úÖ Fresh collection created
üöÄ Ready for PDF text extraction
