# Chonkie Advanced Tutorial: Refineries, Porters & Handshakes

## Master Advanced Features for Production RAG Systems

Welcome to the advanced Chonkie tutorial! This notebook covers the powerful features that take your RAG pipelines from prototype to production.

### Prerequisites
- Complete the [Chonkie Chunkers Tutorial](chonkie_complete_tutorial.ipynb) first
- Understanding of chunking strategies
- Google Gemini API key
- Pinecone API key (free tier available)

### What You'll Learn

1. **Refineries** - Enhance chunks with context and embeddings
2. **Porters** - Export chunks to JSON and HuggingFace Datasets
3. **Handshakes** - Integrate with vector databases (ChromaDB, Pinecone)
4. **Production Pipelines** - Build complete end-to-end RAG systems

### Tutorial Structure (60-90 minutes)

1. Introduction & Setup (10 min)
2. Refineries - Enhancing Chunks (25 min)
3. Porters - Exporting Data (15 min)
4. Handshakes - Vector DB Integration (30 min)
5. Best Practices (10 min)
6. Exercises (5 min)

Let's get started!

## Section 1: Introduction & Setup

### 1.1 Understanding the RAG Pipeline

In the previous tutorial, we learned about chunking. Now we'll complete the RAG pipeline:

```
Documents → Chunkers → Refineries → Porters/Handshakes → Vector DB → RAG
```

**Refineries** add context or embeddings to chunks
**Porters** export chunks to various formats  
**Handshakes** connect chunks directly to vector databases

### When to Use These Features

- **Overlap Refinery**: Prevent context loss at chunk boundaries
- **Embeddings Refinery**: Pre-compute embeddings for batch operations
- **JSONPorter**: Backup, debugging, version control
- **DatasetsPorter**: Share data, ML workflows, HuggingFace Hub
- **Handshakes**: Production RAG with scalable vector databases

In [None]:
# 1.2 Setup and Imports
import sys
print(f"Python version: {sys.version}")

try:
    import chonkie
    print(f"Chonkie version: {chonkie.__version__}")
except ImportError:
    print("ERROR: Chonkie not installed. Run: pip install -r requirements.txt")

# Import required libraries
from dotenv import load_dotenv
import os
import time
import json
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from typing import List, Dict

# Set plotting style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("\nAll base libraries loaded successfully!")

In [None]:
# 1.3 Load API Keys
load_dotenv()

GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")

# Check Gemini API
if not GEMINI_API_KEY or GEMINI_API_KEY == "your_api_key_here":
    print("WARNING: Gemini API key not configured!")
    print("Set GEMINI_API_KEY in .env file")
else:
    print("✓ Gemini API key loaded")

# Check Pinecone API
if not PINECONE_API_KEY or PINECONE_API_KEY == "your_pinecone_api_key_here":
    print("WARNING: Pinecone API key not configured!")
    print("Set PINECONE_API_KEY in .env file")
    print("Get key from: https://www.pinecone.io/")
else:
    print("✓ Pinecone API key loaded")

In [None]:
# 1.4 Import Chonkie Components
try:
    from chonkie import (
        RecursiveChunker,
        TokenChunker,
        OverlapRefinery,
        EmbeddingsRefinery,
        JSONPorter,
        DatasetsPorter,
        ChromaHandshake,
        PineconeHandshake,
        GeminiEmbeddings,
        Pipeline
    )
    print("✓ All Chonkie components imported successfully!")
except ImportError as e:
    print(f"ERROR importing Chonkie components: {e}")
    print("Install missing dependencies: pip install -r requirements.txt")

In [None]:
# 1.5 Load Sample Data (from chunkers tutorial)
try:
    # Load technical documentation
    with open("../data/sample_technical_doc.txt", "r") as f:
        technical_doc_text = f.read()

    # Load research paper
    with open("../data/sample_research_paper.txt", "r") as f:
        research_paper_text = f.read()

    print(f"✓ Technical Doc: {len(technical_doc_text):,} characters")
    print(f"✓ Research Paper: {len(research_paper_text):,} characters")
    
    print("\n--- Sample Text Preview ---")
    print(technical_doc_text[:250] + "...")
except FileNotFoundError:
    print("ERROR: Sample data files not found!")
    print("Make sure you're running from the notebooks/ directory")

In [None]:
# 1.6 Initialize Gemini Embeddings
try:
    embeddings = GeminiEmbeddings(
        model="gemini-embedding-exp-03-07",
        api_key=GEMINI_API_KEY,
        task_type="SEMANTIC_SIMILARITY"
    )
    
    # Test embeddings
    test_vector = embeddings.embed("Hello, this is a test")
    print(f"✓ Gemini embeddings initialized")
    print(f"  Embedding dimension: {len(test_vector)}")
    print(f"  Sample values: {test_vector[:3]}")
except Exception as e:
    print(f"ERROR: Failed to initialize Gemini embeddings: {e}")
    print("Some features will not work without valid API key")

In [None]:
# 1.7 Create Initial Chunks for Demonstrations
from tokenizers import Tokenizer

# Initialize tokenizer
tokenizer = Tokenizer.from_pretrained("gpt2")

# Create chunks using RecursiveChunker
chunker = RecursiveChunker(
    tokenizer=tokenizer,
    chunk_size=512,
    chunk_overlap=128
)

# Chunk both documents
tech_chunks = chunker.chunk(technical_doc_text)
research_chunks = chunker.chunk(research_paper_text)

print(f"Created {len(tech_chunks)} chunks from technical doc")
print(f"Created {len(research_chunks)} chunks from research paper")
print(f"\nTotal chunks for demonstrations: {len(tech_chunks) + len(research_chunks)}")

## Section 2: Refineries - Enhancing Chunks

### 2.1 Understanding Refineries

**Refineries** post-process chunks to add additional context or metadata. Think of them as enhancement layers that make chunks more useful for RAG.

**Two Types:**
1. **OverlapRefinery** - Adds overlapping context from adjacent chunks
2. **EmbeddingsRefinery** - Attaches vector embeddings to chunk objects

**Pipeline Position:**
```
Text → Chunker → Refinery → Porter/Handshake
```

**Why Refine?**
- Prevent information loss at chunk boundaries
- Pre-compute embeddings for efficiency
- Enrich chunks with additional metadata

### 2.2 OverlapRefinery - Adding Contextual Overlap

**Purpose:** Add overlapping text from adjacent chunks to preserve context at boundaries

**Use Cases:**
- Narratives or stories where flow matters
- Dialogues with speaker attribution
- Technical docs with cross-references

**How it works:**
- Takes chunks and adds context from previous/next chunks
- Configurable overlap size
- Improves embedding quality at chunk edges

In [None]:
# Initialize OverlapRefinery
overlap_refinery = OverlapRefinery(
    tokenizer="gpt2",  # Optional: specify tokenizer
    context_size=128   # Overlap size in tokens
)

# Refine chunks with overlap
refined_chunks = overlap_refinery(tech_chunks[:5])  # Use first 5 for demo

print(f"Refined {len(refined_chunks)} chunks with overlap context")
print("\n" + "="*80)

In [None]:
# Compare before and after
print("COMPARISON: Original vs Refined Chunks\n")

for i in range(min(2, len(refined_chunks))):
    print(f"\n--- Chunk {i+1} ---")
    print(f"\nOriginal (no context):")
    print(f"Text: {tech_chunks[i].text[:200]}...")
    print(f"Length: {len(tech_chunks[i].text)} chars")
    
    print(f"\nRefined (with overlap context):")
    print(f"Text: {refined_chunks[i].text[:200]}...")
    print(f"Length: {len(refined_chunks[i].text)} chars")
    
    # Check for context attribute
    if hasattr(refined_chunks[i], 'context'):
        print(f"Added context: {refined_chunks[i].context[:150]}...")
    
    print("-" * 80)

In [None]:
# Visualize size increase
original_sizes = [len(chunk.text) for chunk in tech_chunks[:5]]
refined_sizes = [len(chunk.text) for chunk in refined_chunks]

fig, ax = plt.subplots(figsize=(10, 5))
x = np.arange(len(original_sizes))
width = 0.35

ax.bar(x - width/2, original_sizes, width, label='Original', alpha=0.8)
ax.bar(x + width/2, refined_sizes, width, label='With Overlap', alpha=0.8)

ax.set_xlabel('Chunk Index')
ax.set_ylabel('Character Count')
ax.set_title('Chunk Size: Original vs Refined with Overlap')
ax.set_xticks(x)
ax.legend()
plt.tight_layout()
plt.show()

print(f"\nAverage size increase: {np.mean(refined_sizes) - np.mean(original_sizes):.0f} characters")

### 2.3 EmbeddingsRefinery - Pre-Computing Embeddings

**Purpose:** Attach vector embeddings directly to chunk objects

**Use Cases:**
- Batch embedding for multiple vector databases
- Offline embedding computation
- Caching embeddings for reuse

**Benefits:**
- Compute once, use many times
- Separate embedding from storage
- Enable custom embedding workflows

In [None]:
# Initialize EmbeddingsRefinery
embeddings_refinery = EmbeddingsRefinery(
    embedding_model=embeddings  # Use our Gemini embeddings
)

# Refine chunks with embeddings (use subset to save API calls)
sample_chunks = tech_chunks[:3]
embedded_chunks = embeddings_refinery(sample_chunks)

print(f"✓ Added embeddings to {len(embedded_chunks)} chunks")
print("\nNote: This makes API calls to Gemini - processing limited sample")

In [None]:
# Inspect embedded chunks
for i, chunk in enumerate(embedded_chunks[:2]):
    print(f"\n--- Embedded Chunk {i+1} ---")
    print(f"Text: {chunk.text[:150]}...")
    
    if hasattr(chunk, 'embedding'):
        print(f"\n✓ Embedding attached!")
        print(f"  Dimension: {len(chunk.embedding)}")
        print(f"  Sample values: {chunk.embedding[:5]}")
        print(f"  Type: {type(chunk.embedding)}")
    else:
        print("No embedding attribute found")
    
    print("-" * 80)

In [None]:
# Memory footprint analysis
import sys

# Estimate memory for chunk with embedding
if embedded_chunks and hasattr(embedded_chunks[0], 'embedding'):
    text_size = sys.getsizeof(embedded_chunks[0].text)
    embedding_size = sys.getsizeof(embedded_chunks[0].embedding)
    
    print("Memory Footprint per Chunk:")
    print(f"  Text: ~{text_size} bytes")
    print(f"  Embedding: ~{embedding_size} bytes")
    print(f"  Total: ~{text_size + embedding_size} bytes")
    print(f"\nFor 1000 chunks: ~{(text_size + embedding_size) * 1000 / 1024 / 1024:.2f} MB")

### 2.4 Chaining Refineries with Pipeline

**The Power of Pipelines:**
Chonkie's `Pipeline` API lets you chain chunkers and refineries elegantly:

```python
Pipeline()
  .chunk_with("recursive", chunk_size=512)
  .refine_with("overlap", context_size=128)
  .refine_with("embeddings", embedding_model=embeddings)
```

In [None]:
# Create a complete pipeline
pipeline = (
    Pipeline()
    .chunk_with("recursive", tokenizer="gpt2", chunk_size=512)
    .refine_with("overlap", context_size=128)
    .refine_with("embeddings", embedding_model=embeddings)
)

print("✓ Pipeline created with:")
print("  1. RecursiveChunker (512 tokens)")
print("  2. OverlapRefinery (128 token overlap)")
print("  3. EmbeddingsRefinery (Gemini embeddings)")

In [None]:
# Run pipeline on sample text (limited for API costs)
sample_text = research_paper_text[:2000]  # First 2000 chars

print("Running pipeline...")
start_time = time.time()

pipeline_result = pipeline.run(texts=sample_text)

elapsed = time.time() - start_time
print(f"\n✓ Pipeline completed in {elapsed:.2f} seconds")
print(f"✓ Produced {len(pipeline_result)} fully processed chunks")
print("  - Chunked")
print("  - Overlap context added")
print("  - Embeddings computed")

#### Exercise: Build Your Own Pipeline

Try creating a pipeline with different settings:
- Different chunk sizes (256, 1024)
- Different overlap sizes (64, 256)
- With or without embeddings

Compare the results!

In [None]:
# Your custom pipeline here
# Example:
# my_pipeline = (
#     Pipeline()
#     .chunk_with("token", chunk_size=256)
#     .refine_with("overlap", context_size=64)
# )
# result = my_pipeline.run(texts=sample_text)

## Section 3: Porters - Exporting Chunks

### 3.1 Understanding Porters

**Porters** export chunks to various formats and destinations. They're the final step before storage or sharing.

**Available Porters:**
1. **JSONPorter** - Export to JSON files
2. **DatasetsPorter** - Export to HuggingFace Datasets

**When to Use:**
- **JSONPorter**: Backup, debugging, version control, simple integrations
- **DatasetsPorter**: ML workflows, data sharing, HuggingFace Hub uploads

**Common Workflow:**
```
Chunk → Refine → Export (Porter) → Store/Share
```

### 3.2 JSONPorter - Exporting to JSON

**Purpose:** Save chunks as structured JSON files

**Benefits:**
- Human-readable format
- Universal compatibility
- Easy to inspect and debug
- Version control friendly

**Output Structure:**
```json
[
  {
    "text": "chunk content...",
    "token_count": 487,
    "chunk_id": "chunk_0",
    "embedding": [0.123, ...],
    "metadata": {...}
  }
]
```

In [None]:
# Initialize JSONPorter
json_porter = JSONPorter()

# Export embedded chunks to JSON
output_path = "../data/sample_chunks_output.json"

try:
    json_porter.write(
        chunks=embedded_chunks,
        filepath=output_path
    )
    print(f"✓ Exported {len(embedded_chunks)} chunks to {output_path}")
except Exception as e:
    print(f"Error exporting to JSON: {e}")

In [None]:
# Load and inspect exported JSON
try:
    with open(output_path, "r") as f:
        exported_data = json.load(f)
    
    print(f"✓ Loaded {len(exported_data)} chunks from JSON")
    print(f"\nFile size: {os.path.getsize(output_path):,} bytes")
    
    # Inspect structure
    if exported_data:
        print(f"\nChunk structure (keys): {list(exported_data[0].keys())}")
        print(f"\nFirst chunk preview:")
        print(json.dumps(exported_data[0], indent=2)[:500] + "...")
except FileNotFoundError:
    print(f"File not found: {output_path}")

### 3.3 DatasetsPorter - Exporting to HuggingFace Datasets

**Purpose:** Convert chunks to HuggingFace Dataset format

**Benefits:**
- Rich dataset features (filtering, mapping, batching)
- Direct upload to HuggingFace Hub
- Integration with transformers ecosystem
- Efficient storage and loading

**Use Cases:**
- Training embedding models
- Sharing preprocessed data
- Research and collaboration

In [None]:
# Initialize DatasetsPorter
try:
    datasets_porter = DatasetsPorter()
    
    # Export to HuggingFace Dataset
    dataset = datasets_porter.write(chunks=embedded_chunks)
    
    print(f"✓ Created HuggingFace Dataset with {len(dataset)} rows")
except Exception as e:
    print(f"Error creating dataset: {e}")
    print("Make sure datasets library is installed: pip install datasets")

In [None]:
# Inspect dataset
if 'dataset' in locals():
    print("Dataset Information:")
    print(f"  Features: {dataset.features}")
    print(f"  Number of rows: {len(dataset)}")
    print(f"  Number of columns: {len(dataset.column_names)}")
    print(f"\nColumn names: {dataset.column_names}")
    
    # Show first row
    print(f"\nFirst row:")
    print(dataset[0])

In [None]:
# Optional: Push to HuggingFace Hub (commented out)
# Uncomment and configure to upload your processed chunks

# from huggingface_hub import login
# login()  # Login to HuggingFace
# 
# dataset.push_to_hub(
#     repo_id="your-username/chonkie-tutorial-chunks",
#     private=True  # Set to False to make public
# )

print("Tip: You can push datasets to HuggingFace Hub for sharing!")
print("See: dataset.push_to_hub('username/dataset-name')")

### 3.4 Porter Comparison

Let's compare both porters:

In [None]:
# Create comparison table
comparison_data = {
    'Feature': ['Format', 'File Size', 'Portability', 'HuggingFace Hub', 'ML Integration', 'Human Readable', 'Best For'],
    'JSONPorter': ['JSON', 'Larger', 'Universal', 'No', 'Basic', 'Yes', 'Debugging, backup'],
    'DatasetsPorter': ['HF Dataset', 'Smaller*', 'HF ecosystem', 'Yes', 'Excellent', 'Partial', 'ML workflows, sharing']
}

comparison_df = pd.DataFrame(comparison_data)
print("\nPorter Comparison:")
print("=" * 80)
print(comparison_df.to_string(index=False))
print("\n*With Arrow format compression")

#### Exercise: Export and Compare

Export your chunks with both porters and compare:
1. File sizes
2. Loading speed
3. Ease of inspection

Which would you use for your project?

## Section 4: Handshakes - Vector Database Integration

### 4.1 Understanding Handshakes

**Handshakes** provide a unified interface to connect Chonkie chunks with vector databases.

**What They Do:**
- Embed chunks (if not already embedded)
- Store in vector database
- Provide search interface
- Handle batch operations

**Supported Databases:**
- ChromaDB (local/persistent)
- Pinecone (managed cloud)
- Qdrant, Weaviate, Milvus
- MongoDB, Elasticsearch
- Pgvector, Turbopuffer

**Architecture:**
```
Text → Chunker → Refinery → Handshake → Vector DB
                                ↓
                         (Embed + Store)
```

### 4.2 ChromaDB Handshake - Local Development

**Purpose:** Fast, local vector database for development and prototyping

**Benefits:**
- No API keys required
- In-memory or persistent
- Easy to setup
- Great for testing

**Limitations:**
- Not for production scale
- Single machine only
- Limited advanced features

In [None]:
# Initialize ChromaDB Handshake
try:
    chroma_handshake = ChromaHandshake(
        collection_name="chonkie_tutorial",
        embedding_model=embeddings,
        persist_directory="./chroma_db"  # Optional: persist to disk
    )
    
    print("✓ ChromaDB Handshake initialized")
    print(f"  Collection: chonkie_tutorial")
    print(f"  Persist dir: ./chroma_db")
except Exception as e:
    print(f"Error initializing ChromaDB: {e}")
    print("Install with: pip install chromadb")

In [None]:
# Write chunks to ChromaDB
if 'chroma_handshake' in locals():
    try:
        # Use our embedded chunks
        chroma_handshake.write(embedded_chunks)
        print(f"✓ Wrote {len(embedded_chunks)} chunks to ChromaDB")
    except Exception as e:
        print(f"Error writing to ChromaDB: {e}")

In [None]:
# Search ChromaDB
if 'chroma_handshake' in locals():
    try:
        query = "What are the best practices for chunking in RAG?"
        results = chroma_handshake.search(
            query=query,
            limit=3
        )
        
        print(f"Query: {query}")
        print(f"\nTop {len(results)} Results:\n")
        print("=" * 80)
        
        for i, result in enumerate(results):
            print(f"\nResult {i+1}:")
            print(f"Score: {result.get('score', 'N/A')}")
            print(f"Text: {result['text'][:200]}...")
            print("-" * 80)
    except Exception as e:
        print(f"Error searching ChromaDB: {e}")

### 4.3 Pinecone Handshake - Production Scale

**Purpose:** Managed, cloud-native vector database for production

**Benefits:**
- Fully managed (no infrastructure)
- Scales to billions of vectors
- Low latency (<10ms)
- Advanced features (namespaces, metadata filtering)
- Free tier available

**Perfect For:**
- Production RAG applications
- Large-scale systems
- Multi-tenant applications

In [None]:
# Initialize Pinecone
try:
    import pinecone
    
    # Initialize Pinecone client
    pinecone.init(
        api_key=PINECONE_API_KEY,
        environment="us-west1-gcp"  # Change to your environment
    )
    
    print("✓ Pinecone initialized")
    print(f"  Environment: us-west1-gcp")
except Exception as e:
    print(f"Error initializing Pinecone: {e}")
    print("Make sure PINECONE_API_KEY is set in .env")

In [None]:
# Create Pinecone index (one-time setup)
if 'pinecone' in dir():
    try:
        index_name = "chonkie-tutorial"
        
        # Check if index exists
        if index_name not in pinecone.list_indexes():
            print(f"Creating index '{index_name}'...")
            pinecone.create_index(
                name=index_name,
                dimension=768,  # Match embedding dimension
                metric="cosine"
            )
            print("✓ Index created")
        else:
            print(f"✓ Index '{index_name}' already exists")
        
        # Get index stats
        index = pinecone.Index(index_name)
        stats = index.describe_index_stats()
        print(f"\nIndex stats:")
        print(f"  Total vectors: {stats.get('total_vector_count', 0)}")
        print(f"  Dimension: {stats.get('dimension', 'N/A')}")
    except Exception as e:
        print(f"Error with Pinecone index: {e}")

In [None]:
# Initialize Pinecone Handshake
if 'pinecone' in dir() and 'index_name' in locals():
    try:
        pinecone_handshake = PineconeHandshake(
            index_name=index_name,
            embedding_model=embeddings,
            namespace="tutorial_docs"  # Optional namespace
        )
        
        print("✓ Pinecone Handshake initialized")
        print(f"  Index: {index_name}")
        print(f"  Namespace: tutorial_docs")
    except Exception as e:
        print(f"Error initializing Pinecone Handshake: {e}")

In [None]:
# Write chunks to Pinecone
if 'pinecone_handshake' in locals():
    try:
        print("Writing chunks to Pinecone...")
        pinecone_handshake.write(
            chunks=embedded_chunks,
            batch_size=100  # Upload in batches
        )
        print(f"✓ Wrote {len(embedded_chunks)} chunks to Pinecone")
    except Exception as e:
        print(f"Error writing to Pinecone: {e}")

In [None]:
# Search Pinecone with metadata filtering
if 'pinecone_handshake' in locals():
    try:
        query = "semantic chunking strategies"
        
        results = pinecone_handshake.search(
            query=query,
            limit=5,
            # filter={"source": "research_paper"}  # Optional metadata filter
        )
        
        print(f"Query: '{query}'")
        print(f"\nTop {len(results)} Results from Pinecone:\n")
        print("=" * 80)
        
        for i, result in enumerate(results):
            print(f"\nResult {i+1}:")
            print(f"Score: {result.get('score', 'N/A'):.3f}")
            print(f"Text: {result['text'][:200]}...")
            if 'metadata' in result:
                print(f"Metadata: {result['metadata']}")
            print("-" * 80)
    except Exception as e:
        print(f"Error searching Pinecone: {e}")

### 4.4 Complete RAG Workflow Example

Let's build an end-to-end RAG system combining everything we've learned:

In [None]:
# Complete RAG Pipeline Function
def build_rag_system(documents, vector_db="pinecone"):
    """
    Build complete RAG system from documents.
    
    Args:
        documents: List of text documents
        vector_db: "chroma" or "pinecone"
    
    Returns:
        Handshake object for querying
    """
    print("Building RAG System...\n")
    
    # Step 1: Create pipeline
    print("1. Creating processing pipeline...")
    pipeline = (
        Pipeline()
        .chunk_with("recursive", chunk_size=512)
        .refine_with("overlap", context_size=128)
        .refine_with("embeddings", embedding_model=embeddings)
    )
    print("   ✓ Pipeline ready\n")
    
    # Step 2: Process documents
    print("2. Processing documents...")
    all_chunks = []
    for i, doc in enumerate(documents):
        chunks = pipeline.run(texts=doc)
        all_chunks.extend(chunks)
        print(f"   ✓ Document {i+1}: {len(chunks)} chunks")
    print(f"   Total: {len(all_chunks)} chunks\n")
    
    # Step 3: Export backup
    print("3. Exporting backup to JSON...")
    json_porter = JSONPorter()
    json_porter.write(all_chunks, "../data/rag_system_backup.json")
    print("   ✓ Backup saved\n")
    
    # Step 4: Store in vector DB
    print(f"4. Storing in {vector_db}...")
    if vector_db == "chroma":
        handshake = ChromaHandshake(
            collection_name="rag_system",
            embedding_model=embeddings
        )
    else:  # pinecone
        handshake = PineconeHandshake(
            index_name=index_name,
            embedding_model=embeddings,
            namespace="rag_system"
        )
    
    handshake.write(all_chunks)
    print(f"   ✓ {len(all_chunks)} chunks stored\n")
    
    print("✓ RAG System ready!")
    return handshake

print("RAG system builder function ready!")

In [None]:
# Build RAG system (using limited sample to save API costs)
sample_docs = [
    technical_doc_text[:1500],
    research_paper_text[:1500]
]

# Choose vector DB
rag_handshake = build_rag_system(sample_docs, vector_db="chroma")  # Use "pinecone" if preferred

In [None]:
# Query the RAG system
def rag_query(handshake, question, k=3):
    """
    Query the RAG system.
    
    Args:
        handshake: Vector DB handshake
        question: User query
        k: Number of chunks to retrieve
    
    Returns:
        Retrieved context and answer
    """
    print(f"Question: {question}\n")
    
    # Retrieve relevant chunks
    results = handshake.search(query=question, limit=k)
    
    # Combine context
    context = "\n\n".join([r['text'] for r in results])
    
    print(f"Retrieved {len(results)} relevant chunks:\n")
    for i, r in enumerate(results):
        print(f"Chunk {i+1} (score: {r.get('score', 'N/A')})")
        print(f"{r['text'][:150]}...\n")
    
    # In production, you'd send context to LLM for answer generation
    answer = f"Based on {k} retrieved chunks, the answer would be generated by an LLM using this context."
    
    return context, answer

# Test query
context, answer = rag_query(
    rag_handshake,
    "How does recursive chunking work?",
    k=3
)

print("\nNext step: Send context to LLM (GPT-4, Gemini, etc.) for answer generation")

### 4.5 Handshakes Comparison

Here's how different vector databases compare:

In [None]:
# Create comparison table
handshake_comparison = {
    'Database': ['ChromaDB', 'Pinecone', 'Qdrant', 'Weaviate'],
    'Deployment': ['Local/Self-hosted', 'Managed Cloud', 'Both', 'Both'],
    'Best For': ['Development', 'Production', 'Balance', 'GraphQL/Hybrid'],
    'API Required': ['No', 'Yes', 'Optional', 'Optional'],
    'Cost': ['Free', 'Free tier + Paid', 'Free + Paid', 'Free + Paid'],
    'Scale': ['Medium', 'Billions', 'Billions', 'Large'],
    'Setup Difficulty': ['Easy', 'Easy', 'Medium', 'Medium']
}

comparison_df = pd.DataFrame(handshake_comparison)
print("\nVector Database Comparison:")
print("=" * 100)
print(comparison_df.to_string(index=False))

## Section 5: Best Practices & Integration

### 5.1 When to Use Each Component

**Decision Matrix:**

In [None]:
# Create decision guide
decision_guide = {
    'Scenario': [
        'Story/narrative chunking',
        'Batch embedding job',
        'Data backup needed',
        'Share with team',
        'Local development',
        'Production RAG',
        'Self-hosted required',
        'ML model training'
    ],
    'Recommended': [
        'OverlapRefinery',
        'EmbeddingsRefinery',
        'JSONPorter',
        'DatasetsPorter',
        'ChromaDB Handshake',
        'Pinecone Handshake',
        'Qdrant/Weaviate',
        'DatasetsPorter'
    ],
    'Why': [
        'Preserves context at boundaries',
        'Compute once, reuse many times',
        'Human-readable, version control',
        'HuggingFace Hub integration',
        'No setup, no API keys',
        'Scalable, managed, fast',
        'Control, privacy, customization',
        'HF ecosystem integration'
    ]
}

guide_df = pd.DataFrame(decision_guide)
print("\nDecision Guide: What to Use When")
print("=" * 100)
print(guide_df.to_string(index=False))

### 5.2 Production Pipeline Pattern

Here's a robust production pattern:

In [None]:
def production_pipeline(documents, config):
    """
    Production-ready RAG pipeline with error handling.
    
    Args:
        documents: List of text documents
        config: Configuration dict with settings
    
    Returns:
        dict with results and statistics
    """
    results = {
        'chunks_processed': 0,
        'errors': [],
        'timing': {}
    }
    
    try:
        # 1. Chunk with appropriate strategy
        start = time.time()
        chunker = RecursiveChunker(
            chunk_size=config.get('chunk_size', 512),
            chunk_overlap=config.get('overlap', 128)
        )
        chunks = chunker.chunk_batch(documents)
        results['timing']['chunking'] = time.time() - start
        results['chunks_processed'] = len(chunks)
        
        # 2. Refine for quality (if enabled)
        if config.get('use_overlap_refinery', True):
            start = time.time()
            overlap_refinery = OverlapRefinery(
                context_size=config.get('context_size', 128)
            )
            chunks = overlap_refinery(chunks)
            results['timing']['overlap_refinery'] = time.time() - start
        
        # 3. Add embeddings (with retry logic)
        if config.get('use_embeddings', True):
            start = time.time()
            embeddings_refinery = EmbeddingsRefinery(
                embedding_model=config['embedding_model']
            )
            chunks = embeddings_refinery(chunks)
            results['timing']['embeddings'] = time.time() - start
        
        # 4. Export backup (async in production)
        if config.get('export_backup', True):
            start = time.time()
            json_porter = JSONPorter()
            backup_path = config.get('backup_path', 'backup/chunks.json')
            json_porter.write(chunks, backup_path)
            results['timing']['backup'] = time.time() - start
        
        # 5. Store in vector DB (with batch size limit)
        start = time.time()
        vector_db = config.get('vector_db', 'pinecone')
        
        if vector_db == 'pinecone':
            handshake = PineconeHandshake(
                index_name=config['index_name'],
                embedding_model=config['embedding_model']
            )
        else:  # chroma
            handshake = ChromaHandshake(
                collection_name=config.get('collection_name', 'default'),
                embedding_model=config['embedding_model']
            )
        
        handshake.write(chunks, batch_size=config.get('batch_size', 100))
        results['timing']['vector_db'] = time.time() - start
        
        results['status'] = 'success'
        results['handshake'] = handshake
        
    except Exception as e:
        results['status'] = 'error'
        results['errors'].append(str(e))
    
    return results

print("✓ Production pipeline function ready")
print("\nFeatures:")
print("  - Error handling")
print("  - Configurable components")
print("  - Performance timing")
print("  - Batch operations")
print("  - Backup integration")

### 5.3 Performance Considerations

**Key Optimizations:**

1. **Batch Processing**: Use `chunk_batch()` for multiple documents
2. **Memory Management**: Process large corpora in chunks
3. **API Costs**: Optimize chunk sizes to minimize embedding API calls
4. **Database Costs**: Monitor vector DB usage and index size
5. **Caching**: Reuse embeddings when possible
6. **Parallel Processing**: Use async/threading for large batches

**Cost Estimation:**

In [None]:
def estimate_costs(num_documents, avg_tokens_per_doc, chunk_size=512):
    """
    Estimate embedding and storage costs.
    
    Args:
        num_documents: Number of documents to process
        avg_tokens_per_doc: Average tokens per document
        chunk_size: Chunk size in tokens
    """
    # Calculate chunks
    total_tokens = num_documents * avg_tokens_per_doc
    num_chunks = total_tokens / chunk_size
    
    # Embedding costs (example: Gemini pricing)
    embedding_cost_per_1m = 0.025  # $0.025 per 1M tokens (example)
    embedding_cost = (total_tokens / 1_000_000) * embedding_cost_per_1m
    
    # Storage costs (example: Pinecone pricing)
    vectors_per_pod = 1_000_000
    cost_per_pod = 70  # $70/month (example)
    pods_needed = max(1, num_chunks / vectors_per_pod)
    storage_cost_monthly = pods_needed * cost_per_pod
    
    print("Cost Estimation:")
    print("=" * 60)
    print(f"Documents: {num_documents:,}")
    print(f"Total tokens: {total_tokens:,}")
    print(f"Estimated chunks: {num_chunks:,.0f}")
    print(f"\nEmbedding cost (one-time): ${embedding_cost:.2f}")
    print(f"Vector DB cost (monthly): ${storage_cost_monthly:.2f}")
    print(f"Pods needed: {pods_needed:.1f}")

# Example estimation
estimate_costs(
    num_documents=10000,
    avg_tokens_per_doc=2000,
    chunk_size=512
)

## Section 6: Exercises & Next Steps

### Exercises

Try these hands-on exercises to solidify your learning:

1. **Custom Pipeline Builder**
   - Build a pipeline with your preferred chunker, refineries, and porter
   - Process your own documents
   - Compare results with different configurations

2. **Handshake Comparison**
   - Index same data in ChromaDB and Pinecone
   - Query both with identical questions
   - Compare retrieval quality and speed

3. **Cost Optimization**
   - Experiment with different chunk sizes (256, 512, 1024)
   - Calculate embedding costs for each
   - Find optimal balance of quality vs cost

4. **Production Setup**
   - Design complete RAG architecture for a use case
   - Choose chunker, refineries, and vector DB
   - Document rationale for each choice

In [None]:
# Exercise 1: Your custom pipeline
# Build and test your own pipeline configuration

# my_pipeline = (
#     Pipeline()
#     .chunk_with("...", ...)
#     .refine_with("...", ...)
# )
# my_result = my_pipeline.run(texts=...)

In [None]:
# Exercise 2: Compare handshakes
# Query both ChromaDB and Pinecone, compare results

### Key Takeaways

1. **Refineries enhance chunks** - Add overlap for context, pre-compute embeddings
2. **Porters enable data portability** - JSON for backup/debugging, Datasets for ML
3. **Handshakes unify vector DB access** - Consistent API across 9+ databases
4. **Pipeline composition is powerful** - Chain components for complete workflows
5. **Choose tools based on needs** - Development vs production, cost vs quality

### Resources

- [Chonkie Documentation](https://docs.chonkie.ai/)
- [Chonkie GitHub](https://github.com/chonkie-inc/chonkie)
- [Pinecone Docs](https://docs.pinecone.io/)
- [ChromaDB Docs](https://docs.trychroma.com/)
- [HuggingFace Datasets](https://huggingface.co/docs/datasets/)

### Next Steps

1. **Integrate with your RAG application**
   - Replace existing chunking logic
   - Add refineries for better quality
   - Switch to production vector DB

2. **Experiment with vector databases**
   - Try Qdrant, Weaviate, or others
   - Compare performance and features
   - Choose best fit for your needs

3. **Build evaluation framework**
   - Test retrieval quality
   - Measure end-to-end performance
   - Optimize based on metrics

4. **Scale to production**
   - Implement error handling
   - Add monitoring and logging
   - Optimize for cost and performance

### Thank You!

You've completed the advanced Chonkie tutorial! You now have the knowledge to build production-ready RAG systems with:

- Advanced chunking strategies
- Context enhancement with refineries
- Data portability with porters
- Scalable vector database integration

Happy building!