# Chonkie Pipelines - Complete Guide

This notebook demonstrates the **Pipeline API** in Chonkie - a fluent interface for building text processing workflows.

## What are Pipelines?

Pipelines provide a chainable API following the **CHOMP architecture** (CHOnkie's Multi-step Pipeline):

**Fetcher ‚Üí Chef ‚Üí Chunker ‚Üí Refinery ‚Üí Porter/Handshake**

## Key Features:
- ‚úÖ Fluent, chainable API for building workflows
- ‚úÖ Automatic component reordering (follows CHOMP)
- ‚úÖ Single file or directory processing
- ‚úÖ Direct text input (no fetcher needed)
- ‚úÖ Multiple refineries can be chained
- ‚úÖ Export and storage options
- ‚úÖ Recipe-based pipelines from Chonkie Hub

## Visual Overview

```mermaid
%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#ff6b6b','primaryTextColor':'#fff','primaryBorderColor':'#c92a2a','lineColor':'#339af0','secondaryColor':'#51cf66','tertiaryColor':'#ffd43b','background':'#f8f9fa','mainBkg':'#e3fafc','secondBkg':'#fff3bf','tertiaryBkg':'#ffe3e3','textColor':'#212529','fontSize':'16px'}}}%%

graph LR
    Start([üöÄ Pipeline<br/>CHOMP Architecture]):::startClass
    
    Start --> Fetch["1Ô∏è‚É£ Fetcher<br/>(Optional)"]:::fetchClass
    Start --> TextInput["Direct Text Input"]:::inputClass
    
    Fetch --> Chef["2Ô∏è‚É£ Chef<br/>(Optional)"]:::chefClass
    TextInput --> Chef
    
    Chef --> Chunker["3Ô∏è‚É£ Chunker<br/>(Required)"]:::chunkerClass
    TextInput --> Chunker
    
    Chunker --> Refine["4Ô∏è‚É£ Refinery<br/>(Optional, Chainable)"]:::refineClass
    
    Refine --> Output{Output Options}:::decisionClass
    
    Output -->|Export| Porter["5Ô∏è‚É£ Porter<br/>JSON/Datasets"]:::porterClass
    Output -->|Store| Handshake["5Ô∏è‚É£ Handshake<br/>Vector DB"]:::handshakeClass
    Output -->|Return| Document["Document/List[Document]"]:::docClass
    
    Porter --> Final([‚ú® Complete]):::finalClass
    Handshake --> Final
    Document --> Final
    
    classDef startClass fill:#4c6ef5,stroke:#364fc7,stroke-width:3px,color:#fff
    classDef fetchClass fill:#7950f2,stroke:#5f3dc4,stroke-width:2px,color:#fff
    classDef inputClass fill:#748ffc,stroke:#4c6ef5,stroke-width:2px,color:#fff
    classDef chefClass fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px,color:#fff
    classDef chunkerClass fill:#fa5252,stroke:#e03131,stroke-width:3px,color:#fff
    classDef refineClass fill:#20c997,stroke:#087f5b,stroke-width:2px,color:#fff
    classDef decisionClass fill:#ffd43b,stroke:#fab005,stroke-width:2px,color:#333
    classDef porterClass fill:#ff922b,stroke:#e8590c,stroke-width:2px,color:#fff
    classDef handshakeClass fill:#cc5de8,stroke:#9c36b5,stroke-width:2px,color:#fff
    classDef docClass fill:#51cf66,stroke:#37b24d,stroke-width:2px,color:#fff
    classDef finalClass fill:#4c6ef5,stroke:#364fc7,stroke-width:3px,color:#fff
```

## Setup

Import Chonkie and create sample data.

In [1]:
from chonkie import Pipeline
import os
import tempfile

# Create temporary directory for demo files
demo_dir = tempfile.mkdtemp()
print(f"‚úÖ Demo directory created: {demo_dir}")

# Create sample text files
sample_texts = {
    "doc1.txt": "Machine learning is transforming industries worldwide. Deep learning models can recognize complex patterns. Natural language processing enables human-computer interaction.",
    "doc2.txt": "Artificial intelligence represents one of the most significant technological advances. Neural networks mimic human brain structure. Computer vision allows machines to understand visual data.",
    "doc3.md": """# AI Overview

## Machine Learning
Machine learning algorithms learn from data without explicit programming.

## Applications
- Image recognition
- Speech processing
- Text analysis
"""
}

for filename, content in sample_texts.items():
    filepath = os.path.join(demo_dir, filename)
    with open(filepath, 'w', encoding='utf-8') as f:
        f.write(content)
    print(f"  Created: {filename}")

print(f"\n‚úÖ Setup complete! Created {len(sample_texts)} sample files")

‚úÖ Demo directory created: C:\Users\PMACHA~1\AppData\Local\Temp\tmpjlgca46s
  Created: doc1.txt
  Created: doc2.txt
  Created: doc3.md

‚úÖ Setup complete! Created 3 sample files


## Installation

Install Chonkie (if needed):

In [2]:
# Install chonkie
# !pip install chonkie

from chonkie import Pipeline, Document

print("‚úÖ Chonkie imported successfully!")
print(f"  Pipeline: {Pipeline}")
print(f"  Document: {Document}")

‚úÖ Chonkie imported successfully!
  Pipeline: <class 'chonkie.pipeline.pipeline.Pipeline'>
  Document: <class 'chonkie.types.document.Document'>


---

# Part 1: Basic Pipeline Usage

## 1. Single File Processing

Process a single file with the simplest pipeline.

In [3]:
# Single file processing
doc1_path = os.path.join(demo_dir, "doc1.txt")

doc = (Pipeline()
    .fetch_from("file", path=doc1_path)
    .process_with("text")
    .chunk_with("recursive", chunk_size=50)
    .run())

print("üìù Single File Processing:\n")
print(f"Document type: {type(doc)}")
print(f"Number of chunks: {len(doc.chunks)}")
print(f"\nüìä Chunks:")
for i, chunk in enumerate(doc.chunks, 1):
    print(f"  {i}. {chunk.text[:60]}... ({chunk.token_count} tokens)")

üìù Single File Processing:

Document type: <class 'chonkie.types.document.Document'>
Number of chunks: 6

üìä Chunks:
  1. Machine learning is transforming industries... (44 tokens)
  2.  worldwide. ... (12 tokens)
  3. Deep learning models can recognize complex... (43 tokens)
  4.  patterns. ... (11 tokens)
  5. Natural language processing enables human-... (42 tokens)
  6. computer interaction.... (21 tokens)


## 2. Directory Processing

Process multiple files from a directory at once.

In [4]:
# Directory processing with extension filter
docs = (Pipeline()
    .fetch_from("file", dir=demo_dir, ext=[".txt"])
    .process_with("text")
    .chunk_with("recursive", chunk_size=50)
    .run())

print("üìÅ Directory Processing:\n")
print(f"Result type: {type(docs)}")
print(f"Number of documents: {len(docs)}")

for i, doc in enumerate(docs, 1):
    print(f"\nüìÑ Document {i}:")
    print(f"  Chunks: {len(doc.chunks)}")
    print(f"  First chunk: {doc.chunks[0].text[:50]}...")

üìÅ Directory Processing:

Result type: <class 'list'>
Number of documents: 2

üìÑ Document 1:
  Chunks: 6
  First chunk: Machine learning is transforming industries...

üìÑ Document 2:
  Chunks: 5
  First chunk: Artificial intelligence represents one of the...


## 3. Direct Text Input

Skip the fetcher and provide text directly.

In [5]:
# Single text input (no fetcher needed)
doc = (Pipeline()
    .process_with("text")
    .chunk_with("semantic", threshold=0.8)
    .run(texts="Machine learning enables computers to learn from data. Deep learning uses neural networks. AI is transforming industries."))

print("üí¨ Direct Text Input (Single):\n")
print(f"Document type: {type(doc)}")
print(f"Number of chunks: {len(doc.chunks)}")
print(f"\nüìä Chunks:")
for i, chunk in enumerate(doc.chunks, 1):
    print(f"  {i}. {chunk.text}")

print("\n" + "="*60 + "\n")

# Multiple texts input
texts = [
    "Python is excellent for data science and machine learning.",
    "JavaScript powers modern web applications and servers.",
    "Rust provides memory safety without garbage collection."
]

docs = (Pipeline()
    .chunk_with("recursive", chunk_size=30)
    .run(texts=texts))

print("üí¨ Direct Text Input (Multiple):\n")
print(f"Result type: {type(docs)}")
print(f"Number of documents: {len(docs)}")

for i, doc in enumerate(docs, 1):
    print(f"\nüìÑ Document {i}: {len(doc.chunks)} chunks")

üí¨ Direct Text Input (Single):

Document type: <class 'chonkie.types.document.Document'>
Number of chunks: 1

üìä Chunks:
  1. Machine learning enables computers to learn from data. Deep learning uses neural networks. AI is transforming industries.


üí¨ Direct Text Input (Multiple):

Result type: <class 'list'>
Number of documents: 3

üìÑ Document 1: 3 chunks

üìÑ Document 2: 2 chunks

üìÑ Document 3: 2 chunks


---

# Part 2: Pipeline Methods

## 4. fetch_from() - Data Sources

Fetch data from various sources.

In [6]:
print("üì• Fetcher Examples:\n")

# Example 1: Single file
print("1Ô∏è‚É£ Single File:")
pipeline1 = Pipeline().fetch_from("file", path=os.path.join(demo_dir, "doc1.txt"))
print(f"   .fetch_from('file', path='doc1.txt')")

# Example 2: Directory with extension filter
print("\n2Ô∏è‚É£ Directory with Filter:")
pipeline2 = Pipeline().fetch_from("file", dir=demo_dir, ext=[".txt", ".md"])
print(f"   .fetch_from('file', dir='{demo_dir}', ext=['.txt', '.md'])")

# Example 3: All files in directory
print("\n3Ô∏è‚É£ All Files in Directory:")
pipeline3 = Pipeline().fetch_from("file", dir=demo_dir)
print(f"   .fetch_from('file', dir='{demo_dir}')")

print("\n‚úÖ Fetcher configurations created (not executed)")
print("   Use .run() to execute the pipeline")

üì• Fetcher Examples:

1Ô∏è‚É£ Single File:
   .fetch_from('file', path='doc1.txt')

2Ô∏è‚É£ Directory with Filter:
   .fetch_from('file', dir='C:\Users\PMACHA~1\AppData\Local\Temp\tmpjlgca46s', ext=['.txt', '.md'])

3Ô∏è‚É£ All Files in Directory:
   .fetch_from('file', dir='C:\Users\PMACHA~1\AppData\Local\Temp\tmpjlgca46s')

‚úÖ Fetcher configurations created (not executed)
   Use .run() to execute the pipeline


## 5. process_with() - Chefs

Process data with different chef types.

In [7]:
print("üë®‚Äçüç≥ Chef Examples:\n")

# Text Chef
print("1Ô∏è‚É£ Text Chef:")
doc = (Pipeline()
    .fetch_from("file", path=os.path.join(demo_dir, "doc1.txt"))
    .process_with("text")
    .chunk_with("recursive", chunk_size=50)
    .run())
print(f"   Processed {len(doc.chunks)} chunks from text file")

# Markdown Chef
print("\n2Ô∏è‚É£ Markdown Chef:")
doc = (Pipeline()
    .fetch_from("file", path=os.path.join(demo_dir, "doc3.md"))
    .process_with("markdown")
    .chunk_with("recursive", chunk_size=50)
    .run())
print(f"   Processed markdown with {len(doc.chunks)} chunks")
if hasattr(doc, 'tables'):
    print(f"   Found {len(doc.tables)} tables")
if hasattr(doc, 'code'):
    print(f"   Found {len(doc.code)} code blocks")

# Without Chef (direct chunking)
print("\n3Ô∏è‚É£ No Chef (Direct):")
doc = (Pipeline()
    .chunk_with("recursive", chunk_size=30)
    .run(texts="Direct text chunking without preprocessing"))
print(f"   Chunked directly: {len(doc.chunks)} chunks")

print("\n‚úÖ Different chef types demonstrated")

üë®‚Äçüç≥ Chef Examples:

1Ô∏è‚É£ Text Chef:
   Processed 6 chunks from text file

2Ô∏è‚É£ Markdown Chef:
   Processed markdown with 5 chunks
   Found 0 tables
   Found 0 code blocks

3Ô∏è‚É£ No Chef (Direct):
   Chunked directly: 2 chunks

‚úÖ Different chef types demonstrated


## 6. chunk_with() - Chunkers (Required)

Use different chunking strategies.

In [8]:
sample_text = "Machine learning enables computers to learn from data. Deep learning uses neural networks with multiple layers. Natural language processing helps computers understand human language. Computer vision allows machines to interpret visual information."

print("‚úÇÔ∏è Chunker Examples:\n")

# Recursive Chunker
print("1Ô∏è‚É£ Recursive Chunker:")
doc = (Pipeline()
    .chunk_with("recursive", chunk_size=50)
    .run(texts=sample_text))
print(f"   Created {len(doc.chunks)} chunks")
for i, chunk in enumerate(doc.chunks, 1):
    print(f"   {i}. {chunk.text[:50]}... ({chunk.token_count} tokens)")

# Token Chunker
print("\n2Ô∏è‚É£ Token Chunker:")
doc = (Pipeline()
    .chunk_with("token", chunk_size=40)
    .run(texts=sample_text))
print(f"   Created {len(doc.chunks)} chunks")
for i, chunk in enumerate(doc.chunks, 1):
    print(f"   {i}. {chunk.text[:50]}... ({chunk.token_count} tokens)")

# Semantic Chunker
print("\n3Ô∏è‚É£ Semantic Chunker:")
doc = (Pipeline()
    .chunk_with("semantic", threshold=0.7, chunk_size=100)
    .run(texts=sample_text))
print(f"   Created {len(doc.chunks)} semantic chunks")
for i, chunk in enumerate(doc.chunks, 1):
    print(f"   {i}. {chunk.text[:50]}...")

print("\n‚úÖ Different chunking strategies demonstrated")

‚úÇÔ∏è Chunker Examples:

1Ô∏è‚É£ Recursive Chunker:
   Created 8 chunks
   1. Machine learning enables computers to learn from... (49 tokens)
   2.  data. ... (7 tokens)
   3. Deep learning uses neural networks with multiple... (49 tokens)
   4.  layers. ... (9 tokens)
   5. Natural language processing helps computers... (44 tokens)
   6.  understand human language. ... (28 tokens)
   7. Computer vision allows machines to interpret... (45 tokens)
   8.  visual information.... (20 tokens)

2Ô∏è‚É£ Token Chunker:
   Created 7 chunks
   1. Machine learning enables computers to le... (40 tokens)
   2. arn from data. Deep learning uses neural... (40 tokens)
   3.  networks with multiple layers. Natural ... (40 tokens)
   4. language processing helps computers unde... (40 tokens)
   5. rstand human language. Computer vision a... (40 tokens)
   6. llows machines to interpret visual infor... (40 tokens)
   7. mation.... (7 tokens)

3Ô∏è‚É£ Semantic Chunker:
   Created 1 semantic chunks
   1. 

## 7. refine_with() - Refineries (Optional, Chainable)

Enhance chunks with refineries. Multiple refineries can be chained!

In [9]:
from chonkie import OverlapRefinery, EmbeddingsRefinery

sample_text = "Artificial intelligence is revolutionizing technology. Machine learning enables computers to learn from data. Deep learning uses neural networks. Natural language processing helps understand text."

print("üîß Refinery Examples:\n")

# Overlap Refinery
print("1Ô∏è‚É£ Overlap Refinery:")
doc = (Pipeline()
    .chunk_with("token", chunk_size=30)
    .refine_with("overlap", context_size=0.3, method="suffix")
    .run(texts=sample_text))
print(f"   Created {len(doc.chunks)} chunks with overlap context")
if doc.chunks[0].context:
    print(f"   First chunk context: {doc.chunks[0].context[:50]}...")

# Embeddings Refinery
print("\n2Ô∏è‚É£ Embeddings Refinery:")
doc = (Pipeline()
    .chunk_with("token", chunk_size=30)
    .refine_with("embeddings", embedding_model="minishlab/potion-base-8M")
    .run(texts=sample_text))
print(f"   Created {len(doc.chunks)} chunks with embeddings")
if doc.chunks[0].embedding is not None:
    print(f"   Embedding shape: {doc.chunks[0].embedding.shape}")

# Chain Multiple Refineries
print("\n3Ô∏è‚É£ Chained Refineries (Overlap + Embeddings):")
doc = (Pipeline()
    .chunk_with("token", chunk_size=30)
    .refine_with("overlap", context_size=0.2, method="suffix")
    .refine_with("embeddings", embedding_model="minishlab/potion-base-8M")
    .run(texts=sample_text))
print(f"   Created {len(doc.chunks)} chunks with both refineries")
print(f"   Has context: {doc.chunks[0].context is not None}")
print(f"   Has embedding: {doc.chunks[0].embedding is not None}")

print("\n‚úÖ Refineries can be chained for complex processing")

üîß Refinery Examples:

1Ô∏è‚É£ Overlap Refinery:
   Created 7 chunks with overlap context
   First chunk context: olutioniz...

2Ô∏è‚É£ Embeddings Refinery:


model.safetensors:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


README.md: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/202 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

   Created 7 chunks with embeddings
   Embedding shape: (256,)

3Ô∏è‚É£ Chained Refineries (Overlap + Embeddings):
   Created 7 chunks with both refineries
   Has context: True
   Has embedding: True

‚úÖ Refineries can be chained for complex processing


## 8. export_with() - Porters (Optional)

Export chunks to different formats.

In [10]:
import json

sample_text = "Machine learning is a subset of AI. Deep learning uses neural networks. NLP enables language understanding."

print("üì¶ Export Examples:\n")

# Export to JSON
print("1Ô∏è‚É£ JSON Export:")
doc = (Pipeline()
    .chunk_with("token", chunk_size=20)
    .export_with("json", file="./pipeline_chunks.json")
    .run(texts=sample_text))

if os.path.exists("./pipeline_chunks.json"):
    with open("./pipeline_chunks.json", "r") as f:
        # Try to load as JSON Lines (default)
        f.seek(0)
        content = f.read().strip()
        if content.startswith('['):
            # JSON array
            f.seek(0)
            data = json.load(f)
            print(f"   ‚úÖ Exported {len(data)} chunks to JSON array")
        else:
            # JSON Lines
            f.seek(0)
            lines = f.readlines()
            print(f"   ‚úÖ Exported {len(lines)} chunks to JSON Lines")

# Export to Datasets
print("\n2Ô∏è‚É£ Datasets Export:")
doc = (Pipeline()
    .chunk_with("token", chunk_size=20)
    .export_with("datasets", save_to_disk=True, path="./pipeline_dataset")
    .run(texts=sample_text))

if os.path.exists("./pipeline_dataset"):
    print(f"   ‚úÖ Exported dataset to: ./pipeline_dataset")
    print(f"   Directory exists: {os.path.exists('./pipeline_dataset')}")

print("\n‚úÖ Chunks exported successfully")
print("   Note: Pipeline still returns Document object")

üì¶ Export Examples:

1Ô∏è‚É£ JSON Export:
   ‚úÖ Exported 6 chunks to JSON Lines

2Ô∏è‚É£ Datasets Export:


Saving the dataset (0/1 shards):   0%|          | 0/6 [00:00<?, ? examples/s]

   ‚úÖ Exported dataset to: ./pipeline_dataset
   Directory exists: True

‚úÖ Chunks exported successfully
   Note: Pipeline still returns Document object


---

# Part 3: Advanced Examples

## 9. Complete RAG Pipeline

Build a complete RAG ingestion pipeline with all components.

In [11]:
# Create sample knowledge base files
kb_dir = tempfile.mkdtemp()
kb_files = {
    "ml_basics.txt": "Machine learning is a subset of artificial intelligence that enables computers to learn from data without explicit programming. Supervised learning uses labeled data for training. Unsupervised learning finds patterns in unlabeled data.",
    "dl_intro.txt": "Deep learning uses artificial neural networks with multiple layers to process complex data. Convolutional neural networks excel at image recognition. Recurrent neural networks handle sequential data like text and time series.",
    "nlp_guide.txt": "Natural language processing enables computers to understand human language. Named entity recognition identifies important entities in text. Sentiment analysis determines emotional tone. Machine translation breaks down language barriers."
}

for filename, content in kb_files.items():
    with open(os.path.join(kb_dir, filename), 'w') as f:
        f.write(content)

print("üîÆ Complete RAG Pipeline:\n")

# Full pipeline with all components
docs = (Pipeline()
    .fetch_from("file", dir=kb_dir, ext=[".txt"])
    .process_with("text")
    .chunk_with("semantic", threshold=0.8, chunk_size=100)
    .refine_with("overlap", context_size=0.2, method="suffix")
    .refine_with("embeddings", embedding_model="minishlab/potion-base-8M")
    .export_with("json", file="./rag_chunks.json")
    .run())

print(f"‚úÖ Processed {len(docs)} documents")
print(f"\nüìä Results:")
total_chunks = sum(len(doc.chunks) for doc in docs)
print(f"  Total chunks: {total_chunks}")

# Check first document
if docs:
    first_doc = docs[0]
    print(f"\nüìÑ First Document:")
    print(f"  Chunks: {len(first_doc.chunks)}")
    if first_doc.chunks:
        chunk = first_doc.chunks[0]
        print(f"  First chunk: {chunk.text[:60]}...")
        print(f"  Has context: {chunk.context is not None}")
        print(f"  Has embedding: {chunk.embedding is not None}")

# Verify export
if os.path.exists("./rag_chunks.json"):
    print(f"\n‚úÖ Exported to: rag_chunks.json")

print("\n‚ú® Complete RAG pipeline executed successfully!")

üîÆ Complete RAG Pipeline:

‚úÖ Processed 3 documents

üìä Results:
  Total chunks: 3

üìÑ First Document:
  Chunks: 1
  First chunk: Deep learning uses artificial neural networks with multiple ...
  Has context: False
  Has embedding: True

‚úÖ Exported to: rag_chunks.json

‚ú® Complete RAG pipeline executed successfully!


## 10. Semantic Search Pipeline

Process documents with embeddings for semantic search.

In [12]:
research_text = """Transformer architectures have revolutionized natural language processing. The attention mechanism allows models to focus on relevant parts of the input. BERT uses bidirectional training to understand context. GPT models use autoregressive generation for text creation. These models achieve state-of-the-art results across various NLP tasks."""

print("üîç Semantic Search Pipeline:\n")

doc = (Pipeline()
    .chunk_with("semantic", 
                threshold=0.8, 
                chunk_size=100, 
                similarity_window=3)
    .refine_with("overlap", context_size=0.2)
    .refine_with("embeddings", embedding_model="minishlab/potion-base-8M")
    .run(texts=research_text))

print(f"‚úÖ Processed document with semantic chunking")
print(f"  Total chunks: {len(doc.chunks)}")

print(f"\nüìä Chunk Analysis:")
for i, chunk in enumerate(doc.chunks, 1):
    print(f"\n  Chunk {i}:")
    print(f"    Text: {chunk.text[:60]}...")
    print(f"    Tokens: {chunk.token_count}")
    if chunk.context:
        print(f"    Context: {chunk.context[:40]}...")
    if chunk.embedding is not None:
        print(f"    Embedding shape: {chunk.embedding.shape}")

print("\n‚ú® All chunks ready for semantic search!")

üîç Semantic Search Pipeline:

‚úÖ Processed document with semantic chunking
  Total chunks: 1

üìä Chunk Analysis:

  Chunk 1:
    Text: Transformer architectures have revolutionized natural langua...
    Tokens: 59
    Embedding shape: (256,)

‚ú® All chunks ready for semantic search!


## 11. Code Documentation Pipeline

Process code files with specialized chunking.

In [13]:
# Create sample Python files
code_dir = tempfile.mkdtemp()
code_files = {
    "utils.py": """def calculate_sum(a, b):
    '''Calculate sum of two numbers'''
    return a + b

def calculate_product(a, b):
    '''Calculate product of two numbers'''
    return a * b

class Calculator:
    def __init__(self):
        self.result = 0
    
    def add(self, value):
        self.result += value
        return self.result
""",
    "models.py": """class User:
    def __init__(self, name, email):
        self.name = name
        self.email = email
    
    def get_profile(self):
        return {'name': self.name, 'email': self.email}

class Post:
    def __init__(self, title, content):
        self.title = title
        self.content = content
"""
}

for filename, content in code_files.items():
    with open(os.path.join(code_dir, filename), 'w') as f:
        f.write(content)

print("üíª Code Documentation Pipeline:\n")

# Process Python files with code chunker
docs = (Pipeline()
    .fetch_from("file", dir=code_dir, ext=[".py"])
    .chunk_with("code", chunk_size=150)
    .export_with("json", file="./code_chunks.json")
    .run())

print(f"‚úÖ Processed {len(docs)} Python files")

for i, doc in enumerate(docs, 1):
    print(f"\nüìÑ File {i}:")
    print(f"  Chunks: {len(doc.chunks)}")
    if doc.chunks:
        print(f"  First chunk preview:")
        print(f"    {doc.chunks[0].text[:80]}...")

if os.path.exists("./code_chunks.json"):
    print(f"\n‚úÖ Code chunks exported to: code_chunks.json")

print("\n‚ú® Code documentation pipeline complete!")

üíª Code Documentation Pipeline:

‚úÖ Processed 2 Python files

üìÑ File 1:
  Chunks: 3
  First chunk preview:
    class User:
    def __init__(self, name, email):
        self.name = name
      ...

üìÑ File 2:
  Chunks: 3
  First chunk preview:
    def calculate_sum(a, b):
    '''Calculate sum of two numbers'''
    return a + b...

‚úÖ Code chunks exported to: code_chunks.json

‚ú® Code documentation pipeline complete!




## 12. Markdown Processing Pipeline

Handle markdown with awareness of tables and code blocks.

In [15]:
# Create sample markdown file
md_content = """# Project Documentation

## Introduction
This project demonstrates machine learning concepts.

## Features
- Data preprocessing
- Model training
- Evaluation metrics

## Code Example
```python
def train_model(data):
    model = LinearRegression()
    model.fit(data.X, data.y)
    return model
```

## Results Table
| Model | Accuracy | F1 Score |
|-------|----------|----------|
| LR    | 0.85     | 0.82     |
| RF    | 0.92     | 0.90     |

## Conclusion
Machine learning provides powerful tools for data analysis.
"""

md_file = os.path.join(demo_dir, "project_doc.md")
with open(md_file, 'w') as f:
    f.write(md_content)

print("üìù Markdown Processing Pipeline:\n")

# Process markdown with awareness
doc = (Pipeline()
    .fetch_from("file", path=md_file)
    .process_with("markdown")
    .chunk_with("recursive", chunk_size=100)
    .run())

print(f"‚úÖ Processed markdown document")
print(f"  Total chunks: {len(doc.chunks)}")

# Access markdown metadata
if hasattr(doc, 'tables') and doc.tables:
    print(f"\nüìä Tables found: {len(doc.tables)}")
    for i, table in enumerate(doc.tables, 1):
        # Convert table to string representation
        table_str = str(table) if hasattr(table, '__str__') else repr(table)
        print(f"  Table {i}: {table_str[:50]}...")

if hasattr(doc, 'code') and doc.code:
    print(f"\nüíª Code blocks found: {len(doc.code)}")
    for i, code in enumerate(doc.code, 1):
        # Convert code to string representation
        code_str = str(code) if hasattr(code, '__str__') else repr(code)
        print(f"  Code {i}: {code_str[:40]}...")

print(f"\nüìÑ Chunks preview:")
for i, chunk in enumerate(doc.chunks[:3], 1):
    print(f"  {i}. {chunk.text[:60]}...")

print("\n‚ú® Markdown processed with full structure awareness!")


üìù Markdown Processing Pipeline:

‚úÖ Processed markdown document
  Total chunks: 4

üìä Tables found: 1
  Table 1: MarkdownTable(content='| Model | Accuracy | F1 Sco...

üíª Code blocks found: 1
  Code 1: MarkdownCode(content='def train_model(da...

üìÑ Chunks preview:
  1. # Project Documentation

## Introduction
This project demons...
  2. 
## Features
- Data preprocessing
- Model training
- Evaluat...
  3. 

## Results Table
...

‚ú® Markdown processed with full structure awareness!


---

# Part 4: Pipeline Validation & Patterns

## 13. Pipeline Validation

Pipelines validate configuration before execution.

In [16]:
print("‚úÖ Pipeline Validation Examples:\n")

# Valid pipeline - has chunker and input
print("1Ô∏è‚É£ Valid Pipeline:")
try:
    doc = (Pipeline()
        .fetch_from("file", path=os.path.join(demo_dir, "doc1.txt"))
        .chunk_with("recursive", chunk_size=50)
        .run())
    print("   ‚úÖ Valid: Has chunker and fetcher")
except Exception as e:
    print(f"   ‚ùå Error: {e}")

# Valid pipeline - text input, no fetcher needed
print("\n2Ô∏è‚É£ Valid Pipeline (No Fetcher):")
try:
    doc = (Pipeline()
        .chunk_with("recursive", chunk_size=50)
        .run(texts="Hello world"))
    print("   ‚úÖ Valid: Has chunker and text input")
except Exception as e:
    print(f"   ‚ùå Error: {e}")

# Invalid pipeline - no chunker
print("\n3Ô∏è‚É£ Invalid Pipeline (No Chunker):")
try:
    doc = (Pipeline()
        .fetch_from("file", path=os.path.join(demo_dir, "doc1.txt"))
        .run())
    print("   ‚ùå Should have failed!")
except Exception as e:
    print(f"   ‚úÖ Expected error caught: {type(e).__name__}")

# Invalid pipeline - multiple chefs
print("\n4Ô∏è‚É£ Invalid Pipeline (Multiple Chefs):")
try:
    doc = (Pipeline()
        .process_with("text")
        .process_with("markdown")  # Second chef!
        .chunk_with("recursive", chunk_size=50)
        .run(texts="test"))
    print("   ‚ùå Should have failed!")
except Exception as e:
    print(f"   ‚úÖ Expected error caught: {type(e).__name__}")

print("\n‚úÖ Pipeline validates configuration before execution")

‚úÖ Pipeline Validation Examples:

1Ô∏è‚É£ Valid Pipeline:
   ‚úÖ Valid: Has chunker and fetcher

2Ô∏è‚É£ Valid Pipeline (No Fetcher):
   ‚úÖ Valid: Has chunker and text input

3Ô∏è‚É£ Invalid Pipeline (No Chunker):
   ‚úÖ Expected error caught: ValueError

4Ô∏è‚É£ Invalid Pipeline (Multiple Chefs):
   ‚úÖ Expected error caught: ValueError

‚úÖ Pipeline validates configuration before execution


## 14. Return Values

Pipeline returns Document or list[Document] depending on input.

In [17]:
print("üì¶ Return Value Examples:\n")

# Single file ‚Üí Document
print("1Ô∏è‚É£ Single File ‚Üí Document:")
result = (Pipeline()
    .fetch_from("file", path=os.path.join(demo_dir, "doc1.txt"))
    .chunk_with("recursive", chunk_size=50)
    .run())
print(f"   Type: {type(result).__name__}")
print(f"   Is Document: {isinstance(result, Document)}")
print(f"   Chunks: {len(result.chunks)}")

# Directory ‚Üí list[Document]
print("\n2Ô∏è‚É£ Directory ‚Üí list[Document]:")
result = (Pipeline()
    .fetch_from("file", dir=demo_dir, ext=[".txt"])
    .chunk_with("recursive", chunk_size=50)
    .run())
print(f"   Type: {type(result).__name__}")
print(f"   Is list: {isinstance(result, list)}")
print(f"   Documents: {len(result)}")

# Multiple texts ‚Üí list[Document]
print("\n3Ô∏è‚É£ Multiple Texts ‚Üí list[Document]:")
result = (Pipeline()
    .chunk_with("recursive", chunk_size=30)
    .run(texts=["Text 1", "Text 2", "Text 3"]))
print(f"   Type: {type(result).__name__}")
print(f"   Is list: {isinstance(result, list)}")
print(f"   Documents: {len(result)}")

# Single text ‚Üí Document
print("\n4Ô∏è‚É£ Single Text ‚Üí Document:")
result = (Pipeline()
    .chunk_with("recursive", chunk_size=30)
    .run(texts="Single text input"))
print(f"   Type: {type(result).__name__}")
print(f"   Is Document: {isinstance(result, Document)}")
print(f"   Chunks: {len(result.chunks)}")

print("\n‚úÖ Return type depends on input: single ‚Üí Document, multiple ‚Üí list[Document]")

üì¶ Return Value Examples:

1Ô∏è‚É£ Single File ‚Üí Document:
   Type: Document
   Is Document: True
   Chunks: 6

2Ô∏è‚É£ Directory ‚Üí list[Document]:
   Type: list
   Is list: True
   Documents: 2

3Ô∏è‚É£ Multiple Texts ‚Üí list[Document]:
   Type: list
   Is list: True
   Documents: 3

4Ô∏è‚É£ Single Text ‚Üí Document:
   Type: Document
   Is Document: True
   Chunks: 1

‚úÖ Return type depends on input: single ‚Üí Document, multiple ‚Üí list[Document]


## 15. Error Handling

Pipelines provide clear error messages.

In [18]:
print("üõ°Ô∏è Error Handling Examples:\n")

# File not found
print("1Ô∏è‚É£ File Not Found:")
try:
    doc = (Pipeline()
        .fetch_from("file", path="nonexistent.txt")
        .chunk_with("recursive", chunk_size=50)
        .run())
except FileNotFoundError as e:
    print(f"   ‚úÖ Caught FileNotFoundError")
    print(f"   Message: {str(e)[:60]}...")
except Exception as e:
    print(f"   ‚úÖ Caught {type(e).__name__}")

# Configuration error (no input source)
print("\n2Ô∏è‚É£ Configuration Error:")
try:
    doc = (Pipeline()
        .chunk_with("recursive", chunk_size=50)
        .run())  # No texts or fetcher!
except (ValueError, RuntimeError) as e:
    print(f"   ‚úÖ Caught {type(e).__name__}")
    print(f"   Message: {str(e)[:80]}...")
except Exception as e:
    print(f"   ‚úÖ Caught {type(e).__name__}: {str(e)[:80]}...")

# Invalid directory
print("\n3Ô∏è‚É£ Invalid Directory:")
try:
    docs = (Pipeline()
        .fetch_from("file", dir="/nonexistent/path")
        .chunk_with("recursive", chunk_size=50)
        .run())
except (FileNotFoundError, ValueError, OSError) as e:
    print(f"   ‚úÖ Caught {type(e).__name__}")
    print(f"   Message: {str(e)[:60]}...")
except Exception as e:
    print(f"   ‚úÖ Caught {type(e).__name__}")

print("\n‚úÖ Pipelines provide clear, actionable error messages")

üõ°Ô∏è Error Handling Examples:

1Ô∏è‚É£ File Not Found:
   ‚úÖ Caught RuntimeError

2Ô∏è‚É£ Configuration Error:
   ‚úÖ Caught ValueError
   Message: Pipeline must include a fetcher component (use fetch_from()) or provide text inp...

3Ô∏è‚É£ Invalid Directory:
   ‚úÖ Caught RuntimeError

‚úÖ Pipelines provide clear, actionable error messages


## 16. Automatic Component Reordering

Pipeline automatically reorders components to follow CHOMP architecture.

In [19]:
print("üîÑ Automatic Reordering Examples:\n")

sample_text = "Machine learning enables computers to learn from data."

# Components added in "wrong" order
print("1Ô∏è‚É£ Components Added Out of Order:")
doc = (Pipeline()
    .refine_with("overlap", context_size=0.2)  # 4th in CHOMP
    .chunk_with("token", chunk_size=20)        # 3rd in CHOMP
    .process_with("text")                      # 2nd in CHOMP
    .run(texts=sample_text))

print(f"   ‚úÖ Pipeline executed successfully!")
print(f"   Chunks: {len(doc.chunks)}")
print(f"   Pipeline auto-reordered to: Chef ‚Üí Chunker ‚Üí Refinery")

# Even more mixed up order
print("\n2Ô∏è‚É£ Highly Mixed Order:")
doc = (Pipeline()
    .export_with("json", file="./reorder_test.json")  # 5th in CHOMP
    .refine_with("overlap", context_size=0.1)         # 4th in CHOMP
    .process_with("text")                             # 2nd in CHOMP
    .chunk_with("token", chunk_size=20)               # 3rd in CHOMP
    .run(texts=sample_text))

print(f"   ‚úÖ Still works perfectly!")
print(f"   Chunks: {len(doc.chunks)}")
print(f"   Auto-reordered to: Chef ‚Üí Chunker ‚Üí Refinery ‚Üí Porter")

print("\n‚úÖ Pipeline automatically follows CHOMP architecture")
print("   You can add components in any order!")

üîÑ Automatic Reordering Examples:

1Ô∏è‚É£ Components Added Out of Order:
   ‚úÖ Pipeline executed successfully!
   Chunks: 3
   Pipeline auto-reordered to: Chef ‚Üí Chunker ‚Üí Refinery

2Ô∏è‚É£ Highly Mixed Order:
   ‚úÖ Still works perfectly!
   Chunks: 3
   Auto-reordered to: Chef ‚Üí Chunker ‚Üí Refinery ‚Üí Porter

‚úÖ Pipeline automatically follows CHOMP architecture
   You can add components in any order!


## Cleanup

Remove temporary files created during demonstration.

In [20]:
import shutil

# List of items to clean up
cleanup_items = [
    demo_dir,
    kb_dir,
    code_dir,
    "./pipeline_chunks.json",
    "./pipeline_dataset",
    "./rag_chunks.json",
    "./code_chunks.json",
    "./reorder_test.json"
]

print("üßπ Cleaning up temporary files...\n")

for item in cleanup_items:
    try:
        if os.path.isfile(item):
            os.remove(item)
            print(f"  ‚úÖ Deleted file: {item}")
        elif os.path.isdir(item):
            shutil.rmtree(item)
            print(f"  ‚úÖ Deleted directory: {os.path.basename(item)}")
    except FileNotFoundError:
        print(f"  ‚ÑπÔ∏è Not found: {item}")
    except Exception as e:
        print(f"  ‚ùå Error deleting {item}: {e}")

print("\n‚úÖ Cleanup complete!")

üßπ Cleaning up temporary files...

  ‚úÖ Deleted directory: tmpjlgca46s
  ‚úÖ Deleted directory: tmppy45esrg
  ‚úÖ Deleted directory: tmpagk7mjst
  ‚úÖ Deleted file: ./pipeline_chunks.json
  ‚úÖ Deleted directory: pipeline_dataset
  ‚úÖ Deleted file: ./rag_chunks.json
  ‚úÖ Deleted file: ./code_chunks.json
  ‚úÖ Deleted file: ./reorder_test.json

‚úÖ Cleanup complete!


---

## Summary: Pipeline API Complete Guide

### CHOMP Architecture

The Pipeline follows this order automatically:

1. **Fetcher** (Optional) - Retrieve data from sources
2. **Chef** (Optional) - Preprocess and transform data
3. **Chunker** (Required) - Split into manageable chunks
4. **Refinery** (Optional, Chainable) - Enhance chunks
5. **Porter/Handshake** (Optional) - Export or store

### Pipeline Methods

| Method | Purpose | Required | Chainable |
|--------|---------|----------|-----------|
| `fetch_from()` | Fetch data from files/APIs | No* | No |
| `process_with()` | Process with chef | No | No |
| `chunk_with()` | Split into chunks | **Yes** | No |
| `refine_with()` | Enhance chunks | No | **Yes** |
| `export_with()` | Export to formats | No | Yes |
| `store_in()` | Store in vector DB | No | Yes |
| `run()` | Execute pipeline | **Yes** | No |

*Required unless providing `texts` to `run()`

### Key Features

‚úÖ **Fluent API**: Chain methods naturally
```python
Pipeline().fetch_from(...).chunk_with(...).run()
```

‚úÖ **Auto-reordering**: Add components in any order
```python
# Works! Auto-reorders to CHOMP
Pipeline().refine_with(...).chunk_with(...).process_with(...)
```

‚úÖ **Multiple refineries**: Chain as many as needed
```python
.refine_with("overlap", ...).refine_with("embeddings", ...)
```

‚úÖ **Flexible input**: File, directory, or direct text
```python
.fetch_from("file", path="doc.txt")  # Single file
.fetch_from("file", dir="./docs")    # Directory
.run(texts="Direct text")             # No fetcher
```

‚úÖ **Smart returns**: Single ‚Üí Document, Multiple ‚Üí list[Document]
```python
doc = pipeline.run(texts="one")      # Document
docs = pipeline.run(texts=["1","2"]) # list[Document]
```

### Common Patterns

**1. Simple Text Processing**:
```python
doc = (Pipeline()
    .chunk_with("recursive", chunk_size=512)
    .run(texts="Your text here"))
```

**2. File Processing**:
```python
doc = (Pipeline()
    .fetch_from("file", path="document.txt")
    .process_with("text")
    .chunk_with("semantic", threshold=0.8)
    .run())
```

**3. RAG Pipeline**:
```python
docs = (Pipeline()
    .fetch_from("file", dir="./docs")
    .chunk_with("semantic", chunk_size=1024)
    .refine_with("overlap", context_size=100)
    .refine_with("embeddings", model="potion-base-8M")
    .run())
```

**4. Export Pipeline**:
```python
doc = (Pipeline()
    .chunk_with("recursive", chunk_size=512)
    .export_with("json", file="chunks.json")
    .run(texts="Text to export"))
```

### Best Practices

‚úÖ **Always specify chunk_size** - Required for most chunkers
‚úÖ **Match chunkers to content** - Use `code` for code, `semantic` for varied content
‚úÖ **Use refineries for RAG** - Add overlap and embeddings
‚úÖ **Filter extensions** - Use `ext=[".txt", ".md"]` in directory mode
‚úÖ **Chain refineries** - Combine overlap, embeddings, etc.
‚úÖ **Handle errors** - Use try/except for file operations

### Validation Rules

‚úÖ **Must have**: At least one chunker
‚úÖ **Must have**: Fetcher OR text input via `run(texts=...)`
‚ùå **Cannot have**: Multiple chefs (only one allowed)
‚ùå **Cannot have**: Multiple chunkers (only one allowed)

### Error Messages

Pipelines provide clear, actionable errors:
- `FileNotFoundError` - File or directory not found
- `ValueError` - Invalid configuration or missing required components
- `RuntimeError` - Pipeline execution failed

### Pipeline Recipes

Load pre-configured pipelines:
```python
# From Chonkie Hub
pipeline = Pipeline.from_recipe("markdown")

# From local file
pipeline = Pipeline.from_recipe("custom", path="./recipe.json")
```

### Return Values

| Input | Return Type | Example |
|-------|-------------|---------|
| Single file | `Document` | `.fetch_from("file", path=...)` |
| Directory | `list[Document]` | `.fetch_from("file", dir=...)` |
| Single text | `Document` | `.run(texts="one")` |
| Multiple texts | `list[Document]` | `.run(texts=["1", "2"])` |

### Component Overview

- **Fetchers**: file, API, database (see [Fetchers](https://docs.chonkie.ai/oss/fetchers/overview))
- **Chefs**: text, markdown, table (see [Chefs](https://docs.chonkie.ai/oss/chefs/overview))
- **Chunkers**: recursive, semantic, token, code (see [Chunkers](https://docs.chonkie.ai/oss/chunkers/overview))
- **Refineries**: overlap, embeddings (see [Refineries](https://docs.chonkie.ai/oss/refinery/overview))
- **Porters**: JSON, Datasets (see [Porters](https://docs.chonkie.ai/oss/porters/overview))
- **Handshakes**: Chroma, Qdrant, Pinecone (see [Handshakes](https://docs.chonkie.ai/oss/handshakes/overview))