# Chonkie Chefs - Complete Guide

This notebook demonstrates all Chef types in Chonkie: **TextChef**, **MarkdownChef**, and **TableChef**.

## What are Chefs?

Chefs are processors that convert raw files into structured Document objects. Each Chef specializes in different file types:

- **TextChef**: Processes plain text files ‚Üí `Document`
- **MarkdownChef**: Processes markdown files ‚Üí `MarkdownDocument` (with tables, code, images)
- **TableChef**: Processes CSV/Excel/Markdown tables ‚Üí `MarkdownTable` objects

## Key Features:
- ‚úÖ Process single files or batch process multiple files
- ‚úÖ Returns structured Document objects ready for chunking
- ‚úÖ UTF-8 encoding support for international text
- ‚úÖ Works seamlessly in pipelines or standalone

## Visual Overview

```mermaid
%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#ff6b6b','primaryTextColor':'#fff','primaryBorderColor':'#c92a2a','lineColor':'#339af0','secondaryColor':'#51cf66','tertiaryColor':'#ffd43b','background':'#f8f9fa','mainBkg':'#e3fafc','secondBkg':'#fff3bf','tertiaryBkg':'#ffe3e3','textColor':'#212529','fontSize':'16px'}}}%%

graph TB
    Start([üç≥ Chefs<br/>File Processors]):::startClass
    
    Start --> ChefType{Choose Chef Type}:::decisionClass
    
    ChefType -->|Plain Text| TextChef["üìÑ TextChef<br/>No parameters needed"]:::textClass
    ChefType -->|Markdown| MDChef["üìù MarkdownChef<br/>Optional: tokenizer"]:::mdClass
    ChefType -->|Tables| TableChef["üìä TableChef<br/>Requires pandas"]:::tableClass
    
    TextChef --> TextInput{Input Type}:::decisionClass
    MDChef --> MDInput{Input Type}:::decisionClass
    TableChef --> TableInput{Input Type}:::decisionClass
    
    TextInput -->|Single| TextSingle["process(path)"]:::methodClass
    TextInput -->|Multiple| TextBatch["process_batch(paths)"]:::methodClass
    
    MDInput -->|Single| MDSingle["process(path)"]:::methodClass
    MDInput -->|Multiple| MDBatch["process_batch(paths)"]:::methodClass
    
    TableInput -->|Single| TableSingle["process(path or string)"]:::methodClass
    TableInput -->|Multiple| TableBatch["process_batch(paths)"]:::methodClass
    
    TextSingle --> TextOutput["üì¶ Document<br/>id, content, metadata"]:::outputClass
    TextBatch --> TextOutput
    
    MDSingle --> MDOutput["üì¶ MarkdownDocument<br/>+ tables, code, images, chunks"]:::mdOutputClass
    MDBatch --> MDOutput
    
    TableSingle --> TableOutput["üì¶ list of MarkdownTable<br/>or None"]:::tableOutputClass
    TableBatch --> TableOutput
    
    TextOutput --> Integration{Integration}:::decisionClass
    MDOutput --> Integration
    TableOutput --> Integration
    
    Integration -->|With Chunker| Chunking["‚ö° Add Chunking<br/>doc.chunks = chunks"]:::chunkClass
    Integration -->|Pipeline| Pipeline["üîó Full Pipeline<br/>fetch ‚Üí process ‚Üí chunk"]:::pipelineClass
    Integration -->|Standalone| Direct["üîß Direct Processing<br/>Work with Documents"]:::standaloneClass
    
    classDef startClass fill:#4c6ef5,stroke:#364fc7,stroke-width:3px,color:#fff
    classDef decisionClass fill:#7950f2,stroke:#5f3dc4,stroke-width:2px,color:#fff
    classDef textClass fill:#20c997,stroke:#087f5b,stroke-width:2px,color:#fff
    classDef mdClass fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px,color:#fff
    classDef tableClass fill:#ffd43b,stroke:#fab005,stroke-width:2px,color:#333
    classDef methodClass fill:#748ffc,stroke:#4c6ef5,stroke-width:2px,color:#fff
    classDef outputClass fill:#51cf66,stroke:#37b24d,stroke-width:2px,color:#fff
    classDef mdOutputClass fill:#ff922b,stroke:#e8590c,stroke-width:2px,color:#fff
    classDef tableOutputClass fill:#ffd43b,stroke:#fab005,stroke-width:2px,color:#333
    classDef chunkClass fill:#69db7c,stroke:#40c057,stroke-width:2px,color:#fff
    classDef pipelineClass fill:#845ef7,stroke:#5f3dc4,stroke-width:2px,color:#fff
    classDef standaloneClass fill:#ff922b,stroke:#e8590c,stroke-width:2px,color:#fff
```

## Setup - Create Mock Files

First, we'll create a set of valid sample files (txt, md, csv, xlsx) to demonstrate each Chef.
**Note:** This step resets the `test_chef_files` directory to ensure a clean state.

In [1]:
import os
import shutil
from pathlib import Path
import pandas as pd

# Create test directory
test_dir = Path("./test_chef_files")
if test_dir.exists():
    shutil.rmtree(test_dir)
test_dir.mkdir(exist_ok=True)

# 1. Create plain text files for TextChef
text_files = {
    "article.txt": """Machine Learning in Modern Applications

Machine learning has revolutionized how we build software applications. From recommendation systems to natural language processing, ML models are everywhere. This article explores the key concepts and practical applications of machine learning in today's technology landscape.

Key areas include supervised learning, unsupervised learning, and reinforcement learning. Each approach has its own strengths and use cases.""",
    
    "notes.txt": """Quick Notes:
- Remember to test the new feature
- Update documentation
- Review pull requests
- Schedule team meeting for next week""",
    
    "data_science.txt": """Data Science Pipeline

The modern data science pipeline consists of several stages:
1. Data Collection
2. Data Cleaning
3. Feature Engineering
4. Model Training
5. Model Evaluation
6. Deployment

Each stage is critical for building robust ML systems."""
}

for filename, content in text_files.items():
    (test_dir / filename).write_text(content, encoding='utf-8')

# 2. Create markdown files for MarkdownChef
markdown_files = {
    "tutorial.md": """# Python Programming Tutorial

## Introduction

Python is a versatile programming language loved by developers worldwide.

### Getting Started

```python
def hello_world():
    print("Hello, World!")
    return True
```

## Data Structures

| Type | Mutable | Example |
|------|---------|---------|
| List | Yes | [1, 2, 3] |
| Tuple | No | (1, 2, 3) |
| Dict | Yes | {"key": "value"} |
| Set | Yes | {1, 2, 3} |

## Resources

![Python Logo](https://python.org/logo.png)

For more information, visit [Python.org](https://python.org).
""",
    
    "readme.md": """# Project README

## Overview

This project demonstrates advanced concepts.

```javascript
const greeting = (name) => {
    return `Hello, ${name}!`;
}
```

## Installation

Run the following command:

```bash
pip install chonkie
```
"""
}

for filename, content in markdown_files.items():
    (test_dir / filename).write_text(content, encoding='utf-8')

# 3. Create CSV files for TableChef
products_data = pd.DataFrame({
    'ProductID': [1, 2, 3, 4, 5],
    'Name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
    'Price': [999.99, 29.99, 79.99, 299.99, 89.99],
    'Stock': [15, 150, 75, 30, 50]
})
products_data.to_csv(test_dir / "products.csv", index=False)

sales_data = pd.DataFrame({
    'Date': ['2026-01-01', '2026-01-02', '2026-01-03'],
    'Region': ['North', 'South', 'East'],
    'Sales': [15000, 22000, 18500],
    'Units': [120, 180, 145]
})
sales_data.to_csv(test_dir / "sales.csv", index=False)

# 4. Create Excel file for TableChef
with pd.ExcelWriter(test_dir / "inventory.xlsx") as writer:
    inventory = pd.DataFrame({
        'Item': ['Widget A', 'Widget B', 'Widget C'],
        'Quantity': [100, 250, 175],
        'Location': ['Warehouse 1', 'Warehouse 2', 'Warehouse 1']
    })
    inventory.to_excel(writer, sheet_name='Inventory', index=False)

print("‚úÖ Created test files:")
print(f"üìÅ {test_dir}/")
for file in sorted(test_dir.iterdir()):
    size = file.stat().st_size
    print(f"  üìÑ {file.name} ({size} bytes)")

‚úÖ Created test files:
üìÅ test_chef_files/
  üìÑ article.txt (463 bytes)
  üìÑ data_science.txt (260 bytes)
  üìÑ inventory.xlsx (5026 bytes)
  üìÑ notes.txt (135 bytes)
  üìÑ products.csv (128 bytes)
  üìÑ readme.md (253 bytes)
  üìÑ sales.csv (108 bytes)
  üìÑ tutorial.md (563 bytes)


## Installation

Install Chonkie with table support for TableChef:

In [2]:
# Install chonkie with table support
# !pip install "chonkie[table]"

from chonkie import TextChef, MarkdownChef, TableChef

print("‚úÖ All Chefs imported successfully!")
print(f"  üìÑ TextChef: {TextChef}")
print(f"  üìù MarkdownChef: {MarkdownChef}")
print(f"  üìä TableChef: {TableChef}")

‚úÖ All Chefs imported successfully!
  üìÑ TextChef: <class 'chonkie.chef.text.TextChef'>
  üìù MarkdownChef: <class 'chonkie.chef.markdown.MarkdownChef'>
  üìä TableChef: <class 'chonkie.chef.table.TableChef'>


---

# Part 1: TextChef

## 1. TextChef - Single File Processing

Process a single text file into a Document object.

In [3]:
# Initialize TextChef (no parameters needed)
text_chef = TextChef()

# Process a single text file
doc = text_chef.process("./test_chef_files/article.txt")

print("üìÑ TextChef - Single File Result:")
print(f"  Document ID: {doc.id}")
print(f"  Content Length: {len(doc.content)} characters")
print(f"  Metadata: {doc.metadata}")
print(f"\nüìñ Content Preview (first 200 chars):")
print(f"  {doc.content[:200]}...")

üìÑ TextChef - Single File Result:
  Document ID: doc_b0bf88dd80d7458db59bfdbeda8fb8f9
  Content Length: 459 characters
  Metadata: {}

üìñ Content Preview (first 200 chars):
  Machine Learning in Modern Applications

Machine learning has revolutionized how we build software applications. From recommendation systems to natural language processing, ML models are everywhere. T...


## 2. TextChef - Batch Processing

Process multiple text files at once.

In [4]:
text_chef = TextChef()

# Process multiple text files
file_paths = [
    "./test_chef_files/article.txt",
    "./test_chef_files/notes.txt",
    "./test_chef_files/data_science.txt"
]

docs = text_chef.process_batch(file_paths)

print(f"üìÑ TextChef - Batch Processing Result:")
print(f"  Processed {len(docs)} documents\n")

for i, doc in enumerate(docs, 1):
    filename = Path(doc.id).name if hasattr(doc, 'id') else f"Doc {i}"
    print(f"  {i}. {filename}")
    print(f"     Content: {len(doc.content)} characters")
    print(f"     Preview: {doc.content[:60]}...")
    print()

üìÑ TextChef - Batch Processing Result:
  Processed 3 documents

  1. doc_7a8d09a84ccf45ca95d70855cd94d34a
     Content: 459 characters
     Preview: Machine Learning in Modern Applications

Machine learning ha...

  2. doc_db049c911f4d4c6aa283a9dd1695c521
     Content: 131 characters
     Preview: Quick Notes:
- Remember to test the new feature
- Update doc...

  3. doc_4034198e48fc44a6baf73174c1c293b2
     Content: 250 characters
     Preview: Data Science Pipeline

The modern data science pipeline cons...



## 3. TextChef - Integration with Chunkers

Use TextChef output with a chunker to create chunks.

In [7]:
from chonkie import RecursiveChunker

# Step 1: Load text file with TextChef
text_chef = TextChef()
doc = text_chef.process("./test_chef_files/article.txt")

# Step 2: Chunk the content
chunker = RecursiveChunker(chunk_size=100)
chunks = chunker.chunk(doc.content)

# Step 3: Store chunks in document
doc.chunks = chunks

print("‚ö° TextChef + Chunker Integration:")
print(f"  Document: {Path(doc.id).name if hasattr(doc, 'id') else 'N/A'}")
print(f"  Content: {len(doc.content)} characters")
print(f"  Chunks: {len(doc.chunks)}")
print(f"\nüìù Chunk Samples:")
for i, chunk in enumerate(doc.chunks[:3], 1):
    print(f"  Chunk {i}: {chunk.text[:60]}...")
print(f"  ...")

‚ö° TextChef + Chunker Integration:
  Document: doc_763f0cf271d1499bb1b773e957a7c853
  Content: 459 characters
  Chunks: 7

üìù Chunk Samples:
  Chunk 1: Machine Learning in Modern Applications
...
  Chunk 2: 
Machine learning has revolutionized how we build software a...
  Chunk 3: From recommendation systems to natural language processing, ...
  ...


---

# Part 2: MarkdownChef

## 4. MarkdownChef - Basic Initialization

Initialize MarkdownChef with different tokenizer options.

In [8]:
# Option 1: Default initialization (character tokenizer)
md_chef_default = MarkdownChef()
print("üìù MarkdownChef Initialization Options:\n")
print(f"  1. Default: {md_chef_default}")

# Option 2: With specific tokenizer
md_chef_gpt2 = MarkdownChef(tokenizer="gpt2")
print(f"  2. GPT-2 Tokenizer: {md_chef_gpt2}")

# Option 3: With character tokenizer (explicit)
md_chef_char = MarkdownChef(tokenizer="character")
print(f"  3. Character Tokenizer: {md_chef_char}")

print("\n‚úÖ All tokenizer options work!")

üìù MarkdownChef Initialization Options:

  1. Default: MarkdownChef()
  2. GPT-2 Tokenizer: MarkdownChef()
  3. Character Tokenizer: MarkdownChef()

‚úÖ All tokenizer options work!


## 5. MarkdownChef - Single File Processing

Process a markdown file and extract all components.

In [9]:
md_chef = MarkdownChef()

# Process markdown file
doc = md_chef.process("./test_chef_files/tutorial.md")

print("üìù MarkdownChef - Single File Result:")
print(f"  Document ID: {doc.id}")
print(f"  Content Length: {len(doc.content)} characters")
print(f"\nüìä Extracted Components:")
print(f"  Tables: {len(doc.tables)}")
print(f"  Code Blocks: {len(doc.code)}")
print(f"  Images: {len(doc.images)}")
print(f"  Text Chunks: {len(doc.chunks)}")

# Show details of extracted components
if doc.tables:
    print(f"\nüìã Table Sample:")
    for i, table in enumerate(doc.tables, 1):
        print(f"    Table {i} (pos {table.start_index}-{table.end_index}):")
        print(f"    {table.content[:100]}...")

if doc.code:
    print(f"\nüíª Code Block Sample:")
    for i, code in enumerate(doc.code, 1):
        lang = code.language or "unknown"
        print(f"    Code {i} [{lang}] (pos {code.start_index}-{code.end_index}):")
        print(f"    {code.content[:80]}...")

if doc.images:
    print(f"\nüñºÔ∏è Image Sample:")
    for i, img in enumerate(doc.images, 1):
        print(f"    Image {i}: {img.alias}")
        print(f"    URL: {img.content[:50]}...")

üìù MarkdownChef - Single File Result:
  Document ID: doc_8f2a6fa84db44737a8123d667aaf5a10
  Content Length: 535 characters

üìä Extracted Components:
  Tables: 1
  Code Blocks: 1
  Images: 1
  Text Chunks: 4

üìã Table Sample:
    Table 1 (pos 241-413):
    | Type | Mutable | Example |
|------|---------|---------|
| List | Yes | [1, 2, 3] |
| Tuple | No | ...

üíª Code Block Sample:
    Code 1 [python] (pos 144-219):
    def hello_world():
    print("Hello, World!")
    return True...

üñºÔ∏è Image Sample:
    Image 1: Python Logo
    URL: https://python.org/logo.png...


## 6. MarkdownChef - Batch Processing

Process multiple markdown files simultaneously.

In [10]:
md_chef = MarkdownChef()

# Process multiple markdown files
md_files = [
    "./test_chef_files/tutorial.md",
    "./test_chef_files/readme.md"
]

docs = md_chef.process_batch(md_files)

print(f"üìù MarkdownChef - Batch Processing Result:")
print(f"  Processed {len(docs)} markdown documents\n")

for i, doc in enumerate(docs, 1):
    filename = Path(doc.id).name if hasattr(doc, 'id') else f"Doc {i}"
    print(f"  {i}. {filename}")
    print(f"     Content: {len(doc.content)} characters")
    print(f"     Tables: {len(doc.tables)} | Code: {len(doc.code)} | Images: {len(doc.images)} | Chunks: {len(doc.chunks)}")
    print()

üìù MarkdownChef - Batch Processing Result:
  Processed 2 markdown documents

  1. doc_5a9354b97c3c4bb999a97d9d30c6ce59
     Content: 535 characters
     Tables: 1 | Code: 1 | Images: 1 | Chunks: 4

  2. doc_76a92af1f7db4802bdb781a2978ee158
     Content: 234 characters
     Tables: 0 | Code: 2 | Images: 0 | Chunks: 2



## 7. MarkdownChef - Detailed Component Analysis

Explore the structure of extracted markdown components.

In [11]:
md_chef = MarkdownChef()
doc = md_chef.process("./test_chef_files/tutorial.md")

print("üîç MarkdownDocument Structure Analysis:\n")

# Analyze Tables
print("üìã TABLES:")
for i, table in enumerate(doc.tables, 1):
    print(f"\n  Table {i}:")
    print(f"    Position: chars {table.start_index} to {table.end_index}")
    print(f"    Content:\n{table.content}")

# Analyze Code Blocks
print("\nüíª CODE BLOCKS:")
for i, code in enumerate(doc.code, 1):
    print(f"\n  Code Block {i}:")
    print(f"    Language: {code.language or 'Not specified'}")
    print(f"    Position: chars {code.start_index} to {code.end_index}")
    print(f"    Content Preview:")
    print(f"    {code.content[:100]}")

# Analyze Images
print("\nüñºÔ∏è IMAGES:")
for i, img in enumerate(doc.images, 1):
    print(f"\n  Image {i}:")
    print(f"    Alt Text: {img.alias}")
    print(f"    Source: {img.content}")
    print(f"    Position: chars {img.start_index} to {img.end_index}")
    if img.link:
        print(f"    Link: {img.link}")

# Analyze Text Chunks
print(f"\nüìù TEXT CHUNKS: {len(doc.chunks)} chunks")
if doc.chunks:
    print(f"  First chunk: {doc.chunks[0].text[:80]}...")
    print(f"  Last chunk: {doc.chunks[-1].text[:80]}...")

üîç MarkdownDocument Structure Analysis:

üìã TABLES:

  Table 1:
    Position: chars 241 to 413
    Content:
| Type | Mutable | Example |
|------|---------|---------|
| List | Yes | [1, 2, 3] |
| Tuple | No | (1, 2, 3) |
| Dict | Yes | {"key": "value"} |
| Set | Yes | {1, 2, 3} |


üíª CODE BLOCKS:

  Code Block 1:
    Language: python
    Position: chars 144 to 219
    Content Preview:
    def hello_world():
    print("Hello, World!")
    return True

üñºÔ∏è IMAGES:

  Image 1:
    Alt Text: Python Logo
    Source: https://python.org/logo.png
    Position: chars 428 to 471

üìù TEXT CHUNKS: 4 chunks
  First chunk: # Python Programming Tutorial

## Introduction

Python is a versatile programmin...
  Last chunk: 

For more information, visit [Python.org](https://python.org).
...


---

# Part 3: TableChef

## 8. TableChef - Process CSV File

Extract table data from CSV files.

In [13]:
table_chef = TableChef()

# Process CSV file
doc = table_chef.process("./test_chef_files/products.csv")

print("üìä TableChef - CSV Processing Result:")
if doc and hasattr(doc, 'tables') and doc.tables:
    print(f"  Found {len(doc.tables)} table(s)\n")
    for i, table in enumerate(doc.tables, 1):
        print(f"  Table {i}:")
        print(f"    Content:\n{table.content}")
        print()
else:
    print("  No tables found")

üìä TableChef - CSV Processing Result:
  Found 1 table(s)

  Table 1:
    Content:
|   ProductID | Name     |   Price |   Stock |
|------------:|:---------|--------:|--------:|
|           1 | Laptop   |  999.99 |      15 |
|           2 | Mouse    |   29.99 |     150 |
|           3 | Keyboard |   79.99 |      75 |
|           4 | Monitor  |  299.99 |      30 |
|           5 | Webcam   |   89.99 |      50 |



## 9. TableChef - Process Excel File

Extract table data from Excel files.

In [14]:
table_chef = TableChef()

# Process Excel file
doc = table_chef.process("./test_chef_files/inventory.xlsx")

print("üìä TableChef - Excel Processing Result:")
if doc and hasattr(doc, 'tables') and doc.tables:
    print(f"  Found {len(doc.tables)} table(s)\n")
    for i, table in enumerate(doc.tables, 1):
        print(f"  Table {i}:")
        print(f"    Content:\n{table.content}")
        print()
else:
    print("  No tables found")

üìä TableChef - Excel Processing Result:
  Found 1 table(s)

  Table 1:
    Content:
| Item     |   Quantity | Location    |
|:---------|-----------:|:------------|
| Widget A |        100 | Warehouse 1 |
| Widget B |        250 | Warehouse 2 |
| Widget C |        175 | Warehouse 1 |



## 10. TableChef - Process Markdown String

Extract tables from markdown text (not just files).

In [15]:
table_chef = TableChef()

# Markdown string with table
markdown_text = """
# Sales Report

Here are the quarterly results:

| Quarter | Revenue | Profit |
|---------|---------|--------|
| Q1      | $100K   | $25K   |
| Q2      | $150K   | $40K   |
| Q3      | $180K   | $55K   |
| Q4      | $200K   | $70K   |

Great progress this year!
"""

# Process markdown string
doc = table_chef.process(markdown_text)

print("üìä TableChef - Markdown String Processing:")
if doc and hasattr(doc, 'tables') and doc.tables:
    print(f"  Found {len(doc.tables)} table(s)\n")
    for i, table in enumerate(doc.tables, 1):
        print(f"  Table {i}:")
        print(f"    Position: chars {table.start_index} to {table.end_index}")
        print(f"    Content:\n{table.content}")
        print()
else:
    print("  No tables found")

üìä TableChef - Markdown String Processing:
  Found 1 table(s)

  Table 1:
    Position: chars 50 to 236
    Content:
| Quarter | Revenue | Profit |
|---------|---------|--------|
| Q1      | $100K   | $25K   |
| Q2      | $150K   | $40K   |
| Q3      | $180K   | $55K   |
| Q4      | $200K   | $70K   |




## 11. TableChef - Batch Processing

Process multiple table sources at once.

In [16]:
table_chef = TableChef()

# Process multiple files
table_sources = [
    "./test_chef_files/products.csv",
    "./test_chef_files/sales.csv",
    "./test_chef_files/inventory.xlsx"
]

docs = table_chef.process_batch(table_sources)

print("üìä TableChef - Batch Processing Result:")
if docs:
    # Count total tables across all documents
    total_tables = sum(len(doc.tables) if hasattr(doc, 'tables') else 0 for doc in docs)
    print(f"  Processed {len(docs)} documents with {total_tables} total tables\n")
    
    table_num = 1
    for doc_idx, doc in enumerate(docs, 1):
        if hasattr(doc, 'tables') and doc.tables:
            for table in doc.tables:
                print(f"  Table {table_num} (from document {doc_idx}):")
                lines = table.content.split('\n')
                header = lines[0] if lines else "N/A"
                print(f"    Header: {header}")
                print(f"    Rows: {len(lines) - 1}")
                print()
                table_num += 1
else:
    print("  No documents processed")

üìä TableChef - Batch Processing Result:
  Processed 3 documents with 3 total tables

  Table 1 (from document 1):
    Header: |   ProductID | Name     |   Price |   Stock |
    Rows: 6

  Table 2 (from document 2):
    Header: | Date       | Region   |   Sales |   Units |
    Rows: 4

  Table 3 (from document 3):
    Header: | Item     |   Quantity | Location    |
    Rows: 4



---

# Part 4: Pipeline Integration

## 12. Pipeline - TextChef Integration

Use TextChef in a complete pipeline.

In [17]:
from chonkie.pipeline import Pipeline

# Pipeline: Fetch ‚Üí Process with TextChef ‚Üí Chunk
doc = (Pipeline()
    .fetch_from("file", path="./test_chef_files/article.txt")
    .process_with("text")
    .chunk_with("recursive", chunk_size=100)
    .run())

print("üîó TextChef Pipeline Result:")
print(f"  Document: {Path(doc.source).name if hasattr(doc, 'source') else 'N/A'}")
print(f"  Content: {len(doc.content) if hasattr(doc, 'content') else 'N/A'} characters")
print(f"  Chunks: {len(doc.chunks)}")
print(f"\nüìù First 2 Chunks:")
for i, chunk in enumerate(doc.chunks[:2], 1):
    print(f"  {i}. {chunk.text[:70]}...")

üîó TextChef Pipeline Result:
  Document: N/A
  Content: 459 characters
  Chunks: 7

üìù First 2 Chunks:
  1. Machine Learning in Modern Applications
...
  2. 
Machine learning has revolutionized how we build software application...


## 13. Pipeline - MarkdownChef Integration

Use MarkdownChef in a complete pipeline.

In [18]:
# Pipeline: Fetch ‚Üí Process with MarkdownChef ‚Üí Chunk
doc = (Pipeline()
    .fetch_from("file", path="./test_chef_files/tutorial.md")
    .process_with("markdown", tokenizer="character")
    .chunk_with("recursive", chunk_size=150)
    .run())

print("üîó MarkdownChef Pipeline Result:")
print(f"  Document: {Path(doc.source).name if hasattr(doc, 'source') else 'N/A'}")
print(f"  Tables: {len(doc.tables)}")
print(f"  Code Blocks: {len(doc.code)}")
print(f"  Images: {len(doc.images)}")
print(f"  Text Chunks: {len(doc.chunks)}")

if doc.code:
    print(f"\nüíª Code Block Languages:")
    for i, code in enumerate(doc.code, 1):
        print(f"  {i}. {code.language or 'unknown'}")

üîó MarkdownChef Pipeline Result:
  Document: N/A
  Tables: 1
  Code Blocks: 1
  Images: 1
  Text Chunks: 4

üíª Code Block Languages:
  1. python


## 14. Pipeline - Full Batch Integration

Combine fetching multiple valid files with specific processing.

In [19]:
# Pipeline: Fetch directory ‚Üí Process with TextChef ‚Üí Chunk
docs = (Pipeline()
    .fetch_from("file", dir="./test_chef_files", ext=[".txt"])
    .process_with("text")
    .chunk_with("recursive", chunk_size=80)
    .run())

print("üîó Batch Pipeline Result:")
print(f"  Processed {len(docs)} documents\n")

for i, doc in enumerate(docs, 1):
    source = Path(doc.source).name if hasattr(doc, 'source') else f"Doc {i}"
    print(f"  {i}. {source}")
    print(f"     Chunks: {len(doc.chunks)}")
    if doc.chunks:
        print(f"     First chunk: {doc.chunks[0].text[:50]}...")
    print()

üîó Batch Pipeline Result:
  Processed 3 documents

  1. Doc 1
     Chunks: 10
     First chunk: Machine Learning in Modern Applications
...

  2. Doc 2
     Chunks: 6
     First chunk: Data Science Pipeline

The modern data science pip...

  3. Doc 3
     Chunks: 3
     First chunk: Quick Notes:
- Remember to test the new feature
...



---

## Summary: All Chef Types and Capabilities

### Chef Comparison Table

| Chef | Input Types | Output | Special Features | Use Cases |
|------|------------|--------|------------------|-----------|
| **TextChef** | .txt files | `Document` | Simple, no params | Plain text processing |
| **MarkdownChef** | .md files | `MarkdownDocument` | Extracts tables, code, images | Rich markdown content |
| **TableChef** | .csv, .xlsx, md strings | `list[MarkdownTable]` | Requires pandas | Data extraction |

### Methods Available

All chefs support:
- `process(path)` - Process single file/string
- `process_batch(paths)` - Process multiple files/strings

### Return Types

**Document** (TextChef):
```python
{
    id: str,
    content: str,
    metadata: dict,
    chunks: list[Chunk]  # (added after chunking)
}
```

**MarkdownDocument** (MarkdownChef):
```python
{
    id: str,
    content: str,
    tables: list[MarkdownTable],
    code: list[MarkdownCode],
    images: list[MarkdownImage],
    chunks: list[Chunk],
    metadata: dict
}
```

**MarkdownTable** (TableChef):
```python
{
    content: str,
    start_index: int,
    end_index: int
}
```

### Best Practices

‚úÖ **TextChef**: Use for simple text files, articles, notes
‚úÖ **MarkdownChef**: Use for documentation, technical content with code samples
‚úÖ **TableChef**: Use when you need to extract structured data from tables
‚úÖ **Pipeline Integration**: Combine chefs with fetchers and chunkers for complete workflows
‚úÖ **Batch Processing**: Process multiple files at once for efficiency

## Cleanup

Remove test files created for this demonstration.

In [20]:
# Clean up test files
import shutil

test_dir = Path("./test_chef_files")
if test_dir.exists():
    shutil.rmtree(test_dir)
    print("‚úÖ Test files cleaned up successfully")
else:
    print("‚ÑπÔ∏è Test directory not found")

‚úÖ Test files cleaned up successfully
