# Chonkie Porters - Complete Guide

This notebook demonstrates all Porter types in Chonkie: **JSONPorter** and **DatasetsPorter**.

## What are Porters?

Porters are exporters that save chunks to different formats for storage, sharing, or integration with other tools:

- **JSONPorter**: Exports chunks to JSON files for archiving and interoperability
- **DatasetsPorter**: Exports chunks to Hugging Face Dataset format for ML workflows

## Key Features:
- ‚úÖ Export chunks to standardized formats
- ‚úÖ Save to disk or keep in memory
- ‚úÖ Preserve all chunk metadata
- ‚úÖ Integration with Hugging Face ecosystem
- ‚úÖ Easy data sharing and archiving

## Visual Overview

```mermaid
%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#ff6b6b','primaryTextColor':'#fff','primaryBorderColor':'#c92a2a','lineColor':'#339af0','secondaryColor':'#51cf66','tertiaryColor':'#ffd43b','background':'#f8f9fa','mainBkg':'#e3fafc','secondBkg':'#fff3bf','tertiaryBkg':'#ffe3e3','textColor':'#212529','fontSize':'16px'}}}%%

graph TB
    Start([üì¶ Porters<br/>Export Chunks]):::startClass
    
    Start --> Input["üîß Input: List of Chunks"]:::inputClass
    
    Input --> PorterChoice{Choose Porter Type}:::decisionClass
    
    PorterChoice -->|JSON Format| JSONPorter["üìÑ JSONPorter<br/>Export to JSON"]:::jsonClass
    PorterChoice -->|HF Dataset| DatasetsPorter["ü§ó DatasetsPorter<br/>Export to Dataset"]:::datasetClass
    
    JSONPorter --> JSONConfig{Configuration}:::decisionClass
    DatasetsPorter --> DatasetConfig{Configuration}:::decisionClass
    
    JSONConfig -->|File Path| JSONPath["path='chunks.json'"]:::paramClass
    JSONConfig -->|Pretty Print| JSONPretty["indent=2"]:::paramClass
    
    DatasetConfig -->|Save Option| SaveOption["save_to_disk=True/False"]:::paramClass
    DatasetConfig -->|Directory| DirPath["path='chunks'"]:::paramClass
    DatasetConfig -->|Extra Args| KwArgs["**kwargs for save_to_disk"]:::paramClass
    
    JSONPath --> JSONProcess["Export Process"]:::processClass
    JSONPretty --> JSONProcess
    
    SaveOption --> DatasetProcess["Export Process"]:::processClass
    DirPath --> DatasetProcess
    KwArgs --> DatasetProcess
    
    JSONProcess --> JSONOutput["üìÅ JSON File<br/>chunks.json"]:::outputClass
    DatasetProcess --> DatasetMemory["üíæ Dataset Object<br/>In Memory"]:::outputClass
    DatasetProcess --> DatasetDisk["üìÅ Dataset on Disk<br/>Arrow format"]:::diskClass
    
    JSONOutput --> UseCases{Use Cases}:::decisionClass
    DatasetMemory --> UseCases
    DatasetDisk --> UseCases
    
    UseCases -->|Archive| Archive["üì¶ Data Archiving"]:::useClass
    UseCases -->|Share| Share["üîó Data Sharing"]:::useClass
    UseCases -->|ML| Training["üß† Model Training"]:::useClass
    UseCases -->|Integration| External["üîå External Tools"]:::useClass
    
    classDef startClass fill:#4c6ef5,stroke:#364fc7,stroke-width:3px,color:#fff
    classDef inputClass fill:#7950f2,stroke:#5f3dc4,stroke-width:2px,color:#fff
    classDef decisionClass fill:#7950f2,stroke:#5f3dc4,stroke-width:2px,color:#fff
    classDef jsonClass fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px,color:#fff
    classDef datasetClass fill:#20c997,stroke:#087f5b,stroke-width:2px,color:#fff
    classDef paramClass fill:#748ffc,stroke:#4c6ef5,stroke-width:2px,color:#fff
    classDef processClass fill:#ffd43b,stroke:#fab005,stroke-width:2px,color:#333
    classDef outputClass fill:#51cf66,stroke:#37b24d,stroke-width:2px,color:#fff
    classDef diskClass fill:#69db7c,stroke:#40c057,stroke-width:2px,color:#fff
    classDef useClass fill:#ff922b,stroke:#e8590c,stroke-width:2px,color:#fff
```

## Setup - Create Sample Chunks

First, we'll create sample chunks to demonstrate each Porter.

In [1]:
from chonkie import Chunk

# Create sample chunks with various properties
sample_chunks = [
    Chunk(
        text="Machine learning is transforming industries worldwide.",
        start_index=0,
        end_index=56,
        token_count=8
    ),
    Chunk(
        text="Deep learning models can recognize complex patterns.",
        start_index=57,
        end_index=109,
        token_count=7
    ),
    Chunk(
        text="Natural language processing enables human-computer interaction.",
        start_index=110,
        end_index=174,
        token_count=8
    ),
    Chunk(
        text="Computer vision allows machines to understand visual data.",
        start_index=175,
        end_index=234,
        token_count=9
    )
]

print("‚úÖ Sample chunks created:")
print(f"  Total chunks: {len(sample_chunks)}")
for i, chunk in enumerate(sample_chunks, 1):
    print(f"  {i}. {chunk.text[:50]}... ({chunk.token_count} tokens)")

‚úÖ Sample chunks created:
  Total chunks: 4
  1. Machine learning is transforming industries worldw... (8 tokens)
  2. Deep learning models can recognize complex pattern... (7 tokens)
  3. Natural language processing enables human-computer... (8 tokens)
  4. Computer vision allows machines to understand visu... (9 tokens)


## Installation

Install Chonkie with optional dependencies for porters:

In [2]:
# Install chonkie with datasets support
# !pip install "chonkie[datasets]"

from chonkie import JSONPorter, DatasetsPorter

print("‚úÖ All Porters imported successfully!")
print(f"  üìÑ JSONPorter: {JSONPorter}")
print(f"  ü§ó DatasetsPorter: {DatasetsPorter}")

‚úÖ All Porters imported successfully!
  üìÑ JSONPorter: <class 'chonkie.porters.json.JSONPorter'>
  ü§ó DatasetsPorter: <class 'chonkie.porters.datasets.DatasetsPorter'>


---

# Part 1: JSONPorter

## 1. JSONPorter - Basic Export

Export chunks to a JSON file.

In [6]:
import os
from pathlib import Path

# Initialize JSONPorter with lines=False for JSON array format
json_porter = JSONPorter(lines=False)

# Export chunks to JSON file
output_path = "./exported_chunks.json"
json_porter.export(sample_chunks, file=output_path)

print("üìÑ JSONPorter - Basic Export:\n")
print(f"‚úÖ Chunks exported to: {output_path}")
print(f"  File exists: {os.path.exists(output_path)}")
print(f"  File size: {os.path.getsize(output_path)} bytes")
print(f"  Format: JSON array (lines=False)")

üìÑ JSONPorter - Basic Export:

‚úÖ Chunks exported to: ./exported_chunks.json
  File exists: True
  File size: 1138 bytes
  Format: JSON array (lines=False)


## 2. JSONPorter - View Exported Content

Read and display the exported JSON file.

In [7]:
import json

# Read the exported JSON file
with open("./exported_chunks.json", "r", encoding="utf-8") as f:
    exported_data = json.load(f)

print("üìÑ Exported JSON Content:\n")
print(f"Number of chunks: {len(exported_data)}\n")

# Display first chunk structure
print("First chunk structure:")
print(json.dumps(exported_data[0], indent=2))

# Display summary
print(f"\nüìä Summary:")
for i, chunk_data in enumerate(exported_data, 1):
    print(f"  Chunk {i}: {chunk_data['text'][:40]}...")
    print(f"    Tokens: {chunk_data['token_count']}")
    print(f"    Range: [{chunk_data['start_index']}, {chunk_data['end_index']}]\n")

üìÑ Exported JSON Content:

Number of chunks: 4

First chunk structure:
{
  "id": "chnk_e59cdda4ba38472b894ace02436c026c",
  "text": "Machine learning is transforming industries worldwide.",
  "start_index": 0,
  "end_index": 56,
  "token_count": 8,
  "context": null,
  "embedding": null
}

üìä Summary:
  Chunk 1: Machine learning is transforming industr...
    Tokens: 8
    Range: [0, 56]

  Chunk 2: Deep learning models can recognize compl...
    Tokens: 7
    Range: [57, 109]

  Chunk 3: Natural language processing enables huma...
    Tokens: 8
    Range: [110, 174]

  Chunk 4: Computer vision allows machines to under...
    Tokens: 9
    Range: [175, 234]



## 3. JSONPorter - Using as Callable

Use the porter as a callable (shorthand syntax).

In [9]:
# Create new chunks for this example
new_chunks = [
    Chunk(
        text="Python is excellent for data science.",
        start_index=0,
        end_index=38,
        token_count=6
    ),
    Chunk(
        text="JavaScript powers modern web applications.",
        start_index=39,
        end_index=82,
        token_count=5
    )
]

# Use porter as callable with lines=False for JSON array format
json_porter = JSONPorter(lines=False)
json_porter(new_chunks, file="./callable_export.json")

print("üìÑ JSONPorter - Callable Usage:\n")
print(f"‚úÖ Exported using callable syntax")
print(f"  File: ./callable_export.json")
print(f"  Chunks exported: {len(new_chunks)}")
print(f"  Format: JSON array (lines=False)")

# Verify content
with open("./callable_export.json", "r") as f:
    data = json.load(f)
    print(f"\nüìä Exported chunks:")
    for i, chunk in enumerate(data, 1):
        print(f"  {i}. {chunk['text']}")

üìÑ JSONPorter - Callable Usage:

‚úÖ Exported using callable syntax
  File: ./callable_export.json
  Chunks exported: 2
  Format: JSON array (lines=False)

üìä Exported chunks:
  1. Python is excellent for data science.
  2. JavaScript powers modern web applications.


## 4. JSONPorter - With Metadata

Export chunks with custom metadata.

In [11]:
# Create chunks with metadata
chunk1 = Chunk(
    text="Artificial intelligence is revolutionizing technology.",
    start_index=0,
    end_index=55,
    token_count=7
)
chunk1.metadata = {"source": "document1.txt", "category": "AI", "author": "John Doe"}

chunk2 = Chunk(
    text="Machine learning algorithms learn from data patterns.",
    start_index=56,
    end_index=110,
    token_count=8
)
chunk2.metadata = {"source": "document1.txt", "category": "ML", "author": "John Doe"}

chunk3 = Chunk(
    text="Neural networks mimic human brain structure.",
    start_index=111,
    end_index=156,
    token_count=6
)
chunk3.metadata = {"source": "document2.txt", "category": "Deep Learning", "author": "Jane Smith"}

chunks_with_metadata = [chunk1, chunk2, chunk3]

# Export with metadata (use lines=False for JSON array format)
json_porter = JSONPorter(lines=False)
json_porter.export(chunks_with_metadata, file="./chunks_with_metadata.json")

print("üìÑ JSONPorter - Export with Metadata:\n")
print(f"‚úÖ Exported {len(chunks_with_metadata)} chunks with metadata\n")

# Display exported content
with open("./chunks_with_metadata.json", "r") as f:
    data = json.load(f)
    print("Sample chunk with metadata:")
    print(json.dumps(data[0], indent=2))
    
    print(f"\nüìä All chunks:")
    for i, chunk in enumerate(data, 1):
        print(f"  {i}. [{chunk['metadata']['category']}] {chunk['text'][:40]}...")
        print(f"     Source: {chunk['metadata']['source']}, Author: {chunk['metadata']['author']}")

üìÑ JSONPorter - Export with Metadata:

‚úÖ Exported 3 chunks with metadata

Sample chunk with metadata:
{
  "id": "chnk_e69a4dab00b94d1e9a48f2a16e6b68e6",
  "text": "Artificial intelligence is revolutionizing technology.",
  "start_index": 0,
  "end_index": 55,
  "token_count": 7,
  "context": null,
  "embedding": null,
  "metadata": {
    "source": "document1.txt",
    "category": "AI",
    "author": "John Doe"
  }
}

üìä All chunks:
  1. [AI] Artificial intelligence is revolutionizi...
     Source: document1.txt, Author: John Doe
  2. [ML] Machine learning algorithms learn from d...
     Source: document1.txt, Author: John Doe
  3. [Deep Learning] Neural networks mimic human brain struct...
     Source: document2.txt, Author: Jane Smith


## 5. JSONPorter - Real Chunking Example

Create chunks from actual text and export them.

In [13]:
from chonkie import TokenChunker

# Sample text
sample_text = """The field of artificial intelligence has grown exponentially over the past decade. 
Machine learning algorithms now power everything from recommendation systems to autonomous vehicles. 
Deep learning, a subset of machine learning, uses neural networks with multiple layers to process complex data. 
Natural language processing enables computers to understand and generate human language. 
Computer vision allows machines to interpret and analyze visual information from the world."""

# Chunk the text
chunker = TokenChunker(chunk_size=30)
chunks = chunker(sample_text)

print(f"üìù Created {len(chunks)} chunks from text\n")

# Export to JSON (use lines=False for JSON array format)
json_porter = JSONPorter(lines=False)
json_porter.export(chunks, file="./real_chunks.json")

print(f"‚úÖ Exported to: ./real_chunks.json")
print(f"  Format: JSON array (lines=False)\n")

# Verify export
with open("./real_chunks.json", "r") as f:
    data = json.load(f)
    print(f"üìä Exported {len(data)} chunks:")
    for i, chunk in enumerate(data, 1):
        print(f"  {i}. {chunk['text'][:60]}... ({chunk['token_count']} tokens)")

üìù Created 16 chunks from text

‚úÖ Exported to: ./real_chunks.json
  Format: JSON array (lines=False)

üìä Exported 16 chunks:
  1. The field of artificial intell... (30 tokens)
  2. igence has grown exponentially... (30 tokens)
  3.  over the past decade. 
Machin... (30 tokens)
  4. e learning algorithms now powe... (30 tokens)
  5. r everything from recommendati... (30 tokens)
  6. on systems to autonomous vehic... (30 tokens)
  7. les. 
Deep learning, a subset ... (30 tokens)
  8. of machine learning, uses neur... (30 tokens)
  9. al networks with multiple laye... (30 tokens)
  10. rs to process complex data. 
N... (30 tokens)
  11. atural language processing ena... (30 tokens)
  12. bles computers to understand a... (30 tokens)
  13. nd generate human language. 
C... (30 tokens)
  14. omputer vision allows machines... (30 tokens)
  15.  to interpret and analyze visu... (30 tokens)
  16. al information from the world.... (30 tokens)


---

# Part 2: DatasetsPorter

## 6. DatasetsPorter - Basic Initialization

Initialize DatasetsPorter and understand its capabilities.

In [14]:
from chonkie import DatasetsPorter

# Initialize DatasetsPorter
datasets_porter = DatasetsPorter()

print("ü§ó DatasetsPorter Initialization:\n")
print(f"  Porter: {datasets_porter}")
print(f"  Type: {type(datasets_porter)}")
print("\n‚úÖ DatasetsPorter ready!")
print("\nüìã Features:")
print("  ‚Ä¢ Export to Hugging Face Dataset format")
print("  ‚Ä¢ Keep dataset in memory or save to disk")
print("  ‚Ä¢ Compatible with Hugging Face ecosystem")
print("  ‚Ä¢ Efficient Arrow format for large datasets")

ü§ó DatasetsPorter Initialization:

  Porter: <chonkie.porters.datasets.DatasetsPorter object at 0x0000021EBBF1F3B0>
  Type: <class 'chonkie.porters.datasets.DatasetsPorter'>

‚úÖ DatasetsPorter ready!

üìã Features:
  ‚Ä¢ Export to Hugging Face Dataset format
  ‚Ä¢ Keep dataset in memory or save to disk
  ‚Ä¢ Compatible with Hugging Face ecosystem
  ‚Ä¢ Efficient Arrow format for large datasets


## 7. DatasetsPorter - Return Dataset Object

Export chunks to an in-memory Dataset object.

In [15]:
# Export to Dataset (in memory)
datasets_porter = DatasetsPorter()
dataset = datasets_porter.export(sample_chunks)

print("ü§ó DatasetsPorter - In-Memory Dataset:\n")
print(f"Dataset type: {type(dataset)}")
print(f"\nDataset info:")
print(dataset)

print(f"\nüìä Dataset features:")
for feature_name, feature_type in dataset.features.items():
    print(f"  ‚Ä¢ {feature_name}: {feature_type}")

print(f"\nüìã Number of rows: {len(dataset)}")

# Access individual examples
print(f"\nüîç First example:")
first_example = dataset[0]
for key, value in first_example.items():
    if key == 'text':
        print(f"  {key}: {value[:50]}...")
    else:
        print(f"  {key}: {value}")

Saving the dataset (0/1 shards):   0%|          | 0/4 [00:00<?, ? examples/s]

ü§ó DatasetsPorter - In-Memory Dataset:

Dataset type: <class 'datasets.arrow_dataset.Dataset'>

Dataset info:
Dataset({
    features: ['id', 'text', 'start_index', 'end_index', 'token_count', 'context', 'embedding'],
    num_rows: 4
})

üìä Dataset features:
  ‚Ä¢ id: Value('string')
  ‚Ä¢ text: Value('string')
  ‚Ä¢ start_index: Value('int64')
  ‚Ä¢ end_index: Value('int64')
  ‚Ä¢ token_count: Value('int64')
  ‚Ä¢ context: Value('null')
  ‚Ä¢ embedding: Value('null')

üìã Number of rows: 4

üîç First example:
  id: chnk_e59cdda4ba38472b894ace02436c026c
  text: Machine learning is transforming industries worldw...
  start_index: 0
  end_index: 56
  token_count: 8
  context: None
  embedding: None


## 8. DatasetsPorter - Save to Disk

Save the dataset to disk for persistence.

In [16]:
import shutil

# Clean up if directory exists
dataset_dir = "./my_exported_chunks"
if os.path.exists(dataset_dir):
    shutil.rmtree(dataset_dir)

# Export and save to disk
datasets_porter = DatasetsPorter()
dataset = datasets_porter.export(
    sample_chunks,
    save_to_disk=True,
    path=dataset_dir
)

print("ü§ó DatasetsPorter - Save to Disk:\n")
print(f"‚úÖ Dataset saved to: {dataset_dir}")
print(f"  Directory exists: {os.path.exists(dataset_dir)}")

# List files in directory
if os.path.exists(dataset_dir):
    print(f"\nüìÅ Files in directory:")
    for file in os.listdir(dataset_dir):
        file_path = os.path.join(dataset_dir, file)
        size = os.path.getsize(file_path) if os.path.isfile(file_path) else 0
        print(f"  ‚Ä¢ {file} ({size} bytes)")

print(f"\nüíæ Dataset object still returned:")
print(f"  Type: {type(dataset)}")
print(f"  Rows: {len(dataset)}")

Saving the dataset (0/1 shards):   0%|          | 0/4 [00:00<?, ? examples/s]

ü§ó DatasetsPorter - Save to Disk:

‚úÖ Dataset saved to: ./my_exported_chunks
  Directory exists: True

üìÅ Files in directory:
  ‚Ä¢ data-00000-of-00001.arrow (1792 bytes)
  ‚Ä¢ dataset_info.json (632 bytes)
  ‚Ä¢ state.json (259 bytes)

üíæ Dataset object still returned:
  Type: <class 'datasets.arrow_dataset.Dataset'>
  Rows: 4


## 9. DatasetsPorter - Load from Disk

Load a previously saved dataset from disk.

In [17]:
from datasets import load_from_disk

# Load the dataset from disk
loaded_dataset = load_from_disk("./my_exported_chunks")

print("ü§ó DatasetsPorter - Load from Disk:\n")
print(f"‚úÖ Dataset loaded from: ./my_exported_chunks")
print(f"\nLoaded dataset:")
print(loaded_dataset)

print(f"\nüìä Dataset contents:")
for i in range(len(loaded_dataset)):
    example = loaded_dataset[i]
    print(f"  {i+1}. {example['text'][:50]}... ({example['token_count']} tokens)")

print(f"\n‚úÖ Dataset loaded successfully and ready for use!")

ü§ó DatasetsPorter - Load from Disk:

‚úÖ Dataset loaded from: ./my_exported_chunks

Loaded dataset:
Dataset({
    features: ['id', 'text', 'start_index', 'end_index', 'token_count', 'context', 'embedding'],
    num_rows: 4
})

üìä Dataset contents:
  1. Machine learning is transforming industries worldw... (8 tokens)
  2. Deep learning models can recognize complex pattern... (7 tokens)
  3. Natural language processing enables human-computer... (8 tokens)
  4. Computer vision allows machines to understand visu... (9 tokens)

‚úÖ Dataset loaded successfully and ready for use!


## 10. DatasetsPorter - Using as Callable

Use the porter as a callable for convenience.

In [18]:
# Create test chunks
test_chunks = [
    Chunk(text="First test chunk.", start_index=0, end_index=17, token_count=3),
    Chunk(text="Second test chunk.", start_index=18, end_index=36, token_count=3),
    Chunk(text="Third test chunk.", start_index=37, end_index=54, token_count=3)
]

# Use as callable - in memory
datasets_porter = DatasetsPorter()
dataset_memory = datasets_porter(test_chunks)

print("ü§ó DatasetsPorter - Callable Usage:\n")
print("Option 1: In-Memory")
print(f"  Dataset: {dataset_memory}")
print(f"  Rows: {len(dataset_memory)}\n")

# Use as callable - save to disk
dataset_dir = "./callable_dataset"
if os.path.exists(dataset_dir):
    shutil.rmtree(dataset_dir)

dataset_disk = datasets_porter(test_chunks, save_to_disk=True, path=dataset_dir)

print("Option 2: Saved to Disk")
print(f"  Path: {dataset_dir}")
print(f"  Exists: {os.path.exists(dataset_dir)}")
print(f"  Dataset object returned: {type(dataset_disk)}")

Saving the dataset (0/1 shards):   0%|          | 0/3 [00:00<?, ? examples/s]

ü§ó DatasetsPorter - Callable Usage:

Option 1: In-Memory
  Dataset: Dataset({
    features: ['id', 'text', 'start_index', 'end_index', 'token_count', 'context', 'embedding'],
    num_rows: 3
})
  Rows: 3



Saving the dataset (0/1 shards):   0%|          | 0/3 [00:00<?, ? examples/s]

Option 2: Saved to Disk
  Path: ./callable_dataset
  Exists: True
  Dataset object returned: <class 'datasets.arrow_dataset.Dataset'>


## 11. DatasetsPorter - Working with Dataset

Perform common operations on the exported dataset.

In [19]:
# Export chunks to dataset
datasets_porter = DatasetsPorter()
dataset = datasets_porter(sample_chunks)

print("ü§ó Working with Datasets:\n")

# 1. Filtering
print("1Ô∏è‚É£ Filter chunks with more than 7 tokens:")
filtered = dataset.filter(lambda x: x['token_count'] > 7)
print(f"  Original: {len(dataset)} chunks")
print(f"  Filtered: {len(filtered)} chunks\n")

# 2. Mapping
print("2Ô∏è‚É£ Add uppercase text field:")
def add_uppercase(example):
    example['text_upper'] = example['text'].upper()
    return example

mapped = dataset.map(add_uppercase)
print(f"  New features: {list(mapped.features.keys())}")
print(f"  Sample: {mapped[0]['text_upper'][:50]}...\n")

# 3. Selecting columns
print("3Ô∏è‚É£ Select specific columns:")
selected = dataset.select_columns(['text', 'token_count'])
print(f"  Selected features: {list(selected.features.keys())}\n")

# 4. Sorting
print("4Ô∏è‚É£ Sort by token count:")
sorted_dataset = dataset.sort('token_count')
for i in range(len(sorted_dataset)):
    print(f"  {i+1}. Tokens: {sorted_dataset[i]['token_count']} - {sorted_dataset[i]['text'][:40]}...")

Saving the dataset (0/1 shards):   0%|          | 0/4 [00:00<?, ? examples/s]

ü§ó Working with Datasets:

1Ô∏è‚É£ Filter chunks with more than 7 tokens:


Filter:   0%|          | 0/4 [00:00<?, ? examples/s]

  Original: 4 chunks
  Filtered: 3 chunks

2Ô∏è‚É£ Add uppercase text field:


Map:   0%|          | 0/4 [00:00<?, ? examples/s]

  New features: ['id', 'text', 'start_index', 'end_index', 'token_count', 'context', 'embedding', 'text_upper']
  Sample: MACHINE LEARNING IS TRANSFORMING INDUSTRIES WORLDW...

3Ô∏è‚É£ Select specific columns:
  Selected features: ['text', 'token_count']

4Ô∏è‚É£ Sort by token count:
  1. Tokens: 7 - Deep learning models can recognize compl...
  2. Tokens: 8 - Machine learning is transforming industr...
  3. Tokens: 8 - Natural language processing enables huma...
  4. Tokens: 9 - Computer vision allows machines to under...


## 12. DatasetsPorter - Large Scale Export

Export a larger number of chunks efficiently.

In [20]:
from chonkie import TokenChunker

# Create a longer text
long_text = """Artificial intelligence represents one of the most significant technological advances of our time. 
Machine learning algorithms enable computers to learn from data without explicit programming. 
Deep learning, using neural networks with multiple layers, has revolutionized fields like computer vision and natural language processing. 
Convolutional neural networks excel at image recognition tasks. 
Recurrent neural networks are designed for sequential data like text and time series. 
Transformer architectures have become the foundation of modern language models. 
Applications of AI span healthcare, finance, transportation, and entertainment. 
Ethical considerations around AI include bias, privacy, and accountability. 
The future of AI promises even more transformative capabilities.

Natural language processing enables computers to understand human language. 
Named entity recognition identifies important entities in text. 
Sentiment analysis determines emotional tone. 
Machine translation breaks down language barriers. 
Question answering systems provide direct answers to queries. 
Text summarization condenses long documents. 
Speech recognition converts audio to text. 
Text generation creates human-like content. 
These technologies power virtual assistants, search engines, and chatbots."""

# Chunk the text
chunker = TokenChunker(chunk_size=40)
large_chunks = chunker(long_text)

print(f"üìù Created {len(large_chunks)} chunks from long text\n")

# Export to dataset
datasets_porter = DatasetsPorter()
large_dataset = datasets_porter(
    large_chunks,
    save_to_disk=True,
    path="./large_dataset"
)

print("ü§ó Large Scale Export:\n")
print(f"‚úÖ Exported {len(large_dataset)} chunks")
print(f"  Dataset: {large_dataset}")
print(f"\nüìä Statistics:")
print(f"  Total chunks: {len(large_dataset)}")
print(f"  Total tokens: {sum(large_dataset['token_count'])}")
print(f"  Avg tokens per chunk: {sum(large_dataset['token_count']) / len(large_dataset):.1f}")
print(f"  Min tokens: {min(large_dataset['token_count'])}")
print(f"  Max tokens: {max(large_dataset['token_count'])}")

üìù Created 33 chunks from long text



Saving the dataset (0/1 shards):   0%|          | 0/33 [00:00<?, ? examples/s]

ü§ó Large Scale Export:

‚úÖ Exported 33 chunks
  Dataset: Dataset({
    features: ['id', 'text', 'start_index', 'end_index', 'token_count', 'context', 'embedding'],
    num_rows: 33
})

üìä Statistics:
  Total chunks: 33
  Total tokens: 1305
  Avg tokens per chunk: 39.5
  Min tokens: 25
  Max tokens: 40


---

# Part 3: Comparing Porters

## 13. Side-by-Side Comparison

Compare JSONPorter and DatasetsPorter outputs.

In [21]:
# Create test chunks
comparison_chunks = [
    Chunk(text="AI is transforming technology.", start_index=0, end_index=30, token_count=5),
    Chunk(text="ML algorithms learn from data.", start_index=31, end_index=61, token_count=5),
    Chunk(text="DL uses neural networks.", start_index=62, end_index=86, token_count=4)
]

print("‚öñÔ∏è Porter Comparison:\n")
print("="*60)

# Export with JSONPorter
print("\nüìÑ JSONPorter:")
json_porter = JSONPorter()
json_porter(comparison_chunks, file="./comparison.json")
json_size = os.path.getsize("./comparison.json")
print(f"  ‚úÖ Exported to: comparison.json")
print(f"  üìè File size: {json_size} bytes")
print(f"  üìã Format: JSON (human-readable)")
print(f"  üîß Use case: Archiving, sharing, interoperability")

# Export with DatasetsPorter
print("\nü§ó DatasetsPorter:")
datasets_porter = DatasetsPorter()
dataset = datasets_porter(comparison_chunks, save_to_disk=True, path="./comparison_dataset")

# Calculate directory size
dataset_size = sum(
    os.path.getsize(os.path.join("./comparison_dataset", f))
    for f in os.listdir("./comparison_dataset")
    if os.path.isfile(os.path.join("./comparison_dataset", f))
)
print(f"  ‚úÖ Exported to: comparison_dataset/")
print(f"  üìè Total size: {dataset_size} bytes")
print(f"  üìã Format: Arrow (binary, efficient)")
print(f"  üîß Use case: ML training, HF ecosystem, large datasets")

print("\n" + "="*60)
print("\nüí° Choose JSONPorter for simplicity and portability")
print("üí° Choose DatasetsPorter for ML workflows and efficiency")

‚öñÔ∏è Porter Comparison:


üìÑ JSONPorter:
  ‚úÖ Exported to: comparison.json
  üìè File size: 536 bytes
  üìã Format: JSON (human-readable)
  üîß Use case: Archiving, sharing, interoperability

ü§ó DatasetsPorter:


Saving the dataset (0/1 shards):   0%|          | 0/3 [00:00<?, ? examples/s]

  ‚úÖ Exported to: comparison_dataset/
  üìè Total size: 2459 bytes
  üìã Format: Arrow (binary, efficient)
  üîß Use case: ML training, HF ecosystem, large datasets


üí° Choose JSONPorter for simplicity and portability
üí° Choose DatasetsPorter for ML workflows and efficiency


## 14. Complete Workflow: Chunk ‚Üí Refine ‚Üí Export

Full pipeline from text to exported chunks.

In [22]:
from chonkie import TokenChunker, OverlapRefinery, JSONPorter, DatasetsPorter

def complete_pipeline(text, export_format="both"):
    """Complete pipeline: Chunk ‚Üí Refine ‚Üí Export"""
    print("üöÄ Complete Pipeline\n")
    print(f"Input text length: {len(text)} characters\n")
    
    # Step 1: Chunk
    print("üìù Step 1: Chunking...")
    chunker = TokenChunker(chunk_size=50)
    chunks = chunker(text)
    print(f"  Created {len(chunks)} chunks\n")
    
    # Step 2: Refine with overlap
    print("üìä Step 2: Adding overlap context...")
    refinery = OverlapRefinery(context_size=0.3, method="suffix", merge=True)
    refined_chunks = refinery(chunks)
    print(f"  Added overlap to {len(refined_chunks)} chunks\n")
    
    # Step 3: Export
    print("üì¶ Step 3: Exporting...")
    
    if export_format in ["json", "both"]:
        json_porter = JSONPorter()
        json_porter(refined_chunks, file="./pipeline_output.json")
        print(f"  ‚úÖ JSON: Exported to pipeline_output.json")
    
    if export_format in ["dataset", "both"]:
        datasets_porter = DatasetsPorter()
        dataset = datasets_porter(refined_chunks, save_to_disk=True, path="./pipeline_dataset")
        print(f"  ‚úÖ Dataset: Exported to pipeline_dataset/")
        print(f"     Features: {list(dataset.features.keys())}")
    
    print("\n‚ú® Pipeline complete!")
    return refined_chunks

# Run pipeline
sample_text = """Machine learning is a subset of artificial intelligence that enables systems to learn from data. 
It includes supervised learning, unsupervised learning, and reinforcement learning. 
Applications range from image recognition to natural language processing. 
The field continues to evolve with new algorithms and techniques."""

result = complete_pipeline(sample_text, export_format="both")

print(f"\nüìä Final Output:")
print(f"  Processed chunks: {len(result)}")
print(f"  Files created:")
print(f"    ‚Ä¢ pipeline_output.json")
print(f"    ‚Ä¢ pipeline_dataset/")

üöÄ Complete Pipeline

Input text length: 323 characters

üìù Step 1: Chunking...
  Created 7 chunks

üìä Step 2: Adding overlap context...
  Added overlap to 7 chunks

üì¶ Step 3: Exporting...
  ‚úÖ JSON: Exported to pipeline_output.json


Saving the dataset (0/1 shards):   0%|          | 0/7 [00:00<?, ? examples/s]

  ‚úÖ Dataset: Exported to pipeline_dataset/
     Features: ['id', 'text', 'start_index', 'end_index', 'token_count', 'context', 'embedding']

‚ú® Pipeline complete!

üìä Final Output:
  Processed chunks: 7
  Files created:
    ‚Ä¢ pipeline_output.json
    ‚Ä¢ pipeline_dataset/


## Cleanup

Remove exported files created during demonstration.

In [23]:
import shutil
import os

# List of files and directories to clean up
cleanup_items = [
    "./exported_chunks.json",
    "./callable_export.json",
    "./chunks_with_metadata.json",
    "./real_chunks.json",
    "./my_exported_chunks",
    "./callable_dataset",
    "./large_dataset",
    "./comparison.json",
    "./comparison_dataset",
    "./pipeline_output.json",
    "./pipeline_dataset"
]

print("üßπ Cleaning up exported files...\n")

for item in cleanup_items:
    try:
        if os.path.isfile(item):
            os.remove(item)
            print(f"  ‚úÖ Deleted file: {item}")
        elif os.path.isdir(item):
            shutil.rmtree(item)
            print(f"  ‚úÖ Deleted directory: {item}")
    except FileNotFoundError:
        print(f"  ‚ÑπÔ∏è Not found: {item}")
    except Exception as e:
        print(f"  ‚ùå Error deleting {item}: {e}")

print("\n‚úÖ Cleanup complete!")

üßπ Cleaning up exported files...

  ‚úÖ Deleted file: ./exported_chunks.json
  ‚úÖ Deleted file: ./callable_export.json
  ‚úÖ Deleted file: ./chunks_with_metadata.json
  ‚úÖ Deleted file: ./real_chunks.json
  ‚úÖ Deleted directory: ./my_exported_chunks
  ‚úÖ Deleted directory: ./callable_dataset
  ‚úÖ Deleted directory: ./large_dataset
  ‚úÖ Deleted file: ./comparison.json
  ‚úÖ Deleted directory: ./comparison_dataset
  ‚úÖ Deleted file: ./pipeline_output.json
  ‚úÖ Deleted directory: ./pipeline_dataset

‚úÖ Cleanup complete!


---

## Summary: All Porter Types and Capabilities

### Porter Comparison Table

| Porter | Format | Parameters | Return Type | Use Cases |
|--------|--------|------------|-------------|-----------|
| **JSONPorter** | JSON/JSONL (text) | `file` | None | Archiving, Sharing, Interoperability, Debugging |
| **DatasetsPorter** | Arrow (binary) | `save_to_disk`, `path`, `**kwargs` | `Dataset` object | ML Training, HF Ecosystem, Large Datasets |

### JSONPorter

**Purpose**: Export chunks to JSON or JSON Lines format

**Key Features**:
- Human-readable format
- Works with any JSON-compatible tool
- Simple and portable
- Good for small to medium datasets
- Easy to inspect and debug
- Supports both JSON array (`lines=False`) and JSONL (`lines=True`)

**Usage**:
```python
# JSON Lines format (default)
porter = JSONPorter(lines=True)
porter.export(chunks, file="chunks.jsonl")

# JSON array format
porter = JSONPorter(lines=False)
porter.export(chunks, file="chunks.json")

# Or use as callable
porter(chunks, file="output.json")
```

### DatasetsPorter

**Purpose**: Export chunks to Hugging Face Dataset format

**Parameters**:
- **save_to_disk**: `bool` (default: `True`) - Whether to save to disk
- **path**: `str` (default: `"chunks"`) - Directory path for saving
- **\*\*kwargs**: Additional arguments for `Dataset.save_to_disk()`

**Key Features**:
- Efficient Arrow format
- Integration with Hugging Face ecosystem
- Optimized for large datasets
- Rich data manipulation APIs
- Memory-efficient operations

**Usage**:
```python
porter = DatasetsPorter()

# In-memory
dataset = porter.export(chunks)

# Save to disk
dataset = porter.export(chunks, save_to_disk=True, path="my_chunks")

# Load from disk
from datasets import load_from_disk
loaded = load_from_disk("my_chunks")
```

### Methods Available

**JSONPorter**:
- `export(chunks, file="chunks.jsonl")` - Export chunks to JSON file
- `__call__(chunks, file="chunks.jsonl")` - Callable interface

**DatasetsPorter**:
- `export(chunks, save_to_disk=True, path="chunks", **kwargs)` - Export chunks to Dataset
- `__call__(chunks, save_to_disk=True, path="chunks", **kwargs)` - Callable interface

### Best Practices

‚úÖ **JSONPorter**:
- Use for small to medium datasets (< 10,000 chunks)
- Ideal for sharing with non-Python tools
- Perfect for debugging and inspection
- Good for archiving and version control
- Easy to parse in any language
- Use `lines=True` for streaming large files
- Use `lines=False` for better readability

‚úÖ **DatasetsPorter**:
- Use for large datasets (> 10,000 chunks)
- Essential for ML training pipelines
- Better performance for data operations
- Native integration with Hugging Face models
- Memory-efficient for big data

‚úÖ **General Tips**:
- Always preserve metadata when exporting
- Choose format based on downstream use case
- Consider file size and performance needs
- Use DatasetsPorter for HF ecosystem integration
- Use JSONPorter for maximum portability

### Common Workflows

**1. Data Archiving**:
```python
# Chunk text
chunks = chunker(text)

# Export to JSON for archiving
JSONPorter()(chunks, file="archive.json")
```

**2. ML Training Pipeline**:
```python
# Process and refine chunks
chunks = refinery(chunker(text))

# Export to Dataset for training
dataset = DatasetsPorter()(chunks, save_to_disk=True, path="train_data")
```

**3. Data Sharing**:
```python
# Export to both formats
JSONPorter()(chunks, file="chunks.json")  # For general use
DatasetsPorter()(chunks, save_to_disk=True, path="chunks_hf")  # For HF users
```