# Split data into multiple rows with iterators

Transform a single document, video, or text into multiple rows for granular processing.

**What's in this recipe:**

- Split documents, videos, and strings using built-in iterator functions
- Build custom iterators for specialized splitting logic

## Problem

You have documents, videos, or text that you need to break into smaller pieces for processing. A PDF needs to be split into chunks for retrieval-augmented generation. A video needs individual frames for analysis. Text needs to be divided into sentences or sliding windows.

You need a way to transform one source row into multiple output rows automatically.

## Solution

You create views with iterator functions that split source data into multiple rows. Built-in iterators handle common cases like documents, videos, and strings. For specialized needs, you define custom iterators with a wrapper function.

### Setup

In [None]:
%pip install -qU pixeltable

In [None]:
import pixeltable as pxt

### Split documents into chunks

Use `document_splitter` to break documents (PDF, HTML, Markdown, TXT) into text chunks.

In [None]:
from pixeltable.functions.document import document_splitter

pxt.drop_dir('iterator_demo', force=True)
pxt.create_dir('iterator_demo')

docs = pxt.create_table('iterator_demo.docs', {'doc': pxt.Document})
docs.insert([{'doc': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Jefferson-Amazon.pdf'}])

In [None]:
chunks = pxt.create_view(
    'iterator_demo.doc_chunks',
    docs,
    iterator=document_splitter(docs.doc, separators='sentence,token_limit', limit=300)
)
chunks.select(chunks.text).limit(3).collect()

**Available separators:**

- `heading` — Split on HTML/Markdown headings
- `sentence` — Split on sentence boundaries (requires spacy)
- `token_limit` — Split by token count (requires tiktoken)
- `char_limit` — Split by character count
- `page` — Split by page (PDF only)

### Extract frames from videos

Use `frame_iterator` to extract frames at specified intervals.

In [None]:
from pixeltable.functions.video import frame_iterator

videos = pxt.create_table('iterator_demo.videos', {'video': pxt.Video})
videos.insert([{'video': 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/bangkok.mp4'}])

In [None]:
frames = pxt.create_view(
    'iterator_demo.frames',
    videos,
    iterator=frame_iterator(videos.video, fps=1.0)
)
frames.select(frames.frame_idx, frames.pos_msec, frames.frame).limit(3).collect()

**frame_iterator options:**

- `fps` — Frames per second to extract
- `num_frames` — Extract exact number of frames (evenly spaced)
- `keyframes_only` — Extract only keyframes
- `all_frame_attrs` — Include all pyav frame attributes

### Split strings into sentences

Use `string_splitter` to divide text into sentences.

In [None]:
from pixeltable.functions.string import string_splitter

texts = pxt.create_table('iterator_demo.texts', {'content': pxt.String})
texts.insert([{'content': 'AI data infrastructure simplifies ML workflows. Declarative pipelines update incrementally. This makes development faster and more maintainable.'}])

In [None]:
sentences = pxt.create_view(
    'iterator_demo.sentences',
    texts,
    iterator=string_splitter(texts.content, separators='sentence')
)
sentences.select(sentences.text).collect()

### Build a custom iterator

When built-in iterators don't fit your needs, you define a custom one. The pattern is:

1. Create an iterator class that subclasses `ComponentIterator`
2. Wrap it in a function for a clean API

This example builds a sliding window iterator that splits text into overlapping word windows—useful for context-aware text processing.

**Define the iterator class:**

In [None]:
from pixeltable.iterators import ComponentIterator
import pixeltable.type_system as ts
from typing import Any


class SlidingWindowIterator(ComponentIterator):
    """Split text into overlapping windows of words."""

    def __init__(self, text: str, *, window_size: int = 10, step: int = 5):
        # Initialize: prepare the data to iterate over
        words = text.split()
        self.windows = []
        for i in range(0, max(1, len(words) - window_size + 1), step):
            self.windows.append(' '.join(words[i:i + window_size]))
        self.pos = 0

    @classmethod
    def input_schema(cls) -> dict[str, ts.ColumnType]:
        # Define input parameter types (must match __init__ params)
        return {
            'text': ts.StringType(nullable=False),
            'window_size': ts.IntType(),
            'step': ts.IntType(),
        }

    @classmethod
    def output_schema(cls, *args: Any, **kwargs: Any) -> tuple[dict[str, ts.ColumnType], list[str]]:
        # Define output columns and unstored columns (empty list = all stored)
        return {
            'window_idx': ts.IntType(),
            'window_text': ts.StringType(),
        }, []

    def __next__(self) -> dict[str, Any]:
        # Return next row as dict, or raise StopIteration when done
        if self.pos >= len(self.windows):
            raise StopIteration
        result = {'window_idx': self.pos, 'window_text': self.windows[self.pos]}
        self.pos += 1
        return result

    def close(self) -> None:
        pass  # Release any resources (file handles, etc.)


In [None]:
articles = pxt.create_table('iterator_demo.articles', {'content': pxt.String})
articles.insert([{'content': 'Artificial intelligence transforms software development. Machine learning models understand images text and audio. Multimodal AI combines these capabilities into unified systems.'}])


**Create a wrapper function:**


In [None]:
def sliding_window(
    text: Any,
    *,
    window_size: int = 10,
    step: int = 5
) -> tuple[type[ComponentIterator], dict[str, Any]]:
    """Iterator over sliding windows of text.
    
    Args:
        text: Text column to split into windows
        window_size: Number of words per window
        step: Number of words to advance between windows
        
    Examples:
        >>> pxt.create_view('windows', tbl, iterator=sliding_window(tbl.text, window_size=8, step=4))
    """
    kwargs = {}
    if window_size != 10:
        kwargs['window_size'] = window_size
    if step != 5:
        kwargs['step'] = step
    return SlidingWindowIterator._create(text=text, **kwargs)


In [None]:
windows = pxt.create_view(
    'iterator_demo.windows',
    articles,
    iterator=sliding_window(articles.content, window_size=6, step=3)
)
windows.select(windows.window_idx, windows.window_text).collect()


## Explanation

**Custom iterator structure:**

- `__init__` — Receives column values and prepares data to iterate
- `input_schema` — Maps parameter names to Pixeltable types
- `output_schema` — Defines output columns (returns dict and list of unstored columns)
- `__next__` — Returns next row as dict or raises `StopIteration`
- `close` — Releases resources (file handles, connections)

**Common types:** `ts.StringType()`, `ts.IntType()`, `ts.FloatType()`, `ts.ImageType()`, `ts.VideoType()`, `ts.AudioType()`, `ts.DocumentType()`, `ts.JsonType()`

## See also

- [Split documents for RAG](https://docs.pixeltable.com/howto/cookbooks/text/doc-chunk-for-rag)
- [Extract frames from videos](https://docs.pixeltable.com/howto/cookbooks/video/video-extract-frames)
- [Custom aggregates](https://docs.pixeltable.com/howto/cookbooks/core/custom-aggregates-uda)