# Working with Iterators

Iterators split documents, media, or any data into multiple rows. This recipe covers:

- Using built-in iterator functions (`document_splitter`, `frame_iterator`, `video_splitter`, `audio_splitter`, `string_splitter`)
- Creating custom iterators for specialized use cases

## Part 1: Built-in Iterator Functions

Pixeltable provides iterator functions for common splitting operations. These functions return a tuple `(iterator_class, kwargs)` that you pass to `pxt.create_view()`.

## Setup

In [None]:
%pip install -qU pixeltable

In [None]:
import pixeltable as pxt

pxt.drop_dir('iterator_demo', force=True)
pxt.create_dir('iterator_demo')

### Document Splitting

Split documents (HTML, Markdown, PDF, TXT) into chunks using `document_splitter`.

In [None]:
from pixeltable.functions.document import document_splitter

# Create a table with documents
docs = pxt.create_table(
    'iterator_demo.docs',
    {'doc': pxt.Document}
)

# Insert a sample document
docs.insert([{
    'doc': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Jefferson-Amazon.pdf'
}])

In [None]:
# Split by sentences with token limit of 300
chunks = pxt.create_view(
    'iterator_demo.doc_chunks',
    docs,
    iterator=document_splitter(
        docs.doc,
        separators='sentence,token_limit',
        limit=300
    )
)

chunks.select(chunks.text).limit(3).collect()

**Available separators:**

- `heading` — Split on HTML/Markdown headings
- `sentence` — Split on sentence boundaries (requires spacy)
- `token_limit` — Split by token count (requires tiktoken)
- `char_limit` — Split by character count
- `page` — Split by page (PDF only)

### Video Frame Extraction

Extract frames from videos at specified intervals using `frame_iterator`.


In [None]:
from pixeltable.functions.video import frame_iterator

# Create a table with videos
videos = pxt.create_table(
    'iterator_demo.videos',
    {'video': pxt.Video}
)

videos.insert([{
    'video': 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/bangkok.mp4'
}])

In [None]:
# Extract frames at 1 fps
frames = pxt.create_view(
    'iterator_demo.frames',
    videos,
    iterator=frame_iterator(videos.video, fps=1.0)
)

frames.select(frames.frame_idx, frames.pos_msec, frames.frame).limit(3).collect()

**frame_iterator options:**

- `fps` — Frames per second to extract
- `num_frames` — Extract exact number of frames (evenly spaced)
- `keyframes_only` — Extract only keyframes
- `all_frame_attrs` — Include all pyav frame attributes

### String Splitting

Split text strings into sentences using `string_splitter`.


In [None]:
from pixeltable.functions.string import string_splitter

# Create a table with text
texts = pxt.create_table(
    'iterator_demo.texts',
    {'content': pxt.String}
)

texts.insert([{
    'content': 'Pixeltable is a Python library for AI data infrastructure. It provides declarative, incremental data pipelines. Machine learning workflows become simpler and more maintainable.'
}])

In [None]:
# Split text into sentences
sentences = pxt.create_view(
    'iterator_demo.sentences',
    texts,
    iterator=string_splitter(texts.content, separators='sentence')
)

sentences.select(sentences.text).collect()

## Part 2: Building Custom Iterators

When built-in iterators don't fit your needs, create custom ones. The recommended pattern is to:

1. Define an iterator class that subclasses `ComponentIterator`
2. Wrap it in a function for a cleaner API

### Example: Sliding Window Iterator

Let's build an iterator that splits text into overlapping windows of words—useful for context-aware text processing.

**Step 1: Define the iterator class**

In [None]:
from pixeltable.iterators import ComponentIterator
import pixeltable.type_system as ts
from typing import Any


class SlidingWindowIterator(ComponentIterator):
    """Split text into overlapping windows of words."""

    def __init__(self, text: str, *, window_size: int = 10, step: int = 5):
        # Initialize: prepare the data to iterate over
        words = text.split()
        self.windows = []
        for i in range(0, max(1, len(words) - window_size + 1), step):
            self.windows.append(' '.join(words[i:i + window_size]))
        self.pos = 0

    @classmethod
    def input_schema(cls) -> dict[str, ts.ColumnType]:
        # Define input parameter types (must match __init__ params)
        return {
            'text': ts.StringType(nullable=False),
            'window_size': ts.IntType(),
            'step': ts.IntType(),
        }

    @classmethod
    def output_schema(cls, *args: Any, **kwargs: Any) -> tuple[dict[str, ts.ColumnType], list[str]]:
        # Define output columns and unstored columns (empty list = all stored)
        return {
            'window_idx': ts.IntType(),
            'window_text': ts.StringType(),
        }, []

    def __next__(self) -> dict[str, Any]:
        # Return next row as dict, or raise StopIteration when done
        if self.pos >= len(self.windows):
            raise StopIteration
        result = {'window_idx': self.pos, 'window_text': self.windows[self.pos]}
        self.pos += 1
        return result

    def close(self) -> None:
        pass  # Release any resources (file handles, etc.)


In [None]:
# Create sample data for testing
articles = pxt.create_table(
    'iterator_demo.articles',
    {'title': pxt.String, 'content': pxt.String}
)

articles.insert([{
    'title': 'AI Overview',
    'content': 'Artificial intelligence is transforming how we build software. Machine learning models can now understand images, text, and audio. Multimodal AI combines these capabilities into unified systems.'
}])


**Step 2: Create a wrapper function**

Wrap your iterator in a function for a cleaner API:


In [None]:
def sliding_window(
    text: Any,
    *,
    window_size: int = 10,
    step: int = 5
) -> tuple[type[ComponentIterator], dict[str, Any]]:
    """Iterator over sliding windows of text.
    
    Args:
        text: Text column to split into windows
        window_size: Number of words per window
        step: Number of words to advance between windows
        
    Examples:
        >>> pxt.create_view('windows', tbl, iterator=sliding_window(tbl.text, window_size=8, step=4))
    """
    kwargs = {}
    if window_size != 10:
        kwargs['window_size'] = window_size
    if step != 5:
        kwargs['step'] = step
    return SlidingWindowIterator._create(text=text, **kwargs)


In [None]:
# Step 3: Use the wrapper function
windows = pxt.create_view(
    'iterator_demo.windows',
    articles,
    iterator=sliding_window(articles.content, window_size=6, step=3)
)

windows.select(windows.window_idx, windows.window_text).collect()


## Key Points

**Type system:**

- `ts.StringType()` — Text strings
- `ts.IntType()` — Integers
- `ts.FloatType()` — Floating point numbers
- `ts.BoolType()` — Boolean values
- `ts.JsonType()` — JSON objects/arrays
- `ts.ImageType()` — PIL Images
- `ts.VideoType()` — Video files
- `ts.AudioType()` — Audio files
- `ts.DocumentType()` — Document files (PDF, HTML, MD, TXT)

**Schema tips:**

- Add `nullable=False` for required inputs
- Unstored columns (second return of `output_schema`) are computed on-the-fly and not persisted
- Output schema can vary based on input parameters


## See Also

- [Split documents for RAG](https://docs.pixeltable.com/howto/cookbooks/text/doc-chunk-for-rag) — Document chunking strategies
- [Extract frames from videos](https://docs.pixeltable.com/howto/cookbooks/video/video-extract-frames) — Video frame extraction
- [Custom aggregates (UDAs)](https://docs.pixeltable.com/howto/cookbooks/core/custom-aggregates-uda) — Building custom aggregation functions
