# Custom Iterators

An **iterator** in Pixeltable is a function that expands a single input row into multiple output rows. Built-in Pixeltable iterators include [frame_iterator](https://docs.pixeltable.com/sdk/latest/video#iterator-frame-iterator), which iterates over the frames of a video; [tile_iterator](https://docs.pixeltable.com/sdk/latest/image#iterator-tile-iterator), which iterates over tiles of an image; and [document_splitter](https://docs.pixeltable.com/sdk/latest/document#iterator-document_splitter), which iterates over chunks (such as sentences or pages) of a document. These and other examples are discussed in the [Iterators](https://docs.pixeltable.com/platform/iterators) platform tutorial.

As with UDFs, Pixeltable provides a way for users to define their own iterators from arbitrary Python code. Recall that custom UDFs are created by decorating a Python function with the `@pxt.udf` decorator. Similarly, custom iterators are created by decorating a Python generator function with `@pxt.iterator`.

<div class="alert alert-block alert-info">
Custom iterators are a relatively advanced Pixeltable feature. This guide will make the most sense if you're already familiar with Pixeltable's built-in iterators, as well as the <code>pxt.udf</code> decorator. If you haven't encountered those concepts yet, it's recommended to first read the <a href="https://docs.pixeltable.com/platform/iterators">Iterators</a> and <a href="https://docs.pixeltable.com/platform/udfs-in-pixeltable">UDFs</a> tutorial sections.
</div>

In [1]:
import pixeltable as pxt

pxt.create_dir('iterators_demo', if_exists='replace_force')

Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory 'iterators_demo'.


<pixeltable.catalog.dir.Dir at 0x14739e080>

In this tutorial, we'll be creating an iterator that takes an image as input, and produces multiple images as output. The output images will be variations of the input with different characteristics. To start, we'll create a base table to store our source images.

In [2]:
t = pxt.create_table('iterators_demo/images', {'image': pxt.Image})
images = [
    'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000108.jpg',
    'https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/resources/images/000000000632.jpg',
]
t.insert({'image': image} for image in images)
t.head()

image


Now let's define a custom iterator. Our iterator is going to turn each image into `n` different grayscale images of varying brightness. Creating a functioning iterator is as simple as defining a Python generator function (a function that `yield`s its output) and then decorating it with `@pxt.iterator`.

In [3]:
from typing import Iterator, TypedDict
from PIL.Image import Image
from PIL.ImageEnhance import Brightness


class GrayscaleOutput(TypedDict):
    brightness: float
    grayscale_image: Image


@pxt.iterator
def grayscale_iterator(
    image: Image, *, n: int
) -> Iterator[GrayscaleOutput]:
    grayscale_image = image.convert('L')
    enhancer = Brightness(grayscale_image)
    for brightness in [0.5 * (i + 1) for i in range(n)]:
        enhanced_image = enhancer.enhance(brightness)
        yield {
            'grayscale_image': enhanced_image,
            'brightness': brightness,
        }

Notice that before defining our iterator, we first introduced a `TypedDict` class describing the content of the iterator's output. Unlike UDFs, iterators can (and usually do) return multiple outputs. They will *always* `yield` dictionaries, and you *must* annotate the return type with a suitable `TypedDict`. This is how Pixeltable knows what types to assign to the iterator's output columns.

<div class="alert alert-block alert-info">
Defining a <code>TypedDict</code> for your iterator is not optional. Remember that Pixeltable is a database system, and everything must be typed!
</div>

Now let's see our iterator in action! We'll create a view on top of the `images` table and collect the results.

In [4]:
v = pxt.create_view(
    'iterators_demo/grayscale',
    t,
    iterator=grayscale_iterator(t.image, n=3),
)
v.head()

pos,brightness,grayscale_image,image
0,0.5,,
1,1.0,,
2,1.5,,
0,0.5,,
1,1.0,,
2,1.5,,


The iterator view has the columns `brightness` and `grayscale_image`, which were defined in `GrayscaleOutput`. In addition, Pixeltable added a third column `pos`. *Every* iterator will automatically output a `pos` column, regardless of what shows up in the iterator's `TypedDict`. The `pos` column simply indicates the integer position of that row in the original iteration order. If we look at the schema of our new view, we can see that `pos` always has type `Int`.

In [5]:
v

0
view 'iterators_demo/grayscale' (of 'iterators_demo/images')

Column Name,Type,Computed With
pos,Required[Int],
brightness,Required[Float],
grayscale_image,Required[Image],
image,Image,


In addition, a column for the original input image is included for reference. (Of course, the input image is *not* copied `n` times; Pixeltable materializes it in the view by joining against the base table.)

## Parameterizing Iterators

Iterators often contain complex functionality; `document_splitter`, for example, has 10 optional parameters to tune its behavior. Like UDFs, iterators can involve any number of parameters. To illustrate this, let's add an optional `colorize` parameter to our iterator.

In [6]:
from PIL import ImageOps


@pxt.iterator
def grayscale_iterator(
    image: Image, *, n: int, colorize: str | None = None
) -> Iterator[GrayscaleOutput]:
    grayscale_image = image.convert('L')
    if colorize is not None:
        grayscale_image = ImageOps.colorize(
            grayscale_image, black='black', white=colorize
        )
    enhancer = Brightness(grayscale_image)
    for brightness in [0.5 * (i + 1) for i in range(n)]:
        enhanced_image = enhancer.enhance(brightness)
        yield {
            'grayscale_image': enhanced_image,
            'brightness': brightness,
        }

In [7]:
v = pxt.create_view(
    'iterators_demo/grayscale',
    t,
    iterator=grayscale_iterator(t.image, n=3, colorize='red'),
    if_exists='replace',
)
v.head()

pos,brightness,grayscale_image,image
0,0.5,,
1,1.0,,
2,1.5,,
0,0.5,,
1,1.0,,
2,1.5,,


## Validation

Often it's desirable to validate an iterator's inputs as a sanity check. Suppose we want to check that the `colorize` input is a valid PIL color name. That's already being done, in a sense: when `ImageOps.colorize` is called in our iterator code, it will raise an exception if the color name is not valid. The problem is that the iterator code isn't executed until our workflow actually runs. There's nothing stopping us from *instantiating* instances of `grayscale_iterator` with broken inputs. To appreciate this distinction, let's set up an empty table with no rows, and define an invalid iterator view on it.

In [8]:
t = pxt.create_table(
    'iterators_demo/images',
    {'image': pxt.Image},
    if_exists='replace_force',
)

Created table 'images'.


In [9]:

v = pxt.create_view(
    'iterators_demo/grayscale',
    t,
    iterator=grayscale_iterator(
        t.image, n=3, colorize='invalid_color_name'
    ),
)

The view gets created without any errors, because nothing has actually run yet! Only when we go to insert data do we see an exception.

In [10]:
t.insert({'image': image} for image in images)

ValueError: unknown color specifier: 'invalid_color_name'

It's more useful to do *fail-fast validation*, in which the arguments get checked at the time the iterator is first instantiated. This can be done in Pixeltable with the `@validate` decorator.

In [11]:
from PIL import ImageColor


@grayscale_iterator.validate
def _(bound_args: dict):
    color = bound_args.get('colorize')
    if color is not None:
        try:
            ImageColor.getrgb(color)
        except ValueError as exc:
            raise ValueError(f'Invalid color name: {color}') from exc

Now if we try to create an invalid instance, we get an error right away.

In [12]:
t = pxt.create_table(
    'iterators_demo/images',
    {'input': pxt.Image},
    if_exists='replace_force',
)

Created table 'images'.


In [13]:
v = pxt.create_view(
    'iterators_demo/grayscale',
    t,
    iterator=grayscale_iterator(
        t.input, n=3, colorize='invalid_color_name'
    ),
)

ValueError: Invalid color name: invalid_color_name

The input to `validate()`, `bound_args`, is a dictionary that contains all *constant* arguments for a particular instance of the iterator. In the above example, it contains `colorize` (because it's equal to the constant value `'invalid_color_name'`), but not `image` (which depends dynamically on the data in the `t.input` column).

`validate()` will actually be called twice: once when the iterator is instantiated, with just the constant arguments present in `bound_args`; and again when the iterator is evaluated on each row, this time with *all* arguments present.

## Class-Based Iterators

For complex iterators that need to maintain a lot of state or provide fine-grained control over their iteration mechanism, it can be convenient to define a class rather than a generator function. This can be done by writing a subclass of `PxtIterator` and decorating the class, rather than decorating a function. Here's what `grayscale_iterator` looks like if written as a class; it is functionally identical to the earlier implementation.

In [14]:
@pxt.iterator
class grayscale_iterator(pxt.PxtIterator[GrayscaleOutput]):
    # The parameters of __init__() determine the iterator arguments
    def __init__(
        self, image: Image, *, n: int, colorize: str | None = None
    ):
        self.image = image
        self.n = n
        self.colorize = colorize
        self.idx = 0

        grayscale_image = self.image.convert('L')
        if self.colorize is not None:
            grayscale_image = ImageOps.colorize(
                grayscale_image, black='black', white=self.colorize
            )
        self.enhancer = Brightness(grayscale_image)

    # Every class-based iterator *must* implement a __next__() method
    # whose return type is a `TypedDict`.
    def __next__(self) -> GrayscaleOutput:
        if self.idx >= self.n:
            raise StopIteration

        brightness = 0.5 * (self.idx + 1)
        enhanced_image = self.enhancer.enhance(brightness)
        self.idx += 1
        return {
            'grayscale_image': enhanced_image,
            'brightness': brightness,
        }

    # When defining a class-based iterator, validate() can optionally be specified
    # as a @classmethod rather than a standalone decorated function.
    @classmethod
    def validate(cls, bound_args: dict):
        color = bound_args.get('colorize')
        if color is not None:
            try:
                ImageColor.getrgb(color)
            except ValueError as exc:
                raise ValueError(f'Invalid color name: {color}') from exc

## Unstored Columns

That's all you need to know to implement fully functional iterators. But sometimes, depending on the nature of the outputs, a little extra work will help make them more performant.

In our example, every input image gets turned into `n` output images. Moreover, recreating those output images doesn't involve a lot of computation: it's just a simple color mask. If we store every output image as a separate file, then when `n` is large we'll be using up a lot of storage without much benefit. Even at `n=3`, the outputs will consume 3x the storage as the inputs (maybe a little less since they're monochrome now, but you get the idea).

Just as with computed columns, Pixeltable provides an option for iterator outputs to be *unstored* - meaning the outputs won't be saved to disk, and they'll instead be dynamically regenerated each time a client queries them. Unstored columns don't provide much benefit for scalar columns (integers or strings, say), where the storage footprint is small; or for expensive computations (such as generative model outputs), where we actually *do* want to persist the output. But for simple image operations, they can be a lifesaver.

<div class="alert alert-block alert-info">
In the Pixeltable library, <code>frame_iterator</code> and <code>tile_iterator</code> both use an unstored column for the output images. In the case of <code>frame_iterator</code>, the output is potentially *huge*, because video data is highly compressed, as compared to individually stored frame images.
</div>

To mark an iterator output as unstored, use the `unstored_cols` decorator parameter. There is one important caveat:

- If you use unstored columns, you *must* implement your iterator as a class-based iterator; and
- You *must* implement a `seek()` method in your class, as in the example below.

This is to ensure Pixeltable has efficient random access to the iterator outputs, to facilitate downstream queries against the iterator view.

In [15]:
# Mark `grayscale_image` as an unstored column.
@pxt.iterator(unstored_cols=['grayscale_image'])
class grayscale_iterator(pxt.PxtIterator[GrayscaleOutput]):
    def __init__(
        self, image: Image, *, n: int, colorize: str | None = None
    ):
        self.image = image
        self.n = n
        self.colorize = colorize
        self.idx = 0

        grayscale_image = self.image.convert('L')
        if self.colorize is not None:
            grayscale_image = ImageOps.colorize(
                grayscale_image, black='black', white=self.colorize
            )
        self.enhancer = Brightness(grayscale_image)

    def __next__(self) -> GrayscaleOutput:
        if self.idx >= self.n:
            raise StopIteration

        brightness = 0.5 * (self.idx + 1)
        enhanced_image = self.enhancer.enhance(brightness)
        self.idx += 1
        return {
            'grayscale_image': enhanced_image,
            'brightness': brightness,
        }

    # seek() will always receive the `pos` of the row being sought. It
    # will also receive the previously stored values of any *stored*
    # output columns in the target row, as keyword arguments.
    def seek(self, pos: int, **kwargs):
        assert 0 <= pos < self.n
        # 'brightness' is a stored column, so it should always be
        # present. We don't need it to implement seek(), but for
        # purposes of illustration let's check that it's here.
        assert 'brightness' in kwargs

        self.idx = pos  # Reset the iterator to the sought position.

    # When defining a class-based iterator, validate() can optionally
    # be a @classmethod rather than a standalone decorated function.
    @classmethod
    def validate(cls, bound_args: dict):
        color = bound_args.get('colorize')
        if color is not None:
            try:
                ImageColor.getrgb(color)
            except ValueError as exc:
                raise ValueError(f'Invalid color name: {color}') from exc

There it is: a complete, performant implementation of `grayscale_iterator`. Let's check one more time that it all works as expected.

In [16]:
t = pxt.create_table(
    'iterators_demo/images',
    {'image': pxt.Image},
    if_exists='replace_force',
)
t.insert({'image': image} for image in images)

Inserted 2 rows with 0 errors in 0.03 s (75.79 rows/s)


2 rows inserted.

In [17]:
v = pxt.create_view(
    'iterators_demo/grayscale',
    t,
    iterator=grayscale_iterator(t.image, n=3),
)
v.head()

pos,brightness,grayscale_image,image
0,0.5,,
1,1.0,,
2,1.5,,
0,0.5,,
1,1.0,,
2,1.5,,


In [14]:
# Check that we have random access to arbitrary rows in the view.
v.where(v.pos == 2).collect()

pos,brightness,grayscale_image,image
2,1.5,,
2,1.5,,
