Skip to content

Pydantic/dataclass models can't flow through a pipeline as columns — unhashable extension type (Bug A) + polars round-trip drops ext metadata in Join (Bug B) #184

Description

@brian-arnold

orcapod-python: two bugs block pydantic/dataclass models as pipeline columns

Repo / commit: walkerlab/orcapod-python @ main (9ed9b72016030ae6e122e8c7b051ef120b0f82bc)
Environment: Python 3.12.10 · orcapod 0.1.0rc2.dev546+g9ed9b7201 · starfix 0.3.1 · pyarrow 24.0.0 · polars 1.41.2 · pydantic 2.13.4

TL;DR

The native pydantic/dataclass extension-type support (PR #183) works for round-tripping a
model to/from Parquet/IPC, but it cannot yet flow a model through a pipeline as a data
column
. Two independent defects fire:

  • Bug A — building a source that carries a pydantic/dataclass column crashes in the
    hasher: the generated pa.ExtensionType subclass is unhashable, and orcapod's
    schema-cleaner does not strip live extension types before handing the schema to starfix.
  • Bug BJoin round-trips the table through polars (pl.DataFrame(t).to_arrow()), and
    the pydantic/dataclass factory registers the Arrow extension type with a
    {"category": ...} metadata blob but the Polars extension type without it. The polars
    round-trip drops the metadata, and the Arrow type's strict __arrow_ext_deserialize__
    then rejects the now-empty metadata.

Both are needed to support the intended end-to-end use case:

# Broadcast a validated pydantic config to every pod as one content-hashed column.
cfg_src = op.sources.DictSource([{"config": cfg}], tag_columns=[],
                                data_schema=Schema({"config": MyConfig}))
@op.function_pod(output_keys=["out"])
def some_pod(x: int, config: MyConfig) -> str: ...
with job:
    some_pod.pod(join(src.select_data_columns(["x"]), cfg_src.select_data_columns("config")))

Built-in extension types (LogicalPath, LogicalUPath, LogicalUUID) are not affected by
either bug — see the per-bug "why built-ins survive" notes, which point straight at the fix.


Bug A — pydantic/dataclass extension-type columns are unhashable

Symptom

TypeError: unhashable type: '_ArrowExt___main___Cfg'

Minimal repro (repro_bug_a_hashing.py)

import pydantic
import orcapod as op
from orcapod.types import Schema
from orcapod.contexts import get_default_context


class Inner(pydantic.BaseModel):
    model_config = pydantic.ConfigDict(extra="forbid", frozen=True)
    a: int


class Cfg(pydantic.BaseModel):
    model_config = pydantic.ConfigDict(extra="forbid", frozen=True)
    name: str
    inner: Inner


cfg = Cfg(name="hello", inner=Inner(a=7))

# In a real pipeline this registration happens automatically when a FunctionPod
# annotated `config: Cfg` is declared; done explicitly here to keep it minimal.
get_default_context().type_converter.register_python_class(Cfg)

op.sources.DictSource([{"config": cfg}], tag_columns=[],
                      data_schema=Schema({"config": Cfg}))

Traceback (tail)

  File ".../orcapod/core/sources/dict_source.py", line 58, in __init__
    result = builder.build(...)
  File ".../orcapod/core/sources/stream_builder.py", line 172, in build
    table_hash = self._data_context.arrow_hasher.hash_table(table)
  File ".../orcapod/hashing/arrow_hashers.py", line 129, in hash_table
    digest = ArrowDigester.hash_table(clean_table, include_metadata=include_meta)
  File ".../starfix/arrow_digester.py", line 774, in __init__
    self._schema_digest = _hash_schema(schema, include_metadata=include_metadata)
  File ".../starfix/arrow_digester.py", line 194, in _serialized_schema
    "data_type": _data_type_to_value(field.type),
  File ".../starfix/arrow_digester.py", line 123, in _primitive_data_type_string
    if dt in _simple:
TypeError: unhashable type: '_ArrowExt___main___Cfg'

Root cause (three contributing factors)

  1. The generated extension type is unhashable.
    make_arrow_extension_type() (src/orcapod/extension_types/registry.py:36-103) synthesizes
    the class via type(name, (pa.ExtensionType,), {...}) defining only __init__,
    __arrow_ext_serialize__, and __arrow_ext_deserialize__. pa.ExtensionType defines
    __eq__, so Python sets __hash__ = None on the subclass → instances are unhashable.
    Verified:

    ext = get_default_context().type_converter.register_python_class(Cfg)
    type(ext).__hash__          # -> None
    hash(ext)                   # -> TypeError: unhashable type

    starfix's _primitive_data_type_string does if dt in _simple: (a set/dict membership
    test) which calls hash(dt) → crash.

  2. The schema-cleaner never strips live extension types.
    StarfixArrowHasher.hash_table (src/orcapod/hashing/arrow_hashers.py:114-130) only cleans
    the schema when has_extension_metadata(schema) is true. But
    has_extension_metadata / clean_schema_for_hashing
    (src/orcapod/hashing/schema_cleaner.py) key exclusively off field-level
    b"ARROW:extension:name" metadata (_has_extension_in_field, line ~133). For a live
    in-memory pa.ExtensionType column built by DictSource, field.metadata is None — the
    extension identity lives in the type object and __arrow_ext_serialize__(), not in field
    metadata. So has_extension_metadata(schema) == False, cleaning is skipped, and the raw
    unhashable type reaches starfix. (Even if forced, _clean_type has no pa.types.is_extension
    branch, so it would return the live extension type unchanged.) The
    perf(hashing): skip schema cleaning when no extension metadata present change hardens this skip.
    Verified:

    schema = pa.schema([pa.field("config", ext)])
    schema.field("config").metadata          # -> None
    ext.__arrow_ext_serialize__()            # -> b'{"category": "orcapod.pydantic"}'
    has_extension_metadata(schema)           # -> False   (so cleaning is skipped)
  3. No semantic hasher is registered for pydantic/dataclass.
    SemanticHashingVisitor.visit_extension (src/orcapod/hashing/visitors.py:196-231) would
    replace an extension column with a pa.large_binary() hash token before starfix — but only
    if a handler is registered for the column's Python type (the has_handler(python_type) guard
    at lines 213-216). The default context (src/orcapod/contexts/data/v0.1.json) registers the
    pydantic/dataclass logical-type factories (arrow conversion) but
    register_builtin_python_type_handlers (src/orcapod/hashing/semantic_hashing/builtin_handlers.py)
    registers semantic handlers only for Path, UPath, UUID, bytes, functions, type,
    pa.Table, etc. — not for pydantic.BaseModel or dataclasses. So pydantic/dataclass
    columns fall through unhashed.

Why built-ins survive

LogicalPath/LogicalUPath/LogicalUUID have registered semantic handlers, so
visit_extension replaces those columns with hash-token large_binary() before starfix
ever sees the (also-unhashable) extension type. pydantic/dataclass have no handler, so the raw
type reaches starfix.

Suggested fixes (any one unblocks; ideally factor 1 + one of 2/3)

  • Make generated extension types hashable — add "__hash__": ... to the type(...) dict in
    make_arrow_extension_type (e.g. hash on (extension_name, storage_type, metadata)), and the
    symmetric make_polars_extension_type. This alone stops the TypeError.
  • Teach the schema-cleaner about live extension types — have has_extension_metadata /
    clean_schema_for_hashing detect pa.types.is_extension(field.type) (not just field metadata)
    and lower it to its storage_type before starfix.
  • Register default semantic handlers for pydantic.BaseModel and dataclasses (mirroring the
    built-in handlers), so visit_extension tokenizes the column. This also defines the content
    identity
    of a model column (e.g. canonical model_dump_json() + qualified class name).

Bug B — pydantic/dataclass extension metadata is dropped across a polars round-trip (breaks Join)

Symptom

ValueError: Arrow extension type '__main__.Cfg':
  expected metadata b'{"category": "orcapod.pydantic"}' but got b''.

Triggered in normal pipelines by Join, which does
pl.DataFrame(table).join(...).to_arrow() (src/orcapod/core/operators/join.py:194-199).

Minimal repro (repro_bug_b_polars.py) — isolates polars as the culprit

import pyarrow as pa
import polars as pl
import pydantic
from orcapod.contexts import get_default_context


class Inner(pydantic.BaseModel):
    model_config = pydantic.ConfigDict(extra="forbid", frozen=True)
    a: int


class Cfg(pydantic.BaseModel):
    model_config = pydantic.ConfigDict(extra="forbid", frozen=True)
    name: str
    inner: Inner


ext = get_default_context().type_converter.register_python_class(Cfg)
storage = pa.array([{"name": "hi", "inner": {"a": 3}}], type=ext.storage_type)
tbl = pa.table({"config": pa.ExtensionArray.from_storage(ext, storage)})

# (1) pure pyarrow IPC round-trip — PRESERVES the extension type
sink = pa.BufferOutputStream()
with pa.ipc.new_stream(sink, tbl.schema) as w:
    w.write_table(tbl)
rt = pa.ipc.open_stream(sink.getvalue()).read_all()
assert isinstance(rt.schema.field("config").type, pa.ExtensionType)   # OK

# (2) polars round-trip — what Join does — DROPS metadata and raises
pl.DataFrame(tbl).to_arrow()

Output:

orig field type        : extension<__main__.Cfg<_ArrowExt___main___Cfg>>
pyarrow IPC round-trip : extension<__main__.Cfg<_ArrowExt___main___Cfg>> | preserved: True
ValueError: Arrow extension type '__main__.Cfg': expected metadata b'{"category": "orcapod.pydantic"}' but got b''.

Root cause — Arrow/Polars metadata asymmetry in the factory

PydanticLogicalType.__init__ (src/orcapod/extension_types/pydantic_logical_type_factory.py:82-93):

_metadata = json.dumps({"category": PYDANTIC_CATEGORY}).encode("utf-8")
self._arrow_ext_class  = make_arrow_extension_type(logical_name, storage_type, metadata=_metadata)  # WITH metadata
...
self._polars_ext_class = make_polars_extension_type(logical_name, storage_type)                     # WITHOUT metadata

The Arrow extension type is built with {"category": "orcapod.pydantic"}; the Polars
extension type is built without any metadata. On a polars→arrow round-trip, polars hands its
(empty) metadata to the Arrow type's __arrow_ext_deserialize__, whose strict equality check
rejects it (src/orcapod/extension_types/registry.py:53-56):

if serialized != _metadata:
    raise ValueError(
        f"Arrow extension type '{_name}': expected metadata {_metadata!r} but got {serialized!r}.")

DataclassLogicalType has the identical asymmetry
(src/orcapod/extension_types/dataclass_logical_type_factory.py:83-93, category "orcapod.dataclass").

Why built-ins survive

LogicalPath/LogicalUPath/LogicalUUID build their Arrow extension type without category
metadata (_metadata = b""), so after a polars round-trip the strict check compares b"" == b""
and passes. Verified — both survive the same pl.DataFrame(tbl).to_arrow() round-trip with the
extension type intact:

Path: polars round-trip -> extension<orcapod.path<...>>  | preserved-ext: True
UUID: polars round-trip -> extension<orcapod.uuid<...>>  | preserved-ext: True

Suggested fixes (either)

  • Pass the category metadata to the Polars extension type too
    make_polars_extension_type(logical_name, storage_type, metadata=_metadata.decode()) in both
    the pydantic and dataclass factories, so the round-trip preserves it. (Requires confirming
    polars actually round-trips metadata_str through to_arrow.)
  • Relax __arrow_ext_deserialize__ in make_arrow_extension_type to reconstruct from the
    registered category when serialized is empty/missing, instead of hard-raising on inequality.

Combined end-to-end check

With Bug A worked around (registering a pydantic semantic handler) and Bug B worked around
(broadcasting the model as a JSON string instead of an extension column), the
DictSource → Join → FunctionPod pipeline runs to completion — confirming these two defects are
the only blockers on the pydantic-column path, and that the model serialization/validation
machinery itself is sound.

Repro scripts

Both repros are embedded inline above (Bug A and Bug B sections) and are fully self-contained —
save each to a .py file and run with python <file>.py against orcapod main @ 9ed9b72.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions