orcapod-python: two bugs block pydantic/dataclass models as pipeline columns
Repo / commit: walkerlab/orcapod-python @ main (9ed9b72016030ae6e122e8c7b051ef120b0f82bc)
Environment: Python 3.12.10 · orcapod 0.1.0rc2.dev546+g9ed9b7201 · starfix 0.3.1 · pyarrow 24.0.0 · polars 1.41.2 · pydantic 2.13.4
TL;DR
The native pydantic/dataclass extension-type support (PR #183) works for round-tripping a
model to/from Parquet/IPC, but it cannot yet flow a model through a pipeline as a data
column. Two independent defects fire:
- Bug A — building a source that carries a pydantic/dataclass column crashes in the
hasher: the generated pa.ExtensionType subclass is unhashable, and orcapod's
schema-cleaner does not strip live extension types before handing the schema to starfix.
- Bug B —
Join round-trips the table through polars (pl.DataFrame(t).to_arrow()), and
the pydantic/dataclass factory registers the Arrow extension type with a
{"category": ...} metadata blob but the Polars extension type without it. The polars
round-trip drops the metadata, and the Arrow type's strict __arrow_ext_deserialize__
then rejects the now-empty metadata.
Both are needed to support the intended end-to-end use case:
# Broadcast a validated pydantic config to every pod as one content-hashed column.
cfg_src = op.sources.DictSource([{"config": cfg}], tag_columns=[],
data_schema=Schema({"config": MyConfig}))
@op.function_pod(output_keys=["out"])
def some_pod(x: int, config: MyConfig) -> str: ...
with job:
some_pod.pod(join(src.select_data_columns(["x"]), cfg_src.select_data_columns("config")))
Built-in extension types (LogicalPath, LogicalUPath, LogicalUUID) are not affected by
either bug — see the per-bug "why built-ins survive" notes, which point straight at the fix.
Bug A — pydantic/dataclass extension-type columns are unhashable
Symptom
TypeError: unhashable type: '_ArrowExt___main___Cfg'
Minimal repro (repro_bug_a_hashing.py)
import pydantic
import orcapod as op
from orcapod.types import Schema
from orcapod.contexts import get_default_context
class Inner(pydantic.BaseModel):
model_config = pydantic.ConfigDict(extra="forbid", frozen=True)
a: int
class Cfg(pydantic.BaseModel):
model_config = pydantic.ConfigDict(extra="forbid", frozen=True)
name: str
inner: Inner
cfg = Cfg(name="hello", inner=Inner(a=7))
# In a real pipeline this registration happens automatically when a FunctionPod
# annotated `config: Cfg` is declared; done explicitly here to keep it minimal.
get_default_context().type_converter.register_python_class(Cfg)
op.sources.DictSource([{"config": cfg}], tag_columns=[],
data_schema=Schema({"config": Cfg}))
Traceback (tail)
File ".../orcapod/core/sources/dict_source.py", line 58, in __init__
result = builder.build(...)
File ".../orcapod/core/sources/stream_builder.py", line 172, in build
table_hash = self._data_context.arrow_hasher.hash_table(table)
File ".../orcapod/hashing/arrow_hashers.py", line 129, in hash_table
digest = ArrowDigester.hash_table(clean_table, include_metadata=include_meta)
File ".../starfix/arrow_digester.py", line 774, in __init__
self._schema_digest = _hash_schema(schema, include_metadata=include_metadata)
File ".../starfix/arrow_digester.py", line 194, in _serialized_schema
"data_type": _data_type_to_value(field.type),
File ".../starfix/arrow_digester.py", line 123, in _primitive_data_type_string
if dt in _simple:
TypeError: unhashable type: '_ArrowExt___main___Cfg'
Root cause (three contributing factors)
-
The generated extension type is unhashable.
make_arrow_extension_type() (src/orcapod/extension_types/registry.py:36-103) synthesizes
the class via type(name, (pa.ExtensionType,), {...}) defining only __init__,
__arrow_ext_serialize__, and __arrow_ext_deserialize__. pa.ExtensionType defines
__eq__, so Python sets __hash__ = None on the subclass → instances are unhashable.
Verified:
ext = get_default_context().type_converter.register_python_class(Cfg)
type(ext).__hash__ # -> None
hash(ext) # -> TypeError: unhashable type
starfix's _primitive_data_type_string does if dt in _simple: (a set/dict membership
test) which calls hash(dt) → crash.
-
The schema-cleaner never strips live extension types.
StarfixArrowHasher.hash_table (src/orcapod/hashing/arrow_hashers.py:114-130) only cleans
the schema when has_extension_metadata(schema) is true. But
has_extension_metadata / clean_schema_for_hashing
(src/orcapod/hashing/schema_cleaner.py) key exclusively off field-level
b"ARROW:extension:name" metadata (_has_extension_in_field, line ~133). For a live
in-memory pa.ExtensionType column built by DictSource, field.metadata is None — the
extension identity lives in the type object and __arrow_ext_serialize__(), not in field
metadata. So has_extension_metadata(schema) == False, cleaning is skipped, and the raw
unhashable type reaches starfix. (Even if forced, _clean_type has no pa.types.is_extension
branch, so it would return the live extension type unchanged.) The
perf(hashing): skip schema cleaning when no extension metadata present change hardens this skip.
Verified:
schema = pa.schema([pa.field("config", ext)])
schema.field("config").metadata # -> None
ext.__arrow_ext_serialize__() # -> b'{"category": "orcapod.pydantic"}'
has_extension_metadata(schema) # -> False (so cleaning is skipped)
-
No semantic hasher is registered for pydantic/dataclass.
SemanticHashingVisitor.visit_extension (src/orcapod/hashing/visitors.py:196-231) would
replace an extension column with a pa.large_binary() hash token before starfix — but only
if a handler is registered for the column's Python type (the has_handler(python_type) guard
at lines 213-216). The default context (src/orcapod/contexts/data/v0.1.json) registers the
pydantic/dataclass logical-type factories (arrow conversion) but
register_builtin_python_type_handlers (src/orcapod/hashing/semantic_hashing/builtin_handlers.py)
registers semantic handlers only for Path, UPath, UUID, bytes, functions, type,
pa.Table, etc. — not for pydantic.BaseModel or dataclasses. So pydantic/dataclass
columns fall through unhashed.
Why built-ins survive
LogicalPath/LogicalUPath/LogicalUUID have registered semantic handlers, so
visit_extension replaces those columns with hash-token large_binary() before starfix
ever sees the (also-unhashable) extension type. pydantic/dataclass have no handler, so the raw
type reaches starfix.
Suggested fixes (any one unblocks; ideally factor 1 + one of 2/3)
- Make generated extension types hashable — add
"__hash__": ... to the type(...) dict in
make_arrow_extension_type (e.g. hash on (extension_name, storage_type, metadata)), and the
symmetric make_polars_extension_type. This alone stops the TypeError.
- Teach the schema-cleaner about live extension types — have
has_extension_metadata /
clean_schema_for_hashing detect pa.types.is_extension(field.type) (not just field metadata)
and lower it to its storage_type before starfix.
- Register default semantic handlers for
pydantic.BaseModel and dataclasses (mirroring the
built-in handlers), so visit_extension tokenizes the column. This also defines the content
identity of a model column (e.g. canonical model_dump_json() + qualified class name).
Bug B — pydantic/dataclass extension metadata is dropped across a polars round-trip (breaks Join)
Symptom
ValueError: Arrow extension type '__main__.Cfg':
expected metadata b'{"category": "orcapod.pydantic"}' but got b''.
Triggered in normal pipelines by Join, which does
pl.DataFrame(table).join(...).to_arrow() (src/orcapod/core/operators/join.py:194-199).
Minimal repro (repro_bug_b_polars.py) — isolates polars as the culprit
import pyarrow as pa
import polars as pl
import pydantic
from orcapod.contexts import get_default_context
class Inner(pydantic.BaseModel):
model_config = pydantic.ConfigDict(extra="forbid", frozen=True)
a: int
class Cfg(pydantic.BaseModel):
model_config = pydantic.ConfigDict(extra="forbid", frozen=True)
name: str
inner: Inner
ext = get_default_context().type_converter.register_python_class(Cfg)
storage = pa.array([{"name": "hi", "inner": {"a": 3}}], type=ext.storage_type)
tbl = pa.table({"config": pa.ExtensionArray.from_storage(ext, storage)})
# (1) pure pyarrow IPC round-trip — PRESERVES the extension type
sink = pa.BufferOutputStream()
with pa.ipc.new_stream(sink, tbl.schema) as w:
w.write_table(tbl)
rt = pa.ipc.open_stream(sink.getvalue()).read_all()
assert isinstance(rt.schema.field("config").type, pa.ExtensionType) # OK
# (2) polars round-trip — what Join does — DROPS metadata and raises
pl.DataFrame(tbl).to_arrow()
Output:
orig field type : extension<__main__.Cfg<_ArrowExt___main___Cfg>>
pyarrow IPC round-trip : extension<__main__.Cfg<_ArrowExt___main___Cfg>> | preserved: True
ValueError: Arrow extension type '__main__.Cfg': expected metadata b'{"category": "orcapod.pydantic"}' but got b''.
Root cause — Arrow/Polars metadata asymmetry in the factory
PydanticLogicalType.__init__ (src/orcapod/extension_types/pydantic_logical_type_factory.py:82-93):
_metadata = json.dumps({"category": PYDANTIC_CATEGORY}).encode("utf-8")
self._arrow_ext_class = make_arrow_extension_type(logical_name, storage_type, metadata=_metadata) # WITH metadata
...
self._polars_ext_class = make_polars_extension_type(logical_name, storage_type) # WITHOUT metadata
The Arrow extension type is built with {"category": "orcapod.pydantic"}; the Polars
extension type is built without any metadata. On a polars→arrow round-trip, polars hands its
(empty) metadata to the Arrow type's __arrow_ext_deserialize__, whose strict equality check
rejects it (src/orcapod/extension_types/registry.py:53-56):
if serialized != _metadata:
raise ValueError(
f"Arrow extension type '{_name}': expected metadata {_metadata!r} but got {serialized!r}.")
DataclassLogicalType has the identical asymmetry
(src/orcapod/extension_types/dataclass_logical_type_factory.py:83-93, category "orcapod.dataclass").
Why built-ins survive
LogicalPath/LogicalUPath/LogicalUUID build their Arrow extension type without category
metadata (_metadata = b""), so after a polars round-trip the strict check compares b"" == b""
and passes. Verified — both survive the same pl.DataFrame(tbl).to_arrow() round-trip with the
extension type intact:
Path: polars round-trip -> extension<orcapod.path<...>> | preserved-ext: True
UUID: polars round-trip -> extension<orcapod.uuid<...>> | preserved-ext: True
Suggested fixes (either)
- Pass the category metadata to the Polars extension type too —
make_polars_extension_type(logical_name, storage_type, metadata=_metadata.decode()) in both
the pydantic and dataclass factories, so the round-trip preserves it. (Requires confirming
polars actually round-trips metadata_str through to_arrow.)
- Relax
__arrow_ext_deserialize__ in make_arrow_extension_type to reconstruct from the
registered category when serialized is empty/missing, instead of hard-raising on inequality.
Combined end-to-end check
With Bug A worked around (registering a pydantic semantic handler) and Bug B worked around
(broadcasting the model as a JSON string instead of an extension column), the
DictSource → Join → FunctionPod pipeline runs to completion — confirming these two defects are
the only blockers on the pydantic-column path, and that the model serialization/validation
machinery itself is sound.
Repro scripts
Both repros are embedded inline above (Bug A and Bug B sections) and are fully self-contained —
save each to a .py file and run with python <file>.py against orcapod main @ 9ed9b72.
orcapod-python: two bugs block pydantic/dataclass models as pipeline columns
Repo / commit:
walkerlab/orcapod-python@main(9ed9b72016030ae6e122e8c7b051ef120b0f82bc)Environment: Python 3.12.10 · orcapod
0.1.0rc2.dev546+g9ed9b7201· starfix0.3.1· pyarrow24.0.0· polars1.41.2· pydantic2.13.4TL;DR
The native pydantic/dataclass extension-type support (PR #183) works for round-tripping a
model to/from Parquet/IPC, but it cannot yet flow a model through a pipeline as a data
column. Two independent defects fire:
hasher: the generated
pa.ExtensionTypesubclass is unhashable, and orcapod'sschema-cleaner does not strip live extension types before handing the schema to starfix.
Joinround-trips the table through polars (pl.DataFrame(t).to_arrow()), andthe pydantic/dataclass factory registers the Arrow extension type with a
{"category": ...}metadata blob but the Polars extension type without it. The polarsround-trip drops the metadata, and the Arrow type's strict
__arrow_ext_deserialize__then rejects the now-empty metadata.
Both are needed to support the intended end-to-end use case:
Built-in extension types (
LogicalPath,LogicalUPath,LogicalUUID) are not affected byeither bug — see the per-bug "why built-ins survive" notes, which point straight at the fix.
Bug A — pydantic/dataclass extension-type columns are unhashable
Symptom
Minimal repro (
repro_bug_a_hashing.py)Traceback (tail)
Root cause (three contributing factors)
The generated extension type is unhashable.
make_arrow_extension_type()(src/orcapod/extension_types/registry.py:36-103) synthesizesthe class via
type(name, (pa.ExtensionType,), {...})defining only__init__,__arrow_ext_serialize__, and__arrow_ext_deserialize__.pa.ExtensionTypedefines__eq__, so Python sets__hash__ = Noneon the subclass → instances are unhashable.Verified:
starfix's
_primitive_data_type_stringdoesif dt in _simple:(aset/dictmembershiptest) which calls
hash(dt)→ crash.The schema-cleaner never strips live extension types.
StarfixArrowHasher.hash_table(src/orcapod/hashing/arrow_hashers.py:114-130) only cleansthe schema when
has_extension_metadata(schema)is true. Buthas_extension_metadata/clean_schema_for_hashing(
src/orcapod/hashing/schema_cleaner.py) key exclusively off field-levelb"ARROW:extension:name"metadata (_has_extension_in_field, line ~133). For a livein-memory
pa.ExtensionTypecolumn built byDictSource,field.metadata is None— theextension identity lives in the type object and
__arrow_ext_serialize__(), not in fieldmetadata. So
has_extension_metadata(schema) == False, cleaning is skipped, and the rawunhashable type reaches starfix. (Even if forced,
_clean_typehas nopa.types.is_extensionbranch, so it would return the live extension type unchanged.) The
perf(hashing): skip schema cleaning when no extension metadata presentchange hardens this skip.Verified:
No semantic hasher is registered for pydantic/dataclass.
SemanticHashingVisitor.visit_extension(src/orcapod/hashing/visitors.py:196-231) wouldreplace an extension column with a
pa.large_binary()hash token before starfix — but onlyif a handler is registered for the column's Python type (the
has_handler(python_type)guardat lines 213-216). The default context (
src/orcapod/contexts/data/v0.1.json) registers thepydantic/dataclass logical-type factories (arrow conversion) but
register_builtin_python_type_handlers(src/orcapod/hashing/semantic_hashing/builtin_handlers.py)registers semantic handlers only for
Path,UPath,UUID,bytes, functions,type,pa.Table, etc. — not forpydantic.BaseModelor dataclasses. So pydantic/dataclasscolumns fall through unhashed.
Why built-ins survive
LogicalPath/LogicalUPath/LogicalUUIDhave registered semantic handlers, sovisit_extensionreplaces those columns with hash-tokenlarge_binary()before starfixever sees the (also-unhashable) extension type. pydantic/dataclass have no handler, so the raw
type reaches starfix.
Suggested fixes (any one unblocks; ideally factor 1 + one of 2/3)
"__hash__": ...to thetype(...)dict inmake_arrow_extension_type(e.g. hash on(extension_name, storage_type, metadata)), and thesymmetric
make_polars_extension_type. This alone stops theTypeError.has_extension_metadata/clean_schema_for_hashingdetectpa.types.is_extension(field.type)(not just field metadata)and lower it to its
storage_typebefore starfix.pydantic.BaseModeland dataclasses (mirroring thebuilt-in handlers), so
visit_extensiontokenizes the column. This also defines the contentidentity of a model column (e.g. canonical
model_dump_json()+ qualified class name).Bug B — pydantic/dataclass extension metadata is dropped across a polars round-trip (breaks
Join)Symptom
Triggered in normal pipelines by
Join, which doespl.DataFrame(table).join(...).to_arrow()(src/orcapod/core/operators/join.py:194-199).Minimal repro (
repro_bug_b_polars.py) — isolates polars as the culpritOutput:
Root cause — Arrow/Polars metadata asymmetry in the factory
PydanticLogicalType.__init__(src/orcapod/extension_types/pydantic_logical_type_factory.py:82-93):The Arrow extension type is built with
{"category": "orcapod.pydantic"}; the Polarsextension type is built without any metadata. On a polars→arrow round-trip, polars hands its
(empty) metadata to the Arrow type's
__arrow_ext_deserialize__, whose strict equality checkrejects it (
src/orcapod/extension_types/registry.py:53-56):DataclassLogicalTypehas the identical asymmetry(
src/orcapod/extension_types/dataclass_logical_type_factory.py:83-93, category"orcapod.dataclass").Why built-ins survive
LogicalPath/LogicalUPath/LogicalUUIDbuild their Arrow extension type without categorymetadata (
_metadata = b""), so after a polars round-trip the strict check comparesb"" == b""and passes. Verified — both survive the same
pl.DataFrame(tbl).to_arrow()round-trip with theextension type intact:
Suggested fixes (either)
make_polars_extension_type(logical_name, storage_type, metadata=_metadata.decode())in boththe pydantic and dataclass factories, so the round-trip preserves it. (Requires confirming
polars actually round-trips
metadata_strthroughto_arrow.)__arrow_ext_deserialize__inmake_arrow_extension_typeto reconstruct from theregistered category when
serializedis empty/missing, instead of hard-raising on inequality.Combined end-to-end check
With Bug A worked around (registering a pydantic semantic handler) and Bug B worked around
(broadcasting the model as a JSON string instead of an extension column), the
DictSource → Join → FunctionPodpipeline runs to completion — confirming these two defects arethe only blockers on the pydantic-column path, and that the model serialization/validation
machinery itself is sound.
Repro scripts
Both repros are embedded inline above (Bug A and Bug B sections) and are fully self-contained —
save each to a
.pyfile and run withpython <file>.pyagainst orcapodmain@9ed9b72.