DM-41113: Add dimension record containers. #928

TallJimbo · 2023-12-17T21:52:30Z

This adds two containers for DimensionRecord objects: DimensionRecordSet and DimensionRecordTable. The former is backed by a regular Python dict of DimensionRecord, and probably what users want to interact with for the most part. DimensionRecordTable is backed by an Arrow table, and is really better only at column access (which might prove to be popular) and (via Arrow) I/O.

I'm not sure yet whether we'll want to make DimensionRecordSet serializable with Pydantic, for cases where the number of records is small and embedding them within other models is desirable; I've got code for that back on another branch that I'm holding back for now. I think the alternative is occasionally stuffing a base64-encoded Arrow binary serialization into a Pydantic model. How to serialize dimension records associated with "expanded" sets of DataCoordinate and DatasetRef is the big open question, and while I'm not ready to answer it yet, I've explored enough of the options on DM-41114 to feel pretty confident about what I've put on this branch.

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes

codecov · 2023-12-17T22:02:19Z

Codecov Report

Attention: 53 lines in your changes are missing coverage. Please review.

Comparison is base (480261b) 87.84% compared to head (8b1cc48) 87.94%.

Files	Patch %	Lines
...sst/daf/butler/timespan_database_representation.py	88.81%	9 Missing and 8 partials ⚠️
python/lsst/daf/butler/column_spec.py	91.76%	7 Missing ⚠️
python/lsst/daf/butler/dimensions/_schema.py	90.14%	5 Missing and 2 partials ⚠️
python/lsst/daf/butler/dimensions/_record_table.py	91.54%	4 Missing and 2 partials ⚠️
python/lsst/daf/butler/dimensions/_config.py	68.75%	4 Missing and 1 partial ⚠️
python/lsst/daf/butler/_compat.py	0.00%	3 Missing ⚠️
python/lsst/daf/butler/dimensions/_elements.py	92.50%	2 Missing and 1 partial ⚠️
python/lsst/daf/butler/dimensions/_database.py	94.11%	1 Missing ⚠️
python/lsst/daf/butler/dimensions/_governor.py	92.85%	1 Missing ⚠️
python/lsst/daf/butler/dimensions/_record_set.py	99.03%	0 Missing and 1 partial ⚠️
... and 2 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #928      +/-   ##
==========================================
+ Coverage   87.84%   87.94%   +0.10%     
==========================================
  Files         295      301       +6     
  Lines       38429    39199     +770     
  Branches     8118     8276     +158     
==========================================
+ Hits        33757    34474     +717     
- Misses       3476     3524      +48     
- Partials     1196     1201       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dhirving · 2023-12-20T22:17:51Z

python/lsst/daf/butler/column_spec.py

+class ColumnSpec(_BaseModelCompat, ABC):
+    """Base class for descriptions of table columns."""
+


You might consider making ColumnSpec private to this file and treating the AnyColumnSpec union type as the exported, public one.

Anything that wants to be serializable with one of these as a property will have to use AnyColumnSpec, because parsing data via ColumnSpec will cause loss of information by only returning the base class.

ColumnSpec isn't assignable to AnyColumnSpec, so it will be awkward to copy data from non-serializable classes using ColumnSpec into serializable ones using AnyColumnSpec.

👍 I've also renamed AnyColumnSpec to just ColumnSpec and renamed the old ColumnSpec to _BaseColumnSpec.

dhirving · 2023-12-20T22:21:29Z

python/lsst/daf/butler/column_spec.py

+    nullable: bool | None = pydantic.Field(
+        default=None,
+        description=textwrap.dedent(
+            """\
+            Whether the column may be ``NULL``.
+
+            The default (represented by `None`) is context-dependent; some
+            fields are ``NOT NULL``, while oethers default to nullable.
+            """
+        ),
+    )


I'm not sure that we'd lose much by making nullable default to True instead of having it be a tri-state. We'd lose to the ability to raise an error for the True case for primary keys, but having the application logic silently override that to non-nullable is probably not a big deal.

Would reduce all the Pydantic tomfoolery in _config.py to an easy-to-understand statement

for key in unique_keys: # Primary keys may not be nullable key.nullable = False

and would save downstream code from needing to fight with mypy about that None.

Good idea; done.

dhirving · 2023-12-20T22:28:28Z

python/lsst/daf/butler/column_spec.py

+AnyColumnSpec = Union[
+    IntColumnSpec,
+    StringColumnSpec,
+    HashColumnSpec,
+    FloatColumnSpec,
+    BoolColumnSpec,
+    RegionColumnSpec,
+    TimespanColumnSpec,
+]


It would be a good idea to declare this as a discriminated union by writing this as:

Annotated[Union[...], Field(discriminator='type')]

That will let Pydantic validate this more efficiently and generate more useful error messages when parsing fails. (Otherwise, it would tell you what didn't match for all 8 subclasses instead of just the 1 subclass that should have been validated.)

I've been holding off on doing that on all Union typedefs because I wasn't sure whether unions-of-unions would work properly if the inner ones declared a discriminator, so I'm curious if you know whether that will happen to work. But it's a moot point here since this is meant to be the union of all possible column spec types and hence probably won't ever be used to define some larger union, so 👍.

Not totally clear to me. Per the documentation it definitely works if the discriminators are different between the outer one and the inner one, but I don't think that's what you're asking about: https://docs.pydantic.dev/latest/concepts/unions/#nested-discriminated-unions

I haven't tried it but I don't see any reason something like:

A = Annotated[Union[...], Field(discriminator='type')] B = Annotated[Union[...], Field(discriminator='type')] Combined = Union[A, B]

wouldn't work, since in the combined Union it's checking each member separately. You'd get two separate error messages if you tried to include something that wasn't in either union but that's probably fine.

What's less clear (and is possibly documented to NOT work but it's slightly ambiguous):

# ???? Combined = Annotated[Union[A, B], Field(discriminator='type')]

dhirving · 2023-12-20T23:01:47Z

python/lsst/daf/butler/dimensions/_record_table.py

+            },
+        )
+        self._required_value_fields = [pc.field(name) for name in self._element.schema.required.names]
+        batches = [self._make_batch(records, arrow_schema)]


It might be worth breaking the single batch up into multiple smaller batches -- especially if records is sometimes incrementally generated.

This would reduce peak memory usage by not having a duplicate copy of the full set of data in Python lists, as well as reducing the maximum size of the temporary string columns that will eventually be converted to dictionaries.

Good point. I don't have any intuition for what a default chunk size is, so I'll make it an optional __init__ parameter that defaults to None and hence defaults to the current behavior.

dhirving · 2023-12-20T23:27:19Z

python/lsst/daf/butler/dimensions/_record_table.py

+            Dimension record representing a table row.
+        """
+        return self._element.RecordClass(
+            **{k: table.column(j)[index].as_py() for j, k in enumerate(self._element.schema.all.names)}


Not at all necessary right now, but there might be a possibility for future optimization by lazily converting the data from the columns as they are used instead of immediately converting the entire row.

In this context I'm more concerned with being not holding on to the row-based inputs so they can be garbage collected. But your instincts are good: there are clear analogs elsewhere in the codebase of this point, in terms of us doing work on DimensionRecord fields that are never accessed at all, and I've been mulling over possible changes to the DataCoordinate public interface to maybe help with that, since the alternative is a lot of lazy-loading that extends through many, many layers.

We've been including these in the configuration file for ages but haven't done anything with them.

This separates things that depend on SQLAlchemy from things that don't, and it makes it easier to avoid import cycles in other code that tries to the same sort of separation.

When we convert the dimensions.yaml config into a DimensionUniverse, we've been converting the rather limited set of types supported in the configuration into just a few of the very specific types used by SQLAlchemy (in the ddl.FieldSpec objects), and that makes it hard to write a mapping from the Python form of that information to some other schema system (like Pydantic or Arrow). The new ColumnSpec objects map more directly to what's in the config (they're Pydantic models; this is one tiny step towards making dimensions.yaml Pydantic-validated), and while right now they live alongside the ddl objects everywhere (there are some duplicative methods on the DimensionElement classes in particular), eventually I'd like to replace the ddl stuff entirely with direct usage of SQLAlchemy Table and Column objects. Fully removing the ddl stuff will require an RFC as well, since they leak into the public interface a bit via RecordClass.fields as well as the DimensionElements.

It should be safe to revert this commit once we require Pydantic v2.

The new documenteer does not support the pipelines.lsst.io theme, and we can't migrate until the rest of pipelines.lsst.io theme.

TallJimbo mentioned this pull request Dec 17, 2023

DM-41114: Add dimension record containers #911

Closed

2 tasks

TallJimbo force-pushed the tickets/DM-41113 branch from 1dd8413 to 70dea23 Compare December 19, 2023 14:54

dhirving approved these changes Dec 20, 2023

View reviewed changes

TallJimbo added 13 commits December 21, 2023 12:33

Propagate dimension docstrings into DimensionElement classes.

60e82ce

We've been including these in the configuration file for ages but haven't done anything with them.

Move TimespanDatabaseRepresentation to its own file.

ba85d7f

This separates things that depend on SQLAlchemy from things that don't, and it makes it easier to avoid import cycles in other code that tries to the same sort of separation.

Add machinery for converting dimension values to Arrow columns.

44b6f04

Add table-backed container for DimensionRecords.

c2913cc

Add set-like, dict-backed container for DimensionRecords.

d27e973

Add model_copy to Pydantic migration utils.

14c22af

Add Pydantic v1 version of ColumnSpec.

d8e550a

It should be safe to revert this commit once we require Pydantic v2.

Add changelog entry.

93b73fa

Pin documenteer <1.0 in doc build.

6d7cae5

The new documenteer does not support the pipelines.lsst.io theme, and we can't migrate until the rest of pipelines.lsst.io theme.

Hide column spec base class, so the Union can be ColumnSpec.

bd8def4

Add Pydantic discriminator to ColumnSpec Union.

4f3a7a9

Simplify handling of nullability in ColumnSpecs.

61812a9

TallJimbo force-pushed the tickets/DM-41113 branch from b832a39 to 61812a9 Compare December 21, 2023 17:33

Add support for batched conversion of records in DimensionRecordTable.

8b1cc48

TallJimbo merged commit 747d2cb into main Dec 21, 2023
18 checks passed

TallJimbo deleted the tickets/DM-41113 branch December 21, 2023 18:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-41113: Add dimension record containers. #928

DM-41113: Add dimension record containers. #928

TallJimbo commented Dec 17, 2023 •

edited

codecov bot commented Dec 17, 2023 •

edited

dhirving Dec 20, 2023

TallJimbo Dec 21, 2023

dhirving Dec 20, 2023 •

edited

TallJimbo Dec 21, 2023

dhirving Dec 20, 2023

TallJimbo Dec 21, 2023

dhirving Dec 21, 2023

dhirving Dec 20, 2023

TallJimbo Dec 21, 2023

dhirving Dec 20, 2023

TallJimbo Dec 21, 2023

		class ColumnSpec(_BaseModelCompat, ABC):
		"""Base class for descriptions of table columns."""

DM-41113: Add dimension record containers. #928

DM-41113: Add dimension record containers. #928

Conversation

TallJimbo commented Dec 17, 2023 • edited

Checklist

codecov bot commented Dec 17, 2023 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhirving Dec 20, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TallJimbo commented Dec 17, 2023 •

edited

codecov bot commented Dec 17, 2023 •

edited

dhirving Dec 20, 2023 •

edited