Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-41113: Add dimension record containers. #928

Merged
merged 14 commits into from Dec 21, 2023
Merged

DM-41113: Add dimension record containers. #928

merged 14 commits into from Dec 21, 2023

Conversation

TallJimbo
Copy link
Member

@TallJimbo TallJimbo commented Dec 17, 2023

This adds two containers for DimensionRecord objects: DimensionRecordSet and DimensionRecordTable. The former is backed by a regular Python dict of DimensionRecord, and probably what users want to interact with for the most part. DimensionRecordTable is backed by an Arrow table, and is really better only at column access (which might prove to be popular) and (via Arrow) I/O.

I'm not sure yet whether we'll want to make DimensionRecordSet serializable with Pydantic, for cases where the number of records is small and embedding them within other models is desirable; I've got code for that back on another branch that I'm holding back for now. I think the alternative is occasionally stuffing a base64-encoded Arrow binary serialization into a Pydantic model. How to serialize dimension records associated with "expanded" sets of DataCoordinate and DatasetRef is the big open question, and while I'm not ready to answer it yet, I've explored enough of the options on DM-41114 to feel pretty confident about what I've put on this branch.

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

Copy link

codecov bot commented Dec 17, 2023

Codecov Report

Attention: 53 lines in your changes are missing coverage. Please review.

Comparison is base (480261b) 87.84% compared to head (8b1cc48) 87.94%.

Files Patch % Lines
...sst/daf/butler/timespan_database_representation.py 88.81% 9 Missing and 8 partials ⚠️
python/lsst/daf/butler/column_spec.py 91.76% 7 Missing ⚠️
python/lsst/daf/butler/dimensions/_schema.py 90.14% 5 Missing and 2 partials ⚠️
python/lsst/daf/butler/dimensions/_record_table.py 91.54% 4 Missing and 2 partials ⚠️
python/lsst/daf/butler/dimensions/_config.py 68.75% 4 Missing and 1 partial ⚠️
python/lsst/daf/butler/_compat.py 0.00% 3 Missing ⚠️
python/lsst/daf/butler/dimensions/_elements.py 92.50% 2 Missing and 1 partial ⚠️
python/lsst/daf/butler/dimensions/_database.py 94.11% 1 Missing ⚠️
python/lsst/daf/butler/dimensions/_governor.py 92.85% 1 Missing ⚠️
python/lsst/daf/butler/dimensions/_record_set.py 99.03% 0 Missing and 1 partial ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #928      +/-   ##
==========================================
+ Coverage   87.84%   87.94%   +0.10%     
==========================================
  Files         295      301       +6     
  Lines       38429    39199     +770     
  Branches     8118     8276     +158     
==========================================
+ Hits        33757    34474     +717     
- Misses       3476     3524      +48     
- Partials     1196     1201       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 57 to 56
class ColumnSpec(_BaseModelCompat, ABC):
"""Base class for descriptions of table columns."""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might consider making ColumnSpec private to this file and treating the AnyColumnSpec union type as the exported, public one.

Anything that wants to be serializable with one of these as a property will have to use AnyColumnSpec, because parsing data via ColumnSpec will cause loss of information by only returning the base class.

ColumnSpec isn't assignable to AnyColumnSpec, so it will be awkward to copy data from non-serializable classes using ColumnSpec into serializable ones using AnyColumnSpec.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I've also renamed AnyColumnSpec to just ColumnSpec and renamed the old ColumnSpec to _BaseColumnSpec.

nullable: bool | None = pydantic.Field(
default=None,
description=textwrap.dedent(
"""\
Whether the column may be ``NULL``.

The default (represented by `None`) is context-dependent; some
fields are ``NOT NULL``, while oethers default to nullable.
"""
),
)
Copy link
Contributor

@dhirving dhirving Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that we'd lose much by making nullable default to True instead of having it be a tri-state. We'd lose to the ability to raise an error for the True case for primary keys, but having the application logic silently override that to non-nullable is probably not a big deal.

Would reduce all the Pydantic tomfoolery in _config.py to an easy-to-understand statement

for key in unique_keys:
   # Primary keys may not be nullable
   key.nullable = False

and would save downstream code from needing to fight with mypy about that None.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea; done.

Comment on lines 302 to 258
AnyColumnSpec = Union[
IntColumnSpec,
StringColumnSpec,
HashColumnSpec,
FloatColumnSpec,
BoolColumnSpec,
RegionColumnSpec,
TimespanColumnSpec,
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a good idea to declare this as a discriminated union by writing this as:

Annotated[Union[...], Field(discriminator='type')]

That will let Pydantic validate this more efficiently and generate more useful error messages when parsing fails. (Otherwise, it would tell you what didn't match for all 8 subclasses instead of just the 1 subclass that should have been validated.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been holding off on doing that on all Union typedefs because I wasn't sure whether unions-of-unions would work properly if the inner ones declared a discriminator, so I'm curious if you know whether that will happen to work. But it's a moot point here since this is meant to be the union of all possible column spec types and hence probably won't ever be used to define some larger union, so 👍.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not totally clear to me. Per the documentation it definitely works if the discriminators are different between the outer one and the inner one, but I don't think that's what you're asking about: https://docs.pydantic.dev/latest/concepts/unions/#nested-discriminated-unions

I haven't tried it but I don't see any reason something like:

A = Annotated[Union[...], Field(discriminator='type')]
B = Annotated[Union[...], Field(discriminator='type')]
Combined = Union[A, B]

wouldn't work, since in the combined Union it's checking each member separately. You'd get two separate error messages if you tried to include something that wasn't in either union but that's probably fine.

What's less clear (and is possibly documented to NOT work but it's slightly ambiguous):

# ????
Combined = Annotated[Union[A, B], Field(discriminator='type')]

},
)
self._required_value_fields = [pc.field(name) for name in self._element.schema.required.names]
batches = [self._make_batch(records, arrow_schema)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth breaking the single batch up into multiple smaller batches -- especially if records is sometimes incrementally generated.

This would reduce peak memory usage by not having a duplicate copy of the full set of data in Python lists, as well as reducing the maximum size of the temporary string columns that will eventually be converted to dictionaries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I don't have any intuition for what a default chunk size is, so I'll make it an optional __init__ parameter that defaults to None and hence defaults to the current behavior.

Dimension record representing a table row.
"""
return self._element.RecordClass(
**{k: table.column(j)[index].as_py() for j, k in enumerate(self._element.schema.all.names)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all necessary right now, but there might be a possibility for future optimization by lazily converting the data from the columns as they are used instead of immediately converting the entire row.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this context I'm more concerned with being not holding on to the row-based inputs so they can be garbage collected. But your instincts are good: there are clear analogs elsewhere in the codebase of this point, in terms of us doing work on DimensionRecord fields that are never accessed at all, and I've been mulling over possible changes to the DataCoordinate public interface to maybe help with that, since the alternative is a lot of lazy-loading that extends through many, many layers.

We've been including these in the configuration file for ages but
haven't done anything with them.
This separates things that depend on SQLAlchemy from things that
don't, and it makes it easier to avoid import cycles in other code
that tries to the same sort of separation.
When we convert the dimensions.yaml config into a DimensionUniverse,
we've been converting the rather limited set of types supported in the
configuration into just a few of the very specific types used by
SQLAlchemy (in the ddl.FieldSpec objects), and that makes it hard to
write a mapping from the Python form of that information to some other
schema system (like Pydantic or Arrow).

The new ColumnSpec objects map more directly to what's in the config
(they're Pydantic models; this is one tiny step towards making
dimensions.yaml Pydantic-validated), and while right now they live
alongside the ddl objects everywhere (there are some duplicative
methods on the DimensionElement classes in particular), eventually I'd
like to replace the ddl stuff entirely with direct usage of SQLAlchemy
Table and Column objects.  Fully removing the ddl stuff will require
an RFC as well, since they leak into the public interface a bit via
RecordClass.fields as well as the DimensionElements.
It should be safe to revert this commit once we require Pydantic v2.
The new documenteer does not support the pipelines.lsst.io theme, and
we can't migrate until the rest of pipelines.lsst.io theme.
@TallJimbo TallJimbo merged commit 747d2cb into main Dec 21, 2023
18 checks passed
@TallJimbo TallJimbo deleted the tickets/DM-41113 branch December 21, 2023 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants