New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-41113: Add dimension record containers. #928
Conversation
1dd8413
to
70dea23
Compare
class ColumnSpec(_BaseModelCompat, ABC): | ||
"""Base class for descriptions of table columns.""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might consider making ColumnSpec
private to this file and treating the AnyColumnSpec
union type as the exported, public one.
Anything that wants to be serializable with one of these as a property will have to use AnyColumnSpec
, because parsing data via ColumnSpec
will cause loss of information by only returning the base class.
ColumnSpec isn't assignable to AnyColumnSpec
, so it will be awkward to copy data from non-serializable classes using ColumnSpec into serializable ones using AnyColumnSpec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I've also renamed AnyColumnSpec
to just ColumnSpec
and renamed the old ColumnSpec
to _BaseColumnSpec
.
nullable: bool | None = pydantic.Field( | ||
default=None, | ||
description=textwrap.dedent( | ||
"""\ | ||
Whether the column may be ``NULL``. | ||
|
||
The default (represented by `None`) is context-dependent; some | ||
fields are ``NOT NULL``, while oethers default to nullable. | ||
""" | ||
), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that we'd lose much by making nullable default to True
instead of having it be a tri-state. We'd lose to the ability to raise an error for the True
case for primary keys, but having the application logic silently override that to non-nullable is probably not a big deal.
Would reduce all the Pydantic tomfoolery in _config.py
to an easy-to-understand statement
for key in unique_keys:
# Primary keys may not be nullable
key.nullable = False
and would save downstream code from needing to fight with mypy about that None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea; done.
AnyColumnSpec = Union[ | ||
IntColumnSpec, | ||
StringColumnSpec, | ||
HashColumnSpec, | ||
FloatColumnSpec, | ||
BoolColumnSpec, | ||
RegionColumnSpec, | ||
TimespanColumnSpec, | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a good idea to declare this as a discriminated union by writing this as:
Annotated[Union[...], Field(discriminator='type')]
That will let Pydantic validate this more efficiently and generate more useful error messages when parsing fails. (Otherwise, it would tell you what didn't match for all 8 subclasses instead of just the 1 subclass that should have been validated.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been holding off on doing that on all Union
typedefs because I wasn't sure whether unions-of-unions would work properly if the inner ones declared a discriminator, so I'm curious if you know whether that will happen to work. But it's a moot point here since this is meant to be the union of all possible column spec types and hence probably won't ever be used to define some larger union, so 👍.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not totally clear to me. Per the documentation it definitely works if the discriminators are different between the outer one and the inner one, but I don't think that's what you're asking about: https://docs.pydantic.dev/latest/concepts/unions/#nested-discriminated-unions
I haven't tried it but I don't see any reason something like:
A = Annotated[Union[...], Field(discriminator='type')]
B = Annotated[Union[...], Field(discriminator='type')]
Combined = Union[A, B]
wouldn't work, since in the combined Union it's checking each member separately. You'd get two separate error messages if you tried to include something that wasn't in either union but that's probably fine.
What's less clear (and is possibly documented to NOT work but it's slightly ambiguous):
# ????
Combined = Annotated[Union[A, B], Field(discriminator='type')]
}, | ||
) | ||
self._required_value_fields = [pc.field(name) for name in self._element.schema.required.names] | ||
batches = [self._make_batch(records, arrow_schema)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth breaking the single batch up into multiple smaller batches -- especially if records
is sometimes incrementally generated.
This would reduce peak memory usage by not having a duplicate copy of the full set of data in Python lists, as well as reducing the maximum size of the temporary string columns that will eventually be converted to dictionaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I don't have any intuition for what a default chunk size is, so I'll make it an optional __init__
parameter that defaults to None
and hence defaults to the current behavior.
Dimension record representing a table row. | ||
""" | ||
return self._element.RecordClass( | ||
**{k: table.column(j)[index].as_py() for j, k in enumerate(self._element.schema.all.names)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at all necessary right now, but there might be a possibility for future optimization by lazily converting the data from the columns as they are used instead of immediately converting the entire row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this context I'm more concerned with being not holding on to the row-based inputs so they can be garbage collected. But your instincts are good: there are clear analogs elsewhere in the codebase of this point, in terms of us doing work on DimensionRecord
fields that are never accessed at all, and I've been mulling over possible changes to the DataCoordinate
public interface to maybe help with that, since the alternative is a lot of lazy-loading that extends through many, many layers.
We've been including these in the configuration file for ages but haven't done anything with them.
This separates things that depend on SQLAlchemy from things that don't, and it makes it easier to avoid import cycles in other code that tries to the same sort of separation.
When we convert the dimensions.yaml config into a DimensionUniverse, we've been converting the rather limited set of types supported in the configuration into just a few of the very specific types used by SQLAlchemy (in the ddl.FieldSpec objects), and that makes it hard to write a mapping from the Python form of that information to some other schema system (like Pydantic or Arrow). The new ColumnSpec objects map more directly to what's in the config (they're Pydantic models; this is one tiny step towards making dimensions.yaml Pydantic-validated), and while right now they live alongside the ddl objects everywhere (there are some duplicative methods on the DimensionElement classes in particular), eventually I'd like to replace the ddl stuff entirely with direct usage of SQLAlchemy Table and Column objects. Fully removing the ddl stuff will require an RFC as well, since they leak into the public interface a bit via RecordClass.fields as well as the DimensionElements.
It should be safe to revert this commit once we require Pydantic v2.
The new documenteer does not support the pipelines.lsst.io theme, and we can't migrate until the rest of pipelines.lsst.io theme.
b832a39
to
61812a9
Compare
This adds two containers for
DimensionRecord
objects:DimensionRecordSet
andDimensionRecordTable
. The former is backed by a regular Pythondict
ofDimensionRecord
, and probably what users want to interact with for the most part.DimensionRecordTable
is backed by an Arrow table, and is really better only at column access (which might prove to be popular) and (via Arrow) I/O.I'm not sure yet whether we'll want to make
DimensionRecordSet
serializable with Pydantic, for cases where the number of records is small and embedding them within other models is desirable; I've got code for that back on another branch that I'm holding back for now. I think the alternative is occasionally stuffing a base64-encoded Arrow binary serialization into a Pydantic model. How to serialize dimension records associated with "expanded" sets ofDataCoordinate
andDatasetRef
is the big open question, and while I'm not ready to answer it yet, I've explored enough of the options on DM-41114 to feel pretty confident about what I've put on this branch.Checklist
doc/changes