Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-27147: Fix collection docs and error messages, force UTC ingest timestamps in PostgreSQL #435

Merged
merged 3 commits into from
Nov 26, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
16 changes: 14 additions & 2 deletions doc/lsst.daf.butler/organizing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -76,12 +76,24 @@ Membership in a ``TAGGED`` collection is implemented in the `Registry` database
Tags are thus both extremely lightweight relative to copies or re-ingests of files or other `Datastore` content, and *slightly* more expensive to store and possibly query than the ``RUN`` or ``CHAINED`` collection representations (which have no per-dataset costs).
The latter is rarely important, but higher-level code should avoid automatically creating ``TAGGED`` collections that may not ever be used.

Chained Collection
^^^^^^^^^^^^^^^^^^
Calibration Collections
^^^^^^^^^^^^^^^^^^^^^^^

`CollectionType.CALIBRATION` collections associate each dataset they contain with a temporal validity range.
The usual constraint on dataset type and data ID uniqueness is enforced as a function of time, not collection-wide - so for any particular dataset type and data ID combination, the validity range timespans may not overlap (but may be - and usually are - adjacent).

In other respects, ``CALIBRATION`` collections closely resemble ``TAGGED`` collections: they are also backed by a many-to-many join table (where each row has a timespan as well as a collection identifier and a dataset identifier), and datasets can be associated or disassociated from them similarly freely.
We use slightly different nomenclature for these operations, reflecting the high-level actions they represent: `certifying <Registry.certify>` a dataset adds it to a ``CALIBRATION`` collection with a particular validity range, and `decertifying <Registry.decertify>` a dataset removes some or all of that validity range.

The same dataset can be present in a ``CALIBRATION`` collection multiple times with different validity ranges.

Chained Collections
^^^^^^^^^^^^^^^^^^^

A `CollectionType.CHAINED` collection is essentially a multi-collection search path that has been saved in the `Registry` database and associated with a name of its own.
Querying a ``CHAINED`` collection simply queries its child collections in order, and a ``CHAINED`` collection is always (and only) updated when its child collections are.

``CHAINED`` collections may contain other chained collections, as long as they do not contain cycles, and they can also include restrictions on the dataset types to search for within each child collection (see :ref:`daf_butler_collection_expressions`).

The usual constraint on dataset type and data ID uniqueness within a collection is only lazily enforced for chained collections: operations that query them either deduplicate results themselves or terminate single-dataset searches after the first match in a child collection is found.
In some methods, like `Registry.queryDatasets`, this behavior is optional: passing ``findFirst=True`` will enforce the constraint, while ``findFirst=False`` will not.
6 changes: 6 additions & 0 deletions python/lsst/daf/butler/registry/databases/postgresql.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,12 @@ def transaction(self, *, interrupting: bool = False, savepoint: bool = False,
if not self.isWriteable():
with closing(self._connection.connection.cursor()) as cursor:
cursor.execute("SET TRANSACTION READ ONLY")
else:
with closing(self._connection.connection.cursor()) as cursor:
# Make timestamps UTC, because we didn't use TIMESTAMPZ for
# the column type. When we can tolerate a schema change,
# we should change that type and remove this line.
cursor.execute("SET TIME ZONE 0")
yield

def _lockTables(self, tables: Iterable[sqlalchemy.schema.Table] = ()) -> None:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ def delete(self, datasets: Iterable[DatasetRef]) -> None:
def associate(self, collection: CollectionRecord, datasets: Iterable[DatasetRef]) -> None:
# Docstring inherited from DatasetRecordStorage.
if collection.type is not CollectionType.TAGGED:
raise TypeError(f"Cannot associate into collection '{collection}' "
raise TypeError(f"Cannot associate into collection '{collection.name}' "
f"of type {collection.type.name}; must be TAGGED.")
protoRow = {
self._collections.getCollectionForeignKeyName(): collection.key,
Expand All @@ -209,7 +209,7 @@ def associate(self, collection: CollectionRecord, datasets: Iterable[DatasetRef]
def disassociate(self, collection: CollectionRecord, datasets: Iterable[DatasetRef]) -> None:
# Docstring inherited from DatasetRecordStorage.
if collection.type is not CollectionType.TAGGED:
raise TypeError(f"Cannot disassociate from collection '{collection}' "
raise TypeError(f"Cannot disassociate from collection '{collection.name}' "
f"of type {collection.type.name}; must be TAGGED.")
rows = [
{
Expand Down Expand Up @@ -252,7 +252,7 @@ def certify(self, collection: CollectionRecord, datasets: Iterable[DatasetRef],
raise TypeError(f"Cannot certify datasets of type {self.datasetType.name}, for which "
f"DatasetType.isCalibration() is False.")
if collection.type is not CollectionType.CALIBRATION:
raise TypeError(f"Cannot certify into collection '{collection}' "
raise TypeError(f"Cannot certify into collection '{collection.name}' "
f"of type {collection.type.name}; must be CALIBRATION.")
tsRepr = self._db.getTimespanRepresentation()
protoRow = {
Expand Down Expand Up @@ -323,7 +323,7 @@ def decertify(self, collection: CollectionRecord, timespan: Timespan, *,
raise TypeError(f"Cannot decertify datasets of type {self.datasetType.name}, for which "
f"DatasetType.isCalibration() is False.")
if collection.type is not CollectionType.CALIBRATION:
raise TypeError(f"Cannot decertify from collection '{collection}' "
raise TypeError(f"Cannot decertify from collection '{collection.name}' "
f"of type {collection.type.name}; must be CALIBRATION.")
tsRepr = self._db.getTimespanRepresentation()
# Construct a SELECT query to find all rows that overlap our inputs.
Expand Down