Skip to content

Commit

Permalink
Add/improve documentation on types of collections in butler.
Browse files Browse the repository at this point in the history
I'm not sure why the apparently-duplicate py:currentmodule directives
are needed (note that there's one at the top of index.rst, and that
was enough for configuring.rst), but they make the links work, and
without them, the links don't.
  • Loading branch information
TallJimbo committed Mar 4, 2020
1 parent f07595f commit 1bd64fd
Show file tree
Hide file tree
Showing 4 changed files with 97 additions and 2 deletions.
2 changes: 2 additions & 0 deletions doc/lsst.daf.butler/dimensions.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
.. _lsst.daf.butler-dimensions_overview:

.. py:currentmodule:: lsst.daf.butler
Overview
--------
Dimensions are astronomical concepts that are used to label and organize datasets.
Expand Down
1 change: 1 addition & 0 deletions doc/lsst.daf.butler/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ Using the Butler
:maxdepth: 1

configuring.rst
organizing.rst
queries.rst

.. _lsst.daf.butler-scripts:
Expand Down
87 changes: 87 additions & 0 deletions doc/lsst.daf.butler/organizing.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
.. _daf_butler_organizing_datasets:

Organizing and identifying datasets
===================================

.. py:currentmodule:: lsst.daf.butler
Each dataset in a repository is associated with an opaque unique integer ID, which we currently call its ``dataset_id``, and it's usually seen in Python code as the value of `DatasetRef.id`.
This is the number used as the primary key in most `Registry` tables that refer to datasets, and it's the only way the contents of a `Datastore` are matched to those in a `Registry`.
With that number, the dataset is fully identified, and anything else about it can be unambiguously looked up.
We call a `DatasetRef` whose `~DatasetRef.id` attribute is not `None` a *resolved* `DatasetRef`.

Most of the time, however, users identify a dataset using a combination of three other attributes:

- a dataset type;
- a data ID;
- a collection.

Most collections are constrained to contain only on dataset with a particular dataset type and data ID, so this combination is usually enough to resolve a dataset (see :ref:`daf_butler_collections` for exceptions).

A dataset's type and data ID are intrinsic to it - while there may be many datasets with a particular dataset type and/or data ID, the dataset type and data ID associated with a dataset are set and fixed when it is created.
A `DatasetRef` always has both a dataset type attribute and a data ID, though the latter may be empty.
Dataset types are discussed below in :ref:`daf_butler_dataset_types`, while data IDs are one aspect of the larger :ref:`Dimensions <lsst.daf.butler-dimensions_overview>` system and are discussed in :ref:`lsst.daf.butler-dimensions_data_ids`.

In contrast, the relationship between dataset and collections is many-to-many: a collection typically contains many different datasets, and a particular dataset may belong to multiple collections.
As a result, is is common to search for datasets in multiple collections (often in a well-defined order), and interfaces that provide that functionality can accept a collection search path in :ref:`many different forms <daf_butler_collection_expressions>`.
Collections are discussed further below in :ref:`daf_butler_collections`.

.. _daf_butler_dataset_types:

Dataset types
-------------

The names "dataset" and "dataset type" (which `lsst.daf.butler` inherits from its `lsst.daf.persistence` predecessor) are intended to evoke the relationship between an instance and its class in object-oriented programming, but this is a metaphor, *not* a relationship that maps to any particular Python objects: we don't have any Python class that fully represents the *dataset* concept (`DatasetRef` is the closest), and the `DatasetType` class is a regular class, not a metaclass.
So a *dataset type* is represented in Python as a `DatasetType` *instance*.

A dataset type defines both the dimensions used in a dataset's data ID (so all data IDs for a particular dataset type have the same keys, at least when put in standard form) and the storage class that corresponds to its in-memory Python type and maps to the file format (or generalization thereof) used by a `Datastore` to store it.
These are associated with an arbitrary string name.

Beyond that definition, what a dataset type *means* isn't really specififed by the butler itself, but we expect higher-level code that *uses* butler to make that clear, and one anticipates case is worth calling out here: a dataset type roughly corresponds to the role its datasets play in a processing pipeline.
In other words, a particular pipeline will typically accept particular dataset types as inputs and produce particular dataset types as outputs (and may produce and consumed other dataset types as intermediates).
And while the exact dataset types used may be configurable, changing a dataset type will generally involve substituting one dataset type for a very similar one (most of the time with the same dimensions and storage class).

.. _daf_butler_collections:

Collections
-----------

Collections are lightweight groups of datasets defined in the `Registry`.
Groups of self-consistent calibration datasets, the outputs of a processing run, and the set of all raw images for a particular instrument are all examples of collections.
Collections are referred to in code simply as `str` names; various `Registry` methods can be used to manage them and obtain additional information about them when relevant.

There are multiple types of collections, corresponding to the different values of the `CollectionType` enum.
All collection types are usable in the same way in any context where existing datasets are being queried or retrieved, though the actual searches may be implemented quite differently in terms of database queries.
Collection types behave completely differently in terms of how and when datasets can be added to or remove from them.

Run Collections
^^^^^^^^^^^^^^^

A dataset is always added to a `CollectionType.RUN` collection when it is inserted into the `Registry`, and can never be removed from it without fully removing the dataset from the `Registry`.
There is no other way to add a dataset to a ``RUN`` collection.
The run collection name *must* be used in any file path templates used by any `Datastore` in order to guarantee uniqueness (other collection types are too flexible to guarantee continued uniqueness over the life of the dataset).

The name "run" reflects the fact that we expect most ``RUN`` collections to be used to store the outputs of processing runs, but they should also be used in any other context in which their lack of flexibility is acceptable, as they are the most efficient type of collection to store and query.

``RUN`` collections that do represent the outputs of processing runs can be associated with a host name string and a timespan, and are expected to be the way in which some provenance is associated with datasets (e.g. a dataset that contains a list of software versions would have the same ``RUN`` as the datasets produced by a processing run that used those versions).

Like most collections, a ``RUN`` can contain at most one dataset with a particular dataset type and data ID.

Tagged Collections
^^^^^^^^^^^^^^^^^^

`CollectionType.TAGGED` collections are the most flexible type of collection; datasets can be `associated <Registry.associate>` with or `disassociated <Registry.disassociate>` from a ``TAGGED`` collection at any time, as long as the usual contraint on a collection having only one dataset with a particular dataset type and data ID is maintained.
Membership in a ``TAGGED`` collection is implemented in the `Registry` database as a single row in a many-to-many join table (a "tag") and is completely decoupled from the actual storage of the dataset.

Tags are thus both extremely lightweight relative to copies or re-ingests of files or other `Datastore` content, and *slightly** more expensive to store and possibly query than than the ``RUN`` or ``CHAINED`` collection representations (which have no per-dataset costs).
The latter is rarely important, but higher-level code should avoid automatically creating ``TAGGED`` collections that may not ever be used.

Chained Collection
^^^^^^^^^^^^^^^^^^

A `CollectionType.CHAINED` collection is essentially a multi-collection search path that has been saved in the `Registry` database and associated with a name of its own.
Querying a ``CHAINED`` collection simply queries its child collections in order, and a ``CHAINED`` collection is always (and only) updated when its child collections are.

``CHAINED`` collections may contain other chained collections, as long as they do not contain cycles, and they can also include restrictions on the dataset types to search for within each child collection (see :ref:`daf_butler_collection_expressions`).

The usual constraint on dataset type and data ID uniqueness within a collection is only lazily enforced for chained collections: operations that query them either deduplicate results themselves or terminate single-dataset searches after the first match in a child collection is found.
9 changes: 7 additions & 2 deletions doc/lsst.daf.butler/queries.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
.. _daf_butler_queries:

.. py:currentmodule:: lsst.daf.butler
Querying datasets
=================

Expand Down Expand Up @@ -48,8 +50,11 @@ User code will rarely need to interact with these directly, but they can be pass
Ordered collection searches
^^^^^^^^^^^^^^^^^^^^^^^^^^^

A *ordered* collection expression is required in contexts where we want to search collections only until a dataset with a particular dataset type and data ID is found, such as `~Registry.findDataset` or `~Registy.queryDatasets` (when ``deduplicate`` is `True`).
In these contexts, regular expressions and `...` are not allowed, because they make it impossible to unambiguously define the order in which to search the matching collections.
An *ordered* collection expression is required in contexts where we want to search collections only until a dataset with a particular dataset type and data ID is found.
These include all direct `Butler` operations, the definitions of `~CollectionType.CHAINED` collections, `Registry.findDataset`, and the ``deduplicate=True`` mode of `Registry.queryDatasets`.
In these contexts, regular expressions and `...` are not allowed for collection names, because they make it impossible to unambiguously define the order in which to search.
Dataset type restrictions are allowed in these contexts, and those
may be (and usually are) `...`.

Ordered collection searches are processed by the `~registry.wildcards.CollectionSearch` class.

Expand Down

0 comments on commit 1bd64fd

Please sign in to comment.