Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-25691: Some documentation on storage classes and formatters #327

Merged
merged 11 commits into from
Sep 10, 2020
4 changes: 2 additions & 2 deletions doc/lsst.daf.butler/configuring.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ The main sections of the YAML file are handled separately by each sub configurat
Each config specialization, registry, schema, storage class, composites, and dimensions knows the name of the key for its own section of the configuration and knows the names of files providing overrides and defaults for the configuration.
Additionally, if the sub configuration contains a ``cls`` key, that class is imported and an additional configuration file name can be provided by asking the class for its defaultConfigFile property.
All the keys within a sub configuration are processed by the class constructed from ``cls``.
The primary source of default values comes from the ``config`` resource (accessed using `pkg_resources` and package ``lsst.daf.butler``) –- this directory contains YAML files matching the names specified in the sub config classes and also can include names specified by the corresponding component class (for example `~lsst.daf.butler.datastores.posixDatastore.PosixDatastore` specifies that configuration should be found in ``datastores/posixDatastore.yaml``.
The primary source of default values comes from the ``configs`` resource (accessed using `pkg_resources` and package ``lsst.daf.butler``) –- this directory contains YAML files matching the names specified in the sub config classes and also can include names specified by the corresponding component class (for example `~lsst.daf.butler.datastores.posixDatastore.PosixDatastore` specifies that configuration should be found in ``datastores/posixDatastore.yaml``.
There are additional search paths that can be included when a config object is constructed:

1. Explicit list of directory paths to search passed into the constructor.
Expand Down Expand Up @@ -41,7 +41,7 @@ There is a command available to allow you to see how all these overrides and inc

.. prompt:: bash

dumpButlerConfig.py --subset .registry.db ./repo/butler.yaml
butler config-dump --subset .registry.db ./repo/butler.yaml

Note the leading "``.``" to indicate that you are using a "``.``" delimiter to specify the hierarchy within the configuration.

Expand Down
101 changes: 101 additions & 0 deletions doc/lsst.daf.butler/datastores.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
.. _daf_butler-datastores:

timj marked this conversation as resolved.
Show resolved Hide resolved
#######################
Datastore Configuration
#######################

.. py:currentmodule:: lsst.daf.butler

A Butler `Datastore` is configured in the ``datastore`` section of the top-level Butler YAML configuration.
The only mandatory entry in the datastore configuration is the ``cls`` key.
This specifies the fully qualified class name of the Python class implementing the datastore.
The default Butler configurations uses the `~datastores.posixDatastore.PosixDatastore`.
All other keys depend on the specific datastore class that is selected.

timj marked this conversation as resolved.
Show resolved Hide resolved

.. note::

The default configuration values can be inspected at ``$DAF_BUTLER_DIR/python/lsst/daf/butler/configs`` (they can be accessed directly as Python package resources) and current values can be obtained by calling ``butler config-dump`` on a Butler repository.

The supported datastores are:

* :ref:`daf_butler-datastores-file` (local POSIX along with remote datastores such as S3)
* :ref:`daf_butler-datastores-memory`
* :ref:`daf_butler-datastores-chain`

.. _daf_butler-datastores-file:

File-Based Datastores
=====================

The file-based datastores (for example `~datastores.posixDatastore.PosixDatastore` and `~datastores.s3Datastore.S3Datastore`) share configuration since they use formatters to serialize datasets to file artifacts using file name template schemes and also support disassembly of :ref:`composite storage classes <daf_butler_storage_classes>`.

The supported configurations are:

timj marked this conversation as resolved.
Show resolved Hide resolved
**root**
The location of the "root" of the datastore "file system".
Usually the default value of ``<butlerRoot>/datastore`` can be left unchanged.
Here ``<butlerRoot>`` is a magic value that is replaced either with the location of the Butler configuration file or the top-level ``root`` as set in that ``butler.yaml`` configuration file.
**records**
This sections defines the name of the registry table that should be used to hold details about datasets stored in the datastore (such as the path within the datastore and the associated formatter).
This only needs to be set if multiple datastores are to be used simultaneously within one Butler repository since the table names should not clash.
**create**
A Boolean to define whether an attempt should be made to initialize the datastore by creating the directory. Defaults to `True`, and that default should normally not be changed.
**templates**
The template to use to construct "files" within the datastore.
The template uses data dimensions to do this.
Generally the default setting will be usable although it can be tuned per `DatasetType`, `StorageClass` or data ID.
Changes to this template only apply to new datasets since datastore remembers the names associated with previous datasets.
Templates are formatted as `FileTemplate` instances.
**formatters**
Mapping of `DatasetType`, `StorageClass` or data ID to a specific formatter class that understands the associated Python type and will serialize it to a file artifact.
The formatters section also supports the definitions of write recipes (bulk configurations that can be selected for specific formatters) and write parameters (parameters that control how the dataset is serialized; note it is required that all serialized artifacts be readable by a formatter without knowing which write parameters were used).
Once a formatter is associated with a particular dataset it is permanently associated with that dataset even if the configuration is later modified to specify a different formatter.
**constraints**
Specify `DatasetType`, `StorageClass` or data ID that will be accepted or rejected by this datastore.
**composites**
Controls whether composite datasets are disassembled by the datastore.
By default composites are not disassembled.
Disassembly can be controlled by `DatasetType`, `StorageClass` or data ID.

.. _daf_butler-config-lookups:

Name Matching
^^^^^^^^^^^^^

Templates, formatters, constraints, and composites all use a standard look up priority.
The order is:

#. If there is an ``instrument`` in the data ID the first look up will be for a key that matches ``instrument<INSTRUMENT_NAME>``.
If there is a match the items within that part of the hierarchy will be matched in preference to those at the top-level.
#. The highest priority is then the `DatasetType` name.
#. If the `DatasetType` corresponds to a component of a composite the composite name will then be checked.
#. If there is still no match the dimensions will be used.
Dimensions are specified by the presence of a ``+`` as a separator.
For example ``instrument+physical_filter+visit`` would match any `DatasetType` that uses those three dimensions.
#. The final match is against the `StorageClass` name.

.. _daf_butler-datastores-memory:

In-Memory Datastore
===================

The `~datastores.inMemoryDatastore.InMemoryDatastore` currently only supports the ``constraints`` field.
This allows the datastore to accept specific dataset types.

In the future more features will be added to allow some form of cache expiry.

.. _daf_butler-datastores-chain:

Chained Datastores
==================

The `~datastores.chainedDatastore.ChainedDatastore` datastore enables multiple other datastores to be combined into one.
The datastore will be sent to every datastore in the chain and success is reported if any of the datastores accepts the dataset.
timj marked this conversation as resolved.
Show resolved Hide resolved
When a dataset is retrieved each datastore is asked for the dataset in turn and the first match is sufficient.
This allows an in-memory datastore to be combined with a file-based datastore to enable simple in-memory retrieval for a dataset that has been persisted to disk.
A file-based datastore can be turned into a a chained datastore after the fact, for example by adding an in-memory caching datastore.
The only constraint is that all the datasets in registry are associated with at least one of the datastores in the chain.

timj marked this conversation as resolved.
Show resolved Hide resolved
`~datastores.chainedDatastore.ChainedDatastore` has a ``datastores`` key that contains a list of datastore configurations that can match the ``datastore`` contents from other datastores.
Additionally, a `~datastores.chainedDatastore.ChainedDatastore` can also support ``constraints`` definitions.