Refactor and clean up wildcards and the methods that use them.

This includes a number of changes for consistency and clarity, including: - replacing Registry's getAll* methods with more flexible query* methods; - making the types accepted in expressions for collections and dataset types more consistent; - switching from SQL LIKE patterns to regular expressions (it looks like we mostly want to apply these in Python rather than in the database); - adding user guide docs for those expressions, integrating it with the existing page on data ID expression strings.
lsst · Mar 4, 2020 · a1abbc1 · a1abbc1
1 parent 9d1416d
commit a1abbc1
Show file tree

Hide file tree

Showing 14 changed files with 867 additions and 260 deletions.
diff --git a/doc/lsst.daf.butler/index.rst b/doc/lsst.daf.butler/index.rst
@@ -26,7 +26,7 @@ Using the Butler
   :maxdepth: 1
 
   configuring.rst
-  exprParser.rst
+  queries.rst
 
 .. _lsst.daf.butler-scripts:
 
@@ -81,6 +81,9 @@ Python API reference
 .. automodapi:: lsst.daf.butler.registry.queries
    :no-main-docstr:
 
+.. automodapi:: lsst.daf.butler.registry.wildcards
+   :no-main-docstr:
+
 Example datastores
 ------------------
 

diff --git a/doc/lsst.daf.butler/exprParser.rst → doc/lsst.daf.butler/queries.rst b/doc/lsst.daf.butler/exprParser.rst → doc/lsst.daf.butler/queries.rst
@@ -1,28 +1,81 @@
-.. _daf_butler_expr_parser:
+.. py:currentmodule:: lsst.daf.butler
 
-Butler expression language
-==========================
+.. _daf_butler_queries:
 
-Butler registry supports a user-supplied expression for constraining input,
-output, or intermediate datasets that can appear in the generated
-QuantumGraph. This page describes the structure and syntax of the expression
-language.
+Querying datasets
+=================
 
-The language grammar is defined in ``exprParser.parserYacc`` module, which is
-responsible for transforming string with user expression into a syntax tree
-with nodes represented by various classes defined in ``exprParser.exprTree``
-module. Modules in ``exprParser`` package are considered butler/registry
-implementation details and are not exposed at the butler package level.
+Datasets in a butler-managed data repository are identified by the combination of their *dataset type* and *data ID* within a *collection*.
+The `Registry` class's query methods (`~Registry.queryDatasetTypes`, `~Registry.queryCollections`, `~Registry.queryDimensions`, and `~Registry.queryDatasets`) allow these to be specified either fully or partially in various ways.
 
-The gramma is based on standard SQL, it is a subset of SQL expression language
-that can appear in WHERE clause of standard SELECT statement with some
-extensions, e.g. range support for ``IN`` operator.
+.. _daf_butler_dataset_type_expressions:
+
+DatasetType expressions
+-----------------------
+
+Arguments that specify one or more dataset types can generally take any of the following:
+
+ - `DatasetType` instances;
+ - `str` values (corresponding to `DatasetType.name`);
+ - `re.Pattern` values (matched to `DatasetType.name` strings, via `~re.Pattern.fullmatch`);
+ - iterables of any of the above;
+ - the special value `...`, which matches all dataset types.
+
+Some of these are not allowed in certain contexts (as documented there).
+
+.. _daf_butler_collection_expressions:
+
+Collection expressions
+----------------------
+
+Arguments that specify one or more collections are similar to those for dataset types; they can take:
+
+ - `str` values (the full collection name);
+ - `re.Pattern` values (matched to the collection name, via `~re.Pattern.fullmatch`);
+ - a `tuple` of (`str`, *dataset-type-restriction*) - see below;
+ - iterables of any of the above;
+ - the special value `...`, which matches all collections;
+ - a mapping from `str` to *dataset-type-restriction*.
+
+A *dataset-type-restriction* is a :ref:`DatasetType expression <daf_butler_dataset_type_expressions>` that limits a search for datasets in the associated collection to just the specified dataset types.
+Unlike most other DatasetType expressions, it may not contain regular expressions (but it may be `...`, which is the implied value when no
+restriction is given, as it means "no restriction").
+In contexts where restrictions are meaningless (e.g. `~Registry.queryCollections` when the ``datasetType`` argument is `None`) they are allowed but ignored.
+
+Collection expressions are processed by the `~registry.wildcards.CollectionQuery`, and `~registry.wildcards.DatasetTypeRestriction` classes.
+User code will rarely need to interact with these directly, but they can be passed to `Registry` instead of the expression objects themselves, and hence may be useful as a way to transform an expression that may include single-pass iterators into an equivalent form that can be reused.
+
+Ordered collection searches
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+A *ordered* collection expression is required in contexts where we want to search collections only until a dataset with a particular dataset type and data ID is found, such as `~Registy.queryDatasets` when ``deduplicate`` is `True`.
+In these contexts, regular expressions and `...` are not allowed, because they make it impossible to unambiguously define the order in which to search the matching collections.
+
+Ordered collection searches are processed by the `~registry.wildcards.CollectionSearch` class.
+
+.. _daf_butler_dimension_expressions:
+
+Dimension expressions
+---------------------
+
+Constraints on the data IDs returned by a query can take two forms:
+
+ - an explicit data ID value can be provided (as a `dict` or `DataCoordinate` instance) to directly constrain the dimensions in the data ID and indirectly constrain any related dimensions (see :ref:`Dimensions Overview <lsst.daf.butler-dimensions_overview>`);
+
+ - a string expression resembling a SQL WHERE clause can be provided to constrain dimension values in a much more general way.
+
+In most cases, the two can be provided together, requiring that returned data IDs match both constraints.
+The rest of this section describes the latter in detail.
+
+The language grammar is defined in the ``exprParser.parserYacc`` module, which is responsible for transforming a string with the user expression into a syntax tree with nodes represented by various classes defined in the ``exprParser.exprTree`` module.
+Modules in the ``exprParser`` package are considered butler/registry implementation details and are not exposed at the butler package level.
+
+The grammar is based on standard SQL; it is a subset of SQL expression language that can appear in WHERE clause of standard SELECT statement with some extensions, such as range support for the ``IN`` operator.
 
 Expression structure
---------------------
+^^^^^^^^^^^^^^^^^^^^
 
-The expression is passed as a string to a registry methods that build a
-QuantumGraph typically from a command-line application such as ``pipetask``.
+The expression is passed as a string via the ``where`` arguments of `~Registry.queryDimensions` and `~Registry.queryDatasets`.
 The string contains a single boolean expression which evaluates to true or
 false (if it is a valid expression). Expression can contain a bunch of
 standard logical operators, comparisons, literals, and identifiers which are
@@ -43,7 +96,7 @@ registry runs the resulting SQL query.
 Following sections describe each of the parts in detail.
 
 Literals
---------
+^^^^^^^^
 
 The language supports these types of literals:
 
@@ -75,7 +128,7 @@ Examples of range literals:
 * ``-10..-1:2`` -- equivalent to ``-10,-8,-6,-4,-2``
 
 Identifiers
------------
+^^^^^^^^^^^
 
 Identifiers represent values external to a parser, such as values stored in a
 database. The parser itself cannot define identifiers or their values; it is
@@ -95,22 +148,22 @@ for accessing raft name (obviously dotted names need knowledge of database
 schema and how SQL query is built).
 
 Unary arithmetic operators
---------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Two unary operators ``+`` (plus) and ``-`` (minus) can be used in the
 expressions in front of (numeric) literals, identifiers, or other
 expressions which should evaluate to a numeric value.
 
 Binary arithmetic operators
----------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Language supports five arithmetic operators - ``+`` (add), ``-`` (subtract),
 ``*`` (multiply), ``/`` (divide), and ``%`` (modulo). Usual precedence rules
 apply to these operators. Operands for them can be anything that evaluates to
 a numeric value.
 
 Comparison operators
---------------------
+^^^^^^^^^^^^^^^^^^^^
 
 Language supports set of regular comparison operators - ``=``, ``!=``, ``<``,
 ``<=``, ``>``, ``>=``. This can be used on operands that evaluate to a numeric
@@ -121,7 +174,7 @@ values, for (in)equality operators operands can also be boolean expressions.
 
 
 IN operator
------------
+^^^^^^^^^^^
 
 The ``IN`` operator (and ``NOT IN``) are an expanded version of a regular SQL
 IN operator. Its general syntax looks like::
@@ -147,7 +200,7 @@ as are these::
     visit Not In (100, 110, 130, 135, 140, 145)
 
 Boolean operators
------------------
+^^^^^^^^^^^^^^^^^
 
 ``NOT`` is the standard unary boolean negation operator.
 
@@ -157,13 +210,13 @@ All boolean operators can work on expressions which return boolean values.
 
 
 Grouping operator
------------------
+^^^^^^^^^^^^^^^^^
 
 Parentheses should be used to change evaluation order (precedence) of
 sub-expressions in the full expression.
 
 Examples
---------
+^^^^^^^^
 
 Few examples of valid expressions using some of the constructs::
 

diff --git a/python/lsst/daf/butler/__init__.py b/python/lsst/daf/butler/__init__.py
@@ -6,7 +6,8 @@
 # dependencies.
 
 from .core import *
-from .registry import Registry, CollectionType  # import the registry subpackage directly for other symbols
+# Import the registry subpackage directly for other symbols.
+from .registry import Registry, CollectionType, CollectionSearch, DatasetTypeRestriction
 from ._butlerConfig import *
 from ._deferredDatasetHandle import *
 from ._butler import *

diff --git a/python/lsst/daf/butler/_butler.py b/python/lsst/daf/butler/_butler.py
@@ -1248,7 +1248,7 @@ def validateConfiguration(self, logFailures: bool = False,
         if datasetTypeNames:
             entities = [self.registry.getDatasetType(name) for name in datasetTypeNames]
         else:
-            entities = list(self.registry.getAllDatasetTypes())
+            entities = list(self.registry.queryDatasetTypes())
 
         # filter out anything from the ignore list
         if ignore:

diff --git a/python/lsst/daf/butler/registry/__init__.py b/python/lsst/daf/butler/registry/__init__.py
@@ -25,7 +25,7 @@
 from ._collectionType import *
 
 from . import wildcards
-from .wildcards import Like  # other symbols are mostly internal
+from .wildcards import CollectionSearch, DatasetTypeRestriction
 from . import interfaces
 from .interfaces import MissingCollectionError
 from . import queries