Skip to content

Commit

Permalink
Refactor and clean up wildcards and the methods that use them.
Browse files Browse the repository at this point in the history
This includes a number of changes for consistency and clarity,
including:

 - replacing Registry's getAll* methods with more flexible query*
   methods;

 - making the types accepted in expressions for collections and
   dataset types more consistent;

 - switching from SQL LIKE patterns to regular expressions (it looks
   like we mostly want to apply these in Python rather than in
   the database);

 - adding user guide docs for those expressions, integrating
   it with the existing page on data ID expression strings.
  • Loading branch information
TallJimbo committed Mar 4, 2020
1 parent 9d1416d commit a1abbc1
Show file tree
Hide file tree
Showing 14 changed files with 867 additions and 260 deletions.
5 changes: 4 additions & 1 deletion doc/lsst.daf.butler/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Using the Butler
:maxdepth: 1

configuring.rst
exprParser.rst
queries.rst

.. _lsst.daf.butler-scripts:

Expand Down Expand Up @@ -81,6 +81,9 @@ Python API reference
.. automodapi:: lsst.daf.butler.registry.queries
:no-main-docstr:

.. automodapi:: lsst.daf.butler.registry.wildcards
:no-main-docstr:

Example datastores
------------------

Expand Down
107 changes: 80 additions & 27 deletions doc/lsst.daf.butler/exprParser.rst → doc/lsst.daf.butler/queries.rst
Original file line number Diff line number Diff line change
@@ -1,28 +1,81 @@
.. _daf_butler_expr_parser:
.. py:currentmodule:: lsst.daf.butler
Butler expression language
==========================
.. _daf_butler_queries:

Butler registry supports a user-supplied expression for constraining input,
output, or intermediate datasets that can appear in the generated
QuantumGraph. This page describes the structure and syntax of the expression
language.
Querying datasets
=================

The language grammar is defined in ``exprParser.parserYacc`` module, which is
responsible for transforming string with user expression into a syntax tree
with nodes represented by various classes defined in ``exprParser.exprTree``
module. Modules in ``exprParser`` package are considered butler/registry
implementation details and are not exposed at the butler package level.
Datasets in a butler-managed data repository are identified by the combination of their *dataset type* and *data ID* within a *collection*.
The `Registry` class's query methods (`~Registry.queryDatasetTypes`, `~Registry.queryCollections`, `~Registry.queryDimensions`, and `~Registry.queryDatasets`) allow these to be specified either fully or partially in various ways.

The gramma is based on standard SQL, it is a subset of SQL expression language
that can appear in WHERE clause of standard SELECT statement with some
extensions, e.g. range support for ``IN`` operator.
.. _daf_butler_dataset_type_expressions:

DatasetType expressions
-----------------------

Arguments that specify one or more dataset types can generally take any of the following:

- `DatasetType` instances;
- `str` values (corresponding to `DatasetType.name`);
- `re.Pattern` values (matched to `DatasetType.name` strings, via `~re.Pattern.fullmatch`);
- iterables of any of the above;
- the special value `...`, which matches all dataset types.

Some of these are not allowed in certain contexts (as documented there).

.. _daf_butler_collection_expressions:

Collection expressions
----------------------

Arguments that specify one or more collections are similar to those for dataset types; they can take:

- `str` values (the full collection name);
- `re.Pattern` values (matched to the collection name, via `~re.Pattern.fullmatch`);
- a `tuple` of (`str`, *dataset-type-restriction*) - see below;
- iterables of any of the above;
- the special value `...`, which matches all collections;
- a mapping from `str` to *dataset-type-restriction*.

A *dataset-type-restriction* is a :ref:`DatasetType expression <daf_butler_dataset_type_expressions>` that limits a search for datasets in the associated collection to just the specified dataset types.
Unlike most other DatasetType expressions, it may not contain regular expressions (but it may be `...`, which is the implied value when no
restriction is given, as it means "no restriction").
In contexts where restrictions are meaningless (e.g. `~Registry.queryCollections` when the ``datasetType`` argument is `None`) they are allowed but ignored.

Collection expressions are processed by the `~registry.wildcards.CollectionQuery`, and `~registry.wildcards.DatasetTypeRestriction` classes.
User code will rarely need to interact with these directly, but they can be passed to `Registry` instead of the expression objects themselves, and hence may be useful as a way to transform an expression that may include single-pass iterators into an equivalent form that can be reused.

Ordered collection searches
^^^^^^^^^^^^^^^^^^^^^^^^^^^

A *ordered* collection expression is required in contexts where we want to search collections only until a dataset with a particular dataset type and data ID is found, such as `~Registy.queryDatasets` when ``deduplicate`` is `True`.
In these contexts, regular expressions and `...` are not allowed, because they make it impossible to unambiguously define the order in which to search the matching collections.

Ordered collection searches are processed by the `~registry.wildcards.CollectionSearch` class.

.. _daf_butler_dimension_expressions:

Dimension expressions
---------------------

Constraints on the data IDs returned by a query can take two forms:

- an explicit data ID value can be provided (as a `dict` or `DataCoordinate` instance) to directly constrain the dimensions in the data ID and indirectly constrain any related dimensions (see :ref:`Dimensions Overview <lsst.daf.butler-dimensions_overview>`);

- a string expression resembling a SQL WHERE clause can be provided to constrain dimension values in a much more general way.

In most cases, the two can be provided together, requiring that returned data IDs match both constraints.
The rest of this section describes the latter in detail.

The language grammar is defined in the ``exprParser.parserYacc`` module, which is responsible for transforming a string with the user expression into a syntax tree with nodes represented by various classes defined in the ``exprParser.exprTree`` module.
Modules in the ``exprParser`` package are considered butler/registry implementation details and are not exposed at the butler package level.

The grammar is based on standard SQL; it is a subset of SQL expression language that can appear in WHERE clause of standard SELECT statement with some extensions, such as range support for the ``IN`` operator.

Expression structure
--------------------
^^^^^^^^^^^^^^^^^^^^

The expression is passed as a string to a registry methods that build a
QuantumGraph typically from a command-line application such as ``pipetask``.
The expression is passed as a string via the ``where`` arguments of `~Registry.queryDimensions` and `~Registry.queryDatasets`.
The string contains a single boolean expression which evaluates to true or
false (if it is a valid expression). Expression can contain a bunch of
standard logical operators, comparisons, literals, and identifiers which are
Expand All @@ -43,7 +96,7 @@ registry runs the resulting SQL query.
Following sections describe each of the parts in detail.

Literals
--------
^^^^^^^^

The language supports these types of literals:

Expand Down Expand Up @@ -75,7 +128,7 @@ Examples of range literals:
* ``-10..-1:2`` -- equivalent to ``-10,-8,-6,-4,-2``

Identifiers
-----------
^^^^^^^^^^^

Identifiers represent values external to a parser, such as values stored in a
database. The parser itself cannot define identifiers or their values; it is
Expand All @@ -95,22 +148,22 @@ for accessing raft name (obviously dotted names need knowledge of database
schema and how SQL query is built).

Unary arithmetic operators
--------------------------
^^^^^^^^^^^^^^^^^^^^^^^^^^

Two unary operators ``+`` (plus) and ``-`` (minus) can be used in the
expressions in front of (numeric) literals, identifiers, or other
expressions which should evaluate to a numeric value.

Binary arithmetic operators
---------------------------
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Language supports five arithmetic operators - ``+`` (add), ``-`` (subtract),
``*`` (multiply), ``/`` (divide), and ``%`` (modulo). Usual precedence rules
apply to these operators. Operands for them can be anything that evaluates to
a numeric value.

Comparison operators
--------------------
^^^^^^^^^^^^^^^^^^^^

Language supports set of regular comparison operators - ``=``, ``!=``, ``<``,
``<=``, ``>``, ``>=``. This can be used on operands that evaluate to a numeric
Expand All @@ -121,7 +174,7 @@ values, for (in)equality operators operands can also be boolean expressions.
IN operator
-----------
^^^^^^^^^^^

The ``IN`` operator (and ``NOT IN``) are an expanded version of a regular SQL
IN operator. Its general syntax looks like::
Expand All @@ -147,7 +200,7 @@ as are these::
visit Not In (100, 110, 130, 135, 140, 145)

Boolean operators
-----------------
^^^^^^^^^^^^^^^^^

``NOT`` is the standard unary boolean negation operator.

Expand All @@ -157,13 +210,13 @@ All boolean operators can work on expressions which return boolean values.


Grouping operator
-----------------
^^^^^^^^^^^^^^^^^

Parentheses should be used to change evaluation order (precedence) of
sub-expressions in the full expression.

Examples
--------
^^^^^^^^

Few examples of valid expressions using some of the constructs::

Expand Down
3 changes: 2 additions & 1 deletion python/lsst/daf/butler/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@
# dependencies.

from .core import *
from .registry import Registry, CollectionType # import the registry subpackage directly for other symbols
# Import the registry subpackage directly for other symbols.
from .registry import Registry, CollectionType, CollectionSearch, DatasetTypeRestriction
from ._butlerConfig import *
from ._deferredDatasetHandle import *
from ._butler import *
Expand Down
2 changes: 1 addition & 1 deletion python/lsst/daf/butler/_butler.py
Original file line number Diff line number Diff line change
Expand Up @@ -1248,7 +1248,7 @@ def validateConfiguration(self, logFailures: bool = False,
if datasetTypeNames:
entities = [self.registry.getDatasetType(name) for name in datasetTypeNames]
else:
entities = list(self.registry.getAllDatasetTypes())
entities = list(self.registry.queryDatasetTypes())

# filter out anything from the ignore list
if ignore:
Expand Down
2 changes: 1 addition & 1 deletion python/lsst/daf/butler/registry/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
from ._collectionType import *

from . import wildcards
from .wildcards import Like # other symbols are mostly internal
from .wildcards import CollectionSearch, DatasetTypeRestriction
from . import interfaces
from .interfaces import MissingCollectionError
from . import queries
Expand Down

0 comments on commit a1abbc1

Please sign in to comment.