Merge pull request #397 from lsst/tickets/DM-27033

DM-27033: Development branch for schema changes
lsst · Nov 4, 2020 · b44916f · b44916f
2 parents e2e77cb + 659bafd
commit b44916f
Show file tree

Hide file tree

Showing 85 changed files with 6,404 additions and 2,503 deletions.
diff --git a/doc/lsst.daf.butler/queries.rst b/doc/lsst.daf.butler/queries.rst
@@ -34,17 +34,10 @@ Arguments that specify one or more collections are similar to those for dataset
 
  - `str` values (the full collection name);
  - `re.Pattern` values (matched to the collection name, via `~re.Pattern.fullmatch`);
- - a `tuple` of (`str`, *dataset-type-restriction*) - see below;
  - iterables of any of the above;
  - the special value "``...``", which matches all collections;
- - a mapping from `str` to *dataset-type-restriction*.
 
-A *dataset-type-restriction* is a :ref:`DatasetType expression <daf_butler_dataset_type_expressions>` that limits a search for datasets in the associated collection to just the specified dataset types.
-Unlike most other DatasetType expressions, it may not contain regular expressions (but it may be "``...``", which is the implied value when no
-restriction is given, as it means "no restriction").
-In contexts where restrictions are meaningless (e.g. `~Registry.queryCollections` when the ``datasetType`` argument is `None`) they are allowed but ignored.
-
-Collection expressions are processed by the `~registry.wildcards.CollectionQuery`, and `~registry.wildcards.DatasetTypeRestriction` classes.
+Collection expressions are processed by the `~registry.wildcards.CollectionQuery` class.
 User code will rarely need to interact with these directly, but they can be passed to `Registry` instead of the expression objects themselves, and hence may be useful as a way to transform an expression that may include single-pass iterators into an equivalent form that can be reused.
 
 Ordered collection searches
@@ -53,8 +46,6 @@ Ordered collection searches
 An *ordered* collection expression is required in contexts where we want to search collections only until a dataset with a particular dataset type and data ID is found.
 These include all direct `Butler` operations, the definitions of `~CollectionType.CHAINED` collections, `Registry.findDataset`, and the ``findFirst=True`` mode of `Registry.queryDatasets`.
 In these contexts, regular expressions and "``...``" are not allowed for collection names, because they make it impossible to unambiguously define the order in which to search.
-Dataset type restrictions are allowed in these contexts, and those
-may be (and usually are) "``...``".
 
 Ordered collection searches are processed by the `~registry.wildcards.CollectionSearch` class.
 
@@ -86,9 +77,9 @@ false (if it is a valid expression). Expression can contain a bunch of
 standard logical operators, comparisons, literals, and identifiers which are
 references to registry objects.
 
-A few words in expression grammar are reserved: ``AND``, ``OR``, ``NOT`` and
-``IN``. Reserved words are not case sensitive and can appear in either upper
-or lower case, or a mixture of both.
+A few words in expression grammar are reserved: ``AND``, ``OR``, ``NOT``,
+``IN``, and ``OVERLAPS``. Reserved words are not case sensitive and can appear
+in either upper or lower case, or a mixture of both.
 
 Language operator precedence rules are the same as for the other languages
 like C++ or Python. When in doubt use grouping operators (parentheses) for
@@ -194,17 +185,21 @@ IN operator. Its general syntax looks like:
 
 .. code-block:: sql
 
-    <expression> IN ( <literal1>[, <literal2>, ... ])
-    <expression> NOT IN ( <literal1>[, <literal2>, ... ])
+    <expression> IN ( <item1>[, <item2>, ... ])
+    <expression> NOT IN ( <item1>[, <item2>, ... ])
+
+where each item in the right hand side list is one of the supported literals
+or identifiers. Unlike regular SQL IN operator the list cannot contain
+expressions, only literals or identifiers. The extension to regular SQL IN is
+that literals can be range literals as defined above. The query language
+allows mixing of different types of literals and ranges but it may not make
+sense to mix them when expressions is translated to SQL.
 
-where each item in the right hand side list is one of the supported literals.
-Unlike regular SQL IN operator the list cannot contain expressions, only
-literals. The extension to regular SQL IN is that literals can be range
-literals as defined above. It can also be a mixture of integer literals and
-range literals (language allows mixing of string literals and ranges but it
-may not make sense when translated to SQL).
+Regular use of ``IN`` operator is for checking whether an integer number is in
+set of numbers. For that case the list on right side can be a mixture of
+integer literals, identifiers that represent integers, and range literals.
 
-For an example of range usage, these two expressions are equivalent:
+For an example of this type of usage, these two expressions are equivalent:
 
 .. code-block:: sql
 
@@ -218,6 +213,62 @@ as are these:
    visit NOT IN (100, 110, 130..145:5)
    visit Not In (100, 110, 130, 135, 140, 145)
 
+Another usage of ``IN`` operator is for checking whether a timestamp or a time
+range is contained wholly in other time range. Time range in this case can be
+specified as a tuple of two time literals or identifers each representing a
+timestamp, or as a single identifier representing a time range. In case a
+single identifier appears on the right side of ``IN`` it has to be enclosed
+in parentheses.
+
+Here are few examples for checking containment in a time range:
+
+.. code-block:: sql
+
+    -- using literals for both timestamp and time range
+    T'2020-01-01' IN (T'2019-01-01', '2020-01-01')
+    (T'2020-01-01', T'2020-02-01') NOT IN (T'2019-01-01', '2020-01-01')
+
+    -- using identifiers for each timestamp in a time range
+    T'2020-01-01' IN (interval.begin, interval.end)
+    T'2020-01-01' NOT IN (interval_id)
+
+    -- identifier on left side can represent either a timestamp or time range
+    timestamp_id IN (interval.begin, interval.end)
+    range_id NOT IN (interval_id)
+
+The same ``IN`` operator can be used for checking containment of a point or
+region inside other region. Presently there are no special literal type for
+regions, so this can only be done with regions represented by identifiers. Few
+examples of region containment:
+
+.. code-block:: sql
+
+    POINT(ra, dec) IN (region1)
+    region2 NOT IN (region1)
+
+
+OVERLAPS operator
+^^^^^^^^^^^^^^^^^
+
+The ``OVERLAPS`` operator checks for overlapping time ranges or regions, its
+argument have to have consistent types. Like with ``IN`` operator time ranges
+can be represented with a tuple of two timestamps (literals or identifiers) or
+with a single identifier. Regions can only be used as identifiers.
+``OVERLAPS`` syntax is similar to ``IN`` but it does not require  parentheses
+on right hand side when there is a single identifier representing a time range
+or a region.
+
+Few examples of the syntax:
+
+.. code-block:: sql
+
+    (T'2020-01-01', T'2022-01-01') OVERLAPS (T'2019-01-01', '2021-01-01')
+    (interval.begin, interval.end) OVERLAPS interval_2
+    interval_1 OVERLAPS interval_2
+
+    NOT (region_1 OVERLAPS region_2)
+
+
 Boolean operators
 ^^^^^^^^^^^^^^^^^
 
@@ -235,6 +286,19 @@ Parentheses should be used to change evaluation order (precedence) of
 sub-expressions in the full expression.
 
 
+Function call
+^^^^^^^^^^^^^
+
+Function call syntax is similar to other languages, expression for call
+consists of an identifier followed by zero or more comma-separated arguments
+enclosed in parentheses (e.g. ``func(1, 2, 3)``). An argument to a function
+can be any expression.
+
+Presently there only one construct that uses this syntax, ``POINT(ra, dec)``
+is function which declares (or returns) sky coordinates similarly to ADQL
+syntax. Name of the ``POINT`` function is not case-sensitive.
+
+
 .. _time-literals-syntax:
 
 Time literals
@@ -246,7 +310,7 @@ supported time formats. For internal time representation Registry uses
 `astropy.time.Time`_ class and parser converts time string into an instance
 of that class. For string-based time formats such as ISO the conversion
 of a time string to an object is done by the ``Time`` constructor. The syntax
-of the string could be anything that is suported by ``astropy``, for details
+of the string could be anything that is supported by ``astropy``, for details
 see `astropy.time`_ reference. For numeric time formats such as MJD the parser
 converts string to a floating point number and passes that number to ``Time``
 constructor.
@@ -261,9 +325,9 @@ Parser guesses time format from the content of the time string:
   "fits" format.
 - If string matches ``year:day:time`` format then "yday" is used.
 
-The format can be specified explicitely by prefixing time string with a format
+The format can be specified explicitly by prefixing time string with a format
 name and slash, e.g. ``T'mjd/58938.515'``. Any of the formats supported by
-``astropy`` can be specified explicitely.
+``astropy`` can be specified explicitly.
 
 Time scale that parser passes to ``Time`` constructor depends on time format,
 by default parser uses:
@@ -272,7 +336,7 @@ by default parser uses:
 - "tt" scale for "cxcsec" format,
 - "tai" scale for anything else.
 
-Default scale can be overriden by adding a suffix to time string consisting
+Default scale can be overridden by adding a suffix to time string consisting
 of a slash and time scale name, e.g. ``T'58938.515/tai'``. Any combination of
 explicit time format and time scale can be given at the same time, e.g.
 ``T'58938.515'``, ``T'mjd/58938.515'``, ``T'58938.515/tai'``, and

diff --git a/python/lsst/daf/butler/__init__.py b/python/lsst/daf/butler/__init__.py
@@ -7,7 +7,7 @@
 
 from .core import *
 # Import the registry subpackage directly for other symbols.
-from .registry import Registry, CollectionType, CollectionSearch, DatasetTypeRestriction
+from .registry import Registry, RegistryConfig, CollectionType, CollectionSearch, DatasetTypeRestriction
 from ._butlerConfig import *
 from ._deferredDatasetHandle import *
 from ._butler import *

diff --git a/python/lsst/daf/butler/_butler.py b/python/lsst/daf/butler/_butler.py
@@ -62,6 +62,7 @@
     DatasetRef,
     DatasetType,
     Datastore,
+    DimensionConfig,
     FileDataset,
     StorageClassFactory,
     Timespan,
@@ -252,10 +253,10 @@ def __init__(self, config: Union[Config, str, None] = None, *,
     """
 
     @staticmethod
-    def makeRepo(root: str, config: Union[Config, str, None] = None, standalone: bool = False,
-                 createRegistry: bool = True, searchPaths: Optional[List[str]] = None,
-                 forceConfigRoot: bool = True, outfile: Optional[str] = None,
-                 overwrite: bool = False) -> Config:
+    def makeRepo(root: str, config: Union[Config, str, None] = None,
+                 dimensionConfig: Union[Config, str, None] = None, standalone: bool = False,
+                 searchPaths: Optional[List[str]] = None, forceConfigRoot: bool = True,
+                 outfile: Optional[str] = None, overwrite: bool = False) -> Config:
         """Create an empty data repository by adding a butler.yaml config
         to a repository root directory.
 
@@ -271,6 +272,9 @@ def makeRepo(root: str, config: Union[Config, str, None] = None, standalone: boo
             configuration will be used.  Root-dependent config options
             specified in this config are overwritten if ``forceConfigRoot``
             is `True`.
+        dimensionConfig : `Config` or `str`, optional
+            Configuration for dimensions, will be used to initialize registry
+            database.
         standalone : `bool`
             If True, write all expanded defaults, not just customized or
             repository-specific settings.
@@ -279,8 +283,6 @@ def makeRepo(root: str, config: Union[Config, str, None] = None, standalone: boo
             may be good or bad, depending on the nature of the changes).
             Future *additions* to the defaults will still be picked up when
             initializing `Butlers` to repos created with ``standalone=True``.
-        createRegistry : `bool`, optional
-            If `True` create a new Registry.
         searchPaths : `list` of `str`, optional
             Directory paths to search when calculating the full butler
             configuration.
@@ -360,6 +362,13 @@ def makeRepo(root: str, config: Union[Config, str, None] = None, standalone: boo
 
         if standalone:
             config.merge(full)
+        else:
+            # Always expand the registry.managers section into the per-repo
+            # config, because after the database schema is created, it's not
+            # allowed to change anymore.  Note that in the standalone=True
+            # branch, _everything_ in the config is expanded, so there's no
+            # need to special case this.
+            Config.updateParameters(RegistryConfig, config, full, toCopy=("managers",), overwrite=False)
         if outfile is not None:
             # When writing to a separate location we must include
             # the root of the butler repo in the config else it won't know
@@ -371,7 +380,10 @@ def makeRepo(root: str, config: Union[Config, str, None] = None, standalone: boo
         config.dumpToUri(configURI, overwrite=overwrite)
 
         # Create Registry and populate tables
-        Registry.fromConfig(config, create=createRegistry, butlerRoot=root)
+        registryConfig = RegistryConfig(config.get("registry"))
+        dimensionConfig = DimensionConfig(dimensionConfig)
+        Registry.createFromConfig(registryConfig, dimensionConfig=dimensionConfig, butlerRoot=root)
+
         return config
 
     @classmethod
@@ -542,6 +554,68 @@ def _findDatasetRef(self, datasetRefOrType: Union[DatasetRef, DatasetType, str],
         else:
             idNumber = None
         timespan: Optional[Timespan] = None
+
+        # Process dimension records that are using record information
+        # rather than ids
+        newDataId: dict[Any, Any] = {}
+        byRecord: dict[Any, dict[str, Any]] = defaultdict(dict)
+
+        # if all the dataId comes from keyword parameters we do not need
+        # to do anything here because they can't be of the form
+        # exposure.obs_id because a "." is not allowed in a keyword parameter.
+        if dataId:
+            for k, v in dataId.items():
+                # If we have a Dimension we do not need to do anything
+                # because it cannot be a compound key.
+                if isinstance(k, str) and "." in k:
+                    # Someone is using a more human-readable dataId
+                    dimension, record = k.split(".", 1)
+                    byRecord[dimension][record] = v
+                else:
+                    newDataId[k] = v
+
+        if byRecord:
+            # Some record specifiers were found so we need to convert
+            # them to the Id form
+            for dimensionName, values in byRecord.items():
+                if dimensionName in newDataId:
+                    log.warning("DataId specified explicit %s dimension value of %s in addition to"
+                                " general record specifiers for it of %s.  Ignoring record information.",
+                                dimensionName, newDataId[dimensionName], str(values))
+                    continue
+
+                # Build up a WHERE expression -- use single quotes
+                def quote(s):
+                    if isinstance(s, str):
+                        return f"'{s}'"
+                    else:
+                        return s
+
+                where = " AND ".join(f"{dimensionName}.{k} = {quote(v)}"
+                                     for k, v in values.items())
+
+                # Hopefully we get a single record that matches
+                records = set(self.registry.queryDimensionRecords(dimensionName, dataId=newDataId,
+                                                                  where=where, **kwds))
+
+                if len(records) != 1:
+                    if len(records) > 1:
+                        log.debug("Received %d records from constraints of %s", len(records), str(values))
+                        for r in records:
+                            log.debug("- %s", str(r))
+                        raise RuntimeError(f"DataId specification for dimension {dimensionName} is not"
+                                           f" uniquely constrained to a single dataset by {values}."
+                                           f" Got {len(records)} results.")
+                    raise RuntimeError(f"DataId specification for dimension {dimensionName} matched no"
+                                       f" records when constrained by {values}")
+
+                # Get the primary key from the real dimension object
+                dimension = self.registry.dimensions[dimensionName]
+                newDataId[dimensionName] = getattr(records.pop(), dimension.primaryKey.name)
+
+            # We have modified the dataId so need to switch to it
+            dataId = newDataId
+
         if datasetType.isCalibration():
             # Because this is a calibration dataset, first try to make a
             # standardize the data ID without restricting the dimensions to

diff --git a/python/lsst/daf/butler/_butlerConfig.py b/python/lsst/daf/butler/_butlerConfig.py
@@ -36,15 +36,13 @@
     ButlerURI,
     Config,
     DatastoreConfig,
-    DimensionConfig,
     StorageClassConfig,
 )
 from .registry import RegistryConfig
 from .transfers import RepoTransferFormatConfig
 
 CONFIG_COMPONENT_CLASSES = (RegistryConfig, StorageClassConfig,
-                            DatastoreConfig, DimensionConfig,
-                            RepoTransferFormatConfig)
+                            DatastoreConfig, RepoTransferFormatConfig)
 
 
 class ButlerConfig(Config):

diff --git a/python/lsst/daf/butler/cli/cmd/commands.py b/python/lsst/daf/butler/cli/cmd/commands.py
@@ -75,6 +75,7 @@ def butler_import(*args, **kwargs):
 @click.command()
 @repo_argument(required=True, help=willCreateRepoHelp)
 @click.option("--seed-config", help="Path to an existing YAML config file to apply (on top of defaults).")
+@click.option("--dimension-config", help="Path to an existing YAML config file with dimension configuration.")
 @click.option("--standalone", is_flag=True, help="Include all defaults in the config file in the repo, "
               "insulating the repo from changes in package defaults.")
 @click.option("--override", is_flag=True, help="Allow values in the supplied config to override all "