Skip to content

Commit

Permalink
Merge pull request #397 from lsst/tickets/DM-27033
Browse files Browse the repository at this point in the history
DM-27033: Development branch for schema changes
  • Loading branch information
TallJimbo committed Nov 4, 2020
2 parents e2e77cb + 659bafd commit b44916f
Show file tree
Hide file tree
Showing 85 changed files with 6,404 additions and 2,503 deletions.
116 changes: 90 additions & 26 deletions doc/lsst.daf.butler/queries.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,17 +34,10 @@ Arguments that specify one or more collections are similar to those for dataset

- `str` values (the full collection name);
- `re.Pattern` values (matched to the collection name, via `~re.Pattern.fullmatch`);
- a `tuple` of (`str`, *dataset-type-restriction*) - see below;
- iterables of any of the above;
- the special value "``...``", which matches all collections;
- a mapping from `str` to *dataset-type-restriction*.

A *dataset-type-restriction* is a :ref:`DatasetType expression <daf_butler_dataset_type_expressions>` that limits a search for datasets in the associated collection to just the specified dataset types.
Unlike most other DatasetType expressions, it may not contain regular expressions (but it may be "``...``", which is the implied value when no
restriction is given, as it means "no restriction").
In contexts where restrictions are meaningless (e.g. `~Registry.queryCollections` when the ``datasetType`` argument is `None`) they are allowed but ignored.

Collection expressions are processed by the `~registry.wildcards.CollectionQuery`, and `~registry.wildcards.DatasetTypeRestriction` classes.
Collection expressions are processed by the `~registry.wildcards.CollectionQuery` class.
User code will rarely need to interact with these directly, but they can be passed to `Registry` instead of the expression objects themselves, and hence may be useful as a way to transform an expression that may include single-pass iterators into an equivalent form that can be reused.

Ordered collection searches
Expand All @@ -53,8 +46,6 @@ Ordered collection searches
An *ordered* collection expression is required in contexts where we want to search collections only until a dataset with a particular dataset type and data ID is found.
These include all direct `Butler` operations, the definitions of `~CollectionType.CHAINED` collections, `Registry.findDataset`, and the ``findFirst=True`` mode of `Registry.queryDatasets`.
In these contexts, regular expressions and "``...``" are not allowed for collection names, because they make it impossible to unambiguously define the order in which to search.
Dataset type restrictions are allowed in these contexts, and those
may be (and usually are) "``...``".

Ordered collection searches are processed by the `~registry.wildcards.CollectionSearch` class.

Expand Down Expand Up @@ -86,9 +77,9 @@ false (if it is a valid expression). Expression can contain a bunch of
standard logical operators, comparisons, literals, and identifiers which are
references to registry objects.

A few words in expression grammar are reserved: ``AND``, ``OR``, ``NOT`` and
``IN``. Reserved words are not case sensitive and can appear in either upper
or lower case, or a mixture of both.
A few words in expression grammar are reserved: ``AND``, ``OR``, ``NOT``,
``IN``, and ``OVERLAPS``. Reserved words are not case sensitive and can appear
in either upper or lower case, or a mixture of both.

Language operator precedence rules are the same as for the other languages
like C++ or Python. When in doubt use grouping operators (parentheses) for
Expand Down Expand Up @@ -194,17 +185,21 @@ IN operator. Its general syntax looks like:

.. code-block:: sql
<expression> IN ( <literal1>[, <literal2>, ... ])
<expression> NOT IN ( <literal1>[, <literal2>, ... ])
<expression> IN ( <item1>[, <item2>, ... ])
<expression> NOT IN ( <item1>[, <item2>, ... ])
where each item in the right hand side list is one of the supported literals
or identifiers. Unlike regular SQL IN operator the list cannot contain
expressions, only literals or identifiers. The extension to regular SQL IN is
that literals can be range literals as defined above. The query language
allows mixing of different types of literals and ranges but it may not make
sense to mix them when expressions is translated to SQL.

where each item in the right hand side list is one of the supported literals.
Unlike regular SQL IN operator the list cannot contain expressions, only
literals. The extension to regular SQL IN is that literals can be range
literals as defined above. It can also be a mixture of integer literals and
range literals (language allows mixing of string literals and ranges but it
may not make sense when translated to SQL).
Regular use of ``IN`` operator is for checking whether an integer number is in
set of numbers. For that case the list on right side can be a mixture of
integer literals, identifiers that represent integers, and range literals.

For an example of range usage, these two expressions are equivalent:
For an example of this type of usage, these two expressions are equivalent:

.. code-block:: sql
Expand All @@ -218,6 +213,62 @@ as are these:
visit NOT IN (100, 110, 130..145:5)
visit Not In (100, 110, 130, 135, 140, 145)
Another usage of ``IN`` operator is for checking whether a timestamp or a time
range is contained wholly in other time range. Time range in this case can be
specified as a tuple of two time literals or identifers each representing a
timestamp, or as a single identifier representing a time range. In case a
single identifier appears on the right side of ``IN`` it has to be enclosed
in parentheses.

Here are few examples for checking containment in a time range:

.. code-block:: sql
-- using literals for both timestamp and time range
T'2020-01-01' IN (T'2019-01-01', '2020-01-01')
(T'2020-01-01', T'2020-02-01') NOT IN (T'2019-01-01', '2020-01-01')
-- using identifiers for each timestamp in a time range
T'2020-01-01' IN (interval.begin, interval.end)
T'2020-01-01' NOT IN (interval_id)
-- identifier on left side can represent either a timestamp or time range
timestamp_id IN (interval.begin, interval.end)
range_id NOT IN (interval_id)
The same ``IN`` operator can be used for checking containment of a point or
region inside other region. Presently there are no special literal type for
regions, so this can only be done with regions represented by identifiers. Few
examples of region containment:

.. code-block:: sql
POINT(ra, dec) IN (region1)
region2 NOT IN (region1)
OVERLAPS operator
^^^^^^^^^^^^^^^^^

The ``OVERLAPS`` operator checks for overlapping time ranges or regions, its
argument have to have consistent types. Like with ``IN`` operator time ranges
can be represented with a tuple of two timestamps (literals or identifiers) or
with a single identifier. Regions can only be used as identifiers.
``OVERLAPS`` syntax is similar to ``IN`` but it does not require parentheses
on right hand side when there is a single identifier representing a time range
or a region.

Few examples of the syntax:

.. code-block:: sql
(T'2020-01-01', T'2022-01-01') OVERLAPS (T'2019-01-01', '2021-01-01')
(interval.begin, interval.end) OVERLAPS interval_2
interval_1 OVERLAPS interval_2
NOT (region_1 OVERLAPS region_2)
Boolean operators
^^^^^^^^^^^^^^^^^

Expand All @@ -235,6 +286,19 @@ Parentheses should be used to change evaluation order (precedence) of
sub-expressions in the full expression.


Function call
^^^^^^^^^^^^^

Function call syntax is similar to other languages, expression for call
consists of an identifier followed by zero or more comma-separated arguments
enclosed in parentheses (e.g. ``func(1, 2, 3)``). An argument to a function
can be any expression.

Presently there only one construct that uses this syntax, ``POINT(ra, dec)``
is function which declares (or returns) sky coordinates similarly to ADQL
syntax. Name of the ``POINT`` function is not case-sensitive.


.. _time-literals-syntax:

Time literals
Expand All @@ -246,7 +310,7 @@ supported time formats. For internal time representation Registry uses
`astropy.time.Time`_ class and parser converts time string into an instance
of that class. For string-based time formats such as ISO the conversion
of a time string to an object is done by the ``Time`` constructor. The syntax
of the string could be anything that is suported by ``astropy``, for details
of the string could be anything that is supported by ``astropy``, for details
see `astropy.time`_ reference. For numeric time formats such as MJD the parser
converts string to a floating point number and passes that number to ``Time``
constructor.
Expand All @@ -261,9 +325,9 @@ Parser guesses time format from the content of the time string:
"fits" format.
- If string matches ``year:day:time`` format then "yday" is used.

The format can be specified explicitely by prefixing time string with a format
The format can be specified explicitly by prefixing time string with a format
name and slash, e.g. ``T'mjd/58938.515'``. Any of the formats supported by
``astropy`` can be specified explicitely.
``astropy`` can be specified explicitly.

Time scale that parser passes to ``Time`` constructor depends on time format,
by default parser uses:
Expand All @@ -272,7 +336,7 @@ by default parser uses:
- "tt" scale for "cxcsec" format,
- "tai" scale for anything else.

Default scale can be overriden by adding a suffix to time string consisting
Default scale can be overridden by adding a suffix to time string consisting
of a slash and time scale name, e.g. ``T'58938.515/tai'``. Any combination of
explicit time format and time scale can be given at the same time, e.g.
``T'58938.515'``, ``T'mjd/58938.515'``, ``T'58938.515/tai'``, and
Expand Down
2 changes: 1 addition & 1 deletion python/lsst/daf/butler/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

from .core import *
# Import the registry subpackage directly for other symbols.
from .registry import Registry, CollectionType, CollectionSearch, DatasetTypeRestriction
from .registry import Registry, RegistryConfig, CollectionType, CollectionSearch, DatasetTypeRestriction
from ._butlerConfig import *
from ._deferredDatasetHandle import *
from ._butler import *
Expand Down
88 changes: 81 additions & 7 deletions python/lsst/daf/butler/_butler.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@
DatasetRef,
DatasetType,
Datastore,
DimensionConfig,
FileDataset,
StorageClassFactory,
Timespan,
Expand Down Expand Up @@ -252,10 +253,10 @@ def __init__(self, config: Union[Config, str, None] = None, *,
"""

@staticmethod
def makeRepo(root: str, config: Union[Config, str, None] = None, standalone: bool = False,
createRegistry: bool = True, searchPaths: Optional[List[str]] = None,
forceConfigRoot: bool = True, outfile: Optional[str] = None,
overwrite: bool = False) -> Config:
def makeRepo(root: str, config: Union[Config, str, None] = None,
dimensionConfig: Union[Config, str, None] = None, standalone: bool = False,
searchPaths: Optional[List[str]] = None, forceConfigRoot: bool = True,
outfile: Optional[str] = None, overwrite: bool = False) -> Config:
"""Create an empty data repository by adding a butler.yaml config
to a repository root directory.
Expand All @@ -271,6 +272,9 @@ def makeRepo(root: str, config: Union[Config, str, None] = None, standalone: boo
configuration will be used. Root-dependent config options
specified in this config are overwritten if ``forceConfigRoot``
is `True`.
dimensionConfig : `Config` or `str`, optional
Configuration for dimensions, will be used to initialize registry
database.
standalone : `bool`
If True, write all expanded defaults, not just customized or
repository-specific settings.
Expand All @@ -279,8 +283,6 @@ def makeRepo(root: str, config: Union[Config, str, None] = None, standalone: boo
may be good or bad, depending on the nature of the changes).
Future *additions* to the defaults will still be picked up when
initializing `Butlers` to repos created with ``standalone=True``.
createRegistry : `bool`, optional
If `True` create a new Registry.
searchPaths : `list` of `str`, optional
Directory paths to search when calculating the full butler
configuration.
Expand Down Expand Up @@ -360,6 +362,13 @@ def makeRepo(root: str, config: Union[Config, str, None] = None, standalone: boo

if standalone:
config.merge(full)
else:
# Always expand the registry.managers section into the per-repo
# config, because after the database schema is created, it's not
# allowed to change anymore. Note that in the standalone=True
# branch, _everything_ in the config is expanded, so there's no
# need to special case this.
Config.updateParameters(RegistryConfig, config, full, toCopy=("managers",), overwrite=False)
if outfile is not None:
# When writing to a separate location we must include
# the root of the butler repo in the config else it won't know
Expand All @@ -371,7 +380,10 @@ def makeRepo(root: str, config: Union[Config, str, None] = None, standalone: boo
config.dumpToUri(configURI, overwrite=overwrite)

# Create Registry and populate tables
Registry.fromConfig(config, create=createRegistry, butlerRoot=root)
registryConfig = RegistryConfig(config.get("registry"))
dimensionConfig = DimensionConfig(dimensionConfig)
Registry.createFromConfig(registryConfig, dimensionConfig=dimensionConfig, butlerRoot=root)

return config

@classmethod
Expand Down Expand Up @@ -542,6 +554,68 @@ def _findDatasetRef(self, datasetRefOrType: Union[DatasetRef, DatasetType, str],
else:
idNumber = None
timespan: Optional[Timespan] = None

# Process dimension records that are using record information
# rather than ids
newDataId: dict[Any, Any] = {}
byRecord: dict[Any, dict[str, Any]] = defaultdict(dict)

# if all the dataId comes from keyword parameters we do not need
# to do anything here because they can't be of the form
# exposure.obs_id because a "." is not allowed in a keyword parameter.
if dataId:
for k, v in dataId.items():
# If we have a Dimension we do not need to do anything
# because it cannot be a compound key.
if isinstance(k, str) and "." in k:
# Someone is using a more human-readable dataId
dimension, record = k.split(".", 1)
byRecord[dimension][record] = v
else:
newDataId[k] = v

if byRecord:
# Some record specifiers were found so we need to convert
# them to the Id form
for dimensionName, values in byRecord.items():
if dimensionName in newDataId:
log.warning("DataId specified explicit %s dimension value of %s in addition to"
" general record specifiers for it of %s. Ignoring record information.",
dimensionName, newDataId[dimensionName], str(values))
continue

# Build up a WHERE expression -- use single quotes
def quote(s):
if isinstance(s, str):
return f"'{s}'"
else:
return s

where = " AND ".join(f"{dimensionName}.{k} = {quote(v)}"
for k, v in values.items())

# Hopefully we get a single record that matches
records = set(self.registry.queryDimensionRecords(dimensionName, dataId=newDataId,
where=where, **kwds))

if len(records) != 1:
if len(records) > 1:
log.debug("Received %d records from constraints of %s", len(records), str(values))
for r in records:
log.debug("- %s", str(r))
raise RuntimeError(f"DataId specification for dimension {dimensionName} is not"
f" uniquely constrained to a single dataset by {values}."
f" Got {len(records)} results.")
raise RuntimeError(f"DataId specification for dimension {dimensionName} matched no"
f" records when constrained by {values}")

# Get the primary key from the real dimension object
dimension = self.registry.dimensions[dimensionName]
newDataId[dimensionName] = getattr(records.pop(), dimension.primaryKey.name)

# We have modified the dataId so need to switch to it
dataId = newDataId

if datasetType.isCalibration():
# Because this is a calibration dataset, first try to make a
# standardize the data ID without restricting the dimensions to
Expand Down
4 changes: 1 addition & 3 deletions python/lsst/daf/butler/_butlerConfig.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,15 +36,13 @@
ButlerURI,
Config,
DatastoreConfig,
DimensionConfig,
StorageClassConfig,
)
from .registry import RegistryConfig
from .transfers import RepoTransferFormatConfig

CONFIG_COMPONENT_CLASSES = (RegistryConfig, StorageClassConfig,
DatastoreConfig, DimensionConfig,
RepoTransferFormatConfig)
DatastoreConfig, RepoTransferFormatConfig)


class ButlerConfig(Config):
Expand Down
1 change: 1 addition & 0 deletions python/lsst/daf/butler/cli/cmd/commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ def butler_import(*args, **kwargs):
@click.command()
@repo_argument(required=True, help=willCreateRepoHelp)
@click.option("--seed-config", help="Path to an existing YAML config file to apply (on top of defaults).")
@click.option("--dimension-config", help="Path to an existing YAML config file with dimension configuration.")
@click.option("--standalone", is_flag=True, help="Include all defaults in the config file in the repo, "
"insulating the repo from changes in package defaults.")
@click.option("--override", is_flag=True, help="Allow values in the supplied config to override all "
Expand Down

0 comments on commit b44916f

Please sign in to comment.