Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-27033: Development branch for schema changes #397

Merged
merged 85 commits into from
Nov 4, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
85 commits
Select commit Hold shift + click to select a range
0ad25b3
Make DimensionRecord immutable.
TallJimbo Oct 2, 2020
3122f18
Move _topology module out of dimensions subpackage.
TallJimbo Oct 2, 2020
d35db8b
Clean up immutable decorator and its usage.
TallJimbo Sep 29, 2020
341b520
Add missing symbol exports for named sets.
TallJimbo Oct 4, 2020
f8a671e
Store dimensions configuration in registry database (DM-26407)
andy-slac Oct 14, 2020
3efd054
Add cached_getter decorator to stand in for 3.8's cached_property.
TallJimbo Oct 2, 2020
27f3f34
Rename to TimespanDatabaseRepresentation.
TallJimbo Oct 2, 2020
3c22a92
Separate DimensionConfig from other configs.
andy-slac Oct 15, 2020
10f9832
Fix missing char in yaml data dump.
TallJimbo Oct 1, 2020
4207705
Add base class to TimespanDatabaseRepresention.
TallJimbo Oct 2, 2020
b3819d1
Fix postgres unit test
andy-slac Oct 16, 2020
b70873e
Rename dimensions package modules with underscores.
TallJimbo Oct 2, 2020
66f8cf6
Add forgotten DimensionCombination symbol export.
TallJimbo Oct 2, 2020
cf58d86
Ignore open files check for registry sqlite database
andy-slac Oct 16, 2020
240f4b6
Overhaul dimension construction and class relationships.
TallJimbo Oct 2, 2020
d5c25b7
Make NamedKeyMapping.keys() return NamedValueAbstractSet.
TallJimbo Oct 7, 2020
abda21e
Drop `createRegistry` argument from `makeRepo`
andy-slac Oct 19, 2020
8f3a774
Merge branch 'tickets/DM-27034' into tickets/DM-27033
timj Oct 20, 2020
c2e4b2b
Doc fix/improvement for Dimension.alternateKeys.
TallJimbo Oct 17, 2020
125fd20
Add timestamp field to dataset table for automatically recording time…
timj Oct 16, 2020
4d9a1af
Merge branch 'tickets/DM-26407' into tickets/DM-27033
timj Oct 20, 2020
a846ed1
Switch format for dimensions in database to JSON
timj Oct 22, 2020
7b57a11
Add __str__ to DimensionElementFields.
TallJimbo Oct 15, 2020
0bd7f94
Merge pull request #394 from lsst/tickets/DM-25180
timj Oct 21, 2020
94ec2b4
Rename exposure.name to exposure.obs_id
timj Oct 26, 2020
eed5619
Rewrite check to workaround mypy confusion
timj Oct 22, 2020
1e8930c
Merge pull request #398 from lsst/tickets/DM-27035
TallJimbo Oct 22, 2020
67006e1
Add day_obs and seq_num to exposure/visit definitions
timj Oct 26, 2020
322f24a
Merge pull request #401 from lsst/tickets/DM-27266
timj Oct 22, 2020
406769e
Add special GovernorDimension concept (for instrument and skymap).
TallJimbo Oct 16, 2020
58105b9
Prohibit dataset type names that clash with governor dimension names.
TallJimbo Oct 21, 2020
e6014f5
Pass DimensionUniverse argument to several collection manager methods
TallJimbo Oct 21, 2020
f3ecfb8
Minor doc fixes.
TallJimbo Oct 21, 2020
5d8bc41
Add ways to get SkyPix objects from a universe.
TallJimbo Oct 22, 2020
a15d641
Add governor dimension restrictions to collection chains.
TallJimbo Oct 21, 2020
4185d19
Add support for double-dot identifiers (DM-27293)
andy-slac Oct 27, 2020
e1a0bcd
Add method to get just the database dimension elements from a universe.
TallJimbo Oct 28, 2020
a31885a
Export DataCoordinate key/value type aliases.
TallJimbo Oct 24, 2020
c2b169a
Add support for POINT() function.
andy-slac Oct 27, 2020
a4bdb44
Rework dimension record storage construction further.
TallJimbo Oct 29, 2020
32ba9a7
Refresh dimension record storage manager at Registry construction.
TallJimbo Oct 28, 2020
6086879
Implement simple identifier substitution
andy-slac Oct 27, 2020
707fca0
Add system for notifying dependent dimensions about governor inserts.
TallJimbo Oct 29, 2020
8e51875
Add special record storage classes for GovernorDimensions.
TallJimbo Oct 28, 2020
c77d7aa
Add support for 2-tuple expressions
andy-slac Oct 28, 2020
7eb8dd1
Remove defunct, broken test method.
TallJimbo Oct 29, 2020
d2463c0
Rename "standard" dimensions to "database" dimensions.
TallJimbo Oct 28, 2020
ad5b2fa
Enable identifiers in IN rhs list
andy-slac Oct 28, 2020
5e66cb7
Rewrite tests for spatial overlap queries.
TallJimbo Oct 29, 2020
3356071
Add DatabaseDimensionElement intermediate base class.
TallJimbo Oct 28, 2020
2ed5e2d
Add OVERLAPS operator (and reserved word)
andy-slac Oct 28, 2020
c2ed155
Add support for saving DimensionGraph definitions to database.
TallJimbo Oct 31, 2020
29eb182
Use more general overlap tables to implement commonSkyPix system.
TallJimbo Oct 29, 2020
9e8e572
Make name an (abstract) property rather than a base-class attribute.
TallJimbo Oct 28, 2020
ab0c220
Update documentation for newest parser additions
andy-slac Oct 27, 2020
da0358e
Merge branch 'tickets/DM-27321' into tickets/DM-27033
timj Oct 29, 2020
9058b71
Pass DimensionRecordStorage manager to DatasetRecordStorageManager.
TallJimbo Oct 31, 2020
da55553
Add stubs for materializing overlaps between non-skypix tables.
TallJimbo Oct 30, 2020
7d07747
Delegate common attributes to StandardDimensionElement.
TallJimbo Oct 28, 2020
63dd003
Improve error message for missing butler_attributes table (DM-27373)
andy-slac Oct 29, 2020
70f8e36
Merge branch 'tickets/DM-27293' into tickets/DM-27033
timj Oct 29, 2020
e6d7251
Switch to using save/loadDimensionGraph instead of encode/decode.
TallJimbo Oct 31, 2020
5245623
Fix bad copy/paste in docs.
TallJimbo Oct 31, 2020
6bdb01a
Add intermediate dimension storage ABCs and rework construction.
TallJimbo Oct 28, 2020
f693ff5
Merge pull request #411 from lsst/tickets/DM-27373
andy-slac Oct 29, 2020
3c6e847
Remove DimensionGraph.{encode, decode}.
TallJimbo Oct 31, 2020
0d2497e
Simplify type-annotation syntax.
TallJimbo Oct 31, 2020
dd85752
Merge pull request #406 from lsst/tickets/DM-27251
TallJimbo Oct 30, 2020
350a27e
Always expand registry.managers into per-repo configuration.
TallJimbo Nov 3, 2020
9f42918
Pass dimensions at construction, not repeatedly to later methods
TallJimbo Oct 31, 2020
e05c736
Merge pull request #414 from lsst/tickets/DM-27253
TallJimbo Nov 2, 2020
f36e222
Add version numbers to YAML export files and check them when reading.
TallJimbo Nov 3, 2020
3144165
Merge pull request #415 from lsst/tickets/DM-27390
TallJimbo Nov 3, 2020
b6cddcc
Tiny doc fix.
TallJimbo Nov 1, 2020
a93a6db
Remove seeing from visit record
timj Nov 3, 2020
d902f5e
Merge pull request #420 from lsst/tickets/DM-24660
TallJimbo Nov 3, 2020
ce83724
Remove content restrictions from collection search expressions.
TallJimbo Nov 2, 2020
0c03f7c
Merge pull request #422 from lsst/tickets/DM-27409
timj Nov 3, 2020
e08779c
Record which dataset types and governor values are in collections.
TallJimbo Nov 1, 2020
295deb3
Merge pull request #421 from lsst/tickets/DM-27397
TallJimbo Nov 3, 2020
ce37120
Support exposure.obs_id or detector.full_name in dataId
timj Nov 2, 2020
ca762e6
Merge pull request #419 from lsst/tickets/DM-24939
TallJimbo Nov 3, 2020
5626ec8
Merge branch 'tickets/DM-27151' into tickets/DM-27033
TallJimbo Nov 4, 2020
78c35bd
Bump all schema versions.
TallJimbo Nov 3, 2020
659bafd
Belated fix for DM-27397 in certify script.
TallJimbo Nov 4, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
116 changes: 90 additions & 26 deletions doc/lsst.daf.butler/queries.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,17 +34,10 @@ Arguments that specify one or more collections are similar to those for dataset

- `str` values (the full collection name);
- `re.Pattern` values (matched to the collection name, via `~re.Pattern.fullmatch`);
- a `tuple` of (`str`, *dataset-type-restriction*) - see below;
- iterables of any of the above;
- the special value "``...``", which matches all collections;
- a mapping from `str` to *dataset-type-restriction*.

A *dataset-type-restriction* is a :ref:`DatasetType expression <daf_butler_dataset_type_expressions>` that limits a search for datasets in the associated collection to just the specified dataset types.
Unlike most other DatasetType expressions, it may not contain regular expressions (but it may be "``...``", which is the implied value when no
restriction is given, as it means "no restriction").
In contexts where restrictions are meaningless (e.g. `~Registry.queryCollections` when the ``datasetType`` argument is `None`) they are allowed but ignored.

Collection expressions are processed by the `~registry.wildcards.CollectionQuery`, and `~registry.wildcards.DatasetTypeRestriction` classes.
Collection expressions are processed by the `~registry.wildcards.CollectionQuery` class.
User code will rarely need to interact with these directly, but they can be passed to `Registry` instead of the expression objects themselves, and hence may be useful as a way to transform an expression that may include single-pass iterators into an equivalent form that can be reused.

Ordered collection searches
Expand All @@ -53,8 +46,6 @@ Ordered collection searches
An *ordered* collection expression is required in contexts where we want to search collections only until a dataset with a particular dataset type and data ID is found.
These include all direct `Butler` operations, the definitions of `~CollectionType.CHAINED` collections, `Registry.findDataset`, and the ``findFirst=True`` mode of `Registry.queryDatasets`.
In these contexts, regular expressions and "``...``" are not allowed for collection names, because they make it impossible to unambiguously define the order in which to search.
Dataset type restrictions are allowed in these contexts, and those
may be (and usually are) "``...``".

Ordered collection searches are processed by the `~registry.wildcards.CollectionSearch` class.

Expand Down Expand Up @@ -86,9 +77,9 @@ false (if it is a valid expression). Expression can contain a bunch of
standard logical operators, comparisons, literals, and identifiers which are
references to registry objects.

A few words in expression grammar are reserved: ``AND``, ``OR``, ``NOT`` and
``IN``. Reserved words are not case sensitive and can appear in either upper
or lower case, or a mixture of both.
A few words in expression grammar are reserved: ``AND``, ``OR``, ``NOT``,
``IN``, and ``OVERLAPS``. Reserved words are not case sensitive and can appear
in either upper or lower case, or a mixture of both.

Language operator precedence rules are the same as for the other languages
like C++ or Python. When in doubt use grouping operators (parentheses) for
Expand Down Expand Up @@ -194,17 +185,21 @@ IN operator. Its general syntax looks like:

.. code-block:: sql

<expression> IN ( <literal1>[, <literal2>, ... ])
<expression> NOT IN ( <literal1>[, <literal2>, ... ])
<expression> IN ( <item1>[, <item2>, ... ])
<expression> NOT IN ( <item1>[, <item2>, ... ])

where each item in the right hand side list is one of the supported literals
or identifiers. Unlike regular SQL IN operator the list cannot contain
expressions, only literals or identifiers. The extension to regular SQL IN is
that literals can be range literals as defined above. The query language
allows mixing of different types of literals and ranges but it may not make
sense to mix them when expressions is translated to SQL.

where each item in the right hand side list is one of the supported literals.
Unlike regular SQL IN operator the list cannot contain expressions, only
literals. The extension to regular SQL IN is that literals can be range
literals as defined above. It can also be a mixture of integer literals and
range literals (language allows mixing of string literals and ranges but it
may not make sense when translated to SQL).
Regular use of ``IN`` operator is for checking whether an integer number is in
set of numbers. For that case the list on right side can be a mixture of
integer literals, identifiers that represent integers, and range literals.

For an example of range usage, these two expressions are equivalent:
For an example of this type of usage, these two expressions are equivalent:

.. code-block:: sql

Expand All @@ -218,6 +213,62 @@ as are these:
visit NOT IN (100, 110, 130..145:5)
visit Not In (100, 110, 130, 135, 140, 145)

Another usage of ``IN`` operator is for checking whether a timestamp or a time
range is contained wholly in other time range. Time range in this case can be
specified as a tuple of two time literals or identifers each representing a
timestamp, or as a single identifier representing a time range. In case a
single identifier appears on the right side of ``IN`` it has to be enclosed
in parentheses.

Here are few examples for checking containment in a time range:

.. code-block:: sql

-- using literals for both timestamp and time range
T'2020-01-01' IN (T'2019-01-01', '2020-01-01')
(T'2020-01-01', T'2020-02-01') NOT IN (T'2019-01-01', '2020-01-01')

-- using identifiers for each timestamp in a time range
T'2020-01-01' IN (interval.begin, interval.end)
T'2020-01-01' NOT IN (interval_id)

-- identifier on left side can represent either a timestamp or time range
timestamp_id IN (interval.begin, interval.end)
range_id NOT IN (interval_id)

The same ``IN`` operator can be used for checking containment of a point or
region inside other region. Presently there are no special literal type for
regions, so this can only be done with regions represented by identifiers. Few
examples of region containment:

.. code-block:: sql

POINT(ra, dec) IN (region1)
region2 NOT IN (region1)


OVERLAPS operator
^^^^^^^^^^^^^^^^^

The ``OVERLAPS`` operator checks for overlapping time ranges or regions, its
argument have to have consistent types. Like with ``IN`` operator time ranges
can be represented with a tuple of two timestamps (literals or identifiers) or
with a single identifier. Regions can only be used as identifiers.
``OVERLAPS`` syntax is similar to ``IN`` but it does not require parentheses
on right hand side when there is a single identifier representing a time range
or a region.

Few examples of the syntax:

.. code-block:: sql

(T'2020-01-01', T'2022-01-01') OVERLAPS (T'2019-01-01', '2021-01-01')
(interval.begin, interval.end) OVERLAPS interval_2
interval_1 OVERLAPS interval_2

NOT (region_1 OVERLAPS region_2)


Boolean operators
^^^^^^^^^^^^^^^^^

Expand All @@ -235,6 +286,19 @@ Parentheses should be used to change evaluation order (precedence) of
sub-expressions in the full expression.


Function call
^^^^^^^^^^^^^

Function call syntax is similar to other languages, expression for call
consists of an identifier followed by zero or more comma-separated arguments
enclosed in parentheses (e.g. ``func(1, 2, 3)``). An argument to a function
can be any expression.

Presently there only one construct that uses this syntax, ``POINT(ra, dec)``
is function which declares (or returns) sky coordinates similarly to ADQL
syntax. Name of the ``POINT`` function is not case-sensitive.


.. _time-literals-syntax:

Time literals
Expand All @@ -246,7 +310,7 @@ supported time formats. For internal time representation Registry uses
`astropy.time.Time`_ class and parser converts time string into an instance
of that class. For string-based time formats such as ISO the conversion
of a time string to an object is done by the ``Time`` constructor. The syntax
of the string could be anything that is suported by ``astropy``, for details
of the string could be anything that is supported by ``astropy``, for details
see `astropy.time`_ reference. For numeric time formats such as MJD the parser
converts string to a floating point number and passes that number to ``Time``
constructor.
Expand All @@ -261,9 +325,9 @@ Parser guesses time format from the content of the time string:
"fits" format.
- If string matches ``year:day:time`` format then "yday" is used.

The format can be specified explicitely by prefixing time string with a format
The format can be specified explicitly by prefixing time string with a format
name and slash, e.g. ``T'mjd/58938.515'``. Any of the formats supported by
``astropy`` can be specified explicitely.
``astropy`` can be specified explicitly.

Time scale that parser passes to ``Time`` constructor depends on time format,
by default parser uses:
Expand All @@ -272,7 +336,7 @@ by default parser uses:
- "tt" scale for "cxcsec" format,
- "tai" scale for anything else.

Default scale can be overriden by adding a suffix to time string consisting
Default scale can be overridden by adding a suffix to time string consisting
of a slash and time scale name, e.g. ``T'58938.515/tai'``. Any combination of
explicit time format and time scale can be given at the same time, e.g.
``T'58938.515'``, ``T'mjd/58938.515'``, ``T'58938.515/tai'``, and
Expand Down
2 changes: 1 addition & 1 deletion python/lsst/daf/butler/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

from .core import *
# Import the registry subpackage directly for other symbols.
from .registry import Registry, CollectionType, CollectionSearch, DatasetTypeRestriction
from .registry import Registry, RegistryConfig, CollectionType, CollectionSearch, DatasetTypeRestriction
from ._butlerConfig import *
from ._deferredDatasetHandle import *
from ._butler import *
Expand Down
88 changes: 81 additions & 7 deletions python/lsst/daf/butler/_butler.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@
DatasetRef,
DatasetType,
Datastore,
DimensionConfig,
FileDataset,
StorageClassFactory,
Timespan,
Expand Down Expand Up @@ -252,10 +253,10 @@ def __init__(self, config: Union[Config, str, None] = None, *,
"""

@staticmethod
def makeRepo(root: str, config: Union[Config, str, None] = None, standalone: bool = False,
createRegistry: bool = True, searchPaths: Optional[List[str]] = None,
forceConfigRoot: bool = True, outfile: Optional[str] = None,
overwrite: bool = False) -> Config:
def makeRepo(root: str, config: Union[Config, str, None] = None,
dimensionConfig: Union[Config, str, None] = None, standalone: bool = False,
searchPaths: Optional[List[str]] = None, forceConfigRoot: bool = True,
outfile: Optional[str] = None, overwrite: bool = False) -> Config:
"""Create an empty data repository by adding a butler.yaml config
to a repository root directory.

Expand All @@ -271,6 +272,9 @@ def makeRepo(root: str, config: Union[Config, str, None] = None, standalone: boo
configuration will be used. Root-dependent config options
specified in this config are overwritten if ``forceConfigRoot``
is `True`.
dimensionConfig : `Config` or `str`, optional
Configuration for dimensions, will be used to initialize registry
database.
standalone : `bool`
If True, write all expanded defaults, not just customized or
repository-specific settings.
Expand All @@ -279,8 +283,6 @@ def makeRepo(root: str, config: Union[Config, str, None] = None, standalone: boo
may be good or bad, depending on the nature of the changes).
Future *additions* to the defaults will still be picked up when
initializing `Butlers` to repos created with ``standalone=True``.
createRegistry : `bool`, optional
If `True` create a new Registry.
searchPaths : `list` of `str`, optional
Directory paths to search when calculating the full butler
configuration.
Expand Down Expand Up @@ -360,6 +362,13 @@ def makeRepo(root: str, config: Union[Config, str, None] = None, standalone: boo

if standalone:
config.merge(full)
else:
# Always expand the registry.managers section into the per-repo
# config, because after the database schema is created, it's not
# allowed to change anymore. Note that in the standalone=True
# branch, _everything_ in the config is expanded, so there's no
# need to special case this.
Config.updateParameters(RegistryConfig, config, full, toCopy=("managers",), overwrite=False)
if outfile is not None:
# When writing to a separate location we must include
# the root of the butler repo in the config else it won't know
Expand All @@ -371,7 +380,10 @@ def makeRepo(root: str, config: Union[Config, str, None] = None, standalone: boo
config.dumpToUri(configURI, overwrite=overwrite)

# Create Registry and populate tables
Registry.fromConfig(config, create=createRegistry, butlerRoot=root)
registryConfig = RegistryConfig(config.get("registry"))
dimensionConfig = DimensionConfig(dimensionConfig)
Registry.createFromConfig(registryConfig, dimensionConfig=dimensionConfig, butlerRoot=root)

return config

@classmethod
Expand Down Expand Up @@ -542,6 +554,68 @@ def _findDatasetRef(self, datasetRefOrType: Union[DatasetRef, DatasetType, str],
else:
idNumber = None
timespan: Optional[Timespan] = None

# Process dimension records that are using record information
# rather than ids
newDataId: dict[Any, Any] = {}
byRecord: dict[Any, dict[str, Any]] = defaultdict(dict)

# if all the dataId comes from keyword parameters we do not need
# to do anything here because they can't be of the form
# exposure.obs_id because a "." is not allowed in a keyword parameter.
if dataId:
for k, v in dataId.items():
# If we have a Dimension we do not need to do anything
# because it cannot be a compound key.
if isinstance(k, str) and "." in k:
# Someone is using a more human-readable dataId
dimension, record = k.split(".", 1)
byRecord[dimension][record] = v
else:
newDataId[k] = v

if byRecord:
# Some record specifiers were found so we need to convert
# them to the Id form
for dimensionName, values in byRecord.items():
if dimensionName in newDataId:
log.warning("DataId specified explicit %s dimension value of %s in addition to"
" general record specifiers for it of %s. Ignoring record information.",
dimensionName, newDataId[dimensionName], str(values))
continue

# Build up a WHERE expression -- use single quotes
def quote(s):
if isinstance(s, str):
return f"'{s}'"
else:
return s

where = " AND ".join(f"{dimensionName}.{k} = {quote(v)}"
for k, v in values.items())

# Hopefully we get a single record that matches
records = set(self.registry.queryDimensionRecords(dimensionName, dataId=newDataId,
where=where, **kwds))

if len(records) != 1:
if len(records) > 1:
log.debug("Received %d records from constraints of %s", len(records), str(values))
for r in records:
log.debug("- %s", str(r))
raise RuntimeError(f"DataId specification for dimension {dimensionName} is not"
f" uniquely constrained to a single dataset by {values}."
f" Got {len(records)} results.")
raise RuntimeError(f"DataId specification for dimension {dimensionName} matched no"
f" records when constrained by {values}")

# Get the primary key from the real dimension object
dimension = self.registry.dimensions[dimensionName]
newDataId[dimensionName] = getattr(records.pop(), dimension.primaryKey.name)

# We have modified the dataId so need to switch to it
dataId = newDataId

if datasetType.isCalibration():
# Because this is a calibration dataset, first try to make a
# standardize the data ID without restricting the dimensions to
Expand Down
4 changes: 1 addition & 3 deletions python/lsst/daf/butler/_butlerConfig.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,15 +36,13 @@
ButlerURI,
Config,
DatastoreConfig,
DimensionConfig,
StorageClassConfig,
)
from .registry import RegistryConfig
from .transfers import RepoTransferFormatConfig

CONFIG_COMPONENT_CLASSES = (RegistryConfig, StorageClassConfig,
DatastoreConfig, DimensionConfig,
RepoTransferFormatConfig)
DatastoreConfig, RepoTransferFormatConfig)


class ButlerConfig(Config):
Expand Down
1 change: 1 addition & 0 deletions python/lsst/daf/butler/cli/cmd/commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ def butler_import(*args, **kwargs):
@click.command()
@repo_argument(required=True, help=willCreateRepoHelp)
@click.option("--seed-config", help="Path to an existing YAML config file to apply (on top of defaults).")
@click.option("--dimension-config", help="Path to an existing YAML config file with dimension configuration.")
@click.option("--standalone", is_flag=True, help="Include all defaults in the config file in the repo, "
"insulating the repo from changes in package defaults.")
@click.option("--override", is_flag=True, help="Allow values in the supplied config to override all "
Expand Down