Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-31725: rewrite queries subpackage, via new dependency on daf_relation #759

Merged
merged 25 commits into from
Jan 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
e4bccc4
Add column tag classes and implement daf_relation interfaces.
TallJimbo Jul 22, 2022
c6635b5
Add DimensionElementFields.columns property.
TallJimbo Aug 28, 2022
3b4b04e
Add ColumnTypeInfo.make_relation_table_spec.
TallJimbo Aug 18, 2022
6bb0266
Add QueryContext and QueryBackend objects.
TallJimbo Aug 3, 2022
3ba1ed3
Add row readers for data IDs, DatasetRefs, and dimension records.
TallJimbo Jul 23, 2022
6458a73
Add factories for common predicates.
TallJimbo Jul 23, 2022
8a6c2df
Add require_ordered kwarg to CollectionWildcard.from_expression.
TallJimbo Jan 5, 2023
a5f4836
Integrate relations into dataset subqueries and storage managers.
TallJimbo Aug 18, 2022
0af81fb
Rewrite query WHERE-clause handling via relation predicates.
TallJimbo Aug 18, 2022
797b8c4
Temporarily drop exception message tests for record order-by.
TallJimbo Aug 27, 2022
6dbb19e
Integrate relations with QueryBuilder and dimension managers.
TallJimbo Apr 25, 2022
b2f7b94
Pass view target storage instance to QueryDimensionRecordStorage.
TallJimbo Aug 25, 2022
26acfaa
Use relations to rewrite Query and QueryResults.
TallJimbo Aug 19, 2022
9f5f590
Revert "Temporarily drop exception message tests for record order-by."
TallJimbo Sep 2, 2022
b735ede
Drop DimensionRecordStorage.fetch.
TallJimbo Sep 22, 2022
f85a194
Add make_data_id_relation to QueryContext.
TallJimbo Sep 23, 2022
5c7cb3e
Use relation in certify/decertify implementations.
TallJimbo Sep 23, 2022
52bbc90
Drop DataCoordinateIterable.constrain.
TallJimbo Sep 22, 2022
5778424
Drop SimpleQuery.
TallJimbo Sep 23, 2022
81ffe43
Drop QueryColumns and DatasetQueryColumns.
TallJimbo Oct 10, 2022
c828c42
Prohibit caches of views of dimension records.
TallJimbo Dec 2, 2022
db801e8
Note that findDataset now respects a given storage class.
TallJimbo Dec 3, 2022
9f5372c
Add changelog entry.
TallJimbo Dec 6, 2022
432b91c
Add daf_relation dependency to pyproject.toml.
TallJimbo Dec 6, 2022
100a12c
Fix type annotation for digestTables.
TallJimbo Jan 5, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
9 changes: 9 additions & 0 deletions doc/changes/DM-31725.misc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Rewrite registry query system, using the new ``daf_relation`` package.

This change should be mostly invisible to users, but there are some subtle behavior changes:

- `Registry.findDatasets` now respects the given storage class when passed a full `DatasetType` instance, instead of replacing it with storage class registered with that dataset type. This causes storage class overrides in `PipelineTask` input connections to be respected in more contexts as well; in at least some cases these were previously being incorrectly ignored.
- `Registry.findDatasets` now utilizes cached summaries of which dataset types and governor dimension values are present in each collection. This should result in fewer and simpler database calls, but it does make the result vulnerable to stale caches (which, like `Registry` methods more generally, must be addressed manually via calls to `Registry.refresh`.
- The diagnostics provided by the `explain_no_results` methods on query result object (used prominently in the reporting on empty quantum graph builds) have been significantly improved, though they now use ``daf_relation`` terminology that may be unfamiliar to users.
- `Registry` is now more consistent about raising `DataIdValueError` when given invalid governor dimension values, while not raising (but providing `explain_no_results` diagnostics) for all other invalid dimension values, as per RFC-878.
- `Registry` methods that take a `where` argument are now typed to expect a `str` that is not `None`, with the default no-op value now an empty string (before either an empty `str` or `None` could be passed, and meant the same thing). This should only affect downstream type checking, as the runtime code still just checks for whether the argument evaluates as `False` in a boolean context.
4 changes: 4 additions & 0 deletions mypy.ini
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,10 @@ ignore_errors = True
ignore_missing_imports = False
ignore_errors = True

[mypy-lsst.daf_relation.*]
ignore_missing_imports = False
ignore_errors = True

# Check all of daf.butler...

[mypy-lsst.daf.butler.*]
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ dependencies = [
"lsst-sphgeom",
"lsst-utils",
"lsst-resources",
"lsst-daf-relation",
"deprecated >= 1.2",
"pydantic",
]
Expand Down
4 changes: 3 additions & 1 deletion python/lsst/daf/butler/cli/opt/options.py
Original file line number Diff line number Diff line change
Expand Up @@ -256,7 +256,9 @@ def _config_split(*args: Any) -> dict[str, str]:
verbose_option = MWOptionDecorator("-v", "--verbose", help="Increase verbosity.", is_flag=True)


where_option = MWOptionDecorator("--where", help="A string expression similar to a SQL WHERE clause.")
where_option = MWOptionDecorator(
"--where", default="", help="A string expression similar to a SQL WHERE clause."
)


order_by_option = MWOptionDecorator(
Expand Down
3 changes: 2 additions & 1 deletion python/lsst/daf/butler/core/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
from . import progress # most symbols are only used by handler implementors
from . import ddl, time_utils
from ._butlerUri import *
from ._column_categorization import *
from ._column_tags import *
from ._column_type_info import *
from ._topology import *
from .composites import *
Expand All @@ -30,7 +32,6 @@
from .named import *
from .progress import Progress
from .quantum import *
from .simpleQuery import *
from .storageClass import *
from .storageClassDelegate import *
from .storedFileInfo import *
Expand Down
77 changes: 77 additions & 0 deletions python/lsst/daf/butler/core/_column_categorization.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# This file is part of daf_butler.
#
# Developed for the LSST Data Management System.
# This product includes software developed by the LSST Project
# (http://www.lsst.org).
# See the COPYRIGHT file at the top-level directory of this distribution
# for details of code ownership.
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.

from __future__ import annotations

__all__ = ("ColumnCategorization",)

import dataclasses
from collections import defaultdict
from collections.abc import Iterable, Iterator
from typing import Any

from ._column_tags import DatasetColumnTag, DimensionKeyColumnTag, DimensionRecordColumnTag
from .dimensions import DimensionUniverse, GovernorDimension, SkyPixDimension


@dataclasses.dataclass
class ColumnCategorization:
dimension_keys: set[str] = dataclasses.field(default_factory=set)
dimension_records: defaultdict[str, set[str]] = dataclasses.field(
default_factory=lambda: defaultdict(set)
)
datasets: defaultdict[str, set[str]] = dataclasses.field(default_factory=lambda: defaultdict(set))

@classmethod
def from_iterable(cls, iterable: Iterable[Any]) -> ColumnCategorization:
result = cls()
for tag in iterable:
match tag:
case DimensionKeyColumnTag(dimension=dimension):
result.dimension_keys.add(dimension)
case DimensionRecordColumnTag(element=element, column=column):
result.dimension_records[element].add(column)
case DatasetColumnTag(dataset_type=dataset_type, column=column):
result.datasets[dataset_type].add(column)
return result

def filter_skypix(self, universe: DimensionUniverse) -> Iterator[SkyPixDimension]:
return (
dimension
for name in self.dimension_keys
if isinstance(dimension := universe[name], SkyPixDimension)
)

def filter_governors(self, universe: DimensionUniverse) -> Iterator[GovernorDimension]:
return (
dimension
for name in self.dimension_keys
if isinstance(dimension := universe[name], GovernorDimension)
)

def filter_timespan_dataset_types(self) -> Iterator[str]:
return (dataset_type for dataset_type, columns in self.datasets.items() if "timespan" in columns)

def filter_timespan_dimension_elements(self) -> Iterator[str]:
return (element for element, columns in self.dimension_records.items() if "timespan" in columns)

def filter_spatial_region_dimension_elements(self) -> Iterator[str]:
return (element for element, columns in self.dimension_records.items() if "region" in columns)
205 changes: 205 additions & 0 deletions python/lsst/daf/butler/core/_column_tags.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
# This file is part of daf_butler.
#
# Developed for the LSST Data Management System.
# This product includes software developed by the LSST Project
# (http://www.lsst.org).
# See the COPYRIGHT file at the top-level directory of this distribution
# for details of code ownership.
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.

from __future__ import annotations

__all__ = (
"DatasetColumnTag",
"DimensionKeyColumnTag",
"DimensionRecordColumnTag",
"is_timespan_column",
)

import dataclasses
from collections.abc import Iterable
from typing import TYPE_CHECKING, Any, TypeVar, final

_S = TypeVar("_S")

if TYPE_CHECKING:
from lsst.daf.relation import ColumnTag


class _BaseColumnTag:

__slots__ = ()

@classmethod
def filter_from(cls: type[_S], tags: Iterable[Any]) -> set[_S]:
return {tag for tag in tags if type(tag) is cls}


@final
@dataclasses.dataclass(frozen=True, slots=True)
class DimensionKeyColumnTag(_BaseColumnTag):
"""An identifier for `~lsst.daf.relation.Relation` columns that represent
a dimension primary key value.
"""

dimension: str
"""Name of the dimension (`str`)."""

def __str__(self) -> str:
return self.dimension

@property
def qualified_name(self) -> str:
return self.dimension

@property
def is_key(self) -> bool:
return True

@classmethod
def generate(cls, dimensions: Iterable[str]) -> list[DimensionKeyColumnTag]:
"""Return a list of column tags from an iterable of dimension
names.

Parameters
----------
dimensions : `Iterable` [ `str` ]
Dimension names.

Returns
-------
tags : `list` [ `DimensionKeyColumnTag` ]
List of column tags.
"""
return [cls(d) for d in dimensions]


@final
@dataclasses.dataclass(frozen=True, slots=True)
class DimensionRecordColumnTag(_BaseColumnTag):
"""An identifier for `~lsst.daf.relation.Relation` columns that represent
non-key columns in a dimension or dimension element record.
"""

element: str
"""Name of the dimension element (`str`).
"""

column: str
"""Name of the column (`str`)."""

def __str__(self) -> str:
return f"{self.element}.{self.column}"

@property
def qualified_name(self) -> str:
return f"n!{self.element}:{self.column}"

@property
def is_key(self) -> bool:
return False

@classmethod
def generate(cls, element: str, columns: Iterable[str]) -> list[DimensionRecordColumnTag]:
"""Return a list of column tags from an iterable of column names
for a single dimension element.

Parameters
----------
element : `str`
Name of the dimension element.
columns : `Iterable` [ `str` ]
Column names.

Returns
-------
tags : `list` [ `DimensionRecordColumnTag` ]
List of column tags.
"""
return [cls(element, column) for column in columns]


@final
@dataclasses.dataclass(frozen=True, slots=True)
class DatasetColumnTag(_BaseColumnTag):
"""An identifier for `~lsst.daf.relation.Relation` columns that represent
columns from a dataset query or subquery.
"""

dataset_type: str
"""Name of the dataset type (`str`)."""

column: str
"""Name of the column (`str`).

Allowed values are:

- "dataset_id" (autoincrement or UUID primary key)
- "run" (collection primary key, not collection name)
- "ingest_date"
- "timespan" (validity range, or NULL for non-calibration collections)
- "rank" (collection position in ordered search)
"""

def __str__(self) -> str:
return f"{self.dataset_type}.{self.column}"

@property
def qualified_name(self) -> str:
return f"t!{self.dataset_type}:{self.column}"

@property
def is_key(self) -> bool:
return self.column == "dataset_id" or self.column == "run"

@classmethod
def generate(cls, dataset_type: str, columns: Iterable[str]) -> list[DatasetColumnTag]:
"""Return a list of column tags from an iterable of column names
for a single dataset type.

Parameters
----------
dataset_type : `str`
Name of the dataset type.
columns : `Iterable` [ `str` ]
Column names.

Returns
-------
tags : `list` [ `DatasetColumnTag` ]
List of column tags.
"""
return [cls(dataset_type, column) for column in columns]


def is_timespan_column(tag: ColumnTag) -> bool:
"""Test whether a column tag is a timespan.

Parameters
----------
tag : `ColumnTag`
Column tag to test.

Returns
-------
is_timespan : `bool`
Whether the given column is a timespan.
"""
match tag:
case DimensionRecordColumnTag(column="timespan"):
return True
case DatasetColumnTag(column="timespan"):
return True
return False