Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-36111: miscellaneous improvements to registry support classes. #731

Merged
merged 23 commits into from Sep 21, 2022

Conversation

TallJimbo
Copy link
Member

@TallJimbo TallJimbo commented Sep 6, 2022

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

@TallJimbo TallJimbo force-pushed the tickets/DM-36111 branch 3 times, most recently from bb7ba89 to 07a6120 Compare September 6, 2022 17:03
@codecov
Copy link

codecov bot commented Sep 6, 2022

Codecov Report

Base: 84.68% // Head: 84.73% // Increases project coverage by +0.05% 🎉

Coverage data is based on head (fd68878) compared to base (4ca570b).
Patch coverage: 86.77% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #731      +/-   ##
==========================================
+ Coverage   84.68%   84.73%   +0.05%     
==========================================
  Files         244      244              
  Lines       32156    32095      -61     
  Branches     6040     6040              
==========================================
- Hits        27230    27196      -34     
+ Misses       3752     3724      -28     
- Partials     1174     1175       +1     
Impacted Files Coverage Δ
python/lsst/daf/butler/registries/remote.py 0.00% <0.00%> (ø)
python/lsst/daf/butler/registries/sql.py 81.08% <ø> (-0.04%) ⬇️
...hon/lsst/daf/butler/registry/dimensions/caching.py 94.73% <ø> (ø)
...on/lsst/daf/butler/registry/dimensions/governor.py 93.82% <ø> (ø)
...ython/lsst/daf/butler/registry/dimensions/query.py 74.24% <ø> (ø)
...thon/lsst/daf/butler/registry/dimensions/skypix.py 76.66% <ø> (ø)
...n/lsst/daf/butler/registry/interfaces/_datasets.py 75.89% <0.00%> (ø)
...sst/daf/butler/registry/interfaces/_collections.py 77.61% <53.33%> (-3.07%) ⬇️
...n/lsst/daf/butler/registry/databases/postgresql.py 79.67% <78.57%> (+0.49%) ⬆️
python/lsst/daf/butler/core/timespan.py 81.62% <79.24%> (-0.47%) ⬇️
... and 29 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@TallJimbo TallJimbo force-pushed the tickets/DM-36111 branch 4 times, most recently from c09da8f to aadd939 Compare September 9, 2022 17:48
@TallJimbo TallJimbo marked this pull request as ready for review September 9, 2022 21:04
Copy link
Contributor

@andy-slac andy-slac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, few minor comments.

imposed by the string expression or data ID
(`GovernorDimensionRestriction`).
imposed by the string expression and/or data ID
(`Mapping` [ ``set`, `~collections.abc.Set` ] ]).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double backtick, and set should be str?

Comment on lines +299 to +303
summary.dataset_types.add(datasetType)
for dimension in self._tables.dimensions:
value = row[dimension.name]
if value is not None:
summary.dimensions.add(dimension, value)
summary.governors.setdefault(dimension.name, set()).add(value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be this using summary.add_data_ids() instead of modifying data members directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I left the data member as public (with a public annotation as a mutable type) because I wasn't bothered by external code doing its own updates to it, and trying to make this block produce data IDs just so they can be unpacked into a different structure didn't seem worthwhile; I prefer the transparency of seeing exactly what's going on over encapsulation in this case.

governor = branch.dataIdKey
value = summary.governors.setdefault(governor, branch.dataIdValue)
if value != branch.dataIdValue:
# Test whether this branch has a form like '<dimension>=<value' (or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Angle bracket missing after <value.


Notes
-----
This is used by `Session.upload` to decide when `constant_rows` can be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see Session.upload, is it in one of your other branches?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's something I ended up removing, because I found that calling code always wanted to manually inspect this value to decide how whether to make a temporary table or use constant_rows itself (since one involves a resource whose lifetime should be managed, and the other does not). I've dropped this first sentence from the docs and reworded the next accordingly.


Returns
-------
timespan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add : Timespan, optional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually want it to be "Timespan or None" in this case, but yes.

@@ -396,6 +397,14 @@ def select(
# create a dummy subquery that we know will fail.
tags_query = SimpleQuery()
tags_query.join(self._tags, **kwargs)
# If the timespan is requested or contrained, simulate a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: contrained

@@ -265,6 +265,31 @@ def dropTemporaryTable(self, table: sqlalchemy.schema.Table) -> None:
else:
raise TypeError(f"Table {table.key} was not created by makeTemporaryTable.")

@contextmanager
def temporary_table(self, spec: ddl.TableSpec, name: Optional[str]) -> Iterator[sqlalchemy.schema.Table]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this used anywhere, I guess it's also for one of the other branches?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. An upcoming branch will use it. I suppose I could also use it to simplify the existing [Something]QueryResult.materialize context managers, but I'll be replacing them completely in one of those upcoming branches, too.

try:
yield table
finally:
self.dropTemporaryTable(table)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Coverity tries to say that there is no unit test for the whole temporary_table but somehow it only points to a single line.

@TallJimbo TallJimbo force-pushed the tickets/DM-36111 branch 2 times, most recently from dac47f1 to 03d4779 Compare September 19, 2022 16:46
It took me a while to figure out why the set-like
GovernorDimensionRestriction class kept needing new set-op method
variants for ~every context in which it was being used, but the answer
is that it was really a mapping of sets (one for each governor
dimension), and hence when instances have different dimensions
(mapping keys) dimensions, their relationship is not set-like.

The best solution seems to be to remove that class in favor of the
kind of mapping-of-sets that backed it, and move its convenience
functionality that's important for CollectionSummary to
CollectionSummary itself.

I changed some attribute names and the module name for collecton
summary to get them consistent with now-preferred style; it made sense
to do this at the same time because I wanted to find all code using
CollectionSummary in order to make sure it was updated appropriately,
and renames made MyPy very good at that.  I also switched from
Dimension instance keys to `str` dimension name keys, since I think
we'll be doing more of that with RFC-834 adopted.

Finally, I've changed the logic in the user-expression checking code
to track all dimensions, and then filter down to just the governor
dimensions at the boundary with the rest of the query system.  This
just seemed the best for overall code simplicity.
This is useful for adding types to literals, which don't need to be
sized the same way that columns in tables do.
This was previously blocking column names with non-standard SQL
characters.
This is enabled by new sphgeom functionality added on DM-34721.
This removes the analogous expression class for spatial regions, as
well as the base class they shared.  We need an abstraction layer
above SQLAlchemy's for timespans, since those can have a
representation that involves multiple columns in some databases, and
SQLAlchemy's ColumnElement can't handle that.  But our regions are
always just an opaque blob column, so that abstraction layer wasn't
carrying its weight, despite the nice symmetry between regions and
timespans in other respects.  A new classmethod constructor has been
added to FieldSpec to take care of the only important thing this class
did - holding the defaults for how region columns should be defined.

That results in a number of abstract methods now being first defined
in TimespanDatabaseRepresentation.  For the most part, these are just
transfers from the old base class with slightly adjusted docs, but
there are a few small updates:

- fromLiteral now accepts None as a way to support NULL expressions;

- fromSelectable has been replaced from_columns, since SQLAlchemy is
  moving towards being much more strict about the difference between
  the columns of a SELECT statement and the columns of a FromClause,
  such as a table or an aliased subquery, and in the future we won't
  be able to access a ".columns" across the board (e.g.  on Select
  objects, it'll be "exported_columns" instead).

- we now use tuples instead of generators for a few methods, since
  generators for 1- or 2-element results involves a lot more overhead
  with no real gain.
(and TimespanDatabaseRepresentation.overlaps).

This makes the Python interface more consistent with the string
expression language, which uses OVERLAPS for timespan-time arguments,
and it will sidestep a problem with function name resolution in the new
daf_relation expression system, since normally that looks in the
operator module first and a method second, and "contains" (but not
"overlaps") is a member of the operator module.
These allowed the query's internal region to be overridden, but they
were a solution in search of a problem.
@TallJimbo TallJimbo force-pushed the tickets/DM-36111 branch 2 times, most recently from 49fc32f to 3212de8 Compare September 20, 2022 15:58
The timespan for a dataset subquery for a non-CALIBRATION collection
is hereby defined to be a fully unbounded timespan, which will let us
UNION those subqueries against CALIBRATION-collection queries for the
same dataset before doing a temporal join against a dimension like
exposure or visit.

It would make sense to change the behavior of queryDatasetAssociations
to associate non-CALIBRATION collections with a full timespan instead
of None at the same time, but that's an API change that I don't want
to tackle right now.  Eventually I'd like to replace
queryDatasetAssocations entirely with a method-chaining interface to
get associations from queryDatasets, and I think that RFC wil be the
time to tackle this.
This case-based approach is much closer to Andy Salnikov's initial
implementation of find-first dataset queries; I think it's likely that
changing it was at most unnecessary performance-wise (I was looking
for ways to simplify a query the planner was having trouble with, but
this construct - despite being hard for humans to parse when reading
the SQL - probably wasn't playing a role in that).  And some other
query system changes have made it preferable from a code-organization
standpoint, providing it's no worse from a performance standpoint.

To that end, in some quick benchmarks (on PostgreSQL, querying 16
collections with the same dataset types and a spatial constraint to
make the overall query nontrivial), this was consistently faster than
the old, UNION-based rank expression, but only barely; I'm by no means
certain that difference is significant, but I'm at least pretty
confident that this is no worse for performance.
This is a very thin wrapper around existing methods to create and drop
temporary tables, but it's a convenient one, especially when using
contextlib.ExitStack.
We already seem to require SQLAlchemy 1.4, which includes the fix we
submitted upstream.
This branch also includes a number of behind-the-scenes changes that
I don't think merit a changelog entry.
@TallJimbo
Copy link
Member Author

@andy-slac , I'm sending this back to you for another look because Jenkins revealed a problem that led to a non-trivial change, and then I decided to throw in a couple more miscellaneous fixes. In particular:

  • 465ebeb has been modified to define the timespan for a non-CALIBRATION collection to a fully-unbounded one, instead of NULL. That's logically the behavior we want in the query system anyway, and it avoids the need for special IS NULL handling in queries to address that downstream problem Jenkins discovered.

  • That change led to d6df482, fixing a preexisting bug in how SQLAlchemy converts literals in queries (see also Literals in SELECT columns incorrectly converted to strings sqlalchemy/sqlalchemy#8540).

  • 5aa83fa is a fix for a bug reported on Slack a couple of weeks ago that I was going to put on this branch but forgot about as I disappeared for vacation.

  • 9cb3aec, 38c22a6, and 3909cdf are tiny fixed for things I noticed while I was working through review comments.

Copy link
Contributor

@andy-slac andy-slac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, no new comments.

@TallJimbo TallJimbo merged commit 1a6169c into main Sep 21, 2022
@TallJimbo TallJimbo deleted the tickets/DM-36111 branch September 21, 2022 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants