DM-25919: custom classes and new functionality for query results #330

TallJimbo · 2020-07-19T01:45:49Z

No description provided.

natelust

I think there were a few points I found to think about, but nothing too major. Overall it looks good, but I over no promises that I didn't miss some small logic bug related to functionality calls that already exist :-)

natelust · 2020-07-24T14:34:27Z

python/lsst/daf/butler/registry/queries/_query.py

@@ -142,44 +282,334 @@ def extractDataId(self, row: RowProxy, *, graph: Optional[DimensionGraph] = None
        graph : `DimensionGraph`, optional
            The dimensions the returned data ID should identify.  If not
            provided, this will be all dimensions in `QuerySummary.requested`.
+        records : `Mapping` [ `str`, `Mapping` [ `tuple`, `DimensionRecord` ] ]
+            Records to use to return an `ExpandedDataCoordinate`.  If provided,


This wording is awkward

natelust · 2020-07-24T15:22:38Z

python/lsst/daf/butler/registry/_registry.py

+            element = self.dimensions[element]
+        dataIds = self.queryDataIds(element.graph, dataId=dataId, datasets=datasets, collections=collections,
+                                    where=where, components=components, **kwargs)
+        """Query for dimension information matching user-provided criteria.


why is the doc-string after some code

natelust · 2020-07-28T16:19:34Z

python/lsst/daf/butler/registry/interfaces/_database.py

+                                 f"Database) already exists.")
+        for foreignKeySpec in spec.foreignKeys:
+            table.append_constraint(self._convertForeignKeySpec(name, foreignKeySpec, self._metadata))
+        table.create(self._connection)


Random question can this allow a temp table to be created that shadows a non temp table? I am not sure what each db will do. If It does then depending on how databases handle this, the dropTemporaryTable may cause a dropped table. I find this unlikely behavior, but its worth verifying

I'm using uuid to create a random name, so clashes should be practically (if not quite theoretically) impossible.

natelust · 2020-08-05T18:43:20Z

python/lsst/daf/butler/registry/queries/_query.py

@@ -389,6 +426,38 @@ def _makeSubsetQueryColumns(self, *, graph: Optional[DimensionGraph] = None,
            columns.datasets = self.getDatasetColumns()
        return graph, columns

+    @contextmanager
+    def materialize(self, db: Database) -> Iterator[MaterializedQuery]:


I think the return type is a context manager

I wish it was; that'd be clearer to read. But MyPy wants me to annotate the undecorated method, which it then transforms by knowing what the decorator does. I suppose that's the only behavior that would actually make sense for decorators in general, but it is particularly unfortunate for this one, where the return type is so different.

natelust · 2020-08-05T19:11:58Z

python/lsst/daf/butler/registry/queries/_query.py

+        self._datasetType = datasetType
+        self._isUnique = isUnique
+
+    def isUnique(self) -> bool:


Why is this a function? If its to future proof things, it should be fine to just expose isUnique directly as an attribute to access, and it can be changed later if need be. If it is supposed to keep the value from being set from the outside, I think you probably meant to make this a property?

It's a naming convention thing. Because we almost always name methods starting with verbs, I don't like to start non-methods with verbs unless it's completely ambiguous that they aren't methods. I don't know about everyone else, but this helps me remember whether or not I should add a () to any particular interface.

in that case, does it need to be called isUnique or would unique as an attribute (or property) do the same thing? Does is add anything?

I think it's a bit clearer. Just "unique" in only a slightly different context could be an attribute that is a container of things that are unique, rather than a bool, while "isUnique" is very clearly a bool, and I think a set of parens is a reasonable price to pay for that.

sounds reasonable

natelust · 2020-08-05T19:29:02Z

python/lsst/daf/butler/registry/queries/_results.py

@@ -235,3 +249,193 @@ def constrain(self, query: SimpleQuery, columns: Callable[[str], sqlalchemy.sql.
                for dimension in self.graph.required
            ])
        )
+
+
+class DatasetQueryResults(Iterable[DatasetRef]):


I dont think I had seen typing used as inheritance in this way

natelust · 2020-08-05T19:36:20Z

python/lsst/daf/butler/registry/queries/_results.py

+
+    __slots__ = ("_db", "_query", "_dimensions", "_components", "_records")
+
+    def __iter__(self) -> Iterator[DatasetRef]:


This is just open discussion, but would it be better to use typing.Generator here? I personally think it helps identify in a signature that the results will be single use. I dont know if analysis tools are or are not smart enough for that to happen automatically though

I don't know about the tooling, but I think Iterator is also single-pass (all you can do with it is call next(), after all). It's Iterable that's unfortunately ambiguous, and I don't have a good solution for that.

natelust · 2020-08-05T19:36:34Z

python/lsst/daf/butler/registry/queries/_results.py

+                else:
+                    yield parentRef.makeComponentRef(component)
+
+    def byParentDatasetType(self) -> Iterator[ParentDatasetQueryResults]:


same as iter

natelust · 2020-08-06T17:36:33Z

python/lsst/daf/butler/core/dimensions/records.py

+
+    def __hash__(self) -> int:
+        return hash(self.dataId)
+


Being pedantic I think this should be 2 commits. I dont know if you are doing any squashing before merging, but just keep it in mind if you do any rebases. It is not high enough priority that I would insist on it on its own.

natelust · 2020-08-06T20:55:25Z

python/lsst/daf/butler/registry/queries/_builder.py

        return True

-    def joinTable(self, table: FromClause, dimensions: NamedValueSet[Dimension]) -> None:
+    def joinTable(self, table: sqlalchemy.sql.FromClause, dimensions: NamedValueSet[Dimension], *,


I think this method needs to verify the dimensions passed in, can contained inside self.dimensions, and either reject if they are not, or somehow do the query to update self.dimensions so that the join can happen.

I took a closer look at this, and it seems I need to add checks to both this method and Query.makeBuilder to make sure we only attempt to build the queries we can build correctly.

Type referenced here previously was an experiment that was ultimately squashed away on that branch.

Some of these examples are valid in other contexts, but not in those that (like this one) need the collections to be ordered.

This just collects a lot of "if result is None: raise" logic in one place.

We'll soon be importing many more symbols from this subpackage but only using them each once or twice, so the new style will be clearly preferable.

We once used a somewhat different subquery to find datasets in the first matching collection, but moved that deduplication to to Python postprocessing of query result rows, in order to simplify the, resulting enormous query. This commit reverses that, with an eye towards moving even more computation to the database in order to reduce single-row queries. But it's also quite different from the previous incarnation in some (important) respects: - The original enormous query involved a search for all required and optional datasets along with all dimension joins at once, so it really was enormous. We now query just for dimensions while joining in required datasets only for existence (we don't try to return dataset_id or run_id values in this query, and hence don't have to deduplicate), and then later perform follow-up queries for one dataset type at a time. And in the near future, we'll be putting the first query's results into a temporary table and using that for those follow-up queries. - The new ordered-search subquery uses UNION ALL clauses to combine collections and window functions to find the first match for a data ID, instead of the CASE statements and string manipulation we had before. I _think_ that should be a little friendlier to the query optimizer, but it's also pretty localized so we can still try other approaches later if needed.

This changes the relationship between QueryBuilder and Query slightly - defining the columns that are actually selected is now the latter's responsibility. And both now use a SimpleQuery internally to transfer that state. Right now, there's only one Query subclass, DirectQuery, which does exactly what the old one did. But there will be at least one more in the future.

This is almost a clone of queryDimensions (which will be removed in a later commit), but it returned one of our new QueryResults objects, and has no `expand` argument (one would call `expanded()` on the result instead).

DM is no longer targeting Oracle, and I'm about to add functionality (temporary tables) that would require an Oracle-specific implementation if we were to keep it.

This was already an option for foreign key fields referencing datasets; this adds it for dimensions and collections.

I had originally wanted to allow lists and other mutable sequences here, too, while demanding extra care from users. But realizing that this broke the simple implementation of __eq__ (lists do not compare equal to tuples with the same elements) swung me the other way.

I plan to remove these limitations on a future ticket, but for now it's better to fail early rather than produce some confusing error message or (more likely and worse) unexpected query later.

TallJimbo · 2020-08-07T18:40:54Z

Second line does not need to be an f string, same in future strings in the makeBuilder functions

For posterity, since we discussed this OOB: I prefer this style for readability, and as a way to avoid forgetting the "f" when re-wrapping. @natelust notes that it does come with a performance penalty, but in formatting an exception message that's not really a concern (especially these, which represent logic errors, not something someone would catch).

timj · 2020-08-07T18:43:02Z

When we update flake8 you will get a warning if your f-string has no variables to format.

TallJimbo · 2020-08-07T18:58:13Z

When we update flake8 you will get a warning if your f-string has no variables to format.

Even if it's whitespace-concatenated with an f-string that does has variables to format? If so, I think we should disable that check.

timj · 2020-08-07T19:15:30Z

Yes, even if it's a continuation of something that did have something to format. I disagree with you on whether it should be disabled.

TallJimbo · 2020-08-07T19:35:43Z

Yes, even if it's a continuation of something that did have something to format. I disagree with you on whether it should be disabled.

Ok, not worth me fighting this, then. I'm going to leave this branch as it is because it's already through Jenkins and this is far from the only case where this occurs in daf_butler, and we'll cross that bridge when we get to it.

natelust approved these changes Aug 6, 2020

View reviewed changes

TallJimbo added 28 commits August 6, 2020 22:06

Doc fix that escaped DM-25776.

f03635f

Type referenced here previously was an experiment that was ultimately squashed away on that branch.

Fix misleading docs in Registry.findDataset.

8896d73

Some of these examples are valid in other contexts, but not in those that (like this one) need the collections to be ordered.

Add DatasetRecordStorageManager.__getitem__ for convenience.

52995ff

This just collects a lot of "if result is None: raise" logic in one place.

Import queries subpackage directly in registry, not its symbol.

4bb2ae2

We'll soon be importing many more symbols from this subpackage but only using them each once or twice, so the new style will be clearly preferable.

Allow a spatial region to be passed to queries without a data ID.

e3554a3

Allow joins of unrelated tables in SimpleQuery.

a32902c

Add QueryResults for DataCoordinate queries.

d1967eb

Add Registry.queryDataIds method.

7f5699b

This is almost a clone of queryDimensions (which will be removed in a later commit), but it returned one of our new QueryResults objects, and has no `expand` argument (one would call `expanded()` on the result instead).

Replace queryDimensions with queryDataIds in tests and docs.

8b7eb46

Remove Registry.queryDimensions.

4df3d8d

Add Registry method for querying DimensionRecords directly.

535a6f0

Drop support for Oracle.

e0b6eb7

DM is no longer targeting Oracle, and I'm about to add functionality (temporary tables) that would require an Oracle-specific implementation if we were to keep it.

Add temporary table create and drop to Database.

f225246

Fix return type for Database.ensureTableExists.

540709b

Add INSERT INTO ... SELECT support to Database.insert.

3963875

Add option to not create a constraint for some foreign keys.

1a4db3c

This was already an option for foreign key fields referencing datasets; this adds it for dimensions and collections.

Add temporary-table materialization to Query.

ac45e4a

Add temporary-table materialization to DataCoordinateQueryResults.

8ff9104

Add QueryResults classes for datasets.

43d7254

Add DataCoordinateQueryResults.findDatasets.

cdb608c

Use DatasetQueryResults in queryDatasets.

81f428c

Add tests for new query system.

bcf7f35

Add equality comparison and hash to DimensionRecord.

bd036a5

Add missing docs to DimensionRecord.

eaa54ef

Handle query-with-no-columns edge case.

505ca97

TallJimbo force-pushed the tickets/DM-25919 branch from 7cde6e3 to 505ca97 Compare August 7, 2020 03:47

Enforce limitations on dimensions in query system.

ad5a949

I plan to remove these limitations on a future ticket, but for now it's better to fail early rather than produce some confusing error message or (more likely and worse) unexpected query later.

TallJimbo force-pushed the tickets/DM-25919 branch from 3197537 to ad5a949 Compare August 7, 2020 14:09

TallJimbo merged commit 27159a1 into master Aug 7, 2020

TallJimbo deleted the tickets/DM-25919 branch August 7, 2020 19:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-25919: custom classes and new functionality for query results #330

DM-25919: custom classes and new functionality for query results #330

TallJimbo commented Jul 19, 2020

natelust left a comment

natelust Jul 24, 2020

natelust Jul 24, 2020

TallJimbo Aug 7, 2020

natelust Jul 28, 2020

TallJimbo Aug 7, 2020

natelust Aug 5, 2020

TallJimbo Aug 7, 2020

natelust Aug 5, 2020

TallJimbo Aug 7, 2020

natelust Aug 7, 2020

TallJimbo Aug 7, 2020 •

edited

natelust Aug 7, 2020

natelust Aug 5, 2020

natelust Aug 5, 2020

TallJimbo Aug 7, 2020

natelust Aug 5, 2020

natelust Aug 6, 2020

natelust Aug 6, 2020

TallJimbo Aug 7, 2020

TallJimbo commented Aug 7, 2020

timj commented Aug 7, 2020

TallJimbo commented Aug 7, 2020

timj commented Aug 7, 2020

TallJimbo commented Aug 7, 2020


		__slots__ = ("_db", "_query", "_dimensions", "_components", "_records")

		def __iter__(self) -> Iterator[DatasetRef]:

DM-25919: custom classes and new functionality for query results #330

DM-25919: custom classes and new functionality for query results #330

Conversation

TallJimbo commented Jul 19, 2020

natelust left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TallJimbo Aug 7, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TallJimbo commented Aug 7, 2020

timj commented Aug 7, 2020

TallJimbo commented Aug 7, 2020

timj commented Aug 7, 2020

TallJimbo commented Aug 7, 2020

TallJimbo Aug 7, 2020 •

edited