New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-25919: custom classes and new functionality for query results #330
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there were a few points I found to think about, but nothing too major. Overall it looks good, but I over no promises that I didn't miss some small logic bug related to functionality calls that already exist :-)
@@ -142,44 +282,334 @@ def extractDataId(self, row: RowProxy, *, graph: Optional[DimensionGraph] = None | |||
graph : `DimensionGraph`, optional | |||
The dimensions the returned data ID should identify. If not | |||
provided, this will be all dimensions in `QuerySummary.requested`. | |||
records : `Mapping` [ `str`, `Mapping` [ `tuple`, `DimensionRecord` ] ] | |||
Records to use to return an `ExpandedDataCoordinate`. If provided, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wording is awkward
element = self.dimensions[element] | ||
dataIds = self.queryDataIds(element.graph, dataId=dataId, datasets=datasets, collections=collections, | ||
where=where, components=components, **kwargs) | ||
"""Query for dimension information matching user-provided criteria. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the doc-string after some code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦
f"Database) already exists.") | ||
for foreignKeySpec in spec.foreignKeys: | ||
table.append_constraint(self._convertForeignKeySpec(name, foreignKeySpec, self._metadata)) | ||
table.create(self._connection) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random question can this allow a temp table to be created that shadows a non temp table? I am not sure what each db will do. If It does then depending on how databases handle this, the dropTemporaryTable may cause a dropped table. I find this unlikely behavior, but its worth verifying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using uuid
to create a random name, so clashes should be practically (if not quite theoretically) impossible.
@@ -389,6 +426,38 @@ def _makeSubsetQueryColumns(self, *, graph: Optional[DimensionGraph] = None, | |||
columns.datasets = self.getDatasetColumns() | |||
return graph, columns | |||
|
|||
@contextmanager | |||
def materialize(self, db: Database) -> Iterator[MaterializedQuery]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the return type is a context manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish it was; that'd be clearer to read. But MyPy wants me to annotate the undecorated method, which it then transforms by knowing what the decorator does. I suppose that's the only behavior that would actually make sense for decorators in general, but it is particularly unfortunate for this one, where the return type is so different.
self._datasetType = datasetType | ||
self._isUnique = isUnique | ||
|
||
def isUnique(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a function? If its to future proof things, it should be fine to just expose isUnique directly as an attribute to access, and it can be changed later if need be. If it is supposed to keep the value from being set from the outside, I think you probably meant to make this a property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a naming convention thing. Because we almost always name methods starting with verbs, I don't like to start non-methods with verbs unless it's completely ambiguous that they aren't methods. I don't know about everyone else, but this helps me remember whether or not I should add a ()
to any particular interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case, does it need to be called isUnique
or would unique
as an attribute (or property) do the same thing? Does is
add anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a bit clearer. Just "unique" in only a slightly different context could be an attribute that is a container of things that are unique, rather than a bool, while "isUnique" is very clearly a bool, and I think a set of parens is a reasonable price to pay for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds reasonable
@@ -235,3 +249,193 @@ def constrain(self, query: SimpleQuery, columns: Callable[[str], sqlalchemy.sql. | |||
for dimension in self.graph.required | |||
]) | |||
) | |||
|
|||
|
|||
class DatasetQueryResults(Iterable[DatasetRef]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think I had seen typing used as inheritance in this way
|
||
__slots__ = ("_db", "_query", "_dimensions", "_components", "_records") | ||
|
||
def __iter__(self) -> Iterator[DatasetRef]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just open discussion, but would it be better to use typing.Generator here? I personally think it helps identify in a signature that the results will be single use. I dont know if analysis tools are or are not smart enough for that to happen automatically though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know about the tooling, but I think Iterator
is also single-pass (all you can do with it is call next()
, after all). It's Iterable
that's unfortunately ambiguous, and I don't have a good solution for that.
else: | ||
yield parentRef.makeComponentRef(component) | ||
|
||
def byParentDatasetType(self) -> Iterator[ParentDatasetQueryResults]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as iter
|
||
def __hash__(self) -> int: | ||
return hash(self.dataId) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being pedantic I think this should be 2 commits. I dont know if you are doing any squashing before merging, but just keep it in mind if you do any rebases. It is not high enough priority that I would insist on it on its own.
return True | ||
|
||
def joinTable(self, table: FromClause, dimensions: NamedValueSet[Dimension]) -> None: | ||
def joinTable(self, table: sqlalchemy.sql.FromClause, dimensions: NamedValueSet[Dimension], *, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method needs to verify the dimensions passed in, can contained inside self.dimensions, and either reject if they are not, or somehow do the query to update self.dimensions so that the join can happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a closer look at this, and it seems I need to add checks to both this method and Query.makeBuilder
to make sure we only attempt to build the queries we can build correctly.
Type referenced here previously was an experiment that was ultimately squashed away on that branch.
Some of these examples are valid in other contexts, but not in those that (like this one) need the collections to be ordered.
This just collects a lot of "if result is None: raise" logic in one place.
We'll soon be importing many more symbols from this subpackage but only using them each once or twice, so the new style will be clearly preferable.
We once used a somewhat different subquery to find datasets in the first matching collection, but moved that deduplication to to Python postprocessing of query result rows, in order to simplify the, resulting enormous query. This commit reverses that, with an eye towards moving even more computation to the database in order to reduce single-row queries. But it's also quite different from the previous incarnation in some (important) respects: - The original enormous query involved a search for all required and optional datasets along with all dimension joins at once, so it really was enormous. We now query just for dimensions while joining in required datasets only for existence (we don't try to return dataset_id or run_id values in this query, and hence don't have to deduplicate), and then later perform follow-up queries for one dataset type at a time. And in the near future, we'll be putting the first query's results into a temporary table and using that for those follow-up queries. - The new ordered-search subquery uses UNION ALL clauses to combine collections and window functions to find the first match for a data ID, instead of the CASE statements and string manipulation we had before. I _think_ that should be a little friendlier to the query optimizer, but it's also pretty localized so we can still try other approaches later if needed.
This changes the relationship between QueryBuilder and Query slightly - defining the columns that are actually selected is now the latter's responsibility. And both now use a SimpleQuery internally to transfer that state. Right now, there's only one Query subclass, DirectQuery, which does exactly what the old one did. But there will be at least one more in the future.
This is almost a clone of queryDimensions (which will be removed in a later commit), but it returned one of our new QueryResults objects, and has no `expand` argument (one would call `expanded()` on the result instead).
DM is no longer targeting Oracle, and I'm about to add functionality (temporary tables) that would require an Oracle-specific implementation if we were to keep it.
This was already an option for foreign key fields referencing datasets; this adds it for dimensions and collections.
I had originally wanted to allow lists and other mutable sequences here, too, while demanding extra care from users. But realizing that this broke the simple implementation of __eq__ (lists do not compare equal to tuples with the same elements) swung me the other way.
7cde6e3
to
505ca97
Compare
I plan to remove these limitations on a future ticket, but for now it's better to fail early rather than produce some confusing error message or (more likely and worse) unexpected query later.
3197537
to
ad5a949
Compare
For posterity, since we discussed this OOB: I prefer this style for readability, and as a way to avoid forgetting the "f" when re-wrapping. @natelust notes that it does come with a performance penalty, but in formatting an exception message that's not really a concern (especially these, which represent logic errors, not something someone would catch). |
When we update flake8 you will get a warning if your f-string has no variables to format. |
Even if it's whitespace-concatenated with an |
Yes, even if it's a continuation of something that did have something to format. I disagree with you on whether it should be disabled. |
Ok, not worth me fighting this, then. I'm going to leave this branch as it is because it's already through Jenkins and this is far from the only case where this occurs in daf_butler, and we'll cross that bridge when we get to it. |
No description provided.