Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-25919: custom classes and new functionality for query results #330

Merged
merged 29 commits into from Aug 7, 2020

Conversation

TallJimbo
Copy link
Member

No description provided.

Copy link
Contributor

@natelust natelust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there were a few points I found to think about, but nothing too major. Overall it looks good, but I over no promises that I didn't miss some small logic bug related to functionality calls that already exist :-)

@@ -142,44 +282,334 @@ def extractDataId(self, row: RowProxy, *, graph: Optional[DimensionGraph] = None
graph : `DimensionGraph`, optional
The dimensions the returned data ID should identify. If not
provided, this will be all dimensions in `QuerySummary.requested`.
records : `Mapping` [ `str`, `Mapping` [ `tuple`, `DimensionRecord` ] ]
Records to use to return an `ExpandedDataCoordinate`. If provided,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wording is awkward

element = self.dimensions[element]
dataIds = self.queryDataIds(element.graph, dataId=dataId, datasets=datasets, collections=collections,
where=where, components=components, **kwargs)
"""Query for dimension information matching user-provided criteria.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the doc-string after some code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦

f"Database) already exists.")
for foreignKeySpec in spec.foreignKeys:
table.append_constraint(self._convertForeignKeySpec(name, foreignKeySpec, self._metadata))
table.create(self._connection)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random question can this allow a temp table to be created that shadows a non temp table? I am not sure what each db will do. If It does then depending on how databases handle this, the dropTemporaryTable may cause a dropped table. I find this unlikely behavior, but its worth verifying

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using uuid to create a random name, so clashes should be practically (if not quite theoretically) impossible.

@@ -389,6 +426,38 @@ def _makeSubsetQueryColumns(self, *, graph: Optional[DimensionGraph] = None,
columns.datasets = self.getDatasetColumns()
return graph, columns

@contextmanager
def materialize(self, db: Database) -> Iterator[MaterializedQuery]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the return type is a context manager

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish it was; that'd be clearer to read. But MyPy wants me to annotate the undecorated method, which it then transforms by knowing what the decorator does. I suppose that's the only behavior that would actually make sense for decorators in general, but it is particularly unfortunate for this one, where the return type is so different.

self._datasetType = datasetType
self._isUnique = isUnique

def isUnique(self) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a function? If its to future proof things, it should be fine to just expose isUnique directly as an attribute to access, and it can be changed later if need be. If it is supposed to keep the value from being set from the outside, I think you probably meant to make this a property?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a naming convention thing. Because we almost always name methods starting with verbs, I don't like to start non-methods with verbs unless it's completely ambiguous that they aren't methods. I don't know about everyone else, but this helps me remember whether or not I should add a () to any particular interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case, does it need to be called isUnique or would unique as an attribute (or property) do the same thing? Does is add anything?

Copy link
Member Author

@TallJimbo TallJimbo Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a bit clearer. Just "unique" in only a slightly different context could be an attribute that is a container of things that are unique, rather than a bool, while "isUnique" is very clearly a bool, and I think a set of parens is a reasonable price to pay for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds reasonable

@@ -235,3 +249,193 @@ def constrain(self, query: SimpleQuery, columns: Callable[[str], sqlalchemy.sql.
for dimension in self.graph.required
])
)


class DatasetQueryResults(Iterable[DatasetRef]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think I had seen typing used as inheritance in this way


__slots__ = ("_db", "_query", "_dimensions", "_components", "_records")

def __iter__(self) -> Iterator[DatasetRef]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just open discussion, but would it be better to use typing.Generator here? I personally think it helps identify in a signature that the results will be single use. I dont know if analysis tools are or are not smart enough for that to happen automatically though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about the tooling, but I think Iterator is also single-pass (all you can do with it is call next(), after all). It's Iterable that's unfortunately ambiguous, and I don't have a good solution for that.

else:
yield parentRef.makeComponentRef(component)

def byParentDatasetType(self) -> Iterator[ParentDatasetQueryResults]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as iter


def __hash__(self) -> int:
return hash(self.dataId)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being pedantic I think this should be 2 commits. I dont know if you are doing any squashing before merging, but just keep it in mind if you do any rebases. It is not high enough priority that I would insist on it on its own.

return True

def joinTable(self, table: FromClause, dimensions: NamedValueSet[Dimension]) -> None:
def joinTable(self, table: sqlalchemy.sql.FromClause, dimensions: NamedValueSet[Dimension], *,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method needs to verify the dimensions passed in, can contained inside self.dimensions, and either reject if they are not, or somehow do the query to update self.dimensions so that the join can happen.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a closer look at this, and it seems I need to add checks to both this method and Query.makeBuilder to make sure we only attempt to build the queries we can build correctly.

Type referenced here previously was an experiment that was ultimately
squashed away on that branch.
Some of these examples are valid in other contexts, but not in those
that (like this one) need the collections to be ordered.
This just collects a lot of "if result is None: raise" logic in one
place.
We'll soon be importing many more symbols from this subpackage but
only using them each once or twice, so the new style will be clearly
preferable.
We once used a somewhat different subquery to find datasets in the
first matching collection, but moved that deduplication to to Python
postprocessing of query result rows, in order to simplify the,
resulting enormous query.

This commit reverses that, with an eye towards moving even more
computation to the database in order to reduce single-row queries.  But
it's also quite different from the previous incarnation in some
(important) respects:

 - The original enormous query involved a search for all required and
 optional datasets along with all dimension joins at once, so it
 really was enormous.  We now query just for dimensions while joining
 in required datasets only for existence (we don't try to return
 dataset_id or run_id values in this query, and hence don't have to
 deduplicate), and then later perform follow-up queries for one
 dataset type at a time.  And in the near future, we'll be putting the
 first query's results into a temporary table and using that for those
 follow-up queries.

 - The new ordered-search subquery uses UNION ALL clauses to combine
 collections and window functions to find the first match for a data
 ID, instead of the CASE statements and string manipulation we had
 before.  I _think_ that should be a little friendlier to the query
 optimizer, but it's also pretty localized so we can still try other
 approaches later if needed.
This changes the relationship between QueryBuilder and Query slightly
 - defining the columns that are actually selected is now the latter's
responsibility.  And both now use a SimpleQuery internally to transfer
that state.

Right now, there's only one Query subclass, DirectQuery, which does
exactly what the old one did.  But there will be at least one more in
the future.
This is almost a clone of queryDimensions (which will be removed in a
later commit), but it returned one of our new QueryResults objects, and
has no `expand` argument (one would call `expanded()` on the result
instead).
DM is no longer targeting Oracle, and I'm about to add functionality
(temporary tables) that would require an Oracle-specific
implementation if we were to keep it.
This was already an option for foreign key fields referencing
datasets; this adds it for dimensions and collections.
I had originally wanted to allow lists and other mutable sequences
here, too, while demanding extra care from users.  But realizing that
this broke the simple implementation of __eq__ (lists do not compare
equal to tuples with the same elements) swung me the other way.
I plan to remove these limitations on a future ticket, but for now it's
better to fail early rather than produce some confusing error message
or (more likely and worse) unexpected query later.
@TallJimbo
Copy link
Member Author

Second line does not need to be an f string, same in future strings in the makeBuilder functions

For posterity, since we discussed this OOB: I prefer this style for readability, and as a way to avoid forgetting the "f" when re-wrapping. @natelust notes that it does come with a performance penalty, but in formatting an exception message that's not really a concern (especially these, which represent logic errors, not something someone would catch).

@timj
Copy link
Member

timj commented Aug 7, 2020

When we update flake8 you will get a warning if your f-string has no variables to format.

@TallJimbo
Copy link
Member Author

When we update flake8 you will get a warning if your f-string has no variables to format.

Even if it's whitespace-concatenated with an f-string that does has variables to format? If so, I think we should disable that check.

@timj
Copy link
Member

timj commented Aug 7, 2020

Yes, even if it's a continuation of something that did have something to format. I disagree with you on whether it should be disabled.

@TallJimbo
Copy link
Member Author

Yes, even if it's a continuation of something that did have something to format. I disagree with you on whether it should be disabled.

Ok, not worth me fighting this, then. I'm going to leave this branch as it is because it's already through Jenkins and this is far from the only case where this occurs in daf_butler, and we'll cross that bridge when we get to it.

@TallJimbo TallJimbo merged commit 27159a1 into master Aug 7, 2020
@TallJimbo TallJimbo deleted the tickets/DM-25919 branch August 7, 2020 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants