New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-16539: overhaul calibration dimensions and refactor selectDimensions implementations #135
Conversation
c7095e8
to
16f0b2a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job untangling preflight mess! I did a quick scrolling through it and I need to look at it more but here is a small set of comments that I have so far.
name: valid_last | ||
type: datetime | ||
doc: > | ||
TAI timestamp of last exposure included in the range (inclusive). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think usual case for intervals/ranges is to make second timepoint exclusive, it makes some operations on ranges simpler (e.g. easier to do [t0, t1), [t1, t2)
than [t0, t1-something], [t1, t2]
). Though if you are only interested in exposure timepoints (i.e. you have reasonable/predictable gaps) inclusive can work fine too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this ticket I'd like to keep it inclusive to retain the previous logic (or, at least, the old as-documented logic - thinking more about the changes to the unit tests I had to make, I suppose this may actually be a change to the behavior to bring it in line with the documentation).
Anyhow, I think I did have some concrete use-case reasons for why I thought an inclusive upper bound would be better here, and I think it did involve the fact that there are always gap between exposures. But I'm not sure I can reconstruct all of that thinking right now, and I'm happy to revisit it in the future.
src: instrument | ||
tgt: Instrument.instrument | ||
|
||
ExposureCalibrationLabelJoin: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this view used by anything, I can't find it on this PR except in unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it definitely is, and having it just appear in the unit tests makes sense; this should now behave pretty much just like any of the spatial join views (which also should appear by name only in the tests), but is computed from temporal overlap rather than spatial overlap.
transformation from ``neededDatasetTypes`` to ``futureDatasetTypes``, | ||
restricted by existing data and filter expression. | ||
transformation from ``requiredDatasetTypes`` to ``optional | ||
DatasetTypes``, restricted by existing data and filter expression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``required`` to ``optional`` DatasetTypes
?
be included in the returned column set. Datasets of these types | ||
may or may not existin the registry. | ||
prerequisite : iterable of `DatasetType` or `str` | ||
`DatasetType`s that should not constrain the query results, but |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`DatasetType`s -> `DatasetTypes <DatasetType>`?
or just drop backticks entirely.
must be present for all result rows. These are included with | ||
a LEFT OUTER JOIN, but the results are checked for NULL. Unlike | ||
regular inputs, prerequisite inputs lookups may be deferred | ||
(see the documentaiton ``defer`` argument). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in documentaiton
.
Which defer
argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed; I had copied this from the docs for MultipleDatasetQueryBuilder
, but the defer
option there is controlled by configuration in SqlRegistry
, not passed directly by the user, so I've instead just said that deferral is something that Registry
implementations may or may not do.
|
||
Parameters | ||
---------- | ||
managed : ResultColumnsManager.ManagedRow` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backtick is missing
already in the selected columns, this method does nothing. | ||
datasetType : `DatasetType`, optional | ||
If not `None`, this link corresponds to a "per-DatasetType" | ||
dimension in a querym for this `DatasetType`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
querym
return | ||
self._indicesForDimensionLinks[link] = len(self._columns) | ||
else: | ||
if link in self._indicesForPerDatasetTypeDimensionLinks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
link in self._indicesForPerDatasetTypeDimensionLinks[datasetType]
?
Parameters | ||
---------- | ||
datasetType : `DatasetType`, optional | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra blank line
---------- | ||
managed : `ResultsColumnsManager.ManagedRow` | ||
Intermediate result row object to convert. | ||
expandDataId : `True` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK (though non-trivial for sure). Few minor comments.
tables : `dict` | ||
Mapping of table names to `sqlalchemy.Table` instances. | ||
dimensions : `DimensionGraph` | ||
All Dimensions included in the query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parameters section needs an update
if selectable is None: | ||
raise ValueError(f"No table or clause providing link '{column}' in this query.") | ||
try: | ||
return selectable.c[column] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have to blame mostly myself for this, but after reading our code I strongly prefer columns
to single-letter c
. No need to change it now, maybe just try to use it more in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've switched to columns
for all of the code in the new sql
package. I'll leave the rest of the SqlRegistry
code as-is for now.
query = query.where(whereSql) | ||
_LOG.debug("building query: %s", | ||
query.compile(bind=self.registry._connection.engine, | ||
compile_kwargs={"literal_binds": True})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe wrap it into if _LOG.isEnabledFor(logging.DEBUG):
to avoid calling query.compile
if it's not needed.
Notes | ||
----- | ||
A query row that include disjoint regions is filtered out, resulting | ||
in `None` being returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is filtered by region overlap then I would assume that query is defined in a way to return just one "true" row (with regions overlapping) but there may be other "false" rows (without overlap). With that assumption I think you cannot do fetchone
but instead it should iterate and return first row that passes filter.
|
||
@classmethod | ||
def fromCollections(cls, registry, datasetType, collections, addResultColumns=True): | ||
"""Construct a builder that searches a single collection for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
single collection -> multiple collections
DS.linkN = DSG.linkN) | ||
""" | ||
if len(collections) == 1: | ||
return cls.fromSingleCollection(registry, datasetType, collections[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to pass addResultColumns
argument too?
|
||
A multiple-dataset query returns a sequence of `MultipleDatasetQueryRow` | ||
instances; each instance will have a unique `DataId`, but the | ||
`DatasetRef`s in `datasetRefs` are not necessarily unique. For example, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double backticks around datasetRefs
.
selectable = data.builder.findSelectableForLink(link) | ||
column = selectable.c[link] | ||
expr.append(column == dataId[link]) | ||
ref = data.builder.executeOne(whereSql=and_(*expr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possible optimization that I was thinking about when doing deferred queries is to reduce number of queries by writing queries to return multiple datasetRefs (e.g. all datasets in output collection if that is not too large) and doing in-memory join (or lookup). I have a feeling that doing individual queries here will be the main bottleneck for large graphs, if that is true we may consider re-implementing this piece of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other code path that does include these queries in the big one instead of deferring them is still alive, and should Just Work for databases other than SQLite (where we need to defer to avoid the 64-term join limit), and I'm hoping we can avoid scaling problems with SQLite by just not using SQLite at scale.
Right now we also always do other follow-up queries to obtain component datasets for composites, of course, but that's a separate issue.
@@ -41,7 +41,8 @@ | |||
from ..core.config import Config | |||
from ..core.dimensions import DataId | |||
from .sqlRegistryDatabaseDict import SqlRegistryDatabaseDict | |||
from .sqlPreFlight import SqlPreFlight | |||
from ..sql import MultipleDatasetQueryBuilder | |||
|
|||
|
|||
__all__ = ("SqlRegistryConfig", "SqlRegistry") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be moved above all imports
@@ -183,3 +184,35 @@ def updateVisitEntryFromObsInfo(dataId, obsInfo): | |||
exposure_time=obsInfo.exposure_time.to_value("s"), | |||
) | |||
return dataId | |||
|
|||
|
|||
def addUnboundedCalibrationLabel(registry, instrumentName): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you want it added to __all__
?
This somewhat-generalizes the case of VisitDetectorRegion, which should only be included in a query when the region it provides is actually needed. The current query logic identifies this case based on it being the only DimensionJoin that is a region holder, but I'd like to start being explicit about that.
Old doc was a copy/paste error.
This separates the big preflight query into several QueryBuilder classes that delegate to each other in various ways. This makes the logic a bit easier to follow, at least at the highest level, by better encapsulating the lower-level bits. It also (I hope) is a start towards adding simpler Registry APIs for simpler queries, though that will need to wait for another ticket This does change the behavior of selectDimensions slightly: the full data IDs generated are now expanded to include implied dependencies. As those data IDs aren't actually used outside unit tests, this shouldn't break anything, and it's arguably better behavior anyway.
Unlike ExposureRange (but like all other dimensions) CalibrationLabel has a single link column (not counting dependencies), and is related to other Dimensions via a comparatively normal DimensionJoin view.
The major new functionality here is the "prerequisite" DatasetType category for selectDimensions (now selectMultipleDatasetTypes), but this also restructures a lot of the QueryBuilder code and tests, in some cases to try to anticipate future Registry methods that query for single DatasetTypes.
16f0b2a
to
2223847
Compare
No description provided.