Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-16539: overhaul calibration dimensions and refactor selectDimensions implementations #135

Merged
merged 14 commits into from Mar 6, 2019

Conversation

TallJimbo
Copy link
Member

No description provided.

@TallJimbo TallJimbo force-pushed the tickets/DM-16539 branch 13 times, most recently from c7095e8 to 16f0b2a Compare February 27, 2019 01:46
Copy link
Contributor

@andy-slac andy-slac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job untangling preflight mess! I did a quick scrolling through it and I need to look at it more but here is a small set of comments that I have so far.

name: valid_last
type: datetime
doc: >
TAI timestamp of last exposure included in the range (inclusive).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think usual case for intervals/ranges is to make second timepoint exclusive, it makes some operations on ranges simpler (e.g. easier to do [t0, t1), [t1, t2) than [t0, t1-something], [t1, t2]). Though if you are only interested in exposure timepoints (i.e. you have reasonable/predictable gaps) inclusive can work fine too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this ticket I'd like to keep it inclusive to retain the previous logic (or, at least, the old as-documented logic - thinking more about the changes to the unit tests I had to make, I suppose this may actually be a change to the behavior to bring it in line with the documentation).

Anyhow, I think I did have some concrete use-case reasons for why I thought an inclusive upper bound would be better here, and I think it did involve the fact that there are always gap between exposures. But I'm not sure I can reconstruct all of that thinking right now, and I'm happy to revisit it in the future.

src: instrument
tgt: Instrument.instrument

ExposureCalibrationLabelJoin:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this view used by anything, I can't find it on this PR except in unit test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it definitely is, and having it just appear in the unit tests makes sense; this should now behave pretty much just like any of the spatial join views (which also should appear by name only in the tests), but is computed from temporal overlap rather than spatial overlap.

transformation from ``neededDatasetTypes`` to ``futureDatasetTypes``,
restricted by existing data and filter expression.
transformation from ``requiredDatasetTypes`` to ``optional
DatasetTypes``, restricted by existing data and filter expression.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

``required`` to ``optional`` DatasetTypes?

be included in the returned column set. Datasets of these types
may or may not existin the registry.
prerequisite : iterable of `DatasetType` or `str`
`DatasetType`s that should not constrain the query results, but
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`DatasetType`s -> `DatasetTypes <DatasetType>`?

or just drop backticks entirely.

must be present for all result rows. These are included with
a LEFT OUTER JOIN, but the results are checked for NULL. Unlike
regular inputs, prerequisite inputs lookups may be deferred
(see the documentaiton ``defer`` argument).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in documentaiton.
Which defer argument?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed; I had copied this from the docs for MultipleDatasetQueryBuilder, but the defer option there is controlled by configuration in SqlRegistry, not passed directly by the user, so I've instead just said that deferral is something that Registry implementations may or may not do.


Parameters
----------
managed : ResultColumnsManager.ManagedRow`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backtick is missing

already in the selected columns, this method does nothing.
datasetType : `DatasetType`, optional
If not `None`, this link corresponds to a "per-DatasetType"
dimension in a querym for this `DatasetType`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

querym

return
self._indicesForDimensionLinks[link] = len(self._columns)
else:
if link in self._indicesForPerDatasetTypeDimensionLinks:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link in self._indicesForPerDatasetTypeDimensionLinks[datasetType]?

Parameters
----------
datasetType : `DatasetType`, optional

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra blank line

----------
managed : `ResultsColumnsManager.ManagedRow`
Intermediate result row object to convert.
expandDataId : `True`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool?

Copy link
Contributor

@andy-slac andy-slac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK (though non-trivial for sure). Few minor comments.

tables : `dict`
Mapping of table names to `sqlalchemy.Table` instances.
dimensions : `DimensionGraph`
All Dimensions included in the query.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameters section needs an update

if selectable is None:
raise ValueError(f"No table or clause providing link '{column}' in this query.")
try:
return selectable.c[column]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have to blame mostly myself for this, but after reading our code I strongly prefer columns to single-letter c. No need to change it now, maybe just try to use it more in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've switched to columns for all of the code in the new sql package. I'll leave the rest of the SqlRegistry code as-is for now.

query = query.where(whereSql)
_LOG.debug("building query: %s",
query.compile(bind=self.registry._connection.engine,
compile_kwargs={"literal_binds": True}))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe wrap it into if _LOG.isEnabledFor(logging.DEBUG): to avoid calling query.compile if it's not needed.

Notes
-----
A query row that include disjoint regions is filtered out, resulting
in `None` being returned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is filtered by region overlap then I would assume that query is defined in a way to return just one "true" row (with regions overlapping) but there may be other "false" rows (without overlap). With that assumption I think you cannot do fetchone but instead it should iterate and return first row that passes filter.


@classmethod
def fromCollections(cls, registry, datasetType, collections, addResultColumns=True):
"""Construct a builder that searches a single collection for the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single collection -> multiple collections

DS.linkN = DSG.linkN)
"""
if len(collections) == 1:
return cls.fromSingleCollection(registry, datasetType, collections[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to pass addResultColumns argument too?


A multiple-dataset query returns a sequence of `MultipleDatasetQueryRow`
instances; each instance will have a unique `DataId`, but the
`DatasetRef`s in `datasetRefs` are not necessarily unique. For example,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double backticks around datasetRefs.

selectable = data.builder.findSelectableForLink(link)
column = selectable.c[link]
expr.append(column == dataId[link])
ref = data.builder.executeOne(whereSql=and_(*expr))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possible optimization that I was thinking about when doing deferred queries is to reduce number of queries by writing queries to return multiple datasetRefs (e.g. all datasets in output collection if that is not too large) and doing in-memory join (or lookup). I have a feeling that doing individual queries here will be the main bottleneck for large graphs, if that is true we may consider re-implementing this piece of code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other code path that does include these queries in the big one instead of deferring them is still alive, and should Just Work for databases other than SQLite (where we need to defer to avoid the 64-term join limit), and I'm hoping we can avoid scaling problems with SQLite by just not using SQLite at scale.

Right now we also always do other follow-up queries to obtain component datasets for composites, of course, but that's a separate issue.

@@ -41,7 +41,8 @@
from ..core.config import Config
from ..core.dimensions import DataId
from .sqlRegistryDatabaseDict import SqlRegistryDatabaseDict
from .sqlPreFlight import SqlPreFlight
from ..sql import MultipleDatasetQueryBuilder


__all__ = ("SqlRegistryConfig", "SqlRegistry")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be moved above all imports

@@ -183,3 +184,35 @@ def updateVisitEntryFromObsInfo(dataId, obsInfo):
exposure_time=obsInfo.exposure_time.to_value("s"),
)
return dataId


def addUnboundedCalibrationLabel(registry, instrumentName):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want it added to __all__?

This somewhat-generalizes the case of VisitDetectorRegion, which
should only be included in a query when the region it provides is
actually needed.  The current query logic identifies this case based
on it being the only DimensionJoin that is a region holder, but I'd
like to start being explicit about that.
Old doc was a copy/paste error.
This separates the big preflight query into several QueryBuilder
classes that delegate to each other in various ways.  This makes
the logic a bit easier to follow, at least at the highest level,
by better encapsulating the lower-level bits.  It also (I hope)
is a start towards adding simpler Registry APIs for simpler
queries, though that will need to wait for another ticket

This does change the behavior of selectDimensions slightly:
the full data IDs generated are now expanded to include implied
dependencies.  As those data IDs aren't actually used outside
unit tests, this shouldn't break anything, and it's arguably
better behavior anyway.
Unlike ExposureRange (but like all other dimensions) CalibrationLabel
has a single link column (not counting dependencies), and is related
to other Dimensions via a comparatively normal DimensionJoin view.
The major new functionality here is the "prerequisite" DatasetType
category for selectDimensions (now selectMultipleDatasetTypes), but
this also restructures a lot of the QueryBuilder code and tests,
in some cases to try to anticipate future Registry methods that
query for single DatasetTypes.
@TallJimbo TallJimbo merged commit 8481d27 into master Mar 6, 2019
@TallJimbo TallJimbo deleted the tickets/DM-16539 branch March 6, 2019 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants