New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-19988: rewrite of QuantumGraph generation #95
Conversation
I've only made a token (i.e. untested) effort to update GraphBuilder, but that code will be replaced later on this branch anyway.
These were already being skipped, and are now sufficiently out-of-date that they're not worth keeping. Test coverage for this logic will have to live in ci_hsc_gen3, because we just don't have the test data we need here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine, few minor comments.
python/lsst/pipe/base/graph.py
Outdated
self.dependencies) | ||
dependencies: FrozenSet(int) | ||
"""Possibly empty set of indices of dependencies for this Quantum. | ||
Dependnecies include other nodes in the graph; they do not reflect data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dependnecies
python/lsst/pipe/base/pipeline.py
Outdated
in this Pipeline. | ||
|
||
This does not include dataset types that are produced when constructing | ||
other Tasks in the Pipeline (these are classified as (`initIntermediates`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one extra opening parenthesis
python/lsst/pipe/base/pipeline.py
Outdated
|
||
This does not include dataset types that are also used as inputs when | ||
constructing other Tasks in the Pipeline (these are classified as | ||
(`initIntermediates`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too.
Parameters | ||
---------- | ||
taskDef : `TaskDef` | ||
Data structure that identifies that task class and its config. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that task class?
Also below.
returns related tuples of all dimensions used to identify any regular | ||
input, output, and intermediate datasets (not prerequisites). We then | ||
iterate over these tuples of related dimensions, identifying the subsets | ||
that correspond to distinct data IDs each task and dataset type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for each task?
""" | ||
# Initialization datasets always have empty data IDs. | ||
emptyDataId = DataId(dimensions=registry.dimensions.empty) | ||
for datasetType, scaffolding in itertools.chain(self.initInputs.items(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datasetType
is not used by the loop, you can probably iterate over values()
.
originInfo : `lsst.daf.butler.DatasetOriginInfo` | ||
Object holding the input and output collections for each | ||
`DatasetType`. | ||
skipExisting : `bool` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional
?
|
||
ALLOWED_QUANTUM_DIMENSIONS = frozenset(["instrument", "visit", "exposure", "detector", | ||
"physical_filter", "abstract_filter", | ||
"skymap", "tract", "patch"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should something like this come from configuration instead? Is this used by anything at all? I looked at this package and ctrl_mpexec
and could not find any references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch; looks like a relic of an earlier draft. I'll just remove it.
This currently duplicates but will soon supersede similar code in GraphBuilder.
This includes: - using dataclasses for QuantumGraph helper classes; - removing QuantumGraph.getDatasetTypes (in favor of the new PipelineDatasetTypes class); - adding task-level initInputs and initOutputs; - using "index" consistently instead of "index" and "quantumId" inconsistently in graph traversal. "index" seems better as it avoids confusing with "quantum.id", which is quite different. This breaks GraphBuilder, but we're about to replace that entirely.
QuantumGraph creation now centers on a "scaffolding" data structure that evolves the Pipeline into a QuantumGraph in a few steps. Algorithmically, the big changes are: - We now perform all dataset_id lookups in deferred follow-up queries. This comes at the expense of some performance at the moment, but I believe it will make future optimization much easier by splitting the problem up into more understandable pieces. The most important part of this is that it lets us look up and construct each DatasetRef exactly once naturally (instead of by relying on caching) by doing it after data ID duplicates in the Big Join Query results have been removed. - Prerequisite datasets are now looked up on the query IDs of the quanta they are attached to, rather than the full data ID of the Big Join Query's result row. This is the change that actually fixes the problem of DM-19988, because it means the skypix for the ref_cat is constrained only by its overlap with the visit+detector region of its associated Quantum, not its overlap with any tract/patch regions that might part of the full-pipeline query. - Per-DatasetType dimensions are no longer explicitly provided by PipelineTasks - instead, these are any dimensions that are used only by prerequisite dataset types. It turns out we already required this to be true, and essentially demanded PipelineTasks provide consistent duplicate information in the declarations.
ef7916e
to
be51f06
Compare
d0925f6
to
aff803b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
No description provided.