Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-19988: rewrite of QuantumGraph generation #95

Merged
merged 7 commits into from Jul 12, 2019
Merged

Conversation

TallJimbo
Copy link
Member

No description provided.

I've only made a token (i.e. untested) effort to update GraphBuilder,
but that code will be replaced later on this branch anyway.
These were already being skipped, and are now sufficiently out-of-date
that they're not worth keeping.  Test coverage for this logic will
have to live in ci_hsc_gen3, because we just don't have the test data
we need here.
Copy link
Contributor

@andy-slac andy-slac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, few minor comments.

self.dependencies)
dependencies: FrozenSet(int)
"""Possibly empty set of indices of dependencies for this Quantum.
Dependnecies include other nodes in the graph; they do not reflect data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dependnecies

in this Pipeline.

This does not include dataset types that are produced when constructing
other Tasks in the Pipeline (these are classified as (`initIntermediates`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one extra opening parenthesis


This does not include dataset types that are also used as inputs when
constructing other Tasks in the Pipeline (these are classified as
(`initIntermediates`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too.

Parameters
----------
taskDef : `TaskDef`
Data structure that identifies that task class and its config.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that task class?
Also below.

returns related tuples of all dimensions used to identify any regular
input, output, and intermediate datasets (not prerequisites). We then
iterate over these tuples of related dimensions, identifying the subsets
that correspond to distinct data IDs each task and dataset type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for each task?

"""
# Initialization datasets always have empty data IDs.
emptyDataId = DataId(dimensions=registry.dimensions.empty)
for datasetType, scaffolding in itertools.chain(self.initInputs.items(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datasetType is not used by the loop, you can probably iterate over values().

originInfo : `lsst.daf.butler.DatasetOriginInfo`
Object holding the input and output collections for each
`DatasetType`.
skipExisting : `bool`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional?


ALLOWED_QUANTUM_DIMENSIONS = frozenset(["instrument", "visit", "exposure", "detector",
"physical_filter", "abstract_filter",
"skymap", "tract", "patch"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should something like this come from configuration instead? Is this used by anything at all? I looked at this package and ctrl_mpexec and could not find any references.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch; looks like a relic of an earlier draft. I'll just remove it.

This currently duplicates but will soon supersede similar code
in GraphBuilder.
This includes:

 - using dataclasses for QuantumGraph helper classes;

 - removing QuantumGraph.getDatasetTypes (in favor of the new
   PipelineDatasetTypes class);

 - adding task-level initInputs and initOutputs;

 - using "index" consistently instead of "index" and "quantumId"
   inconsistently in graph traversal.  "index" seems better as
   it avoids confusing with "quantum.id", which is quite different.

This breaks GraphBuilder, but we're about to replace that entirely.
QuantumGraph creation now centers on a "scaffolding" data structure
that evolves the Pipeline into a QuantumGraph in a few steps.

Algorithmically, the big changes are:

 - We now perform all dataset_id lookups in deferred follow-up queries.
   This comes at the expense of some performance at the moment, but I
   believe it will make future optimization much easier by splitting
   the problem up into more understandable pieces.  The most important
   part of this is that it lets us look up and construct each
   DatasetRef exactly once naturally (instead of by relying on
   caching) by doing it after data ID duplicates in the Big Join Query
   results have been removed.

 - Prerequisite datasets are now looked up on the query IDs of the
   quanta they are attached to, rather than the full data ID of the
   Big Join Query's result row.  This is the change that actually
   fixes the problem of DM-19988, because it means the skypix
   for the ref_cat is constrained only by its overlap with the
   visit+detector region of its associated Quantum, not its overlap
   with any tract/patch regions that might part of the full-pipeline
   query.

 - Per-DatasetType dimensions are no longer explicitly provided by
   PipelineTasks - instead, these are any dimensions that are used
   only by prerequisite dataset types.  It turns out we already
   required this to be true, and essentially demanded PipelineTasks
   provide consistent duplicate information in the declarations.
Copy link
Contributor

@andy-slac andy-slac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@TallJimbo TallJimbo merged commit 0ee56d7 into master Jul 12, 2019
@TallJimbo TallJimbo deleted the tickets/DM-19988 branch July 12, 2019 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants