New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

DM-19988: rewrite of QuantumGraph generation #95

Merged

TallJimbo merged 7 commits into master from tickets/DM-19988

Jul 12, 2019

Member

TallJimbo commented Jun 27, 2019

No description provided.

TallJimbo added 3 commits

June 27, 2019 10:11


          Switch to Python 3.7 for Travis flake8 testing.

48408ff


          Update for Quantum changes in daf_butler.

49df7c8

I've only made a token (i.e. untested) effort to update GraphBuilder,
but that code will be replaced later on this branch anyway.


          Remove obsolete GraphBuilder tests.

57bb3c2

These were already being skipped, and are now sufficiently out-of-date
that they're not worth keeping.  Test coverage for this logic will
have to live in ci_hsc_gen3, because we just don't have the test data
we need here.

andy-slac approved these changes

View reviewed changes

Contributor

andy-slac left a comment

Looks fine, few minor comments.

python/lsst/pipe/base/graph.py Outdated

-                                                                  self.dependencies)
+                  dependencies: FrozenSet(int)
+                  """Possibly empty set of indices of dependencies for this Quantum.
+                  Dependnecies include other nodes in the graph; they do not reflect data

Contributor

andy-slac Jun 29, 2019

Dependnecies

python/lsst/pipe/base/pipeline.py Outdated

+                  in this Pipeline.
+                  This does not include dataset types that are produced when constructing
+                  other Tasks in the Pipeline (these are classified as (`initIntermediates`).

Contributor

andy-slac Jun 29, 2019

one extra opening parenthesis

python/lsst/pipe/base/pipeline.py Outdated

+                  This does not include dataset types that are also used as inputs when
+                  constructing other Tasks in the Pipeline (these are classified as
+                  (`initIntermediates`).

Contributor

andy-slac Jun 29, 2019

Here too.

python/lsst/pipe/base/graphBuilder.py Outdated

+                  Parameters
+                  ----------
+                  taskDef : `TaskDef`
+                      Data structure that identifies that task class and its config.

Contributor

andy-slac Jun 29, 2019

that task class?
Also below.

python/lsst/pipe/base/graphBuilder.py Outdated

+                     returns related tuples of all dimensions used to identify any regular
+                     input, output, and intermediate datasets (not prerequisites).  We then
+                     iterate over these tuples of related dimensions, identifying the subsets
+                     that correspond to distinct data IDs each task and dataset type.

Contributor

andy-slac Jun 30, 2019

for each task?

python/lsst/pipe/base/graphBuilder.py Outdated

+                      """
+                      # Initialization datasets always have empty data IDs.
+                      emptyDataId = DataId(dimensions=registry.dimensions.empty)
+                      for datasetType, scaffolding in itertools.chain(self.initInputs.items(),

Contributor

andy-slac Jun 30, 2019

datasetType is not used by the loop, you can probably iterate over values().

python/lsst/pipe/base/graphBuilder.py Outdated

+                      originInfo : `lsst.daf.butler.DatasetOriginInfo`
+                          Object holding the input and output collections for each
+                          `DatasetType`.
+                      skipExisting : `bool`

Contributor

andy-slac Jun 30, 2019

optional?

python/lsst/pipe/base/graphBuilder.py Outdated

+                  ALLOWED_QUANTUM_DIMENSIONS = frozenset(["instrument", "visit", "exposure", "detector",
+                                                          "physical_filter", "abstract_filter",
+                                                          "skymap", "tract", "patch"])

Contributor

andy-slac Jun 30, 2019

Should something like this come from configuration instead? Is this used by anything at all? I looked at this package and ctrl_mpexec and could not find any references.

Member Author

TallJimbo Jun 30, 2019

Good catch; looks like a relic of an earlier draft. I'll just remove it.

TallJimbo added 3 commits

June 30, 2019 17:07


          Add DatasetType extraction and classification to Pipeline.

9371fc2

This currently duplicates but will soon supersede similar code
in GraphBuilder.


          Modify how QuantumGraph holds dataset types and refs.

d4d6b86

This includes:

 - using dataclasses for QuantumGraph helper classes;

 - removing QuantumGraph.getDatasetTypes (in favor of the new
   PipelineDatasetTypes class);

 - adding task-level initInputs and initOutputs;

 - using "index" consistently instead of "index" and "quantumId"
   inconsistently in graph traversal.  "index" seems better as
   it avoids confusing with "quantum.id", which is quite different.

This breaks GraphBuilder, but we're about to replace that entirely.


          Complete rewrite of GraphBuilder.

be51f06

QuantumGraph creation now centers on a "scaffolding" data structure
that evolves the Pipeline into a QuantumGraph in a few steps.

Algorithmically, the big changes are:

 - We now perform all dataset_id lookups in deferred follow-up queries.
   This comes at the expense of some performance at the moment, but I
   believe it will make future optimization much easier by splitting
   the problem up into more understandable pieces.  The most important
   part of this is that it lets us look up and construct each
   DatasetRef exactly once naturally (instead of by relying on
   caching) by doing it after data ID duplicates in the Big Join Query
   results have been removed.

 - Prerequisite datasets are now looked up on the query IDs of the
   quanta they are attached to, rather than the full data ID of the
   Big Join Query's result row.  This is the change that actually
   fixes the problem of DM-19988, because it means the skypix
   for the ref_cat is constrained only by its overlap with the
   visit+detector region of its associated Quantum, not its overlap
   with any tract/patch regions that might part of the full-pipeline
   query.

 - Per-DatasetType dimensions are no longer explicitly provided by
   PipelineTasks - instead, these are any dimensions that are used
   only by prerequisite dataset types.  It turns out we already
   required this to be true, and essentially demanded PipelineTasks
   provide consistent duplicate information in the declarations.

TallJimbo force-pushed the tickets/DM-19988 branch from ef7916e to be51f06 Compare

June 30, 2019 21:07


          Utilize new QueryBuilder method for DataId WHERE expressions.

aff803b

TallJimbo force-pushed the tickets/DM-19988 branch from d0925f6 to aff803b Compare

July 1, 2019 23:00

andy-slac approved these changes

View reviewed changes

Contributor

andy-slac left a comment

Looks good.

TallJimbo merged commit 0ee56d7 into master

TallJimbo deleted the tickets/DM-19988 branch

July 12, 2019 15:17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment