New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

DM-24734: Rework QuantumGraph generation to avoid O(N^2) scaling. #128

Merged

TallJimbo merged 4 commits into master from tickets/DM-24734

May 29, 2020

Member

TallJimbo commented May 12, 2020

No description provided.

TallJimbo force-pushed the tickets/DM-24734 branch 2 times, most recently from 6f8fa1c to 6013858 Compare

May 16, 2020 19:56

natelust approved these changes

View reviewed changes

Contributor

natelust left a comment

Mostly looks fine, just a few code comments, but algorithm seems ok as best as I followed

python/lsst/pipe/base/pipelineIR.py

    
            @@ -440,7 +440,7 @@ def to_primitives(self):
          
                      accumulate = {"description": self.description}

                      if self.instrument is not None:

                          accumulate['instrument'] = self.instrument

                      accumulate['tasks'] = {l: t.to_primitives() for l, t in self.tasks.items()}

                      accumulate['tasks'] = {m: t.to_primitives() for m, t in self.tasks.items()}

Contributor

natelust May 27, 2020

Just curious what didnt it like here

Member

timj May 28, 2020

l is frowned upon as a variable because some fonts confuse it with 1

python/lsst/pipe/base/graphBuilder.py Outdated

+                      dataIds : `Iterable` [ `DataCoordinate` ]
+                          Data IDs to match.
+                      Yields

Contributor

natelust May 27, 2020

This is not correct. It returns a concrete thing. Yields is only if you did something like:

def extract(...) -> Generator[DatasetRef, None, None:
    ...
    yield from (refs[datasetId] for dataId in dataIds)

Member Author

TallJimbo May 28, 2020

Agreed, in the sense that this is what the Python community expects, but I'm curious as to whether you know if this actually matters. AFAICT, whether an iterator-returning function is implemented as a generator or not is completely transparent to the caller, so distinguishing between them is always just exposing an implementation detail. Is that wrong in some edge cases (or maybe coroutines)?

Contributor

natelust May 28, 2020

A few thoughts:

pedantically the function does not yield anything it returns something that yields something. Its like the difference between iterable and iterator
If someone really wanted to know the call stack (or something like that), this will never be re-entered, the thing you are returning will be
It does make a difference when coroutines are involved (in so far as you loose the richer type information provided to any checkers)

I think in day to day practice you can use this as is, but your terminology does not really match what is happening, and if it was used like this in other places (co routines) it would be wrong. Code in a generator function is not executed until the first use (everything up to the first yield, which is why it must be "primed" with a next, but in this case there is code execution each time before the generator is returned. If someone really cared about that for some reason then the info you wrote would lead them astray. In practice it probably doesnt matter much.

python/lsst/pipe/base/graphBuilder.py

+                  dataId : `DataCoordinate`
+                      Data ID for this quantum.
+                  """
+                  def __init__(self, task: _TaskScaffolding, dataId: DataCoordinate):

Contributor

natelust May 27, 2020

remove your __init__ method

Member Author

TallJimbo May 28, 2020 •

edited

I think taking that approach (along with the changes in your next few comments) also requires adding field(init=False) or defaults (and hence dropping __slots__) for inputs, outputs, and prerequisites, and that task and dataId shouldn't actually be InitVar in that scenario since we do want them as attributes. Overall it seems simpler to just define my own __init__ than go to so much work to make the dataclass-injected one usable for me.

Maybe the better question is whether I should drop @dataclass - I'm already overriding __repr__ and __init__, so all I'm getting from that is equality comparison, and I don't actually want that, I suppose. Unless I hear otherwise I'll just drop the decorator.

python/lsst/pipe/base/graphBuilder.py

+                  """
+                  def __init__(self, task: _TaskScaffolding, dataId: DataCoordinate):
+                      self.task = task
+                      self.dataId = dataId

Contributor

natelust May 27, 2020

add def __posit_init__(self, task, dataId):

python/lsst/pipe/base/graphBuilder.py

+                      # because of back-references.
+                      return f"_QuantumScaffolding(taskDef={self.taskDef}, dataId={self.dataId}, ...)"
+                  task: _TaskScaffolding

Contributor

natelust May 27, 2020

make this task: typing.InitVar[_TaskScaffolding]

python/lsst/pipe/base/graphBuilder.py

+                              quantum = task.quanta.get(quantumDataId)
+                              if quantum is None:
+                                  quantum = _QuantumScaffolding(task=task, dataId=quantumDataId)
+                                  task.quanta[quantumDataId] = quantum

Contributor

natelust May 27, 2020

why not just do task.quanta.setdefault(quantumDataId, _QuantumScaffolding(task=task, dataId=quantumDataId)

Member Author

TallJimbo May 28, 2020

Same reasoning as above.

python/lsst/pipe/base/graphBuilder.py Outdated

+                              # be associated.
+                              # Many of these associates will be duplicates (because another
+                              # query row that differed from this one only in irrelevant
+                              # dimensions already added them), and we use sets to skips.

Contributor

natelust May 27, 2020

skips -> skip

python/lsst/pipe/base/graphBuilder.py

+                          for datasetType, refs in itertools.chain(self.inputs.items(), self.intermediates.items(),
+                                                                   self.outputs.items()):
+                              datasetDataId = commonDataId.subset(datasetType.dimensions)
+                              ref = refs.get(datasetDataId)

Contributor

natelust May 27, 2020

why not do refs.setdefault(datasetId, DatasetRef(datasetType, datasetDataID)

Member Author

TallJimbo May 28, 2020

I figured the cost of an extra hash lookup was probably lower than the cost of constructing the temporary if it's usually not needed (as I think it is the case here). No quantitative analysis of that, though. Mostly I really wish Python (and every language I've ever used) had a better interface for conditional mapping insertions that avoided that choice.

python/lsst/pipe/base/graphBuilder.py Outdated

+                          _LOG.debug("Resolving %d datasets for input dataset %s.", len(refs), datasetType.name)
+                          for dataId in refs:
+                              refs[dataId] = registry.findDataset(datasetType, dataId=dataId, collections=collections)
+                          assert not any(ref is None for ref in refs.values())

Contributor

natelust May 28, 2020

bare asserts can be hard for debug. Its up to you to decide if it is likely enough to happen to make a richer message worth it.

Member Author

TallJimbo May 28, 2020

Yeah, this one is probably worth changing to a RuntimeError raise, as it's guarding against both daf_butler logic (which is probably too "far away" for this assert), and that these datasets haven't been deleted by some other Registry client since the previous query.

python/lsst/pipe/base/graphBuilder.py Outdated

+                                  elif not skipExisting:
+                                      raise OutputExistsError(f"Output dataset {datasetType.name} already exists in "
+                                                              f"output RUN collection '{run}' with data ID {dataId}.")
+                                  refs[dataId] = ref

Contributor

natelust May 28, 2020

it seems weird to pull out the ref and then reassign it in the case it is not resolved. I think these branches can be re worked a bit.


          Improve status logging for QuantumGraph generation.

aa5b6a0

Original, per-data ID messages are now at TRACE level, with coarser
(per-Task) status meessages at DEBUG.

TallJimbo force-pushed the tickets/DM-24734 branch from 6013858 to 1da6094 Compare

May 28, 2020 21:11

Member Author

TallJimbo commented May 28, 2020

I've addressed or responded to review comments and rebased, and am kicking off a new Jenkins run now since it's been a while. Will merge when that's green unless I see more comments first.

TallJimbo force-pushed the tickets/DM-24734 branch from 1da6094 to f21cc20 Compare

May 29, 2020 01:47


          Fix type annotations and docs for adjustQuantum.

de188ef

These now reflect how adjustQuantum is actually being called.  I suspect
the original types reflect a reasonable aspiration: PipelineTask
subclasses ideally would be operating on a mapping that uses their internal
collection names instead of dataset type names, but the Connections
infrastructure doesn't provide a good way to do that translation (!),
so changing that here is both out-of-scope _and_ a lot of work.

TallJimbo force-pushed the tickets/DM-24734 branch from f21cc20 to e829e2b Compare

May 29, 2020 01:55

TallJimbo added 2 commits

May 28, 2020 22:56


          Rework QuantumGraph generation to avoid O(N^2) scaling.

d74fcd6

The previous implementation tested all DatasetRefs (of the right type)
for compatibility with all quanta, which doesn't scale when the graph
is large.

This removes the fillQuanta step from _PipelineScaffolding, moving the
association of datasets with quanta into fillDataIds (now
connectDataIds) and prerequisite lookup into fillDatasetRefs (now
resolveDatasetRefs).

To do that, I added a new _QuantumScaffolding object to represent an
under-construction Quantum, and removed _DatasetScaffolding as it
didn't end up providing more than a simple dict would provide;
_DatasetScaffoldingDict was accordingly renamed and adjusted to
_DatasetDict.


          Fix old problems revealed by new flake8.

1c80eb4

TallJimbo force-pushed the tickets/DM-24734 branch from e829e2b to 1c80eb4 Compare

May 29, 2020 02:56

TallJimbo merged commit 878c510 into master

TallJimbo deleted the tickets/DM-24734 branch

May 29, 2020 12:59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment