New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-38498: rewrite QuantumGraph generation #370
Conversation
c5ffc7e
to
f2093ce
Compare
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #370 +/- ##
==========================================
- Coverage 83.52% 83.33% -0.19%
==========================================
Files 77 81 +4
Lines 9212 9464 +252
Branches 1782 1772 -10
==========================================
+ Hits 7694 7887 +193
- Misses 1227 1289 +62
+ Partials 291 288 -3
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only looked at the first four commits and I have to stop for today. Here is a small number of comments, will continue tomorrow.
# starts with the output run collection, as an optimization to avoid | ||
# queries later. | ||
if self.skip_existing_in and self.output_run_exists: | ||
first, *_ = self.butler.registry.queryCollections(self.skip_existing_in, flattenChains=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it happen that queryCollections
does not find any collections (returns empty list)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch; it'd definitely be a caller error, but it is possible, and we should re-raise with a better error message.
---------- | ||
task_key : `QuantumKey` or `TaskInitKey` | ||
Identifier for the quantum node. | ||
dataset_key : `DatasetKey` or `PrerequisiteKey` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prerequisites are inputs-only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've addressed the first round of review comments (lots of fixup commits that will be squashed later) and squashed a bug discovered by Jenkins (mostly missing exception-handling around registry queries).
# starts with the output run collection, as an optimization to avoid | ||
# queries later. | ||
if self.skip_existing_in and self.output_run_exists: | ||
first, *_ = self.butler.registry.queryCollections(self.skip_existing_in, flattenChains=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch; it'd definitely be a caller error, but it is possible, and we should re-raise with a better error message.
def update(self, other: QuantumGraphSkeleton) -> None: | ||
"""Copy all nodes from ``other`` to ``self``.""" | ||
for task_label, (_, quanta) in other._tasks.items(): | ||
self._tasks[task_label][1].update(quanta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That's fine for the only use case right now and that requirement keeps the implementation simple, so I'll include it in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked all remaining commits, looks great, a couple of minor comments.
@@ -169,6 +167,7 @@ def __init__( | |||
skip_existing_in: Sequence[str] = (), | |||
clobber: bool = False, | |||
): | |||
self.log = getLogger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment - I usually try to avoid having Logger
attributes, it makes instances un-picklable, but I think it should not be an issue in this case. Context where it could matter is multiprocessing
, we'll probably never use that for graph building.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the heads-up; I did not know about the issue with logs and pickling/multiprocessing.
I think I will leave this as it is - as you said, multiprocessing is unlikely to be an issue, and I need both self.log
and self.metadata
to allow the timeMethod
decorator to work.
dataset_id_bytes: bytes | ||
"""Dataset ID (UUID) as raw bytes.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not find how dataset_id_bytes
is used in the code. Would it be more efficient to store it as int
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's never actually accessed directly, but it is used as part of the (generated) namedtuple
__eq__
and __hash__
.
I guess int
might be more efficient if that can be boiled down to a native uint128
on platforms that support it, but I am pretty certain this is not at all a bottleneck - converting the DataCoordinate
attributes to tuples
on this commit was huge for performance (several orders of magnitude), but the hotspot where the set operations on those occur is in the loop over query results before any prerequisites are added.
5f0c15a
to
72e9272
Compare
These mostly use the same algorithm as the one in graphBuilder.py, but they split the pipeline up into disconnected subgraphs first, which will provide a nice performance boost for some pipelines. The other algorithmic difference is that pruning via adjustQuantum now happens directly in QG generation, making the stuff for pruning in QuantumGraph construction unnecessary, and allowing adjustQuantum to raise NoWorkFound even during QG generation. By building on PipelineGraph's more careful handling of storage class overrides, this should make it much, much harder for those problems to creep back in. And finally, the split here into an ABC and implementation should make it much easier to handle special QG generation cases, like the ones for gathering resource usage and HiPS generation, as they can now delegate the stuff they don't want to customize to the base class. This may be useful for generating simple QGs in Prompt Processing, too (but I'm not convinced it'd gain us much) or as a starting point for a more advanced general-purpose QG generation algorithm (though it's unlikely the base class could stay _exactly_ as is for that.
It's easier for a user who wants to adjust verbosity to only deal with one logger for all of QG generation.
We're always comparing data IDs with the same dimensions, so we can do it much faster by just comparing value tuples.
Make use of the butler serialization caching mechanisms to make sure object are effectively cached instead of reconstructing objects needlessly. Also lower the compression ratio of LZMA. This results in slightly larger graph sizes, but is offset by a large runtime gain.
72e9272
to
f12fe70
Compare
QuantumGraph doesn't need to be able to prune itself anymore, since that's now handled earlier, inside QuantumGraphBuilder (where we can do it more efficiently). And without that pruning code, it's easy to replace QuantumGraph's _datasetRefDict attribute with a temporary networkx graph, and this sidesteps a problem in which different DatasetRefs don't compare as equal if they have different storage classes, which was causing QGs to lack some edges they should have had with the QuantumGraphBuilder. I believe this was because the old QG generation algorithm passed DatasetRefs with incorrect storage classes to QuantumGraph, and that also sidestepped the bug.
This class was previously used for both task <-> dataset type graphs and quantum <-> dataset graphs, and now it's just used for the former. So it doesn't need to be generic, and it doesn't need to support node removal anymore. Eventually it should be replaced entirely by PipelineGraph, but that's out of scope for now (it's part of DM-40442).
With the switch to tuples instead of DataCoordinates as keys, adding nodes implicitly when edges are added gets trickier; we can only support it for input datasets, because something else (registry queries or upstream tasks) are responsible for making DatasetRefs for those.
We want to allow users to include at least the output run here without it actually existing yet, and let that be a no-op, instead of complaining.
f12fe70
to
6457335
Compare
Checklist
doc/changes