New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-40392: Re-implement QuantumGraph.updateRun method #369
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #369 +/- ##
==========================================
+ Coverage 83.44% 83.46% +0.02%
==========================================
Files 77 77
Lines 9173 9212 +39
Branches 1768 1782 +14
==========================================
+ Hits 7654 7689 +35
- Misses 1231 1233 +2
- Partials 288 290 +2
☔ View full report in Codecov by Sentry. |
3d77341
to
1db8a18
Compare
@@ -1229,32 +1236,44 @@ def updateRun(self, run: str, *, metadata_key: str | None = None, update_graph_i | |||
update_graph_id : `bool`, optional | |||
If `True` then also update graph ID with a new unique value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been meaning to ask in what scenario you want to update all the dataset refs but not update the graph ID...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea, for now there is an option for pipetask update-graph-run
and I'm not sure if this is used or not.
python/lsst/pipe/base/graph/graph.py
Outdated
def _update_refs_in_place(refs: list[DatasetRef], run: str) -> None: | ||
"""Update list of `~lsst.daf.butler.DatasetRef` with new run and | ||
dataset IDs. | ||
def _update_output_ref(ref: DatasetRef, run: str) -> DatasetRef: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it more efficient in python to have this take Iterable(DatasetRef)
and return a list
(like it did before) rather than have the function called N times? It's always called in a list comprehension (a quick benchmark seems to show me that it takes half the time if you don't call the function repeatedly even if calling list.append
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change that, I thought there was a context when it was called on a single ref, but it's indeed only lists of refs here.
def _update_input_ref(ref: DatasetRef, run: str) -> DatasetRef: | ||
"""Update `~lsst.daf.butler.DatasetRef` with new run and dataset | ||
ID. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment explaining that it only returns an updated ref if the ref is listed as an output elsewhere in the graph?
python/lsst/pipe/base/graph/graph.py
Outdated
ID. | ||
""" | ||
if dataset_id := dataset_id_map.get(ref.id): | ||
ref = ref.replace(run=run, id=dataset_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this ref be identical to the other ref in the other part of the graph? Why do we need to create a new ref? They are immutable. Can't dataset_id_map
point to the ref itself or is there a memory concern and we are trying to minimize that so we don't have to carry around all the outputs twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason was that refs may have different storage classes so they are not exactly identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, different storage classes, and also sometimes some of them are components.
python/lsst/pipe/base/graph/graph.py
Outdated
|
||
for refs in self._initOutputRefs.values(): | ||
_update_refs_in_place(refs, run) | ||
# Loop through all outputs and update their dataset refs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop isn't updating the dataset refs is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rephrase that.
1db8a18
to
30a45de
Compare
This commit fixes bug in updateRun which did not update dataset IDs of the references after changing their run collection.
967862d
to
bfe6f61
Compare
This commit fixes bug in updateRun which did not update dataset
IDs of the references after changing their run collection.
Depends on lsst/daf_butler#882
Checklist
doc/changes