New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-33963: Add resource-stats gathering task. #15
Conversation
07d00db
to
4d2302b
Compare
41fd792
to
d2f13e5
Compare
6de0042
to
be06824
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding type annotations, at least where convenient. Even if nothing checks it now, each bit added now potentially makes things easier in the future.
|
||
|
||
class GatherResourceStatisticsTask(PipelineTask): | ||
"""A `PipelineTask` that gathers resource usage statistics from task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be qualified to pipe_base, so that it does not try to link to the local import? I know there was some similar problem at some point, but I am not good at sphinx to know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if there is a from lsst.pipe.base import PipelineTask
then just PipelineTask
will work, but if you import it another way you need to qualify it. Another reason to like that import style!
56b8b3c
to
02bb26e
Compare
I thought about it when I started, but I did not want to start to impose my preferences on a package where most of the work is done by other developers, and after the experience in middleware I've grown wary of adding them without checking, because it's easy to get them wrong and a bad annotation is often worse than no annotation. |
On the type annotations, I see what you are saying, I am happy for you to leave them out if you prefer. However, I am not quite sure I buy that argument, there is no more danger than the docstring being wrong, which is also something that is not checked or really enforced. At least with type annotations an author is more likely to updated them if an interface changes by virtue of being right next to the argument, and are useful when cross-checking the docstring. |
33c142e
to
8bcd0f3
Compare
) | ||
|
||
def __init__(self, *, config): | ||
super().__init__(config=config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should validate the names were actually overloaded here, else PLACEHOLDER gets registered into butler repos if people run --register-dataset-types
Which I think runs before graph builder to ensure the types that are registered are known to the butler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Dataset type registration always runs after GraphBuilder, but worth doing anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that true? I thought there was a situation the graph builder could get into if the types didn't exist yet (I know we fixed some bug related to that, but I thought there was additional undefined behavior by not registering first in some cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There have certainly been lots of bugs and tricky behavior in that area, but we have always been consistent about not modifying the data repository at all until after QuantumGraph generation; I think we may even use a read-only butler at first to guarantee this.
outputs=outputs, | ||
) | ||
quanta_by_task_def[task_def] = {quantum} | ||
return QuantumGraph(quanta_by_task_def) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think you want to check for cycles here, or subset to connected, and verify you have the expected number of independent chains (possibly not, I think you have just a bunch of independent quanta, but I wanted to raise it so it would be on your mind.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should always be a bunch of independent quanta.
n_rows = len(handles_by_data_id) | ||
# Create a dict of empty column arrays that we'll ultimately make into | ||
# a table. | ||
columns = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use this dict of numpy arrays vs a recarray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas.DataFrame
is a transpose of recarray
but pretty much identical to a dict of numpy arrays under the hood. It may still do a copy when I convert it to a DataFrame
, but at least it won't need to do a transpose.
handle : `lsst.daf.butler.DeferredDatasetHandle` | ||
Butler handle for the metadata dataset; used to identify the | ||
metadata in diagnostic messages only. | ||
warned_about_metadata_version : `bool` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason I thought warnings themselves could be configured to only be emitted once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so I'm not familiar with it, and the message does include the data ID so the messages wouldn't be identical anyway.
The tricky part here is the dynamic aspect of the connections. This looks like it'll work, but hasn't been tested.
The actual CLI script will be going into drp_pipe, because the intent is to only run it inside the SCons scripts there.
This drops the pipeline-building stuff; the pipelines it generated were just too fragile, because they embedded the labels of other tasks in this task's config.
It now calls three private methods.
4a31f52
to
8627753
Compare
No description provided.