New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-27100: Add PanDA support #17
Conversation
Please rebase to get rid of that bad merge commit you have there. |
20c4779
to
70ace2c
Compare
70ace2c
to
b701a24
Compare
done. Thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a quick look and I have many comments. I'm not willing to approve this just yet because there is so much missing in terms of docstrings and code comments that it's hard to see what's really going on.
I'm also a bit concerned that there is LSST code referenced here that is being developed somewhere else that we can't see (DomaLSSTWork) so I'd like to have some insight into that (unless the "LSST" in the name is some fluke).
bucket: "s3://ci_hsc_w_2020_48" | ||
s3_endpoint_url: "https://storage.googleapis.com" | ||
payload_folder: payload | ||
singulatiry_prefix: '"cd /tmp;export HOME=/tmp;export S3_ENDPOINT_URL={s3_endpoint_url};export AWS_ACCESS_KEY_ID={aws_access_key}; export AWS_SECRET_ACCESS_KEY={aws_secret_access_key}; . /opt/lsst/software/stack/loadLSST.bash; setup lsst_distrib -t w_2020_48;' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem correct. "singulatiry"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming of the DomaLSSTWork class in iDDS will be changed in the next PIP release of that module. I'll update the BPS plugin accordingly once it appears.
@@ -0,0 +1,8 @@ | |||
#!/bin/bash | |||
echo "I am in container" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to put some commentary in here explaining why it exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added description
@@ -0,0 +1,6 @@ | |||
#!/usr/bin/python | |||
import sys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a docstring to let people know what this command is for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -0,0 +1,126 @@ | |||
class LSSTTask(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never inherit from object
in python3. Include a docstring. Convention requires this be LsstTask
but I imagine RubinTask
would also work.
Can this be a dataclass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reimplemented as a dataclass
|
||
# We take the commandline only from first job because PanDA uses late binding and | ||
# command line for each job in task is equal to each other in exception to the processing | ||
# file name which is substitutes by PanDA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
substituted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
task.step = task_step | ||
task.name = self.define_task_name(task.step) | ||
task.queue = self.computing_queue_himem if task_step in self.himem_tasks else self.computing_queue | ||
task.lfns = list(jobs.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if jobs
is dict-like then list(jobs)
is shorter and results in the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Fixed.
def pick_non_init_cmdline(self): | ||
for node_name in self.bps_workflow.nodes: | ||
if node_name != 'pipetaskInit': | ||
return self.bps_workflow.nodes[node_name]['job'].cmdline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This picks the first command line that isn't called pipeTaskInit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is an over-optimized workaround for something ctrl_bps is currently doing. We should make a ctrl_bps ticket explaining exactly what PanDA needs/does to make sure those of us unfamiliar with PanDA don't make wrong assumptions. Currently this works. But there will be more jobs that do not share the same command line. And users currently are allowed to change the command based upon which PipelineTask is being executed, but so far no one has needed this. For example, turn on running with DEBUG logging only for assembleCoadd. Coming soon will be ability to use single full QuantumGraph for every job, but it means having different Quantum NodIds on each command line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add explicit return of None if can't find a command line?
def define_execution_command(self): | ||
exec_str = "" | ||
if self.bps_config.get("computing_queue") == 'docker': | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the code meant to be doing? Should the if
be deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code cleaned.
Extra message for report command to print. This could be pointers to documentation or | ||
to WMS specific commands. | ||
""" | ||
message = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably document that this method does nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering why having it return None is better than the NotImplementedError? With None, bps report
will always behave like there are no runs. I would rather have folks know that "bps report" doesn't work with PanDA yet as opposed to saying there's a bug in "bps report".
DM-28480: Update action to py3.8 and ignore W503
…f PanDA server as authorization proxy.
Update doc/conf.py to new version Add a :no-inherited-members: option to the automodapi directive in ctrl_bps/doc/lsst.ctrl.bps/index.rst to workaround bug. Since having to test building the docs, included minor text changes: Update pipeline syntax from ':' to '#' to match stack changes. Fix one missed capitalization change (qgraph_file to qgraphFile).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran a pipeline using HTCondor and it still worked fine (as expected because of where the changes are). After rebasing latest changes on master, there is nothing that says this shouldn't be merged. There are various LSST guideline comments and a few other smaller changes that could be done before merging if time allows. There are a couple places we should follow up on for ctrl_bps tickets and some other longer term questions and comments.
@@ -0,0 +1,43 @@ | |||
operator: jdoe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just comment for future: When we're past the prototyping period, if the wms plugins stay in ctrl_bps, I think this example should go in the ctrl_bps docs.
|
||
computing_queue: DOMA_LSST_GOOGLE_TEST | ||
computing_queue_himem: DOMA_LSST_GOOGLE_TEST_HIMEM | ||
himem_steps: ['makeWarp', 'assembleCoadd', 'deblend', 'measure', 'pipetaskInit'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't there a discussion about using requestMemory to determine which queue it should go to?
def pick_non_init_cmdline(self): | ||
for node_name in self.bps_workflow.nodes: | ||
if node_name != 'pipetaskInit': | ||
return self.bps_workflow.nodes[node_name]['job'].cmdline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is an over-optimized workaround for something ctrl_bps is currently doing. We should make a ctrl_bps ticket explaining exactly what PanDA needs/does to make sure those of us unfamiliar with PanDA don't make wrong assumptions. Currently this works. But there will be more jobs that do not share the same command line. And users currently are allowed to change the command based upon which PipelineTask is being executed, but so far no one has needed this. For example, turn on running with DEBUG logging only for assembleCoadd. Coming soon will be ability to use single full QuantumGraph for every job, but it means having different Quantum NodIds on each command line.
Tasks filled with parameters provided in workflow configuration and generated pipeline. | ||
""" | ||
tasks = [] | ||
raw_dependency_map = self.create_raw_jobs_dependency_map() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raw is an overloaded term in the LSST project. I was wondering why the code was assuming that the pipelines always started with raw images. If there's an equivalent term, it would be helpful to use it instead.
class IDDSWorkflowGenerator: | ||
""" | ||
Class generates a iDDS workflow to be submitted into PanDA. Workflow includes definition of each task and | ||
definition of dependencies for each task input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is task PanDA's name for a compute job?
# | ||
# You should have received a copy of the GNU General Public License | ||
# along with this program. If not, see <https://www.gnu.org/licenses/>. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing module docstring
workflow.write(out_prefix) | ||
return workflow | ||
|
||
def convert_exec_string_to_hex(self, cmdline): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why passing in self? And then why is this a class method?
nodes_from_edges = set(list(dependency_map.keys())) | ||
extra_nodes = [node for node in all_nodes if node not in nodes_from_edges] | ||
for node in extra_nodes: | ||
dependency_map.setdefault(node, []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a setdefault()?
tasks_dependency_map.setdefault(self.get_task_by_job_name(job), {})[file_name] = \ | ||
self.split_dependencies_by_tasks(dependency) | ||
self.tasks_inputs.setdefault(self.define_task_name( | ||
self.get_task_by_job_name(job)), []).append(file_local_src) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines would be shorter and easier to read if self.get_task_by_job_name(job) would get pulled out and run once. Think this is possible, but having problems understanding these two setdefault lines of code.
target = ButlerURI(cloud_prefix + '/' + os.path.basename(src_path)) | ||
target.transfer_from(src, transfer="copy") | ||
|
||
def write(self, out_prefix): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not implementing, should this just be removed (after removing call from prepare)?
…f PanDA server as authorization proxy.
…rl_bps into tickets/DM-27100
No description provided.