Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-27100: Add PanDA support #17

Merged
merged 23 commits into from Mar 1, 2021
Merged

DM-27100: Add PanDA support #17

merged 23 commits into from Mar 1, 2021

Conversation

SergeyPod
Copy link
Contributor

No description provided.

@timj
Copy link
Member

timj commented Dec 14, 2020

Please rebase to get rid of that bad merge commit you have there.

@SergeyPod
Copy link
Contributor Author

Please rebase to get rid of that bad merge commit you have there.

done. Thank you.

@timj timj changed the title Tickets/dm 27100 DM-27100: Add PanDA support Jan 4, 2021
@SergeyPod SergeyPod requested a review from timj January 4, 2021 22:47
Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a quick look and I have many comments. I'm not willing to approve this just yet because there is so much missing in terms of docstrings and code comments that it's hard to see what's really going on.

I'm also a bit concerned that there is LSST code referenced here that is being developed somewhere else that we can't see (DomaLSSTWork) so I'd like to have some insight into that (unless the "LSST" in the name is some fluke).

bucket: "s3://ci_hsc_w_2020_48"
s3_endpoint_url: "https://storage.googleapis.com"
payload_folder: payload
singulatiry_prefix: '"cd /tmp;export HOME=/tmp;export S3_ENDPOINT_URL={s3_endpoint_url};export AWS_ACCESS_KEY_ID={aws_access_key}; export AWS_SECRET_ACCESS_KEY={aws_secret_access_key}; . /opt/lsst/software/stack/loadLSST.bash; setup lsst_distrib -t w_2020_48;'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem correct. "singulatiry"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming of the DomaLSSTWork class in iDDS will be changed in the next PIP release of that module. I'll update the BPS plugin accordingly once it appears.

@@ -0,0 +1,8 @@
#!/bin/bash
echo "I am in container"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to put some commentary in here explaining why it exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added description

@@ -0,0 +1,6 @@
#!/usr/bin/python
import sys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a docstring to let people know what this command is for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -0,0 +1,126 @@
class LSSTTask(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never inherit from object in python3. Include a docstring. Convention requires this be LsstTask but I imagine RubinTask would also work.

Can this be a dataclass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reimplemented as a dataclass

python/lsst/ctrl/bps/wms/panda/idds_tasks.py Show resolved Hide resolved

# We take the commandline only from first job because PanDA uses late binding and
# command line for each job in task is equal to each other in exception to the processing
# file name which is substitutes by PanDA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

substituted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

task.step = task_step
task.name = self.define_task_name(task.step)
task.queue = self.computing_queue_himem if task_step in self.himem_tasks else self.computing_queue
task.lfns = list(jobs.keys())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if jobs is dict-like then list(jobs) is shorter and results in the same thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed.

def pick_non_init_cmdline(self):
for node_name in self.bps_workflow.nodes:
if node_name != 'pipetaskInit':
return self.bps_workflow.nodes[node_name]['job'].cmdline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This picks the first command line that isn't called pipeTaskInit?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is an over-optimized workaround for something ctrl_bps is currently doing. We should make a ctrl_bps ticket explaining exactly what PanDA needs/does to make sure those of us unfamiliar with PanDA don't make wrong assumptions. Currently this works. But there will be more jobs that do not share the same command line. And users currently are allowed to change the command based upon which PipelineTask is being executed, but so far no one has needed this. For example, turn on running with DEBUG logging only for assembleCoadd. Coming soon will be ability to use single full QuantumGraph for every job, but it means having different Quantum NodIds on each command line.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add explicit return of None if can't find a command line?

def define_execution_command(self):
exec_str = ""
if self.bps_config.get("computing_queue") == 'docker':
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the code meant to be doing? Should the if be deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code cleaned.

Extra message for report command to print. This could be pointers to documentation or
to WMS specific commands.
"""
message = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably document that this method does nothing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering why having it return None is better than the NotImplementedError? With None, bps report will always behave like there are no runs. I would rather have folks know that "bps report" doesn't work with PanDA yet as opposed to saying there's a bug in "bps report".

timj and others added 14 commits January 26, 2021 15:27
DM-28480: Update action to py3.8 and ignore W503
Update doc/conf.py to new version
Add a :no-inherited-members: option to the automodapi directive in
    ctrl_bps/doc/lsst.ctrl.bps/index.rst to workaround bug.
Since having to test building the docs, included minor text changes:
    Update pipeline syntax from ':' to '#' to match stack changes.
    Fix one missed capitalization change (qgraph_file to qgraphFile).
Copy link
Collaborator

@MichelleGower MichelleGower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran a pipeline using HTCondor and it still worked fine (as expected because of where the changes are). After rebasing latest changes on master, there is nothing that says this shouldn't be merged. There are various LSST guideline comments and a few other smaller changes that could be done before merging if time allows. There are a couple places we should follow up on for ctrl_bps tickets and some other longer term questions and comments.

@@ -0,0 +1,43 @@
operator: jdoe
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just comment for future: When we're past the prototyping period, if the wms plugins stay in ctrl_bps, I think this example should go in the ctrl_bps docs.


computing_queue: DOMA_LSST_GOOGLE_TEST
computing_queue_himem: DOMA_LSST_GOOGLE_TEST_HIMEM
himem_steps: ['makeWarp', 'assembleCoadd', 'deblend', 'measure', 'pipetaskInit']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't there a discussion about using requestMemory to determine which queue it should go to?

def pick_non_init_cmdline(self):
for node_name in self.bps_workflow.nodes:
if node_name != 'pipetaskInit':
return self.bps_workflow.nodes[node_name]['job'].cmdline
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is an over-optimized workaround for something ctrl_bps is currently doing. We should make a ctrl_bps ticket explaining exactly what PanDA needs/does to make sure those of us unfamiliar with PanDA don't make wrong assumptions. Currently this works. But there will be more jobs that do not share the same command line. And users currently are allowed to change the command based upon which PipelineTask is being executed, but so far no one has needed this. For example, turn on running with DEBUG logging only for assembleCoadd. Coming soon will be ability to use single full QuantumGraph for every job, but it means having different Quantum NodIds on each command line.

Tasks filled with parameters provided in workflow configuration and generated pipeline.
"""
tasks = []
raw_dependency_map = self.create_raw_jobs_dependency_map()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raw is an overloaded term in the LSST project. I was wondering why the code was assuming that the pipelines always started with raw images. If there's an equivalent term, it would be helpful to use it instead.

class IDDSWorkflowGenerator:
"""
Class generates a iDDS workflow to be submitted into PanDA. Workflow includes definition of each task and
definition of dependencies for each task input.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is task PanDA's name for a compute job?

#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing module docstring

workflow.write(out_prefix)
return workflow

def convert_exec_string_to_hex(self, cmdline):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why passing in self? And then why is this a class method?

nodes_from_edges = set(list(dependency_map.keys()))
extra_nodes = [node for node in all_nodes if node not in nodes_from_edges]
for node in extra_nodes:
dependency_map.setdefault(node, [])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a setdefault()?

tasks_dependency_map.setdefault(self.get_task_by_job_name(job), {})[file_name] = \
self.split_dependencies_by_tasks(dependency)
self.tasks_inputs.setdefault(self.define_task_name(
self.get_task_by_job_name(job)), []).append(file_local_src)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines would be shorter and easier to read if self.get_task_by_job_name(job) would get pulled out and run once. Think this is possible, but having problems understanding these two setdefault lines of code.

target = ButlerURI(cloud_prefix + '/' + os.path.basename(src_path))
target.transfer_from(src, transfer="copy")

def write(self, out_prefix):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not implementing, should this just be removed (after removing call from prepare)?

@SergeyPod SergeyPod merged commit f221100 into master Mar 1, 2021
@SergeyPod SergeyPod deleted the tickets/DM-27100 branch March 1, 2021 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants