DM-31900: Compute site changes for PanDA #47

SergeyPod · 2021-09-24T15:53:00Z

…uration file

timj · 2021-09-25T01:11:52Z

python/lsst/ctrl/bps/wms/panda/conf_example/pipelines_check_execution_butler_idf.yaml

-  #command3: "${DAF_BUTLER_DIR}/bin/butler collection-chain {butlerConfig} {output} --mode=prepend {outCollection}"
+  mergePreCmdOpts: "{defaultPreCmdOpts}"
+  command1: "${DAF_BUTLER_DIR}/bin/butler {mergePreCmdOpts} transfer-datasets  {executionButlerDir} {butlerConfig} --collections {outCollection}"
+  command2: "${DAF_BUTLER_DIR}/bin/butler {mergePreCmdOpts} collection-chain {butlerConfig} {output} --flatten --mode=extend {inCollection}"


I thought we were going to use an include file to pick up the execution butler configuration so that we didn't have to write down the commands for panda and separately for condor? Maybe just over-riding defaultPreCmdOpts?

All of this executionButler section will become bps defaults in this next release (unless someone reports a problem that can't be fixed by then). It should just be an include (and if everyone on the IDF should be using it the include should go in the common IDF submission YAML - hmmm, I should go test whether "nested" includeConfigs work)

needs further discussion

MichelleGower

I think there needs to be a discussion about a shared IDF site yaml that has the defaults for the IDF. Under the assumption that the yaml works, the changes could be merged. However, I don't know that the spirit of the "config cleanup" was completely met.

MichelleGower · 2021-09-27T19:49:31Z

python/lsst/ctrl/bps/wms/panda/conf_example/pipelines_check_execution_butler_idf.yaml

@@ -1,13 +1,13 @@
 includeConfigs:
- ${CTRL_BPS_DIR}/python/lsst/ctrl/bps/wms/panda/conf_example/pipelines_check_idf.yaml
+- pipelines_check_idf.yaml


Does this work? (Or are you assuming folks copy the files out of this directory into the same directory before submitting?)

I am assuming files a copied for customization

MichelleGower · 2021-09-27T19:51:09Z

python/lsst/ctrl/bps/wms/panda/conf_example/pipelines_check_execution_butler_idf.yaml

@@ -1,13 +1,13 @@
 includeConfigs:
- ${CTRL_BPS_DIR}/python/lsst/ctrl/bps/wms/panda/conf_example/pipelines_check_idf.yaml
+- pipelines_check_idf.yaml


 #PANDA plugin specific settings:
 idds_server: "https://aipanda015.cern.ch:443/idds"
 placeholderParams: ['qgraphNodeId', 'qgraphId']


Both of the above won't change, so they probably should not be in normal user-level submit yaml.

MichelleGower · 2021-09-27T19:53:57Z

python/lsst/ctrl/bps/wms/panda/conf_example/pipelines_check_execution_butler_idf.yaml

-  #command3: "${DAF_BUTLER_DIR}/bin/butler collection-chain {butlerConfig} {output} --mode=prepend {outCollection}"
+  mergePreCmdOpts: "{defaultPreCmdOpts}"
+  command1: "${DAF_BUTLER_DIR}/bin/butler {mergePreCmdOpts} transfer-datasets  {executionButlerDir} {butlerConfig} --collections {outCollection}"
+  command2: "${DAF_BUTLER_DIR}/bin/butler {mergePreCmdOpts} collection-chain {butlerConfig} {output} --flatten --mode=extend {inCollection}"


All of this executionButler section will become bps defaults in this next release (unless someone reports a problem that can't be fixed by then). It should just be an include (and if everyone on the IDF should be using it the include should go in the common IDF submission YAML - hmmm, I should go test whether "nested" includeConfigs work)

MichelleGower · 2021-09-27T19:55:47Z

python/lsst/ctrl/bps/wms/panda/conf_example/pipelines_check_execution_butler_idf.yaml


 pipetask:
  pipetaskInit:
    # Notes:  Declaring and chaining now happen within execution butler
    # steps.  So, this command no longer needs -o and must have
    # --extend-run.
-    runQuantumCommand: "${CTRL_MPEXEC_DIR}/bin/pipetask --long-log run -b {butlerConfig} -i {inCollection} --output-run {outCollection} --init-only --register-dataset-types --qgraph {fileDistributionEndPoint}/{qgraphFile} --extend-run --clobber-outputs --no-versions"
+    runQuantumCommand: "${CTRL_MPEXEC_DIR}/bin/pipetask {initPreCmdOpts} run -b {butlerConfig} -i {inCollection} -o {output} --output-run {outCollection} --qgraph {fileDistributionEndPoint}/{qgraphFile} --qgraph-id {qgraphId} --qgraph-node-id {qgraphNodeId} --clobber-outputs --init-only --extend-run {extraInitOptions}"


How did the discussion end about using the same runQuantumCommand as the default bps setting?

the {initPreCmdOpts} is included, the {fileDistributionEndPoint} will be removed in future and at that point the command become universal

MichelleGower · 2021-09-27T19:56:26Z

python/lsst/ctrl/bps/wms/panda/conf_example/pipelines_check_execution_butler_idf.yaml

-requestCpus: 1
-
+#this is a series of setup commands preceding the actual core SW execution
+runner_command: 'docker run --network host --privileged --env AWS_ACCESS_KEY_ID=$(</credentials/AWS_ACCESS_KEY_ID) --env AWS_SECRET_ACCESS_KEY=$(</credentials/AWS_SECRET_ACCESS_KEY) --env PGPASSWORD=$(</credentials/PGPASSWORD) --env S3_ENDPOINT_URL=${S3_ENDPOINT_URL} {sw_image} /bin/bash -c "source /opt/lsst/software/stack/loadLSST.bash;cd /tmp;ls -a;setup lsst_distrib;pwd;python3 \${CTRL_BPS_DIR}/python/lsst/ctrl/bps/wms/panda/edgenode/cmd_line_decoder.py _cmd_line_ " >&2;'
 wmsServiceClass: lsst.ctrl.bps.wms.panda.panda_service.PanDAService


Again these are PanDA or IDF specific and could be in a shared yaml.

MichelleGower · 2021-09-27T19:59:36Z

python/lsst/ctrl/bps/wms/panda/idds_tasks.py

@@ -160,12 +159,12 @@ def define_tasks(self):
                == self.tasks_steps[task_name],
                self.bps_workflow))
            bps_node = self.bps_workflow.get_job(picked_job_name)
-            task.queue = bps_node.compute_site
+            task.queue = bps_node.queue
+            task.cloud = bps_node.compute_site
            task.jobs_pseudo_inputs = list(jobs)
            task.maxattempt = self.maxattempt
            task.maxwalltime = self.maxwalltime


BTW, GenericWorkflowJob has number_of_retries and request_walltime (that should have been fixed with queue but someone should check).

MichelleGower · 2021-09-27T20:03:21Z

python/lsst/ctrl/bps/wms/panda/idds_tasks.py

@@ -160,12 +159,12 @@ def define_tasks(self):
                == self.tasks_steps[task_name],
                self.bps_workflow))
            bps_node = self.bps_workflow.get_job(picked_job_name)
-            task.queue = bps_node.compute_site
+            task.queue = bps_node.queue


Just a note that this isn't using the GenericWorkflowJob.concurrency_limit. The project should decide whether to use concurrency_limit and let the plugin decide if that limit is implemented as a queue or some other mechanism or if we should switch to just queues and then the plugin decide that a queue sometimes isn't a queue. Or since this is a limited case just let the plugins be different.

plugin can not easily propagate the concurrency limit into the queue configuration. Another question, is the concurrently limit specified as per submission level however expected to bind the whole payload active in the system, so this setting should be propagated into the queue configuration?

The concurrency limit is a per GenericWorkflowJob setting. Realistically, it is currently set for all jobs with a particular label (or final job). The yaml allows it to be set for the entire workflow, but that doesn't really make sense. This concurrency limit is supposed to throttle those jobs across all workflows.

The current design assumes that if a WMS or site is implementing this concurrency limit via queues that the plugin would internally do the conversion. The plugin is allowed to have configuration that explains the conversion. For example, there could be yaml:

pandaConcurrencyLimits: db_limit: <panda queue name> # underscore in key db_limit because that's what the current value is in the job.

The yaml and code can be more complex where needed (e.g., different queues for different compute sites, maybe it isn't a directly translation to a queue name but must also use request_memory, etc.)

If folks want to change the design to either only allow queue-based implementations or have plugins translate queues to the other concurrency limit implementations, that's another direction we could go.

MichelleGower · 2021-09-27T20:05:02Z

python/lsst/ctrl/bps/wms/panda/conf_example/pipelines_check_idf.yaml


-  sw_image: "spodolsky/centos:7-stack-lsst_distrib-d_2021_09_06"
+  sw_image: "lsstsqre/centos:7-stack-lsst_distrib-w_2021_39"
  fileDistributionEndPoint: "s3://butler-us-central1-panda-dev/hsc/{payload_folder}/{uniqProcName}/"
  s3_endpoint_url: "https://storage.googleapis.com"


Do sw_image, fileDistributionEndPoint, and s3_endpoint_url change often? If not, they could go in the central config as defaults.

fileDistributionEndPoint, and s3_endpoint_url are the same for a while, the sw_image probably could be expected to updated together with sw releases

Can the sw_image be defaulted by the plugin to one matching the submit version (still allow user to override it but only if they need to)?

I think that's a very good idea.

MichelleGower · 2021-09-27T20:06:42Z

python/lsst/ctrl/bps/wms/panda/conf_example/pipelines_check_idf.yaml

@@ -2,15 +2,12 @@ pipelineYaml: "${OBS_SUBARU_DIR}/pipelines/DRP.yaml#processCcd"

 payload:
  payloadName: pipelines_check
-  runInit: true
-  output: "u/{operator}/{payload_name}"
-  outCollection: "{output}/{timestamp}"
  butlerConfig: s3://butler-us-central1-panda-dev/hsc/butler.yaml
  inCollection: HSC/calib,HSC/raw/all,refcats
  dataQuery: "tract = 9615 and patch=30 and detector IN (10..11) and instrument='HSC' and skymap='hsc_rings_v1' and band in ('r')"


If really wanted a minimal example, could include the pipelines_check.yaml and only override the butlerConfig (if it doesn't go into a shared yaml).

…uration file Removed debug artefact in configuration file

SergeyPod requested review from MichelleGower and timj September 24, 2021 15:53

timj reviewed Sep 25, 2021

View reviewed changes

MichelleGower approved these changes Sep 27, 2021

View reviewed changes

timj changed the title ~~implemented ticket requirements, performed cleanup of code and config…~~ DM-31900: Compute site changes for PanDA Sep 27, 2021

implemented ticket requirements, performed cleanup of code and config…

0348b84

…uration file Removed debug artefact in configuration file

SergeyPod force-pushed the tickets/DM-31900 branch from 4a8e0af to d8c5264 Compare September 29, 2021 01:30

Code cleanup

6290900

SergeyPod force-pushed the tickets/DM-31900 branch from d8c5264 to 6290900 Compare September 29, 2021 01:36

Merge branch 'master' into tickets/DM-31900

af03a42

SergeyPod merged commit 679a4d4 into master Sep 29, 2021

SergeyPod deleted the tickets/DM-31900 branch September 29, 2021 01:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-31900: Compute site changes for PanDA #47

DM-31900: Compute site changes for PanDA #47

SergeyPod commented Sep 24, 2021

timj Sep 25, 2021

MichelleGower Sep 27, 2021

SergeyPod Sep 29, 2021

MichelleGower left a comment

MichelleGower Sep 27, 2021

SergeyPod Sep 29, 2021

MichelleGower Sep 27, 2021

MichelleGower Sep 27, 2021

MichelleGower Sep 27, 2021

SergeyPod Sep 28, 2021

MichelleGower Sep 27, 2021

MichelleGower Sep 27, 2021

MichelleGower Sep 27, 2021

SergeyPod Sep 28, 2021

MichelleGower Sep 28, 2021

MichelleGower Sep 27, 2021

SergeyPod Sep 28, 2021

MichelleGower Sep 28, 2021

SergeyPod Sep 28, 2021

MichelleGower Sep 27, 2021

DM-31900: Compute site changes for PanDA #47

DM-31900: Compute site changes for PanDA #47

Conversation

SergeyPod commented Sep 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichelleGower left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment