New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-31900: Compute site changes for PanDA #47
Conversation
#command3: "${DAF_BUTLER_DIR}/bin/butler collection-chain {butlerConfig} {output} --mode=prepend {outCollection}" | ||
mergePreCmdOpts: "{defaultPreCmdOpts}" | ||
command1: "${DAF_BUTLER_DIR}/bin/butler {mergePreCmdOpts} transfer-datasets {executionButlerDir} {butlerConfig} --collections {outCollection}" | ||
command2: "${DAF_BUTLER_DIR}/bin/butler {mergePreCmdOpts} collection-chain {butlerConfig} {output} --flatten --mode=extend {inCollection}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were going to use an include file to pick up the execution butler configuration so that we didn't have to write down the commands for panda and separately for condor? Maybe just over-riding defaultPreCmdOpts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this executionButler section will become bps defaults in this next release (unless someone reports a problem that can't be fixed by then). It should just be an include (and if everyone on the IDF should be using it the include should go in the common IDF submission YAML - hmmm, I should go test whether "nested" includeConfigs work)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs further discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there needs to be a discussion about a shared IDF site yaml that has the defaults for the IDF. Under the assumption that the yaml works, the changes could be merged. However, I don't know that the spirit of the "config cleanup" was completely met.
@@ -1,13 +1,13 @@ | |||
includeConfigs: | |||
- ${CTRL_BPS_DIR}/python/lsst/ctrl/bps/wms/panda/conf_example/pipelines_check_idf.yaml | |||
- pipelines_check_idf.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work? (Or are you assuming folks copy the files out of this directory into the same directory before submitting?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming files a copied for customization
@@ -1,13 +1,13 @@ | |||
includeConfigs: | |||
- ${CTRL_BPS_DIR}/python/lsst/ctrl/bps/wms/panda/conf_example/pipelines_check_idf.yaml | |||
- pipelines_check_idf.yaml | |||
|
|||
|
|||
#PANDA plugin specific settings: | |||
idds_server: "https://aipanda015.cern.ch:443/idds" | |||
placeholderParams: ['qgraphNodeId', 'qgraphId'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of the above won't change, so they probably should not be in normal user-level submit yaml.
#command3: "${DAF_BUTLER_DIR}/bin/butler collection-chain {butlerConfig} {output} --mode=prepend {outCollection}" | ||
mergePreCmdOpts: "{defaultPreCmdOpts}" | ||
command1: "${DAF_BUTLER_DIR}/bin/butler {mergePreCmdOpts} transfer-datasets {executionButlerDir} {butlerConfig} --collections {outCollection}" | ||
command2: "${DAF_BUTLER_DIR}/bin/butler {mergePreCmdOpts} collection-chain {butlerConfig} {output} --flatten --mode=extend {inCollection}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this executionButler section will become bps defaults in this next release (unless someone reports a problem that can't be fixed by then). It should just be an include (and if everyone on the IDF should be using it the include should go in the common IDF submission YAML - hmmm, I should go test whether "nested" includeConfigs work)
|
||
pipetask: | ||
pipetaskInit: | ||
# Notes: Declaring and chaining now happen within execution butler | ||
# steps. So, this command no longer needs -o and must have | ||
# --extend-run. | ||
runQuantumCommand: "${CTRL_MPEXEC_DIR}/bin/pipetask --long-log run -b {butlerConfig} -i {inCollection} --output-run {outCollection} --init-only --register-dataset-types --qgraph {fileDistributionEndPoint}/{qgraphFile} --extend-run --clobber-outputs --no-versions" | ||
runQuantumCommand: "${CTRL_MPEXEC_DIR}/bin/pipetask {initPreCmdOpts} run -b {butlerConfig} -i {inCollection} -o {output} --output-run {outCollection} --qgraph {fileDistributionEndPoint}/{qgraphFile} --qgraph-id {qgraphId} --qgraph-node-id {qgraphNodeId} --clobber-outputs --init-only --extend-run {extraInitOptions}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did the discussion end about using the same runQuantumCommand as the default bps setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the {initPreCmdOpts} is included, the {fileDistributionEndPoint} will be removed in future and at that point the command become universal
requestCpus: 1 | ||
|
||
#this is a series of setup commands preceding the actual core SW execution | ||
runner_command: 'docker run --network host --privileged --env AWS_ACCESS_KEY_ID=$(</credentials/AWS_ACCESS_KEY_ID) --env AWS_SECRET_ACCESS_KEY=$(</credentials/AWS_SECRET_ACCESS_KEY) --env PGPASSWORD=$(</credentials/PGPASSWORD) --env S3_ENDPOINT_URL=${S3_ENDPOINT_URL} {sw_image} /bin/bash -c "source /opt/lsst/software/stack/loadLSST.bash;cd /tmp;ls -a;setup lsst_distrib;pwd;python3 \${CTRL_BPS_DIR}/python/lsst/ctrl/bps/wms/panda/edgenode/cmd_line_decoder.py _cmd_line_ " >&2;' | ||
wmsServiceClass: lsst.ctrl.bps.wms.panda.panda_service.PanDAService |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again these are PanDA or IDF specific and could be in a shared yaml.
@@ -160,12 +159,12 @@ def define_tasks(self): | |||
== self.tasks_steps[task_name], | |||
self.bps_workflow)) | |||
bps_node = self.bps_workflow.get_job(picked_job_name) | |||
task.queue = bps_node.compute_site | |||
task.queue = bps_node.queue | |||
task.cloud = bps_node.compute_site | |||
task.jobs_pseudo_inputs = list(jobs) | |||
task.maxattempt = self.maxattempt | |||
task.maxwalltime = self.maxwalltime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, GenericWorkflowJob has number_of_retries and request_walltime (that should have been fixed with queue but someone should check).
@@ -160,12 +159,12 @@ def define_tasks(self): | |||
== self.tasks_steps[task_name], | |||
self.bps_workflow)) | |||
bps_node = self.bps_workflow.get_job(picked_job_name) | |||
task.queue = bps_node.compute_site | |||
task.queue = bps_node.queue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note that this isn't using the GenericWorkflowJob.concurrency_limit. The project should decide whether to use concurrency_limit and let the plugin decide if that limit is implemented as a queue or some other mechanism or if we should switch to just queues and then the plugin decide that a queue sometimes isn't a queue. Or since this is a limited case just let the plugins be different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plugin can not easily propagate the concurrency limit into the queue configuration. Another question, is the concurrently limit specified as per submission level however expected to bind the whole payload active in the system, so this setting should be propagated into the queue configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concurrency limit is a per GenericWorkflowJob setting. Realistically, it is currently set for all jobs with a particular label (or final job). The yaml allows it to be set for the entire workflow, but that doesn't really make sense. This concurrency limit is supposed to throttle those jobs across all workflows.
The current design assumes that if a WMS or site is implementing this concurrency limit via queues that the plugin would internally do the conversion. The plugin is allowed to have configuration that explains the conversion. For example, there could be yaml:
pandaConcurrencyLimits:
db_limit: <panda queue name> # underscore in key db_limit because that's what the current value is in the job.
The yaml and code can be more complex where needed (e.g., different queues for different compute sites, maybe it isn't a directly translation to a queue name but must also use request_memory, etc.)
If folks want to change the design to either only allow queue-based implementations or have plugins translate queues to the other concurrency limit implementations, that's another direction we could go.
|
||
sw_image: "spodolsky/centos:7-stack-lsst_distrib-d_2021_09_06" | ||
sw_image: "lsstsqre/centos:7-stack-lsst_distrib-w_2021_39" | ||
fileDistributionEndPoint: "s3://butler-us-central1-panda-dev/hsc/{payload_folder}/{uniqProcName}/" | ||
s3_endpoint_url: "https://storage.googleapis.com" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do sw_image, fileDistributionEndPoint, and s3_endpoint_url change often? If not, they could go in the central config as defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fileDistributionEndPoint, and s3_endpoint_url are the same for a while, the sw_image probably could be expected to updated together with sw releases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the sw_image be defaulted by the plugin to one matching the submit version (still allow user to override it but only if they need to)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a very good idea.
@@ -2,15 +2,12 @@ pipelineYaml: "${OBS_SUBARU_DIR}/pipelines/DRP.yaml#processCcd" | |||
|
|||
payload: | |||
payloadName: pipelines_check | |||
runInit: true | |||
output: "u/{operator}/{payload_name}" | |||
outCollection: "{output}/{timestamp}" | |||
butlerConfig: s3://butler-us-central1-panda-dev/hsc/butler.yaml | |||
inCollection: HSC/calib,HSC/raw/all,refcats | |||
dataQuery: "tract = 9615 and patch=30 and detector IN (10..11) and instrument='HSC' and skymap='hsc_rings_v1' and band in ('r')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If really wanted a minimal example, could include the pipelines_check.yaml and only override the butlerConfig (if it doesn't go into a shared yaml).
…uration file Removed debug artefact in configuration file
4a8e0af
to
d8c5264
Compare
d8c5264
to
6290900
Compare
…uration file