Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-31900: Compute site changes for PanDA #47

Merged
merged 3 commits into from Sep 29, 2021
Merged

Conversation

SergeyPod
Copy link
Contributor

…uration file

#command3: "${DAF_BUTLER_DIR}/bin/butler collection-chain {butlerConfig} {output} --mode=prepend {outCollection}"
mergePreCmdOpts: "{defaultPreCmdOpts}"
command1: "${DAF_BUTLER_DIR}/bin/butler {mergePreCmdOpts} transfer-datasets {executionButlerDir} {butlerConfig} --collections {outCollection}"
command2: "${DAF_BUTLER_DIR}/bin/butler {mergePreCmdOpts} collection-chain {butlerConfig} {output} --flatten --mode=extend {inCollection}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were going to use an include file to pick up the execution butler configuration so that we didn't have to write down the commands for panda and separately for condor? Maybe just over-riding defaultPreCmdOpts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this executionButler section will become bps defaults in this next release (unless someone reports a problem that can't be fixed by then). It should just be an include (and if everyone on the IDF should be using it the include should go in the common IDF submission YAML - hmmm, I should go test whether "nested" includeConfigs work)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs further discussion

Copy link
Collaborator

@MichelleGower MichelleGower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there needs to be a discussion about a shared IDF site yaml that has the defaults for the IDF. Under the assumption that the yaml works, the changes could be merged. However, I don't know that the spirit of the "config cleanup" was completely met.

@@ -1,13 +1,13 @@
includeConfigs:
- ${CTRL_BPS_DIR}/python/lsst/ctrl/bps/wms/panda/conf_example/pipelines_check_idf.yaml
- pipelines_check_idf.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work? (Or are you assuming folks copy the files out of this directory into the same directory before submitting?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming files a copied for customization

@@ -1,13 +1,13 @@
includeConfigs:
- ${CTRL_BPS_DIR}/python/lsst/ctrl/bps/wms/panda/conf_example/pipelines_check_idf.yaml
- pipelines_check_idf.yaml


#PANDA plugin specific settings:
idds_server: "https://aipanda015.cern.ch:443/idds"
placeholderParams: ['qgraphNodeId', 'qgraphId']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of the above won't change, so they probably should not be in normal user-level submit yaml.

#command3: "${DAF_BUTLER_DIR}/bin/butler collection-chain {butlerConfig} {output} --mode=prepend {outCollection}"
mergePreCmdOpts: "{defaultPreCmdOpts}"
command1: "${DAF_BUTLER_DIR}/bin/butler {mergePreCmdOpts} transfer-datasets {executionButlerDir} {butlerConfig} --collections {outCollection}"
command2: "${DAF_BUTLER_DIR}/bin/butler {mergePreCmdOpts} collection-chain {butlerConfig} {output} --flatten --mode=extend {inCollection}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this executionButler section will become bps defaults in this next release (unless someone reports a problem that can't be fixed by then). It should just be an include (and if everyone on the IDF should be using it the include should go in the common IDF submission YAML - hmmm, I should go test whether "nested" includeConfigs work)


pipetask:
pipetaskInit:
# Notes: Declaring and chaining now happen within execution butler
# steps. So, this command no longer needs -o and must have
# --extend-run.
runQuantumCommand: "${CTRL_MPEXEC_DIR}/bin/pipetask --long-log run -b {butlerConfig} -i {inCollection} --output-run {outCollection} --init-only --register-dataset-types --qgraph {fileDistributionEndPoint}/{qgraphFile} --extend-run --clobber-outputs --no-versions"
runQuantumCommand: "${CTRL_MPEXEC_DIR}/bin/pipetask {initPreCmdOpts} run -b {butlerConfig} -i {inCollection} -o {output} --output-run {outCollection} --qgraph {fileDistributionEndPoint}/{qgraphFile} --qgraph-id {qgraphId} --qgraph-node-id {qgraphNodeId} --clobber-outputs --init-only --extend-run {extraInitOptions}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did the discussion end about using the same runQuantumCommand as the default bps setting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the {initPreCmdOpts} is included, the {fileDistributionEndPoint} will be removed in future and at that point the command become universal

requestCpus: 1

#this is a series of setup commands preceding the actual core SW execution
runner_command: 'docker run --network host --privileged --env AWS_ACCESS_KEY_ID=$(</credentials/AWS_ACCESS_KEY_ID) --env AWS_SECRET_ACCESS_KEY=$(</credentials/AWS_SECRET_ACCESS_KEY) --env PGPASSWORD=$(</credentials/PGPASSWORD) --env S3_ENDPOINT_URL=${S3_ENDPOINT_URL} {sw_image} /bin/bash -c "source /opt/lsst/software/stack/loadLSST.bash;cd /tmp;ls -a;setup lsst_distrib;pwd;python3 \${CTRL_BPS_DIR}/python/lsst/ctrl/bps/wms/panda/edgenode/cmd_line_decoder.py _cmd_line_ " >&2;'
wmsServiceClass: lsst.ctrl.bps.wms.panda.panda_service.PanDAService
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again these are PanDA or IDF specific and could be in a shared yaml.

@@ -160,12 +159,12 @@ def define_tasks(self):
== self.tasks_steps[task_name],
self.bps_workflow))
bps_node = self.bps_workflow.get_job(picked_job_name)
task.queue = bps_node.compute_site
task.queue = bps_node.queue
task.cloud = bps_node.compute_site
task.jobs_pseudo_inputs = list(jobs)
task.maxattempt = self.maxattempt
task.maxwalltime = self.maxwalltime
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, GenericWorkflowJob has number_of_retries and request_walltime (that should have been fixed with queue but someone should check).

@@ -160,12 +159,12 @@ def define_tasks(self):
== self.tasks_steps[task_name],
self.bps_workflow))
bps_node = self.bps_workflow.get_job(picked_job_name)
task.queue = bps_node.compute_site
task.queue = bps_node.queue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that this isn't using the GenericWorkflowJob.concurrency_limit. The project should decide whether to use concurrency_limit and let the plugin decide if that limit is implemented as a queue or some other mechanism or if we should switch to just queues and then the plugin decide that a queue sometimes isn't a queue. Or since this is a limited case just let the plugins be different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plugin can not easily propagate the concurrency limit into the queue configuration. Another question, is the concurrently limit specified as per submission level however expected to bind the whole payload active in the system, so this setting should be propagated into the queue configuration?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concurrency limit is a per GenericWorkflowJob setting. Realistically, it is currently set for all jobs with a particular label (or final job). The yaml allows it to be set for the entire workflow, but that doesn't really make sense. This concurrency limit is supposed to throttle those jobs across all workflows.

The current design assumes that if a WMS or site is implementing this concurrency limit via queues that the plugin would internally do the conversion. The plugin is allowed to have configuration that explains the conversion. For example, there could be yaml:

pandaConcurrencyLimits:
    db_limit: <panda queue name>   # underscore in key db_limit because that's what the current value is in the job.

The yaml and code can be more complex where needed (e.g., different queues for different compute sites, maybe it isn't a directly translation to a queue name but must also use request_memory, etc.)

If folks want to change the design to either only allow queue-based implementations or have plugins translate queues to the other concurrency limit implementations, that's another direction we could go.


sw_image: "spodolsky/centos:7-stack-lsst_distrib-d_2021_09_06"
sw_image: "lsstsqre/centos:7-stack-lsst_distrib-w_2021_39"
fileDistributionEndPoint: "s3://butler-us-central1-panda-dev/hsc/{payload_folder}/{uniqProcName}/"
s3_endpoint_url: "https://storage.googleapis.com"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do sw_image, fileDistributionEndPoint, and s3_endpoint_url change often? If not, they could go in the central config as defaults.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fileDistributionEndPoint, and s3_endpoint_url are the same for a while, the sw_image probably could be expected to updated together with sw releases

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the sw_image be defaulted by the plugin to one matching the submit version (still allow user to override it but only if they need to)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a very good idea.

@@ -2,15 +2,12 @@ pipelineYaml: "${OBS_SUBARU_DIR}/pipelines/DRP.yaml#processCcd"

payload:
payloadName: pipelines_check
runInit: true
output: "u/{operator}/{payload_name}"
outCollection: "{output}/{timestamp}"
butlerConfig: s3://butler-us-central1-panda-dev/hsc/butler.yaml
inCollection: HSC/calib,HSC/raw/all,refcats
dataQuery: "tract = 9615 and patch=30 and detector IN (10..11) and instrument='HSC' and skymap='hsc_rings_v1' and band in ('r')"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If really wanted a minimal example, could include the pipelines_check.yaml and only override the butlerConfig (if it doesn't go into a shared yaml).

@timj timj changed the title implemented ticket requirements, performed cleanup of code and config… DM-31900: Compute site changes for PanDA Sep 27, 2021
…uration file

Removed debug artefact in configuration file
@SergeyPod SergeyPod merged commit 679a4d4 into master Sep 29, 2021
@SergeyPod SergeyPod deleted the tickets/DM-31900 branch September 29, 2021 01:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants