DM-52360: Switch the default finalJob to aggregate-graph. #214

TallJimbo · 2025-10-07T16:24:37Z

Open questions:

Is there a way to set the number of CPUs/processes in one place?
Did I even set requestCpus for finalJob in the right place?
Should the defaults be appropriate for a massive LSSTCam job, or something smaller?

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes

codecov · 2025-10-07T16:25:54Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.80%. Comparing base (c4e9aae) to head (7898789).
⚠️ Report is 4 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #214   +/-   ##
=======================================
  Coverage   89.80%   89.80%           
=======================================
  Files          51       51           
  Lines        5465     5465           
  Branches      539      539           
=======================================
  Hits         4908     4908           
  Misses        458      458           
  Partials       99       99

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

TallJimbo · 2025-10-09T21:44:25Z

python/lsst/ctrl/bps/etc/bps_defaults.yaml

  concurrencyLimit: db_limit
  finalPreCmdOpts: "{defaultPreCmdOpts}"
+  requestCpus: 32
+  requestMemory: 65536


These numbers are guesses based on my 1.6m-quanta stage1 tests; I used 32 cores and peak memory usage was about 47GB (so I rounded up to 64GB).

Because compute resources seem to be in high demand, is 1.6m quanta the average for non-production user? I suspect it will be easier for CM Service to increase than to get development runs to decrease.

Later if helpful, we could think about some way to use information about the QuantumGraph in figuring out these resource values (in ctrl_bps or have ctrl_bps call some function if it better fits elsewhere).

I've reset the defaults to 8 cores and 16G.

MichelleGower

I will run more tests after the fixes are pushed.

MichelleGower · 2025-10-13T23:48:37Z

python/lsst/ctrl/bps/etc/bps_defaults.yaml

+    -g {fileDistributionEndPoint}{qgraphFile}
+    -j 48
    --register-dataset-types
    {updateOutputChain}


The arguments don't match current implementation. Also, the value for -j doesn't match value given for requestCpus. Should it?

I would add a couple more variables to make changes and running with mocks easier. If a different name makes more sense, please change.

finalJobNumProcesses: 32 extraAggregateOptions: "" finalJob: retryUnlessExit: 2 updateOutputChain: "--update-output-chain" whenSetup: "NEVER" whenRun: "ALWAYS" # Added for future flexibility, e.g., if prefer workflow instead of shell # script. implementation: JOB concurrencyLimit: db_limit finalPreCmdOpts: "{defaultPreCmdOpts}" requestCpus: "{finalJobNumProcesses}" requestMemory: 65536 command1: >- ${DAF_BUTLER_DIR}/bin/butler {finalPreCmdOpts} aggregate-graph {fileDistributionEndPoint}{qgraphFile} {butlerConfig} -j {finalJobNumProcesses} --register-dataset-types {updateOutputChain} {extraAggregateOptions}

So on my smaller workstation running a ci_middleware workflow with mocked failures, I needed to add the following to my submit yaml:

finalJobNumProcesses: 10 extraAggregateOptions: "--mock-storage-classes"

p.s., I wouldn't object to these going inside the final section like finalPreCmdOpts. But looking forward to having a side aggregate job, setting the values once at the root level that would be used by both the aggregate and final jobs would be ideal if generally they would match.

I've adopted you recommended configuration pretty-much as-is (I just revised the default resource usage lower, per the other thread). I didn't realize one could define a new BPS variable implicitly like this.

As for using the finalJob section for the new variables, at this point I'm not really anticipating adding a side aggregate, but I don't think we'll know for sure until CM starts using this in production. I am still hoping to support running it manually on the side, much as one can already do with the transfer-from-graph finalJob. Should we just move the new variables into that section, then?

MichelleGower · 2025-10-13T23:58:35Z

python/lsst/ctrl/bps/etc/bps_defaults.yaml

  concurrencyLimit: db_limit
  finalPreCmdOpts: "{defaultPreCmdOpts}"
+  requestCpus: 32
+  requestMemory: 65536


Because compute resources seem to be in high demand, is 1.6m quanta the average for non-production user? I suspect it will be easier for CM Service to increase than to get development runs to decrease.

MichelleGower · 2025-10-14T00:00:58Z

doc/lsst.ctrl.bps/quickstart.rst

       {butlerConfig}
+       -g {fileDistributionEndPoint}{qgraphFile}
       --register-dataset-types
       {updateOutputChain}


See discussion below with the bps_defaults.yaml.

MichelleGower · 2025-10-15T17:57:51Z

python/lsst/ctrl/bps/etc/bps_defaults.yaml

  concurrencyLimit: db_limit
  finalPreCmdOpts: "{defaultPreCmdOpts}"
+  requestCpus: 32
+  requestMemory: 65536


Later if helpful, we could think about some way to use information about the QuantumGraph in figuring out these resource values (in ctrl_bps or have ctrl_bps call some function if it better fits elsewhere).

MichelleGower · 2025-10-17T02:43:17Z

I think this should have a doc/changes file.

MichelleGower

I ran a PanDA and an HTCondor on SLAC test with replacing the finalJobNumProcesses with just requestCpus to avoid the ctrl_bps bug (passing request_cpus as string to the WMS plugins when doing the nested variables). Merge approved.

TallJimbo force-pushed the tickets/DM-52360 branch from 6e594b5 to 05a0c45 Compare October 9, 2025 21:43

TallJimbo commented Oct 9, 2025

View reviewed changes

MichelleGower reviewed Oct 15, 2025

View reviewed changes

TallJimbo force-pushed the tickets/DM-52360 branch 2 times, most recently from d5e6df2 to 78eead1 Compare October 17, 2025 15:28

Switch default finalJob to aggregate-graph.

4291798

TallJimbo force-pushed the tickets/DM-52360 branch from 78eead1 to 4291798 Compare October 17, 2025 17:20

TallJimbo marked this pull request as ready for review October 17, 2025 17:20

TallJimbo and others added 2 commits October 20, 2025 14:54

Update finalJob resource requests based on USDF benchmarks/sizes.

0b4ad5b

Replace finalJobNumProcesses with requestCpus.

7898789

MichelleGower approved these changes Oct 20, 2025

View reviewed changes

TallJimbo merged commit 4123807 into main Oct 22, 2025
21 checks passed

TallJimbo deleted the tickets/DM-52360 branch October 22, 2025 00:55

DM-52360: Switch the default finalJob to aggregate-graph. #214

DM-52360: Switch the default finalJob to aggregate-graph. #214

Uh oh!

Conversation

TallJimbo commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

codecov bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MichelleGower left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MichelleGower commented Oct 17, 2025

Uh oh!

MichelleGower left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TallJimbo commented Oct 7, 2025 •

edited

Loading

codecov bot commented Oct 7, 2025 •

edited

Loading