Skip to content

Conversation

@TallJimbo
Copy link
Member

@TallJimbo TallJimbo commented Oct 7, 2025

Open questions:

  • Is there a way to set the number of CPUs/processes in one place?
  • Did I even set requestCpus for finalJob in the right place?
  • Should the defaults be appropriate for a massive LSSTCam job, or something smaller?

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

@codecov
Copy link

codecov bot commented Oct 7, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.80%. Comparing base (c4e9aae) to head (7898789).
⚠️ Report is 4 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #214   +/-   ##
=======================================
  Coverage   89.80%   89.80%           
=======================================
  Files          51       51           
  Lines        5465     5465           
  Branches      539      539           
=======================================
  Hits         4908     4908           
  Misses        458      458           
  Partials       99       99           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

concurrencyLimit: db_limit
finalPreCmdOpts: "{defaultPreCmdOpts}"
requestCpus: 32
requestMemory: 65536
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These numbers are guesses based on my 1.6m-quanta stage1 tests; I used 32 cores and peak memory usage was about 47GB (so I rounded up to 64GB).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because compute resources seem to be in high demand, is 1.6m quanta the average for non-production user? I suspect it will be easier for CM Service to increase than to get development runs to decrease.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later if helpful, we could think about some way to use information about the QuantumGraph in figuring out these resource values (in ctrl_bps or have ctrl_bps call some function if it better fits elsewhere).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reset the defaults to 8 cores and 16G.

Copy link
Collaborator

@MichelleGower MichelleGower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will run more tests after the fixes are pushed.

-g {fileDistributionEndPoint}{qgraphFile}
-j 48
--register-dataset-types
{updateOutputChain}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arguments don't match current implementation. Also, the value for -j doesn't match value given for requestCpus. Should it?

I would add a couple more variables to make changes and running with mocks easier. If a different name makes more sense, please change.

finalJobNumProcesses: 32 
extraAggregateOptions: ""
finalJob:
  retryUnlessExit: 2
  updateOutputChain: "--update-output-chain"
  whenSetup: "NEVER"
  whenRun: "ALWAYS"
  # Added for future flexibility, e.g., if prefer workflow instead of shell
  # script.
  implementation: JOB
  concurrencyLimit: db_limit
  finalPreCmdOpts: "{defaultPreCmdOpts}"
  requestCpus: "{finalJobNumProcesses}"
  requestMemory: 65536
  command1: >-
    ${DAF_BUTLER_DIR}/bin/butler {finalPreCmdOpts} aggregate-graph
    {fileDistributionEndPoint}{qgraphFile}
    {butlerConfig} 
    -j {finalJobNumProcesses} 
    --register-dataset-types
    {updateOutputChain}
    {extraAggregateOptions}

So on my smaller workstation running a ci_middleware workflow with mocked failures, I needed to add the following to my submit yaml:

finalJobNumProcesses: 10
extraAggregateOptions: "--mock-storage-classes"

p.s., I wouldn't object to these going inside the final section like finalPreCmdOpts. But looking forward to having a side aggregate job, setting the values once at the root level that would be used by both the aggregate and final jobs would be ideal if generally they would match.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've adopted you recommended configuration pretty-much as-is (I just revised the default resource usage lower, per the other thread). I didn't realize one could define a new BPS variable implicitly like this.

As for using the finalJob section for the new variables, at this point I'm not really anticipating adding a side aggregate, but I don't think we'll know for sure until CM starts using this in production. I am still hoping to support running it manually on the side, much as one can already do with the transfer-from-graph finalJob. Should we just move the new variables into that section, then?

concurrencyLimit: db_limit
finalPreCmdOpts: "{defaultPreCmdOpts}"
requestCpus: 32
requestMemory: 65536
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because compute resources seem to be in high demand, is 1.6m quanta the average for non-production user? I suspect it will be easier for CM Service to increase than to get development runs to decrease.

{butlerConfig}
-g {fileDistributionEndPoint}{qgraphFile}
--register-dataset-types
{updateOutputChain}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See discussion below with the bps_defaults.yaml.

concurrencyLimit: db_limit
finalPreCmdOpts: "{defaultPreCmdOpts}"
requestCpus: 32
requestMemory: 65536
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later if helpful, we could think about some way to use information about the QuantumGraph in figuring out these resource values (in ctrl_bps or have ctrl_bps call some function if it better fits elsewhere).

@MichelleGower
Copy link
Collaborator

I think this should have a doc/changes file.

@TallJimbo TallJimbo force-pushed the tickets/DM-52360 branch 2 times, most recently from d5e6df2 to 78eead1 Compare October 17, 2025 15:28
@TallJimbo TallJimbo marked this pull request as ready for review October 17, 2025 17:20
Copy link
Collaborator

@MichelleGower MichelleGower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran a PanDA and an HTCondor on SLAC test with replacing the finalJobNumProcesses with just requestCpus to avoid the ctrl_bps bug (passing request_cpus as string to the WMS plugins when doing the nested variables). Merge approved.

@TallJimbo TallJimbo merged commit 4123807 into main Oct 22, 2025
21 checks passed
@TallJimbo TallJimbo deleted the tickets/DM-52360 branch October 22, 2025 00:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants