-
Notifications
You must be signed in to change notification settings - Fork 6
DM-52360: Switch the default finalJob to aggregate-graph. #214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #214 +/- ##
=======================================
Coverage 89.80% 89.80%
=======================================
Files 51 51
Lines 5465 5465
Branches 539 539
=======================================
Hits 4908 4908
Misses 458 458
Partials 99 99 ☔ View full report in Codecov by Sentry. |
6e594b5 to
05a0c45
Compare
| concurrencyLimit: db_limit | ||
| finalPreCmdOpts: "{defaultPreCmdOpts}" | ||
| requestCpus: 32 | ||
| requestMemory: 65536 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These numbers are guesses based on my 1.6m-quanta stage1 tests; I used 32 cores and peak memory usage was about 47GB (so I rounded up to 64GB).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because compute resources seem to be in high demand, is 1.6m quanta the average for non-production user? I suspect it will be easier for CM Service to increase than to get development runs to decrease.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later if helpful, we could think about some way to use information about the QuantumGraph in figuring out these resource values (in ctrl_bps or have ctrl_bps call some function if it better fits elsewhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reset the defaults to 8 cores and 16G.
MichelleGower
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will run more tests after the fixes are pushed.
| -g {fileDistributionEndPoint}{qgraphFile} | ||
| -j 48 | ||
| --register-dataset-types | ||
| {updateOutputChain} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The arguments don't match current implementation. Also, the value for -j doesn't match value given for requestCpus. Should it?
I would add a couple more variables to make changes and running with mocks easier. If a different name makes more sense, please change.
finalJobNumProcesses: 32
extraAggregateOptions: ""
finalJob:
retryUnlessExit: 2
updateOutputChain: "--update-output-chain"
whenSetup: "NEVER"
whenRun: "ALWAYS"
# Added for future flexibility, e.g., if prefer workflow instead of shell
# script.
implementation: JOB
concurrencyLimit: db_limit
finalPreCmdOpts: "{defaultPreCmdOpts}"
requestCpus: "{finalJobNumProcesses}"
requestMemory: 65536
command1: >-
${DAF_BUTLER_DIR}/bin/butler {finalPreCmdOpts} aggregate-graph
{fileDistributionEndPoint}{qgraphFile}
{butlerConfig}
-j {finalJobNumProcesses}
--register-dataset-types
{updateOutputChain}
{extraAggregateOptions}
So on my smaller workstation running a ci_middleware workflow with mocked failures, I needed to add the following to my submit yaml:
finalJobNumProcesses: 10
extraAggregateOptions: "--mock-storage-classes"
p.s., I wouldn't object to these going inside the final section like finalPreCmdOpts. But looking forward to having a side aggregate job, setting the values once at the root level that would be used by both the aggregate and final jobs would be ideal if generally they would match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've adopted you recommended configuration pretty-much as-is (I just revised the default resource usage lower, per the other thread). I didn't realize one could define a new BPS variable implicitly like this.
As for using the finalJob section for the new variables, at this point I'm not really anticipating adding a side aggregate, but I don't think we'll know for sure until CM starts using this in production. I am still hoping to support running it manually on the side, much as one can already do with the transfer-from-graph finalJob. Should we just move the new variables into that section, then?
| concurrencyLimit: db_limit | ||
| finalPreCmdOpts: "{defaultPreCmdOpts}" | ||
| requestCpus: 32 | ||
| requestMemory: 65536 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because compute resources seem to be in high demand, is 1.6m quanta the average for non-production user? I suspect it will be easier for CM Service to increase than to get development runs to decrease.
| {butlerConfig} | ||
| -g {fileDistributionEndPoint}{qgraphFile} | ||
| --register-dataset-types | ||
| {updateOutputChain} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See discussion below with the bps_defaults.yaml.
| concurrencyLimit: db_limit | ||
| finalPreCmdOpts: "{defaultPreCmdOpts}" | ||
| requestCpus: 32 | ||
| requestMemory: 65536 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later if helpful, we could think about some way to use information about the QuantumGraph in figuring out these resource values (in ctrl_bps or have ctrl_bps call some function if it better fits elsewhere).
|
I think this should have a doc/changes file. |
d5e6df2 to
78eead1
Compare
78eead1 to
4291798
Compare
MichelleGower
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran a PanDA and an HTCondor on SLAC test with replacing the finalJobNumProcesses with just requestCpus to avoid the ctrl_bps bug (passing request_cpus as string to the WMS plugins when doing the nested variables). Merge approved.
Open questions:
Checklist
doc/changes