Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-29756: Automatic retry with larger memory #43

Merged
merged 9 commits into from Sep 1, 2021
Merged

Conversation

mxk62
Copy link
Contributor

@mxk62 mxk62 commented Aug 27, 2021

No description provided.

Copy link
Collaborator

@MichelleGower MichelleGower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor recommendations, but there are also a couple spots where not sure code is correct/complete. Code can be merged after checking/correcting those things.

doc/changes/DM-29756.feature.rst Outdated Show resolved Hide resolved
doc/lsst.ctrl.bps/quickstart.rst Outdated Show resolved Hide resolved
python/lsst/ctrl/bps/defaults.py Outdated Show resolved Hide resolved
python/lsst/ctrl/bps/wms/htcondor/lssthtc.py Outdated Show resolved Hide resolved
python/lsst/ctrl/bps/wms/htcondor/htcondor_service.py Outdated Show resolved Hide resolved
python/lsst/ctrl/bps/wms/htcondor/htcondor_service.py Outdated Show resolved Hide resolved
python/lsst/ctrl/bps/transform.py Outdated Show resolved Hide resolved
python/lsst/ctrl/bps/wms/htcondor/htcondor_service.py Outdated Show resolved Hide resolved
python/lsst/ctrl/bps/wms/htcondor/lssthtc.py Outdated Show resolved Hide resolved
@mxk62 mxk62 force-pushed the tickets/DM-29756 branch 2 times, most recently from 1b49c8a to e9cc708 Compare September 1, 2021 15:54
There are two default values for the number of retries in ctrl_bps: None
(no retries) and 5 (when memory auto scaling is enabled).  However, they
were not set consistently.  The first was set during creation of a
generic workflow job the other in the HTCondor plugin.  As the second
default value is also supposed to be used by all plugins, I made changes
to make ctrl_bps use in the same place, when creating a generic workflow
job.
The ClassAd for determining if job was put on hold due to exceeding
memory was covering all observed cases.  Fixed it.
Besides setting the number of retries on a job level (in the job
submission file) they were also set in DAGMan submission file.  Fixed
it.
@mxk62 mxk62 merged commit e4174e2 into master Sep 1, 2021
@mxk62 mxk62 deleted the tickets/DM-29756 branch September 1, 2021 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants