Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-29614: Have bps report show info from multiple submit nodes #61

Merged
merged 23 commits into from Oct 25, 2021

Conversation

mxk62
Copy link
Contributor

@mxk62 mxk62 commented Oct 21, 2021

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

Some of the functions responsible for extracting job information from
DAGman files where rising StopIteration exception when a file was
missing.  Modified them to raise the FileNotFoundError to make things
less confusing.
Copy link
Collaborator

@MichelleGower MichelleGower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions regarding correctness of code along with some docstring/comment suggestions. Can be merged after addressing.

python/lsst/ctrl/bps/report.py Outdated Show resolved Hide resolved
python/lsst/ctrl/bps/wms/htcondor/htcondor_service.py Outdated Show resolved Hide resolved
python/lsst/ctrl/bps/wms/htcondor/lssthtc.py Show resolved Hide resolved
python/lsst/ctrl/bps/wms/htcondor/lssthtc.py Outdated Show resolved Hide resolved
python/lsst/ctrl/bps/wms/htcondor/lssthtc.py Outdated Show resolved Hide resolved
python/lsst/ctrl/bps/wms/htcondor/htcondor_service.py Outdated Show resolved Hide resolved
python/lsst/ctrl/bps/wms/htcondor/lssthtc.py Show resolved Hide resolved
python/lsst/ctrl/bps/wms/htcondor/lssthtc.py Outdated Show resolved Hide resolved
python/lsst/ctrl/bps/wms/htcondor/lssthtc.py Outdated Show resolved Hide resolved
python/lsst/ctrl/bps/wms/htcondor/lssthtc.py Outdated Show resolved Hide resolved
Some WMS (e.g. HTCondor) uses distributed job queues. I added a new
flag, --global, to bps report allowing one to control whether it should
query for information all available job queues or only local one (the
default).
When a user provides submission directory as a run id to bps report, it
retrieves the required information from various HTCondor files in the
submission directory.  However, none of this files contains the global
job id of the DAGMan job (at least in ver. 8.8 and 9.0 of HTCondor).
Hence I added an additional step to HTCondor submission which persists
the global id in a JSON file for future reference.
Function htc_version() was returning a HTCondor version where major,
minor, and revision numbers were padded with zeros, i.e.,
000X.000Y.000Z.  However, in other parts of the code this string was
compared against non-padded (X.Y.Z) form.  Removed the padding and used
a third-party utility (included in the LSST stack) packaging.version()
to make this comparisons works as expected.
Names of some local variables varied somewhat strongly between functions
though these variables were associated with objects of the same type.
While there was nothing wrong with the code because of it, I changed
names of some variables to keep them consistent and hopefully make the
code easier to follow.
Probably due to some earlier git rebase gone wrong the arugments of
cancel() were messed up.  Fixed them.
Some WMS (e.g. HTCondor) uses distributed job queues. I added a new
flag, --global, to bps cancel allowing one to control whether it should
attempt to remove jobs matching the search criteria from all available
job queues or only local one (the default).
PanDA plugin overrides only report() method (though it doesn't look like
it uses it anyway).  I updated it's signature to reflect adding
`--global option to bps report command.
Made bps report to show global job id when printing a detailed report
for a given job.
The additional information displayed by report() in case of any errors
was shown before showing error messages reported by the plugin which was
slightly confusing.  Switched the order in these messages are being
displayed.
According to HTCondor docs for ver. 8.8 says one can use 'location_ad'
as the name argument when initializing a Schedd (an object representing
condor_scheduler).  However, HTCondor's throws a fit when it is actually
used so I removed it (it works fine with ver. 9.0 though).
When using the submission path as an id, bps report was not displaying
the correct status for deleted jobs.  Fixed that.
Some of the earlier changes might have stopped Pegasus plugin from
working.  I made adjustments to prevent that from happening.
@mxk62 mxk62 force-pushed the tickets/DM-29614 branch 2 times, most recently from 65e71cf to cfd1fee Compare October 25, 2021 18:00
@mxk62 mxk62 merged commit 1e3c6f9 into master Oct 25, 2021
@mxk62 mxk62 deleted the tickets/DM-29614 branch October 25, 2021 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants