Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-43299: changes to evaluate RequestMemory expressions #26

Merged
merged 4 commits into from Apr 10, 2024

Conversation

daues
Copy link
Contributor

@daues daues commented Mar 27, 2024

No description provided.

print("Auto: No Large BPS Jobs.")
return
except Exception as exc:
raise type(exc)("Problem querying condor schedd for jobs") from None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move lines 219-244 out of the try-except statement. I don't see how anything in these lines could go wrong, i.e., raise an exception. In other words, it looks like only condor_q() call is worth checking for possible exceptions.

condorq_large = []
condorq_small = []
schedd_name = list(condorq_data.keys())[0]
condorq_full = condorq_data[schedd_name]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 259 and 260 could be replaced by a single statement:

schedd_name, condor_full = condorq_data.popitem()

condorq_full = condorq_data[schedd_name]

print("Auto: Search for Large htcondor jobs.")
for jid in list(condorq_full.keys()):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to create a list (and use keys()) when iterating over dictionary keys in a for loop.

for jid in condorq_full:
    ...

should work just fine and will be more memory efficient. Though in this particular case

for jid, ajob in condorq_full.items():
    ...

will work even better for assigning values to jid and ajob variables.

this_list.append(ajob)

for job_label in unique_labels:
print(f"Making an evaluation {thisEvalMemory}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we can't always add the RequestMemoryEval (i.e. the attribute representing numerical value of the requested memory) to the job's ClassAd regardless whether the RequestMemory is an integer or a ClassAd expression? It looks like doing so would allow us to avoid rechecking what RequestMemory really is in other places (lines 289-293 and 342-346).

return

def countIdleSlurmJobs(self, jobname):
"""Check Slurm queue for Idle Glideins"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it looks like a newly added method please include Parameters and Returns sections in its docstring. The same goes for countRunningSlurmJobs() below.

Also, it looks like these two methods could be static as neither of them is accessing/modifying instance members.

Finally, countIdleSlurmJobs() and countRunningSlurmJobs() are practically identical. Can't we replace them by a single, more generic one by introducing additional parameter, say jobstates, and rewriting the Slurm query as f"squeue --noheader --states={jobstates} --name={jobname} | wc -l"?

Even if you prefer to have "convenience" methods for counting Slurm jobs at the given state, they could be just appropriate calls of such a generic function. For example

@staticmethod
def countIdleSlurmJobs(jobname):
    print(f"Checking if idle Slurm job {jobname} exists:")
    return self.countSlurmJob(jobname, jobstates="PD")
    
@staticmethod
def countRunningSlurmJobs(jobname):
    print(f"Checking if running Slurm job {jobname} exists:")
    return self.countSlurmJob(jobnames)
    
@staticmethod
def countSlurmJobs(jobname, jobstates="R"):
    batcmd = f"squeue --noheader --states={jobstates} --name={jobname} | wc -l"
    print(f"The squeue command is {batcmd}")
    time.sleep(3)
    try:
        resultPD = subprocess.check_output(batcmd, shell=True)
    except subprocess.CalledProcessError as e:
        print(e.output)
    numberOfJobs = int(resultPD.decode("UTF-8"))
    return numberOfJobs

if verbose:
print(f"{job_label} reduced {numberOfGlideinsReduced}")

print(f"jobname {jobname}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a "leftover" after some debugging session. If so, please remove. Otherwise, I'd only print it if verbose is True as this piece of information may have little use for a regular user.

if maxNumberOfGlideins > maxAllowedNumberOfGlideins:
maxNumberOfGlideins = maxAllowedNumberOfGlideins
print("Reducing Small Glidein limit due to threshold.")
# initialize counter
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, instead of "singling out" this fairly obvious fact, a comment briefly describing what is the purpose of this section of the code (lines 339-360) would be much more useful here.

Also, I think the readability of the entire section could be improved by small adjustments of the variables names. For example, I'd use:

  • requested(Cpus|Memory) (or similar) instead of this(Cpus|Memory),
  • neededCpus (or similar) instead of thisRatio.

print("smallGlideins: Reducing due to threshold.")
print(
f"smallGlideins: Number of Glideins to submit is {numberOfGlideinsReduced}"
)
Copy link

@mxk62 mxk62 Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be wrong, but Judging by earlier code, it feels like all these messages (lines 377 - 399) should be printed out only if verbose is True.

@daues daues merged commit 2e90341 into main Apr 10, 2024
3 checks passed
@daues daues deleted the tickets/DM-43299 branch April 10, 2024 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants