Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Condor stuff from andy? #23

Closed
icaruso21 opened this issue May 2, 2020 · 0 comments
Closed

Condor stuff from andy? #23

icaruso21 opened this issue May 2, 2020 · 0 comments

Comments

@icaruso21
Copy link
Collaborator

Thanks for that.

FYI I'm including below a reply from Andy, which adds a bunch of additional info.

Some of the formatting may not be great, but Moodle does unfortunate things (html escaping) with some characters in markdown literal mode, so I'm just formatting this as plain text... if you want some of the output to line up better, paste it into something that uses fixed-width fonts.

Issac, your runs get singled out here because those are the ones that are currently in the pipeline; don't take it personally!

-Lee

--- from Andy:

Hold usually means it can’t start the job for some reason. Idle may means it’s waiting for nodes, and that can be because it’s been running for a while and has been “evicted”. At the moment I see that almost all of the nodes are busy:

$ condor_status -sched
TotalRunningJobs TotalIdleJobs TotalHeldJobs
Total 86 317 1

So there’s a lot of competition for nodes. It may also take a while to get running because the available nodes don’t have the necessary resources, e.g. not enough memory is available. Many of Isaac’s jobs are 10–12 GB in size, when on average there is only 8 GB per job available (nodes 1-3) or 4 GB per job available (nodes 4–6).

He currently has one held job, sitting there for a day and a half, the reason is:

$ condor_q -held 725.3

-- Schedd: csc3-desktop.amherst.edu : <148.85.78.197:9618?... @ 04/30/20 02:26:42
ID OWNER HELD_SINCE HOLD_REASON
725.3 icaruso21 4/28 18:31 Failed to initialize user log to /home/icaruso21/ring-of-fire/prod-results-v2s/timestep1/3/log

but the actual directory name is /home/icaruso21/ring-of-fire/prod-results-v2/timestep1/3 . (“v2” not “v2s”).

Important: please remind the students to put their projects into cluster-archive. Their home directory has relatively limited space available.

— Andy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants