You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
FYI I'm including below a reply from Andy, which adds a bunch of additional info.
Some of the formatting may not be great, but Moodle does unfortunate things (html escaping) with some characters in markdown literal mode, so I'm just formatting this as plain text... if you want some of the output to line up better, paste it into something that uses fixed-width fonts.
Issac, your runs get singled out here because those are the ones that are currently in the pipeline; don't take it personally!
-Lee
--- from Andy:
Hold usually means it can’t start the job for some reason. Idle may means it’s waiting for nodes, and that can be because it’s been running for a while and has been “evicted”. At the moment I see that almost all of the nodes are busy:
$ condor_status -sched
TotalRunningJobs TotalIdleJobs TotalHeldJobs
Total 86 317 1
So there’s a lot of competition for nodes. It may also take a while to get running because the available nodes don’t have the necessary resources, e.g. not enough memory is available. Many of Isaac’s jobs are 10–12 GB in size, when on average there is only 8 GB per job available (nodes 1-3) or 4 GB per job available (nodes 4–6).
He currently has one held job, sitting there for a day and a half, the reason is:
$ condor_q -held 725.3
-- Schedd: csc3-desktop.amherst.edu : <148.85.78.197:9618?... @ 04/30/20 02:26:42
ID OWNER HELD_SINCE HOLD_REASON
725.3 icaruso21 4/28 18:31 Failed to initialize user log to /home/icaruso21/ring-of-fire/prod-results-v2s/timestep1/3/log
but the actual directory name is /home/icaruso21/ring-of-fire/prod-results-v2/timestep1/3 . (“v2” not “v2s”).
Important: please remind the students to put their projects into cluster-archive. Their home directory has relatively limited space available.
— Andy
The text was updated successfully, but these errors were encountered:
Thanks for that.
FYI I'm including below a reply from Andy, which adds a bunch of additional info.
Some of the formatting may not be great, but Moodle does unfortunate things (html escaping) with some characters in markdown literal mode, so I'm just formatting this as plain text... if you want some of the output to line up better, paste it into something that uses fixed-width fonts.
Issac, your runs get singled out here because those are the ones that are currently in the pipeline; don't take it personally!
-Lee
--- from Andy:
Hold usually means it can’t start the job for some reason. Idle may means it’s waiting for nodes, and that can be because it’s been running for a while and has been “evicted”. At the moment I see that almost all of the nodes are busy:
$ condor_status -sched
TotalRunningJobs TotalIdleJobs TotalHeldJobs
Total 86 317 1
So there’s a lot of competition for nodes. It may also take a while to get running because the available nodes don’t have the necessary resources, e.g. not enough memory is available. Many of Isaac’s jobs are 10–12 GB in size, when on average there is only 8 GB per job available (nodes 1-3) or 4 GB per job available (nodes 4–6).
He currently has one held job, sitting there for a day and a half, the reason is:
$ condor_q -held 725.3
-- Schedd: csc3-desktop.amherst.edu : <148.85.78.197:9618?... @ 04/30/20 02:26:42
ID OWNER HELD_SINCE HOLD_REASON
725.3 icaruso21 4/28 18:31 Failed to initialize user log to /home/icaruso21/ring-of-fire/prod-results-v2s/timestep1/3/log
but the actual directory name is /home/icaruso21/ring-of-fire/prod-results-v2/timestep1/3 . (“v2” not “v2s”).
Important: please remind the students to put their projects into cluster-archive. Their home directory has relatively limited space available.
— Andy
The text was updated successfully, but these errors were encountered: