Job hangs - “waiting for job to start” on a PBS Cluster #70

snirgaz · 2017-05-16T23:22:38Z

I'm trying to use ClusterManagers on a PBS cluster (interactively e.g.)
`
julia> using ClusterManagers

julia> addprocs_pbs(2, queue="default")
job id is 135963, waiting for job to start ................................................................
`

The job seems to hang even though it appears to run on qstat

Job id Name User Time Use S Queue

135963[].pippen julia-26303 snirgaz 0 R default

Any thoughts?

affans · 2017-05-19T02:38:12Z

The same was happening to me. I still can't get PBS to work with my cluster, but I was able to solve this problem. In qsub.jl file on line 98, it looks for the file generated by PBS for the node information. However, atleast for the cluster I am using, the expression it was trying to match (variable fname) was different than the actual filename.

On line 92 of qsub.jl, is where the function is defined. I had to change mine to
filename(i) = isPBS ? "$home/julia-$(getpid()).o$id-$i" : "$home/julia-$(getpid()).o$id.$i"

juliohm · 2017-12-18T03:26:39Z

@affans could you please confirm this fixed in master?

Should this issue be closed?

bjarthur · 2017-12-18T13:15:52Z

using file systems for interprocess communication is generally a bad idea. try addprocs_qrsh instead.

juliohm · 2017-12-18T18:57:23Z

Hi @bjarthur , I have just tried addprocs_qrsh as you suggested and got the following error:

Error launching workers

could not spawn `qrsh -q class -V -N julia-41243 -now n cd /home/juliohm '&&' /usr/loca/julia/julia-903644385b/bin/julia --worker UHNgiG1KIXPRm6A4`: no such file or directory (ENOENT)

Do you have an idea of what may be happening?

bjarthur · 2017-12-27T13:13:42Z

have you tried cutting and pasting the command in the error message directly into a bash terminal? breaking it down into parts would help isolate the problem:

cd /home/juliohm
/usr/loca/julia/julia-903644385b/bin/julia --worker UHNgiG1KIXPRm6A4, and then,
qrsh -q class -V -N julia-41243 -now n ls

i don't have access to a PBS cluster anymore, so am not able to provide much more help.

juliohm · 2018-01-03T19:57:01Z

Hi @bjarthur sorry for the delay in the reply. I tried copying the commands in the root node, but the first command hangs indefinitely. Do you have any other suggestion on how to debug this?

snirgaz · 2018-05-21T18:36:07Z

Any progress on that matter? I tried again on a different PBS cluster. Still the same issue.

juliohm · 2018-05-21T21:35:24Z

I think the hope in this case is to use MPI.jl, they have a MPIManager defined there now that should work in theory. I will try it as soon as I find some time.

juliohm · 2020-10-06T19:35:50Z

We are trying to revive the package. Please review the latest stable release (released today) and report any issues. PRs are more than welcome!

juliohm closed this as completed Oct 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job hangs - “waiting for job to start” on a PBS Cluster #70

Job hangs - “waiting for job to start” on a PBS Cluster #70

snirgaz commented May 16, 2017

affans commented May 19, 2017 •

edited

Loading

juliohm commented Dec 18, 2017

bjarthur commented Dec 18, 2017

juliohm commented Dec 18, 2017

bjarthur commented Dec 27, 2017 •

edited

Loading

juliohm commented Jan 3, 2018

snirgaz commented May 21, 2018

juliohm commented May 21, 2018

juliohm commented Oct 6, 2020

Job hangs - “waiting for job to start” on a PBS Cluster #70

Job hangs - “waiting for job to start” on a PBS Cluster #70

Comments

snirgaz commented May 16, 2017

affans commented May 19, 2017 • edited Loading

juliohm commented Dec 18, 2017

bjarthur commented Dec 18, 2017

juliohm commented Dec 18, 2017

bjarthur commented Dec 27, 2017 • edited Loading

juliohm commented Jan 3, 2018

snirgaz commented May 21, 2018

juliohm commented May 21, 2018

juliohm commented Oct 6, 2020

affans commented May 19, 2017 •

edited

Loading

bjarthur commented Dec 27, 2017 •

edited

Loading