Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job hangs - “waiting for job to start” on a PBS Cluster #70

Closed
snirgaz opened this issue May 16, 2017 · 9 comments
Closed

Job hangs - “waiting for job to start” on a PBS Cluster #70

snirgaz opened this issue May 16, 2017 · 9 comments

Comments

@snirgaz
Copy link

snirgaz commented May 16, 2017

I'm trying to use ClusterManagers on a PBS cluster (interactively e.g.)
`
julia> using ClusterManagers

julia> addprocs_pbs(2, queue="default")
job id is 135963, waiting for job to start ................................................................
`

The job seems to hang even though it appears to run on qstat

Job id Name User Time Use S Queue

135963[].pippen julia-26303 snirgaz 0 R default

Any thoughts?

@affans
Copy link

affans commented May 19, 2017

The same was happening to me. I still can't get PBS to work with my cluster, but I was able to solve this problem. In qsub.jl file on line 98, it looks for the file generated by PBS for the node information. However, atleast for the cluster I am using, the expression it was trying to match (variable fname) was different than the actual filename.

On line 92 of qsub.jl, is where the function is defined. I had to change mine to
filename(i) = isPBS ? "$home/julia-$(getpid()).o$id-$i" : "$home/julia-$(getpid()).o$id.$i"

@juliohm
Copy link
Collaborator

juliohm commented Dec 18, 2017

@affans could you please confirm this fixed in master?

Should this issue be closed?

@bjarthur
Copy link
Collaborator

using file systems for interprocess communication is generally a bad idea. try addprocs_qrsh instead.

@juliohm
Copy link
Collaborator

juliohm commented Dec 18, 2017

Hi @bjarthur , I have just tried addprocs_qrsh as you suggested and got the following error:

Error launching workers

could not spawn `qrsh -q class -V -N julia-41243 -now n cd /home/juliohm '&&' /usr/loca/julia/julia-903644385b/bin/julia --worker UHNgiG1KIXPRm6A4`: no such file or directory (ENOENT)

Do you have an idea of what may be happening?

@bjarthur
Copy link
Collaborator

bjarthur commented Dec 27, 2017

have you tried cutting and pasting the command in the error message directly into a bash terminal? breaking it down into parts would help isolate the problem:

  1. cd /home/juliohm
  2. /usr/loca/julia/julia-903644385b/bin/julia --worker UHNgiG1KIXPRm6A4, and then,
  3. qrsh -q class -V -N julia-41243 -now n ls

i don't have access to a PBS cluster anymore, so am not able to provide much more help.

@juliohm
Copy link
Collaborator

juliohm commented Jan 3, 2018

Hi @bjarthur sorry for the delay in the reply. I tried copying the commands in the root node, but the first command hangs indefinitely. Do you have any other suggestion on how to debug this?

@snirgaz
Copy link
Author

snirgaz commented May 21, 2018

Any progress on that matter? I tried again on a different PBS cluster. Still the same issue.

@juliohm
Copy link
Collaborator

juliohm commented May 21, 2018

I think the hope in this case is to use MPI.jl, they have a MPIManager defined there now that should work in theory. I will try it as soon as I find some time.

@juliohm
Copy link
Collaborator

juliohm commented Oct 6, 2020

We are trying to revive the package. Please review the latest stable release (released today) and report any issues. PRs are more than welcome!

@juliohm juliohm closed this as completed Oct 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants