Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Nimbus allows qdels to fail in Pilot #87

Open
oldpatricka opened this Issue Feb 15, 2012 · 0 comments

Comments

Projects
None yet
1 participant
Member

oldpatricka commented Feb 15, 2012

From a bug report by Sharon Goliath:

On a nimbus installation running 2.8, the services.log file contains a few instances of the following error:

/usr/local/nimbus/var/services.log.litai02.20120118.000000:2012-01-18 22:17:39,092 INFO  workspace.WorkspaceUtil     [ServiceThread-164,runCommand:154] [NIMBUS-EVENT][id-25]: /opt/bin/qdel 4562168.moab01.**.**.**
/usr/local/nimbus/var/services.log.litai02.20120118.000000:qdel: Server could not connect to MOM 4562168.moab01.**.**.**
/usr/local/nimbus/var/services.log.litai02.20120118.000000:2012-01-18 22:17:39,107 ERROR pilot.PilotSlotManagement     [ServiceThread-164,releaseSpaceImpl:1077] Problem calling Torque qdel: return code = 222, stderr = 'qdel: Server could not     connect to MOM 4562168.moab01.**.**.**', no stdout

The workspace service removes its record of the VM, although the pilot job has not been successfully terminated.

Instead, I think Nimbus should probably retry the qdel a number of times, rather than simply logging an error. The current behaviour can leave zombie jobs in the PBS queue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment