Skip to content
This repository

Incorrect in_use in nimbus-nodes -l #46

Closed
oldpatricka opened this Issue May 03, 2011 · 2 comments

2 participants

Patrick Armstrong David LaBissoniere
Patrick Armstrong
Collaborator

Occasionally, nimbus will report node being in_use: true when it is impossible for them to be in use. For example, no VMs are booted, and five nodes are in use.

This looks like:

[nimbus@calliopex ~]$ nimbus-nodes -l
hostname :  muse01.phys.uvic.ca
pool     :  default
memory   :  7000
networks :  public
in_use   :  true
active   :  true

hostname :  muse02.phys.uvic.ca
pool     :  default
memory   :  7000
networks :  public
in_use   :  true
active   :  true

hostname :  muse03.phys.uvic.ca
pool     :  default
memory   :  7000
networks :  public
in_use   :  true
active   :  true

hostname :  muse04.phys.uvic.ca
pool     :  default
memory   :  7000
networks :  public
in_use   :  true
active   :  true

hostname :  muse05.phys.uvic.ca
pool     :  default
memory   :  7000
networks :  public
in_use   :  true
active   :  true

[nimbus@calliopex ~]$ cat /usr/local/nimbus/.../current-reservations.txt
[nimbus@calliopex ~]$ 

Here is a gist of the last few days of my nimbus log:

https://gist.github.com/954464

David LaBissoniere labisso referenced this issue from a commit May 17, 2011
David LaBissoniere Fixed problem backing out VMM memory allocation
Related to #46 but may not solve it entirely. The `in_use` check just compares memory available to max memory and there could be other sitations which cause a leak of VMM memory use.

It happens when an exception is raised during NIC binding (such as a NIC requesting a nonexistent network). No attempt was made to back out the scheduled reservation. Solved by doing NIC binding before scheduling but I'm not totally clear on any side effects this might have.
cbcde32
David LaBissoniere labisso referenced this issue from a commit May 17, 2011
David LaBissoniere Another effort at issue #46. Better VM backout.
Order of NIC binding and scheduling apparently matters. Reworked backout logic to pass down
required values (memory, cores, etc) through scheduler and slot management since they cannot
be resolved from WorkspaceHome at the time of backout.

In general a pretty ugly solution.
f25f9b0
David LaBissoniere
Collaborator
labisso commented May 27, 2011

I'm fairly confident this bug has been fixed. The problem was that in certain error scenarios, the scheduler was not returning VMM memory allocations back to the pool. Specifically if a request failed because of a network binding error (not enough available IP addresses perhaps).

The fix ended up being somewhat hairy and I'd like @timf to review if possible. It is also important that the pilot is tested before a release.

Patrick Armstrong
Collaborator

This seemed to work fine with pilot. qdel got called, which maybe didn't happen before?

2011-05-27 15:28:50,035 INFO  defaults.CreationManagerImpl [ServiceThread-26,create:362] [NIMBUS-EVENT]: Create request for instance from '/C=CA/O=Grid/OU=phys.uvic.ca/CN=Patrick Armstrong'
2011-05-27 15:28:50,040 INFO  groupauthz.Group [ServiceThread-26,decide:290] 

Considering caller: '/C=CA/O=Grid/OU=phys.uvic.ca/CN=Patrick Armstrong'.
Current elapsed minutes: 54.
Current reserved minutes: 500.
Number of VMs in request: 1.
Charge ratio for request: 1.0.
Number of VMs caller is already currently running: 1.
Rights:
GroupRights for group 'TESTING':  {maxReservedMinutes=0, maxElapsedReservedMinutes=0, maxWorkspaceNumber=5, maxWorkspacesInGroup=1, imageNodeHostname='example.com', imageBaseDirectory='/cloud', dirHashMode=true, maxCPUs=2}

Duration request: 500


2011-05-27 15:28:50,060 INFO  pilot.PilotSlotManagement [ServiceThread-26,reserveSpaceImpl:804] pilot command = /opt/nimbus/bin/workspacepilot.py -t --reserveslot -m 512 -d 30002 -g 8 -i ffbe4aaf-3bb5-47d4-8001-03e0696cb4d6 -c http://calliopex.phys.uvic.ca:41999/pilot_notification/v01/
2011-05-27 15:28:50,060 INFO  workspace.WorkspaceUtil [ServiceThread-26,runCommand:151] [NIMBUS-EVENT]: qsub -j oe -r n -m n -l nodes=1:ppn=1 -l walltime=08:20:02 -l mem=512mb -o /usr/local/nimbus/services/var/nimbus/pilot-logs/ffbe4aaf-3bb5-47d4-8001-03e0696cb4d6 
2011-05-27 15:28:50,082 INFO  workspace.WorkspaceUtil [ServiceThread-26,runCommand:228] [NIMBUS-EVENT]: Return code is 0
2011-05-27 15:28:50,083 INFO  workspace.WorkspaceUtil [ServiceThread-26,runCommand:270] [NIMBUS-EVENT]: 
STDOUT:
6831.calliopex.phys.uvic.ca
2011-05-27 15:28:50,091 ERROR defaults.Util [ServiceThread-26,getNextEntry:88] network 'public' is not currently available
2011-05-27 15:28:50,097 INFO  workspace.WorkspaceUtil [ServiceThread-26,runCommand:151] [NIMBUS-EVENT][id-25]: qdel 6831.calliopex.phys.uvic.ca 
2011-05-27 15:28:50,118 INFO  workspace.WorkspaceUtil [ServiceThread-26,runCommand:228] [NIMBUS-EVENT][id-25]: Return code is 0
2011-05-27 15:28:50,126 ERROR factory.FactoryService [ServiceThread-26,create:109] Error creating workspace(s): network 'public' is not currently available
David LaBissoniere labisso closed this June 09, 2011
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.