Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Incorrect in_use in nimbus-nodes -l #46

Closed
oldpatricka opened this Issue · 2 comments

2 participants

@oldpatricka
Collaborator

Occasionally, nimbus will report node being in_use: true when it is impossible for them to be in use. For example, no VMs are booted, and five nodes are in use.

This looks like:

[nimbus@calliopex ~]$ nimbus-nodes -l
hostname :  muse01.phys.uvic.ca
pool     :  default
memory   :  7000
networks :  public
in_use   :  true
active   :  true

hostname :  muse02.phys.uvic.ca
pool     :  default
memory   :  7000
networks :  public
in_use   :  true
active   :  true

hostname :  muse03.phys.uvic.ca
pool     :  default
memory   :  7000
networks :  public
in_use   :  true
active   :  true

hostname :  muse04.phys.uvic.ca
pool     :  default
memory   :  7000
networks :  public
in_use   :  true
active   :  true

hostname :  muse05.phys.uvic.ca
pool     :  default
memory   :  7000
networks :  public
in_use   :  true
active   :  true

[nimbus@calliopex ~]$ cat /usr/local/nimbus/.../current-reservations.txt
[nimbus@calliopex ~]$ 

Here is a gist of the last few days of my nimbus log:

https://gist.github.com/954464

@labisso labisso referenced this issue from a commit
@labisso labisso Fixed problem backing out VMM memory allocation
Related to #46 but may not solve it entirely. The `in_use` check just compares memory available to max memory and there could be other sitations which cause a leak of VMM memory use.

It happens when an exception is raised during NIC binding (such as a NIC requesting a nonexistent network). No attempt was made to back out the scheduled reservation. Solved by doing NIC binding before scheduling but I'm not totally clear on any side effects this might have.
cbcde32
@labisso labisso referenced this issue from a commit
@labisso labisso Another effort at issue #46. Better VM backout.
Order of NIC binding and scheduling apparently matters. Reworked backout logic to pass down
required values (memory, cores, etc) through scheduler and slot management since they cannot
be resolved from WorkspaceHome at the time of backout.

In general a pretty ugly solution.
f25f9b0
@labisso
Collaborator

I'm fairly confident this bug has been fixed. The problem was that in certain error scenarios, the scheduler was not returning VMM memory allocations back to the pool. Specifically if a request failed because of a network binding error (not enough available IP addresses perhaps).

The fix ended up being somewhat hairy and I'd like @timf to review if possible. It is also important that the pilot is tested before a release.

@oldpatricka
Collaborator

This seemed to work fine with pilot. qdel got called, which maybe didn't happen before?

2011-05-27 15:28:50,035 INFO  defaults.CreationManagerImpl [ServiceThread-26,create:362] [NIMBUS-EVENT]: Create request for instance from '/C=CA/O=Grid/OU=phys.uvic.ca/CN=Patrick Armstrong'
2011-05-27 15:28:50,040 INFO  groupauthz.Group [ServiceThread-26,decide:290] 

Considering caller: '/C=CA/O=Grid/OU=phys.uvic.ca/CN=Patrick Armstrong'.
Current elapsed minutes: 54.
Current reserved minutes: 500.
Number of VMs in request: 1.
Charge ratio for request: 1.0.
Number of VMs caller is already currently running: 1.
Rights:
GroupRights for group 'TESTING':  {maxReservedMinutes=0, maxElapsedReservedMinutes=0, maxWorkspaceNumber=5, maxWorkspacesInGroup=1, imageNodeHostname='example.com', imageBaseDirectory='/cloud', dirHashMode=true, maxCPUs=2}

Duration request: 500


2011-05-27 15:28:50,060 INFO  pilot.PilotSlotManagement [ServiceThread-26,reserveSpaceImpl:804] pilot command = /opt/nimbus/bin/workspacepilot.py -t --reserveslot -m 512 -d 30002 -g 8 -i ffbe4aaf-3bb5-47d4-8001-03e0696cb4d6 -c http://calliopex.phys.uvic.ca:41999/pilot_notification/v01/
2011-05-27 15:28:50,060 INFO  workspace.WorkspaceUtil [ServiceThread-26,runCommand:151] [NIMBUS-EVENT]: qsub -j oe -r n -m n -l nodes=1:ppn=1 -l walltime=08:20:02 -l mem=512mb -o /usr/local/nimbus/services/var/nimbus/pilot-logs/ffbe4aaf-3bb5-47d4-8001-03e0696cb4d6 
2011-05-27 15:28:50,082 INFO  workspace.WorkspaceUtil [ServiceThread-26,runCommand:228] [NIMBUS-EVENT]: Return code is 0
2011-05-27 15:28:50,083 INFO  workspace.WorkspaceUtil [ServiceThread-26,runCommand:270] [NIMBUS-EVENT]: 
STDOUT:
6831.calliopex.phys.uvic.ca
2011-05-27 15:28:50,091 ERROR defaults.Util [ServiceThread-26,getNextEntry:88] network 'public' is not currently available
2011-05-27 15:28:50,097 INFO  workspace.WorkspaceUtil [ServiceThread-26,runCommand:151] [NIMBUS-EVENT][id-25]: qdel 6831.calliopex.phys.uvic.ca 
2011-05-27 15:28:50,118 INFO  workspace.WorkspaceUtil [ServiceThread-26,runCommand:228] [NIMBUS-EVENT][id-25]: Return code is 0
2011-05-27 15:28:50,126 ERROR factory.FactoryService [ServiceThread-26,create:109] Error creating workspace(s): network 'public' is not currently available
@labisso labisso closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.