Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Load balancer error Starcluster 0.95.3 when using spot instances #377

Closed
johnbot1 opened this Issue Mar 5, 2014 · 3 comments

Comments

Projects
None yet
1 participant

johnbot1 commented Mar 5, 2014

I'm having an issue with load balancing using Starcluster 0.95.2 which throws and error 'InstanceDoesNotExist: node 'node002' does not exist' . The nodes listed in the load balancers error output show up in list cluster (include duplicates) when running listclusters. I've also verified the spot instances (including duplicates) are showing up in AWS console as well and are labeled. What's interesting is that only node003 is showing up after running qhost on the master even after adding instances 001-004 (plus dupes). The cluster was started with and m1.small head node and (1) m2.2xlarge spot instance. I'm running the newest release with ubuntu 13.x images.

Thanks

---------------------------
j1 (security group: @sc-j1)
---------------------------
Launch time: 2014-03-05 00:56:02
Uptime: 0 days, 00:41:46
VPC: vpc-bfa822d7
Subnet: subnet-b2a822da
Zone: us-west-2b
Keypair: guttman-west-2
EBS volumes: N/A
Cluster nodes:
     master running i-748f147d ec2-54-186-37-101.us-west-2.compute.amazonaws.com
    node001 running i-9d891294 ec2-54-186-28-42.us-west-2.compute.amazonaws.com (spot sir-d9db365e)
    node002 running i-c2861dcb ec2-54-186-12-235.us-west-2.compute.amazonaws.com (spot sir-8f31223d)
    node002 running i-02861d0b ec2-54-186-21-243.us-west-2.compute.amazonaws.com (spot sir-8c3eb65e)
    node003 running i-5a861d53 ec2-54-186-22-251.us-west-2.compute.amazonaws.com (spot sir-86d2243d)
    node003 running i-07851e0e ec2-54-186-19-183.us-west-2.compute.amazonaws.com (spot sir-d26cbc5e)
    node004 running i-77851e7e ec2-54-186-9-159.us-west-2.compute.amazonaws.com (spot sir-d6f75056)
Total nodes: 7

Here's qhost output after adding multiple nodes (node002,node002,node003,node003 and node004)

root@master:~# qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
node001                 linux-x64       1  0.01    1.6G  124.1M  896.0M     0.0
node003                 linux-x64       1  0.34    1.6G  123.1M  896.0M     0.0
root@master:~#

Here's the load balancer command run

starcluster loadbalance  -n 2 -m 7  j1

Here's some load balancer output

>>> Loading full job history
>>> No jobs have completed yet!
Execution hosts: 1
Queued jobs: 19
Oldest queued job: 2014-03-05 01:09:38+00:00
Avg job duration: 0 secs
Avg job wait time: 0 secs
Last cluster modification time: 2014-03-05 01:08:41
>>> Queued jobs need more slots (19) than available (0)
>>> No queued jobs older than 900 seconds
>>> Sleeping...(looping again in 60 secs)

>>> Loading full job history
>>> No jobs have completed yet!
Execution hosts: 1
Queued jobs: 19
Oldest queued job: 2014-03-05 01:09:38+00:00
Avg job duration: 0 secs
Avg job wait time: 0 secs
Last cluster modification time: 2014-03-05 01:08:41
>>> Queued jobs need more slots (19) than available (0)
>>> A job has been waiting for 923 seconds longer than max: 900
*** WARNING - Adding 1 nodes at 2014-03-05 01:25:02.044713+00:00
>>> Launching node(s): node002
SpotInstanceRequest:sir-8f31223d
>>> Waiting for spot requests to propagate... 
>>> Waiting for node(s) to come up... (updating every 30s)
>>> Waiting for all nodes to be in a 'running' state...
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%  
>>> Waiting for SSH to come up on all nodes...
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%  
>>> Waiting for cluster to come up took 0.023 mins
!!! ERROR - Failed to add new host
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/starcluster/balancers/sge/__init__.py", line 710, in _eval_add_node
    self._cluster.add_nodes(need_to_add)
  File "/usr/lib/python2.7/site-packages/starcluster/cluster.py", line 1040, in add_nodes
    node = self.get_node(alias)
  File "/usr/lib/python2.7/site-packages/starcluster/cluster.py", line 810, in get_node
    raise exception.InstanceDoesNotExist(identifier, label='node')
InstanceDoesNotExist: node 'node002' does not exist
>>> Sleeping...(looping again in 60 secs)

>>> Loading full job history
>>> No jobs have completed yet!
Execution hosts: 1
Queued jobs: 19
Oldest queued job: 2014-03-05 01:09:38+00:00
Avg job duration: 0 secs
Avg job wait time: 0 secs
Last cluster modification time: 2014-03-05 01:08:41
>>> Queued jobs need more slots (19) than available (0)
>>> A job has been waiting for 1019 seconds longer than max: 900
*** WARNING - Adding 1 nodes at 2014-03-05 01:26:37.355967+00:00
>>> Launching node(s): node002
SpotInstanceRequest:sir-8c3eb65e
>>> Waiting for spot requests to propagate... 
>>> Waiting for node(s) to come up... (updating every 30s)
>>> Waiting for all nodes to be in a 'running' state...
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%  
>>> Waiting for SSH to come up on all nodes...
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%  
>>> Waiting for cluster to come up took 0.021 mins
!!! ERROR - Failed to add new host
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/starcluster/balancers/sge/__init__.py", line 710, in _eval_add_node
    self._cluster.add_nodes(need_to_add)
  File "/usr/lib/python2.7/site-packages/starcluster/cluster.py", line 1040, in add_nodes
    node = self.get_node(alias)
  File "/usr/lib/python2.7/site-packages/starcluster/cluster.py", line 810, in get_node
    raise exception.InstanceDoesNotExist(identifier, label='node')
InstanceDoesNotExist: node 'node002' does not exist
>>> Sleeping...(looping again in 60 secs)

johnbot1 commented Mar 5, 2014

Looks like enabling spot may be the problem. In this run (with spot enabled and default sge plugin) the load balancer failed to add node002 but succeeded in adding node003. I checked node003 and it shows up in qhost and is running jobs. I feel like I've seen this in the past where it would add some nodes and skip others due to failure using Spot but I can't recall what the exact cause was.

root@master:~# qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
node001                 linux-x64       1  0.01  589.6M  116.2M     0.0     0.0
node003                 linux-x64       1  0.13  589.6M  115.3M     0.0     0.0
>>> Loading full job history
>>> No jobs have completed yet!
Execution hosts: 1
Queued jobs: 39
Oldest queued job: 2014-03-05 18:05:15+00:00
Avg job duration: 0 secs
Avg job wait time: 0 secs
Last cluster modification time: 2014-03-05 18:04:15
>>> Queued jobs need more slots (39) than available (0)
>>> No queued jobs older than 300 seconds
>>> Sleeping...(looping again in 60 secs)

>>> Loading full job history
>>> No jobs have completed yet!
Execution hosts: 1
Queued jobs: 39
Oldest queued job: 2014-03-05 18:05:15+00:00
Avg job duration: 0 secs
Avg job wait time: 0 secs
Last cluster modification time: 2014-03-05 18:04:15
>>> Queued jobs need more slots (39) than available (0)
>>> A job has been waiting for 308 seconds longer than max: 300
*** WARNING - Adding 1 nodes at 2014-03-05 18:10:23.673676+00:00
>>> Launching node(s): node002
SpotInstanceRequest:sir-0af30e3d
>>> Waiting for spot requests to propagate... 
>>> Waiting for node(s) to come up... (updating every 30s)
>>> Waiting for all nodes to be in a 'running' state...
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%  
>>> Waiting for SSH to come up on all nodes...
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%  
>>> Waiting for cluster to come up took 0.022 mins
!!! ERROR - Failed to add new host
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/starcluster/balancers/sge/__init__.py", line 710, in _eval_add_node
    self._cluster.add_nodes(need_to_add)
  File "/usr/lib/python2.7/site-packages/starcluster/cluster.py", line 1040, in add_nodes
    node = self.get_node(alias)
  File "/usr/lib/python2.7/site-packages/starcluster/cluster.py", line 810, in get_node
    raise exception.InstanceDoesNotExist(identifier, label='node')
InstanceDoesNotExist: node 'node002' does not exist
>>> Sleeping...(looping again in 60 secs)

johnbot1 commented Mar 5, 2014

I've confirmed the problem is only occurring when enabling SPOT in the config. Running without spot added all of my instances properly from the load balancer.

oot@master:~# qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
node001                 linux-x64       1  0.01  589.6M  116.9M     0.0     0.0
node002                 linux-x64       1  0.01  589.6M  117.1M     0.0     0.0
node003                 linux-x64       1  0.01  589.6M  116.9M     0.0     0.0
node004                 linux-x64       1  0.01  589.6M  117.1M     0.0     0.0
node005                 linux-x64       1  0.04  589.6M  116.7M     0.0     0.0
node006                 linux-x64       1  0.01  589.6M  116.9M     0.0     0.0
node007                 linux-x64       1  0.01  589.6M  117.2M     0.0     0.0
node008                 linux-x64       1  0.01  589.6M  117.0M     0.0     0.0
node009                 linux-x64       1  0.01  589.6M  116.7M     0.0     0.0
node010                 linux-x64       1  0.01  589.6M  117.1M     0.0     0.0
node011                 linux-x64       1  0.02  589.6M  116.7M     0.0     0.0
node012                 linux-x64       1  0.10  589.6M  117.8M     0.0     0.0

@johnbot1 johnbot1 changed the title from Load balancer error Starcluster 95.2 - using spot to Load balancer error Starcluster 95.3 when using spot instances Mar 20, 2014

@johnbot1 johnbot1 changed the title from Load balancer error Starcluster 95.3 when using spot instances to Load balancer error Starcluster 0.95.3 when using spot instances Mar 20, 2014

jtriley added a commit that referenced this issue Apr 1, 2014

cluster: improve wait_for_active_spots
Require active spots to also have an instance_id allocated before
exiting the wait loop. This ensures that the spot requests are 'active'
and also have an instance ID that we can use in the
'wait_for_propagation' call later on. This *might* fix
InstanceDoesNotExist errors related to eventual consistency when using
spot nodes (gh-377).

@jtriley jtriley closed this in 2f27bc1 Apr 5, 2014

johnbot1 commented Apr 5, 2014

Adding spot nodes now works great but removing them produces the error:
Remove 1 nodes from spot-sge-fix (y/n)? y

Running plugin starcluster.plugins.sge.SGEPlugin
Removing node001 from SGE
!!! ERROR - Error occured while running plugin 'starcluster.plugins.sge.SGEPlugin':
!!! ERROR - remote command 'source /etc/profile && qconf -de node001'
!!! ERROR - failed with status 1:
!!! ERROR - Host object "node001" is still referenced in cluster queue
!!! ERROR - "all.q".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment