Unable to start new VPC cluster (0.95.1) #372

Closed
nitecoder opened this Issue Feb 25, 2014 · 5 comments

Comments

Projects
None yet
2 participants
@nitecoder

I'm experiencing an issue on 0.95.1 (also happened on 0.95). This does not occur on 0.94.3.
Starting a new cluster hangs on the SSH step. Please see below. How can I troubleshoot this further?

StarCluster - (http://star.mit.edu/cluster) (v. 0.95.1)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster@mit.edu

>>> Using default cluster template: titles1-prod-us-west-1a
>>> Validating cluster template settings...
>>> Cluster template settings are valid
>>> Starting cluster...
>>> Launching a 1-node cluster...
>>> Launching master node (ami: ami-02674b47, type: m1.medium)...
>>> Creating security group @sc-titles1-prod-west-1a...
>>> Opening tcp port range 80-80 for CIDR 0.0.0.0/0
Reservation:r-89XXXXXX
>>> Waiting for cluster to come up... (updating every 30s)
>>> Waiting for all nodes to be in a 'running' state...
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%  
>>> Waiting for SSH to come up on all nodes...
0/1 |                                                                  |   0%  

Then it just hangs in that state for a long time.
I am attempting to run in VPC. Perhaps it's something to do with that? The same cluster config works fine on a 0.94.3 snapshot from some time ago.

After I ^C out of this, listclusters shows the cluster as started with only master node up but no slave.
I was able to get logs from this server via Amazon console. I have the full log if you want it but here's what looks like the relevant portion:

Skipping profile in /etc/apparmor.d/disable: usr.sbin.rsyslogd
Generating locales...
  en_US.UTF-8...  * Starting AppArmor profiles       �[80G 
�[74G[ OK ]
landscape-client is not configured, please run landscape-config.
 * Stopping System V initialisation compatibility�[74G[ OK ]
 * Starting System V runlevel compatibility�[74G[ OK ]
 * Starting automatic crash report generation�[74G[ OK ]
 * Starting save kernel messages�[74G[ OK ]
 * Starting ACPI daemon�[74G[ OK ]
 * Starting regular background program processing daemon�[74G[ OK ]
 * Starting deferred execution scheduler�[74G[ OK ]
 * Stopping CPU interrupts balancing daemon�[74G[ OK ]
 * Stopping save kernel messages�[74G[ OK ]
done
Generation complete.
 * Starting crash report submission daemon�[74G[ OK ]
apache2: apr_sockaddr_info_get() failed for ip-172-24-1-206
apache2: Could not reliably determine the server's fully qualified domain name, using 127.0.0.1 for ServerName
 * Starting web server apache2       �[80G [74G[ OK ]
 * Stopping System V runlevel compatibility�[74G[ OK ]
failed: /var/lib/cloud/instance/scripts/_sc_aliases.txt [1]
failed: /var/lib/cloud/instance/scripts/_sc_plugins.txt [1]
failed: /var/lib/cloud/instance/scripts/_sc_volumes.txt [1]
2014-02-25 01:36:12,855 - cc_scripts_user.py[WARNING]: failed to run-parts in /var/lib/cloud/instance/scripts
2014-02-25 01:36:12,864 - __init__.py[WARNING]: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/cloudinit/CloudConfig/__init__.py", line 117, in run_cc_modules
    cc.handle(name, run_args, freq=freq)
  File "/usr/lib/python2.7/dist-packages/cloudinit/CloudConfig/__init__.py", line 78, in handle
    [name, self.cfg, self.cloud, cloudinit.log, args])
  File "/usr/lib/python2.7/dist-packages/cloudinit/__init__.py", line 326, in sem_and_run
    func(*args)
  File "/usr/lib/python2.7/dist-packages/cloudinit/CloudConfig/cc_scripts_user.py", line 31, in handle
    util.runparts(runparts_path)
  File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 229, in runparts
    raise RuntimeError('runparts: %i failures' % failed)
RuntimeError: runparts: 3 failures

2014-02-25 01:36:12,864 - __init__.py[ERROR]: config handling of scripts-user, None, [] failed

ec2: 
ec2: #############################################################
ec2: -----BEGIN SSH HOST KEY FINGERPRINTS-----
ec2: 1024 6e:f1:dc:4b:1e:10:9a:ba:48:55:9b:dd:49:3d:e8:42  root@ip-172-24-1-206 (DSA)
ec2: 256 7e:70:14:fd:3a:4e:8f:39:bd:86:55:f9:24:99:e3:7b  root@ip-172-24-1-206 (ECDSA)
ec2: 2048 e4:75:60:dc:1b:e7:52:e2:44:e6:23:7f:e1:48:68:fd  root@ip-172-24-1-206 (RSA)
ec2: -----END SSH HOST KEY FINGERPRINTS-----
ec2: #############################################################
-----BEGIN SSH HOST KEY KEYS-----
ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBIKmXFNTSmExLmzx1yHbFHyCDkTq96f8BOD8jnXzKRVeQKOPBCJ1FXM0Zr3KFo225/DB8Q+rwC1co+TOoUqwLis= root@ip-172-24-1-206
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDeP82k3MuIYzeaTzZPdspwppNhRWREBW6U+nDZu5wQirCGG+FSxirJzZIbRvbE3SRsP+EHAQ00HEGQkVqGqdi73JKnMH4JWcUWcsEh6Lz0EK5S9TsXoWAQkc5uJs3IlQGvbSvybPP2fY1XbZRTjuS1SwgQ4XZGEEacnw8G+RApjPdCq/fNLpOhVHJSZlRR1zzvkX01x+b8/Bjx8cWqVVW67fr7Gz3o14JBi9IKZjJRyLo2TkbC04Bp5zx5cnK+JlyUo8VvPfLCGPyGM4WXKnmmxsH4bNhETIosMYkL7SRxj0wFjr4jd4XizQgW+mTXXRnBARei97Vya1lTVlSWnCzZ root@ip-172-24-1-206
-----END SSH HOST KEY KEYS-----
cloud-init boot finished at Tue, 25 Feb 2014 01:36:12 +0000. Up 34.38 seconds
2014-02-25 01:36:12,931 - cloud-init-cfg[ERROR]: errors running cloud_config [final]: ['scripts-user']
errors running cloud_config [final]: ['scripts-user']

Any ideas?

@jtriley

This comment has been minimized.

Show comment Hide comment
@jtriley

jtriley Feb 25, 2014

Owner

@nitecoder Can you please paste the output of $ starcluster listclusters after this happens and also the cluster config you're using. Have you tried connecting to the instances manually via SSH when this happens?

Owner

jtriley commented Feb 25, 2014

@nitecoder Can you please paste the output of $ starcluster listclusters after this happens and also the cluster config you're using. Have you tried connecting to the instances manually via SSH when this happens?

@nitecoder

This comment has been minimized.

Show comment Hide comment
@nitecoder

nitecoder Feb 25, 2014

Hi Justin,
Trying now with 0.95.2.

Here's the config.

{'__name__': 'cluster titles1-prod-us-west-1a',
 'availability_zone': None,
 'cluster_shell': 'bash',
 'cluster_size': 1,
 'cluster_user': 'sgeadmin',
 'disable_cloudinit': False,
 'disable_queue': False,
 'dns_prefix': False,
 'extends': 'spot-prod-us-west-1',
 'force_spot_master': False,
 'key_location': 'keys/starcluster-us-west-1.rsa',
 'keyname': 'starcluster-us-west-1',
 'master_image_id': 'ami-02674b47',
 'master_instance_type': 'm1.medium',
 'node_image_id': 'ami-02674b47',
 'node_instance_type': 'm3.xlarge',
 'node_instance_types': [],
 'permissions': {'http': {'__name__': 'http',
   'cidr_ip': '0.0.0.0/0',
   'from_port': 80,
   'ip_protocol': 'tcp',
   'to_port': 80}},
 'plugins': [{'__name__': 'plugin createusers',
   'setup_class': 'starcluster.plugins.users.CreateUsers',
   'usernames': 'dmitry'}],
 'public_ips': True,
 'spot_bid': 0.5,
 'subnet_id': 'subnet-3fXXXXXX',
 'userdata_scripts': [],
 'volumes': []}

Listclusters produces:

StarCluster - (http://star.mit.edu/cluster) (v. 0.95.2)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster@mit.edu
... other clusters started with another version here ...
---------------------------------------------------------------
titles2-prod-west-1a (security group: @sc-titles2-prod-west-1a)
---------------------------------------------------------------
Launch time: 2014-02-25 10:26:15
Uptime: 0 days, 00:05:55
VPC: vpc-273d534e
Subnet: subnet-3fXXXXXX
Zone: us-west-1a
Keypair: starcluster-us-west-1
EBS volumes: N/A
Cluster nodes:
     master running i-73ceb52c 54.193.11.191
Total nodes: 1

Attempting to do sshmaster also hangs. Then comes back with:

ssh: connect to host 54.193.11.191 port 22: Operation timed out

System Log from AWS Console again shows the following:

failed: /var/lib/cloud/instance/scripts/_sc_aliases.txt [1]
failed: /var/lib/cloud/instance/scripts/_sc_plugins.txt [1]
failed: /var/lib/cloud/instance/scripts/_sc_volumes.txt [1]
2014-02-25 18:27:03,435 - cc_scripts_user.py[WARNING]: failed to run-parts in /var/lib/cloud/instance/scripts
2014-02-25 18:27:03,441 - __init__.py[WARNING]: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/cloudinit/CloudConfig/__init__.py", line 117, in run_cc_modules
    cc.handle(name, run_args, freq=freq)
  File "/usr/lib/python2.7/dist-packages/cloudinit/CloudConfig/__init__.py", line 78, in handle
    [name, self.cfg, self.cloud, cloudinit.log, args])
  File "/usr/lib/python2.7/dist-packages/cloudinit/__init__.py", line 326, in sem_and_run
    func(*args)
  File "/usr/lib/python2.7/dist-packages/cloudinit/CloudConfig/cc_scripts_user.py", line 31, in handle
    util.runparts(runparts_path)
  File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 229, in runparts
    raise RuntimeError('runparts: %i failures' % failed)
RuntimeError: runparts: 3 failures

2014-02-25 18:27:03,441 - __init__.py[ERROR]: config handling of scripts-user, None, [] failed
...
cloud-init boot finished at Tue, 25 Feb 2014 18:27:03 +0000. Up 28.33 seconds
2014-02-25 18:27:03,506 - cloud-init-cfg[ERROR]: errors running cloud_config [final]: ['scripts-user']
errors running cloud_config [final]: ['scripts-user']

Is there anything else I can do to help figure this out?
I am actually in the IRC channel as dserebren right now.

Thanks!
Dmitry

Hi Justin,
Trying now with 0.95.2.

Here's the config.

{'__name__': 'cluster titles1-prod-us-west-1a',
 'availability_zone': None,
 'cluster_shell': 'bash',
 'cluster_size': 1,
 'cluster_user': 'sgeadmin',
 'disable_cloudinit': False,
 'disable_queue': False,
 'dns_prefix': False,
 'extends': 'spot-prod-us-west-1',
 'force_spot_master': False,
 'key_location': 'keys/starcluster-us-west-1.rsa',
 'keyname': 'starcluster-us-west-1',
 'master_image_id': 'ami-02674b47',
 'master_instance_type': 'm1.medium',
 'node_image_id': 'ami-02674b47',
 'node_instance_type': 'm3.xlarge',
 'node_instance_types': [],
 'permissions': {'http': {'__name__': 'http',
   'cidr_ip': '0.0.0.0/0',
   'from_port': 80,
   'ip_protocol': 'tcp',
   'to_port': 80}},
 'plugins': [{'__name__': 'plugin createusers',
   'setup_class': 'starcluster.plugins.users.CreateUsers',
   'usernames': 'dmitry'}],
 'public_ips': True,
 'spot_bid': 0.5,
 'subnet_id': 'subnet-3fXXXXXX',
 'userdata_scripts': [],
 'volumes': []}

Listclusters produces:

StarCluster - (http://star.mit.edu/cluster) (v. 0.95.2)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster@mit.edu
... other clusters started with another version here ...
---------------------------------------------------------------
titles2-prod-west-1a (security group: @sc-titles2-prod-west-1a)
---------------------------------------------------------------
Launch time: 2014-02-25 10:26:15
Uptime: 0 days, 00:05:55
VPC: vpc-273d534e
Subnet: subnet-3fXXXXXX
Zone: us-west-1a
Keypair: starcluster-us-west-1
EBS volumes: N/A
Cluster nodes:
     master running i-73ceb52c 54.193.11.191
Total nodes: 1

Attempting to do sshmaster also hangs. Then comes back with:

ssh: connect to host 54.193.11.191 port 22: Operation timed out

System Log from AWS Console again shows the following:

failed: /var/lib/cloud/instance/scripts/_sc_aliases.txt [1]
failed: /var/lib/cloud/instance/scripts/_sc_plugins.txt [1]
failed: /var/lib/cloud/instance/scripts/_sc_volumes.txt [1]
2014-02-25 18:27:03,435 - cc_scripts_user.py[WARNING]: failed to run-parts in /var/lib/cloud/instance/scripts
2014-02-25 18:27:03,441 - __init__.py[WARNING]: Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/cloudinit/CloudConfig/__init__.py", line 117, in run_cc_modules
    cc.handle(name, run_args, freq=freq)
  File "/usr/lib/python2.7/dist-packages/cloudinit/CloudConfig/__init__.py", line 78, in handle
    [name, self.cfg, self.cloud, cloudinit.log, args])
  File "/usr/lib/python2.7/dist-packages/cloudinit/__init__.py", line 326, in sem_and_run
    func(*args)
  File "/usr/lib/python2.7/dist-packages/cloudinit/CloudConfig/cc_scripts_user.py", line 31, in handle
    util.runparts(runparts_path)
  File "/usr/lib/python2.7/dist-packages/cloudinit/util.py", line 229, in runparts
    raise RuntimeError('runparts: %i failures' % failed)
RuntimeError: runparts: 3 failures

2014-02-25 18:27:03,441 - __init__.py[ERROR]: config handling of scripts-user, None, [] failed
...
cloud-init boot finished at Tue, 25 Feb 2014 18:27:03 +0000. Up 28.33 seconds
2014-02-25 18:27:03,506 - cloud-init-cfg[ERROR]: errors running cloud_config [final]: ['scripts-user']
errors running cloud_config [final]: ['scripts-user']

Is there anything else I can do to help figure this out?
I am actually in the IRC channel as dserebren right now.

Thanks!
Dmitry

@nitecoder

This comment has been minimized.

Show comment Hide comment
@nitecoder

nitecoder Feb 25, 2014

Update: I also tried changing the image to ami-02674b47 (which is Ubuntu 13.04 instead of 12.04 I've been using). There is no change in behavior: hanging at "Waiting for SSH to come up on all nodes... 0/1"

Update: I also tried changing the image to ami-02674b47 (which is Ubuntu 13.04 instead of 12.04 I've been using). There is no change in behavior: hanging at "Waiting for SSH to come up on all nodes... 0/1"

@nitecoder

This comment has been minimized.

Show comment Hide comment
@nitecoder

nitecoder Feb 25, 2014

Quick update after troubleshooting with Justin on IRC for others who might have this issue:
This seems to be caused by differences between VPC flavors on Amazon and how they behave when starcluster attempts to auto-assign a Public IP to your instances (which it does even if you are using a VPC).

One workaround is to stop assigning the public IP. The instance will then be reachable from the VPC but not from outside. For that, add PUBLIC_IPS=False to the config (there is a commandline equivalent too).

Quick update after troubleshooting with Justin on IRC for others who might have this issue:
This seems to be caused by differences between VPC flavors on Amazon and how they behave when starcluster attempts to auto-assign a Public IP to your instances (which it does even if you are using a VPC).

One workaround is to stop assigning the public IP. The instance will then be reachable from the VPC but not from outside. For that, add PUBLIC_IPS=False to the config (there is a commandline equivalent too).

@jtriley

This comment has been minimized.

Show comment Hide comment
@jtriley

jtriley Feb 25, 2014

Owner

See http://star.mit.edu/cluster/docs/latest/manual/configuration.html#using-the-virtual-private-cloud-vpc for details on VPC including StarCluster's defaults and command line options.

Owner

jtriley commented Feb 25, 2014

See http://star.mit.edu/cluster/docs/latest/manual/configuration.html#using-the-virtual-private-cloud-vpc for details on VPC including StarCluster's defaults and command line options.

@jtriley jtriley closed this in 69a98f4 Mar 19, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment