Running into Errors when using starcluster to setup my Ubuntu AMI #376

Open
dclark87 opened this Issue Mar 3, 2014 · 12 comments

Comments

Projects
None yet
3 participants

dclark87 commented Mar 3, 2014

I have an Ubuntu 12.04 LTS Server AMI that I've installed some software on that I'd like to clusterize with starcluster. However, after configuring my starcluster config file to use my AMI machine ID and running 'starcluster -c mediumcluster testcluster' (medium cluster is the configuration name), I eventually get this error:

Starting NFS server on master
!!! ERROR - Error occured while running plugin 'starcluster.clustersetup.DefaultClusterSetup':
!!! ERROR - remote command 'source /etc/profile && /etc/init.d/nfs
!!! ERROR - start' failed with status 127:
!!! ERROR - bash: /etc/init.d/nfs: No such file or directory

Is there any info regarding why this happened? Thanks.

Contributor

FinchPowers commented Mar 3, 2014

If you log to your machine and run

file /etc/init.d/nfs

what does it return?

dclark87 commented Mar 3, 2014

well the file isn't there to begin with I guess (just checked, and this is reflected in the last line of the error..)

Contributor

FinchPowers commented Mar 3, 2014

StarCluster nfs support relies on that file. I'm not sure if an empty file will work. (Possibly.) Otherwise consult the one provided with the default AMIs.

dclark87 commented Mar 3, 2014

Ok, I'll try adding that file to my AMI instance and re-launch a cluster to see what happens

dclark87 commented Mar 4, 2014

Ok, so I used the nfs file from the starcluster AMI and incorporated it into my AMI.

I then saw another error during cluster setup saying the installer couldn't find the function "exportfs" - I remedied this via sudo apt-get install nfs-kernel.

I re-tried with this function installed and got an error saying it could not find SGE installed on my instance. So I installed SGE via sudo apt-get intsall gridengine-client gridengine-common gridengine-master gridengine-qmon.

After all of this, I got the following error when trying to setup a cluster with the above installed on my AMI:

Creating cluster user: sgeadmin (uid: 1001, gid: 1001)
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
!!! ERROR - Error occured while running plugin 'starcluster.clustersetup.DefaultClusterSetup':
!!! ERROR - error occurred in job (id=master): remote command 'source /etc/profile && groupadd -o -g 1001 sgeadmin' failed with status 9:
groupadd: group 'sgeadmin' already exists
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/StarCluster-0.95.2-py2.7.egg/starcluster/threadpool.py", line 48, in run
job.run()
File "/usr/local/lib/python2.7/dist-packages/StarCluster-0.95.2-py2.7.egg/starcluster/threadpool.py", line 75, in run
r = self.method(_self.args, *_self.kwargs)
File "/usr/local/lib/python2.7/dist-packages/StarCluster-0.95.2-py2.7.egg/starcluster/clustersetup.py", line 210, in _add_user_to_node
node.add_user(self._user, uid, gid, self._user_shell)
File "/usr/local/lib/python2.7/dist-packages/StarCluster-0.95.2-py2.7.egg/starcluster/node.py", line 472, in add_user
self.ssh.execute('groupadd -o -g %s %s' % (gid, name))
File "/usr/local/lib/python2.7/dist-packages/StarCluster-0.95.2-py2.7.egg/starcluster/sshutils.py", line 578, in execute
msg, command, exit_status, out_str)
RemoteCommandFailed: remote command 'source /etc/profile && groupadd -o -g 1001 sgeadmin' failed with status 9:
groupadd: group 'sgeadmin' already exists

error occurred in job (id=node001): remote command 'source /etc/profile && groupadd -o -g 1001 sgeadmin' failed with status 9:
groupadd: group 'sgeadmin' already exists
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/StarCluster-0.95.2-py2.7.egg/starcluster/threadpool.py", line 48, in run
job.run()
File "/usr/local/lib/python2.7/dist-packages/StarCluster-0.95.2-py2.7.egg/starcluster/threadpool.py", line 75, in run
r = self.method(_self.args, *_self.kwargs)
File "/usr/local/lib/python2.7/dist-packages/StarCluster-0.95.2-py2.7.egg/starcluster/clustersetup.py", line 210, in _add_user_to_node
node.add_user(self._user, uid, gid, self._user_shell)
File "/usr/local/lib/python2.7/dist-packages/StarCluster-0.95.2-py2.7.egg/starcluster/node.py", line 472, in add_user
self.ssh.execute('groupadd -o -g %s %s' % (gid, name))
File "/usr/local/lib/python2.7/dist-packages/StarCluster-0.95.2-py2.7.egg/starcluster/sshutils.py", line 578, in execute
msg, command, exit_status, out_str)
RemoteCommandFailed: remote command 'source /etc/profile && groupadd -o -g 1001 sgeadmin' failed with status 9:
groupadd: group 'sgeadmin' already exists

!!! ERROR - Oops! Looks like you've found a bug in StarCluster
!!! ERROR - Crash report written to: /home/dclark/.starcluster/logs/crash-report-8050.txt
!!! ERROR - Please remove any sensitive data from the crash report
!!! ERROR - and submit it to starcluster@mit.edu

Contributor

FinchPowers commented Mar 4, 2014

Remove your sgeadmin group prior to creating your AMI.

dclark87 commented Mar 4, 2014

Ok, I removed the sgeadmin group and created a new AMI to configure with starcluster. Everything goes smoothly until:

Mounting all NFS export path(s) on 1 worker node(s)
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%
Setting up NFS took 0.040 mins
Configuring passwordless ssh for root
Configuring passwordless ssh for sgeadmin
Running plugin starcluster.plugins.sge.SGEPlugin
!!! ERROR - SGE is not installed on this AMI, skipping...
Configuring cluster took 0.146 mins
Starting cluster took 1.856 mins

It says I have an error as SGE is not installed on this AMI, even though I can see qsub and qstat, etc on the master node

Contributor

FinchPowers commented Mar 4, 2014

Because it's not installed in the way StarCluster expects it.
If you have a StarCluster official AMI, copy /opt/sge6-fresh directory to your custom AMI. When starting a cluster, StarCluster takes it from there and performs the install.

dclark87 commented Mar 5, 2014

So I just used one of starcluster's AMIs to build up my custom AMI and it sets up successfully

You can see how (S)GE is installed for new Ubuntu builds: https://github.com/jtriley/StarCluster/blob/develop/utils/scimage_13_04.py#L203

Basically GE in installed to /opt/sge6-fresh

Contributor

FinchPowers commented Aug 27, 2014

I think this can be closed.

There are still exisitibf issues with using star cluster with new versions of Ubuntu. Some ticket should track those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment