NFS mounts not properly cleaned up after spot request is outbid/terminated #304

nkrumm · 2013-10-09T22:57:13Z

If spot request nodes in a cluster are terminated (when being outbid, for example), the NFS mounts are not properly cleaned up. I can test this in the following way:

Start a cluster, and use spot requests for 3 additional nodes
Wait for nodes to be terminated (in my case I think it was due to being outbid, since the bid was low).
Attempt to run starcluster addnode mycluster:

>>> Mounting all NFS export path(s) on 1 worker node(s)
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%  
!!! ERROR - Error occured while running plugin 'starcluster.clustersetup.DefaultClusterSetup':
!!! ERROR - error occurred in job (id=node001): remote command 'source /etc/profile && mount /home' failed with status 32:
mount.nfs: access denied by server while mounting master:/home
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/threadpool.py", line 48, in run
    job.run()
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/threadpool.py", line 75, in run
    r = self.method(*self.args, **self.kwargs)
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/node.py", line 730, in mount_nfs_shares
    self.ssh.execute('mount %s' % path)
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/sshutils/__init__.py", line 555, in execute
    msg, command, exit_status, out_str)
RemoteCommandFailed: remote command 'source /etc/profile && mount /home' failed with status 32:
mount.nfs: access denied by server while mounting master:/home

However, if I specify the name of the node to be created, the node is created just fine:

sc addnode -a mynode mycluster
StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster@mit.edu

>>> Launching node(s): mynode
[ deleted for brevity ]
>>> Configuring NFS exports path(s):
/opt/sge6
>>> Mounting all NFS export path(s) on 1 worker node(s)
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%  
>>> Setting up NFS took 0.064 mins
>>> Updating SGE parallel environment 'orte'
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%  
>>> Adding parallel environment 'orte' to queue 'all.q'

And I can get it to fail by specifying a previously terminated node name:

sc addnode -a node002 mycluster
StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster@mit.edu

>>> Launching node(s): node002
SpotInstanceRequest:sir-fb122632
[ deleted for brevity ]
>>> Configuring NFS exports path(s):
/home /data
>>> Mounting all NFS export path(s) on 1 worker node(s)
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%  
!!! ERROR - Error occured while running plugin 'starcluster.clustersetup.DefaultClusterSetup':
!!! ERROR - error occurred in job (id=node002): remote command 'source /etc/profile && mount /home' failed with status 32:
mount.nfs: access denied by server while mounting master:/home
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/threadpool.py", line 48, in run
    job.run()
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/threadpool.py", line 75, in run
    r = self.method(*self.args, **self.kwargs)
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/node.py", line 730, in mount_nfs_shares
    self.ssh.execute('mount %s' % path)
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/sshutils/__init__.py", line 555, in execute
    msg, command, exit_status, out_str)
RemoteCommandFailed: remote command 'source /etc/profile && mount /home' failed with status 32:
mount.nfs: access denied by server while mounting master:/home

Happy to provide more logs. Note that this is a duplicate (but more detailed) bug report of the one I submitted to starcluster@mit.edu. Thanks!

The text was updated successfully, but these errors were encountered:

jtriley · 2013-10-15T16:38:01Z

@nkrumm Thanks for the detailed report. I've seen this before but was never able to reproduce it reliably so I chalked it up to random failure. I'll give this a shot myself and come up with a patch that fixes this for good (hopefully ;)

nkrumm · 2013-10-17T00:41:30Z

It turns out the /etc/exports file still has the old (terminated) hosts in it. Could this be the issue?

nkrumm · 2013-10-17T01:17:52Z

Confirmed. Clearing the entries for the terminated nodes in /etc/hosts and /etc/exports, and restarting the nfs daemon (may not be necessary) allows the new node(s) to come up with the same names as the terminated nodes.

Presumably, this is happening because the AWS-forced termination of the nodes (either because of outbidding or also by "terminate" in the AWS console) does not trigger the standard terminate functions (such as Node.stop_exporting_fs_to_nodes and Node.remove_from_etc_hosts).

@jtriley, is there a mechanism already in place that monitors AWS triggered terminations? One option is to put a shutdown script into each node which notifies master that the node is going down. Another might be to periodically poll for changes? Perhaps you already have something in mind here?

@realoptimal

The previous fix simply blasted all /etc/exports entries for each node before exporting paths to nodes. This can potentially kill other export paths not being exported by export_fs_to_nodes given that they will not be redefined. Added `paths` kwarg to Node.stop_exporting_fs_to_nodes that causes only the specified export paths for each node to be removed from /etc/exports instead of all entries for each node. Thanks to @realoptimal for the catch: f7c4967#commitcomment-4640990

nkrumm added a commit to nkrumm/StarCluster that referenced this issue Oct 17, 2013

Remove NFS mounts in on_add_node (fixes jtriley#304)

75ab28d

jtriley mentioned this issue Oct 25, 2013

AWS VPC support #236

Closed

jtriley closed this as completed in f7c4967 Nov 15, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NFS mounts not properly cleaned up after spot request is outbid/terminated #304

NFS mounts not properly cleaned up after spot request is outbid/terminated #304

nkrumm commented Oct 9, 2013

jtriley commented Oct 15, 2013

nkrumm commented Oct 17, 2013

nkrumm commented Oct 17, 2013

NFS mounts not properly cleaned up after spot request is outbid/terminated #304

NFS mounts not properly cleaned up after spot request is outbid/terminated #304

Comments

nkrumm commented Oct 9, 2013

jtriley commented Oct 15, 2013

nkrumm commented Oct 17, 2013

nkrumm commented Oct 17, 2013