Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NFS mounts not properly cleaned up after spot request is outbid/terminated #304

Closed
nkrumm opened this issue Oct 9, 2013 · 3 comments
Closed
Milestone

Comments

@nkrumm
Copy link

nkrumm commented Oct 9, 2013

If spot request nodes in a cluster are terminated (when being outbid, for example), the NFS mounts are not properly cleaned up. I can test this in the following way:

  1. Start a cluster, and use spot requests for 3 additional nodes
  2. Wait for nodes to be terminated (in my case I think it was due to being outbid, since the bid was low).
  3. Attempt to run starcluster addnode mycluster:
>>> Mounting all NFS export path(s) on 1 worker node(s)
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%  
!!! ERROR - Error occured while running plugin 'starcluster.clustersetup.DefaultClusterSetup':
!!! ERROR - error occurred in job (id=node001): remote command 'source /etc/profile && mount /home' failed with status 32:
mount.nfs: access denied by server while mounting master:/home
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/threadpool.py", line 48, in run
    job.run()
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/threadpool.py", line 75, in run
    r = self.method(*self.args, **self.kwargs)
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/node.py", line 730, in mount_nfs_shares
    self.ssh.execute('mount %s' % path)
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/sshutils/__init__.py", line 555, in execute
    msg, command, exit_status, out_str)
RemoteCommandFailed: remote command 'source /etc/profile && mount /home' failed with status 32:
mount.nfs: access denied by server while mounting master:/home

However, if I specify the name of the node to be created, the node is created just fine:

sc addnode -a mynode mycluster
StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster@mit.edu

>>> Launching node(s): mynode
[ deleted for brevity ]
>>> Configuring NFS exports path(s):
/opt/sge6
>>> Mounting all NFS export path(s) on 1 worker node(s)
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%  
>>> Setting up NFS took 0.064 mins
>>> Updating SGE parallel environment 'orte'
2/2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%  
>>> Adding parallel environment 'orte' to queue 'all.q'

And I can get it to fail by specifying a previously terminated node name:

sc addnode -a node002 mycluster
StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster@mit.edu

>>> Launching node(s): node002
SpotInstanceRequest:sir-fb122632
[ deleted for brevity ]
>>> Configuring NFS exports path(s):
/home /data
>>> Mounting all NFS export path(s) on 1 worker node(s)
1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100%  
!!! ERROR - Error occured while running plugin 'starcluster.clustersetup.DefaultClusterSetup':
!!! ERROR - error occurred in job (id=node002): remote command 'source /etc/profile && mount /home' failed with status 32:
mount.nfs: access denied by server while mounting master:/home
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/threadpool.py", line 48, in run
    job.run()
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/threadpool.py", line 75, in run
    r = self.method(*self.args, **self.kwargs)
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/node.py", line 730, in mount_nfs_shares
    self.ssh.execute('mount %s' % path)
  File "/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/StarCluster-0.9999-py2.7.egg/starcluster/sshutils/__init__.py", line 555, in execute
    msg, command, exit_status, out_str)
RemoteCommandFailed: remote command 'source /etc/profile && mount /home' failed with status 32:
mount.nfs: access denied by server while mounting master:/home

Happy to provide more logs. Note that this is a duplicate (but more detailed) bug report of the one I submitted to starcluster@mit.edu. Thanks!

@jtriley
Copy link
Owner

jtriley commented Oct 15, 2013

@nkrumm Thanks for the detailed report. I've seen this before but was never able to reproduce it reliably so I chalked it up to random failure. I'll give this a shot myself and come up with a patch that fixes this for good (hopefully ;)

@nkrumm
Copy link
Author

nkrumm commented Oct 17, 2013

It turns out the /etc/exports file still has the old (terminated) hosts in it. Could this be the issue?

@nkrumm
Copy link
Author

nkrumm commented Oct 17, 2013

Confirmed. Clearing the entries for the terminated nodes in /etc/hosts and /etc/exports, and restarting the nfs daemon (may not be necessary) allows the new node(s) to come up with the same names as the terminated nodes.

Presumably, this is happening because the AWS-forced termination of the nodes (either because of outbidding or also by "terminate" in the AWS console) does not trigger the standard terminate functions (such as Node.stop_exporting_fs_to_nodes and Node.remove_from_etc_hosts).

@jtriley, is there a mechanism already in place that monitors AWS triggered terminations? One option is to put a shutdown script into each node which notifies master that the node is going down. Another might be to periodically poll for changes? Perhaps you already have something in mind here?

nkrumm added a commit to nkrumm/StarCluster that referenced this issue Oct 17, 2013
jtriley added a commit that referenced this issue Nov 19, 2013
The previous fix simply blasted all /etc/exports entries for each node
before exporting paths to nodes. This can potentially kill other export
paths not being exported by export_fs_to_nodes given that they will not
be redefined. Added `paths` kwarg to Node.stop_exporting_fs_to_nodes
that causes only the specified export paths for each node to be removed
from /etc/exports instead of all entries for each node.

Thanks to @realoptimal for the catch:

f7c4967#commitcomment-4640990
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants