New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NFS mounts not properly cleaned up after spot request is outbid/terminated #304
Comments
@nkrumm Thanks for the detailed report. I've seen this before but was never able to reproduce it reliably so I chalked it up to random failure. I'll give this a shot myself and come up with a patch that fixes this for good (hopefully ;) |
It turns out the /etc/exports file still has the old (terminated) hosts in it. Could this be the issue? |
Confirmed. Clearing the entries for the terminated nodes in /etc/hosts and /etc/exports, and restarting the nfs daemon (may not be necessary) allows the new node(s) to come up with the same names as the terminated nodes. Presumably, this is happening because the AWS-forced termination of the nodes (either because of outbidding or also by "terminate" in the AWS console) does not trigger the standard terminate functions (such as @jtriley, is there a mechanism already in place that monitors AWS triggered terminations? One option is to put a shutdown script into each node which notifies master that the node is going down. Another might be to periodically poll for changes? Perhaps you already have something in mind here? |
The previous fix simply blasted all /etc/exports entries for each node before exporting paths to nodes. This can potentially kill other export paths not being exported by export_fs_to_nodes given that they will not be redefined. Added `paths` kwarg to Node.stop_exporting_fs_to_nodes that causes only the specified export paths for each node to be removed from /etc/exports instead of all entries for each node. Thanks to @realoptimal for the catch: f7c4967#commitcomment-4640990
If spot request nodes in a cluster are terminated (when being outbid, for example), the NFS mounts are not properly cleaned up. I can test this in the following way:
starcluster addnode mycluster
:However, if I specify the name of the node to be created, the node is created just fine:
And I can get it to fail by specifying a previously terminated node name:
Happy to provide more logs. Note that this is a duplicate (but more detailed) bug report of the one I submitted to starcluster@mit.edu. Thanks!
The text was updated successfully, but these errors were encountered: