-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should delete drive mappings when finishing job #65
Comments
(3) - Checked, it indeed works. Here is a hideous example. Consider this R code in Q:\clusternet\sleep_test.R
Each job is run with this batch, Q:\clusternet\run.bat
and the lot are launched with:
These are run on an empty node and all run concurrently; all csv files are correctly written, so the 1-second file closing the connection does not disrupt the progress of the 100-second run. Not sure quite why the '/y' is necessary on the /delete line - %SystemDrive% moves us into C: (or wherever Windows lives), so it must be some sort of network file caching lag... Also tested that this removes the entry from the server end - and it does! (whereas without the net use /delete, it doesn't, and the zombie connections persist on the server) |
(2) is correctly handled and tested, by remembering the ERRORLEVEL before, and restoring it afterwards. and (1) seem to work nicely, confirmed on server. |
It appears that HPC "logging out" a user after running a job does not automatically do a "net use ... /delete" to unmap a drive letter that was previously mapped.
Once there are 16,384 such leftovers between a particular cluster node, and a particular network server (IP address, not share), then the server refuses to allow future mappings, with an error "Not enough server storage is available to process this command."
A fix should:
But also note that rebooting either cluster node, or server clears all leftover jobs. Historically, rebooting for updates etc, has been frequent enough, that this issue was never triggered - it needs an accumulation of 16,384 jobs on a single node mapping to a single server, with no reboot in between. Job patterns, and availability of other nodes has contributed to this being fired on multiple nodes recently.
Another helpful workaround is to encourage worker use, instead of large numbers of individual jobs.
The text was updated successfully, but these errors were encountered: