Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should delete drive mappings when finishing job #65

Closed
weshinsley opened this issue Apr 26, 2019 · 3 comments
Closed

Should delete drive mappings when finishing job #65

weshinsley opened this issue Apr 26, 2019 · 3 comments
Assignees

Comments

@weshinsley
Copy link
Contributor

weshinsley commented Apr 26, 2019

It appears that HPC "logging out" a user after running a job does not automatically do a "net use ... /delete" to unmap a drive letter that was previously mapped.

Once there are 16,384 such leftovers between a particular cluster node, and a particular network server (IP address, not share), then the server refuses to allow future mappings, with an error "Not enough server storage is available to process this command."

A fix should:

  1. try and unmap the drive letters at the end of a job
  2. Failure to finally unmap the drive should be "ignored", rather than noted as a failure.
  3. Test that unmapping at the end of one job does not interfere with other concurrent jobs for the same user on the same node, with similar mappings.

But also note that rebooting either cluster node, or server clears all leftover jobs. Historically, rebooting for updates etc, has been frequent enough, that this issue was never triggered - it needs an accumulation of 16,384 jobs on a single node mapping to a single server, with no reboot in between. Job patterns, and availability of other nodes has contributed to this being fired on multiple nodes recently.

Another helpful workaround is to encourage worker use, instead of large numbers of individual jobs.

@weshinsley weshinsley self-assigned this Apr 26, 2019
@weshinsley
Copy link
Contributor Author

weshinsley commented Apr 26, 2019

(3) - Checked, it indeed works. Here is a hideous example.

Consider this R code in Q:\clusternet\sleep_test.R

args <- commandArgs(trailingOnly = TRUE)
sleep_time <- as.numeric(args[1])
Sys.sleep(sleep_time)
df <- data.frame(time=sleep_time)
write.csv(df, file=paste0("P:/output", sleep_time, ".csv"))

Each job is run with this batch, Q:\clusternet\run.bat

net use P: \\fi--san03.dide.ic.ac.uk\homes\wrh1\clusternet /y
call setr64_3_5_1
P:
Rscript sleep_test.R %1
set Err=%ErrorLevel%
%SystemDrive%
net use P: /delete /y
set ErrorLevel=%Err%

and the lot are launched with:

set SUB=job submit /scheduler:fi--didemrchnb /numcores:1 /jobtemplate:GeneralNodes /requestednodes:fi--didemrc09 \\fi--san03.dide.ic.ac.uk\homes\wrh1\clusternet\run.bat
%SUB% 100
%SUB% 50
%SUB% 20
%SUB% 10
%SUB% 5
%SUB% 3
%SUB% 2
%SUB% 1

These are run on an empty node and all run concurrently; all csv files are correctly written, so the 1-second file closing the connection does not disrupt the progress of the 100-second run.

Not sure quite why the '/y' is necessary on the /delete line - %SystemDrive% moves us into C: (or wherever Windows lives), so it must be some sort of network file caching lag...

Also tested that this removes the entry from the server end - and it does! (whereas without the net use /delete, it doesn't, and the zombie connections persist on the server)

@weshinsley
Copy link
Contributor Author

(2) is correctly handled and tested, by remembering the ERRORLEVEL before, and restoring it afterwards.

and (1) seem to work nicely, confirmed on server.

@weshinsley
Copy link
Contributor Author

#66

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant