Should delete drive mappings when finishing job #65

weshinsley · 2019-04-26T09:34:15Z

It appears that HPC "logging out" a user after running a job does not automatically do a "net use ... /delete" to unmap a drive letter that was previously mapped.

Once there are 16,384 such leftovers between a particular cluster node, and a particular network server (IP address, not share), then the server refuses to allow future mappings, with an error "Not enough server storage is available to process this command."

A fix should:

try and unmap the drive letters at the end of a job
Failure to finally unmap the drive should be "ignored", rather than noted as a failure.
Test that unmapping at the end of one job does not interfere with other concurrent jobs for the same user on the same node, with similar mappings.

But also note that rebooting either cluster node, or server clears all leftover jobs. Historically, rebooting for updates etc, has been frequent enough, that this issue was never triggered - it needs an accumulation of 16,384 jobs on a single node mapping to a single server, with no reboot in between. Job patterns, and availability of other nodes has contributed to this being fired on multiple nodes recently.

Another helpful workaround is to encourage worker use, instead of large numbers of individual jobs.

weshinsley · 2019-04-26T13:39:09Z

(3) - Checked, it indeed works. Here is a hideous example.

Consider this R code in Q:\clusternet\sleep_test.R

args <- commandArgs(trailingOnly = TRUE)
sleep_time <- as.numeric(args[1])
Sys.sleep(sleep_time)
df <- data.frame(time=sleep_time)
write.csv(df, file=paste0("P:/output", sleep_time, ".csv"))

Each job is run with this batch, Q:\clusternet\run.bat

net use P: \\fi--san03.dide.ic.ac.uk\homes\wrh1\clusternet /y
call setr64_3_5_1
P:
Rscript sleep_test.R %1
set Err=%ErrorLevel%
%SystemDrive%
net use P: /delete /y
set ErrorLevel=%Err%

and the lot are launched with:

set SUB=job submit /scheduler:fi--didemrchnb /numcores:1 /jobtemplate:GeneralNodes /requestednodes:fi--didemrc09 \\fi--san03.dide.ic.ac.uk\homes\wrh1\clusternet\run.bat
%SUB% 100
%SUB% 50
%SUB% 20
%SUB% 10
%SUB% 5
%SUB% 3
%SUB% 2
%SUB% 1

These are run on an empty node and all run concurrently; all csv files are correctly written, so the 1-second file closing the connection does not disrupt the progress of the 100-second run.

Not sure quite why the '/y' is necessary on the /delete line - %SystemDrive% moves us into C: (or wherever Windows lives), so it must be some sort of network file caching lag...

Also tested that this removes the entry from the server end - and it does! (whereas without the net use /delete, it doesn't, and the zombie connections persist on the server)

weshinsley · 2019-04-26T14:57:31Z

(2) is correctly handled and tested, by remembering the ERRORLEVEL before, and restoring it afterwards.

and (1) seem to work nicely, confirmed on server.

weshinsley · 2019-04-26T14:58:19Z

#66

weshinsley self-assigned this Apr 26, 2019

weshinsley mentioned this issue Apr 26, 2019

I65 - unmap network drives after use #66

Merged

weshinsley closed this as completed Apr 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should delete drive mappings when finishing job #65

Should delete drive mappings when finishing job #65

weshinsley commented Apr 26, 2019 •

edited

Loading

weshinsley commented Apr 26, 2019 •

edited

Loading

weshinsley commented Apr 26, 2019

weshinsley commented Apr 26, 2019

Should delete drive mappings when finishing job #65

Should delete drive mappings when finishing job #65

Comments

weshinsley commented Apr 26, 2019 • edited Loading

weshinsley commented Apr 26, 2019 • edited Loading

weshinsley commented Apr 26, 2019

weshinsley commented Apr 26, 2019

weshinsley commented Apr 26, 2019 •

edited

Loading

weshinsley commented Apr 26, 2019 •

edited

Loading