You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In MPI parallel jobs, where every rank (or every n ranks) may initialize a statepoint on the fly, a race condition leads to the job crashing when no workspace directory is present. The explanation is that this directory is created on the fly simultaneously by all job.init()s of the same ensemble.
as a culprit. The mkdir should not include the -p option, instead operate only on the target directory. Job workspaces should be created by default upon signac init.
Optionally, when they are deleted (for whatever reason), they could be recreated by an explicit init call, which would be properly guarded by MPI barriers. For now, the workaround is to create the workspace dir manually.
To reproduce
sp= {'just': 'a statepoint'}
project=signac.get_project()
# When initializing jobs on the fly in MPI setting:job=project.open_job(sp)
device.comm.barrier_all() # HOOMD nextifdevice.comm.partition==0anddevice.comm.rank==0:
job.init() # this is the problematic linedevice.comm.barrier_all()
This would be the suggested way to do it (pending changes to signac):
sp= {'just': 'a statepoint'}
project=signac.get_project()
device.comm.barrier_all()
ifdevice.comm.partition==0anddevice.comm.rank==0:
project.init()
device.comm.barrier_all()
# When initializing jobs on the fly in MPI setting:job=project.open_job(sp)
device.comm.barrier_all()
ifdevice.comm.partition==0anddevice.comm.rank==0:
job.init() # this is the problematic linedevice.comm.barrier_all()
Error output
If possible, copy any terminal outputs or attach screenshots that provide additional information on the problem.
Error occured while trying to create workspace directory for job '42e1468f85f2b689659f23bae00de995'.
State point manifest file of job '42e1468f85f2b689659f23bae00de995' appears to be corrupted.
Traceback (most recent call last):
File "1brf_umbrella.py", line 125, in <module>
job.init()
File "/ccs/home/glaser/.conda/envs/myenv/lib/python3.7/site-packages/signac/contrib/job.py", line 427, in init
self._init(force=force)
File "/ccs/home/glaser/.conda/envs/myenv/lib/python3.7/site-packages/signac/contrib/job.py", line 354, in _init
_mkdir_p(self._wd)
File "/ccs/home/glaser/.conda/envs/myenv/lib/python3.7/site-packages/signac/contrib/utility.py", line 171, in _mkdir_p
os.makedirs(path)
File "/ccs/home/glaser/.conda/envs/myenv/lib/python3.7/os.py", line 221, in makedirs
mkdir(name, mode)
...
System configuration
Please complete the following information:
Operating System [e.g. macOS]: Linux login2 4.14.0-115.8.1.el7a.ppc64le Failed conversions are not cached #1 SMP Thu May 9 14:45:13 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux (really, any OS)
Version of Python [e.g. 3.7]: Python 3.7.4 (really, any python3 version)
Version of signac [e.g. 1.0]: signac 1.0.0 (but looking at the code the problem may also be present in newer versions)
The text was updated successfully, but these errors were encountered:
I agree with @csadorf that project initialization should create an empty workspace directory, and we leave the current behavior for creating job directories as mkdir -p (actually its Python equivalent). When initializing a new job, there won't be a race condition to create workspace if it already exists, and it will only need to create workspace if the directory was deleted by the user.
Description
In MPI parallel jobs, where every rank (or every
n
ranks) may initialize a statepoint on the fly, a race condition leads to the job crashing when noworkspace
directory is present. The explanation is that this directory is created on the fly simultaneously by alljob.init()
s of the same ensemble.A detailed look identifies the
mkdir -p
insignac/signac/contrib/job.py
Line 337 in 2c0d9e4
as a culprit. The
mkdir
should not include the-p
option, instead operate only on the target directory. Job workspaces should be created by default uponsignac init
.Optionally, when they are deleted (for whatever reason), they could be recreated by an explicit
init
call, which would be properly guarded by MPI barriers. For now, the workaround is to create the workspace dir manually.To reproduce
This would be the suggested way to do it (pending changes to signac):
Error output
If possible, copy any terminal outputs or attach screenshots that provide additional information on the problem.
System configuration
Please complete the following information:
The text was updated successfully, but these errors were encountered: