Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition when workspace is created #267

Closed
jglaser opened this issue Jan 2, 2020 · 3 comments
Closed

Fix race condition when workspace is created #267

jglaser opened this issue Jan 2, 2020 · 3 comments
Labels
bug Something isn't working good first issue Good for newcomers
Milestone

Comments

@jglaser
Copy link

jglaser commented Jan 2, 2020

Description

In MPI parallel jobs, where every rank (or every n ranks) may initialize a statepoint on the fly, a race condition leads to the job crashing when no workspace directory is present. The explanation is that this directory is created on the fly simultaneously by all job.init()s of the same ensemble.

A detailed look identifies the mkdir -p in

_mkdir_p(self._wd)

as a culprit. The mkdir should not include the -p option, instead operate only on the target directory. Job workspaces should be created by default upon signac init.

Optionally, when they are deleted (for whatever reason), they could be recreated by an explicit init call, which would be properly guarded by MPI barriers. For now, the workaround is to create the workspace dir manually.

To reproduce

sp = {'just': 'a statepoint'}

project = signac.get_project()

# When initializing jobs on the fly in MPI setting:
job = project.open_job(sp)
device.comm.barrier_all() # HOOMD next
if device.comm.partition == 0 and device.comm.rank == 0:
    job.init() # this is the problematic line
device.comm.barrier_all()

This would be the suggested way to do it (pending changes to signac):

sp = {'just': 'a statepoint'}

project = signac.get_project()
device.comm.barrier_all()
if device.comm.partition == 0 and device.comm.rank == 0:
    project.init()
device.comm.barrier_all()

# When initializing jobs on the fly in MPI setting:
job = project.open_job(sp)
device.comm.barrier_all()
if device.comm.partition == 0 and device.comm.rank == 0:
    job.init() # this is the problematic line
device.comm.barrier_all()

Error output

If possible, copy any terminal outputs or attach screenshots that provide additional information on the problem.

Error occured while trying to create workspace directory for job '42e1468f85f2b689659f23bae00de995'.
State point manifest file of job '42e1468f85f2b689659f23bae00de995' appears to be corrupted.
Traceback (most recent call last):
  File "1brf_umbrella.py", line 125, in <module>
    job.init()
  File "/ccs/home/glaser/.conda/envs/myenv/lib/python3.7/site-packages/signac/contrib/job.py", line 427, in init
    self._init(force=force)
  File "/ccs/home/glaser/.conda/envs/myenv/lib/python3.7/site-packages/signac/contrib/job.py", line 354, in _init
    _mkdir_p(self._wd)
  File "/ccs/home/glaser/.conda/envs/myenv/lib/python3.7/site-packages/signac/contrib/utility.py", line 171, in _mkdir_p
    os.makedirs(path)
  File "/ccs/home/glaser/.conda/envs/myenv/lib/python3.7/os.py", line 221, in makedirs
    mkdir(name, mode)
...

System configuration

Please complete the following information:

  • Operating System [e.g. macOS]: Linux login2 4.14.0-115.8.1.el7a.ppc64le Failed conversions are not cached #1 SMP Thu May 9 14:45:13 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux (really, any OS)
  • Version of Python [e.g. 3.7]: Python 3.7.4 (really, any python3 version)
  • Version of signac [e.g. 1.0]: signac 1.0.0 (but looking at the code the problem may also be present in newer versions)
@csadorf
Copy link
Contributor

csadorf commented Jan 6, 2020

I would be in favor of creating the workspace directory with project.init().

@bdice bdice added bug Something isn't working good first issue Good for newcomers labels Jan 7, 2020
@jglaser
Copy link
Author

jglaser commented Jan 8, 2020

note that an easy solution could be to error out when workspace does not exist, i.e. put the burden on the user to create it

@bdice
Copy link
Member

bdice commented Jan 13, 2020

I agree with @csadorf that project initialization should create an empty workspace directory, and we leave the current behavior for creating job directories as mkdir -p (actually its Python equivalent). When initializing a new job, there won't be a race condition to create workspace if it already exists, and it will only need to create workspace if the directory was deleted by the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants