Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with dask worker restarting/continuing a job #258

Closed
jeff231li opened this issue Jul 10, 2020 · 7 comments
Closed

Issue with dask worker restarting/continuing a job #258

jeff231li opened this issue Jul 10, 2020 · 7 comments

Comments

@jeff231li
Copy link
Collaborator

I've just noticed an issue with Evaluator restarting/picking up a job that gave similar error to issue #224 about appending to an empty binary file but may be caused by something else. The log file for the worker for the failed job shows a 7:56 timestamp

07:55:53.738 INFO     Running window 1 out of 59
07:55:53.889 INFO     Executing npt_equilibration
07:55:54.199 INFO     Setting up an openmm platform on GPU 0
07:56:00.527 INFO     No checkpoint files were found.
07:56:30.478 INFO     Protocol failed to execute: npt_equilibration
07:56:30.526 INFO     Protocol failed to execute: 45ab11237d9b4db899cb5418c9d5a5be|host_guest_free_energy_0
07:56:38.933 INFO     Executing 1307506edaf9477882944919aa4e8f65|host_guest_free_energy_1

Tracing back to the previous worker that dealt with this job indicates that it quit at 6:13 because it exceeded the max wall limit.

06:13:49.829 INFO     Executing npt_equilibration
06:13:49.831 INFO     Setting up an openmm platform on GPU 0
06:13:51.863 INFO     No checkpoint files were found.

and the files in the directory are

-rw-r--r-- 1 jsetiadi gibbs-group 548306 Jul 10 06:13 input.pdb
-rw-r--r-- 1 jsetiadi gibbs-group      0 Jul 10 06:13 openmm_statistics.csv 
-rw-r--r-- 1 jsetiadi gibbs-group      0 Jul 10 06:13 trajectory.dcd 

It looks like the worker dies before the DCDReporter/StateDataReporter could write anything to file. So then the next worker that picks up this job will definitely complain about appending to an empty trajectory.dcd.

As a potential fix, what do you think of checking for not only if the file exists but also the file size of trajectory.dcd? So changing

append_trajectory = os.path.isfile(self._local_trajectory_path)
dcd_reporter = app.DCDReporter(
self._local_trajectory_path, 0, append_trajectory
)
to

append_trajectory = os.path.isfile(self._local_trajectory_path) and os.path.getsize(self._local_trajectory_path) != 0
dcd_reporter = app.DCDReporter (
    self._local_trajectory_path, 0, append_trajectory
)

Could potentially apply the same logic to checkpoint_state.xml and checkpoint.json, so if the checkpoint file is empty then just restart that particular window from the beginning.

@dotsdl
Copy link
Member

dotsdl commented Jul 15, 2020

Hey @jeff231li, taking a close look at this now. Any additional information since you posted this Friday?

@jeff231li
Copy link
Collaborator Author

@dotsdl no, that's pretty much it. Just thought it might be more robust for the program to check the filesize on top of checking the existence of the trajectory.dcd.

@dotsdl
Copy link
Member

dotsdl commented Jul 15, 2020

Definitely agree with this approach @jeff231li! I'll prepare a PR against master with an additional utility wrapping os.path.isfile and the check for the size as well, and we'll use this in place of raw usages of os.path.isfile in protocols/openmm.py.

Once merged, we can then pull in changes from master into the paprika_integration branch so we have these benefits as we continue hammering on it.

@SimonBoothroyd
Copy link
Contributor

@dotsdl - before you go ahead with this can you prepare a MCVE that this is indeed the problem?

@SimonBoothroyd
Copy link
Contributor

SimonBoothroyd commented Jul 15, 2020

If this is the case then this also seems like behaviour which should be caught and handled within the OpenMM DCDFile class itself.

@dotsdl
Copy link
Member

dotsdl commented Jul 15, 2020

No problem @SimonBoothroyd, I already put together an implementation. PR Is at #259.

I've added as items needed yet:

  1. an MCVE
  2. advice from you on where this logic should actually live

@SimonBoothroyd
Copy link
Contributor

This should have been resolved by #259. @jeff231li feel free to re-open this if the issue persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants