-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with dask worker restarting/continuing a job #258
Comments
Hey @jeff231li, taking a close look at this now. Any additional information since you posted this Friday? |
@dotsdl no, that's pretty much it. Just thought it might be more robust for the program to check the filesize on top of checking the existence of the |
Definitely agree with this approach @jeff231li! I'll prepare a PR against master with an additional utility wrapping Once merged, we can then pull in changes from master into the |
@dotsdl - before you go ahead with this can you prepare a MCVE that this is indeed the problem? |
If this is the case then this also seems like behaviour which should be caught and handled within the OpenMM |
No problem @SimonBoothroyd, I already put together an implementation. PR Is at #259. I've added as items needed yet:
|
This should have been resolved by #259. @jeff231li feel free to re-open this if the issue persists. |
I've just noticed an issue with
Evaluator
restarting/picking up a job that gave similar error to issue #224 about appending to an empty binary file but may be caused by something else. The log file for the worker for the failed job shows a 7:56 timestampTracing back to the previous worker that dealt with this job indicates that it quit at 6:13 because it exceeded the max wall limit.
and the files in the directory are
It looks like the worker dies before the DCDReporter/StateDataReporter could write anything to file. So then the next worker that picks up this job will definitely complain about appending to an empty
trajectory.dcd
.As a potential fix, what do you think of checking for not only if the file exists but also the file size of
trajectory.dcd
? So changingopenff-evaluator/openff/evaluator/protocols/openmm.py
Lines 602 to 605 in 7898857
Could potentially apply the same logic to
checkpoint_state.xml
andcheckpoint.json
, so if the checkpoint file is empty then just restart that particular window from the beginning.The text was updated successfully, but these errors were encountered: