New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I/O issues when running paprika protocol on NFS, Lustre mount #224
Comments
Agree :) |
Just to note that both the server and workers need access to a shared file system - tasks in a given workflow may be distributed across multiple workers, and hence one worker may (and likely will) need to access files produced by another worker. If workers cannot access the local storage of other workers (which I imagine will be the case here), then this solution will not work in general. Consider the other case where a worker hits the wall clock limit before it has completed its task. The worker will be killed and its local storage deleted. A new worker will attempt to pick up the task but will not find any checkpoint files to restart from and will have to begin from scratch. This may then lead to the task never being computed and the task graph getting stuck in a loop until the max number of retries is hit (which is currently quite high). In general we are not writing overly frequently to the disk, definitely not more than a well configured storage system should be able to handle. (also just to note it may be that the first error is not related to disk issues) |
Hmm. Do the workers have some method of accessing their total allowed wall time? Would it be possible to write to node-local storage during the process and then |
Not in general unfortunately
You would likely need to create a custom @dotsdl and @jeff231li - could you maybe elaborate a bit more on why you think this is file system related? What steps have you taken to reach this conclusion? Unfortunately |
@jeff231li has submitted a job with the possible workaround I posted to TSCC; did this produce the same set of errors? Simon is right that we should try and rule out other possible failure modes before we try and solution further on the assumption it is due to the network filesystem mounts. That first stab hypothesis of mine is admittedly based on past experience encountering fun IO errors when many processes attempt to read/write from such volumes. |
@dotsdl the job is currently still in the queue, I will update you when it starts running. My guess is what @SimonBoothroyd mentioned previously will occur with dask-workers needing to share information.
@SimonBoothroyd I suspected that this is a file system issue from the error below. Basically, a dask-worker running OpenMM opens the # Create an empty file
f = open('test.dcd', 'wb')
f.close()
# Open for append
f = open('test.dcd', 'r+b')
import struct
struct.unpack('<i', file.read(4))[0] I tried adding
|
@jeff231li great presentation today! I wanted to follow-up and ask if this was resolved for you? |
@dotsdl thank you. Unfortunately no, transferring the files to the local node storage does not work because the Evaluator server and client require access to the same files. I'll take this up with TSCC and see if they have a solution. |
@SimonBoothroyd : How difficult would it be to remove this requirement by isolating the storage layer with a level of abstraction? I wonder if there's some sort of simple, well-supported key-value distributed store that would make this problem go away by simply ensuring that each tasklet would grab the file(s) it needs from the distributed store, operate on it locally, and then resubmit it later. The store could handle caching and migrating of data transparently, and suddenly, we gain the ability to run across distributed systems that do not share a common filesystem. |
The dask project seems to already provide PartD, a concurrent, appendable key-value store that may work for this purpose. |
In discussion with @jeff231li, this appears resolved for the time being. TSCC staff identified the issue (some users overloading the filesystem), and have addressed it with policy and monitoring. Jeff hasn't observed the problem for some time since. I will close this for now; we will revisit with retry layers if this situation changes. Thanks everyone! |
Can we have a real-time chat about how difficult it would be to replace the shared filesystem requirement with a simple key-value binary blob storage scheme that could live in the cloud? There might be a very simple solution that would allow us to use multiple clusters. |
Sure - we could possibly do this next week. @jchodera it may be best if you or Ellen propose a time on the calendar as you probably have the busiest schedule! |
@SimonBoothroyd @j-wags : I've shared my calendar with you both---could you send an invite for a free timeslot (7.00A PT or later) anytime Thu 9 Jul or beyond? |
I've created this issue as an anchor for an ongoing troubleshooting session. Any solution(s) to this issue will be documented here.
From @jeff231li:
We met today in a live session to troubleshoot. Some details:
DaskPBSBackend
on TSCC for compute.working_directory
andstorage_directory
.Seeing issues such as the following.
Running on NFS mount
Gives:
Running on Lustre filesystem
Gives:
Possible workarounds
It may make sense in this case to create a scratch directory on each compute node's local storage at
$TMPDIR
for both thePropertyEstimatorServer
and the dask-workers insetup_script_commands
, then set theworking_directory
to point to that. This may avoid issues with rapid writes/reads on mounted network filesystems. More details in the TSCC docs.Substitutions in the scripts above like the following may work well:
@jeff231li, can you give the above a shot and let us know here if this addresses the errors you are seeing?
The text was updated successfully, but these errors were encountered: