New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mhap Exception in thread "main" java.io.FileNotFoundException #533
Comments
This step is not particularly I/O intensive, all jobs stream through the same set of block files linked within the queries folder. Canu retries any failed jobs at least once so this implies the same job failed consistently. Is it possible a subset of your nodes have FS connection issues? Were there any issues on the node the failed jobs ran on (e.g. did they all run on one node)? Was anything else running on the same node that could have caused issues with the FS? Not really much we can do within Canu (other than the retry it already does) if the FS fails to open files that exist. You could try increasing the number of retries ( |
I don't think is much that can be done. It does look like there is a collision going on causing a file lock when two jobs are trying to run against the same set of files. I'm not too familiar with java so am not sure if the file is being opened for reading in the optimal way (i.e. read-only vs read-write). |
Those files are only opened for read by MHAP so there should be no issue opening the same file from multiple processes (line 92 in that file). I don't think this is Java/MHAP related but OS/FS related. That is, the same file cannot be accessed by too many processes. If your nodes have local scratch space you could modify the mhap.sh script to first copy the appropriate query folder to local scratch then run to see if this eliminates the issue. You could also try limiting the number of jobs running on your grid at time (either by submitting only a subset of mhap.sh jobs or using whatever mechanism is available on your grid to limit concurrent cores). |
It looked like there was a collision between two running jobs that couldn't get a lock on a the same file. I don't think there is much that can be done apart from the suggestions you have made. Apparently we are looking in to enabling locking on lustre, canu is not the only application with issues. I have tried scratch before and decided not to use it, it creates another point of failure across so many nodes (other users poor scripts/ lack of cleanup/ drive issues/ etc). I do find the problem worse when using the general pool, when I use the much more limited pool I see less issues but it takes twice a long. Thanks anyway, was worth looking in to. |
I can't think of anything useful or simple to try, other than increasing canuIterationMax. Not-so-easy to try would be to edit src/pipelines/canu/OverlapMhap.pm (around line 478) to make it run mhap a second time after a short delay if the first one fails. The mhap call looks like:
Right after this, duplicate the call but insert a sleep before starting mhap:
If mhap runs successfully the first time, $qry.mhap will exist and the second call will be skipped. If it fails, it'll try again after ten seconds. Is ten enough? Who knows. Two minutes should be more than enough to let whatever other mhap is 'blocking' the file to finish reading. Not sure what else to do, so I'm closing. If a real cause and solution is discovered, I wouldn't mind hearing about it. |
I'll give it a go. I have already modified code to include a sleep at the end of each job to allow time for the buffers to empty before the job is terminated. That has helped when the cluster is busy. |
An interesting error during mhap, I suspect it is due to an i/o blocking issue on our lustre file system. This results in only a couple of jobs failing, but that's enough to break things. Running this step manually works, so there really isn't a permissions problem.
The text was updated successfully, but these errors were encountered: