Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mhap Exception in thread "main" java.io.FileNotFoundException #533

Closed
hodgett opened this issue Jun 20, 2017 · 6 comments
Closed

mhap Exception in thread "main" java.io.FileNotFoundException #533

hodgett opened this issue Jun 20, 2017 · 6 comments

Comments

@hodgett
Copy link

hodgett commented Jun 20, 2017

An interesting error during mhap, I suspect it is due to an i/o blocking issue on our lustre file system. This results in only a couple of jobs failing, but that's enough to break things. Running this step manually works, so there really isn't a permissions problem.

Exception in thread "main" java.io.FileNotFoundException: /lustre/scratch/team/CANU_MH1.5-20170613/correction/1-overlapper/queries/000065/000171.dat (Permission denied)
        at java.io.FileInputStream.open0(Native Method)
        at java.io.FileInputStream.open(FileInputStream.java:195)
        at java.io.FileInputStream.<init>(FileInputStream.java:138)
        at java.io.FileInputStream.<init>(FileInputStream.java:93)
        at edu.umd.marbl.mhap.impl.SequenceSketchStreamer.<init>(SequenceSketchStreamer.java:92)
        at edu.umd.marbl.mhap.main.MhapMain.getSequenceHashStreamer(MhapMain.java:564)
        at edu.umd.marbl.mhap.main.MhapMain.computeMain(MhapMain.java:527)
        at edu.umd.marbl.mhap.main.MhapMain.main(MhapMain.java:315)
@skoren
Copy link
Member

skoren commented Jun 20, 2017

This step is not particularly I/O intensive, all jobs stream through the same set of block files linked within the queries folder. Canu retries any failed jobs at least once so this implies the same job failed consistently. Is it possible a subset of your nodes have FS connection issues? Were there any issues on the node the failed jobs ran on (e.g. did they all run on one node)? Was anything else running on the same node that could have caused issues with the FS?

Not really much we can do within Canu (other than the retry it already does) if the FS fails to open files that exist. You could try increasing the number of retries (canuIterationMax, default is 2) or you could increase the memory for mhap to create fewer partitions and thus files if the FS has issues with too many files being accessed concurrently.

@hodgett
Copy link
Author

hodgett commented Jun 21, 2017

I don't think is much that can be done. It does look like there is a collision going on causing a file lock when two jobs are trying to run against the same set of files. I'm not too familiar with java so am not sure if the file is being opened for reading in the optimal way (i.e. read-only vs read-write).
I think the error is being thrown block starting line 74 https://github.com/marbl/MHAP/blob/master/src/main/java/edu/umd/marbl/mhap/impl/SequenceSketchStreamer.java but I don't have the skills to work on it.

@skoren
Copy link
Member

skoren commented Jun 21, 2017

Those files are only opened for read by MHAP so there should be no issue opening the same file from multiple processes (line 92 in that file). I don't think this is Java/MHAP related but OS/FS related. That is, the same file cannot be accessed by too many processes.

If your nodes have local scratch space you could modify the mhap.sh script to first copy the appropriate query folder to local scratch then run to see if this eliminates the issue. You could also try limiting the number of jobs running on your grid at time (either by submitting only a subset of mhap.sh jobs or using whatever mechanism is available on your grid to limit concurrent cores).

@hodgett
Copy link
Author

hodgett commented Jun 22, 2017

It looked like there was a collision between two running jobs that couldn't get a lock on a the same file. I don't think there is much that can be done apart from the suggestions you have made. Apparently we are looking in to enabling locking on lustre, canu is not the only application with issues. I have tried scratch before and decided not to use it, it creates another point of failure across so many nodes (other users poor scripts/ lack of cleanup/ drive issues/ etc). I do find the problem worse when using the general pool, when I use the much more limited pool I see less issues but it takes twice a long.

Thanks anyway, was worth looking in to.

@brianwalenz
Copy link
Member

I can't think of anything useful or simple to try, other than increasing canuIterationMax.

Not-so-easy to try would be to edit src/pipelines/canu/OverlapMhap.pm (around line 478) to make it run mhap a second time after a short delay if the first one fails.

The mhap call looks like:

    print F "if [ ! -e ./results/\$qry.mhap ] ; then\n";
    print F "  $javaPath -d64 -server -Xmx", $javaMemory, "m \\\n";
    .....
    print F "  mv -f ./results/\$qry.mhap.WORKING ./results/\$qry.mhap\n";
    print F "fi\n";

Right after this, duplicate the call but insert a sleep before starting mhap:

    print F "if [ ! -e ./results/\$qry.mhap ] ; then\n";
    print F "  sleep 10\\\n";
    print F "  $javaPath -d64 -server -Xmx", $javaMemory, "m \\\n";
    .....
    print F "  mv -f ./results/\$qry.mhap.WORKING ./results/\$qry.mhap\n";
    print F "fi\n";

If mhap runs successfully the first time, $qry.mhap will exist and the second call will be skipped. If it fails, it'll try again after ten seconds. Is ten enough? Who knows. Two minutes should be more than enough to let whatever other mhap is 'blocking' the file to finish reading.

Not sure what else to do, so I'm closing. If a real cause and solution is discovered, I wouldn't mind hearing about it.

@hodgett
Copy link
Author

hodgett commented Jun 25, 2017

I'll give it a go. I have already modified code to include a sleep at the end of each job to allow time for the buffers to empty before the job is terminated. That has helped when the cluster is busy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants