mhap Exception in thread "main" java.io.FileNotFoundException #533

hodgett · 2017-06-20T04:27:27Z

An interesting error during mhap, I suspect it is due to an i/o blocking issue on our lustre file system. This results in only a couple of jobs failing, but that's enough to break things. Running this step manually works, so there really isn't a permissions problem.

Exception in thread "main" java.io.FileNotFoundException: /lustre/scratch/team/CANU_MH1.5-20170613/correction/1-overlapper/queries/000065/000171.dat (Permission denied)
        at java.io.FileInputStream.open0(Native Method)
        at java.io.FileInputStream.open(FileInputStream.java:195)
        at java.io.FileInputStream.<init>(FileInputStream.java:138)
        at java.io.FileInputStream.<init>(FileInputStream.java:93)
        at edu.umd.marbl.mhap.impl.SequenceSketchStreamer.<init>(SequenceSketchStreamer.java:92)
        at edu.umd.marbl.mhap.main.MhapMain.getSequenceHashStreamer(MhapMain.java:564)
        at edu.umd.marbl.mhap.main.MhapMain.computeMain(MhapMain.java:527)
        at edu.umd.marbl.mhap.main.MhapMain.main(MhapMain.java:315)

The text was updated successfully, but these errors were encountered:

skoren · 2017-06-20T12:59:40Z

This step is not particularly I/O intensive, all jobs stream through the same set of block files linked within the queries folder. Canu retries any failed jobs at least once so this implies the same job failed consistently. Is it possible a subset of your nodes have FS connection issues? Were there any issues on the node the failed jobs ran on (e.g. did they all run on one node)? Was anything else running on the same node that could have caused issues with the FS?

Not really much we can do within Canu (other than the retry it already does) if the FS fails to open files that exist. You could try increasing the number of retries (canuIterationMax, default is 2) or you could increase the memory for mhap to create fewer partitions and thus files if the FS has issues with too many files being accessed concurrently.

hodgett · 2017-06-21T23:36:02Z

I don't think is much that can be done. It does look like there is a collision going on causing a file lock when two jobs are trying to run against the same set of files. I'm not too familiar with java so am not sure if the file is being opened for reading in the optimal way (i.e. read-only vs read-write).
I think the error is being thrown block starting line 74 https://github.com/marbl/MHAP/blob/master/src/main/java/edu/umd/marbl/mhap/impl/SequenceSketchStreamer.java but I don't have the skills to work on it.

skoren · 2017-06-21T23:43:45Z

Those files are only opened for read by MHAP so there should be no issue opening the same file from multiple processes (line 92 in that file). I don't think this is Java/MHAP related but OS/FS related. That is, the same file cannot be accessed by too many processes.

If your nodes have local scratch space you could modify the mhap.sh script to first copy the appropriate query folder to local scratch then run to see if this eliminates the issue. You could also try limiting the number of jobs running on your grid at time (either by submitting only a subset of mhap.sh jobs or using whatever mechanism is available on your grid to limit concurrent cores).

hodgett · 2017-06-22T06:24:57Z

It looked like there was a collision between two running jobs that couldn't get a lock on a the same file. I don't think there is much that can be done apart from the suggestions you have made. Apparently we are looking in to enabling locking on lustre, canu is not the only application with issues. I have tried scratch before and decided not to use it, it creates another point of failure across so many nodes (other users poor scripts/ lack of cleanup/ drive issues/ etc). I do find the problem worse when using the general pool, when I use the much more limited pool I see less issues but it takes twice a long.

Thanks anyway, was worth looking in to.

brianwalenz · 2017-06-23T07:12:53Z

I can't think of anything useful or simple to try, other than increasing canuIterationMax.

Not-so-easy to try would be to edit src/pipelines/canu/OverlapMhap.pm (around line 478) to make it run mhap a second time after a short delay if the first one fails.

The mhap call looks like:

    print F "if [ ! -e ./results/\$qry.mhap ] ; then\n";
    print F "  $javaPath -d64 -server -Xmx", $javaMemory, "m \\\n";
    .....
    print F "  mv -f ./results/\$qry.mhap.WORKING ./results/\$qry.mhap\n";
    print F "fi\n";

Right after this, duplicate the call but insert a sleep before starting mhap:

    print F "if [ ! -e ./results/\$qry.mhap ] ; then\n";
    print F "  sleep 10\\\n";
    print F "  $javaPath -d64 -server -Xmx", $javaMemory, "m \\\n";
    .....
    print F "  mv -f ./results/\$qry.mhap.WORKING ./results/\$qry.mhap\n";
    print F "fi\n";

If mhap runs successfully the first time, $qry.mhap will exist and the second call will be skipped. If it fails, it'll try again after ten seconds. Is ten enough? Who knows. Two minutes should be more than enough to let whatever other mhap is 'blocking' the file to finish reading.

Not sure what else to do, so I'm closing. If a real cause and solution is discovered, I wouldn't mind hearing about it.

hodgett · 2017-06-25T22:45:44Z

I'll give it a go. I have already modified code to include a sleep at the end of each job to allow time for the buffers to empty before the job is terminated. That has helped when the cluster is busy.

brianwalenz closed this as completed Jun 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mhap Exception in thread "main" java.io.FileNotFoundException #533

mhap Exception in thread "main" java.io.FileNotFoundException #533

hodgett commented Jun 20, 2017

skoren commented Jun 20, 2017

hodgett commented Jun 21, 2017

skoren commented Jun 21, 2017

hodgett commented Jun 22, 2017

brianwalenz commented Jun 23, 2017

hodgett commented Jun 25, 2017

mhap Exception in thread "main" java.io.FileNotFoundException #533

mhap Exception in thread "main" java.io.FileNotFoundException #533

Comments

hodgett commented Jun 20, 2017

skoren commented Jun 20, 2017

hodgett commented Jun 21, 2017

skoren commented Jun 21, 2017

hodgett commented Jun 22, 2017

brianwalenz commented Jun 23, 2017

hodgett commented Jun 25, 2017