Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erroneous input parsing with paired-reads + singletons #92

Closed
unode opened this issue Jan 15, 2018 · 7 comments
Closed

Erroneous input parsing with paired-reads + singletons #92

unode opened this issue Jan 15, 2018 · 7 comments

Comments

@unode
Copy link

unode commented Jan 15, 2018

While running:

minimap2 -ax sr db.mmi read1.fq.gz read2.fq.gz singletons.fq.gz

and monitoring I/O using pipe viewer (pv -d pid) I noticed a somewhat strange behavior.

db.mmi is read in chunks and for every chunk read1.fq.gz, read2.fq.gz and single.fq.gz are re-read. This is a bit wasteful since it has to decompress the inputs several times but not necessarily a problem.

However, single.fq.gz is much smaller than the other 2 files. minimap2 seems to stop reading all inputs as soon as it reaches the end of one of them. Because of this single.fq.gz is read in its entirety while only the first 5-10% ofread1.fq.gz and read2.fq.gz seem to be read at which point minimap2 moves on to the next index chunk.

The documentation doesn't explicitly mention support for single.fq but minimap2 didn't complain either and a result is still produced.

@lh3
Copy link
Owner

lh3 commented Jan 16, 2018

With that command line, minimap2 assumes each fragment has three reads. The correct way to achieve what you want is:

(seqtk mergepe read1.fq.gz read2.fq.gz; cat single.fq.gz) | minimap2 -ax sr db.mmi -

Anyway, I agree that minimap2 should throw an warning if one file has fewer reads than others. I will implement it at some point.

@unode
Copy link
Author

unode commented Jan 16, 2018

When reading from a pipe and mapping against a chunked index, does minimap2 still expect to read the input more than once or does it buffer the content in memory?

@unode
Copy link
Author

unode commented Jan 16, 2018

Indeed I can confirm that using a chunked index (32GB) and piped streams results in undefined behavior.

With:

(seqtk mergepe read1.fq.gz read2.fq.gz; cat single.fq.gz) | minimap2 -ax sr db.mmi -

minimap2 segfaults once it reaches the second chunk of db.mmi.

@lh3
Copy link
Owner

lh3 commented Jan 16, 2018

I didn't notice you were using a multi-part index. For such a index, you can't feed data through a pipe. Minimap2 is designed this way to reduce peak memory. It is often impractical to hold either all target or all query sequences in memory. If you have enough memory, build a uni-part index by increasing -I.

@unode
Copy link
Author

unode commented Jan 24, 2018

Redirecting the output of seqtk to a file and using that as input is a workable solution.
The segfault was also now always seen when reading from a pipe. I don't have a perfectly reproducible example.

@lh3
Copy link
Owner

lh3 commented Jan 28, 2018

master now reports an error if you specify three query files. It also warns if one file contains fewer records than others.

@lh3 lh3 closed this as completed Jan 28, 2018
@unode
Copy link
Author

unode commented Feb 1, 2018

It might also be useful to abort early if the database is a split one and minimap2 is being executed with piped input.

Thanks for the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants