Erroneous input parsing with paired-reads + singletons #92

unode · 2018-01-15T17:42:05Z

While running:

minimap2 -ax sr db.mmi read1.fq.gz read2.fq.gz singletons.fq.gz

and monitoring I/O using pipe viewer (pv -d pid) I noticed a somewhat strange behavior.

db.mmi is read in chunks and for every chunk read1.fq.gz, read2.fq.gz and single.fq.gz are re-read. This is a bit wasteful since it has to decompress the inputs several times but not necessarily a problem.

However, single.fq.gz is much smaller than the other 2 files. minimap2 seems to stop reading all inputs as soon as it reaches the end of one of them. Because of this single.fq.gz is read in its entirety while only the first 5-10% ofread1.fq.gz and read2.fq.gz seem to be read at which point minimap2 moves on to the next index chunk.

The documentation doesn't explicitly mention support for single.fq but minimap2 didn't complain either and a result is still produced.

The text was updated successfully, but these errors were encountered:

lh3 · 2018-01-16T03:08:11Z

With that command line, minimap2 assumes each fragment has three reads. The correct way to achieve what you want is:

(seqtk mergepe read1.fq.gz read2.fq.gz; cat single.fq.gz) | minimap2 -ax sr db.mmi -

Anyway, I agree that minimap2 should throw an warning if one file has fewer reads than others. I will implement it at some point.

unode · 2018-01-16T09:31:24Z

When reading from a pipe and mapping against a chunked index, does minimap2 still expect to read the input more than once or does it buffer the content in memory?

unode · 2018-01-16T11:07:32Z

Indeed I can confirm that using a chunked index (32GB) and piped streams results in undefined behavior.

With:

(seqtk mergepe read1.fq.gz read2.fq.gz; cat single.fq.gz) | minimap2 -ax sr db.mmi -

minimap2 segfaults once it reaches the second chunk of db.mmi.

lh3 · 2018-01-16T12:58:58Z

I didn't notice you were using a multi-part index. For such a index, you can't feed data through a pipe. Minimap2 is designed this way to reduce peak memory. It is often impractical to hold either all target or all query sequences in memory. If you have enough memory, build a uni-part index by increasing -I.

unode · 2018-01-24T10:35:24Z

Redirecting the output of seqtk to a file and using that as input is a workable solution.
The segfault was also now always seen when reading from a pipe. I don't have a perfectly reproducible example.

lh3 · 2018-01-28T22:37:02Z

master now reports an error if you specify three query files. It also warns if one file contains fewer records than others.

unode · 2018-02-01T11:17:37Z

It might also be useful to abort early if the database is a split one and minimap2 is being executed with piped input.

Thanks for the fix.

lh3 added question enhancement labels Jan 16, 2018

lh3 added a commit that referenced this issue Jan 28, 2018

r666: report if >=3 query files in SR mode #92

72b9b0e

lh3 added a commit that referenced this issue Jan 28, 2018

r667: warn if one query file has fewer records #92

43bfa61

lh3 closed this as completed Jan 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Erroneous input parsing with paired-reads + singletons #92

Erroneous input parsing with paired-reads + singletons #92

unode commented Jan 15, 2018

lh3 commented Jan 16, 2018 •

edited

unode commented Jan 16, 2018

unode commented Jan 16, 2018

lh3 commented Jan 16, 2018

unode commented Jan 24, 2018

lh3 commented Jan 28, 2018

unode commented Feb 1, 2018

Erroneous input parsing with paired-reads + singletons #92

Erroneous input parsing with paired-reads + singletons #92

Comments

unode commented Jan 15, 2018

lh3 commented Jan 16, 2018 • edited

unode commented Jan 16, 2018

unode commented Jan 16, 2018

lh3 commented Jan 16, 2018

unode commented Jan 24, 2018

lh3 commented Jan 28, 2018

unode commented Feb 1, 2018

lh3 commented Jan 16, 2018 •

edited