SAM header interspersed through file #15

mbhall88 · 2017-08-22T08:19:10Z

Hi Heng,

Just came across a bizarre issue which I have been debugging all day to try and get at the root cause.

To cut a long story short, it looks like in some circumstances the SAM header gets spread out in chunks throughout the SAM file - almost like there are multiple SAM files concatenated together.

I only see this when aligning to a reference that is effectively a database of fasta files concatenated together.

For example:

minimap2 -a DB.fasta my.fastq > my.sam

I have a file DB.fasta which has 16498 entries in it. When I align my fastq file to this reference the resulting SAM file has 4 separate header locations spread throughout the file. Each section is unique.

To check my sanity I ran the exact same thing with bwa mem and there are no issues.
Also, if I just align the sample to a single reference (with minimap2) such as GRCh38, the resulting SAM is fine.

When trying to view the resulting SAM file in samtools view I get an error saying the file is truncated. Therefore I was only able to see this in vim.

My current install of minimap2 is 2.0-r191-dirty

Let me know if you want some more information.

The text was updated successfully, but these errors were encountered:

lh3 · 2017-08-22T08:37:44Z

2.0-r191 is almost the first version when this repo was made public. There have been lots of meticulous improvements and bugfixes since then. You should use v2.0-r275 from the release page.

Please let me know if you still have the issue.

lh3 · 2017-08-22T08:39:25Z

Also, if I just align the sample to a single reference (with minimap2) such as GRCh38, the resulting SAM is fine.

What do you mean by a "single reference". GRCh38 contains many sequences, too.

mbhall88 · 2017-08-22T09:40:46Z

Yes, sorry I meant single species.

I have updated to the latest version but am still getting the same problem.

lh3 · 2017-08-22T09:55:02Z

Ah, I see the issue. Unfortunately, it is very difficult to fix it. When you have a huge reference database, minimap2 only holds part of it in memory, aligns all reads against this part and then move to the next part of the reference. Because it is unable to see all the reference sequences before doing alignment, it can't write the right SAM header.

There are two solutions to this. 1) you can increase option -I such that minimap2 can load all references into memory:

minimap2 -a -I100g ref.fa reads.fa

You will need a machine with huge RAM for this. 2) you can filter out all SAM header lines and then add them when you convert SAM to BAM:

minimap2 -a ref.fa reads.fa | grep -v ^@ | samtools view -bt ref.fa.fai - > unsrt.bam

mbhall88 · 2017-08-22T10:00:18Z

Ahhh, yes I should have seen that before. Default is 4Gb and my database is 13Gb.

Great, those two options should be able to sort out my problem.

Thanks for the speedy response.

mbhall88 · 2017-08-25T07:15:41Z

Just thought I would also mention (in case other people stumble on this) this rule also applies to indexing files.
For example: I have indexed a 12GB fasta file with default arguments

minimap2 -d db.mmi db.fa

When running minimap2 later with this index file, changing the -I flag has no effect on how the index is loaded in:

minimap2 -aI 64g db.mmi in.fa

Will still load the index into memory in 4GB chunks (default).

So to avoid this I found I had to index like so

minimap2 -I 64g -d db.mmi db.fa
minimap2 -aI 64g db.mmi in.fa

The above will now avoid the 'concatenated' SAM file effect.

NB @lh3 example above using grep to filter out SAM headers until later works a treat incase you dont have the memory to do the above (or can't be bothered).

lh3 · 2017-08-25T09:11:03Z

-I is an indexing option. Once you create the index, you can't change the parameter (see also the man page). Minimap2 gives a warning if build-time -k, -w and -H are different from map-time values. However, it is hard to check the consistency between the build-time -I and the map-time -I because different -I values may still yield the same index. I will see how to improve this at some point. Thanks.

kehey · 2017-10-03T14:16:25Z

@mbhall88 @lh3 Just to comment on the message that using the grep solution is the same as uisng the larger index options: I don't think it is since minimap2 will have an optimal alignment (bitwise flag 0) per chunk, rather than 1 optimal alignment across the whole index, no?

lh3 · 2017-10-03T14:45:17Z

I don't think it is since minimap2 will have an optimal alignment (bitwise flag 0) per chunk, rather than 1 optimal alignment across the whole index

That is correct. As such, a larger -I is always preferred as long as your machine has enough memory to hold the index.

kehey · 2017-10-03T14:45:54Z

Thanks for the quick confirmation!

kehey · 2017-10-03T14:49:24Z

Quick follow up if I may: is there a way to confirm that the -I setting was large enough to hold the index when building the index, or can this only be seen when the SAM file is interspersed with chunk headers?

lh3 · 2017-10-03T15:05:32Z

You always get a uni-part index if -I is larger than the total lengths of all sequences. You can set it to something very large like -I 10000G (if this doesn't work, let me know). The total memory will be determined by the total length, not by the actual -I value.

If you have a prebuilt index, you can use the following command line to see how many parts it contains:

grep -obUaP "MMI\x2" index.mmi

Note that this only works with GNU grep. Mac uses BSD grep by default, which does not work well with binary files.

mbhall88 changed the title ~~SAM Header interspersed through file~~ SAM header interspersed through file Aug 22, 2017

mbhall88 closed this as completed Aug 22, 2017

lh3 added the bug label Oct 12, 2017

jblachly mentioned this issue Oct 20, 2017

SAM header interspersed through file (not a duplicate of #15) #39

Closed

zeqianli mentioned this issue Mar 24, 2021

Recommended workflow breaks for large dataset (no SQ lines in the header when converting to bam) RasmussenLab/vamb#63

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SAM header interspersed through file #15

SAM header interspersed through file #15

mbhall88 commented Aug 22, 2017

lh3 commented Aug 22, 2017

lh3 commented Aug 22, 2017

mbhall88 commented Aug 22, 2017

lh3 commented Aug 22, 2017

mbhall88 commented Aug 22, 2017

mbhall88 commented Aug 25, 2017 •

edited

lh3 commented Aug 25, 2017

kehey commented Oct 3, 2017

lh3 commented Oct 3, 2017

kehey commented Oct 3, 2017

kehey commented Oct 3, 2017 •

edited

lh3 commented Oct 3, 2017

SAM header interspersed through file #15

SAM header interspersed through file #15

Comments

mbhall88 commented Aug 22, 2017

lh3 commented Aug 22, 2017

lh3 commented Aug 22, 2017

mbhall88 commented Aug 22, 2017

lh3 commented Aug 22, 2017

mbhall88 commented Aug 22, 2017

mbhall88 commented Aug 25, 2017 • edited

lh3 commented Aug 25, 2017

kehey commented Oct 3, 2017

lh3 commented Oct 3, 2017

kehey commented Oct 3, 2017

kehey commented Oct 3, 2017 • edited

lh3 commented Oct 3, 2017

mbhall88 commented Aug 25, 2017 •

edited

kehey commented Oct 3, 2017 •

edited