Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAM header interspersed through file #15

Closed
mbhall88 opened this issue Aug 22, 2017 · 12 comments
Closed

SAM header interspersed through file #15

mbhall88 opened this issue Aug 22, 2017 · 12 comments
Labels

Comments

@mbhall88
Copy link

Hi Heng,

Just came across a bizarre issue which I have been debugging all day to try and get at the root cause.

To cut a long story short, it looks like in some circumstances the SAM header gets spread out in chunks throughout the SAM file - almost like there are multiple SAM files concatenated together.

I only see this when aligning to a reference that is effectively a database of fasta files concatenated together.

For example:

minimap2 -a DB.fasta my.fastq > my.sam

I have a file DB.fasta which has 16498 entries in it. When I align my fastq file to this reference the resulting SAM file has 4 separate header locations spread throughout the file. Each section is unique.

To check my sanity I ran the exact same thing with bwa mem and there are no issues.
Also, if I just align the sample to a single reference (with minimap2) such as GRCh38, the resulting SAM is fine.

When trying to view the resulting SAM file in samtools view I get an error saying the file is truncated. Therefore I was only able to see this in vim.

My current install of minimap2 is 2.0-r191-dirty

Let me know if you want some more information.

@mbhall88 mbhall88 changed the title SAM Header interspersed through file SAM header interspersed through file Aug 22, 2017
@lh3
Copy link
Owner

lh3 commented Aug 22, 2017

2.0-r191 is almost the first version when this repo was made public. There have been lots of meticulous improvements and bugfixes since then. You should use v2.0-r275 from the release page.

Please let me know if you still have the issue.

@lh3
Copy link
Owner

lh3 commented Aug 22, 2017

Also, if I just align the sample to a single reference (with minimap2) such as GRCh38, the resulting SAM is fine.

What do you mean by a "single reference". GRCh38 contains many sequences, too.

@mbhall88
Copy link
Author

Yes, sorry I meant single species.

I have updated to the latest version but am still getting the same problem.

@lh3
Copy link
Owner

lh3 commented Aug 22, 2017

Ah, I see the issue. Unfortunately, it is very difficult to fix it. When you have a huge reference database, minimap2 only holds part of it in memory, aligns all reads against this part and then move to the next part of the reference. Because it is unable to see all the reference sequences before doing alignment, it can't write the right SAM header.

There are two solutions to this. 1) you can increase option -I such that minimap2 can load all references into memory:

minimap2 -a -I100g ref.fa reads.fa

You will need a machine with huge RAM for this. 2) you can filter out all SAM header lines and then add them when you convert SAM to BAM:

minimap2 -a ref.fa reads.fa | grep -v ^@ | samtools view -bt ref.fa.fai - > unsrt.bam

@mbhall88
Copy link
Author

Ahhh, yes I should have seen that before. Default is 4Gb and my database is 13Gb.

Great, those two options should be able to sort out my problem.

Thanks for the speedy response.

@mbhall88
Copy link
Author

mbhall88 commented Aug 25, 2017

Just thought I would also mention (in case other people stumble on this) this rule also applies to indexing files.
For example: I have indexed a 12GB fasta file with default arguments

minimap2 -d db.mmi db.fa

When running minimap2 later with this index file, changing the -I flag has no effect on how the index is loaded in:

minimap2 -aI 64g db.mmi in.fa 

Will still load the index into memory in 4GB chunks (default).

So to avoid this I found I had to index like so

minimap2 -I 64g -d db.mmi db.fa
minimap2 -aI 64g db.mmi in.fa 

The above will now avoid the 'concatenated' SAM file effect.

NB @lh3 example above using grep to filter out SAM headers until later works a treat incase you dont have the memory to do the above (or can't be bothered).

@lh3
Copy link
Owner

lh3 commented Aug 25, 2017

-I is an indexing option. Once you create the index, you can't change the parameter (see also the man page). Minimap2 gives a warning if build-time -k, -w and -H are different from map-time values. However, it is hard to check the consistency between the build-time -I and the map-time -I because different -I values may still yield the same index. I will see how to improve this at some point. Thanks.

@kehey
Copy link

kehey commented Oct 3, 2017

@mbhall88 @lh3 Just to comment on the message that using the grep solution is the same as uisng the larger index options: I don't think it is since minimap2 will have an optimal alignment (bitwise flag 0) per chunk, rather than 1 optimal alignment across the whole index, no?

@lh3
Copy link
Owner

lh3 commented Oct 3, 2017

I don't think it is since minimap2 will have an optimal alignment (bitwise flag 0) per chunk, rather than 1 optimal alignment across the whole index

That is correct. As such, a larger -I is always preferred as long as your machine has enough memory to hold the index.

@kehey
Copy link

kehey commented Oct 3, 2017

Thanks for the quick confirmation!

@kehey
Copy link

kehey commented Oct 3, 2017

Quick follow up if I may: is there a way to confirm that the -I setting was large enough to hold the index when building the index, or can this only be seen when the SAM file is interspersed with chunk headers?

@lh3
Copy link
Owner

lh3 commented Oct 3, 2017

You always get a uni-part index if -I is larger than the total lengths of all sequences. You can set it to something very large like -I 10000G (if this doesn't work, let me know). The total memory will be determined by the total length, not by the actual -I value.

If you have a prebuilt index, you can use the following command line to see how many parts it contains:

grep -obUaP "MMI\x2" index.mmi

Note that this only works with GNU grep. Mac uses BSD grep by default, which does not work well with binary files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants