New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAM header interspersed through file #15
Comments
2.0-r191 is almost the first version when this repo was made public. There have been lots of meticulous improvements and bugfixes since then. You should use v2.0-r275 from the release page. Please let me know if you still have the issue. |
What do you mean by a "single reference". GRCh38 contains many sequences, too. |
Yes, sorry I meant single species. I have updated to the latest version but am still getting the same problem. |
Ah, I see the issue. Unfortunately, it is very difficult to fix it. When you have a huge reference database, minimap2 only holds part of it in memory, aligns all reads against this part and then move to the next part of the reference. Because it is unable to see all the reference sequences before doing alignment, it can't write the right SAM header. There are two solutions to this. 1) you can increase option minimap2 -a -I100g ref.fa reads.fa You will need a machine with huge RAM for this. 2) you can filter out all SAM header lines and then add them when you convert SAM to BAM: minimap2 -a ref.fa reads.fa | grep -v ^@ | samtools view -bt ref.fa.fai - > unsrt.bam |
Ahhh, yes I should have seen that before. Default is 4Gb and my database is 13Gb. Great, those two options should be able to sort out my problem. Thanks for the speedy response. |
Just thought I would also mention (in case other people stumble on this) this rule also applies to indexing files.
When running
Will still load the index into memory in 4GB chunks (default). So to avoid this I found I had to index like so
The above will now avoid the 'concatenated' SAM file effect. NB @lh3 example above using |
|
That is correct. As such, a larger |
Thanks for the quick confirmation! |
Quick follow up if I may: is there a way to confirm that the -I setting was large enough to hold the index when building the index, or can this only be seen when the SAM file is interspersed with chunk headers? |
You always get a uni-part index if If you have a prebuilt index, you can use the following command line to see how many parts it contains: grep -obUaP "MMI\x2" index.mmi Note that this only works with GNU grep. Mac uses BSD grep by default, which does not work well with binary files. |
Hi Heng,
Just came across a bizarre issue which I have been debugging all day to try and get at the root cause.
To cut a long story short, it looks like in some circumstances the SAM header gets spread out in chunks throughout the SAM file - almost like there are multiple SAM files concatenated together.
I only see this when aligning to a reference that is effectively a database of fasta files concatenated together.
For example:
I have a file
DB.fasta
which has 16498 entries in it. When I align my fastq file to this reference the resulting SAM file has 4 separate header locations spread throughout the file. Each section is unique.To check my sanity I ran the exact same thing with
bwa mem
and there are no issues.Also, if I just align the sample to a single reference (with
minimap2
) such as GRCh38, the resulting SAM is fine.When trying to view the resulting SAM file in
samtools view
I get an error saying the file is truncated. Therefore I was only able to see this invim
.My current install of
minimap2
is2.0-r191-dirty
Let me know if you want some more information.
The text was updated successfully, but these errors were encountered: