I used bamtool split function and have two comments:
1) It exits with an error message (see below) that can be avoided if I pre-filter the BAM file to remove unmapped reads:
terminate called after throwing an instance of 'std::out_of_range'
I guess it could be faster if the 'bamtools filters -IsMapped true' can be avoided in the processing step. May I recommend to tolerate unmapped read in the split function and output the unmapped reads to a separated file as stub.unmapped.bam, if possible.
2) the default suffix after stub string is .REF_refname such as .REF_chr1. Although I can modify the code, it is useful for me to have an option to change the 'REF' to any string such as 'c'. This might not be useful for other users though.
Thanks a lot.
1) The current implementation does a (naive) look-up for the alignment's destination file by using its RefID as a vector index. So if it's unmapped (RefID == -1) then yes, that would definitely cause the vector's access exception. Sorry about that oversight; I'll add in a fix for this case with the next round of updates.
2) This seems useful enough (& pretty straightforward to implement). I'll add this feature when I push the fix for (1).
And thanks for your feedback, always appreciated.
Thanks again for the quick reply. Sorry that I forgot one more suggestion for spliting BAM file. So far, I have two reasons to split by chromosome: parallel processing and reducing sorting time for big BAM file. However, we may still have hugh number of reads per chromosome due to 1) high depth per sample and 2) pooling reads from multiple samples together to run GATK SNP calling, which is better than per-sample SNP calling. So, we may need to further split a big BAM file of one chromosome to multiple BAM files for different regions in that chromosome. Samtools has such function but it requires sorted and indexed BAM as input, I believe (please tell me if I am wrong here). For the pure spliting purpose, I do not think it is necessary to sort and index the input BAM file, expecially one reason for spliting is to avoid long sorting time. So, could you please consider the possibility of spliting by region as: chr1:1000000-2000000 in future releases? One caveat here is how to deal with paired-end reads near the border region. For example, I may want to keep the paired reads in the same splited file as their mates if they are between 2000000 and 2001000, so that the de-duplication can be run properly. Later, I can run GATK with option to call SNPs only in the 1000000-2000000 region to ignore the reads in the extended region. Just my two cents. BTW, I really like the simple and logical interface of bamtools as compared with other software suite.
Pushed the fix for the first 2 requests. Should work with unmapped reads, and there is a new '-refPrefix' option available when splitting by reference:
$ bamtools split -reference -refPrefix FOO- -in data.bam
should yield: data.FOO-chr1.bam data.FOO-chr2.bam etc...
Let me think over the region splitting some more; paired-end reads would definitely make this non-trivial to get right.
Thanks again for the suggestions and compliments.