New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explanation of insert sizes from BWA-MEM #113

Open
roblanf opened this Issue Mar 7, 2017 · 10 comments

Comments

Projects
None yet
4 participants
@roblanf

roblanf commented Mar 7, 2017

I have noticed that insert sizes from BWA-MEM seem to start high at the beginning of a chromosome, and drop down towards the end. This isn't really an issue (feel free to close and ignore) but I wanted to put it out here to see if anyone has a good explanation.

I'm mapping 100bp PE Illumina data to a reference genome from another species. The reference (Eucalyptus grandis) is ~5% diverged from the study species. It has 11 good chromosomes, and a bunch of rubbish scaffolds. Gene order etc. is well conserved between the two.

I'm running BWA-MEM and looking at the output with Qualimap like this:

bwa mem $ref $R1 $R2 -t $threads > out.sam
qualimap bamqc -bam out.bam -outdir $outBWAMEM"qualimap_all/" -nt $threads -c

What I don't understand is the plot of insert sizes. My expectation is that most insert sizes will be ~200bp. But what I observe from BWA-MEM looks very different.

genome_insert_size_across_reference

nevertheless, the insert-size histogram looks fine, and looking at the mapping in IGV looks fine too.

genome_insert_size_histogram

So, it looks like the start of chromosomes must contain a small number of pairs with enormous insert sizes. This result is not shared by other mappers like NextGenMap, bbmap, or Stampy (all run with default params). They all look more like the plot below (this from NextGenMap):
genome_insert_size_across_reference

I'm struggling to understand the difference.

@lh3

This comment has been minimized.

Show comment
Hide comment
@lh3

lh3 Mar 7, 2017

Owner

I always run htsbox bamshuf or samtools collate if the input reads come from a coordinate-sorted BAM. Everyone should do this, or provide the insert size distribution to bwa-mem.

Owner

lh3 commented Mar 7, 2017

I always run htsbox bamshuf or samtools collate if the input reads come from a coordinate-sorted BAM. Everyone should do this, or provide the insert size distribution to bwa-mem.

@lh3 lh3 closed this Mar 7, 2017

@roblanf

This comment has been minimized.

Show comment
Hide comment
@roblanf

roblanf Mar 7, 2017

Any chance you could explain how this would affect the insert size distribution, and why you think the insert size distribution looks so odd before doing this?

My understanding is that both of these things move reads around in the BAM file, but that neither should affect the distribution of insert sizes, which are calculated from the paired reads themselves.

I tried samtools collate but Qualimap insists on a sorted bam (which I just do with samtools sort usually).

roblanf commented Mar 7, 2017

Any chance you could explain how this would affect the insert size distribution, and why you think the insert size distribution looks so odd before doing this?

My understanding is that both of these things move reads around in the BAM file, but that neither should affect the distribution of insert sizes, which are calculated from the paired reads themselves.

I tried samtools collate but Qualimap insists on a sorted bam (which I just do with samtools sort usually).

@lh3

This comment has been minimized.

Show comment
Hide comment
@lh3

lh3 Mar 7, 2017

Owner

I wasn't clear: when you generate fastq, you should not generate from a sorted bam (actually don't use Picard tools to generate fastq!). samtools collate should be run before generating fastq.

Owner

lh3 commented Mar 7, 2017

I wasn't clear: when you generate fastq, you should not generate from a sorted bam (actually don't use Picard tools to generate fastq!). samtools collate should be run before generating fastq.

@maarten-k

This comment has been minimized.

Show comment
Hide comment
@maarten-k

maarten-k Mar 7, 2017

(actually don't use Picard tools to generate fastq!).

Excuse my ignorance, but why is picard bam2fastq a bad idea? I use this (after picard RevertSam SORT_ORDER=queryname|picard MarkIlluminaAdapters) as well as the Broad WDL realingment pipeline do use

maarten-k commented Mar 7, 2017

(actually don't use Picard tools to generate fastq!).

Excuse my ignorance, but why is picard bam2fastq a bad idea? I use this (after picard RevertSam SORT_ORDER=queryname|picard MarkIlluminaAdapters) as well as the Broad WDL realingment pipeline do use

@lh3

This comment has been minimized.

Show comment
Hide comment
@lh3

lh3 Mar 7, 2017

Owner

why is picard bam2fastq a bad idea

Because 1) many naively use it to generate fastq from sorted bam, which is impropriate for bwa; 2) if you sort by query name first, it is slow. Samtools collate+fastq is times faster and the preferred choice.

Owner

lh3 commented Mar 7, 2017

why is picard bam2fastq a bad idea

Because 1) many naively use it to generate fastq from sorted bam, which is impropriate for bwa; 2) if you sort by query name first, it is slow. Samtools collate+fastq is times faster and the preferred choice.

@ekg

This comment has been minimized.

Show comment
Hide comment
@ekg

ekg Mar 7, 2017

ekg commented Mar 7, 2017

@roblanf

This comment has been minimized.

Show comment
Hide comment
@roblanf

roblanf Mar 7, 2017

OK, now I'm totally lost! The fastq's I'm using here come straight from a hiseq, with no pre-processing at all.

All I'm doing is this:

  1. Map forward and reverse fastq files with BWA-MEM
  2. SAM to BAM with samtools
  3. Sort and index BAM with samtools
  4. Qualimap to generate plots of insert sizes

Maybe there's something fundamental that I'm missing here, but my impression is that Qualimap is just reading off the positions of the two reads from the BAM produced by BWA-MEM.

roblanf commented Mar 7, 2017

OK, now I'm totally lost! The fastq's I'm using here come straight from a hiseq, with no pre-processing at all.

All I'm doing is this:

  1. Map forward and reverse fastq files with BWA-MEM
  2. SAM to BAM with samtools
  3. Sort and index BAM with samtools
  4. Qualimap to generate plots of insert sizes

Maybe there's something fundamental that I'm missing here, but my impression is that Qualimap is just reading off the positions of the two reads from the BAM produced by BWA-MEM.

@lh3

This comment has been minimized.

Show comment
Hide comment
@lh3

lh3 Mar 8, 2017

Owner

Ok. My previous explanation was wrong then. Do you have the bwa-mem stderr output?

Owner

lh3 commented Mar 8, 2017

Ok. My previous explanation was wrong then. Do you have the bwa-mem stderr output?

@lh3

This comment has been minimized.

Show comment
Hide comment
@lh3

lh3 Mar 8, 2017

Owner

A wild guess. Once two reads in a pair are too far apart (around 500-600 according to your plot, but you should check the bwa-mem stderr log for sure), bwa-mem will treat them as independent reads - no read pair constraints are applied at all. If one of the two reads is a repeat, it could get mapped everywhere in the genome. Other mappers may have different behaviors. For example, they could try to place the two reads as close as possible. Or if you use a large max insert size, e.g. 1000, in these other mappers, the practical effect will be similar.

Owner

lh3 commented Mar 8, 2017

A wild guess. Once two reads in a pair are too far apart (around 500-600 according to your plot, but you should check the bwa-mem stderr log for sure), bwa-mem will treat them as independent reads - no read pair constraints are applied at all. If one of the two reads is a repeat, it could get mapped everywhere in the genome. Other mappers may have different behaviors. For example, they could try to place the two reads as close as possible. Or if you use a large max insert size, e.g. 1000, in these other mappers, the practical effect will be similar.

@lh3 lh3 reopened this Mar 8, 2017

@roblanf

This comment has been minimized.

Show comment
Hide comment
@roblanf

roblanf Mar 8, 2017

This sounds pretty reasonable. My guess so far (I'll need to do some digging to check) is that these things are happening:

  1. My reference has more repeats at the start of the chromosomes
  2. BWA is mapping as you say
  3. Qualimap (which I'm using to make the plots) is perhaps ignoring flags in the BAM that BWA puts in to tell it which reads were mapped without constraints

If these things are all happening, I think it would explain it. If Qualimap is treating flags from other mappers properly (i.e. not using insert sizes on reads that were mapped without constraints), then that would explain the apparent differences.

Looking at the data in IGV, I can't see anything particularly different about the mapping from BWA.

roblanf commented Mar 8, 2017

This sounds pretty reasonable. My guess so far (I'll need to do some digging to check) is that these things are happening:

  1. My reference has more repeats at the start of the chromosomes
  2. BWA is mapping as you say
  3. Qualimap (which I'm using to make the plots) is perhaps ignoring flags in the BAM that BWA puts in to tell it which reads were mapped without constraints

If these things are all happening, I think it would explain it. If Qualimap is treating flags from other mappers properly (i.e. not using insert sizes on reads that were mapped without constraints), then that would explain the apparent differences.

Looking at the data in IGV, I can't see anything particularly different about the mapping from BWA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment