We run the main analysis script which imports all the necessary libraries and provides all our functions

In [96]:
%run src/analyse.py 

We parse the reference genome and parse it to obtain the masked (N) regions. These regions will be used when we are sampling random mutations for our null model. 

In [None]:
TAIR10 = 'reference/TAIR10-masked.fa'

chromosomes =  { ch.name:ch.seq  for ch in SeqIO.parse(TAIR10, 'fasta', generic_dna) }

In [72]:
for ch in chromosomes :
    print ch, len(chromosomes[ch])

Chr5 26975502
Chr4 18585056
Chr3 23459830
Chr2 19698289
Chr1 30427671
mitochondria 366924
chloroplast 154478


In [None]:
maskedRanges = { ch:get_masked_ranges(chromosomes[ch]) for ch in chromosomes }

In [44]:
for ch in chromosomes :
    ranges = maskedRanges[ch]
    rangeSize = map(lambda x : x[1]-x[0], ranges)
    totalMasked = sum(rangeSize)
    print '{} has {} \tmasked regions covering {:.2f} of it.'.format(ch, len(ranges), float(totalMasked)/len(chromosomes[ch]))

Chr5 has 14204 	masked regions covering 0.17 of it.
Chr4 has 9929 	masked regions covering 0.20 of it.
Chr3 has 12682 	masked regions covering 0.18 of it.
Chr2 has 11107 	masked regions covering 0.21 of it.
Chr1 has 15902 	masked regions covering 0.15 of it.
mitochondria has 108 	masked regions covering 0.05 of it.
chloroplast has 87 	masked regions covering 0.04 of it.


We then read the vcf file to obtain al the heterozygous variants. 

`get_hetero_variants()` will extract heterozygous variant that satisfy :
  * minGQ specified for SNPs (default 50)
  * minGQ specified for indels (default 30)
  * minAD specified (default 5)
  
Where GQ is genotype quality and AD is alelle depth as defined by GATK. 
The defaults reflects values reported in the original publication. 

`get_hetero_variants()` returns sample counts for each genotype in each variant position. It does not return per-sample information.


In [97]:
vcfFile = 'calls/genotype.vcf'

hetSNPs, hetIndels = get_hetero_variants(vcfFile)

In [98]:
for ch in hetSNPs :
    print '{} has {} hetero SNPs and {} hetero indels.'.format \
    (ch, len(hetSNPs[ch]), len(hetIndels[ch]))

Chr5 has 133333 hetero SNPs and 33662 hetero indels.
Chr4 has 89621 hetero SNPs and 22595 hetero indels.
Chr3 has 102039 hetero SNPs and 24587 hetero indels.
Chr2 has 79041 hetero SNPs and 19656 hetero indels.
Chr1 has 141016 hetero SNPs and 35974 hetero indels.
mitochondria has 192 hetero SNPs and 12 hetero indels.
chloroplast has 16 hetero SNPs and 2 hetero indels.


`process_hetero_variants()` will split heretozygous variants into **inherited** and **denovo** partitions. 

If we observe a heterozygous variant in more than *minInherit* samples (default 10) we claim this was inherited, therefore **f1 was heterozygous in these positions** 

If we observe a heterozygous variant in a maximum of *maxDenovo* samples (default 1), we claim **a sample in f2 has a denovo mutation in this position** 

In [99]:
variants = dict()
for ch in hetSNPs :
    inheritSNPs, denovoSNPs = process_hetero_variants(hetSNPs[ch])
    inheritIndels, denovoIndels = process_hetero_variants(hetIndels[ch])
    
    thisChromosome = dict()
    
    thisChromosome['inheritSNPs'] = inheritSNPs
    thisChromosome['inheritIndels'] = inheritIndels
    thisChromosome['denovoSNPs'] = denovoSNPs
    thisChromosome['denovoIndels'] = denovoIndels
    
    variants[ch]=thisChromosome
    
    print '{}: inherited {} snps, {} indels, denovo {} snps,{} indels '.format \
    (ch, len(inheritSNPs), len(inheritIndels), len(denovoSNPs), len(denovoIndels) )
    

Chr5: inherited 119098 snps, 28825 indels, denovo 3877 snps,1867 indels 
Chr4: inherited 80184 snps, 19333 indels, denovo 2728 snps,1248 indels 
Chr3: inherited 88765 snps, 20375 indels, denovo 3763 snps,1643 indels 
Chr2: inherited 70455 snps, 16571 indels, denovo 2329 snps,1281 indels 
Chr1: inherited 126255 snps, 30515 indels, denovo 4174 snps,2054 indels 
mitochondria: inherited 43 snps, 2 indels, denovo 39 snps,3 indels 
chloroplast: inherited 8 snps, 1 indels, denovo 2 snps,0 indels 


`get_hetero_distances()` will calculate the average distance of a denovo mutation to the closest heterozygous mutation. 

We can now calculate this value per chromosome, but for the number to make sense we need to compare it to a **null model**. 

Our null model consists of sampling the same number of mutations positions randomly from the chromosome and measuring the average distance to the closest heterozygous point.  We repeat the experiment `nullsamples` times and report mean and std.  

Let's do this for SNPs first : 

In [104]:
for chno in range(1,6) :
    ch = 'Chr'+str(chno)
    # measuring the observed value
    denovo = variants[ch]['denovoSNPs']
    hetero = variants[ch]['inheritSNPs']
    dist = get_hetero_distances(denovo, hetero)
    
    # sampling the null model 
    nullSamples = 1000
    null = list()
    for i in range(nullSamples) :
        null_denovo = sample_mutations(len(denovo),maskedRanges[ch],chrSize=len(chromosomes[ch])) 
        null_dist = get_hetero_distances(null_denovo, hetero)
        null.append(mean(null_dist))
    
    
    print ch, mean(dist), mean(null), std(null)

Chr1 260.135122185 2991.33622688 188.095848259
Chr2 296.996135681 1343.35874281 71.0516160267
Chr3 258.634334308 1292.51587749 51.6397180415
Chr4 506.761363636 1829.77528116 103.466338013
Chr5 211.622646376 2013.6351024 125.236340515


In [93]:
hist(dist, bins=range(0,1000,10), label='true', normed=True)
hist(null_dist, bins=range(0,1000,10), label='null', normed=True)
legend()
print max(dist), max(null_dist)

44031 144493


In [90]:
allChromosome = list()
here = 0 
for r in maskedRanges['Chr1'] : 
    allChromosome += range(here, r[0])
    here = r[1]
print len(allChromosome), float(len(allChromosome))/len(chromosomes['Chr1'])
allChromosome = array(sorted(allChromosome), dtype=int)

all_dist = get_hetero_distances(allChromosome, hetero)

25839186 0.849200255912


In [95]:
hist(all_dist, bins=range(0,1000,10), label='full', normed=True)
print mean(all_dist), std(all_dist)

1120.86098331 4026.48978733
