## Chromosome 2q11.1 - analyse 20k segment

Analyse a 20 kb segment as 10 x 2 kb intervals.

See Sabeti et al. *Positive Natural Selection in the Human Lineage* p.1619 and Altshuler and Donnelly 2005 *A haplotype map of the human genome* Supplementary Table 4 (96.25 - 96.75).

Note that the HapMap project cited uses reference human genome assembly version 34.3 (Supplementary information referenced above), which is equivalent to UCSC version hg16 (https://en.wikipedia.org/wiki/Reference_genome#Human_reference_genome)

In [1]:
import os
import numpy as np
import pandas as pd
from selectiontest import selectiontest
import pysam
from vcf import Reader        # https://pypi.org/project/PyVCF/


path = "/Users/helmutsimon/"
if not os.getcwd() == path:
    os.chdir(path)

Examine selection by location along genome.

In [2]:
chrom = 2
n = 661
start_hg19 = 96985244
end_hg19   = 97005244
interval = 2000
reps = 100000
variates0 = np.empty((reps, n - 1), dtype=float)
for i, q in enumerate(selectiontest.sample_wf_distribution(n, reps)):
    variates0[i] = q
variates1 = selectiontest.sample_uniform_distribution(n, reps)
fname = 'Data sets/1KG variants full/integrated_call_samples_v3.20130502.ALL.panel'
panel_all = pd.read_csv(fname, sep=None, engine='python', skipinitialspace=True, index_col=0)
panel = panel_all[panel_all['super_pop'].isin(['AFR'])]   
vcf_filename = 'Data sets/1KG variants full/ALL.chr' + str(chrom) \
                + '.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz'
vcf_file = Reader(filename=vcf_filename, compressed=True, encoding='utf-8')
rows = list()
num_segs = int((end_hg19 - start_hg19) / interval)
for segment in range(num_segs):
    seg_start = start_hg19 + segment * interval
    print('Segment start            =', seg_start)
    seg_end = seg_start + interval
    sfs, n, non_seg_snps = selectiontest.vcf2sfs(vcf_file, panel, chrom, seg_start, seg_end)
        
    tajd = selectiontest.calculate_D(sfs)
    print('Tajimas D                =', tajd)
    rho = selectiontest.test_neutrality(sfs, variates0, variates1)
    print('\u03C1                        =', rho)
    if len(non_seg_snps) > 0:
        print(non_seg_snps)
    row = [seg_start, tajd, rho]
    rows.append(row)
results = pd.DataFrame(rows, columns=['segstart', 'tajd', 'rho'])
results

Segment start            = 96985244
Tajimas D                = -1.7949661869444085
ρ                        = 1.4569639183694505
Segment start            = 96987244
Tajimas D                = -1.9153654950603742
ρ                        = 2.836555310694118
Segment start            = 96989244
Tajimas D                = -1.7130119378194018
ρ                        = 1.861159517452594
Segment start            = 96991244
Tajimas D                = -1.8593750761567915
ρ                        = 0.4600390148736979
Segment start            = 96993244
Tajimas D                = -1.6296754226830776
ρ                        = 1.923795935712299
Segment start            = 96995244
Tajimas D                = -1.700811445853704
ρ                        = 2.4831818672182386
Segment start            = 96997244
Tajimas D                = -1.6169913089647274
ρ                        = -0.026704552928929814
Segment start            = 96999244
Tajimas D                = -1.907625016303753
ρ               

Unnamed: 0,segstart,tajd,rho
0,96985244,-1.794966,1.456964
1,96987244,-1.915365,2.836555
2,96989244,-1.713012,1.86116
3,96991244,-1.859375,0.460039
4,96993244,-1.629675,1.923796
5,96995244,-1.700811,2.483182
6,96997244,-1.616991,-0.026705
7,96999244,-1.907625,1.32572
8,97001244,-1.717938,0.249672
9,97003244,-1.812711,2.938565


In [3]:
results.to_csv('Google Drive/Genetics/Bayes SFS/Neutrality test/ch2q11_2kinterval.csv')