## Chromosome 2q11.1 - analyse 20k segment

Analyse a 20 kb segment as 10 x 2 kb intervals.

See Sabeti et al. *Positive Natural Selection in the Human Lineage* p.1619 and Altshuler and Donnelly 2005 *A haplotype map of the human genome* Supplementary Table 4 (96.25 - 96.75).

Note that the HapMap project cited uses reference human genome assembly version 34.3 (Supplementary information referenced above), which is equivalent to UCSC version hg16 (https://en.wikipedia.org/wiki/Reference_genome#Human_reference_genome)

In [1]:
import os
import numpy as np
import pickle, gzip
from selectiontest import selectiontest
import pysam
from vcf import Reader        # https://pypi.org/project/PyVCF/
import pandas as pd


#Include pathname to repository
projdir = "/Users/helmutsimon/repos/NeutralityTest"
if not os.getcwd() == projdir:
    os.chdir(projdir)
import vcf_1KG


path = "/Users/helmutsimon/"
if not os.getcwd() == path:
    os.chdir(path)

Examine selection by location along genome.

In [2]:
chrom = 2
start_hg19 = 96985244
end_hg19   = 97005244
interval = 2000
reps = 10000
fname = 'Data sets/1KG variants full/integrated_call_samples_v3.20130502.ALL.panel'
panel_all = pd.read_csv(fname, sep=None, engine='python', skipinitialspace=True, index_col=0)
panel = panel_all[panel_all['super_pop'].isin(['AFR'])]   
vcf_filename = 'Data sets/1KG variants full/ALL.chr' + str(chrom) \
                + '.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz'
vcf_file = Reader(filename=vcf_filename, compressed=True, encoding='utf-8')
rows = list()
num_segs = int((end_hg19 - start_hg19) / interval)
for segment in range(num_segs):
    seg_start = start_hg19 + segment * interval
    print('Segment start            =', seg_start)
    seg_end = seg_start + interval
    sfs, n, non_seg_snps = vcf_1KG.get_sfs(vcf_file, panel, chrom, seg_start, seg_end)
        
    tajd = selectiontest.calculate_D(sfs)
    print('Tajimas D                =', tajd)
    rho = selectiontest.test_neutrality(sfs)
    print('\u03C1                        =', rho)
    if len(non_seg_snps) > 0:
        print(non_seg_snps)
    row = [seg_start, tajd, rho]
    rows.append(row)
results = pd.DataFrame(rows, columns=['segstart', 'tajd', 'rho'])
results

Segment start            = 96985244
Tajimas D                = -1.7949661869444085
ρ                        = 1.445238911468624
Segment start            = 96987244
Tajimas D                = -1.9153654950603742
ρ                        = 2.9974925525195983
Segment start            = 96989244
Tajimas D                = -1.7130119378194018
ρ                        = 1.8462139199512948
Segment start            = 96991244
Tajimas D                = -1.8593750761567915
ρ                        = 0.9480941482975869
Segment start            = 96993244
Tajimas D                = -1.6296754226830776
ρ                        = 1.911886467329321
Segment start            = 96995244
Tajimas D                = -1.700811445853704
ρ                        = 2.538884001079249
Segment start            = 96997244
Tajimas D                = -1.6169913089647274
ρ                        = -0.13847295405748028
Segment start            = 96999244
Tajimas D                = -1.907625016303753
ρ                

Unnamed: 0,segstart,tajd,rho
0,96985244,-1.794966,1.445239
1,96987244,-1.915365,2.997493
2,96989244,-1.713012,1.846214
3,96991244,-1.859375,0.948094
4,96993244,-1.629675,1.911886
5,96995244,-1.700811,2.538884
6,96997244,-1.616991,-0.138473
7,96999244,-1.907625,1.20806
8,97001244,-1.717938,0.113942
9,97003244,-1.812711,3.108838


In [3]:
results.to_csv('Google Drive/Genetics/Bayes SFS/Neutrality test/ch2q11_2kinterval.csv')