# Filtering .vcf file from ipyrad

This notebook details the secondary filtering I do with a .vcf directly from ipyrad. I use this notebook to:  
1) remove weird individuals based on PCA analysis,  
2) filter out individuals with greater than a certain threshold of missing data,  
3) filter out loci missing in a certain percentage of samples (note: ipyrad does this on a locus basis, but with indels and Ns there could still be sites that are missing in a lot of samples),  
4) filter for a certain minor allele frequency  
5) filter out loci with excess heterozygosity within populations based on Hardy-Weinberg equilibrium \*  
6) filter out loci significantly out of H-W equilibrium within populations  
7) filter for only biallelic SNPs  
8) use python code to select 1 SNP per GBS locus  


\*Note: I set the *max_shared_Hs_locus* parameter in ipyrad to 1.0 so it does not filter for excess heterozygotes across samples. 

In [7]:
%%sh
date "+%D"

06/28/17


In [9]:
%cd /home/ksilliman/Projects/Phylo_Ostrea/c80-denovo/Making_Files/

/home/ksilliman/Projects/Phylo_Ostrea/c80-denovo/Making_Files


Use *bcftools* to rename a sample who's barcode file was incorrect

In [10]:
%cat OR_CAfix.txt

CA2_8 OR3_1
OR3_1 CA2_8
OR7_5 CA7_5


In [2]:
%%sh
bcftools reheader -s OR_CAfix.txt -o cp-OL-s7filter-s67-c80-66.vcf ../OL-s7filter-s67-c80-66_outfiles/OL-s7filter-s67-c80-66.vcf

### Filtering out individuals

Removing weird individuals

In [3]:
%%sh
vcftools --vcf cp-OL-s7filter-s67-c80-66.vcf --recode --recode-INFO-all --remove-indv \
CA5_15b --remove-indv WA10_11 --remove-indv OR2_11 --min-alleles 2 --out OL-c80-66-s67


VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf cp-OL-s7filter-s67-c80-66.vcf
	--recode-INFO-all
	--min-alleles 2
	--out OL-c80-66-s67
	--recode
	--remove-indv CA5_15b
	--remove-indv OR2_11
	--remove-indv WA10_11

Excluding individuals in 'exclude' list
After filtering, kept 190 out of 193 Individuals
Outputting VCF file...
After filtering, kept 102749 out of a possible 102749 Sites
Run Time = 15.00 seconds


Making a file with the percent missingness for each individual.

In [4]:
%%sh
vcftools --vcf OL-c80-66-s67.recode.vcf --missing-indv --out OL-c80-66-s67


VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf OL-c80-66-s67.recode.vcf
	--missing-indv
	--out OL-c80-66-s67

After filtering, kept 190 out of 190 Individuals
Outputting Individual Missingness
After filtering, kept 102749 out of a possible 102749 Sites
Run Time = 2.00 seconds


In [12]:
%%sh
head OL-c80-66-s67.imiss

INDV	N_DATA	N_GENOTYPES_FILTERED	N_MISS	F_MISS
BC1_1	102749	0	57941	0.563908
BC1_10	102749	0	9766	0.0950472
BC1_11	102749	0	60349	0.587344
BC1_12	102749	0	50174	0.488316
BC1_19	102749	0	66593	0.648113
BC1_2	102749	0	78917	0.768056
BC1_20	102749	0	13687	0.133208
BC1_22	102749	0	5617	0.0546672
BC1_4	102749	0	15596	0.151787


In [14]:
import numpy as np
import matplotlib.pyplot as plt
import plotly.plotly as py
import plotly.graph_objs as go
py.sign_in('ksil91', 'ycvvzZQxVMU8Sg9wVQBH')
import pandas

In [15]:
imiss = np.genfromtxt('OL-c80-66-s67.imiss', names=True,dtype=None)
data = [
    go.Histogram(
        x=imiss[["F_MISS"]],autobinx = False,
        xbins=dict(
            start=0,
            end=1,
            size=0.05
        )
    )
]
layout = go.Layout(
    title='Proportion missing data, ',bargap=0.1)
fig = go.Figure(data=data,layout=layout)
py.iplot(fig)

Individuals missing data at fewer than 62% of sites

In [16]:
imissDF = pandas.DataFrame(imiss)
imissDF[imissDF.F_MISS < 0.62].INDV.values

array(['BC1_1', 'BC1_10', 'BC1_11', 'BC1_12', 'BC1_20', 'BC1_22', 'BC1_4',
       'BC1_7', 'BC1_8', 'BC1_9', 'BC2_10', 'BC2_11', 'BC2_12', 'BC2_13',
       'BC2_16', 'BC2_17', 'BC2_18', 'BC2_3', 'BC2_6', 'BC2_7', 'BC2_9',
       'BC3_1', 'BC3_16', 'BC3_17', 'BC3_18', 'BC3_20', 'BC3_3', 'BC3_9',
       'BC4_12', 'BC4_13', 'BC4_15', 'BC4_19', 'BC4_2', 'BC4_3', 'BC4_6',
       'BC4_7', 'BC4_9', 'CA1_15', 'CA1_16', 'CA1_19', 'CA1_2', 'CA1_22',
       'CA1_4', 'CA1_5', 'CA1_9', 'CA2_10', 'CA2_12', 'OR3_1', 'CA3_4',
       'CA3_6', 'CA3_7', 'CA3_8', 'CA4_1', 'CA4_16', 'CA4_2', 'CA4_20',
       'CA4_9', 'CA5_10', 'CA5_11', 'CA5_13', 'CA5_14', 'CA5_15a', 'CA5_7',
       'CA6_10', 'CA6_11', 'CA6_12', 'CA6_13', 'CA6_14', 'CA6_16',
       'CA6_18', 'CA6_2', 'CA7_10', 'CA7_11', 'CA7_12', 'CA7_13', 'CA7_15',
       'CA7_2', 'CA7_7', 'CA7_8', 'CA7_9', 'OR1_1', 'OR1_11', 'OR1_2',
       'OR1_4', 'OR1_5', 'OR1_6', 'OR1_7', 'OR2_1', 'OR2_10', 'OR2_2',
       'OR2_20', 'OR2_9', 'CA2_8', 'OR3_15', 'OR3_1

In [8]:
len(imissDF[imissDF.F_MISS < 0.62].INDV.values)

137

Write file of individuals with missing data at greater than 62% of sites

In [17]:
imissDF[imissDF.F_MISS > 0.62].INDV.to_csv("OL-c80-66-s67_imiss62.txt",sep=" ",index=False)

Use vcftools to remove individuals with greater than 62% missingness. ALso filter for polymorphic loci, loci missing in < 50% of individuals, and a minor allele frequency of 2.5%.

In [10]:
%%sh
vcftools --vcf OL-c80-66-s67.recode.vcf --recode --recode-INFO-all --remove \
OL-c80-66-s67_imiss62.txt --min-alleles 2 --max-missing 0.5 --maf 0.025 --out OL-c80-66-s67-m50x62-maf025


VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf OL-c80-66-s67.recode.vcf
	--remove OL-c80-66-s67_imiss62.txt
	--recode-INFO-all
	--maf 0.025
	--min-alleles 2
	--max-missing 0.5
	--out OL-c80-66-s67-m50x62-maf025
	--recode

Excluding individuals in 'exclude' list
After filtering, kept 137 out of 190 Individuals
Outputting VCF file...
After filtering, kept 34565 out of a possible 102749 Sites
Run Time = 6.00 seconds


Use vcftools to remove individuals with greater than 62% missingness. ALso filter for polymorphic loci, loci missing in < 70% of individuals, and a minor allele frequency of 2.5%.

In [12]:
%%sh
vcftools --vcf OL-c80-66-s67.recode.vcf --recode --recode-INFO-all --remove \
OL-c80-66-s67_imiss62.txt --min-alleles 2 --max-missing 0.70 --maf 0.025 --out OL-c80-66-s67-m70x62-maf025


VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf OL-c80-66-s67.recode.vcf
	--remove OL-c80-66-s67_imiss62.txt
	--recode-INFO-all
	--maf 0.025
	--min-alleles 2
	--max-missing 0.7
	--out OL-c80-66-s67-m70x62-maf025
	--recode

Excluding individuals in 'exclude' list
After filtering, kept 137 out of 190 Individuals
Outputting VCF file...
After filtering, kept 20910 out of a possible 102749 Sites
Run Time = 5.00 seconds


Write files with the sample name and either population (.pop) or sampling location (.loc). These are used for the heterozygosity filtering downstream and to convert the .vcf file to a .str file in PGD Spider. It is dependent on the samples having their population as the first part of their name, separated by an underscore.

In [13]:
IN = open("OL-c80-66-s67.imiss","r")
OUT = open("OL-c80-66-s67.pop","w")
pop_dict = {'CA2':'CA23','CA3':'CA23','WA1':'WA19','WA9':'WA19'}
IN.next()
for line in IN:
    name = line.split()[0]
    pop = name.split("_")[0]
    if pop in pop_dict.keys():
        pop = pop_dict[pop]
    OUT.write(name+"\t"+pop+"\n")
    
IN.close()
OUT.close()

IN = open("OL-c80-66-s67.imiss","r")
OUT = open("OL-c80-66-s67.loc","w")
IN.next()
for line in IN:
    name = line.split()[0]
    pop = name.split("_")[0]
    OUT.write(name+"\t"+pop+"\n")
    
IN.close()
OUT.close()

In [19]:
%%sh
head -n 15 OL-c80-66-s67.pop

BC1_1	BC1
BC1_10	BC1
BC1_11	BC1
BC1_12	BC1
BC1_19	BC1
BC1_2	BC1
BC1_20	BC1
BC1_22	BC1
BC1_4	BC1
BC1_7	BC1
BC1_8	BC1
BC1_9	BC1
BC2_1	BC2
BC2_10	BC2
BC2_11	BC2


### Filtering loci by excess heterozygosity and departures from Hardy-Weinberg

Here I filter out loci with excess heterozygosity in at least 2 populations based on Hardy-Weinberg equilibrium and a p-value cutoff of 0.1. It takes a .vcf file and the .pop file just created as input. This uses a slightly modified script from [Jon Puritz's Github](https://github.com/jpuritz/dDocent/blob/master/scripts/filter_hwe_by_pop.pl), written by Chris Hollenbeck. My modified script is in my Github.

In [14]:
%%sh
mkdir het_m50x62_maf025
#Filtering out loci that have excess heterozygosity in at least 2 regions with a p-value cutoff of 0.1
../../Methods/Scripts/filter_hetexc_by_pop.pl -v OL-c80-66-s67-m50x62-maf025.recode.vcf -p OL-c80-66-s67.pop -h 0.1 -c 0.12 -o OL-c80-66-s67-m50x62-maf025-hetP
mv *.hwe het_m50x62_maf025/
mv *.hetexc het_m50x62_maf025/
rm *.inds

Processing population: BC1 (12 inds)
Processing population: BC2 (13 inds)
Processing population: BC3 (10 inds)
Processing population: BC4 (12 inds)
Processing population: CA1 (11 inds)
Processing population: CA23 (12 inds)
Processing population: CA4 (9 inds)
Processing population: CA5 (8 inds)
Processing population: CA6 (9 inds)
Processing population: CA7 (10 inds)
Processing population: OR1 (12 inds)
Processing population: OR2 (10 inds)
Processing population: OR3 (10 inds)
Processing population: WA10 (10 inds)
Processing population: WA11 (9 inds)
Processing population: WA12 (9 inds)
Processing population: WA13 (10 inds)
Processing population: WA19 (14 inds)
Outputting results of HWE test for filtered loci to 'filtered.hetexc'
Kept 34540 of a possible 34565 loci (filtered 25 loci)


In [16]:
%%sh
mkdir het_m70x62_maf025
#Filtering out loci that have excess heterozygosity in at least 2 regions with a p-value cutoff of 0.1
../../Methods/Scripts/filter_hetexc_by_pop.pl -v OL-c80-66-s67-m70x62-maf025.recode.vcf -p OL-c80-66-s67.pop \
-h 0.1 -c 0.12 -o OL-c80-66-s67-m70x62-maf025-hetP
mv *.hwe het_m70x62_maf025/
mv *.hetexc het_m70x62_maf025/
rm *.inds

Processing population: BC1 (12 inds)
Processing population: BC2 (13 inds)
Processing population: BC3 (10 inds)
Processing population: BC4 (12 inds)
Processing population: CA1 (11 inds)
Processing population: CA23 (12 inds)
Processing population: CA4 (9 inds)
Processing population: CA5 (8 inds)
Processing population: CA6 (9 inds)
Processing population: CA7 (10 inds)
Processing population: OR1 (12 inds)
Processing population: OR2 (10 inds)
Processing population: OR3 (10 inds)
Processing population: WA10 (10 inds)
Processing population: WA11 (9 inds)
Processing population: WA12 (9 inds)
Processing population: WA13 (10 inds)
Processing population: WA19 (14 inds)
Outputting results of HWE test for filtered loci to 'filtered.hetexc'
Kept 20892 of a possible 20910 loci (filtered 18 loci)


The script only filters on a site-by-site basis. In order to throw out any loci that had a SNP with excess heterozygosity (as these may be paralogs), I make a file with the locus ids to then submit to vcftools.

In [17]:
#Make files of bad loci (that have at least one site with excess heterozygotes) to copy/paste in vcftools. 
IN = open('het_m50x62_maf025/exclude.hetexc', "r")
OUT = open('het_m50x62_maf025/badchrom.txt', "w")
exset = set()
for line in IN:
    chrom = line.split()[0]
    if chrom not in exset:
        exset.add(chrom)
        OUT.write(" --not-chr "+str(chrom))
OUT.close()
IN.close()
print "Maf 0.025, m50-x62: "+str(len(exset))

Maf 0.025, m50-x62: 8


In [23]:
#Make files of bad loci (that have at least one site with excess heterozygotes) to copy/paste in vcftools. 
IN = open('het_m70x62_maf025/exclude.hetexc', "r")
OUT = open('het_m70x62_maf025/badchrom.txt', "w")
exset = set()
for line in IN:
    chrom = line.split()[0]
    if chrom not in exset:
        exset.add(chrom)
        OUT.write(" --not-chr "+str(chrom))
OUT.close()
IN.close()
print "Maf025 4, m70-x65: "+str(len(exset))

Maf025 4, m70-x65: 5


Vcftools won't take a file of locus names to remove (which is annoying), so I copy and paste the locus names with a --not-chr call into vcftools.

In [22]:
%cat het_m50x62_maf025/badchrom.txt

 --not-chr locus_286414 --not-chr locus_22966 --not-chr locus_290641 --not-chr locus_371902 --not-chr locus_163220 --not-chr locus_343612 --not-chr locus_20481 --not-chr locus_296426

In [23]:
%%sh
vcftools --vcf OL-c80-66-s67-m50x62-maf025-hetP.recode.vcf --recode --recode-INFO-all  --not-chr locus_290641 --not-chr locus_286414 --not-chr locus_343612 --not-chr locus_371902 --not-chr locus_20481 --not-chr locus_22966 --not-chr locus_163220 --not-chr locus_296426 \
--max-alleles 2 --min-alleles 2 --out OL-c80-66-s67-m50x62-maf025-hetPbi


VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf OL-c80-66-s67-m50x62-maf025-hetP.recode.vcf
	--not-chr locus_163220
	--not-chr locus_20481
	--not-chr locus_22966
	--not-chr locus_286414
	--not-chr locus_290641
	--not-chr locus_296426
	--not-chr locus_343612
	--not-chr locus_371902
	--recode-INFO-all
	--max-alleles 2
	--min-alleles 2
	--out OL-c80-66-s67-m50x62-maf025-hetPbi
	--recode

After filtering, kept 137 out of 137 Individuals
Outputting VCF file...
After filtering, kept 34117 out of a possible 34540 Sites
Run Time = 4.00 seconds


In [24]:
%cat het_m70x62_maf025/badchrom.txt

 --not-chr locus_371902 --not-chr locus_20481 --not-chr locus_286414 --not-chr locus_343612 --not-chr locus_163220

In [25]:
%%sh
vcftools --vcf OL-c80-66-s67-m70x62-maf025-hetP.recode.vcf --recode --recode-INFO-all  --not-chr locus_371902 --not-chr locus_20481 --not-chr locus_286414 --not-chr locus_343612 --not-chr locus_163220 \
--max-alleles 2 --min-alleles 2 --out OL-c80-66-s67-m70x62-maf025-hetPbi


VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf OL-c80-66-s67-m70x62-maf025-hetP.recode.vcf
	--not-chr locus_163220
	--not-chr locus_20481
	--not-chr locus_286414
	--not-chr locus_343612
	--not-chr locus_371902
	--recode-INFO-all
	--max-alleles 2
	--min-alleles 2
	--out OL-c80-66-s67-m70x62-maf025-hetPbi
	--recode

After filtering, kept 137 out of 137 Individuals
Outputting VCF file...
After filtering, kept 20660 out of a possible 20892 Sites
Run Time = 3.00 seconds


Filter loci with a by departure from HW in at least 2 populations and a p-value cutoff of 0.025

In [27]:
%%sh
mkdir hwe_m50x62_maf025
#Filtering out loci that have excess heterozygosity in at least 2 populations with a p-value cutoff of 0.025
../../Methods/Scripts/filter_hwe_by_pop.pl -v OL-c80-66-s67-m50x62-maf025-hetPbi.recode.vcf -p OL-c80-66-s67.pop -h 0.025 -c 0.12 -o OL-c80-66-s67-m50x62-maf025-hetPhwPbi
mv *.hwe hwe_m50x62_maf025/
rm *.inds

Processing population: BC1 (12 inds)
Processing population: BC2 (13 inds)
Processing population: BC3 (10 inds)
Processing population: BC4 (12 inds)
Processing population: CA1 (11 inds)
Processing population: CA23 (12 inds)
Processing population: CA4 (9 inds)
Processing population: CA5 (8 inds)
Processing population: CA6 (9 inds)
Processing population: CA7 (10 inds)
Processing population: OR1 (12 inds)
Processing population: OR2 (10 inds)
Processing population: OR3 (10 inds)
Processing population: WA10 (10 inds)
Processing population: WA11 (9 inds)
Processing population: WA12 (9 inds)
Processing population: WA13 (10 inds)
Processing population: WA19 (14 inds)
Outputting results of HWE test for filtered loci to 'filtered.hwe'
Kept 34072 of a possible 34117 loci (filtered 45 loci)


In [29]:
%%sh
mkdir hwe_m70x62_maf025
#Filtering out loci that have excess heterozygosity in at least 2 populations with a p-value cutoff of 0.05
../../Methods/Scripts/filter_hwe_by_pop.pl -v OL-c80-66-s67-m70x62-maf025-hetPbi.recode.vcf -p OL-c80-66-s67.pop -h 0.025 -c 0.12 -o OL-c80-66-s67-m70x62-maf025-hetPhwPbi
mv *.hwe hwe_m70x62_maf025/
rm *.inds

Processing population: BC1 (12 inds)
Processing population: BC2 (13 inds)
Processing population: BC3 (10 inds)
Processing population: BC4 (12 inds)
Processing population: CA1 (11 inds)
Processing population: CA23 (12 inds)
Processing population: CA4 (9 inds)
Processing population: CA5 (8 inds)
Processing population: CA6 (9 inds)
Processing population: CA7 (10 inds)
Processing population: OR1 (12 inds)
Processing population: OR2 (10 inds)
Processing population: OR3 (10 inds)
Processing population: WA10 (10 inds)
Processing population: WA11 (9 inds)
Processing population: WA12 (9 inds)
Processing population: WA13 (10 inds)
Processing population: WA19 (14 inds)
Outputting results of HWE test for filtered loci to 'filtered.hwe'
Kept 20627 of a possible 20660 loci (filtered 33 loci)


### Subset one SNP per GBS locus

In [30]:
## Code to subset one SNP per GBS locus from a VCF file. Chooses the SNP
## with the highest sample coverage. If there is a tie, chooses the 1st SNP in the loci. (may change to random)
## May be specific to VCF format output from ipyrad.
## This is also in script format in Github as subsetSNPs.py

def subsetSNPs(inputfile,outputfile):
    import linecache
    locidict = {}
    lineNum = []
    IN = open(inputfile, "r")
    OUT = open(outputfile, "w")

    n = 1
    for line in IN:
        if "#" not in line:
            linelist = line.split()
            loci = linelist[0]
            #Column 7 is INFO column of VCF file
            NS = float(linelist[7].split(";")[0].split("=")[1])
            if loci not in locidict.keys():
                locidict[loci] = [NS,n]
            else:
                if locidict[loci][0] < NS:
                    locidict[loci] = [NS,n]
        else:
            OUT.write(line)
        n += 1
    IN.close()
    print("Total SNPS: "+str(n)+"\nUnlinked SNPs: "+str(len(locidict.keys())))

    for locus in sorted(locidict.keys()):
        line = linecache.getline(inputfile, locidict[locus][1])
        OUT.write(line)
    OUT.close()


In [32]:
infile = "OL-c80-66-s67-m50x62-maf025-hetPhwPbi.recode.vcf"
outfile = "Inputs/OL-c80-66-s67-m50x62-maf025-u.vcf"
subsetSNPs(infile,outfile)

Total SNPS: 34084
Unlinked SNPs: 13834


In [33]:
infile = "OL-c80-66-s67-m70x62-maf025-hetPhwPbi.recode.vcf"
outfile = "Inputs/OL-c80-66-s67-m70x62-maf025-u.vcf"
subsetSNPs(infile,outfile)

Total SNPS: 20639
Unlinked SNPs: 9170


These .vcf files are then converted to Structure format files using PGD Spider, in order to load into Adegenet in R.