# Filtering .vcf file from ipyrad

This notebook details the secondary filtering I do with a .vcf directly from ipyrad. I use this notebook to:  
1) filter out individuals with greater than a certain threshold of missing data,  
3) filter out loci missing in a certain percentage of samples (note: ipyrad does this on a locus basis, but with indels and Ns there could still be sites that are missing in a lot of samples),  
4) filter for a certain minor allele frequency  
5) filter out loci with excess heterozygosity within populations based on Hardy-Weinberg equilibrium \*  
6) filter out loci significantly out of H-W equilibrium within populations  
7) filter for only biallelic SNPs  
8) use python code to select 1 SNP per GBS locus for analyses like PCA that require an "unlinked" dataset


\*Note: I set the *max_shared_Hs_locus* parameter in ipyrad to 1.0 so it does not filter for excess heterozygotes across samples. 

In [4]:
%%sh
date "+%D"

09/06/18


In [5]:
pwd

u'/home/ksilliman/Projects/Phylo_Ostrea/Analysis'

## x45m75

Use vcftools to filter for polymorphic loci and loci found in at least 75% of individuals. This is the full SNP dataset, before filtering for minor allele frequency and Hardy-Weinberg equilibrium.

In [11]:
%%sh
suffix=Making_Files/OL-c85-t10-x45m75
vcftools --vcf Assembly/OL-s7filt45-c85-t10-pops_outfiles/OL-s7filt45-c85-t10-pops.vcf \
--recode --recode-INFO-all --min-alleles 2 --max-alleles 2 --max-missing 0.75 \
--out ${suffix}


VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf Assembly/OL-s7filt45-c85-t10-pops_outfiles/OL-s7filt45-c85-t10-pops.vcf
	--recode-INFO-all
	--max-alleles 2
	--min-alleles 2
	--max-missing 0.75
	--out Making_Files/OL-c85-t10-x45m75
	--recode

After filtering, kept 117 out of 117 Individuals
Outputting VCF file...
After filtering, kept 42081 out of a possible 58814 Sites
Run Time = 4.00 seconds


In [12]:
%%sh
suffix=Making_Files/OL-c85-t10-x45m75
vcftools --vcf $suffix.recode.vcf --missing-indv --out $suffix


VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf Making_Files/OL-c85-t10-x45m75.recode.vcf
	--missing-indv
	--out Making_Files/OL-c85-t10-x45m75

After filtering, kept 117 out of 117 Individuals
Outputting Individual Missingness
After filtering, kept 42081 out of a possible 42081 Sites
Run Time = 0.00 seconds


Write files with the sample name and either population (.pop) or all strata info (.strata). These are used for the heterozygosity filtering downstream and to convert the .vcf file to a .str file in PGD Spider. It is dependent on the samples having their population as the first part of their name, separated by an underscore. It uses the .imiss file created earlier, so it does include some samples that have been filtered out. This is not a problem for downstream analyses.

### Filtering loci by departures from Hardy-Weinberg

Here I filter out loci with excess heterozygosity in at least 2 populations based on Hardy-Weinberg equilibrium and a p-value cutoff of 0.05. It takes a .vcf file and the .pop file just created as input. This uses a slightly modified script from [Jon Puritz's Github](https://github.com/jpuritz/dDocent/blob/master/scripts/filter_hwe_by_pop.pl), written by Chris Hollenbeck. My modified script is in my Github.

In [12]:
%%sh
suffix=Making_Files/OL-c85-t10-x45m75
#Filtering out loci that depart HWE in at least 2 populations with a p-value cutoff of 0.05
../Methods/Scripts/filter_hwe_by_pop.pl -v $suffix.recode.vcf \
-p Making_Files/OL-c85-t10-x45.pop -h 0.05 -c 0.09 -o $suffix-hwPbi
rm *.inds
mv exclude.hwe Making_Files/
#Remove these if you don't want to inspect HWE results
rm *.hwe

Processing population: Barkeley_BC (5 inds)
Processing population: Coos_OR (6 inds)
Processing population: Discovery_WA (7 inds)
Processing population: Elkhorn_CA (6 inds)
Processing population: Humboldt_CA (6 inds)
Processing population: Klaskino_BC (8 inds)
Processing population: Ladysmith_BC (5 inds)
Processing population: Liberty_WA (6 inds)
Processing population: MuguLagoon_CA (9 inds)
Processing population: Netarts_OR (7 inds)
Processing population: NorthBay_WA (6 inds)
Processing population: NorthSanFran_CA (5 inds)
Processing population: NorthWillapa_WA (3 inds)
Processing population: SanDiego_CA (7 inds)
Processing population: SouthSanFran_CA (4 inds)
Processing population: SouthWillapa_WA (2 inds)
Processing population: Tomales_CA (6 inds)
Processing population: TritonCove_WA (6 inds)
Processing population: Victoria_BC (7 inds)
Processing population: Yaquina_OR (6 inds)
Outputting results of HWE test for filtered loci to 'filtered.hwe'
Kept 41895 of a possible 42081 loci (fil

The script only filters on a site-by-site basis. In order to throw out any loci that had a SNP with excess heterozygosity (as these may be paralogs), I make a file with the locus ids to then submit to vcftools.

In [13]:
#Make files of bad loci (that have at least one site with excess heterozygotes) to copy/paste in vcftools. 
IN = open('Making_Files/exclude.hwe', "r")
OUT = open('Making_Files/badchrom.txt', "w")
exset = set()
for line in IN:
    chrom = line.split()[0]
    if chrom not in exset:
        exset.add(chrom)
        OUT.write(" --not-chr "+str(chrom))
OUT.close()
IN.close()
print "x45m75: "+str(len(exset))

x45m75: 144


Vcftools won't take a file of locus names to remove (which is annoying), so I use cat to give the commands to VCFtools.

In [14]:
%%sh
#-filt are all SNPs, not filtering for minor allele frequency
value=`cat Making_Files/badchrom.txt`
suffix=OL-c85-t10-x45m75

vcftools --vcf Making_Files/$suffix-hwPbi.recode.vcf --recode \
--recode-INFO-all $value --max-alleles 2 --min-alleles 2 --out Inputs/$suffix-filt


VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf Making_Files/OL-c85-t10-x45m75-hwPbi.recode.vcf
	--not-chr locus_101729
	--not-chr locus_105574
	--not-chr locus_105872
	--not-chr locus_108364
	--not-chr locus_11229
	--not-chr locus_11344
	--not-chr locus_122277
	--not-chr locus_123093
	--not-chr locus_124336
	--not-chr locus_126680
	--not-chr locus_13078
	--not-chr locus_135077
	--not-chr locus_139738
	--not-chr locus_145304
	--not-chr locus_147286
	--not-chr locus_152226
	--not-chr locus_156530
	--not-chr locus_156836
	--not-chr locus_157656
	--not-chr locus_164593
	--not-chr locus_168485
	--not-chr locus_17009
	--not-chr locus_17236
	--not-chr locus_173735
	--not-chr locus_17649
	--not-chr locus_183021
	--not-chr locus_183515
	--not-chr locus_183542
	--not-chr locus_184001
	--not-chr locus_185388
	--not-chr locus_185910
	--not-chr locus_186505
	--not-chr locus_186710
	--not-chr locus_19446
	--not-chr locus_194493
	--not-chr locus_194596

In [15]:
%%sh
#-maf025 are SNPs after filtering for minor allele frequency of 2.5%
value=`cat Making_Files/badchrom.txt`
suffix=OL-c85-t10-x45m75

vcftools --vcf Making_Files/$suffix-hwPbi.recode.vcf --recode --recode-INFO-all \
$value --max-alleles 2 --min-alleles 2 --maf 0.025 --out Inputs/$suffix-maf025


VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf Making_Files/OL-c85-t10-x45m75-hwPbi.recode.vcf
	--not-chr locus_101729
	--not-chr locus_105574
	--not-chr locus_105872
	--not-chr locus_108364
	--not-chr locus_11229
	--not-chr locus_11344
	--not-chr locus_122277
	--not-chr locus_123093
	--not-chr locus_124336
	--not-chr locus_126680
	--not-chr locus_13078
	--not-chr locus_135077
	--not-chr locus_139738
	--not-chr locus_145304
	--not-chr locus_147286
	--not-chr locus_152226
	--not-chr locus_156530
	--not-chr locus_156836
	--not-chr locus_157656
	--not-chr locus_164593
	--not-chr locus_168485
	--not-chr locus_17009
	--not-chr locus_17236
	--not-chr locus_173735
	--not-chr locus_17649
	--not-chr locus_183021
	--not-chr locus_183515
	--not-chr locus_183542
	--not-chr locus_184001
	--not-chr locus_185388
	--not-chr locus_185910
	--not-chr locus_186505
	--not-chr locus_186710
	--not-chr locus_19446
	--not-chr locus_194493
	--not-chr locus_194596

## Subset one SNP per GBS locus

In [5]:
## Code to subset one SNP per GBS locus from a VCF file. Chooses the SNP
## with the highest sample coverage. If there is a tie, chooses the 1st SNP in the loci. (may change to random)
## May be specific to VCF format output from ipyrad.
## This is also in script format in Github as subsetSNPs.py

def subsetSNPs(inputfile,outputfile):
    import linecache
    locidict = {}
    lineNum = []
    IN = open(inputfile, "r")
    OUT = open(outputfile, "w")

    n = 1
    for line in IN:
        if "#" not in line:
            linelist = line.split()
            loci = linelist[0]
            #Column 7 is INFO column of VCF file
            NS = float(linelist[7].split(";")[0].split("=")[1])
            if loci not in locidict.keys():
                locidict[loci] = [NS,n]
            else:
                if locidict[loci][0] < NS:
                    locidict[loci] = [NS,n]
        else:
            OUT.write(line)
        n += 1
    IN.close()
    print("Total SNPS: "+str(n)+"\nUnlinked SNPs: "+str(len(locidict.keys())))

    for locus in sorted(locidict.keys()):
        line = linecache.getline(inputfile, locidict[locus][1])
        OUT.write(line)
    OUT.close()


In [3]:
#Total loci
infile = "Making_Files/OL-c85-t10-x45m75.recode.vcf"
outfile = "Making_Files/test.vcf"
subsetSNPs(infile,outfile)

Total SNPS: 42093
Unlinked SNPs: 9836


In [17]:
#No maf filtering
infile = "Inputs/OL-c85-t10-x45m80-filt.recode.vcf"
outfile = "Inputs/OL-c85t10-x45m80-u.vcf"
subsetSNPs(infile,outfile)
#Maf of 2.5%
infile = "Inputs/OL-c85-t10-x45m80-maf025.recode.vcf"
outfile = "Inputs/OL-c85t10-x45m80-maf025-u.vcf"
subsetSNPs(infile,outfile)

Total SNPS: 30374
Unlinked SNPs: 7203
Total SNPS: 9454
Unlinked SNPs: 4487


In [18]:
#No maf filtering
infile = "Inputs/OL-c85-t10-x45m75-filt.recode.vcf"
outfile = "Inputs/OL-c85t10-x45m75-u.vcf"
subsetSNPs(infile,outfile)
#Maf of 2.5%
infile = "Inputs/OL-c85-t10-x45m75-maf025.recode.vcf"
outfile = "Inputs/OL-c85t10-x45m75-maf025-u.vcf"
subsetSNPs(infile,outfile)

Total SNPS: 41181
Unlinked SNPs: 9692
Total SNPS: 13497
Unlinked SNPs: 6207


## Making an outlier only .vcf
Once I identified outlier loci, I now use VCFtools to filter those SNPs.

### Intersect 2

In [2]:
#Make files of bad loci (that have at least one site with an outlier) to feed to vcftools. 
#IN = open('Outlier/x45m75maf025filt-pcaQ_OF_BS-isect2.snp', "r")
IN = open('Outlier/x45m75maf025filt-pcaQ_OF_BS-isectUnion.snp',"r")
INL = open('Outlier/x45m75maf025filt-pcaQ_OF_BS-isect2.loci',"r")
OUT = open('Making_Files/x45m75maf025filt-pcaQ_OF_BS-isect2.badchrom', "w")
#OUTg = open('Making_Files/x45m75maf025filt-pcaQ_OF_BS-isect2.goodchrom', "w")
OUTg2 = open('Making_Files/x45m75maf025filt-pcaQ_OF_BS-isect2.goodchrom2', "w")
exset = set()
for line in INL:
    chrom = line.strip()
    if chrom not in exset:
        exset.add(chrom)
        OUT.write(" --not-chr locus_"+str(chrom))
print len(exset)
x = 0
for line in IN:
    chrom = line.strip().split("_")[1]
    snp = line.strip().split("_")[3]
    OUTg2.write("locus_"+chrom+"\t"+snp+"\n")
    x += 1
print x
OUT.close()
#OUTg.close()
OUTg2.close()
IN.close()
INL.close()

97
168


In [4]:
%%sh
suffix=Inputs/OL-c85-t10-x45m75-maf025
value=`cat Making_Files/x45m75maf025filt-pcaQ_OF_BS-isect2.badchrom`
vcftools --vcf $suffix.recode.vcf --recode --recode-INFO-all $value \
--max-alleles 2 --min-alleles 2 --out $suffix-neutI2


VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf Inputs/OL-c85-t10-x45m75-maf025.recode.vcf
	--not-chr locus_101035
	--not-chr locus_104611
	--not-chr locus_104742
	--not-chr locus_104832
	--not-chr locus_10670
	--not-chr locus_110252
	--not-chr locus_11196
	--not-chr locus_113647
	--not-chr locus_115948
	--not-chr locus_11905
	--not-chr locus_120275
	--not-chr locus_121489
	--not-chr locus_121492
	--not-chr locus_12991
	--not-chr locus_131325
	--not-chr locus_134388
	--not-chr locus_144194
	--not-chr locus_145020
	--not-chr locus_153852
	--not-chr locus_153863
	--not-chr locus_162959
	--not-chr locus_170867
	--not-chr locus_171395
	--not-chr locus_172232
	--not-chr locus_17308
	--not-chr locus_17359
	--not-chr locus_17888
	--not-chr locus_18220
	--not-chr locus_18437
	--not-chr locus_18554
	--not-chr locus_187013
	--not-chr locus_194092
	--not-chr locus_194810
	--not-chr locus_196263
	--not-chr locus_199936
	--not-chr locus_200167
	--not-

Filters 253 SNPs, kept 13,232 SNPS 

In [5]:
%%sh
suffix=Inputs/OL-c85-t10-x45m75-maf025
vcftools --vcf $suffix.recode.vcf --recode --recode-INFO-all --positions Making_Files/x45m75maf025filt-pcaQ_OF_BS-isect2.goodchrom \
--max-alleles 2 --min-alleles 2 --out $suffix-outI2


VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf Inputs/OL-c85-t10-x45m75-maf025.recode.vcf
	--recode-INFO-all
	--max-alleles 2
	--min-alleles 2
	--out Inputs/OL-c85-t10-x45m75-maf025-outI2
	--positions Making_Files/x45m75maf025filt-pcaQ_OF_BS-isect2.goodchrom
	--recode

After filtering, kept 117 out of 117 Individuals
Outputting VCF file...
After filtering, kept 129 out of a possible 13485 Sites
Run Time = 0.00 seconds


In [3]:
%%sh
suffix=Inputs/OL-c85-t10-x45m75-maf025
vcftools --vcf $suffix.recode.vcf --recode --recode-INFO-all --positions Making_Files/x45m75maf025filt-pcaQ_OF_BS-isect2.goodchrom2 \
--max-alleles 2 --min-alleles 2 --out $suffix-outI2Union


VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf Inputs/OL-c85-t10-x45m75-maf025.recode.vcf
	--recode-INFO-all
	--max-alleles 2
	--min-alleles 2
	--out Inputs/OL-c85-t10-x45m75-maf025-outI2Union
	--positions Making_Files/x45m75maf025filt-pcaQ_OF_BS-isect2.goodchrom2
	--recode

After filtering, kept 117 out of 117 Individuals
Outputting VCF file...
After filtering, kept 168 out of a possible 13485 Sites
Run Time = 0.00 seconds


In [8]:
infile = "Inputs/OL-c85-t10-x45m75-maf025-neutI2.recode.vcf"
outfile = "Inputs/OL-c85t10-x45m75-maf025-neutI2-u.vcf"
subsetSNPs(infile,outfile)

Total SNPS: 13244
Unlinked SNPs: 6110


In [6]:
infile = "Inputs/OL-c85-t10-x45m75-maf025-outI2Union.recode.vcf"
outfile = "Inputs/OL-c85t10-x45m75-maf025-outI2Union-u.vcf"
subsetSNPs(infile,outfile)

Total SNPS: 180
Unlinked SNPs: 97
