# Genomic analysis of a parasite invasion: colonization of the New World by the blood fluke, Schistosoma mansoni 

Roy Nelson Platt II*, Frédéric D. Chevalier*, Winka Le Clec'h, Marina McDew-White, Philip T. LoVerde, Rafael Ramiro de Assis, Guilherme Oliveira, Safari Kinunghi, Anouk Gouvras, Bonnie Webster, Joanne Webster, Aidan Emery, David Rollinson, Timothy J. Anderson

# Genotype and filter SNVs

use the `sch_man_nwinvasion-gatk` conda env

In [5]:
import os
import shutil
import vcf
import re
import gzip
import pandas as pd

from IPython.display import Image
from Bio import SeqIO


import rpy2.ipython


%load_ext rpy2.ipython

os.chdir("/master/nplatt/sch_man_nwinvasion")

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


### Genotyping

In [None]:
%%bash

#run snakemake
snakemake \
    --printshellcmds \
    --use-conda \
    --cluster 'qsub -V -cwd -S /bin/bash -pe smp {threads} -o {log}.log -j y' \
    --jobs 1000 \
    --latency-wait 200 \
    --keep-going

### Filtering

Multiple rounds of filtering were used to get rid of low quality sites and poorly genotyped sites/individuals

First low quality/coverage were removed and only bi-allelic sites were retained

In [9]:
%%bash

vcftools \
    --vcf results/variant_filtration/hard_filtered.vcf \
    --minDP 12 \
    --minGQ 25 \
    --min-alleles 2 \
    --max-alleles 2 \
    --recode \
    --recode-INFO-all \
    --stdout \
    >results/variant_filtration/gq_dp.vcf


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf results/variant_filtration/hard_filtered.vcf
	--recode-INFO-all
	--max-alleles 2
	--min-alleles 2
	--minDP 12
	--minGQ 25
	--recode
	--stdout

After filtering, kept 172 out of 172 Individuals
Outputting VCF file...
After filtering, kept 671303 out of a possible 703368 Sites
Run Time = 1100.00 seconds


sites that were genotyped less than 50% of the time were removed

In [14]:
%%bash

#find sites with gt50% missing data
vcftools \
    --vcf results/variant_filtration/gq_dp.vcf \
    --missing-site \
    --stdout \
    >results/variant_filtration/gt_rate_per_site.tbl

awk '{if ($6<0.5) print $1"\t"$2}' \
    results/variant_filtration/gt_rate_per_site.tbl \
    >results/variant_filtration/gt_rate_ge_50p.list

vcftools \
    --vcf results/variant_filtration/gq_dp.vcf \
    --positions results/variant_filtration/gt_rate_ge_50p.list \
    --recode \
    --recode-INFO-all \
    --stdout \
    >results/variant_filtration/gt_rate_ge_50p.vcf


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf results/variant_filtration/gq_dp.vcf
	--recode-INFO-all
	--positions results/variant_filtration/gt_rate_ge_50p.list
	--recode
	--stdout

After filtering, kept 172 out of 172 Individuals
Outputting VCF file...
After filtering, kept 631588 out of a possible 671303 Sites
Run Time = 857.00 seconds


individuals there were genotyped at less than 50% of sites were removed

In [15]:
%%bash

#find idivs with lt 50p data
vcftools \
    --vcf results/variant_filtration/gt_rate_ge_50p.vcf \
    --missing-indv \
    --stdout \
    >results/variant_filtration/indiv_gt_rate.tbl

cat results/variant_filtration/indiv_gt_rate.tbl \
    | awk '$5>0.50 {print $1}' \
    | sed 1d \
    >results/variant_filtration/indiv_gt_rate_lt_50p.list

vcftools \
    --vcf results/variant_filtration/gt_rate_ge_50p.vcf \
    --remove results/variant_filtration/indiv_gt_rate_lt_50p.list \
    --recode \
    --recode-INFO-all \
    --stdout \
    >results/variant_filtration/50p_site_50p_indiv_filtered.vcf


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf results/variant_filtration/gt_rate_ge_50p.vcf
	--missing-indv
	--stdout

After filtering, kept 172 out of 172 Individuals
Outputting Individual Missingness
After filtering, kept 631588 out of a possible 631588 Sites
Run Time = 180.00 seconds

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf results/variant_filtration/gt_rate_ge_50p.vcf
	--remove results/variant_filtration/indiv_gt_rate_lt_50p.list
	--recode-INFO-all
	--recode
	--stdout

Excluding individuals in 'exclude' list
After filtering, kept 156 out of 172 Individuals
Outputting VCF file...
After filtering, kept 631588 out of a possible 631588 Sites
Run Time = 834.00 seconds


added an id to each site (chrom:pos)

In [16]:
%%bash

bcftools annotate \
    --set-id +'%CHROM\:%POS' \
    results/variant_filtration/50p_site_50p_indiv_filtered.vcf \
    >results/variant_filtration/50p_site_50p_indiv_filtered_annotated.vcf

to make processing easier, i reorganized the order of individuals in "geographic" order rather than alphabetical order of sample names (the ```re-organized_header.vcf``` file was manually generated)

In [17]:
%%bash

grep "#" results/variant_filtration/50p_site_50p_indiv_filtered_annotated.vcf >results/variant_filtration/header.vcf

In [18]:
%%bash

#manually re-arrange samples into desired order and save as results/variant_filtration/re-organized_header.vcf
gzip results/variant_filtration/50p_site_50p_indiv_filtered_annotated.vcf
gzip results/variant_filtration/re-organized_header.vcf

vcf-shuffle-cols -t results/variant_filtration/re-organized_header.vcf.gz \
    results/variant_filtration/50p_site_50p_indiv_filtered_annotated.vcf.gz \
    >results/variant_filtration/smv7_ex_snps.vcf

In [None]:
%%bash

#get gt rate of final filtered dataset
vcftools \
    --vcf results/variant_filtration/smv7_ex_snps.vcf \
    --missing-indv \
    --stdout \
    >results/variant_filtration/gt_rate_per_indiv_at_filtered_sites.tbl
    
vcftools \
    --vcf results/variant_filtration/smv7_ex_snps.vcf \
    --missing-site \
    --stdout \
    >results/variant_filtration/gt_rate_per_site_at_filtered_sites.tbl

split the filtered file into autosomal, sex_chr, and mitochondrial variants

In [23]:
%%bash

vcftools \
    --vcf results/variant_filtration/smv7_ex_snps.vcf \
    --chr SM_V7_1 \
    --chr SM_V7_2 \
    --chr SM_V7_3 \
    --chr SM_V7_4 \
    --chr SM_V7_5 \
    --chr SM_V7_6 \
    --chr SM_V7_7 \
    --recode \
    --recode-INFO-all \
    --stdout \
    >results/variant_filtration/smv7_ex_autosomes.vcf
    
vcftools \
    --vcf results/variant_filtration/smv7_ex_snps.vcf \
    --chr SM_V7_ZW \
    --recode \
    --recode-INFO-all \
    --stdout \
    >results/variant_filtration/smv7_ex_zw.vcf
    
vcftools \
    --vcf results/variant_filtration/smv7_ex_snps.vcf \
    --chr SM_V7_MITO \
    --recode \
    --recode-INFO-all \
    --stdout \
    >results/variant_filtration/smv7_ex_mito.vcf
    
vcftools \
    --vcf results/variant_filtration/smv7_ex_snps.vcf \
    --not-chr SM_V7_1 \
    --not-chr SM_V7_2 \
    --not-chr SM_V7_3 \
    --not-chr SM_V7_4 \
    --not-chr SM_V7_5 \
    --not-chr SM_V7_6 \
    --not-chr SM_V7_7 \
    --not-chr SM_V7_MITO \
    --not-chr SM_V7_Z \
    --recode \
    --recode-INFO-all \
    --stdout \
    >results/variant_filtration/smv7_ex_other.vcf


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf results/variant_filtration/smv7_ex_snps.vcf
	--chr SM_V7_1
	--chr SM_V7_2
	--chr SM_V7_3
	--chr SM_V7_4
	--chr SM_V7_5
	--chr SM_V7_6
	--chr SM_V7_7
	--recode-INFO-all
	--recode
	--stdout

After filtering, kept 156 out of 156 Individuals
Outputting VCF file...
After filtering, kept 475081 out of a possible 631588 Sites
Run Time = 657.00 seconds

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf results/variant_filtration/smv7_ex_snps.vcf
	--chr SM_V7_ZW
	--recode-INFO-all
	--recode
	--stdout

After filtering, kept 156 out of 156 Individuals
Outputting VCF file...
After filtering, kept 155410 out of a possible 631588 Sites
Run Time = 205.00 seconds

VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf results/variant_filtration/smv7_ex_snps.vcf
	--chr SM_V7_MITO
	--recode-INFO-all
	--recode
	--stdout

After

filter sites based on ld

In [24]:
%%bash

#find SNPs in LD
plink \
    --vcf results/variant_filtration/smv7_ex_autosomes.vcf \
    --double-id \
    --allow-extra-chr \
    --indep-pairwise 250kb 1 0.20 \
    --out results/variant_filtration/smv7_ex_autosomes_ld

#extract SNPs in LD
vcftools \
    --vcf results/variant_filtration/smv7_ex_autosomes.vcf \
    --exclude results/variant_filtration/smv7_ex_autosomes_ld.prune.out \
    --recode \
    --recode-INFO-all \
    --stdout \
    >results/variant_filtration/smv7_ex_autosomes_ld.vcf

PLINK v1.90b4 64-bit (20 Mar 2017)             www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to results/variant_filtration/smv7_ex_autosomes_ld.log.
Options in effect:
  --allow-extra-chr
  --double-id
  --indep-pairwise 250kb 1 0.20
  --out results/variant_filtration/smv7_ex_autosomes_ld
  --vcf results/variant_filtration/smv7_ex_autosomes.vcf

24158 MB RAM detected; reserving 12079 MB for main workspace.
--vcf: results/variant_filtration/smv7_ex_autosomes_ld-temporary.bed +
results/variant_filtration/smv7_ex_autosomes_ld-temporary.bim +
results/variant_filtration/smv7_ex_autosomes_ld-temporary.fam written.
475081 variants loaded from .bim file.
156 people (0 males, 0 females, 156 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
results/variant_filtration/smv7_ex_autosomes_ld.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 156 founders and 0 nonfounders presen


VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf results/variant_filtration/smv7_ex_autosomes.vcf
	--recode-INFO-all
	--recode
	--exclude results/variant_filtration/smv7_ex_autosomes_ld.prune.out
	--stdout

After filtering, kept 156 out of 156 Individuals
Outputting VCF file...
After filtering, kept 38197 out of a possible 475081 Sites
Run Time = 43.00 seconds
