In a Py2 environment, follow the documentation of [TWISST](https://github.com/simonhmartin/twisst).

Make sure to try the most up-to-date version of TWISST and Simon H. Martin's [General Genomics Tools](https://github.com/simonhmartin/genomics_general/) as they may provide performance improvements. However, at the time of creating this notebook, I was having difficulties making the most recent version of each repo run together. To solve this, I used the following [version](https://github.com/simonhmartin/twisst/tree/9d9634b9fc5e2763423daac01512b78be3244679) of TWISST and [this](https://github.com/simonhmartin/genomics_general/tree/fd66be9658b65399215e96dd8243070ce21d2b5d) version of the general genomics tools. To make the `raxml_sliding_window.py` script run, I replaced the `import` section of a newer version of the script. You can do that yourself or just replace the old version of the script in the cloned repo with this [one](https://raw.githubusercontent.com/mistergroot/gatktutorial/master/scripts/raxml_sliding_windows.py).

All these commands can be combined so theres just one long line to run in a loop but I like to do things in steps so there's a lot of intermediate files generated.

In [None]:
##conda install openjdk
##conda install -c r r

Preparing VCF files for TWISST:

In [None]:
##I'm working on multiple envs on a cluster so things get weird so I reload my bashrc and activate
##the appropriate env each time. You may not need to do this.
##first excluding indels and multiallelic calls:

In [None]:
%%bash
source ~/.bashrc
conda activate py2
export LC_ALL="en_US.UTF-8"
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do
    bcftools filter -e 'FORMAT/DP<5 | FORMAT/GQ<30' --set-GTs . /moto/eaton/projects/macaques/calls/raw/Chr$i.raw.vcf -O u | \
    bcftools view -m2 -M2 --exclude-types indels -U -i 'TYPE=="snp" & MAC >= 2' -O z \
        > /moto/eaton/projects/macaques/twisst/Chr$i.biallelic.snponly.qual.mac.vcf.gz
done

In [None]:
##remove uncertain genotypes and replace them with ./. so that they do not cause issues later

In [2]:
%%bash
source ~/.bashrc
conda activate py2
export LC_ALL="en_US.UTF-8"
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do
    zcat /moto/eaton/projects/macaques/twisst/Chr$i.biallelic.snponly.qual.mac.vcf.gz | perl -pe "s/\s\.:/\t.\/.:/g" \
        | bgzip -c >/moto/eaton/projects/macaques/twisst/Chr$i.phaseready3.vcf.gz
done

Download BEAGLE 4 scripts (https://faculty.washington.edu/browning/beagle/beagle.r1399.jar) to phase vcfs. Example commands on the TWISST repo do not run with BEAGLE 5 (most recent version of BEAGLE):

In [None]:
%%bash
source ~/.bashrc
conda activate py2
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do
    java -Xmx50g -jar /moto/eaton/projects/macaques/scripts/beagle.r1399.jar \
        gt=/moto/eaton/projects/macaques/twisst/Chr$i.phaseready3.vcf.gz \
        out=/moto/eaton/projects/macaques/twisst/Chr$i.phased \
        impute=true nthreads=12 window=10000 overlap=1000 gprobs=false
done

In [1]:
%%bash
source ~/.bashrc
conda activate py2
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do
    gunzip -c /moto/eaton/projects/macaques/twisst/Chr$i.phaseready3.vcf.gz \
        | grep -v "#" > /moto/eaton/projects/macaques/twisst/Chr$i.phaseready3.vcf
done

In [1]:
%%bash
source ~/.bashrc
conda activate py2
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do
    gunzip -c /moto/eaton/projects/macaques/twisst/Chr$i.phased.vcf.gz \
        | grep -v "#" > /moto/eaton/projects/macaques/twisst/Chr$i.phased.vcf
done

In [None]:
%%bash
source ~/.bashrc
conda activate py2
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do
    ruby /moto/eaton/projects/macaques/scripts/mask_imputed_gts.rb \
        /moto/eaton/projects/macaques/twisst/Chr$i.phaseready3.vcf \
        /moto/eaton/projects/macaques/twisst/Chr$i.phased.vcf \
        /moto/eaton/projects/macaques/twisst/Chr$i.masked.vcf
done

In [3]:
%%bash
source ~/.bashrc
conda activate py2
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do
    gunzip -c /moto/eaton/projects/macaques/twisst/Chr$i.phaseready3.vcf.gz | grep "#" \
        > /moto/eaton/projects/macaques/twisst/Chr$i.header.vcf
done

In [4]:
%%bash
source ~/.bashrc
conda activate py2
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do
    cat /moto/eaton/projects/macaques/twisst/Chr$i.header.vcf \
        /moto/eaton/projects/macaques/twisst/Chr$i.masked.vcf \
        | gzip > /moto/eaton/projects/macaques/twisst/Chr$i.masked.with.header.vcf.gz
done

ParseVCF (below) is from the General Genomics Tools repo I mentioned above.

In [5]:
%%bash
source ~/.bashrc
conda activate py2
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do
    python /moto/eaton/projects/macaques/scripts/parseVCF.py \
        -i /moto/eaton/projects/macaques/twisst/Chr$i.masked.with.header.vcf.gz \
        | gzip > /moto/eaton/projects/macaques/twisst/Chr$i.geno.gz
done

Process is terminated.


Keeping just the 4 taxa we want to look at (P. anubis, M. mulatta, M. fuscata, M. fascicularis):

In [41]:
%%bash
source ~/.bashrc
conda activate py2
cd /moto/eaton/projects/macaques/scripts/genomics_general/
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do
    python filterGenotypes.py -s SRR4454026,SRR4454020,fasso,fasno \
        -i /moto/eaton/projects/macaques/twisst/Chr$i.geno.gz | grep -v "N|N" \
        > /moto/eaton/projects/macaques/twisst/Chr$i.pops.geno
done

230000 lines read, 22 pods queued, 21 pods filtered, 21 pods sorted, 21 pods written, 185544 good lines written.
440000 lines read, 43 pods queued, 42 pods filtered, 42 pods sorted, 42 pods written, 371446 good lines written.
660000 lines read, 65 pods queued, 64 pods filtered, 64 pods sorted, 64 pods written, 565736 good lines written.
880000 lines read, 87 pods queued, 86 pods filtered, 86 pods sorted, 86 pods written, 762106 good lines written.
1090000 lines read, 108 pods queued, 107 pods filtered, 107 pods sorted, 107 pods written, 954902 good lines written.
1308912 lines read, 130 pods queued, 129 pods filtered, 129 pods sorted, 128 pods written, 1155440 good lines written.
1520000 lines read, 151 pods queued, 150 pods filtered, 150 pods sorted, 150 pods written, 1352983 good lines written.
1730000 lines read, 172 pods queued, 171 pods filtered, 171 pods sorted, 171 pods written, 1545031 good lines written.
1950000 lines read, 194 pods queued, 193 pods filtered, 193 pods sorted, 

In [None]:
%%bash
source ~/.bashrc
conda activate py2
cd /moto/eaton/projects/macaques/scripts/genomics_general/phylo/
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do
    python -m raxml_sliding_windows.py -T 12 \
        -g /moto/eaton/projects/macaques/twisst/Chr$i.pops.geno \
        -w 50 --windType sites --prefix /moto/eaton/projects/macaques/twisst/Chr$i.small.raxml \
        --raxml /moto/eaton/users/nsl2119/miniconda3/envs/py2/bin/raxmlHPC-PTHREADS-SSE3
done

In [None]:
%%bash
source ~/.bashrc
conda activate py2
cd /moto/eaton/projects/macaques/scripts/twisst/
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do
    python twisst.py \
        -t /moto/eaton/projects/macaques/twisst/Chr$i.small.raxml.trees.gz \
        -w /moto/eaton/projects/macaques/twisst/Chr$i.small.raxml.trees.csv.gz \
        --outputTopos /moto/eaton/projects/macaques/twisst/Chr$i.small.topologies.trees \
        -g mulsolo -g mulno -g fasN -g fasS \
        --groupsFile /moto/eaton/projects/macaques/twisst/groups.tsv --method complete
done

In [None]:
-g assamensis -g arctoides -g fasN -g anubis 

In [None]:
%%bash
source ~/.bashrc
conda activate r
export LC_ALL="en_US.UTF-8"
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do
    CHR=$(tail /moto/eaton/projects/macaques/twisst/Chr$i.small.raxml.data.tsv | awk '{x=$1}END{print x}')
    LENGTH=$(tail /moto/eaton/projects/macaques/twisst/Chr$i.small.raxml.data.tsv | awk '{x=$3}END{print x}')
    Rscript /moto/eaton/projects/macaques/scripts/plot_twisst_per_lg.r \
        /moto/eaton/projects/macaques/twisst/Chr$i.small.raxml.trees.csv.gz \
        /moto/eaton/projects/macaques/twisst/Chr$i.small.raxml.data.tsv $CHR $LENGTH \
        /moto/eaton/projects/macaques/twisst/Chr$i.small.rect.pdf /moto/eaton/projects/macaques/twisst/Chr$i.small.smooth.pdf
done

In [None]:
%%bash
source ~/.bashrc
conda activate py2
cd /moto/eaton/projects/macaques/scripts/genomics_general/phylo/
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do
    python -m raxml_sliding_windows.py -T 12 \
        -g /moto/eaton/projects/macaques/twisst/Chr$i.pops.geno \
        -w 100 --overlap 50 --windType sites --prefix /moto/eaton/projects/macaques/twisst/Chr$i.overlap.raxml \
        --raxml /moto/eaton/users/nsl2119/miniconda3/envs/py2/bin/raxmlHPC-PTHREADS-SSE3
done

In [None]:
%%bash
source ~/.bashrc
conda activate py2
cd /moto/eaton/projects/macaques/scripts/twisst/
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do
    python twisst.py \
        -t /moto/eaton/projects/macaques/twisst/Chr$i.overlap.raxml.trees.gz \
        -w /moto/eaton/projects/macaques/twisst/Chr$i.overlap.raxml.trees.csv.gz \
        --outputTopos /moto/eaton/projects/macaques/twisst/Chr$i.overlap.topologies.trees \
        -g mulsolo -g mulno -g fasN -g fasS \
        --groupsFile /moto/eaton/projects/macaques/twisst/groups.tsv --method complete
done

In [None]:
%%bash
source ~/.bashrc
conda activate r
export LC_ALL="en_US.UTF-8"
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do
    CHR=$(tail /moto/eaton/projects/macaques/twisst/Chr$i.overlap.raxml.data.tsv | awk '{x=$1}END{print x}')
    LENGTH=$(tail /moto/eaton/projects/macaques/twisst/Chr$i.overlap.raxml.data.tsv | awk '{x=$3}END{print x}')
    Rscript /moto/eaton/projects/macaques/scripts/plot_twisst_per_lg.r \
        /moto/eaton/projects/macaques/twisst/Chr$i.overlap.raxml.trees.csv.gz \
        /moto/eaton/projects/macaques/twisst/Chr$i.overlap.raxml.data.tsv $CHR $LENGTH \
        /moto/eaton/projects/macaques/twisst/Chr$i.overlap.rect.pdf /moto/eaton/projects/macaques/twisst/Chr$i.overlap.smooth.pdf
done

In [None]:
!scancel -u nsl2119