# Lecture 13 Phylogenetics on the command line

---

## 1. Environments

In [None]:
%%bash

# environment for BI analyses
conda create -n beast beagle -c bioconda
conda activate beast
conda install -c anaconda java-1.8.0-openjdk-devel-cos7-s390x
conda install -c bioconda astral-tree
# in your working folder:
wget https://github.com/beast-dev/beast-mcmc/releases/download/v1.10.4/BEASTv1.10.4.tgz
tar xvzf BEASTv1.10.4.tgz
https://github.com/beast-dev/tracer/releases/download/v1.7.2/Tracer_v1.7.2.tgz
tar xvzf Tracer_v1.7.2.tgz

# environment for ML analyses
conda create -n phylo iqtree muscle mafft trimal openjdk=11
# activate the environment
conda activate phylo
# install the rest of the packages we need
conda install -c bioconda clipkit
conda install -c bioconda cd-hit
conda install -c bioconda seqkit
conda install -c bioconda raxml-ng

---

## 2. Preparing the sequence data from BUSCO score results

These data is from the Takifugu assemblies we did a few weeks ago. Make sure your path to the results coincides with your folder structure

In [None]:
%%bash

conda deactivate; conda activate phylo

# check how many of the same genes are found across samples
for i in *faa; do echo ${i%%.*} >> seqs_recovered.txt; ls *scaffolds_busco/run*/busco*/single*/${i%%.*}* | wc -l >> seqs_recovered.txt; done

# append sample name to header of single gene sequences
# ran inside lecture 8
for i in *_scaffolds_busco; do SAMPLE=${i%%_*}; echo $SAMPLE; sed -i -E "s/>.+/>${SAMPLE}/g" ${SAMPLE}_*_busco/run_*/busco*/single_*/*.faa; done

In [None]:
%%bash

conda deactivate; conda activate phylo

# selected the genes with all samples represented, catting the genes in a single file
# running within lecture13/data
for i in 103348at7898 112660at7898 114667at7898 119924at7898 129688at7898 136410at7898 39654at7898; do cat *scaffolds_busco/run*/busco*/single*/$i* >> "${i}"_all.faa; done

# align them all
for i in *faa; do muscle -align ${i} -output "${i%%.*}".aln; done

# adding line break at then end for the concatenation to work
for i in *aln; do echo >> ${i}; done

# make alignment files single liners
for i in *aln; do awk '/^>/ { if(NR>1) print "";  printf("%s\n",$0); next; } { printf("%s",$0);}  END {printf("\n");}' < ${i} > "${i%%.*}".SL.aln; done

# run the concatenating tool from seqkit
# "if your initial files are having unequal number of species than the output files will have less number of species"
# "but if you have equal number of species to begin with, then the output file will have same number of species"
seqkit concat *_all.SL.aln > all_concat.SL.aln

---

## 3. Maximum Likelihood with IQtree (basics)


In [None]:
%%bash

conda deactivate; conda activate phylo

# Calculating one ML tree for each gene
for i in *_all.aln; do iqtree -s $i --prefix "${i%%.*}"; done

# Calculating one ML tree for the concatenated allignment
iqtree -s all_concat.SL.aln --prefix all_concat.iqtree

---

## 4. Maximum Likelihood with RAxML NG (basics)

In [None]:
%%bash

conda deactivate; conda activate phylo

# Calculating one ML tree for each gene
# for normal versions of RAxML you can add the '-f a' flag to do optimization, bootstrap, and annotation in a single step
# "By default, RAxML-NG will perform MRE-based bootstopping test after every 50 replicates"
# "and terminate once the diagnostic statistics drops below the cutoff value."
# raxml-ng --all --msa <DATFILE> --model <MODEL> --prefix T15 --seed 2 --threads 2 --bs-metric fbp,tbe
for i in *_all.SL.aln; do raxml-ng --all -msa $i --prefix "${i%%.*}".raxml --model BLOSUM62 --seed 666; done 

# Calculating one ML tree for the concatenated allignment
raxml-ng --all -msa all_concat.SL.aln --prefix all_concat.raxml --model BLOSUM62 --seed 666

---

## 5. Summarizing gene trees into species trees with Astral

In [None]:
%%bash

conda deactivate; conda activate beast

cat *.raxml.bestTree > all_raxml.bestrees
cat *.treefile > all_iqtrees.bestrees

astral -i all_raxml.bestrees -o all_raxml_spptree.tree -t 1
astral -i all_iqtrees.bestrees -o all_iqtree_spptree.tree -t 1

---

## 6. Gene congruence using Astral

In [None]:
%%bash

conda deactivate; conda activate beast

iqtree2 -t all_raxml_spptree.tree --gcf all_raxml.bestrees --prefix  concord_raxml

iqtree2 -t all_iqtree_spptree.tree --gcf all_iqtrees.bestrees --prefix concord_iqtree

---

## 7. Bayesian Inference with Beast

Please refer to the slides for instructions on how to set up the models. I am presenting here the commands that we need to execute from the command line

In [None]:
%%bash

conda deativate; conda activate beast

# beast analysis unlinked trees linked substitution and clock models
./beast -java -overwrite 103348at7898_all_1.xml
./beast -java -overwrite 103348at7898_all_2.xml

## joint trees split substitution models and linked tree and clock models
./beast -java -overwrite linked_1.xml
./beast -java -overwrite linked_2.xml

# evaluate convergence
./tracer

# combine trees for each gene
for i in 103348at7898 112660at7898 114667at7898 119924at7898 129688at7898 136410at7898 39654at7898; do 
     ./logcombiner -trees -burnin 100 "103348at7898_all_1.${i}_all.SL.aln.(time).trees" \
     "103348at7898_all_2.${i}_all.SL.aln.(time).trees" \
     "${i}"_all_combined_time.trees;
done

# treeannotator to get consensus from combined files for each gene
# get the maximum clade credibility tree using "common ancestor' as the method for summarizing heights along the tree
for i in 103348at7898 112660at7898 114667at7898 119924at7898 129688at7898 136410at7898 39654at7898; do 
     ./treeannotator -heights ca ${i}_all_combined_time.trees
     "${i}"_all_combined_time.mcct;
done

# combine the tree files for the linked-tree analyses
./logcombiner -trees -burnin 100 "linked_1.(time).trees" "linked_2.(time).trees" linked_combined_time.trees

# get the maximum clade credibility tree using "common ancestor' as the method for summarizing heights along the tree
./treeannotator -heights ca linked_combined_time.trees linked_combined_time.mcct

---

## 8. Tree visualisation using Figtree

In [None]:
%%bash

https://github.com/rambaut/figtree/releases/download/v1.4.4/FigTree_v1.4.4.tgz
tar xvzf FigTree_v1.4.4.tgz

java -Djava.library.path="~/anaconda3/envs/beast/lib" -jar "lib/figtree.jar"

---

## 9. References and resources

Suchard MA, Lemey P, Baele G, Ayres DL, Drummond AJ & Rambaut A (2018) Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10 Virus Evolution 4, vey016. DOI:10.1093/ve/vey016​  
https://beast.community/​  
https://github.com/beast-dev/beast-mcmc/releases​  
https://artic.network/how-to-read-a-tree.html  

Kozlov, A.M., Darriba, D., Flouri, T., Morel, B. and Stamatakis, A., 2019. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics, 35(21), pp.4453-4455.  
https://github.com/amkozlov/raxml-ng  

B.Q. Minh, H.A. Schmidt, O. Chernomor, D. Schrempf, M.D. Woodhams, A. von Haeseler, R. Lanfear (2020) IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol., 37:1530-1534. https://doi.org/10.1093/molbev/msaa015  
http://www.iqtree.org/doc/Tutorial  

W Shen, S Le, Y Li*, F Hu*. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLOS ONE. doi:10.1371/journal.pone.0163962.  https://bioinf.shenwei.me/seqkit/  

