Reconstructing the phylogenetic relationships among chordates using genomic data

We use data from the public RefSeq-database of the National Center for Biotechnology Information (NCBI) to reconstruct the phylogenetic tree displayed in the glass cabinet of the NHM, which depicts the relationship among more than 3,000 chordate species. This database is rapidly growing due to recent advances in sequencing technology and it contains more than 100,000 curated genomes of diverse organisms by 2022. The RefSeq-database not only provides access to DNA sequences, but also to amino acid sequences of transcribed genes. Amino acids are the building blocks of proteins which are encoded by DNA codons (triplets of consecutive nucleotides in the DNA sequence). The sequence of amino acids in the RefSeq was generated by bioinformatic tools that translate the DNA sequence of known genes into the corresponding amino acid sequences. Since amino acid sequences are usually more conserved than the underlying DNA sequences, these data are well suited to reconstruct the phylogenetic relationships of even distantly related taxa. In our case, we focus on the genomes of mitochondria, which are cell organelles that act as the cellular powerplants of eukaryotic organisms.

Powerful computers with sufficient RAM memory (>100GB) and multicore processors (>100 cores) are required to carry out these computationally demanding analyses. Additionally, the operation system should be UNIX-based, which allows to transmit tasks to the computer on the command line. If you are interested in learning more about UNIX and the basics of bioinformatic analyses, you can find useful tutorials and other resources on the internet, such as here and here.

Below, we present the basic analysis steps underlying the phylogenetic reconstruction of chordates (as shown in the glass cabinet). Since the number of taxa in the RefSeq database is constantly growing, the resulting tree may differ from our original tree.

(1) Downloading the sequencing data

At first, we download the dataset of amino acid and DNA sequences of mitochondrial genomes for all available taxa. Based on the metainfomation from individual DNA sequences, we generate a database of names for all organisms in the dataset together with their names at different taxonomic levels (e.g., species, genus, family, order, etc.). In the following we talk about "taxa" (Singular "taxon").

Additionally required programs:

NCBI-edirect

## make directory
mkdir ~/PhylogenyDIY/data

## go to directory
cd ~/PhylogenyDIY/data

## download aminoacid sequence dataset
wget https://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion/mitochondrion.1.protein.faa.gz

## download DNA sequence dataset
wget https://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion/mitochondrion.1.1.genomic.fna.gz

## simplify the header of the DNA FASTA file and only retain the name of the Species
gunzip -c mitochondrion.1.1.genomic.fna.gz \
  | awk '{if ($1~"^>") {print $1"_"$2"_"$3} else {print}}' \
  > mitochondrion.1.1.genomic_fixed.fasta

gunzip -c mitochondrion.1.1.genomic.fna.gz \
  | awk '{if ($1~"^>") {print substr($1,2)}}' \
  > mitochondrion.1.1.genomic_fixed.list

### now get the taxonomy table for each RefSeq entry
module load Tools/NCBIedirect
while read -r line
do
  #echo $line
  ID=`esearch -db nucleotide -query ${line} < /dev/null |esummary | xtract -pattern TaxId  -element TaxId `
  ID2=`efetch -db taxonomy -id ${ID} -format xml | xtract -pattern Taxon -tab "," -first TaxId ScientificName \
    -group Taxon -KING "(-)" -PHYL "(-)" -CLSS "(-)" -ORDR "(-)" -FMLY "(-)" -GNUS "(-)" \
    -block "*/Taxon" -match "Rank:kingdom" -KING ScientificName \
    -block "*/Taxon" -match "Rank:phylum" -PHYL ScientificName \
    -block "*/Taxon" -match "Rank:class" -CLSS ScientificName \
    -block "*/Taxon" -match "Rank:order" -ORDR ScientificName \
    -block "*/Taxon" -match "Rank:family" -FMLY ScientificName \
    -block "*/Taxon" -match "Rank:genus" -GNUS ScientificName \
    -group Taxon -tab "," -element "&KING" "&PHYL" "&CLSS" "&ORDR" "&FMLY" "&GNUS"`
  echo ${line}","$ID2
done < mitochondrion.1.1.genomic_fixed.list > mitochondrion.1.1.genomic_fixed_taxon.list

## convert the table to rename the tips of the trees with species names
awk -F "," '{split($1,a,"."); print a[1]".:"$3}' mitochondrion.1.1.genomic_fixed_taxon.list \
  > mitochondrion.1.1.genomic_fixed_taxon.txt

## isolate the class taxonomic level to color code the tree
awk -F "," '$1!~/""/{split($3,a," "); print a[1]"_"a[2]"\t"$6}' mitochondrion.1.1.genomic_fixed_taxon.list \
  | grep -v "^_"  > mitochondrion.1.1.genomic_fixed_taxon.colors

(2) Filtering the dataset

The downloaded dataset mitochondrion.1.protein.faa.gz contains the amino acid sequences of mitochondrial genomes for all available taxa. Using a script (proteins2genome.py) written in the programming language Python, we isolate sequencing data of chordates only. Additionally, we reduce the dataset to genes, which are present in at least 90% of all taxa in the dataset and concatenate the amino acid sequences of all genes in the same order for each taxon.

## make new directory
mkdir -p ~/PhylogenyDIY/results

cd ~/PhylogenyDIY

## here, we reduce the FASTA file to contain only genes that are present in 90% of all taxa that belong to the Chordates and concatenate the sequence of all retained genes per taxon
python ~/PhylogenyDIY/scripts/proteins2genome.py \
  --TaxList data/mitochondrion.1.1.genomic_fixed_taxon.list \
  --Tax Chordata \
  --FreqTH 0.90 \
  --input data/mitochondrion.1.protein.faa.gz  \
  > results/mitochondrion.1.protein_Chordata.fasta

(3) Alignment of the sequencing data

The sequence comparison is accomplished by a so-called alignment, in which the sequences of the individual taxa are written one below the other. Changes in the amino acid sequence (= an exchange of individual amino acids) thus become visible. Now, mutations that lead to different lengths have to be taken into account. Insertions or deletions of amino acids would lead to a shift in the relative positions of amino acids of the same origin in the sequence comparison. Positions with "gaps" due to deletions are filled with a - symbol (this is called a "gap"). However, mutations that lead to one or more additional amino acids (insertions) must also be compensated for in the alignment by introducing "gaps". This critical step is necessary in order to be able to compare traits of the same origin, in our case amino acids at the same sequence position, across taxa and thus be able to identify genetic differences. The underlying algorithms are computationally intensive, which is why we divide the computing load over 200 processor cores.

MAFFT is modifying the metainformation of the original data, which causes problems when assigning taxon names to the branch tips of the final phylogenetic tree. Thus, we reconstitute the original metadata using a custom script (fixIDAfterMafft.py). In addition, we delete positions in the alignment which contain gaps in more than 50% of all taxa (reduceAln2FASTA.py) to avoid biased results due to an excess of non-informative positions.

Additionally required programs:

MAFFT

conda activate mafft-7.487

cd ~/PhylogenyDIY

## carry out the alignment with MAFFT
mafft \
  --thread 200 \
  --auto \
  results/mitochondrion.1.protein_Chordata.fasta \
  > results/mitochondrion.1.protein_Chordata_aln_full.fasta

## fix ID's after MAFFT alignment
python  ~/PhylogenyDIY/scripts/fixIDAfterMafft.py \
  --Alignment ~/PhylogenyDIY/results/mitochondrion.1.protein_Chordata_aln_full.fasta \
  --input ~/PhylogenyDIY/results/mitochondrion.1.protein_Chordata.fasta \
  > ~/PhylogenyDIY/results/mitochondrion.1.protein_Chordata_aln_full_fixed.fasta

## only retain Position where less than 50% of all taxa have gaps
python ~/PhylogenyDIY/scripts/reduceAln2FASTA.py \
  --input results/mitochondrion.1.protein_Chordata_aln_full_fixed.fasta  \
  --threshold 0.5 \
  > results/mitochondrion.1.protein_Chordata_aln.fasta

## replace ambiguous AA with gaps
sed -i '/^>/! s/[BJZX]/\-/g' ~/PhylogenyDIY/results/mitochondrion.1.protein_Chordata_aln.fasta

The image in the glass cabinet and the image below are examples of such an alignment based on DNA sequencing data.

(4) Reconstructing the phylogenetic tree using Maximum Likelihood

The polished alignment file can now be used to estimate the relationships among taxa based on sequence differences. We employ a Maximum-Likelihood approach which compares the probabilities of different tree topologies that describe the relationships among all taxa. The goal of this approach is to identity the tree topology that best fits the matrix of sequence data while making specific assumptions about evolutionary change. These assumptions include, for example, the probability that a given amino acid is replaced by another due to mutations. Such calculations are computationally demanding. Thus, we again use 200 CPU cores to distribute the computational load.

Additionally required programs:

RAxML

module load Phylogeny/RAxML-2.8.10

## make new directory
mkdir ~/PhylogenyDIY/results/raxml
cd ~/PhylogenyDIY/results/raxml

## run ML tree reconstruction
raxmlHPC-PTHREADS-SSE3 \
  -m PROTGAMMAWAG  \
  -N 20 \
  -p 772374015 \
  -n Chordata_const \
  -s ../mitochondrion.1.protein_Chordata_aln.fasta \
  -# 3 \
  -T 200

In the best-fitting phylogenetic tree we finally replace the original taxon-IDs with correct species names using a custom (RenameTreeLeaves_new.py). Additionally, we obtain the species names of taxa belonging to the classes Hyperoartia, Ascidiacea and Leptocardii using yet another script (MakeOutgroup.py) and define these as the outgroup for the visualization of the phylogenetic tree.

python ~/PhylogenyDIY/scripts/RenameTreeLeaves_new.py \
  --input ~/PhylogenyDIY/results/raxml/RAxML_bestTree.Chordata_const \
  > ~/PhylogenyDIY/results/raxml/RAxML_bestTree_renamed.Chordata_const

outgroup=`python ~/PhylogenyDIY/scripts/MakeOutgroup.py --tree ~/PhylogenyDIY/results/iqtree_const/mitochondrion.1.protein_Chordata_aln.fasta_renamed.parstree --taxa ~/PhylogenyDIY/data/mitochondrion.1.1.genomic_fixed_taxon.list --list Hyperoartia,Ascidiacea,Leptocardii`

At last, we visualize the best-fitting phylogentic tree identified by RAxML using the programming language R.

# load necessary R libraries
library('ggtree')
library('gridExtra')
library('ggrepel')
library('ape')
library('ggplot2')
library('phangorn')
library('ggimage')
library('dplyr')
library('plotly')

## load tree file and root with outgroup taxa
tree<-read.tree('~/PhylogenyDIY/results/raxml/RAxML_bestTree_renamed.Chordata_const')
tree<-root(tree,outgroup=c($outgroup))

## load color information for highlighting different orders in different colors
Col=read.table('~/PhylogenyDIY/data/mitochondrion.1.1.genomic_fixed_taxon.colors',
header=F)
colnames(Col)<-c('tip','cat')

## plot tree
phylo.tree <- ggtree(tree,
layout='roundrect',
lwd=.1,
branch.length='none')+
theme_tree2()+
theme_bw()+
xlab('av. subst./site') +
theme(axis.title.y=element_blank(),
  axis.text.y=element_blank(),
axis.ticks.y=element_blank())+
theme(legend.position='bottom') +
scale_colour_discrete('Orders')+
theme(legend.title = element_text(size=10))+
theme(legend.text = element_text(size=8))+
guides(color = guide_legend(override.aes = list(size = 3)))

phylo.tree <- phylo.tree  %<+% Col+
geom_tiplab(aes(color=cat), size = 0.2)

## export tree
ggsave(filename='~/PhylogenyDIY/results/raxml/tree_rect.pdf',
  phylo.tree,
  width=10,
  height=30,limitsize=F)

The tree below and in the glass cabinet was additionally edited with other graphics programs. For example, we removed the original legend at the bottom of the figure and added vertical German names for all taxonomic classes. In addition, we rotated several taxon-groups along their x-axis. This does not modify the tree topology, but improves the readability of the tree.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_en.md

README_en.md

Reconstructing the phylogenetic relationships among chordates using genomic data

(1) Downloading the sequencing data

(2) Filtering the dataset

(3) Alignment of the sequencing data

(4) Reconstructing the phylogenetic tree using Maximum Likelihood

Files

README_en.md

Latest commit

History

README_en.md

File metadata and controls

Reconstructing the phylogenetic relationships among chordates using genomic data

(1) Downloading the sequencing data

(2) Filtering the dataset

(3) Alignment of the sequencing data

(4) Reconstructing the phylogenetic tree using Maximum Likelihood