# Phylogenetics course

### practice instructions

## Distance methods

In this assignment we will construct phylogenies with distance based methods. Do all the exercises preferably in the provided order. You will practice the most basic tasks related to a general phylogenetic analysis including importing data, calculating phylogenies with different, alternative options, and generating tree plots.

The software what will be used in this practice is the R statistical environment with the following supporting libraries: **seqinr**, **ape**, and **phangorn**. This is a free and open-source software setting which is available for you also later on, therefore you are able to apply the learned skills and approaches in your own research or study projects.

### Files you need:
**IL6_protein.aln** – This file contains protein sequences of interleukin 6 from six mammal and a bird species. The sequences were aligned using Clustal Omega.

**IL6_mRNA.aln** – This file contains coding cDNA sequences of interleukin 6 gene from six mammal and a bird genomes. The sequences were aligned using Clustal Omega.

## Exercises

### 1. Set up the environment

**1.1.** R is a general software focused on the statistical analysis of large scale data. Supporting libraries, such as sequinr here, are used to provide task and topic specific functionality. You have to load libraries should be loaded before you can use the functions inside. We will load the seqinr and ape libraries here:

In [None]:
install.packages(c("seqinr", "ape", "phangorn"))

library(seqinr)
library(ape)


Remember, if you restart R, you have to load the libraries again to repeat or continue your tasks.

**1.2.** We have to tell R the directory, where we want to work. Usually, this directory contains the input files and we want to save the results files there. When R starts it has configuration specific working directory what we can easily see:

In [None]:
getwd()

### 2. Load data files

**2.1.** R accesses data from local files using **read()** type functions. Many data types and file types has specialized functions. We will use **read.alignment()** function from seqinr package to access Clustal Omega aligned sequences.

In [None]:
ali.prot <- read.alignment("https://raw.githubusercontent.com/nbrg-ppcu/Introduction_to_bioinfo/refs/heads/main/data/phlyogenetics/IL6_protein.aln", format="fasta")
ali.rna <- read.alignment("https://raw.githubusercontent.com/nbrg-ppcu/Introduction_to_bioinfo/refs/heads/main/data/phlyogenetics/IL6_mRNA.aln", format="fasta")


**2.2.** These create complex representations of the alignments in the **ali.prot** and **ali.rna** variables. We can have a good overview of their content by simply typing in variable names or using the **str()** function.

In [None]:
ali.prot
str(ali.prot)

**2.3.** Some functions in the ape library requires that the alignment is in a specific, **DNAbin format**. We have to convert the alignment object here to convert accordingly:

In [None]:
ali.rna.b<-as.DNAbin(ali.rna)

### 3. Calculate distances

**3.1.** As you have learned from the lecture, there are countless ways to calculate distances between aligned sequences. For protein sequences **dist.alignment()** function can be called, which can calculate the distances either taking into account of mutational similarity of codons or not.

In [None]:
d.prot<-dist.alignment(ali.prot,matrix="similarity")

**3.2.** For the cDNA sequences the **dist.dna()** function will be used from the ape library which offers many evolutionary and mathematical methods via its model parameter. For example, we can use simply the number of different sites as a (not very good) distance measure:

In [None]:
d.rna.n<-dist.dna(ali.rna.b,model="N")

Alternatively, we can use the number of transitions, transversion, the same way, or we can weight the number of differences with the length of the sequences.

In [None]:
d.rna.ts<-dist.dna(ali.rna.b,model="TS")
d.rna.tv<-dist.dna(ali.rna.b,model="TV")
d.rna.raw<-dist.dna(ali.rna.b,model="raw")

**3.3.** Many of the nucleotide substitution models which were mentioned in the lecture (and even more which wasn't) are also usable here:

In [None]:
d.rna.JC69<-dist.dna(ali.rna.b,model="JC69")
d.rna.K80<-dist.dna(ali.rna.b,model="K80")
d.rna.F84<-dist.dna(ali.rna.b,model="F84")
d.rna.TN93<-dist.dna(ali.rna.b,model="TN93")

### 4. Creating trees

**4.1.** Now we have so many matrices, but how to create phylogenetic trees from them? In this practice we will use **Neighbour-Joining alhorithm** which is implemented in the **nj()** function of the ape package. Note that **Minimum Evolution** as **fastme.bal()** or the **BIONJ** as **bionj()** algorithms are available and usable in a very similar manner.

In [None]:
t.prot<-nj(d.prot)

**4.2.** We have now a tree in the **t.prot** variable. Let's get some information about it. How many species are represented on this tree? What are those? Is this a rooted tree?

In [None]:
t.prot$Nnode
t.prot$tip.label
is.rooted(t.prot)

**4.3.** As you see, this tree is not rooted. That is a problem, because almost all further methods need a rooted tree. We can use the **outgroup method** to root this tree. From the list of included species we have seen that **species number 7 is Gallus gallus (chicken)** which is the only bird sequence, all the others are from mammals. Birds are clearly an outgroup compared to the mammal sequences, so we can use it to root the tree:

In [None]:
t.prot<-root(t.prot,outgroup=7,resolve.root=T)

**4.4.** Let's finally see how the tree's graphic representation looks like:

In [None]:
plot(t.prot)
nodelabels()

**4.5.** If we do not like the exact arrangement here, we can swap the tree at internal nodes.

In [None]:
t.prot<-rotate(t.prot,9)
plot(t.prot,main="Protein sequences")

### 5. Compare trees from different distance matrices

**5.1.** Now we have all the tools to create comparative plots to see the real differences between different distance calculations. To help further the visual inspection, we will plot the trees on the same pages, and we will add bars indicating the distance measure on the trees. Let's investigate the mathematical distances first:

In [None]:
par(mfrow=c(2,2))
plot(root(nj(d.rna.n),outgroup=7,resolve.root=T),main="mRNAsequences",sub="N")
add.scale.bar(length=10)
plot(root(nj(d.rna.ts),outgroup=7,resolve.root=T),main="mRNA sequences",sub="TS")
add.scale.bar(length=10)
plot(root(nj(d.rna.tv),outgroup=7,resolve.root=T),main="mRNA sequences",sub="TV")
add.scale.bar(length=10)
plot(root(nj(d.rna.raw),outgroup=7,resolve.root=T),main="mRNA sequences",sub="Raw")
add.scale.bar(length=0.05)
par(mfrow=c(1,1))

**5.2.** And the same for the evolutionary distances:

In [None]:
par(mfrow=c(2,2))
t.rna<-root(nj(d.rna.JC69),outgroup=7,resolve.root=T)
plot(t.rna,main="mRNA sequences")
add.scale.bar(length=0.05)
plot(root(nj(d.rna.K80),outgroup=7,resolve.root=T),main="mRNA sequences",sub="K80")
add.scale.bar(length=0.05)
plot(root(nj(d.rna.F84),outgroup=7,resolve.root=T),main="mRNA sequences",sub="F84")
add.scale.bar(length=0.05)
plot(root(nj(d.rna.TN93),outgroup=7,resolve.root=T),main="mRNA sequences",sub="TN93")
add.scale.bar(length=0.05)
par(mfrow=c(1,1))

**5.3.** We have a possibility to compare two trees directly. For example, are the trees coming from the protein and mRNA sequences the same or not?

In [None]:
A<-matrix(t.rna$tip.label,nrow=7,ncol=2)
cophyloplot(t.rna,t.prot,A,space=25,lty=2)

### 6. Saving plots into files

**6.1.** If we want to save these nice plots into files, we have to use the graphics facility R offers. R is capable of using several devices for plotting. A device can be our monitor in front of us, or a PDF, JPEG, or TIFF file the same way. For example, to save the last nice co-plot with two trees into a PDF file, we have to call the **pdf()** function, and all plotting command will draw into a file. We have to close the file using the **dev.off()** function to have R release the file. After that it can be opened with other software.

In [None]:
pdf("IL6_cophyloplot.pdf",paper="a4")
cophyloplot(t.rna,t.prot,A,space=25,lty=2)
dev.off()

**6.2.** Or having a more complex example:

In [None]:
pdf("IL6_trees.pdf",paper="a4")
par(mfrow=c(2,2))
plot(root(nj(d.rna.n),outgroup=7,resolve.root=T),main="mRNAsequences",sub="N")
add.scale.bar(length=10)
plot(root(nj(d.rna.raw),outgroup=7,resolve.root=T),main="mRNA sequences",sub="Raw")
add.scale.bar(length=0.05)
plot(root(nj(d.rna.JC69),outgroup=7,resolve.root=T),main="mRNA sequences",sub="JC69")
add.scale.bar(length=0.05)
plot(t.prot,main="Protein sequences",sub="Similarity")
add.scale.bar(length=0.05)
dev.off()

## Parsimony

This part of the exercise will demonstrate you the concept of parsimony. As you remember from the lecture, parsimony is used to evaluate the relationship of a given tree and an observed pattern of feature. We attempt to calculate that what is the least amount of mutation we need to match a given tree and the observed character states. Remember, character state can be a morphological feature, but also an amino acid at a sequence site. In this practice we will use traditional morphological characters, but the method works exactly the same way for nucleotide sequences or proteins.

This exercise will put all data directly to R, therefore we do not need any extra files. This part of the practice is based on Introduction to Tree-Thinking in R, using phylogenies of tetrapods by Nick Matzke from PhyloWiki.
See: http://phylo.wikidot.com/tree-thinking-with-r

## Exercises

### 7. Import phangorn package

**7.1.** We will use two libraries in this part: ape and phangorn. Package **ape** is used for
generating and plotting the phylogenetic tree, while **phangorn** is used for parsimony
related calculations and for reconstructing character states of ancient common ancestors.

In [None]:
library(phangorn)

### 8. Create tree

The first step is to create a tree that we will use as reference throughout this part. We will use a tree published previously in the literature.

**8.1.** We will use the Newick notation to provide the species, the tree topology and branch length as a string.

In [None]:
vert_plusDinos_newick_str <-  "(shark:471,
(tuna:432,(lungfish:416,(frog:352.6457263,
(((((((((kiwi:71.06707278,seagull:71.06707278):60,Confuciusornis:5)
:20,Archaeopteryx:5):7.1992512,Velociraptor:80):30,Brontosaurus:50):
25,crocodile:213.266324):38.01256425,turtle:251.2788883):20.2733676,
((wall_lizard:134.7070246,(snake:127.5268735,anole_lizard:127.5268735):
7.180151025):85.293,Tuatara:220):51.5522313):42.65854432,(platypus:187.9246145,
(opossum:147.3778793,human:147.3778793):40.5467352):126.2861857):38.43492607):
63.35427375):16):41);"

**8.2.** At this point, the variable **vert_plusDinos_newick_str** contains a string representing the tree. We will use the **read.tree()** function to convert this string to a tree obbject. Normally, this function is used to read similar information from a file, but R is smart enough to handle this string in a variable as if it would be a file.

In [None]:
vert_plusDinos_phylo <- read.tree(file="", text=vert_plusDinos_newick_str)
vert_plusDinos_phylo$tip.label

**8.3.** As you can see, we have the tree object in the vert_plusDinos_phylo variable. There are several parts of this object, among others tip.label, which shows you what species are present on this tree. Let’s visualize this tree to have an idea about what is the subject of our analysis:

In [None]:
plot(vert_plusDinos_phylo)

**8.4.** This might be interesting, but let’s beef up this image by adding the time scale as the horizontal axis, and some text to explain what we are looking at.

In [None]:
plot(vert_plusDinos_phylo)
title("Rooted phylogenetic tree\nof some living and fossil vertebrates")
axisPhylo()
mtext(side=1, text="Ma (millions of years ago)", line=3)

**8.5.** Questions to consider:

8.5.1. Can you see the traditional vertebrate groups on this tree (fishes, amphibians, reptiles, birds, mammals)? Which are monophyletic? Which are polyphyletic?

8.5.2. When someone says, "Actually, birds are a type of dinosaur", what do they mean?

### 9. Let’s investigate how feathered flight has appeared in evolution.
We know that on our tree, only three species are capable of this kind of flying: seagull, Confuciusornis which is an extinct bird like animal, and the famous Archaeopteryx.

**9.1.** We have to code this information into a data format which is usable in R. A **named vector** can hold values just like any other vector, but they are associated with arbitrary names.

In [None]:
feathered_flight_data <- c(0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0)
names(feathered_flight_data) <- c("shark","tuna","lungfish",
                                  "frog","kiwi",
                                  "seagull","Confuciusornis",
                                  "Archaeopteryx","Velociraptor",
                                  "Brontosaurus","crocodile",
                                  "turtle","wall_lizard",
                                  "snake","anole_lizard",
                                  "Tuatara","platypus",
                                  "opossum","human")
feathered_flight_data['tuna']
feathered_flight_data['seagull']

**9.2.** We will turn this named vector to the **phyDat** class which is coming from the phangorn package to use it in further analysis.

In [None]:
state_names <- c("no feathered flight", "feathered flight")
data_levels <- c(0, 1)
feathered_flight <- phyDat(feathered_flight_data, type = "USER", levels=data_levels)
feathered_flight

**9.3.** Let’s put together this information with the phylogenetic tree. Visualization helps to understand the evolutionary events causing the observed pattern.

In [None]:
state_colors <-  c("brown", "blue")
nexus_to_tree_tiporder <-  match(x=names(feathered_flight_data),
                                 table=vert_plusDinos_phylo$tip.label)
colors_to_plot <-  state_colors[1+as.numeric(feathered_flight_data)[nexus_to_tree_tiporder]]

plot(vert_plusDinos_phylo, label.offset=15)
axisPhylo()
mtext(side=1, text="Ma (millions of years ago)", line=3)

tiplabels(text=NULL, tip=1:length(vert_plusDinos_phylo$tip.label), col=colors_to_plot, bg=colors_to_plot, pch=21, cex=1)
legend(x="topleft", legend=state_names, fill=state_colors, cex=0.75)

**9.4.** Looking at this figure, you should consider the following questions:

9.4.1. According to this phylogeny, and the distribution of the “feathered flight” character, how many times do you think feathered flight evolved? At which branches?

9.4.2. How many times do you think “feathered flight” was lost? At which branches?

**9.5.** As you remember from the lecture, character fit is a score which talks about how many times a character appeared and disappeared in a given tree and an observed pattern. Lets calculate this for our tree and feathered flight data using the **parsimony()** function from **phangorn**.

In [None]:
feathered_flight_fit <- parsimony(tree=vert_plusDinos_phylo,
                             data=feathered_flight,
                             method="fitch",
                             site="pscore")
feathered_flight_fit

**9.6.** Questions to be answered:

9.6.1. What is the character fit score here according to your previous visual inspection, and the calculations?

9.6.2. Are they the same or different? Can you explain if the are different?

**9.7.** Let’s investigate the actual mutation events. For that, we will have to reconstruct the characters for the internal nodes of the tree and visualize them as we did previously:

In [None]:
vert_plusDinos_phylo <- makeNodeLabel(vert_plusDinos_phylo)
ancestral_states <- anc_pars(tree=vert_plusDinos_phylo, data=feathered_flight, type="MPR", cost=NULL, return="prob")

plotAnc(ancestral_states, i=1, col=state_colors, pos=NULL, cex=0.5)
axisPhylo()

titletxt <- paste0("Mapping of ancestral states under Fitch parsimony")
title(titletxt)
mtext(side=1, text="Ma (millions of years ago)", line=3)
legend(x="topleft", legend=state_names, fill=state_colors, cex=0.5)

parsimony_txt <- paste0("parsimony score\n(# of steps) = ", feathered_flight_fit)
text(x=-8, y=16, labels=parsimony_txt, pos=4, cex=0.7)

**9.8.** Questions we can answer from the plot:

9.8.1. Approximately when did feathered flight evolve? You can get an approximate minimum and maximum date.

9.8.2. Can we estimate when kiwi lost its ability to fly?

### 10. We will repeat this exercise using data on tetrapod viviparity: giving birth to living offsprings.
Animals that lay eggs external to the body exhibit “oviparity”. Here, we want to explore the research question if it is true that live birth is a defining characteristic of mammals.

**10.1.** In our dataset, only humans and opossums do not lay eggs. At first, we will code the data to a **named vector** again, then code it into the **phyDat** class to use for character fit and ancestor state calculations:

In [None]:
viviparity_data <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1)
names(viviparity_data) <- c("shark","tuna","lungfish",
                            "frog","kiwi",
                            "seagull","Confuciusornis",
                            "Archaeopteryx","Velociraptor",
                            "Brontosaurus","crocodile",
                            "turtle","wall_lizard",
                            "snake","anole_lizard",
                            "Tuatara","platypus",
                            "opossum","human")

state_names <- c("oviparity", "viviparity")
data_levels <- c(0, 1)
viviparity <- phyDat(viviparity_data, type = "USER", levels=data_levels)
viviparity

**10.2.** It is time to calculate the character fit for viviparity, and state reconstruction:

In [None]:
viviparity_fit <- parsimony(tree=vert_plusDinos_phylo,
                                  data=viviparity, method="fitch", site="pscore")
viviparity_fit
ancestral_states_vivi <- anc_pars(tree=vert_plusDinos_phylo,
                                        data=viviparity, type="MPR",
                                        cost=NULL, return="prob")

**10.3.** And let’s visualize again the whole analysis:

In [None]:
plotAnc(ancestral_states_vivi, i=1, col=state_colors, pos=NULL, cex=0.5)
axisPhylo()

titletxt <- paste0("Mapping of viviparity under Fitch parsimony")
title(titletxt)
mtext(side=1, text="Ma (millions of years ago)", line=3)
legend(x="topleft", legend=state_names, fill=state_colors, cex=0.5)

parsimony_txt_vivi <- paste0("parsimony score\n(# of steps) = ", viviparity_fit)
text(x=-8, y=16, labels=parsimony_txt_vivi, pos=4, cex=0.7)

**10.4.** Questions we want to answer:

10.4.1. How many times this kind of viviparity has evolved?

10.4.2. When viviparity has appeared?

10.4.3. Is viviparity a common feature of all mammals? (What about poor platypus?)

10.4.4. Is viviparity a homology or a homoplasy on this tree?

10.4.5. What is the tree length of this given tree considering feathered flight and viviparity together?

### 11. Try to repeat the character fit calcualation for the following characters:


**11.1.** Hair on the skin (as opposed to scales or feathers)

**11.2.** Having lung for breathing

**11.3.** Having warm blood (Dinosaurs: Google and form hypothesis!)