# ETE3 demo for Soltis Lab March 2017

Let's look at the __[ETE Toolkit](http://etetoolkit.org/)__ for working iwth phylogenetic trees in Python. You can __[download ETE from here](http://etetoolkit.org/download/)__. For this demo, we'll start by mostly working through the __[ETE tutorial](http://etetoolkit.org/docs/latest/tutorial/index.html)__. <br>

## Thinking about trees generally
As an intro to trees, the tutorial has this to say:<br>
<div class="alert alert-block alert-info">"In bioinformatics, trees are the result of many analyses, such as phylogenetics or clustering. Although each case entails specific considerations, many properties remains constant among them. In this respect, ETE is a python toolkit that assists in the automated manipulation, analysis and visualization of any type of hierarchical trees. It provides general methods to handle and visualize tree topologies, as well as specific modules to deal with phylogenetic and clustering trees."
</div>

## Let's go...
Import ete3 and play with some trees:

In [1]:
import random
from ete3 import Tree

# Loads a tree structure from a newick string. The returned variable ’t’ is the root node for the tree.
t = Tree("(A:0.5,(B:1,(E:1,D:1):0.5):0.5);" )
print(t)


   /-A
--|
  |   /-B
   \-|
     |   /-E
      \-|
         \-D


### get_common_ancestor
We'll come back to fancy graphical trees later, but for now, we have a decent representation of a tree and can start doing things with it.

We can find the sub-tree that is the common ancestor of two tips. For example, E and B. This is done a __[bit later](http://etetoolkit.org/docs/latest/tutorial/tutorial_trees.html#find-the-first-common-ancestor)__ in the tutorial.

Remember that sub-trees are basically the same as trees, so we could get the leaves on that sub-tree with <code>get_leaves()</code>:

In [2]:
ancestor=t.get_common_ancestor("E", "B")
print(ancestor)
decendents=ancestor.get_leaves()
print (decendents)


   /-B
--|
  |   /-E
   \-|
      \-D
[Tree node 'B' (0x11305a55), Tree node 'E' (-0x7fffffffeec4ce3f), Tree node 'D' (0x113af516)]


### Searching
We can also search in trees, or test if a taxon is in a tree:

In [3]:
print(t.get_leaves_by_name("B"))
for taxon in ["A","X"]:
    if t.get_leaves_by_name(taxon):
        print("%s is in the tree" %(taxon))
    else:
        print("%s is not in the tree" %(taxon))

[Tree node 'B' (0x11305a55)]
A is in the tree
X is not in the tree


We can also search by name:

In [4]:
my_node = t.search_nodes(name = "A")
print (my_node)

[Tree node 'A' (0x113b31c5)]


### Custom searching functions
For more complex searches, you will need to make your own search function. Here's the one from __[this part of the tutorial](http://etetoolkit.org/docs/latest/tutorial/tutorial_trees.html#search-all-nodes-matching-a-given-criteria)__, modified a bit to print the nodes found (and re-naming the tree to t2 to not get rid of my tree).

In [5]:
def search_by_size(node, size):
    "Finds nodes with a given number of leaves"
    matches = []
    for n in node.traverse():
       if len(n) == size:
          matches.append(n)
    return matches

t2 = Tree()
t2.populate(40)
# returns nodes containing 6 leaves
subtrees= search_by_size(t2, size=6)
for node in subtrees:
    print (node)


      /-aaaaaaaaad
   /-|
  |  |   /-aaaaaaaaae
  |   \-|
--|      \-aaaaaaaaaf
  |
  |   /-aaaaaaaaag
   \-|
     |   /-aaaaaaaaah
      \-|
         \-aaaaaaaaai

   /-aaaaaaaaao
--|
  |   /-aaaaaaaaap
   \-|
     |   /-aaaaaaaaaq
      \-|
        |   /-aaaaaaaaar
         \-|
           |   /-aaaaaaaaas
            \-|
               \-aaaaaaaaat


Let's try a search function to find branches under a set value:

In [6]:
def find_short_branches(node, length):
    "Finds nodes with branches under set length"
    matches=[]
    for n in node.traverse():
        if n.dist <= length:
            matches.append(n)
    return matches
subtrees= find_short_branches(t, 0.5)
for node in subtrees:
    print (node)


   /-A
--|
  |   /-B
   \-|
     |   /-E
      \-|
         \-D

--A

   /-B
--|
  |   /-E
   \-|
      \-D

   /-E
--|
   \-D


### Shortcuts
<div class="alert alert-block alert-info">"Finally, ETE implements a built-in method to find the first node matching a given name, which is one of the most common tasks needed for tree analysis. This can be done through the operator & (AND). Thus, TreeNode&”A” will always return the first node whose name is “A” and that is under the tree “MyTree”. The syntaxis may seem confusing, but it can be very useful in some situations."</div>

In [7]:
t = Tree("((H:0.3,I:0.1):0.5, A:1, (B:0.4,(C:1,(J:1, (F:1, D:1):0.5):0.5):0.5):0.5);")
# Get the node D in a very simple way
D = t&"D"
# Get the path from B to the root
node = D
path = []
while node.up:
  path.append(node)
  node = node.up
print (t)
# I substract D node from the total number of visited nodes
print ("There are", len(path)-1, "nodes between D and the root")
# Using parentheses you can use by-operand search syntax as a node
# instance itself
Csparent= (t&"C").up #MAG: Changed name of variable for consistency
Bsparent= (t&"B").up
Jsparent= (t&"J").up
# I check if nodes belong to certain partitions
print ("It is", Csparent in Bsparent, "that C's parent is under B's ancestor")
print ("It is", Csparent in Jsparent, "that C's parent is under J's ancestor")



      /-H
   /-|
  |   \-I
  |
--|--A
  |
  |   /-B
   \-|
     |   /-C
      \-|
        |   /-J
         \-|
           |   /-F
            \-|
               \-D
There are 4 nodes between D and the root
It is True that C's parent is under B's ancestor
It is False that C's parent is under J's ancestor


## Checking monophyly


In [8]:
t =  Tree("((((((a, e), i), o),h), u), ((f, g), j));")
print (t)



                  /-a
               /-|
            /-|   \-e
           |  |
         /-|   \-i
        |  |
      /-|   \-o
     |  |
   /-|   \-h
  |  |
  |   \-u
--|
  |      /-f
  |   /-|
   \-|   \-g
     |
      \-j


In [9]:
# We can check how, indeed, all vowels are not monophyletic in the
# previous tree, but polyphyletic (a foreign label breaks its monophyly)
print (t.check_monophyly(values=["a", "e", "i", "o", "u"], target_attr="name"))


(False, 'polyphyletic', {Tree node 'h' (0x113b50b7)})


In [10]:
# however, the following set of vowels are monophyletic
print (t.check_monophyly(values=["a", "e", "i", "o"], target_attr="name"))


(True, 'monophyletic', set())


In [11]:
# A special case of polyphyly, called paraphyly, is also used to
# define certain type of grouping. See this wikipedia article for
# disambiguation: http://en.wikipedia.org/wiki/Paraphyly
print (t.check_monophyly(values=["i", "o"], target_attr="name"))

(False, 'paraphyletic', {Tree node 'a' (0x113b50a2), Tree node 'e' (0x113b5094)})


In [12]:
t =  Tree("((((((4, e), i), o),h), u), ((3, 4), (i, june)));")
# we annotate the tree using external data
colors = {"a":"red", "e":"green", "i":"yellow",
          "o":"black", "u":"purple", "4":"green",
          "3":"yellow", "1":"white", "5":"red",
          "june":"yellow"}
for leaf in t:
    leaf.add_features(color=colors.get(leaf.name, "none"))
print (t.get_ascii(attributes=["name", "color"], show_internal=False))




                  /-4, green
               /-|
            /-|   \-e, green
           |  |
         /-|   \-i, yellow
        |  |
      /-|   \-o, black
     |  |
   /-|   \-h, none
  |  |
  |   \-u, purple
--|
  |      /-3, yellow
  |   /-|
  |  |   \-4, green
   \-|
     |   /-i, yellow
      \-|
         \-june, yellow


In [13]:
print ("Green-yellow clusters:")
# And obtain clusters exclusively green and yellow
for node in t.get_monophyletic(values=["green", "yellow"], target_attr="color"):
   print (node.get_ascii(attributes=["color", "name"], show_internal=False))


Green-yellow clusters:

      /-green, 4
   /-|
--|   \-green, e
  |
   \-yellow, i

      /-yellow, 3
   /-|
  |   \-green, 4
--|
  |   /-yellow, i
   \-|
      \-yellow, june


## Node Annotation
<div class="alert alert-block alert-info">"Every node contains three basic attributes: name (TreeNode.name), branch length (TreeNode.dist) and branch support (TreeNode.support). These three values are encoded in the newick format. However, any extra data could be linked to trees. This is called tree annotation.

The TreeNode.add_feature() and TreeNode.add_features() methods allow to add extra attributes (features) to any node. The first allows to add one one feature at a time, while the second can be used to add many features with the same call.

Once extra features are added, you can access their values at any time during the analysis of a tree. To do so, you only need to access to the TreeNode.feature_name attributes.

Similarly, TreeNode.del_feature() can be used to delete an attribute."</div>



In [14]:
# Creates a tree
t = Tree( '((H:0.3,I:0.1):0.5, A:1, (B:0.4,(C:0.5,(J:1.3, (F:1.2, D:0.1):0.5):0.5):0.5):0.5);' )

# Let's locate some nodes using the get common ancestor method
ancestor=t.get_common_ancestor("J", "F", "C")
# the search_nodes method (I take only the first match )
A = t.search_nodes(name="A")[0]
# and using the shorcut to finding nodes by name
C= t&"C"
H= t&"H"
I= t&"I"

# Let's now add some custom features to our nodes. add_features can be
# used to add many features at the same time.
C.add_features(vowel=False, confidence=1.0)
A.add_features(vowel=True, confidence=0.5)
ancestor.add_features(nodetype="internal")

# Or, using the oneliner notation
(t&"H").add_features(vowel=False, confidence=0.2)

# But we can automatize this. (note that i will overwrite the previous
# values)
for leaf in t.traverse():
   if leaf.name in "AEIOU":
      leaf.add_features(vowel=True, confidence=random.random())
   else:
      leaf.add_features(vowel=False, confidence=random.random())

# Now we use these information to analyze the tree.
print ("This tree has", len(t.search_nodes(vowel=True)), "vowel nodes")
print ("Which are", [leaf.name for leaf in t.iter_leaves() if leaf.vowel==True]
)

This tree has 8 vowel nodes
Which are ['I', 'A']


In [16]:
# But features may refer to any kind of data, not only simple
# values. For example, we can calculate some values and store them
# within nodes.
#
# Let's detect leaf nodes under "ancestor" with distance higher thatn
# 1. Note that I'm traversing a subtree which starts from "ancestor"
matches = [leaf for leaf in ancestor.traverse() if leaf.dist>1.0]

# And save this pre-computed information into the ancestor node
ancestor.add_feature("long_branch_nodes", matches)

# Prints the precomputed nodes
print ("These are nodes under ancestor with long branches", \
   [n.name for n in ancestor.long_branch_nodes])

# We can also use the add_feature() method to dynamically add new features.
label = input("custom label:")
value = input("custom label value:")
ancestor.add_feature(label, value)
print ("Ancestor has now the [", label, "] attribute with value [", value, "]")

These are nodes under ancestor with long branches ['J', 'F']
custom label:Test
custom label value:45
Ancestor has now the [ Test ] attribute with value [ 45 ]


## Comparing Trees



In [21]:
t1 = Tree('(((a,b),c), ((e, f), g));')
t2 = Tree('(((a,c),b), ((e, f), g));')

#Note I changed this from the ETE Tutorial because my version of ETE3 is returning 7 values. 
#I think these are correct based on the docs, but hard to say for sure....
rf, rf_max, common_attrs, edges_t1, edges_t2, discarded_edges_t1, discarded_edges_t2 = t1.robinson_foulds(t2)
print (t1, t2)
print ("RF distance is %s over a total of %s" %(rf, rf_max))
print ("Partitions in tree2 that were not found in tree1:", edges_t1 - edges_t2)
print ("Partitions in tree1 that were not found in tree2:", edges_t2 - edges_t1)



         /-a
      /-|
   /-|   \-b
  |  |
  |   \-c
--|
  |      /-e
  |   /-|
   \-|   \-f
     |
      \-g 
         /-a
      /-|
   /-|   \-c
  |  |
  |   \-b
--|
  |      /-e
  |   /-|
   \-|   \-f
     |
      \-g
RF distance is 2 over a total of 8
Partitions in tree2 that were not found in tree1: {('a', 'b')}
Partitions in tree1 that were not found in tree2: {('a', 'c')}
