- distance matrix must be:
    1) symmetric
    2) non-negative
    3) satisfy triangle inequality
    
- __tree__: connected graph without cycles

- __leaves__: nodes having degree 1

- __internal nodes__: nodes with degree > 1

- __parent(j)__: the only node connected to j by an edge

- __limb__ : an edge connecting a leaf to its parent

Prove:
- every tree with at least 2 nodes has at least 2 leaves

- every tree with n nodes has n-1 edges

- __rooted tree__: a tree that has one node called __root__ and the edges in the tree automatically inherit an implicit orientation away from the root

- __unrooted tree__: tree without a designated root

In this chapter, we define the length of a path in a tree as the sum of the lengths of its edges (rather than the number of edges on the path). As a result, the evolutionary distance between two present-day species corresponding to leaves i and j in a tree T is equal to the length of the unique path connecting i and j, denoted $d_{i, j}(T)$.

__Distance Between Leaves Problem__

Compute the distances between leaves in a weighted tree.

__Given__: An integer n followed by the adjacency list of a weighted tree with n leaves.


__Return__: A space-separated n x n (di, j), where di, j is the length of the path between leaves i and j.

In [1]:
import queue
#7a dist between leaves
# https://github.com/egeulgen/Bioinformatics_Textbook_Track/blob/master/solutions/BA7A.py
class Node:
    def __init__(self, label):
        self.label = label
        self.linked_nodes = set()

class Tree:
    def __init__(self):
        self.nodes_dict = {}
    
    def add_node(self, label):
        if label in self.nodes_dict:
            return self.nodes_dict[label]
    
        node = Node(label)
        self.nodes_dict[label] = node
        return node

    # function to construct tree from adj list
    def construct_tree(self, adj_list):
        for line in adj_list:
            labels, weight = line.split(':')
            weight = int(weight)
            label1, label2 = [int(x) for x in labels.split('->')]
            
            node1 = self.add_node(label1)
            node2 = self.add_node(label2)
            
            node1.linked_nodes.add((label2, weight))
            node2.linked_nodes.add((label1, weight))
            
    def distance(self, label_a, label_b):
        visited = [False] * len(self.nodes_dict)
        distance = [0] * len(self.nodes_dict)
        
        Q = queue.Queue()
        distance[label_a] = 0
        
        Q.put(label_a)
        visited[label_a] = True
        while not Q.empty():
            x = Q.get()
            for label2, weight in self.nodes_dict[x].linked_nodes:
                if not visited[label2]:
                    distance[label2] = distance[x] + weight
                    Q.put(label2)
                    visited[label2] = True
        return distance[label_b]
    
    def distance_matrix_between_leaves(self, n_leaves):
        distance_mat = [[0 for _ in range(n_leaves)] for _ in range(n_leaves)]
        for i in range(n_leaves):
            for j in range(n_leaves):
                distance_mat[i][j] = self.distance(i, j)
        return distance_mat
        

In [8]:
test_file = 'rosalind_ba7a.txt'
n_adj_list = []
f = open(test_file, 'r')
n_adj_list = f.readlines()

n = int(n_adj_list[0])
adj_list = n_adj_list[1:]
adj_list = [edges.strip('\n') for edges in adj_list]

t = Tree()
t.construct_tree(adj_list)
result = t.distance_matrix_between_leaves(n)
# for row in result:
#     print(" ".join(map(str, row)))

In [53]:
import networkx as nx
adj_list_nx = []
for weighted_edges in adj_list:
    node_1 = int(weighted_edges.split('->')[0])
    node_2 = int(weighted_edges.split(':')[0].split('->')[1])
    weight = int(weighted_edges.split(':')[1])
    adj_list_nx.append((node_1, node_2, weight))
#print(adj_list_nx[:5])

test_edges = [(0, 4, 11), (1, 4, 2), (2, 5, 6), (3, 5, 7), (4, 0, 11), (4, 1, 2), (4, 5, 4), 
             (5, 4, 4), (5, 3, 7), (5, 2, 6)]
graph = nx.DiGraph()

# add weighted edges
graph.add_weighted_edges_from(test_edges)
#paths = sorted(nx.all_simple_paths(graph))

# get leaves
leaves = [x for x in graph.nodes() if graph.out_degree(x) == 1 and graph.in_degree(x) ==1]

# create edges from leaves
edges_from_leaves = list(set([(i, j) for i in leaves for j in leaves]))
paths_from_leaves = {}
for edges in edges_from_leaves:
    paths_from_leaves[edges] = list(nx.all_simple_edge_paths(graph, source = edges[0], target = edges[1]))

weights_from_leaves = {edges: 0 for edges in paths_from_leaves}
for edges in paths_from_leaves:
    if paths_from_leaves[edges] == []:
        weights_from_leaves[edges] = 0
    else:
        connecting_edges = paths_from_leaves[edges][0]
        for edge in connecting_edges:
            weights_from_leaves[edges] += graph.get_edge_data(edge[0], edge[1])['weight']
weights_from_leaves


{(0, 0): 0,
 (0, 1): 13,
 (0, 2): 21,
 (0, 3): 22,
 (1, 0): 13,
 (1, 1): 0,
 (1, 2): 12,
 (1, 3): 13,
 (2, 0): 21,
 (2, 1): 12,
 (2, 2): 0,
 (2, 3): 13,
 (3, 0): 22,
 (3, 1): 13,
 (3, 2): 13,
 (3, 3): 0}

A weighted unrooted tree T fits a distance matrix D if $d_{i,j}(T) = D_{i,j}$ for every pair of leaves $i$ and $j$

a distance matrix is __additive__ if there exists a tree that fits this matrix and non-additive otherwise

a path in a tree is __non-branching__ if every node other than the beginning and ending node of the path has degree equal to 2

a non-branching path is __maximal__ if it is not a subpath of an even longer non-branching path

a __simple tree__ is a tree whose nodes's degree not equal to 2, other than the beginning and ending node of the path

--> If a matrix is additive, then there exists a _unique_ simple tree fitting this matrix

Denote $Tree(D)$ as the simple tree fitting the additive distance matrix D

* Prove that every simple tree with n leaves has at most n-2 internal nodes

__Distance-Based Phylogeny Problem:__

Reconstruct an evolutionary tree fitting a distance matrix.

__Input__: A distance matrix.

__Output__: A tree fitting this distance matrix.

natural 1st step in solving this problem would be to ensure that the 2 closest species wrt the distance matrix D correspond to __neighbors__ in Tree(D). ie the min value of $D_{i, j}$ should correspond to leaves i and j having the same parent

__off-diagonal__ is the minimum element of a matrix

__Theorem__: every simple tree with at least three nodes has a pair of neighboring trees (proof in pg 12)

for neighboring leaves i and j sharing a parent node m, for every other leaf k in the tree:

$d_{k, m} = \frac{D_{i,k} + D_{j, k} - D_{i, j}}{2}$

in the case when deg(m) = 3, removing leaves i and j from the tree turns m into a leaf and thus reduces the total numbre of leaves --> equivalent to removing rows i and j as well as columns i and j from D, then adding a new row and column corresponding to their parent m, where the distances from m to other leaves are computed according to the above formula

--> recursive algo for the distance-based phylogeny problem:
- find a pair of neighboring leaves i and j by selecting the min $D_{i, j}$ in the distance matrix

- replace i and j with their parent, and recompute the distances from this parent to all other leaves as described above

- solve the distance-based phylogeny problem for the smaller tree

- add the previously removed leaves i and j back to the tree

yet this will fail!

rather than looking for a pair of neighbors in Tree(D), we will instead reduce the size of the tree by trimming its leaves one at a time

- given a leaf j in a tree, denote $LIMBLENGTH(J)$ as the length of the limb connecting j with its parent

- edges that aren't limb must connect two internal nodes and are called __internal edges__

__Limb length theorem__: Given an additive matrix D and a leaf j, 
$LIMBLENGTH(j) = min(\frac{D_{i, j} + D_{j, k} - D_{i, k}}{2})$ over all leaves i and k

__Limb Length Problem__

Find the limb length for a leaf in a tree.

__Given__: An integer n, followed by an integer j between 0 and n - 1, followed by a space-separated additive distance matrix D (whose elements are integers).

__Return__: The limb length of the leaf in Tree(D) corresponding to row j of this distance matrix (use 0-based indexing).

In [8]:
# 7b find limb length

def LimbLength(j, dist_mat):
    other_leaves = [i for i in range(len(dist_mat)) if i != j]
    
    limb_length = []
    
    for idx_i in range(len(other_leaves) - 1):
        for idx_k in range(idx_i, len(other_leaves)):
            i = other_leaves[idx_i]
            k = other_leaves[idx_k]
            limb_length.append((dist_mat[i][j] + dist_mat[j][k] - dist_mat[i][k])/2)

    return min(limb_length)

In [10]:
n = 4
j = 1
dist_mat = [[0, 13, 21, 22], [13, 0,12,13], [21,12,0,13], [22,13,13,0]]

LimbLength(j, dist_mat)

2.0

In [12]:
test_file = 'rosalind_ba7b.txt'
n_j_dist_mat = open(test_file, 'r').read().splitlines()
n = int(n_j_dist_mat[0])
j = int(n_j_dist_mat[1])
dist_mat = [[int(x) for x in line.split()] for line in n_j_dist_mat[2:]]

LimbLength(j, dist_mat)

289.0

__Additive Phylogeny Problem__

Construct the simple tree fitting an additive matrix.

__Given__: n and a tab-delimited n x n additive matrix.

__Return__: A weighted adjacency list for the simple tree fitting this matrix.

Note on formatting: The adjacency list must have consecutive integer node labels starting from 0. The n leaves must be labeled 0, 1, ..., n-1 in order of their appearance in the distance matrix. Labels for internal nodes may be labeled in any order but must start from n and increase consecutively.

In [None]:
def AdditivePhylogeny(D, n):
    if n == 2:
        return D[1][2]
    
    limbLength = LimbLength(n, D)
    for j in range(1, n-1):
        D[j][n] = D[j][n] - limbLength
        D[n][j] = D[j][n]