Please see my [blog post](https://genomejigsaw.wordpress.com/2015/09/09/building-phylogenetic-trees-with-binary-traits/) for an introduction to this algorithm.

## Perfect phylogeny and the Gusfield algorithm

Let's say we have a matrix _M_ with _n_ (rows of) samples and _m_ (columns of) some kind of variation, which we denote by the presence (1) or absence (0) of a trait. This might represent the presence of a particular homozygous or heterozygous genotype (e.g. 1 = AA or BB, 0 = AB), or a non-sequence based trait such as the status of DNA methylation (e.g. 1 = methylated, 0 = unmethylated.) The table below shows such a matrix where C1 - C10 denotes one particular feature across the different sample groups or individuals: S1 - S4.

|    | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 |
|----|----|----|----|----|----|----|----|----|----|-----|
| S1 | 1  | 1  | 1  | 1  | 0  | 1  | 1  | 1  | 1  | 1   |
| S2 | 0  | 1  | 1  | 1  | 0  | 0  | 0  | 1  | 0  | 1   |
| S3 | 0  | 1  | 1  | 1  | 0  | 0  | 0  | 0  | 0  | 0   |
| S4 | 0  | 0  | 1  | 0  | 1  | 0  | 0  | 0  | 0  | 0   |

The Gusfield algorithm implements a test that determines whether a matrix of traits can be represented as a phylogenetic tree. This is called the test for a 'perfect phylogeny', where this is defined as a tree _T_ where:

1. each variation (or group of variations with identical column values) corresponds to at most one edge
2. each sample has at most one leaf
3. there is a unique path of edges to any one leaf

To implement this test on a matrix of binary features, we perform the following operations:

The presence of a perfect phylogeny simply denotes that we are able to construct a valid phylogenetic tree where all features evolve down the tree, and ancestral features are not spontaneously re-acquired. To implement the test for perfect phylogeny on a matrix of binary features, we perform the following operations:

* Each column in _m_ is evaluated as its binary number (so for instance, the column C1 with a value 1000 becomes 8; see here for a description of binary numbers.) These values are then sorted in decreasing order to determine the column orders. We discard duplicated columns. Let’s call this matrix _M_' (M prime.)
* Now we construct a new matrix _k_, where each row corresponds to the features present for each sample. Once all the features have been listed, we terminate with a ‘#’ and append 0s for the end of the row.
* Build the corresponding tree from the matrix and remove the terminating ‘#’ edges
* Test for the 3 criteria of a perfect phylogeny.

To create our _M_' matrix of the feature matrix shown above, we code this up in python like so:

In [2]:
import numpy as np
m = np.array([[1, 1, 0, 1, 0, 1, 1, 1, 1, 1],
              [0, 1, 0, 1, 0, 0, 0, 1, 0, 1],
              [0, 1, 1, 1, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 1, 0, 0, 0, 0, 0]])

# rotate M for convenient iteration
m_prime = np.rot90(m)

# keep only unique combinations
m_prime = np.unique(map(lambda x: '.'.join(map(str,x)),m_prime))
m_prime = np.array(map(lambda x: map(int,x.split('.')),m_prime))

# count binary score of columns
binary_strings = []
for col in m_prime:
    col_string = '0b'+''.join(map(str,col))
    binary_strings.append(int(col_string,2))
    
# sort by binary score
order = np.argsort(binary_strings)[::-1]
m_prime = m_prime[order] 

m_prime = np.rot90(m_prime[:,:])[::-1] #rotate again
print(m_prime)

[[1 1 1 0 0]
 [1 1 0 0 0]
 [1 0 0 1 0]
 [0 0 0 0 1]]


To test if a perfect phylogeny exists, we must now construct a matrix _k_ which shows the path of edges for each sample (or leaf.) We make the following transition to turn our matrix _M_' into _k_,  This involves listing the features present in each row and then terminating with a '#', followed by 0s.

In [3]:
import string

ncol = len(m_prime[0])
k = np.empty( [0,ncol], dtype='|S15' )
features = np.array(list(string.ascii_lowercase[:ncol]))

for m in m_prime:
    row_feats = features[m!=0] #features in the row
    mrow = np.zeros(ncol,dtype='|S15')
    mrow.fill('0')

    for idx,feature in enumerate(row_feats):
        mrow[idx] = feature

    n_feat = len(row_feats)    
    if n_feat < ncol: 
        mrow[n_feat]='#'

    k = np.append(k,[mrow],axis=0)

print(k)

[['a' 'b' 'c' '#' '0']
 ['a' 'b' '#' '0' '0']
 ['a' 'd' '#' '0' '0']
 ['e' '#' '0' '0' '0']]


A perfect phylogeny can now be found if each leaf has a unique vector. One way to write this in code, is to determine whether any ‘features’ (a,b,c etc.) are repeated in 1 or more columns of matrix _k_. Note that the fact that we remove any duplicate columns in an earlier step takes care of the need to test for unique vectors of features leading to each leaf.

In [4]:
locations = []
for feature in features:
    present_at = set([])
    for k_i in k:
        [ present_at.add(loc_list) for loc_list in list(np.where(k_i==feature)[0]) ]
    locations.append(present_at)

loc_test = np.array([len(loc_list)>1 for loc_list in locations])
if np.any(loc_test):
    print 'No phylogeny found!'
else:    
    print 'Success! Found phylogeny!'

Success! Found phylogeny!
