In [1]:
import os
import numpy as np
import toytree
import strange
import ast
from numba import jit
import toyplot

goal: take a sequence of newick trees (organized in a **strange** directory) and, for each tree, determine if a specific bipartition is present

Then, examine the fragment lengths for which each split is preserved.

First, make a tree as usual.

In [2]:
tree = toytree.rtree.unittree(ntips=13, treeheight=3, seed=42)
tree.draw();

Define a coalseq object and make a treeseq.

In [3]:
sim = strange.Coalseq(tree,
            'testdir',
           recombination_rate = 1e-9,
           length = 1000000)
sim.make_treeseq()

Directory 'testdir' created.


In [4]:
sim.treeseq.num_trees

9539

Write out all of these newick genetrees.

In [5]:
sim.write_trees()

Directory 'testdir/ms_genetrees' created.


For the purposes of this, we actually don't need to write out the sequences. That saves us some time.

In [6]:
#sim.write_seqs()

In [7]:
#sim.build_seqs(hdf5=True)

Instead, we will now use "write_clades()" which will make a new "clades" directory with as many files as we have gene trees. In each file will be a list of clades for that particular gene tree.

In [8]:
sim.write_clades()

Directory 'testdir/clades' created.


Now we're going to ask about the persistence of each clade, in an exploratory way. So first: how many clades are there among all the gene trees?

In [85]:
full_list = []
for i in os.listdir('testdir/clades/'):
    with open('testdir/clades/'+i,'r') as f:
        data=f.read().splitlines()
    test = [sorted(ast.literal_eval(i)) for i in data]
    full_list.extend(test)
    full_list = list(np.unique(full_list))

Show some of them

In [75]:
full_list[0:10]

[['1', '10'],
 ['1', '10', '11'],
 ['1', '10', '11', '12', '13'],
 ['1', '10', '11', '12', '13', '2'],
 ['1', '10', '11', '12', '13', '2', '3'],
 ['1', '10', '11', '12', '13', '2', '3', '4'],
 ['1', '10', '11', '12', '13', '2', '3', '4', '5'],
 ['1', '10', '11', '12', '13', '2', '3', '4', '5', '6'],
 ['1', '10', '11', '12', '13', '2', '3', '4', '5', '6', '7'],
 ['1', '10', '11', '12', '13', '2', '3', '4', '5', '6', '7', '8']]

How many total?

In [76]:
len(full_list)

1481

Wow -- a lot.

Let's sort the list by clade size -- just for fun.

In [122]:
full_list=np.array(full_list)[np.argsort([len(i) for i in full_list])]

In [123]:
full_list

array([list(['1', '10']), list(['3', '8']), list(['3', '9']), ...,
       list(['1', '10', '11', '12', '13', '2', '3', '4', '5', '7', '8', '9']),
       list(['1', '10', '11', '12', '13', '2', '3', '4', '5', '6', '8', '9']),
       list(['1', '10', '12', '13', '2', '3', '4', '5', '6', '7', '8', '9'])],
      dtype=object)

Now, we'd love to go through and say which of these clades (by index) are present in which gene trees.

In [126]:
@jit
def get_equal_list(big_list,little_list):
    for i in range(len(big_list)):
        if big_list[i] == little_list:
            return(i)

Now make a huge array to contain all of the presence vs. absence of particular clades...

In [131]:
clade_pres = np.zeros((len(full_list),len(os.listdir('testdir/clades/'))),dtype=np.int8)

In [132]:
clade_pres.shape

(1481, 9539)

So here, each row corresponds to a clade. Each column is a consecutive gene tree. And since we sorted the clades by length, the shorter clades (e.g. of two taxa) will be near the top, and the longer clades (e.g. of 12 taxa) will be near the bottom of the array.

We'll change a cell from 0 to 1 if that clade (row) is present in that gene tree (col).

In [139]:
for i in range(len(os.listdir('testdir/clades/'))):
    with open('testdir/clades/'+str(i)+'.txt','r') as f:
        data=f.read().splitlines()
    test = [sorted(ast.literal_eval(dat)) for dat in data]
    for clade in test:
        row = get_equal_list(full_list,clade)
        clade_pres[row,i] = 1

Now we have a huge numpy array where each column corresponds to a particular gene tree, and each row corresponds to a possible clade. Cells with "1" in this array indicate that the particular clade is found in the particular gene tree.

We can visualize this using toyplot matrix:

In [169]:
toyplot.matrix(clade_pres[10:20,0:1000],width = 1300, height = 300);

So you can see that some particular clades tend to be present in a lot of consecutive gene trees (they aren't split up too often)