In [1]:
import mat
import pandas as pd
import numpy as np
import seaborn as sns
import ete3

The matreePy API is a wrapper on the Mutation_Annotated_Tree namespace and select matUtils functions, including the Tree and Node classes. 

In [2]:
%%time
#loading takes several seconds, like the command line.
t = mat.MATree("public-latest.all.masked.pb.gz")

CPU times: user 23.7 s, sys: 1.68 s, total: 25.4 s
Wall time: 17.1 s


In [4]:
t.get_parsimony_score()

3370883

As a comparison point for efficiency, let's use the popular python phylogenetics package ete3. ete3 only works with newick, not with mutation annotated trees directly.

In [5]:
nwk = t.get_newick_string()

In [6]:
%%time
etetree = ete3.Tree(nwk.decode("UTF-8"),1)

KeyboardInterrupt: 

Ete3 takes six times longer to load *just* the newick tree sans mutations. The power of the MAT library is not to be underestimated!

The API can extract subtrees with requested attributes, much like matUtils extract, and save the results to a new protobuf file.

In [None]:
%%time
omicron = t.get_clade("BA.1")
print("Omicron Total Parsimony:",omicron.get_parsimony_score())
omicron.save_pb("omicron_only.pb")

The API can traverse the tree in breadth- or depth-first order and contains a python-readable MATNode class!

In [None]:
%%time
allnodes = t.depth_first_expansion()
len(allnodes)

In [None]:
%%time
#and how does this compare to ete3 traversal time?
for node in etetree.traverse('postorder'):
    pass

We can see already that despite tracking mutations assigned to each node, our wrapper is ~5x faster than the equivalent functions in ete3. A great sign for usability!

In [None]:
help(allnodes[250])

The node class contains getter methods for each of the original C++ node class attributes, allowing for python-levle parsing and selection of nodes from a set of MATNode objects.

In [None]:
leaves = [n for n in allnodes if n.is_leaf()]
len(leaves)

Finally, the API can also support cython-only functions for particular analytical applications. Here is one example- a function which counts individual mutation types across all nodes and loads them into a Python dictionary!

In [None]:
%%time
mcount = t.count_mutations()
sns.barplot(list(mcount.keys()),list(mcount.values()))

Our API also provides a translation method which computes and stores translations to amino acid changes from each nucleotide mutation on the tree, automatically propagating any changes as AAChange class objects to Nodes obtained from a tree with translations computed. (This does significantly slow traversal/MATNode generation time)

In [3]:
t.translate('ncbiGenes.gtf','NC_045512v2.fa')

In [None]:
%%time
translated_nodes = t.depth_first_expansion()

In [9]:
translated_nodes[10].translation[0].aa

'V2157I'

Altogether, this wrapper exposes basic and extremely useful functions of the excellent Mutation_Annotated_Tree library and matUtils to Python to allow for efficient and informed analysis.

If you've ever found yourself trying to run an analysis on a MAT pb and been frustrated by the linear nature of the command line tools and the constant need to parse different text files, this wrapper could help you!

I'm seeking feedback on any and all aspects of the wrapper and am collecting it in a google doc [here](https://docs.google.com/document/d/1UR82v2xJixnEIEHRh7jZawClOqmsufrPWELygq8lb_Y/edit?usp=sharing). Try it out!