## Reduce redundancy in set of sequences.

The set of sequences retrieved from our full BLASTing is too large to construct a tree computationally. To circumvent this problem, we'll use a program called CDHIT to reduce highly similar sequences that provide redundant information in our maximum likelihood calculation. `phylogenetics` package has a simple API to use CDHIT from within Python.

In [10]:
from phylogenetics.utils import load_homologset

# Run CDHIT from python
from phylogenetics.cdhit import run_cdhit

We set the redundancy cutoff value to reach a list of roughly 300 sequences. Once the clusters are determined, ideally we want to select representative sequences from those clusters that are known (not hypothetical). We'll use the positive keyword argument to select sequences that look like alpha-lytic protease.

In [11]:
redundancy_cutoff = 0.85
representative_names = ("human")

# Load the homolog set
homolog_set = load_homologset("../homologs/03_homologs.pickle")

# Run Cdhit, with representative sequences that have highest rank
clean_homologs = run_cdhit(homolog_set, redund_cutoff=0.85, positive=representative_names)

print("Number of sequences: " + str(len(clean_homologs.homologs)))

Number of sequences: 366


Save these homologs to file.

In [13]:
clean_homologs.write("../homologs/04_clean_homologs.pickle", format="pickle")