<h2>Preparing a dataset for phylogenetic analysis</h2>

In this recipe, we will download and prepare the dataset to be used for our analysis. The dataset contains
complete genomes of the Ebola virus. We will use DendroPy to download and prepare the data

We will download complete genomes from GenBank; these genomes were collected from various
Ebola outbreaks, including several from the 2014 outbreak.

To do this analysis, we will need to install dendropy.

In [1]:
pip install  dendropy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting dendropy
  Downloading DendroPy-4.5.2.tar.gz (15.2 MB)
[K     |████████████████████████████████| 15.2 MB 7.3 MB/s 
Building wheels for collected packages: dendropy
  Building wheel for dendropy (setup.py) ... [?25l[?25hdone
  Created wheel for dendropy: filename=DendroPy-4.5.2-py3-none-any.whl size=453171 sha256=f60f5f68a8b13febcafa3fc260a5334253fea92d1b9739fbf4b7b191d2067660
  Stored in directory: /root/.cache/pip/wheels/9b/c9/0e/1955707a41b3995e9d64768cf8b9b41c74a75f2d6e8d16a61f
Successfully built dendropy
Installing collected packages: dendropy
Successfully installed dendropy-4.5.2


First, let’s start by specifying our data sources using DendroPy, as follows:

In [5]:
import dendropy
from dendropy.interop import genbank


# Here, we have three functions: one to retrieve data from the most recent EBOV outbreak,
# another to retrieve data from the previous EBOV outbreaks, and one to retrieve data from the
# outbreaks of other species.

# Note that the DendroPy GenBank interface provides several different ways to specify lists or
# ranges of records to retrieve. Some lines are commented out. These include the code to download
# more genomes. For our purpose, the subset that we will download is enough.

def get_ebov_2014_sources():
    #EBOV_2014
    #yield 'EBOV_2014', genbank.GenBankDna(id_range=(233036, 233118), prefix='KM')
    yield 'EBOV_2014', genbank.GenBankDna(id_range=(34549, 34563), prefix='KM0')
    
def get_other_ebov_sources():
    #EBOV other
    yield 'EBOV_1976', genbank.GenBankDna(ids=['AF272001', 'KC242801'])
    yield 'EBOV_1995', genbank.GenBankDna(ids=['KC242796', 'KC242799'])
    yield 'EBOV_2007', genbank.GenBankDna(id_range=(84, 90), prefix='KC2427')
    
def get_other_ebolavirus_sources():
    #BDBV
    yield 'BDBV', genbank.GenBankDna(id_range=(3, 6), prefix='KC54539')
    yield 'BDBV', genbank.GenBankDna(ids=['FJ217161'])

    #RESTV
    yield 'RESTV', genbank.GenBankDna(ids=['AB050936', 'JX477165', 'JX477166', 'FJ621583', 'FJ621584', 'FJ621585']) 

    #SUDV
    yield 'SUDV', genbank.GenBankDna(ids=['KC242783', 'AY729654', 'EU338380',
                                          'JN638998', 'FJ968794', 'KC589025', 'JN638998'])
    #yield 'SUDV', genbank.GenBankDna(id_range=(89, 92), prefix='KC5453')    

    #TAFV
    yield 'TAFV', genbank.GenBankDna(ids=['FJ217162'])

Now, we will create a set of FASTA files; we will use these files here and in future recipes:

We will generate several different FASTA files, which include either all genomes, just EBOV, or
just EBOV samples from the 2014 outbreak.</br> In this chapter, we will mostly use the sample.
fasta file with all genomes.

Note the use of the dendropy functions to create FASTA files that are retrieved from GenBank
records through conversion. The ID of each sequence in the FASTA file is produced by a lambda
function that uses the species and the year, alongside the GenBank accession number.

working with dendropy starter: https://dendropy.org/primer/genbank.html

what is a CharacterMatrix object: https://dendropy.org/library/charmatrixmodel.html#dendropy.datamodel.charmatrixmodel.CharacterMatrix

what is CharacterDataSequence object: https://dendropy.org/library/charmatrixmodel.html#dendropy.datamodel.charmatrixmodel.CharacterDataSequence

In [7]:
other = open('other.fasta', 'w')
sampled = open('sample.fasta', 'w')

for species, recs in get_other_ebolavirus_sources():
    tn = dendropy.TaxonNamespace()
    char_mat = recs.generate_char_matrix(taxon_namespace=tn,
        gb_to_taxon_fn=lambda gb: tn.require_taxon(label='%s_%s' % (species, gb.accession)))
    char_mat.write_to_stream(other, 'fasta')
    char_mat.write_to_stream(sampled, 'fasta')
other.close()
ebov_2014 = open('ebov_2014.fasta', 'w')
ebov = open('ebov.fasta', 'w')
for species, recs in get_ebov_2014_sources():
    tn = dendropy.TaxonNamespace()
    char_mat = recs.generate_char_matrix(taxon_namespace=tn,
        gb_to_taxon_fn=lambda gb: tn.require_taxon(label='EBOV_2014_%s' % gb.accession))
    char_mat.write_to_stream(ebov_2014, 'fasta')
    char_mat.write_to_stream(sampled, 'fasta')
    char_mat.write_to_stream(ebov, 'fasta')
ebov_2014.close()

ebov_2007 = open('ebov_2007.fasta', 'w')
for species, recs in get_other_ebov_sources():
    tn = dendropy.TaxonNamespace()
    char_mat = recs.generate_char_matrix(taxon_namespace=tn,
        gb_to_taxon_fn=lambda gb: tn.require_taxon(label='%s_%s' % (species, gb.accession)))
    char_mat.write_to_stream(ebov, 'fasta')
    char_mat.write_to_stream(sampled, 'fasta')
    if species == 'EBOV_2007':
        char_mat.write_to_stream(ebov_2007, 'fasta')

ebov.close()
ebov_2007.close()
sampled.close()

URLError: <urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>

Let’s extract four (of the total seven) genes in the virus, as follows:

We start by searching the first GenBank record for all gene features and write to the FASTA files in order to extract the genes. We put
each gene into a different file and only take two virus species. We also get translated proteins,
which are available in the records for each gene.

In [5]:
my_genes = ['NP', 'L', 'VP35', 'VP40']

def dump_genes(species, recs, g_dls, p_hdls):
    for rec in recs:

        for feature in rec.feature_table:
                    if feature.key == 'CDS':
                        gene_name = None
                        for qual in feature.qualifiers:
                            if qual.name == 'gene':
                                if qual.value in my_genes:
                                    gene_name = qual.value
                            elif qual.name == 'translation':
                                protein_translation = qual.value
                        if gene_name is not None:
                            locs = feature.location.split('.')
                            start, end = int(locs[0]), int(locs[-1])
                            g_hdls[gene_name].write('>%s_%s\n' % (species, rec.accession))
                            p_hdls[gene_name].write('>%s_%s\n' % (species, rec.accession))
                            g_hdls[gene_name].write('%s\n' % rec.sequence_text[start - 1 : end])
                            p_hdls[gene_name].write('%s\n' % protein_translation)

g_hdls = {}
p_hdls = {}
for gene in my_genes:
    g_hdls[gene] = open('%s.fasta' % gene, 'w')
    p_hdls[gene] = open('%s_P.fasta' % gene, 'w')
for species, recs in get_other_ebolavirus_sources():
    if species in ['RESTV', 'SUDV']:
        dump_genes(species, recs, g_hdls, p_hdls)
for gene in my_genes:
    g_hdls[gene].close()
    p_hdls[gene].close()


Let’s create a function to get the basic statistical information from the alignment, as follows:

In [6]:
def describe_seqs(seqs):
    print('Number of sequences: %d' % len(seqs.taxon_namespace))
    print('First 10 taxon sets: %s' % ' '.join([taxon.label for taxon in seqs.taxon_namespace[:10]]))
    lens = []
    for tax, seq in seqs.items():
        lens.append(len([x for x in seq.symbols_as_list() if x != '-']))
    print('Genome length: min %d, mean %.1f, max %d' % (min(lens), sum(lens) / len(lens), max(lens)))

Our function takes a DnaCharacterMatrix DendroPy class and counts the number of
taxons. Then, we extract all the amino acids per sequence (we exclude gaps identified by -) to
compute the length and report the minimum, mean, and maximum sizes.

Let’s inspect the sequence of the EBOV genome and compute the basic statistics, as shown earlier:

We then call a function and get 25 sequences with a minimum size of 18,700, a mean size of 18,925.2, and a maximum size of 18,959. This is a small genome when compared to eukaryotes.</br>
Note that at the very end, the memory structure has been deleted. This is because the memory
footprint is still quite big (DendroPy is a pure Python library and has some costs in terms of
speed and memory). 

In [7]:
ebov_seqs = dendropy.DnaCharacterMatrix.get_from_path('ebov.fasta', schema='fasta', data_type='dna')
print('EBOV')
describe_seqs(ebov_seqs)
del ebov_seqs

EBOV
Number of sequences: 26
First 10 taxon sets: EBOV_2014_KM034549 EBOV_2014_KM034550 EBOV_2014_KM034551 EBOV_2014_KM034552 EBOV_2014_KM034553 EBOV_2014_KM034554 EBOV_2014_KM034555 EBOV_2014_KM034556 EBOV_2014_KM034557 EBOV_2014_KM034558
Genome length: min 18700, mean 18925.2, max 18959


Now, let’s inspect the other Ebola virus genome file and count the number of different species:

In [8]:
print('ebolavirus sequences')
ebolav_seqs = dendropy.DnaCharacterMatrix.get_from_path('other.fasta', schema='fasta', data_type='dna')
describe_seqs(ebolav_seqs)
from collections import defaultdict
species = defaultdict(int)
for taxon in ebolav_seqs.taxon_namespace:
    toks = taxon.label.split('_')
    my_species = toks[0]
    if my_species == 'EBOV':
        ident = '%s (%s)' % (my_species, toks[1])
    else:
        ident = my_species
    species[ident] += 1
for my_species, cnt in species.items():
    print("%20s: %d" % (my_species, cnt))
del ebolav_seqs

ebolavirus sequences
Number of sequences: 18
First 10 taxon sets: BDBV_KC545393 BDBV_KC545394 BDBV_KC545395 BDBV_KC545396 BDBV_FJ217161 RESTV_AB050936 RESTV_JX477165 RESTV_JX477166 RESTV_FJ621583 RESTV_FJ621584
Genome length: min 18796, mean 18892.7, max 18940
                BDBV: 5
               RESTV: 6
                SUDV: 6
                TAFV: 1


The name prefix of each taxon is indicative of the species, and we leverage that to fill a dictionary
of counts.</br>
The output for the species and the EBOV breakdown is detailed next (with the legend as
Bundibugyo virus=BDBV, Tai Forest virus=TAFV, Sudan virus=SUDV, and Reston virus=RESTV;
we have 1 TAFV, 6 SUDV, 6 RESTV, and 5 BDBV).

Let’s extract the basic statistics of a gene in the virus:

In [9]:
import os
gene_length = {}
my_genes = ['NP', 'L', 'VP35', 'VP40']

for name in my_genes:
    gene_name = name.split('.')[0]
    seqs = dendropy.DnaCharacterMatrix.get_from_path('%s.fasta' % name, schema='fasta', data_type='dna')
    gene_length[gene_name] = []
    for tax, seq in seqs.items():
        gene_length[gene_name].append(len([x for x in seq.symbols_as_list() if x != '-']))
for gene, lens in gene_length.items():
    print ('%6s: %d' % (gene, sum(lens) / len(lens)))

    NP: 2218
     L: 6636
  VP35: 990
  VP40: 988


In [None]:
!mkdir unt
!mv *.fasta unt 
!zip -r unt.zip unt