# Look for coevolution Streptococcus-Lactobacillus

Coevolution is everywhere. Because no species can be considered totally isolated from the others. And not only on a biological level, we perceive it in the lines of change in our societies, in political, scientific, religious ideas and even in the evolution of software engineering.

The difficulty is to know how much co-evolution is due to the interaction between a pair of species, because the tandem can not be totally isolated from the rest of the universe either. No doubt the problem can not be solved only from the bioinformatic perspective, it is also necessary the contribution of other branches of knowledge such as microbiology, physics, mathematics and computer theory.

Our approach consists in comparing the rates of protein evolutionary change between strains of Streptococcus and Lactobacillus linked by symbiotic pathways, and the rates between the rest of species of each genus.

We postulate that the proteins that reflects a rate change with more significance, i.e., more speedy or more slowly that intra genus rates, are proteins that could be influenced by the symbiotic environment.

To do so we need to compute separately both species of symbiotic tandem. The detailed steps are:

1) Obtain all the proteome from this four groups of species:
- Genus streptococcus
- Strains of streptococcus termophilus.
- Genus lactobacillus
- Strains of lactobacillus bulgaricus.

2) For each protein of each group we obtain the phylogenetic tree and compute the mean branch length. We suppose that we have ultrametricicy or almost ultrametricity. It can be not exact but it could serve as a reference. We use the clulstalw multialignment method, but other methods as T-COFFEE or MUSCLE could be used too. The branch length is a measure of the substitution rate.

3) For each protein of the first an second group, we calculate the ration between branch length of termophilus and branch length of protein in genus. And also we compute the overall mean of all ratios. Finally we select the proteins with the most extreme values (80% percentile, two tailed).

4) We do the same with lactobacillus (third an fourth groups). At this stage we have two sets of proteins.

5) We obtain the biologic pathways of each of the sets of proteins.

6) The proteins involved in the same pathways from lactobacillus ans streptoccocus are the target proteins influenced by the coevolution between the two species. It will be necessary a later microbiological study that confirms this expectancies and dive deeply in the details of this coevolution.


## Obtain data from servers

In [1]:
import requests, sys


In [2]:
def load_taxa(scientific_prefix):
    """
    """
    requestURL = "https://www.ebi.ac.uk/proteins/api/taxonomy/name/" + scientific_prefix +\
                "%20?pageNumber=1&pageSize=100&searchType=STARTSWITH&fieldName=SCIENTIFICNAME"

    r = requests.get(requestURL, headers={ "Accept" : "application/json"})

    if not r.ok:
      r.raise_for_status()
      sys.exit()

    jsonBody = json.loads(r.text)
    taxa = []
    names = []
    for taxonomy in jsonBody["taxonomies"]:
        print(taxonomy['taxonomyId'])
        print(taxonomy['scientificName'])
        taxa.append(taxonomy['taxonomyId'])
        names.append(taxonomy['scientificName'])
    return taxa, names

termophilus_taxa,  termophilus_names = load_taxa("Streptococcus thermophilus")
streptococcus_taxa,  streptococcus_names = load_taxa("Streptococcus")


264199
Streptococcus thermophilus (strain ATCC BAA-250 / LMG 18311)
299768
Streptococcus thermophilus (strain CNRZ 1066)
322159
Streptococcus thermophilus (strain ATCC BAA-491 / LMD-9)
767463
Streptococcus thermophilus (strain ND03)
1042404
Streptococcus thermophilus CNCM I-1630
1051074
Streptococcus thermophilus JIM 8232
1073569
Streptococcus thermophilus MTCC 5460
1073570
Streptococcus thermophilus MTCC 5461
1091038
Streptococcus thermophilus DSM 20617
1187956
Streptococcus thermophilus MN-ZLW-002
1263110
Streptococcus thermophilus CAG:236
1268061
Streptococcus thermophilus DGCC 7710
1408178
Streptococcus thermophilus ASCC 1275
1415776
Streptococcus thermophilus TH1435
1423145
Streptococcus thermophilus TH1436
1433288
Streptococcus thermophilus MTH17CL396
1433289
Streptococcus thermophilus M17PTZA496
1435972
Streptococcus thermophilus TH985
1435974
Streptococcus thermophilus TH982
1435981
Streptococcus thermophilus 1F8CT
1436725
Streptococcus thermophilus TH1477
1302
Streptococcus go

In [41]:
def load_proteome(taxids, size=10, protein=["LDH"]):
    """
    """
    taxids_str = ",".join(str(x) for x in taxids)
    protein_str = ",".join(x for x in protein)
    print(taxids_str)
    requestURL = "https://www.ebi.ac.uk/proteins/api/proteins?offset=0&size=" + str(size) + "&taxid=" +\
                    taxids_str + "&reviewed=false"
    if protein != []:
        requestURL += "&gene=" + protein_str 
    print(requestURL)
    r = requests.get(requestURL, headers={ "Accept" : "text/x-fasta"})

    if not r.ok:
      r.raise_for_status()
      sys.exit()

    proteome = r.text
    return proteome

# We limit to 20 taxids (it's the maximum for webservice and it should be enough
termophilus_taxids = termophilus_taxa[0:19]
streptococcus_taxids = streptococcus_taxa[0:19]
print(streptococcus_taxids)
print(termophilus_taxids)

#streptococcus_proteome = load_proteome(streptococcus_taxids, -1, protein = ["LDH", "CAS2", "CAS3"])
#termophilus_proteome = load_proteome(termophilus_taxids, -1, protein = ["LDH", "CAS2", "CAS3"])

streptococcus_proteome_complete = load_proteome(streptococcus_taxids, -1, [])
termophilus_proteome_complete = load_proteome(termophilus_taxids, -1, [])



[1302, 1303, 1304, 1305, 1306, 1307, 1308, 1309, 1310, 1311, 1313, 1314, 1317, 1318, 1319, 1320, 1324, 1325, 1326]
[264199, 299768, 322159, 767463, 1042404, 1051074, 1073569, 1073570, 1091038, 1187956, 1263110, 1268061, 1408178, 1415776, 1423145, 1433288, 1433289, 1435972, 1435974]
1302,1303,1304,1305,1306,1307,1308,1309,1310,1311,1313,1314,1317,1318,1319,1320,1324,1325,1326
https://www.ebi.ac.uk/proteins/api/proteins?offset=0&size=-1&taxid=1302,1303,1304,1305,1306,1307,1308,1309,1310,1311,1313,1314,1317,1318,1319,1320,1324,1325,1326&reviewed=false
264199,299768,322159,767463,1042404,1051074,1073569,1073570,1091038,1187956,1263110,1268061,1408178,1415776,1423145,1433288,1433289,1435972,1435974
https://www.ebi.ac.uk/proteins/api/proteins?offset=0&size=-1&taxid=264199,299768,322159,767463,1042404,1051074,1073569,1073570,1091038,1187956,1263110,1268061,1408178,1415776,1423145,1433288,1433289,1435972,1435974&reviewed=false


In [62]:
import pickle

def dump(obj, obj_name):
    binary_file = open(obj_name + '.bin',mode='wb')
    pickled_obj = pickle.dump(obj, binary_file)
    binary_file.close()

def load(obj_name):
    with (open(obj_name + ".bin", "rb")) as openfile:
        while True:
            try:
                obj = pickle.load(openfile)
            except EOFError:
                break
    return obj

dump(streptococcus_proteome_complete, "streptococcus_proteome_complete")
dump(termophilus_proteome_complete, "termophilus_proteome_complete")

In [63]:
print(termophilus_proteome_complete[0:1000])
print(streptococcus_proteome_complete[0:1000])

>tr|V6CG59|V6CG59_STRTN Proteolysis tag peptide encoded by tmRNA Strep_therm_CNRZ10 (Fragment) OS=Streptococcus thermophilus (strain ND03) OX=767463 GN=tmRNA Strep_therm_CNRZ10 PE=4 SV=1
AKNTNSYAVAA
>tr|V6CG54|V6CG54_STRTR Proteolysis tag peptide encoded by tmRNA Strep_therm (Fragment) OS=Streptococcus thermophilus MN-ZLW-002 OX=1187956 GN=tmRNA Strep_therm PE=4 SV=1
AKNTNSYAVAA
>tr|V6CG50|V6CG50_STRTR Proteolysis tag peptide encoded by tmRNA Strep_therm_CNRZ10 (Fragment) OS=Streptococcus thermophilus MTCC 5461 OX=1073570 GN=tmRNA Strep_therm_CNRZ10 PE=4 SV=1
AKNTNSYAVAA
>tr|V6BJW4|V6BJW4_STRTR Proteolysis tag peptide encoded by tmRNA Strep_therm_CNRZ10 (Fragment) OS=Streptococcus thermophilus MTCC 5460 OX=1073569 GN=tmRNA Strep_therm_CNRZ10 PE=4 SV=1
AKNTNSYAVAA
>tr|V6CE29|V6CE29_STRT1 Proteolysis tag peptide encoded by tmRNA Strep_therm_CNRZ10 (Fragment) OS=Streptococcus thermophilus (strain CNRZ 1066) OX=299768 GN=tmRNA Strep_therm_CNRZ10 PE=4 SV=1
AKNTNSYAVAA
>tr|A0A1L1QK15|A0A1L1Q

## Compute substitution rates

### Methods

In [52]:
import re

def proteome2dict(proteome_fasta):
    """
    Returns a dict with keys protein accession and values the list of fasta format for all taxids
    This is the basis for clustalw alignments and tree generation
    """
    proteome = {}
    key_found = False
    for line in proteome_fasta.splitlines():
        if len(line) > 0:
            if line[0] == ">":
                if key_found:
                    if key in proteome:
                        proteome[key].append(seq)
                    else:
                        proteome[key] = [seq]  
                search_gene_name = re.search('GN=(\w*)', line)
                if search_gene_name:
                    key = search_gene_name.group(1).upper()
                    key_found = True  
                #print(key)
                seq = line + '\n'
            elif key_found:
                seq += line + '\n'
    if key_found:
        if key in proteome:
            proteome[key].append(seq)
        else:
            proteome[key] = [seq]
    return proteome

In [80]:
# Phylo tree with clustalw. We need to measure the substitution rate.
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
from Bio import Phylo
from io import StringIO
import os
from Bio.Align.Applications import ClustalwCommandline
CLUSTALW = r"./clustalw2"
assert os.path.isfile(CLUSTALW), "Clustal W executable missing"
plt.rcParams["figure.figsize"] = (20,30)
matplotlib.rc('font', size=12)
        
def compute_mean_subst_rate(proteome, verbose=False, show_tree=False):
    """
    """
    clustalw_cline = ClustalwCommandline(CLUSTALW, infile=proteome + ".fasta")
    stdout, stderr = clustalw_cline()
    f = open(proteome + ".dnd", "r")
    s_tree = f.read()
    f.close()
    #print(s_tree)
    branch_len = 0
    num_branches = 0
    search_branch_length = re.findall(':([-.0123456789]*)', s_tree)
    for branch_length in search_branch_length:
        #print(branch_length)
        if branch_length != "0.00000":
            branch_len += float(branch_length)
            num_branches += 1
    if num_branches > 0:
        rate = branch_len/num_branches
    else:
        rate = -1
    if verbose: print(branch_len, num_branches, rate)
    if show_tree:
        tree = Phylo.read(proteome + ".dnd", "newick")
        Phylo.draw(tree)
    return rate

def compute_subst_rates(proteome, proteome_keys, proteome_name, verbose=False):
    """
    """
    subst_rates = {}
    for protein in proteome_keys:
        if verbose: print(protein)
        protein_sequence = ""
        # Only for proteins with enough sequences to make a tree
        if len(proteome[protein]) >= 3:
            for sequence in proteome[protein]:
                protein_sequence += sequence
            fasta_file_name = proteome_name + "_" + protein
            f = open(fasta_file_name + ".fasta", "w")
            if verbose: print(protein_sequence)
            f.write(protein_sequence)
            f.close()
            mean_subst_rate = compute_mean_subst_rate(fasta_file_name)
            subst_rates[protein] = mean_subst_rate
    return subst_rates

### Fasta to dictionary

In [64]:
#Proteomes in fasta to dictionaries
RESTART = True
VERBOSE = True

if RESTART:
    termophilus_proteome_fasta = load("termophilus_proteome_complete")
    streptococcus_proteome_fasta = load("streptococcus_proteome_complete")

#print(termophilus_proteome_fasta)
proteome_termophilus = proteome2dict(termophilus_proteome_fasta)
proteome_streptococcus = proteome2dict(streptococcus_proteome_fasta)

dump(proteome_termophilus, "proteome_termophilus")
dump(proteome_streptococcus, "proteome_streptococcus")

if VERBOSE: print(list(proteome_termophilus.keys())[0:100])
if VERBOSE: print(list(proteome_streptococcus.keys())[0:100])

['TMRNA', 'BN551_00358', 'GALT', 'LACZ', 'SBCD', 'GALR', 'DNAN', 'STU0007', 'FTSH', 'MREC', 'ARAT', 'PURL', 'STU0044', 'STU0052', 'STU0053', 'TAG', 'STU0075', 'STU0082', 'LABC', 'STU0110', 'STU0113', 'ILVA', 'STU0161', 'STU0182', 'STU0202', 'STU0208', 'STU0251', 'STU0258', 'STU0267', 'CBIM', 'STU0297', 'TRKH2', 'DNAB', 'SERS', 'STU0330', 'STU0334', 'STU0338', 'STU0339', 'METB1', 'STU0358', 'LIVM', 'HPF', 'FABF', 'FRUR', 'STU0422', 'STU0435', 'PHOH', 'STU0448', 'METS', 'STU0452', 'PRTM', 'ARGB', 'STU0468', 'STU0473', 'STU0474', 'STU0475', 'FTSW', 'DNAH', 'NAGA', 'STU0510', 'STU0516', 'STU0539', 'STU0551', 'STU0557', 'STU0565', 'HLYIII', 'STU0580', 'STU0595', 'MURN', 'NNRD', 'RNR', 'STU0631', 'STU0636', 'AROE', 'MIP', 'CAS2', 'STU0665', 'STU0668', 'STU0672', 'STU0675', 'STU0678', 'STU0679', 'STU0681', 'STU0693', 'STU0695', 'GPMC', 'STU0702', 'STU0704', 'LEMA', 'STU0721', 'APBE', 'DLTX', 'STU0811', 'STU0819', 'STU0829', 'ADCA', 'MUR2', 'STU0876', 'STU0877', 'STHIM']
['SRTA', 'ATLH', 'DBLB

### Substitution rates

In [81]:
#Proteomes in fasta to dictionaries
RESTART = True
VERBOSE = True
#Protein limit
LIMIT = 20 

if RESTART:
    proteome_termophilus = load("proteome_termophilus")
    proteome_streptococcus = load("proteome_streptococcus")

# Compute keys to process: only the keys that are included in both groups
proteome_keys = []
for key in proteome_termophilus.keys():
    if key in proteome_streptococcus.keys():
        proteome_keys.append(key)

if VERBOSE: print(proteome_keys[0:LIMIT])
print("Proteins we need to process:", len(proteome_keys))
limit =  min(len(proteome_keys), LIMIT) 
print("Proteins we want to process:", limit)

# Compute branch lengths
subst_rates_groups = {}
subst_rates_groups["termophilus"] = compute_subst_rates(proteome_termophilus, proteome_keys[0:limit],
                                                        "termophilus", False)
subst_rates_groups["streptococcus"] = compute_subst_rates(proteome_streptococcus, proteome_keys[0:limit], 
                                                          "streptococcus", False)

dump(subst_rates_groups, "subst_rates_groups")
if VERBOSE: print(subst_rates_groups)  

['TMRNA', 'GALT', 'LACZ', 'SBCD', 'GALR', 'DNAN', 'FTSH', 'MREC', 'ARAT', 'PURL', 'TAG', 'LABC', 'ILVA', 'CBIM', 'TRKH2', 'DNAB', 'SERS', 'LIVM', 'HPF', 'FABF']
Proteins we need to process: 964
Proteins we want to process: 20
{'termophilus': {'TMRNA': -1, 'GALT': 0.014773333333333333, 'LACZ': 0.00292, 'SBCD': 0.00245625, 'FTSH': 0.00262, 'ILVA': 0.0024, 'SERS': 0.001318888888888889, 'HPF': 0.01282, 'FABF': 0.31672}, 'streptococcus': {'TMRNA': 0.10939888888888888, 'GALT': 0.011721821086261982, 'LACZ': 0.028357828947368432, 'SBCD': 0.05702195652173911, 'GALR': 0.0795148, 'DNAN': 0.01873888888888889, 'FTSH': 0.017297005347593573, 'MREC': 0.06296962962962963, 'ARAT': 0.17196846153846151, 'PURL': 0.012450666666666667, 'TAG': 0.02267730769230769, 'ILVA': 0.012551562499999992, 'CBIM': 0.10676333333333335, 'TRKH2': 0.16780555555555557, 'DNAB': 0.09844512820512821, 'SERS': 0.008190401785714287, 'LIVM': 0.027198333333333328, 'HPF': 0.03110622641509432, 'FABF': 0.029105231788079465}}


## Selection of first set of target proteins
At this step is necessary to calculate the ratios, the means of the ratios and select the more extreme between them (two tailed percentile 80%).

### Methods

In [95]:
# Compute branch ratios
# Compute mean of branch ratios and standard deviation
# Obtain the most extreme values. 
# These are the proteins that could have been a slowdown or from his initial state
import statistics as stats

def compute_branch_ratios(subst_rates_groups, group1, group2, verbose=False):
    """ 
    For every protein in group1 that has counterpart in group2 calculate the ratio of
    branch lengths.
    
    Returns:
        dict of string, float: ratios by protein
    """
    ratios = {}
    for protein in subst_rates_groups[group1].keys():
        if protein in subst_rates_groups[group2].keys():
            ratio = subst_rates_groups[group1][protein]/(subst_rates_groups[group2][protein] + 0.00001)
            if subst_rates_groups[group1][protein] != -1 and subst_rates_groups[group2][protein] != -1:
                ratios[protein] = ratio
            if verbose: print(group1, protein, 
                              subst_rates_groups[group1][protein], 
                              subst_rates_groups[group2][protein],
                              ratio)
    return ratios

def compute_mean_std(ratios, verbose=False):
    """
    Calculate mean and std for ratios
    """
    ratios_list =  list(ratios.values())
    if verbose: print("ratios_list",ratios_list)
    ratios_mean = stats.mean(ratios_list)
    ratios_stdev = stats.stdev(ratios_list)
    return ratios_mean, ratios_stdev

def filter_target_proteins(ratios, mean, std, n_std, verbose=False):
    """
    Filter proteins that are n_std > mean or n_std < mean
    """
    proteins = []
    for protein in ratios.keys():
        if verbose: print(protein, ratios[protein], mean, std)
        if ratios[protein] > mean + n_std * std or ratios[protein] < mean - n_std * std:
            proteins.append(protein)
    return proteins

### Selection of target proteins

In [101]:
RESTART = True
VERBOSE = False

if RESTART:
    subst_rates_groups = load("subst_rates_groups")

rat = compute_branch_ratios(subst_rates_groups, "termophilus", "streptococcus", VERBOSE)   
if VERBOSE: print(rat)

mean, std = compute_mean_std(rat, VERBOSE)
if VERBOSE: print("\nMean", mean, "Standard deviation", std, "\n")

prot = filter_target_proteins(rat, mean, std, 0.8, VERBOSE)

print("Target proteins:", prot)

Target proteins: ['FABF']


## Obtain protein pathways from servers.
At this stage we obtain the pathways for proteins.

## Selection of final set of target proteins
At this stage we select the proteins that belong tho the same pathways on Lactobacillus and Streptoccus. These are the proteins that could be affected by co-evolucion.

# Generate document outputs.

In [24]:
%%bash
#cd /Users/nandoide/Desktop/uni/STRBI.practical
jupyter nbconvert --to=latex --template=~/report.tplx coevolution.ipynb 1> /dev/null
pdflatex -shell-escape coevolution 1> /dev/null

[NbConvertApp] Converting notebook coevolution.ipynb to latex
[NbConvertApp] Writing 41101 bytes to coevolution.tex
