# 01-Gerating artificial signal peptides

To avoid the combinatorial explosion that arises from the large number of possible amino acid sequences, an algorithm must be developed to narrow down the search space and identify the sequences that are most likely to function as signal peptides. This can be accomplished through a variety of computational methods, such as bioinformatics, machine learning, and statistical analysis.

One common approach is to use bioinformatics methods to analyze large sets of data on known signal peptides and identify patterns or features that are associated with signal peptide function. These features can then be used to predict the function of novel sequences.

Machine learning algorithms can also be used to predict signal peptides. These algorithms can be trained on large sets of data on known signal peptides, and can then be used to predict the function of novel sequences. Common machine learning algorithms used for this purpose include decision trees, random forests, and neural networks.

Another approach is to use statistical analysis to identify the regions of the peptide sequences that are most likely to function as signal peptides. This can be done by analyzing the frequency and distribution of different amino acids in known signal peptides and identifying those that are over-represented or under-represented in these sequences.

In summary, by developing an algorithm, we can narrow down the search space and identify the sequences that are most likely to function as signal peptides, thus avoiding combinatorial explosion. The algorithm we are showcasing here is based on a combination of bioinformatics, machine learning and statistical analysis.

# Using the random random library 

In [16]:
import numpy as np
import pandas as pd

Lets import our df_pwn that was made in a previous notebook:

In [19]:
df_pwn = pd.read_csv('../data/02_all_signal_peptides/df_pwn_68_positions.csv')
df_pwn

Unnamed: 0,A,C,D,E,F,G,H,I,K,L,...,N,P,Q,R,S,T,V,W,Y,-
0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,0.069811,0.001887,0.002830,0.001887,0.032075,0.021698,0.064151,0.037736,0.208491,0.084906,...,0.022642,0.028302,0.058491,0.193396,0.033962,0.018868,0.052830,0.015094,0.028302,0.000000
2,0.044340,0.007547,0.006604,0.007547,0.152830,0.047170,0.019811,0.050000,0.018868,0.190566,...,0.021698,0.052830,0.020755,0.053774,0.131132,0.053774,0.060377,0.025472,0.022642,0.000000
3,0.057547,0.001887,0.004717,0.004717,0.089623,0.019811,0.016038,0.064151,0.058491,0.175472,...,0.033962,0.041509,0.043396,0.056604,0.163208,0.100000,0.021698,0.010377,0.026415,0.000000
4,0.072642,0.004717,0.004717,0.007547,0.054717,0.032075,0.019811,0.077358,0.053774,0.155660,...,0.041509,0.040566,0.036792,0.045283,0.138679,0.097170,0.043396,0.023585,0.032075,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
63,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000943,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.999057
64,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000
65,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000
66,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000


In [32]:
amino_acids = list(df_pwn.columns.values)

In [28]:
list_of_probabilities = []
for i in range(len(df_pwn)): 
    list_of_probabilities.append(df_pwn.loc[i, :].values.tolist())

In [31]:
list_of_probabilities[2]

[0.0443396226415094,
 0.0075471698113207,
 0.0066037735849056,
 0.0075471698113207,
 0.1528301886792452,
 0.0471698113207547,
 0.0198113207547169,
 0.05,
 0.0188679245283018,
 0.190566037735849,
 0.0122641509433962,
 0.0216981132075471,
 0.0528301886792452,
 0.020754716981132,
 0.0537735849056603,
 0.1311320754716981,
 0.0537735849056603,
 0.060377358490566,
 0.0254716981132075,
 0.0226415094339622,
 0.0]

In [91]:
def generate_artificial_peptide(list_of_probabilities: np.ndarray, amino_acids: np.ndarray) -> str:
    """
    Generate an artificial peptide based on a list of probabilities and amino acids.
    
    Parameters:
    ----------
    list_of_probabilities : numpy.ndarray
        2-D array of probability of amino acids in the peptide
    amino_acids : numpy.ndarray
        1-D array of amino acids.
        
    Returns:
    -------
    str
        Generated artificial peptide
        
    Notes:
    ------
    The length of the probability array should be same as the length of the peptide.
    """
    out_str = ''
    for i in range(len(list_of_probabilities)):
        # make synthetic signal peptide
        artificial_amino_acid = list(np.random.choice(amino_acids, 1, p=list_of_probabilities[i]))

        if artificial_amino_acid == ['-']: 
            break

        out_str += artificial_amino_acid[0]
    return out_str


In [98]:
def generate_artificial_peptides(list_of_probabilities: np.ndarray, amino_acids: np.ndarray, n_peptides: int) -> pd.DataFrame:
    """
    Generate a dataframe of artificial peptides based on a list of probabilities and amino acids.
    
    Parameters:
    ----------
    list_of_probabilities : numpy.ndarray
        2-D array of probability of amino acids in the peptide
    amino_acids : numpy.ndarray
        1-D array of amino acids.
    n_peptides : int
        Number of peptides to generate
        
    Returns:
    -------
    pd.DataFrame
        Dataframe of generated artificial peptides with 'sequence' as column
        
    Notes:
    ------
    The length of the probability array should be same as the length of the peptide.
    """
    artificial_peptides = []
    lengths = [] 
    for i in range(n_peptides): 
        peptide = generate_artificial_peptide(list_of_probabilities,amino_acids)
        lengths.append(len(peptide))                                     
        artificial_peptides.append(peptide)

    df = pd.DataFrame(artificial_peptides, columns =['sequence'])
    df['length'] = lengths
    return df


In [105]:
df_100_artificial = generate_artificial_peptides(list_of_probabilities, amino_acids, n_peptides= 100)
df_100_artificial

Unnamed: 0,sequence,length
0,MRLKSFLALLCATMAATCAVV,21
1,MKLAATLLLALLLLLLLRITII,22
2,MRKVNPAWLSATVSASSH,18
3,MRTSPILITLLLVLSGMAV,19
4,MKFASALLILTFGLGA,16
...,...,...
95,MRRTCRSLSASSLLQV,16
96,MRTFHKARVFALSIATA,17
97,MKRYYGLLLLLLVTVVYAPL,20
98,MIMSVLFLVLFSLLTSIAA,19


In [106]:
describe = df_100_artificial["sequence"].describe()
describe 

count                       100
unique                      100
top       MRLKSFLALLCATMAATCAVV
freq                          1
Name: sequence, dtype: object

In [107]:
describe = df_100_artificial["length"].describe()
describe 

count    100.000000
mean      17.860000
std        1.869586
min       14.000000
25%       17.000000
50%       18.000000
75%       19.000000
max       23.000000
Name: length, dtype: float64

In [2]:
from Bio import SeqIO
import pandas as pd
from itertools import product

In [3]:
consensus_list = 'MKLSSLLLLLLLLLLLLALA'

In [4]:
aa_used = "".join(set(consensus_list[1:]))
aa_used

'ALKS'

In [7]:
aa1 = list(aa_used)

In [8]:
list_of_20_peptides = []

for i in range(0,20): 
    list_of_20_peptides.append(aa1)

In [9]:
len(list_of_20_peptides)

20

In [None]:
%%time
all_combinations = list(product(only_one_letter, repeat=19))


In [None]:
len(all_combinations)

In [1]:
19**4


130321