# Gerating artificial signal peptides

To avoid the combinatorial explosion that arises from the large number of possible amino acid sequences, an algorithm must be developed to narrow down the search space and identify the sequences that are most likely to function as signal peptides. This can be accomplished through a variety of computational methods, such as bioinformatics, machine learning, and statistical analysis.

One common approach is to use bioinformatics methods to analyze large sets of data on known signal peptides and identify patterns or features that are associated with signal peptide function. These features can then be used to predict the function of novel sequences.

Machine learning algorithms can also be used to predict signal peptides. These algorithms can be trained on large sets of data on known signal peptides, and can then be used to predict the function of novel sequences. Common machine learning algorithms used for this purpose include decision trees, random forests, and neural networks.

Another approach is to use statistical analysis to identify the regions of the peptide sequences that are most likely to function as signal peptides. This can be done by analyzing the frequency and distribution of different amino acids in known signal peptides and identifying those that are over-represented or under-represented in these sequences.

In summary, by developing an algorithm, we can narrow down the search space and identify the sequences that are most likely to function as signal peptides, thus avoiding combinatorial explosion. The algorithm we are showcasing here is based on a combination of bioinformatics, machine learning and statistical analysis.

# Using random random libraries

In [1]:
import numpy as np
import pandas as pd
import random
import os

Lets import our df_pwn that was made in a previous notebook:

In [34]:
from google.colab import drive 
drive.mount('/content/home')

Drive already mounted at /content/home; to attempt to forcibly remount, call drive.mount("/content/home", force_remount=True).


In [35]:
df_pwn =  pd.read_csv(os.getcwd() + '/home/MyDrive/DTU-MASTER/DTU-Sem4/Thesis/sigpep/df_pwn_68_positions.csv')
df_pwn

Unnamed: 0,A,C,D,E,F,G,H,I,K,L,...,N,P,Q,R,S,T,V,W,Y,-
0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,0.069877,0.001889,0.002833,0.001889,0.032106,0.021719,0.064212,0.037771,0.208687,0.084986,...,0.022663,0.028329,0.058546,0.193579,0.033994,0.018886,0.052880,0.014164,0.028329,0.000000
2,0.044381,0.007554,0.006610,0.007554,0.152975,0.047214,0.019830,0.050047,0.018886,0.190746,...,0.020774,0.052880,0.020774,0.053824,0.131256,0.053824,0.060434,0.025496,0.022663,0.000000
3,0.057602,0.001889,0.004721,0.004721,0.089707,0.019830,0.016053,0.064212,0.058546,0.175637,...,0.033994,0.041549,0.043437,0.056657,0.163362,0.100094,0.021719,0.010387,0.026440,0.000000
4,0.072710,0.004721,0.004721,0.007554,0.054769,0.032106,0.019830,0.077432,0.053824,0.155807,...,0.041549,0.040604,0.036827,0.045326,0.138810,0.096317,0.043437,0.023607,0.032106,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.001889,0.000944,0.000000,0.000000,0.000000,0.997167
59,0.000000,0.000000,0.000000,0.000944,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000944,0.000000,0.000000,0.000000,0.998111
60,0.000944,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000944,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.998111
61,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000


In [36]:
amino_acids = list(df_pwn.columns.values)

In [37]:
list_of_probabilities = []
for i in range(len(df_pwn)): 
    list_of_probabilities.append(df_pwn.loc[i, :].values.tolist())

In [38]:
def generate_artificial_peptide(list_of_probabilities: np.ndarray, amino_acids: np.ndarray, max_length=22) -> str:
    """
    Generate an artificial peptide based on a list of probabilities and amino acids.
    
    Parameters:
    ----------
    list_of_probabilities : numpy.ndarray
        2-D array of probability of amino acids in the peptide
    amino_acids : numpy.ndarray
        1-D array of amino acids.
        
    Returns:
    -------
    str
        Generated artificial peptide
        
    Notes:
    ------
    The length of the probability array should be same as the length of the peptide.
    """
    out_str = ''
    for i in range(len(list_of_probabilities)):
        # make synthetic signal peptide
        artificial_amino_acid = list(np.random.choice(amino_acids, 1, p=list_of_probabilities[i]))

        if artificial_amino_acid == ['-']: 
            break

        out_str += artificial_amino_acid[0]
    return out_str


In [39]:
def add_dunder_tail(peptide:str , max_lenght:int = 22 ): 
    '''Adds a tail if a peptide is shorter than the specified max_len.
    '''
    if len(peptide) < max_lenght: 
        difference = max_lenght - len(peptide)
        sequence = peptide + ('-'*difference)
    else: 
        sequence = peptide
    
    
    return sequence        

In [40]:
def generate_artificial_peptides(list_of_probabilities: np.ndarray, amino_acids: np.ndarray, n_peptides: int, max_len = 50) -> pd.DataFrame:
    """
    Generate a dataframe of artificial peptides based on a list of probabilities and amino acids.
    
    Parameters:
    ----------
    list_of_probabilities : numpy.ndarray
        2-D array of probability of amino acids in the peptide
    amino_acids : numpy.ndarray
        1-D array of amino acids.
    n_peptides : int
        Number of peptides to generate
        
    Returns:
    -------
    pd.DataFrame
        Dataframe of generated artificial peptides with 'sequence' as column
        
    Notes:
    ------
    The length of the probability array should be same as the length of the peptide.
    """
    artificial_peptides = []
    lengths = [] 
    for i in range(n_peptides): 
        peptide = generate_artificial_peptide(list_of_probabilities,amino_acids, max_length=max_len)
        if len(peptide) <= max_len:
            peptide_w_tail = add_dunder_tail(peptide, max_lenght = max_len)
        else: 
            continue
        
        # save
        lengths.append(len(peptide))                                     
        artificial_peptides.append(peptide_w_tail)

    df = pd.DataFrame(artificial_peptides, columns =['sequence'])
    df['length'] = lengths
    return df


In [41]:
df_signalPP = pd.read_csv('/content/home/MyDrive/DTU-MASTER/DTU-Sem4/Thesis/sigpep/signal_peptides_ML.csv')
df_signalPP

Unnamed: 0,sequence,length,MM_N_peptide_abundance
0,MMVAWWSLFLYGLQVAAPAL,20,1.000000
1,MEAFNLHNFLSSLYILLPFVILANPVH,27,0.417923
2,MLRVSAIFMACLLLATAA,18,0.339312
3,MAVRIARFLGLSTVAYLALANGID,24,0.276919
4,MVSFSSCLRALALGSSVLAVQPVL,24,0.218331
...,...,...,...
1056,MQVKLFYTLALWAPILVS,18,0.000000
1057,MKSLIWALPFIPLAY,15,0.000000
1058,MWPTRSLSSLFFLSLALGSPVS,22,0.000000
1059,MLLPRLSSLLCLAGLATMPVAN,22,0.000000


In [42]:
describe = df_signalPP["sequence"].describe()
describe 

count                     1061
unique                    1058
top       MMVAWWSLFLYGLQVAAPAL
freq                         3
Name: sequence, dtype: object

In [43]:
describe = df_signalPP["length"].describe()
describe 

count    1061.000000
mean       21.388313
std         6.170687
min        12.000000
25%        18.000000
50%        20.000000
75%        24.000000
max        68.000000
Name: length, dtype: float64

In [44]:
def split_peptides_sequences(df_signalPP:pd.DataFrame): 
    '''Split eaxh AA for each position'''
    peptides_split = []
    for k,v in df_signalPP.iterrows(): 
        sequence = []
        for seq in v['sequence']: 
            sequence.append(seq)
        peptides_split.append(sequence)
    
    # make a dataframe
    new_peptides = pd.DataFrame(peptides_split)

    return new_peptides

In [45]:
new_peptides = split_peptides_sequences(df_signalPP)
new_peptides

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,58,59,60,61,62,63,64,65,66,67
0,M,M,V,A,W,W,S,L,F,L,...,,,,,,,,,,
1,M,E,A,F,N,L,H,N,F,L,...,,,,,,,,,,
2,M,L,R,V,S,A,I,F,M,A,...,,,,,,,,,,
3,M,A,V,R,I,A,R,F,L,G,...,,,,,,,,,,
4,M,V,S,F,S,S,C,L,R,A,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1056,M,Q,V,K,L,F,Y,T,L,A,...,,,,,,,,,,
1057,M,K,S,L,I,W,A,L,P,F,...,,,,,,,,,,
1058,M,W,P,T,R,S,L,S,S,L,...,,,,,,,,,,
1059,M,L,L,P,R,L,S,S,L,L,...,,,,,,,,,,


# generate dummy ML model

In [46]:
%%capture 
!pip install h2o

In [47]:
import h2o
from h2o.automl import H2OAutoML

In [48]:
h2o.init(ip="localhost", min_mem_size_GB=8)

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O_cluster_uptime:,38 mins 49 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.40.0.1
H2O_cluster_version_age:,11 days
H2O_cluster_name:,H2O_from_python_unknownUser_nodvpd
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,7.994 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


In [88]:
new_peptides['MM_N_peptide_abundance'] = df_signalPP['MM_N_peptide_abundance']
new_peptides

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,59,60,61,62,63,64,65,66,67,MM_N_peptide_abundance
0,M,M,V,A,W,W,S,L,F,L,...,,,,,,,,,,1.000000
1,M,E,A,F,N,L,H,N,F,L,...,,,,,,,,,,0.417923
2,M,L,R,V,S,A,I,F,M,A,...,,,,,,,,,,0.339312
3,M,A,V,R,I,A,R,F,L,G,...,,,,,,,,,,0.276919
4,M,V,S,F,S,S,C,L,R,A,...,,,,,,,,,,0.218331
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1056,M,Q,V,K,L,F,Y,T,L,A,...,,,,,,,,,,0.000000
1057,M,K,S,L,I,W,A,L,P,F,...,,,,,,,,,,0.000000
1058,M,W,P,T,R,S,L,S,S,L,...,,,,,,,,,,0.000000
1059,M,L,L,P,R,L,S,S,L,L,...,,,,,,,,,,0.000000


In [89]:
df_test = h2o.H2OFrame(new_peptides)
df_test.describe()

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,MM_N_peptide_abundance
type,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,real
mins,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0
mean,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.003157757935910452
maxs,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0
sigma,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.03730967489556199
zeros,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,819
missing,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,M,M,V,A,W,W,S,L,F,L,Y,G,L,Q,V,A,A,P,A,L,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0
1,M,E,A,F,N,L,H,N,F,L,S,S,L,Y,I,L,L,P,F,V,I,L,A,N,P,V,H,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.4179234540632222
2,M,L,R,V,S,A,I,F,M,A,C,L,L,L,A,T,A,A,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.339312307761118


In [90]:
for column in df_test.columns:
    if column != 'MM_N_peptide_abundance':
        df_test[column] = df_test[column].asfactor()

In [91]:
# Select the columns we want to train on
feature_cols = [str(i) for i in range(0, 30)]

In [92]:
df_test.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,MM_N_peptide_abundance
type,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,real
mins,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0
mean,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.003157757935910452
maxs,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0
sigma,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.03730967489556199
zeros,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,819
missing,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,M,M,V,A,W,W,S,L,F,L,Y,G,L,Q,V,A,A,P,A,L,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0
1,M,E,A,F,N,L,H,N,F,L,S,S,L,Y,I,L,L,P,F,V,I,L,A,N,P,V,H,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.4179234540632222
2,M,L,R,V,S,A,I,F,M,A,C,L,L,L,A,T,A,A,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.339312307761118


In [93]:
train_ml = True

In [94]:
if train_ml: 
    # Select the columns we want to train on
    feature_cols = [str(i) for i in range(0, 22)]

    # Initialize H2O autoML class
    AutoML = H2OAutoML(
        max_runtime_secs=60,  # 1 hour =int(3600 * 1) , if unlimited time is wanted then set this to zero = 0
        max_models=None,  # None =  no limit
        nfolds=0,         # number of folds for k-fold cross-validation (nfolds=0 disables cross-validation)
        seed=1,            # Reproducibility
        sort_metric = "MAE",
        keep_cross_validation_predictions=True 
    )

    # train a model
    AutoML.train(
         x=feature_cols,
         y='MM_N_peptide_abundance',
         training_frame=df_test,
     ) 
    
    best_model = AutoML.get_best_model()
    best_model_name = best_model.key

    # how to save any model
    out_path = '/content/home/MyDrive/DTU-MASTER/DTU-Sem4/Thesis/sigpep/09-AutoML'
    mdl = h2o.get_model(best_model_name)
    h2o.save_model(model=mdl, path=out_path, force=True)

AutoML progress: |
22:52:15.395: _train param, Dropping bad and constant columns: [0]


22:52:16.261: _train param, Dropping bad and constant columns: [0]


22:52:16.710: _train param, Dropping bad and constant columns: [0]

█
22:52:17.74: _train param, Dropping bad and constant columns: [0]

███
22:52:20.348: _train param, Dropping bad and constant columns: [0]

███
22:52:21.895: _train param, Dropping bad and constant columns: [0]
22:52:22.119: _train param, Dropping bad and constant columns: [0]
22:52:22.367: _train param, Dropping bad and constant columns: [0]
22:52:22.649: _train param, Dropping bad and constant columns: [0]
22:52:22.858: _train param, Dropping bad and constant columns: [0]

█
22:52:23.791: _train param, Dropping bad and constant columns: [0]
22:52:24.124: _train param, Dropping bad and constant columns: [0]

███████████████████████████████████████████████████████| (done) 100%


In [95]:
generate_artificial_peptide

<function __main__.generate_artificial_peptide(list_of_probabilities: numpy.ndarray, amino_acids: numpy.ndarray, max_length=22) -> str>

In [96]:
new_TO_NATURE_peptides = generate_artificial_peptides(list_of_probabilities, amino_acids, n_peptides= 100, max_len=22 )
new_TO_NATURE_peptides = split_peptides_sequences(new_TO_NATURE_peptides)

df_test = h2o.H2OFrame(pd.concat([new_TO_NATURE_peptides], axis='columns'))
for column in df_test.columns:
    if column != 'MM_N_peptide_abundance':
        df_test[column] = df_test[column].asfactor()
df_test.describe()

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
type,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum
mins,,,,,,,,,,,,,,,,,,,,,,
mean,,,,,,,,,,,,,,,,,,,,,,
maxs,,,,,,,,,,,,,,,,,,,,,,
sigma,,,,,,,,,,,,,,,,,,,,,,
zeros,,,,,,,,,,,,,,,,,,,,,,
missing,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3
0,M,Q,L,S,S,T,T,T,G,F,S,S,T,-,-,-,-,-,-,-,-,0
1,M,Y,S,I,K,S,L,A,T,A,L,I,T,S,L,L,A,A,P,-,-,0
2,M,R,F,V,A,L,V,S,F,F,A,A,A,L,L,A,A,-,-,-,-,0


In [97]:
best_model = h2o.load_model("/content/home/MyDrive/DTU-MASTER/DTU-Sem4/Thesis/sigpep/09-AutoML/GBM_grid_1_AutoML_2_20230219_225215_model_48")

In [98]:
predicted = best_model.predict(df_test).as_data_frame()
#predicted = predicted.as_data_frame()
predicted

gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%




Unnamed: 0,predict
0,0.002183
1,0.000722
2,0.007038
3,-0.002010
4,0.003519
...,...
93,0.006853
94,0.000459
95,-0.000051
96,0.002800


In [99]:
ml_predictions = predicted['predict'].to_list()

In [100]:
ml_predictions = predicted['predict'].to_list()
new_TO_NATURE_peptides['predictions'] = ml_predictions
new_TO_NATURE_peptides.sort_values('predictions', ascending = False)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13,14,15,16,17,18,19,20,21,predictions
80,M,W,L,T,L,I,R,L,F,A,...,V,F,I,S,F,C,-,-,-,0.012502
66,M,A,C,K,I,S,S,T,F,A,...,L,Q,L,A,V,A,L,-,-,0.012195
34,M,L,V,Q,S,L,T,S,F,A,...,V,L,A,V,L,-,-,-,-,0.010488
87,M,Q,P,T,R,S,F,A,I,A,...,Q,T,L,S,L,L,-,-,-,0.007786
17,M,L,F,I,L,A,L,F,A,L,...,H,T,F,-,-,-,-,-,-,0.007043
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50,M,Q,F,T,Y,M,I,I,L,A,...,S,A,G,V,-,-,-,-,-,-0.002154
7,M,F,F,H,R,P,L,L,A,A,...,A,V,V,I,-,-,-,-,-,-0.002249
63,M,R,L,K,G,V,I,R,P,L,...,S,V,A,I,L,I,-,-,-,-0.002771
58,M,K,L,L,V,I,L,A,P,S,...,A,L,I,L,S,A,F,-,-,-0.002776


### SIGNAL_PEPTIDE preodictor algorithm

In [101]:
number_of_iterations = 3

In [103]:
def signal_peptide_predictor(list_of_probabilities, amino_acids, n_peptides,  number_of_iterations:int)-> pd.DataFrame:
    '''Predicts best signal peptides from a number of iterations'''

    data = pd.DataFrame()
    for i in range(0,number_of_iterations):
        new_TO_NATURE_peptides_1 = generate_artificial_peptides(list_of_probabilities, amino_acids, n_peptides=n_peptides, max_len = 30 )
        new_TO_NATURE_peptides_1 = split_peptides_sequences(new_TO_NATURE_peptides_1)

        df_test = h2o.H2OFrame(pd.concat([new_TO_NATURE_peptides_1], axis='columns'))
        # make the df into categorical values
        for column in df_test.columns:
            if column != 'MM_N_peptide_abundance':
                df_test[column] = df_test[column].asfactor()

        #predict
        predicted = best_model.predict(df_test).as_data_frame()
        new_TO_NATURE_peptides_1['predictions'] = predicted['predict'].to_list()

        if len(data) == 0: 
            data = new_TO_NATURE_peptides_1.copy()
        else: 
            # concat to precious predictions
            data = pd.concat([data, new_TO_NATURE_peptides_1], axis=0)
            data = data.sort_values('predictions', ascending = False)
            data = data[0:1000]   

    return data

In [104]:
%%time
lets_predict_signal_peptides = signal_peptide_predictor(list_of_probabilities, 
                                                        amino_acids,n_peptides =  1000,  
                                                        number_of_iterations = 100)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%




Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Parse progress: |█████████████████████████████████

In [105]:
lets_predict_signal_peptides


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,21,22,23,24,25,26,27,28,29,predictions
673,M,V,S,A,Q,L,C,P,F,T,...,-,-,-,-,-,-,-,-,-,0.020977
526,M,V,L,S,N,I,S,L,R,F,...,-,-,-,-,-,-,-,-,-,0.020613
594,M,A,L,F,N,W,R,A,M,F,...,-,-,-,-,-,-,-,-,-,0.020140
687,M,V,V,Q,W,W,C,H,F,L,...,S,-,-,-,-,-,-,-,-,0.020136
116,M,V,C,F,N,L,S,T,F,I,...,-,-,-,-,-,-,-,-,-,0.020085
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45,M,R,A,H,W,A,F,L,F,L,...,-,-,-,-,-,-,-,-,-,0.011949
694,M,V,T,T,S,K,I,A,F,L,...,-,-,-,-,-,-,-,-,-,0.011949
925,M,H,F,F,S,A,S,T,V,A,...,-,-,-,-,-,-,-,-,-,0.011949
101,M,K,L,F,Y,A,C,L,S,V,...,-,-,-,-,-,-,-,-,-,0.011948


----

In [127]:
import h2o
from h2o.automl import H2OAutoML
import pandas as pd

# Start an H2O cluster
h2o.init()

# Load the peptide data as a pandas dataframe
df = pd.read_csv('/content/home/MyDrive/DTU-MASTER/DTU-Sem4/Thesis/sigpep/signal_peptides_ML.csv')

# Split the peptide sequences into separate columns
def split_peptides_sequences(df_signalPP:pd.DataFrame): 
    '''Split each AA for each position'''
    peptides_split = []
    for k,v in df_signalPP.iterrows(): 
        sequence = []
        for seq in v['sequence']: 
            sequence.append(seq)
        peptides_split.append(sequence)

    # make a dataframe
    new_peptides = pd.DataFrame(peptides_split)

    return new_peptides

peptides = split_peptides_sequences(df)
peptides = peptides.iloc[:, :30] # Select only the first 22 columns
peptides = peptides.fillna("-")

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O_cluster_uptime:,1 hour 54 mins
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.40.0.1
H2O_cluster_version_age:,11 days
H2O_cluster_name:,H2O_from_python_unknownUser_nodvpd
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,7.875 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


In [128]:
peptides

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,M,M,V,A,W,W,S,L,F,L,...,-,-,-,-,-,-,-,-,-,-
1,M,E,A,F,N,L,H,N,F,L,...,I,L,A,N,P,V,H,-,-,-
2,M,L,R,V,S,A,I,F,M,A,...,-,-,-,-,-,-,-,-,-,-
3,M,A,V,R,I,A,R,F,L,G,...,N,G,I,D,-,-,-,-,-,-
4,M,V,S,F,S,S,C,L,R,A,...,Q,P,V,L,-,-,-,-,-,-
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1056,M,Q,V,K,L,F,Y,T,L,A,...,-,-,-,-,-,-,-,-,-,-
1057,M,K,S,L,I,W,A,L,P,F,...,-,-,-,-,-,-,-,-,-,-
1058,M,W,P,T,R,S,L,S,S,L,...,V,S,-,-,-,-,-,-,-,-
1059,M,L,L,P,R,L,S,S,L,L,...,A,N,-,-,-,-,-,-,-,-


In [129]:
# Combine the peptide sequences and abundance into an H2O dataframe
h2o_df = h2o.H2OFrame(pd.concat([peptides, df['MM_N_peptide_abundance']], axis=1))
h2o_df

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,MM_N_peptide_abundance
M,M,V,A,W,W,S,L,F,L,Y,G,L,Q,V,A,A,P,A,L,-,-,-,-,-,-,-,-,-,-,1.0
M,E,A,F,N,L,H,N,F,L,S,S,L,Y,I,L,L,P,F,V,I,L,A,N,P,V,H,-,-,-,0.417923
M,L,R,V,S,A,I,F,M,A,C,L,L,L,A,T,A,A,-,-,-,-,-,-,-,-,-,-,-,-,0.339312
M,A,V,R,I,A,R,F,L,G,L,S,T,V,A,Y,L,A,L,A,N,G,I,D,-,-,-,-,-,-,0.276919
M,V,S,F,S,S,C,L,R,A,L,A,L,G,S,S,V,L,A,V,Q,P,V,L,-,-,-,-,-,-,0.218331
M,V,S,F,K,Y,L,G,A,T,A,A,Y,I,L,V,L,A,S,Q,I,T,T,A,L,-,-,-,-,-,0.197361
M,R,F,S,A,I,F,T,L,G,L,A,G,T,A,L,A,T,P,L,V,E,-,-,-,-,-,-,-,-,0.119362
M,M,V,A,W,W,S,L,F,L,Y,G,L,Q,V,A,A,P,A,L,-,-,-,-,-,-,-,-,-,-,0.0176281
M,H,L,P,T,L,V,T,L,A,C,M,A,V,S,A,S,-,-,-,-,-,-,-,-,-,-,-,-,-,0.071338
M,K,I,S,A,A,I,S,T,A,L,L,A,V,S,A,A,-,-,-,-,-,-,-,-,-,-,-,-,-,0.0601995


In [132]:
# Combine the peptide sequences and abundance into an H2O dataframe
h2o_df = h2o.H2OFrame(pd.concat([peptides, df['MM_N_peptide_abundance']], axis=1))

# Split the data into training and validation sets
train, valid = h2o_df.split_frame(ratios=[0.8], seed=42)

# Define the features and target column
x = train.columns[:-1]
y = "MM_N_peptide_abundance"

# Train an H2O AutoML model
aml = H2OAutoML(max_models=10, seed=42, max_runtime_secs=60)
aml.train(x=x, y=y, training_frame=train, validation_frame=valid)

# Make predictions using the best model
best_model = aml.leader
new_peptides_split = new_peptides.replace(np.nan, '-', regex=True)
new_peptides_split = new_peptides_split.iloc[:, :22]
new_h2o_df = h2o.H2OFrame(new_peptides_split)
predictions = best_model.predict(new_h2o_df)

# Print the predicted peptide abundances
predicted_peptide_abundances = predictions.as_data_frame()['predict']
print(predicted_peptide_abundances)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |
23:52:40.773: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
23:52:40.777: _train param, Dropping bad and constant columns: [0]


23:52:41.936: _train param, Dropping bad and constant columns: [0]

█
23:52:42.449: _train param, Dropping bad and constant columns: [0]


23:52:43.172: _train param, Dropping bad and constant columns: [0]

██████
23:52:47.794: _train param, Dropping bad and constant columns: [0]

██
23:52:50.358: _train param, Dropping bad and constant columns: [0]
23:52:51.101: _train param, Dropping bad and constant columns: [0]

██
23:52:51.935: _train param, Dropping bad and constant columns: [0]
23:52:53.84: _train param, Dropping bad and constant co

OSError: ignored