# Instructions:

Please follow the steps below, before running the script:


1.  Navigate to `Runtime`>`Change runtime type` and select **GPU** as Hardware accelerator 
2.   Download this as .zip 
1.   Choose `Files` from the side menu
2.   Upload the downloaded .zip file
3.   Run the following cells in order





In [0]:
!unzip dataset.zip
!pip install biopython

In [1]:
%tensorflow_version 1.x
import tensorflow as tf
print(tf.__version__)
import os
import numpy as np
import pandas as pd
import evaluation as evaal
import preprocessing as prp
import functions as func
from datetime import datetime
from tensorflow import keras
#print(keras.__version__)
from keras.preprocessing.sequence import pad_sequences

TensorFlow 1.x selected.
1.15.2


Using TensorFlow backend.


### Preprocessing

In [0]:
#@title Please choose the algorithm that you want to use: { form-width: "250px", display-mode: "both" }
algorithm = "SVM" #@param ["SVM", "BiLSTM"]



# ========== Importing the data

X, y = prp.import_data(algorithm)

# =========================== Preprocessing ===============================

# ========== Ordinal encoding

amino_codes = ['0', 'A', 'C', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'M', 'L', 'N', 'Q', 'P', 'S', 'R', 'T', 'W', 'V', 'Y']
non_amino_letters = ['B', 'J', 'O', 'U', 'X', 'Z']

amino_mapping = prp.create_mapping(amino_codes)
 
X['mapped_seq'] = prp.integer_encoding(X['seq'], amino_mapping) 

# ========== Sequence padding

X_pad = pad_sequences(X['mapped_seq'], maxlen=3800, padding='post', truncating='post')

# ===================== Dimensionality reduction

# =========== RFE
#reduced_X = prp.dim_reduction_RFE(X_pad, y, 100)
# If you are using new dataset, run the above line and comment the lines in the below
# Otherwise, in order to save time the optimal features have been saved in a file and will be read from the file (features_100.csv)

feat_support = pd.read_csv('features_100.csv')

X = pd.DataFrame(data=X_pad)
reduced_X = pd.DataFrame()
c = 0
for index, r in feat_support.iterrows():
    if(r[0] == True):
        reduced_X.loc[:,c] = X.iloc[:, index]
        c+=1

if (algorithm == "BiLSTM"):
    # Add the generated data (in the same order of their labels as they were added in preprocessing.py) at the end of X
    reduced_X = prp.add_synthetics("dataset/cytoplasmiccytoplasmicmembrane_synthetic(cyt&cm).txt", reduced_X)
    reduced_X = prp.add_synthetics("dataset/periplasmiccytoplasmicmembrane_synthetic(per_cm&per).txt", reduced_X)
    reduced_X = prp.add_synthetics("dataset/outermembraneextracellular_synthetic(om_ext&ext).txt", reduced_X)



### Machine Learning models

In [3]:
#@title Please choose the run type: { form-width: "250px" }
run_type = "10_fold_regular" #@param ["10_fold_regular", "10_fold_grand_mean"]
#@markdown * 10_fold_grand_mean will run 10_fold_regular for 30 times.
# ========================= Machine learning models ======================

if (algorithm == "SVM"):
    # ========== 1. Support Vector Machine (SVM) ==========
    
    # ==========  Hyperparameter tuning    
    #svm_params = evaal.SVM_tuning(reduced_X, y)
    svm_params = {'C': 50, 'gamma': 0.0001, 'kernel': 'rbf'}
    
    # ==========  Model evaluation using 10-fold cross-validation
#    evaal.svm_eval(reduced_X, y, svm_params, run_type)
    evaal.svm_eval(reduced_X, y, svm_params, run_type)

elif (algorithm == "BiLSTM"):
    # ========== 2. Deep learning ==========
    
    # ========== One-hot encoding    
    X_ohe, y_ohe = prp.one_hot_encoding(reduced_X, y)    
    
    # ========== Bidirectional Long short-term memory networks (Bi-LSTMs)
    # ========== Model evaluation using 10-fold Cross-validation
#    evaal.lstm_eval(reduced_X, y_ohe, run_type)
    evaal.lstm_eval(reduced_X, y_ohe, run_type)



Regular 10-fold cross-validation Bi-LSTM
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Train on 10917 samples, validate on 1213 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 00014: early stopping
Train loss:  0.43309027598630795
Train accuracy:  0.9095455
----------------------------------------------------------------------
Test loss:  0.49349519079712334
Test ac

  _warn_prf(average, modifier, msg_start, len(result))


Train on 10917 samples, validate on 1213 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 00021: early stopping
Train loss:  0.38920731037060746
Train accuracy:  0.9175117
----------------------------------------------------------------------
Test loss:  0.4234881439389813
Test accuracy:  0.914324
Train on 10917 samples, validate on 1213 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 00025: early stopping
Train loss:  0.3911790941168212
Train accuracy:  0.90112543
-----------------------------------------------------------------

  _warn_prf(average, modifier, msg_start, len(result))


Train on 10917 samples, validate on 1213 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 00022: early stopping
Train loss:  0.3664316025733118
Train accuracy:  0.91816634
----------------------------------------------------------------------
Test loss:  0.4369352803874979
Test accuracy:  0.90009195
Train on 10917 samples, validate on 1213 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 3

  _warn_prf(average, modifier, msg_start, len(result))


Train on 10917 samples, validate on 1213 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 00026: early stopping
Train loss:  0.3670106629175702
Train accuracy:  0.9190688
----------------------------------------------------------------------
Test loss:  0.35978926156831004
Test accuracy:  0.9121216
Testing Accuracy: 87.222%
F-score(macro): 0.72
F-score(micro): 0.872
Precision: 0.75
Recall: 0.714

It took 219.59295 seconds to compute.


In [0]:
func.beeep()