<a href="https://colab.research.google.com/github/liisaloel/ss-prediction-project/blob/main/notebooks/Prediction_Nb_LL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Protein Secondary Structure Prediction Program**

In [None]:
# @title Run this cell to import necessary data from GitHub repository
!git clone https://github.com/liisaloel/ss-prediction-project.git

In [None]:
# @title Run this cell to input a pssm file
# @markdown Click on 'Upload' to input a .pssm file for prediction:
from ipywidgets import FileUpload
upload_button = FileUpload()
display(upload_button)

In [None]:
# @title Hit "Run" to make the prediction
# @markdown Information about the prediction program:

# @markdown To prepare the sequence data for prediction, your uploaded .pssm file will be processed.

# @markdown MSA frequency values will be converted to a consistent scale of 0 to 1 to create a sequence profile, while the protein sequence will be one-hot encoded. The sequence and profile will be used to form a secondary structure prediction.

# @markdown The prediction algorithm uses the sliding window method and padding is added to ensure consistent data coverage. The default window size used for the model was 17; input a new window size to change it (or input 17 to have the default):
window_size = 17 # @param {type:"integer"}

# @markdown The model used for secondary structure prediction is a fully connected neural network, which is previously trained, validated and imported for use. The accuracy of the prediction model is 75%.

# @markdown Three state secondary structure prediction will be provided as following:
# @markdown   - H for alpha helix,
# @markdown   - E for beta sheet,
# @markdown   - C for coil.


# @markdown The code used for handling data, creating, training and validating the model can be found in the following GitHub repository: https://github.com/liisaloel/ss-prediction-project.git , in notebook Project_Notebook.ipynb.

# Import necessary libraries
import numpy as np
import tensorflow as tf
from tensorflow import keras
from IPython.display import display
from ipywidgets import FileUpload


def parse_pssm(pssm_file):
  amino_acids = 'ACDEFGHIKLMNPQRSTVWY'
  num_aas = len(amino_acids)
  sequence = ''
  profile = []

  # Parsing MSA frequences from a PSSM file
  pssm_lines = pssm_file.decode('utf-8').split('\n')
  for line in pssm_lines[3:-7]:                 # Iterates over lines
    profile_line = []
    profile_line = [float(n) / 100 for n in line.rstrip().split()[22:-2]]     # Converts the values to a scale of 0 to 1
    profile.append(profile_line)
    if len(line) > 6:                           # Check if the line has enough characters
      sequence += line[6]                       # Fetches the protein sequence: every 6th character in given line n

  # One-hot encoding the protein sequence
  encoding = np.zeros((len(sequence), num_aas))   # Initialises a 2D array of zeros
  for i, aa in enumerate(sequence):               # Returns an iterator that produces tuples containing both the index and aa
    if aa in amino_acids:
      index = amino_acids.index(aa)               # Finds corresponding index at aa string
      encoding[i, index] = 1                      # 0 is replaced with 1 in the array, at position: seq index x aa string index
    else: encoding[i, :] = 0.05                   # If aa not found in file, fill the entire row with 0.05 to represent unknown/invalid aa

  return encoding, profile


def predict_ss():
    uploaded_file = next(iter(upload_button.value))
    data = upload_button.value[uploaded_file]['content']
    file_type = uploaded_file.split('.')[-1]
    X = []

    if file_type == 'pssm':
        # Processing the uploaded PSSM file
        sequence, profile = np.asarray(parse_pssm(data))
        x = np.concatenate((sequence, profile), axis=-1)
        side = int((window_size-1)/2)
        x_pad = np.zeros((side, 40))
        x = np.concatenate((x_pad, np.concatenate((sequence, profile), axis=-1), x_pad), axis=0)

        # Extracting all windows
        X = [x[i-side:i+side+1,:] for i in range(side, len(x)-side-1)]
        X.append(x[-2*side-1:,:])

        # Converting to numpy array
        X = np.array(X)

        # Making predictions using the loaded model
        prediction = model.predict(X)

        # Decoding the output
        prediction_cat = np.argmax(prediction, axis=1)
        ss_labels = ['H', 'E', 'C']
        predicted_ss = [ss_labels[i] for i in prediction_cat]
        predicted_ss = "".join(predicted_ss)

        print("Predicted secondary structure:")
        print(predicted_ss)
    else:
        print("Please upload a .pssm file.")


# Loading the trained model
model_path = '/content/ss-prediction-project/trained_FCNN.h5'
model = keras.models.load_model(model_path)

# Fn will be called automatically to handle the uploaded file
upload_button.observe(predict_ss, names=['value'])
predict_ss()