[Repo Link](https://github.com/liyu95/Deep_learning_examples/blob/master/1.Fully_connected_psepssm_predict_enzyme/predict_enzyme.ipynb)
## Motivation
- Annotation of enzyme function: metagenomics, industrial biotech, diagnosis of enzyme deficiency-caused diseases
- Long time & high cost to experimentally determine the function 
- Algo to determine enzyme function by predicting the Enzyme Commission (EC) number

## Results
- end-to-end feature selection & classification model, automatic & robust feature dimensionality uniformization method
- instead of extracting manually crafted features from enzyme sequences, the model takes the raw sequence encoding as inputs, extracting convolutional and sequential features from the raw encoding based on the classification result to directly improve the prediction performance
- cross-fold validation experiments conducted on 2 large-scale datasets show that DEEPre improves the prediction performance over the previous state-of-the-art methods

<img src = 'https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/bioinformatics/34/5/10.1093_bioinformatics_btx680/2/m_btx680f1.jpeg?Expires=1637011796&Signature=tWmUZfXISbtufiQnpDzbtkZ3HjgbS4Pj0zi0NRhdHcdBgyjzPlIn2Xxxk2omKnYt3-OxFSo8kyn7hHpjSzOgxlt0ZlYrZ1zSxXJF9XVX47IcZGaO-55qb61QoiHEgFglXu6Jc~kNh-d38uSDzYo5pzjZcKbZaFZX2crpm6gzUS2tQczCbFRD8fikHvQDpWROXx7bMgQqxStGiZgE3VD2J2Fu4zqaxvkgg9cTLE5fGpNdS4q0lypSEeB3X6GzaCEFN9w1hng1E2tIUDvn49bJiI0Eq14RiJXMuyL15rd25OZjGWQ9EImABQErkGoh1TfJ5S0zGKL469dTbeaNJdBZPA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA'>

## Datasets
1. Ezypred: a top-down approach for predicting enzyme functional classes and subclasses. Biochem. Biophys. Res. Commun., 364, 53–59 [Shen H.B., Chou K.C., 2007](http://www.ncbi.nlm.nih.gov/pubmed/17931599). Constructed from the ENZYME database, with 40% sequence similarity cutoff (Denoted as **KNN** dataset)
2. Manually constructed (**NEW**) using these steps
  i. The SWISS-PROT (released on September 7, 2016) database was separated into enzymes and non-enzymes based on the annotation.
  
  ii. To guarantee the uniqueness and correctness, enzyme sequences with more than one set of EC numbers or incomplete EC number annotation were excluded.
  
  iii. To avoid fragment data, enzyme sequences annotated with ‘fragment’ or with <50 amino acids were excluded. Enzyme sequences with more than 5000 amino acids were also excluded.
  
  iv. To remove redundancy bias, we used CD-HIT (Fu et al., 2012) with 40% similarity threshold to sift upon the raw dataset, resulting in 22 168 low-homology enzyme sequences.
  
  v. To construct the non-enzyme part, 22 168 non-enzyme protein sequences were randomly collected from the SWISS-PROT (released on September 7, 2016) non-enzyme part, which were also subject to the (ii–iv) steps.

3. [Benchmark dataset](http://www.ncbi.nlm.nih.gov/pubmed/22570420), referred as **COFACTOR**. Non-homologous dataset collected from PDB w/ 2 requirements: (i) the pair-wise sequence similarity within the dataset is below 30%, (ii) there is no self-BLAST hit within the dataset to ensure that there are no enzymes that are homologous to each other in this set

All enzymes in this dataset have experimentally determined 3D structures. To avoud overlaps between the training and testing datasets, sequences contained in both our training dataset and this dataset were removed, which reduced the size of the dataset from 318 to 284

**Background**

According to SWISS-PROT (Bairoch and Apweiler, 2000) (released on September 7, 2016), among the 539 566 manually annotated proteins, 258 733 proteins are enzymes. Such a large number of enzymes are usually classified using the Enzyme Commission (EC) system (Cornish-Bowden, 2014), the most well-known numerical enzyme classification scheme, which specifies the function of an enzyme by four digits. This classification system has a tree structure. After the root of the tree, there are two main nodes, standing for enzyme and non-enzyme proteins, respectively. The enzyme main node extends out six successor nodes, corresponding to the six main enzyme classes: (i) oxidoreductases, (ii) transferases, (iii) hydrolases, (iv) lyases, (v) isomerases and (vi) ligases, represented by the first digit. Each main class node further extends out several subclass nodes, specifying the enzyme’s subclasses, represented by the second digit. With the same logic, the third digit indicates the enzyme’s sub-subclasses and the fourth digit denotes the sub-sub-subclasses. Take Type II restriction enzyme, which is annotated as EC 3.1.21.4, as an example, the ‘3’ denotes that it is an hydrolase; the ‘1’ indicates that it acts on ester bonds; the ‘21’ shows that it is an endodeoxyribonuclease producing 5-phosphomonoesters; and the ‘4’ suggests that it is a Type II site-specific deoxyribonuclease. By predicting the EC numbers precisely, computational methods can annotate the function of enzymes. It should also be noted that a substantial number of enzymes annotated with some reactions in databases such as UniProt or Brenda do not have EC numbers associated, which is out of the scope of this study.

In [1]:
! wget https://github.com/liyu95/Deep_learning_examples/raw/master/1.Fully_connected_psepssm_predict_enzyme/Pfam_model_names_list.pickle
! wget https://github.com/liyu95/Deep_learning_examples/raw/master/1.Fully_connected_psepssm_predict_enzyme/Pfam_name_list_new_data.pickle
! wget https://github.com/liyu95/Deep_learning_examples/raw/master/1.Fully_connected_psepssm_predict_enzyme/Pfam_name_list_non_enzyme.pickle

--2022-01-05 09:46:50--  https://github.com/liyu95/Deep_learning_examples/raw/master/1.Fully_connected_psepssm_predict_enzyme/Pfam_model_names_list.pickle
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/liyu95/Deep_learning_examples/master/1.Fully_connected_psepssm_predict_enzyme/Pfam_model_names_list.pickle [following]
--2022-01-05 09:46:50--  https://raw.githubusercontent.com/liyu95/Deep_learning_examples/master/1.Fully_connected_psepssm_predict_enzyme/Pfam_model_names_list.pickle
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 313298 (306K) [text/plain]
Saving to: ‘Pfam_model_names_list.pickl

### Load related packages


In [2]:
import pickle
import numpy as np
import keras
import tensorflow as tf
from sklearn.model_selection import train_test_split
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import RMSprop

In [3]:
## Helper function to load data
def Pfam_from_file_encoding(name_list_pickle_filename, model_names_list_filename):
  with open(name_list_pickle_filename, 'rb') as f:
    name_list = pickle.load(f)
  
  with open(model_names_list_filename, 'rb') as f:
    model_list = pickle.load(f)
  
  encoding = []

  for i in range(len(name_list)):
    if i % 10000 == 0:
      print('Processing %dth sequence.'%i)
    single_encoding = np.zeros(16306)

    if name_list[i] != []:
      for single_name in name_list[i]:
        single_encoding[model_list.index(single_name)] = 1
    encoding.append(single_encoding)
  
  return encoding

### Load the data

In [4]:
enzyme_feature = Pfam_from_file_encoding('Pfam_name_list_new_data.pickle',
                                         'Pfam_model_names_list.pickle')

non_enzyme_feature = Pfam_from_file_encoding('Pfam_name_list_non_enzyme.pickle',
                                         'Pfam_model_names_list.pickle')


Processing 0th sequence.
Processing 10000th sequence.
Processing 20000th sequence.
Processing 0th sequence.
Processing 10000th sequence.
Processing 20000th sequence.


In [None]:
feature = np.concatenate([enzyme_feature, non_enzyme_feature], axis=0)
label = np.concatenate([np.ones([22168,1]), np.zeros([22168,1])], axis=0).flatten()
label = tf.keras.utils.to_categorical(label,num_classes=2)

Session crashed because of insufficient RAM!