

"Protein secondary structures are traditionally characterized as 3 general states: helix (H), strand (E), and coil (C). From these general three states, the DSSP program [2] proposed a finer characterization of the secondary structures by extending the three states into eight states: 310 helix (G), α-helix (H), π-helix (I), β-stand (E), bridge (B), turn (T), bend (S), and others (C).

Recently, the focus of secondary structure prediction has been shifted from Q3 prediction to the prediction of 8-state secondary structures, due to the fact that a chain of 8-state secondary structures contains more precise structural information for a variety of applications. The prediction of the 8 states of secondary structures from protein sequences is called a Q8 prediction problem. The Q8 problem is much more complicated than the Q3 problem.

For example, SC-GSN network [17], the bidirectional long short-term memory (BLSTM) method [18, 19], the deep conditional neural field [20], DCRNN [21], the next-step conditioned deep convolutional neural network(CNN) [22] and Deep inception-inside-inception (Deep3I) network [23] have been widely explored"

17) Zhou J, Troyanskaya OG. Deep supervised and convolutional generative stochastic network for protein secondary structure prediction. In: Proceedings of the 31st International Converenfe on Machine Learning (ICML). Bejing: PMLR: 2014. p. 745–53.Return to ref 17 in article"



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from keras.preprocessing import text, sequence
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
from keras.metrics import categorical_accuracy
from keras.utils import to_categorical
import seaborn as sns
from matplotlib import rc
import matplotlib as plt


from google.colab import drive
drive.mount('/content/gdrive')

sns.set_context("paper")
sns.color_palette("cubehelix", 8)
sns.set_style("whitegrid", {'axes.grid' : False})
# Using seaborn's style
plt.style.use('seaborn')
# With LaTex fonts
sns.set_context("paper")


# Set the global font to be DejaVu Sans, size 10 (or any other sans-serif font of your choice!)
rc('font',**{'family':'sans-serif','sans-serif':['DejaVu Sans'],'size':9})

# Set the font used for MathJax - more on this later
rc('mathtext',**{'default':'regular'})

%config InlineBackend.figure_format = 'retina'

Mounted at /content/gdrive


In [None]:
MAX_LEN = 128 # maximum length of the sequence 
df = pd.read_csv('/content/gdrive/My Drive/protein_structure.csv')
df.head()

def triplets(sequences):
    """
    Apply sliding window of length 3 to each sequence in the input list
    :param sequences: list of sequences
    :return: numpy array of triplets for each sequence
    Usage: Split protein sequence into triplets of aminoacids
    """
    return np.array([[aminoacids[i:i+3] for i in range(len(aminoacids))] for aminoacids in sequences])


"""
Columns of interest for classification:
seq: the sequence of the peptide
sst8: the eight-state (Q8) secondary structure
sst3: the three-state (Q3) secondary structure
len: the length of the peptide
has_nonstd_aa: whether the peptide contains nonstandard amino acids (B, O, U, X, or Z).
"""
input_seqs, target_seqs = df[['seq', 'sst8']][(df.len <= MAX_LEN) & (~df.has_nonstd_aa)].values.T

print("input seqs", input_seqs)

input seqs ['EDL' 'KCK' 'KAK' ...
 'GQVQLVQSGGGLVQAGGSLRLSCAFSGRTFSMYTMGWFRQAPGKEREFVAANRGRGLSPDIADSVNGRFTISRDNAKNTLYLQMDSLKPEDTAVYYCAADLQYGSSWPQRSSAEYDYWGQGTTVTVSS'
 'MSGYTPDEKLRLQQLRELRRRWLKDQELSPREPVLPPRRMWPLERFWDNFLRDGAVWKNMVFKAYRSSLFAVSHVLIPMWFVHYYVKYHMATKPYTIVSSKPRIFPGDTILETGEVIPPMRDFPDQHH'
 'MSGYTPDEKLRLQQLRELRRRWLKDQELSPREPVLPPRRMWPLERFWDNFLRDGAVWKNMVFKAYRSSLFAVSHVLIPMWFVHYYVKYHMATKPYTIVSSKPRIFPGDTILETGEVIPPMRDFPDQHH']


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from keras.preprocessing import text, sequence
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
from keras.metrics import categorical_accuracy
from keras.utils import to_categorical

# Transform features
tokenizer_encoder = Tokenizer()
input_grams = triplets(input_seqs)
tokenizer_encoder.fit_on_texts(input_grams)
input_data = tokenizer_encoder.texts_to_sequences(input_grams)
input_data = sequence.pad_sequences(input_data, maxlen=MAX_LEN, padding='post')

#Transform targets
tokenizer_decoder = Tokenizer(char_level=True)
tokenizer_decoder.fit_on_texts(target_seqs)
target_data = tokenizer_decoder.texts_to_sequences(target_seqs)
target_data = sequence.pad_sequences(target_data, maxlen=MAX_LEN, padding='post')
target_data = to_categorical(target_data)

X_train, X_test, y_train, y_test = train_test_split(input_data, target_data, test_size=.3, random_state=1)
seq_train, seq_test, target_train, target_test = train_test_split(input_seqs, target_seqs, test_size=.3, random_state=1)


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MultiLabelBinarizer
from keras.preprocessing import text, sequence
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
# Transform features
tokenizer_encoder = Tokenizer()
input_grams = triplets(input_seqs)
tokenizer_encoder.fit_on_texts(input_grams)
input_data = tokenizer_encoder.texts_to_sequences(input_grams)
input_data = sequence.pad_sequences(input_data, maxlen=MAX_LEN, padding='post')

#Transform targets
mlb = MultiLabelBinarizer()
target_data = mlb.fit_transform(target_seqs)


X_train, X_test, y_train, y_test = train_test_split(input_data, target_data, test_size=.3, random_state=1)
seq_train, seq_test, target_train, target_test = train_test_split(input_seqs, target_seqs, test_size=.3, random_state=1)

As seen from the graph below, there are 3 dominating classes in the dataset - those that encode the Q3 structure. 

In [None]:
property_classes = ['C','E','H','B','G','I','T','S']
property_counts = {}
for x in property_classes:
  property_counts[x] = sum(pd.Series(target_seqs).str.count(x))

In [None]:
import plotly.graph_objects as go

#Trace plot
fig = go.Figure()
fig.add_trace(go.Scatterpolar(
      r=[y for x,y in property_counts.items()],
      theta=[x for x,y in property_counts.items()],
      fill='toself',line_color = '#77accc'
))
fig.update_layout(title={
        'text': "Presence of each of the classes in the dataset",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

In [None]:
from nltk.probability import ConditionalFreqDist
#Conditional frequency distribution of properties
cfdist=ConditionalFreqDist()
for condition in property_classes:
  for sent in target_seqs:
    if condition in sent:
      for word in sent:
        if word != condition:
          cfdist[condition][word] += 1

In [None]:
import chart_studio.plotly as py
from plotly.offline import iplot
import plotly.graph_objects as go

#Trace plot
fig = go.Figure()
fig.add_trace(go.Heatmap(z=prob_matrix.to_numpy(), x=prob_matrix.columns, y=prob_matrix.index,colorscale='PuBU'))
fig.update_layout(title={
        'text': "Conditional Frequency of properties",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

 # Multilabel prediction setup 


In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

Sat Nov 28 19:16:22 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    25W / 300W |      0MiB / 16130MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
import tensorflow as tf
from keras.models import Model, Input
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Bidirectional, LayerNormalization

n_words = len(tokenizer_encoder.word_index) + 1
n_tags = len(tokenizer_decoder.word_index)

input = Input(shape=(MAX_LEN,))
x = Embedding(input_dim=n_words, output_dim=128, input_length=MAX_LEN)(input)
x = LayerNormalization()(x)
x = Bidirectional(LSTM(units=128, return_sequences=True,use_bias=True))(x)
x = Bidirectional(LSTM(units=128, return_sequences=True,use_bias=True))(x)
x = Bidirectional(LSTM(units=128,use_bias=True))(x)
y = Dense(n_tags, activation="sigmoid")(x)
model = Model(input, y)

model.compile(optimizer="rmsprop", loss='binary_crossentropy', metrics=[tf.keras.metrics.Precision(), 
                                                                        tf.keras.metrics.Recall(),
                                                                        tf.keras.metrics.Hinge()])
history8 = model.fit(X_train, y_train, batch_size=128, epochs=5, validation_data=(X_test, y_test), verbose=1)