<a href="https://colab.research.google.com/github/pcummer/deep_learning_short_projects/blob/main/Protein_folding_exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/molecular-biology/protein-secondary-structure/protein-secondary-structure.train
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/molecular-biology/protein-secondary-structure/protein-secondary-structure.test

--2020-11-13 13:25:04--  https://archive.ics.uci.edu/ml/machine-learning-databases/molecular-biology/protein-secondary-structure/protein-secondary-structure.train
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 73489 (72K) [application/x-httpd-php]
Saving to: ‘protein-secondary-structure.train’


2020-11-13 13:25:04 (697 KB/s) - ‘protein-secondary-structure.train’ saved [73489/73489]

--2020-11-13 13:25:04--  https://archive.ics.uci.edu/ml/machine-learning-databases/molecular-biology/protein-secondary-structure/protein-secondary-structure.test
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14586 (14K) [application/x-httpd-php]
Saving to: ‘protein-secondary-

In [10]:
import pandas as pd
import numpy as np

def parse_to_df(path):
  with open(path,'r') as f:
    content = f.readlines()
  df = pd.DataFrame()
  protein_count = 0
  amino_acids = []
  structure = []
  for i in content:
    i = i.strip()
    if 'end' in i:
        df = df.append(pd.DataFrame({'amino_acid':[amino_acids], 'structure':[structure], 'protein_count':[protein_count]}))
        protein_count += 1
        amino_acids = []
        structure = []
    elif len(i) == 3:
      amino_acids.append(i.split(' ')[0])
      structure.append(i.split(' ')[1])
  return df

Here we load the text files for the train and test splits into easily accessed dataframes. We also also assign an index to each amino acid and structure to replace the text character.

In [11]:
df_train_raw = parse_to_df('/content/protein-secondary-structure.train')
df_test_raw = parse_to_df('/content/protein-secondary-structure.test')


unique_amino_acids_in_train = np.unique([item for sublist in df_train_raw.amino_acid for item in sublist])
unique_amino_acids_in_test = np.unique([item for sublist in df_test_raw.amino_acid for item in sublist])
[unique_amino_acids_in_train.append(x) for x in unique_amino_acids_in_test if x not in unique_amino_acids_in_train]

amino_acid_to_index = {}
i=0
for x in unique_amino_acids_in_train:
  i += 1
  amino_acid_to_index[x] = i

structure_to_index = {'_': 0, 'h': 1, 'e': 2}

df_train_raw['amino_acid_index'] = [[amino_acid_to_index[x] for x in y] for y in df_train_raw.amino_acid]
df_test_raw['amino_acid_index'] = [[amino_acid_to_index[x] for x in y] for y in df_test_raw.amino_acid]
df_train_raw['structure_index'] = [[structure_to_index[x] for x in y] for y in df_train_raw.structure]
df_test_raw['structure_index'] = [[structure_to_index[x] for x in y] for y in df_test_raw.structure]

In [84]:
class basicGenerator(tf.keras.utils.Sequence):
    def __init__(self, df, shuffle=True, batch_size=1):
      self.df = df
      self.shuffle = shuffle
      self.batch_size = batch_size
      self.on_epoch_end()

    def __len__(self):
        return int(np.floor(len(self.df) / self.batch_size))

    def on_epoch_end(self):
        if self.shuffle:
            self.df = self.df.sample(frac=1.0)

    def __getitem__(self, index):
        indexes = np.arange(index * self.batch_size, (index + 1) * self.batch_size)
        batch_input = []
        batch_target = []
        for i in indexes:
            amino_acid_sequence = self.df.amino_acid_index.iloc[i]
            label_sequence = self.df.structure_index.iloc[i]
            batch_input.append(amino_acid_sequence)
            batch_target.append(label_sequence)

        batch_input = np.stack(batch_input)
        batch_target = np.array(batch_target)

        return batch_input, batch_target


In [87]:
train_generator = basicGenerator(df_train_raw)
test_generator = basicGenerator(df_test_raw)

Here we test a toy model to confirm that our data loading and formatting has worked as expected.

In [90]:
import tensorflow as tf

input = tf.keras.layers.Input((None, 1))
output = tf.keras.layers.LSTM(3, return_sequences=True, activation='sigmoid')(input)

toy_model = tf.keras.models.Model(inputs=[input], outputs=[output])
toy_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])

In [91]:
toy_model.fit(train_generator, validation_data=test_generator, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f107e882470>