## Introduction
* For this notebook we will be exploring other stuctures to classify proteins from their sequence
* We will explore 4 different ways
* The first way, we will try changing some of the architecture using the emebedding and CONV1d
* The second way, we ill trying implemeting an LSTM after the embedding layer
* The third way, we will try implementing a GRU after the embedding layer
* The fourth way, we will try implementing bi bi-diertional LSTMS

## First Load in the data

In [2]:
import tensorflow as tf
tf.VERSION

'1.15.0'

In [3]:
import keras 
keras.__version__

'2.2.4'

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
proteins = pd.read_csv("top_15_proteins.csv")

In [3]:
proteins.head()

Unnamed: 0.1,Unnamed: 0,sequence,classification,num_residues
0,67,TYTTRQIGAKNTLEYKVYIEKDGKPVSAFHDIPLYADKENNIFNMV...,HYDROLASE,286
1,68,TYTTRQIGAKNTLEYKVYIEKDGKPVSAFHDIPLYADKENNIFNMV...,HYDROLASE,286
2,74,MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQD...,LIGASE,330
3,75,MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQD...,LIGASE,330
4,76,KESAAAKFERQHMDSGNSPSSSSNYCNLMMCCRKMTQGKCKPVNTF...,HYDROLASE,124


In [4]:
proteins.shape

(283101, 4)

### Tranform the labels

In [5]:
from sklearn.preprocessing import LabelBinarizer

# Transform labels to one-hot
lb = LabelBinarizer()
Y = lb.fit_transform(proteins['classification'])

In [6]:
lb.classes_

array(['HYDROLASE', 'HYDROLASE/HYDROLASE INHIBITOR', 'IMMUNE SYSTEM',
       'ISOMERASE', 'LIGASE', 'LYASE', 'OXIDOREDUCTASE', 'RIBOSOME',
       'RIBOSOME/ANTIBIOTIC', 'SIGNALING PROTEIN', 'TRANSCRIPTION',
       'TRANSFERASE', 'TRANSPORT PROTEIN', 'VIRAL PROTEIN', 'VIRUS'],
      dtype='<U29')

In [7]:
Y[34]

array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

### Vectorize the Sequences
* This time try adding padding pre

In [27]:
from keras.preprocessing import text, sequence
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split

# maximum length of sequence, everything afterwards is discarded!
max_length = 512

#create and fit tokenizer
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(proteins['sequence'])
#represent input data as word rank number sequences
X = tokenizer.texts_to_sequences(proteins['sequence'])
X = sequence.pad_sequences(X, maxlen=max_length,padding='pre')

In [28]:
tokenizer.word_docs

defaultdict(int,
            {'l': 271251,
             'f': 263330,
             'w': 218243,
             'e': 267895,
             'v': 269066,
             's': 268119,
             'n': 265583,
             'g': 276371,
             'q': 264838,
             't': 269758,
             'c': 213007,
             'd': 266266,
             'a': 277393,
             'r': 270110,
             'i': 267264,
             'y': 262657,
             'k': 270045,
             'p': 267751,
             'm': 258030,
             'h': 258236,
             'u': 5915,
             'x': 4954,
             'z': 3,
             'b': 4,
             'o': 2})

In [29]:
X[4545]

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0

### First Model
* Change embedding size 16
* Adding another conv1d and max pooling layer (with 16 filters and kernel size of 2)
* Now try 50 epochs this time

In [1]:
from keras.models import Sequential
from keras.layers import Dense, Conv1D, MaxPooling1D, Flatten
from keras.layers import LSTM
from keras.layers.embeddings import Embedding

embedding_dim = 16

# create the model
model = Sequential()
model.add(Embedding(len(tokenizer.word_index)+1, embedding_dim, input_length=max_length))
model.add(Conv1D(filters=64, kernel_size=8, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Conv1D(filters=32, kernel_size=4, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Conv1D(filters=16, kernel_size=2, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(15, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Using TensorFlow backend.





NameError: name 'tokenizer' is not defined

In [36]:
X.shape,Y.shape

((283101, 512), (283101, 15))

### Split into train and test

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=.1)

In [38]:
#keep X_test and y_test until the end get valids
X_test.shape
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=.1)

In [39]:
model.fit(X_train, y_train, 
          validation_data=(X_valid, y_valid), 
          epochs=50, 
          batch_size=128)

Train on 229311 samples, validate on 25479 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x1c963975f8>

Save the the models and histories  to examine later

In [43]:
pd.DataFrame(model.history.history).to_csv('Protein_Classification_Parameter_Tuning_Other_Architectures/ThreeConvLayersHistory.csv')
                                           

In [44]:
model.save('Protein_Classification_Parameter_Tuning_Other_Architectures/ThreeConvLayers.h5')

In [45]:
del model

## Second  Model
* Use and LSTM layer after the embedding layer

In [56]:
from keras.layers import LSTM

embedding_dim = 16
depth  = 16

# create the model
model = Sequential()
model.add(Embedding(len(tokenizer.word_index)+1, embedding_dim, input_length=max_length))
model.add(LSTM(depth))
model.add(Dense(15, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_14 (Embedding)     (None, 512, 16)           416       
_________________________________________________________________
lstm_5 (LSTM)                (None, 16)                2112      
_________________________________________________________________
dense_23 (Dense)             (None, 15)                255       
Total params: 2,783
Trainable params: 2,783
Non-trainable params: 0
_________________________________________________________________
None


Note this takes quite a long time for training. Use a smaller number of epochs

In [None]:
model.fit(X_train, y_train, 
          validation_data=(X_valid, y_valid), 
          epochs=15, 
          batch_size=128)

Train on 229311 samples, validate on 25479 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15

In [55]:
del model