# ProWave - WaveNet-based Protein Generation

Authors: Hans Jakob Damsgaard & Lucas Balling

02456 Deep Learning project: ProGen

## Initialization

Run the commmand below if you have not yet installed the [TAPE project](https://github.com/songlab-cal/tape).

In [1]:
#!pip install tape_proteins

#### Importing needed packages

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
from torch.nn.parameter import Parameter
import tape

#### Import the data

We were unable to make the data download script, `download_data.sh`, run from Jupyter, so instead we ran it manually and simply placed the resulting files in the right folder for TAPE to find them. We import all the data in the LMDB format as it is most easily worked with in Python.

In [4]:
from tape.datasets import LanguageModelingDataset

# Data stored under `<data-path>/data`
#data_path = '/Users/lucasballing/Desktop/DeepLearningProject/prowave-main/data/'
data_path = 'E:/Pfam/data/'
train_data   = LanguageModelingDataset(data_path, 'train')
valid_data   = LanguageModelingDataset(data_path, 'valid')
holdout_data = LanguageModelingDataset(data_path, 'holdout')

#### Understanding data features

To get a good understanding of the data provided in the imported dataset, we provide plots of certain features and their ranges. Data is already split into the three required subsets; train, validation, and holdout by TAPE, so it is also interesting to understand this split.

In [38]:
# Split sizes
print(f'Training data has shape ({len(train_data)}, {len(train_data[0])})')
print(f'Validation data has shape ({len(valid_data)}, {len(valid_data[0])})')
print(f'Holdout data has shape ({len(holdout_data)}, {len(holdout_data[0])})')

# Original data columns
from tape.datasets import LMDBDataset
lmdb_train = LMDBDataset(data_path+'pfam/pfam_train.lmdb')
print(f'File data entries look like this: {lmdb_train[0]}')
del lmdb_train

# Data columns - all subsets are taken from the same overall dataset, so the columns are the same
# From combining information from LMDBDataset and LanguageModelingDataset, we know the columns are
# - IUPAC-encoded protein string
# - Input mask (for masked-token prediction)
# - Protein clan
# - Protein family
# The protein ID (i.e., its number within its clan and family) is not included
print(f'Encoded data entries look like this: {train_data[0]}')

Training data has shape (32593668, 4)
Validation data has shape (1715454, 4)
Holdout data has shape (44311, 4)
File data entries look like this: {'primary': 'GCTVEDRCLIGMGAILLNGCVIGSGSLVAAGALITQ', 'protein_length': 36, 'clan': 433, 'family': 9122, 'id': '0'}
Encoded data entries look like this: (array([ 2, 11,  7, 23, 25,  9,  8, 21,  7, 15, 13, 11, 16, 11,  5, 13, 15,
       15, 17, 11,  7, 25, 13, 11, 22, 11, 22, 15, 25,  5,  5, 11,  5, 15,
       13, 23, 20,  3], dtype=int64), array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64), 433, 9122)


In [22]:
# def setify(data, param = 2):
#     res = set()
#     for i in range(len(data)):
#         res.add(data[i][param])
#     return res
# 
# # Clans in splits
# print(f'Unique clans in training data {len(setify(train_data))}')
# print(f'Unique clans in validation data {len(setify(valid_data))}')
# print(f'Unique clans in holdout data {len(setify(holdout_data))}')
# 
# # Families in splits
# print(f'Unique families in training data {len(setify(train_data, 3))}')
# print(f'Unique families in validation data {len(setify(valid_data, 3))}')
# print(f'Unique families in holdout data {len(setify(holdout_data, 3))}')


In [93]:
from tape.datasets import LMDBDataset
data_path = '/Users/lucasballing/Desktop/DeepLearningProject/prowave-main/data/pfam/pfam_holdout.lmdb' # 'Your File Path here'


np.train_datav2 = np.array(LMDBDataset(data_path))
train_datav3 = (LMDBDataset(data_path))
print("Size of Dataset",train_data[2][3] )

Size of Dataset 1464


#### Creating Some Histograms
This section will plot some histograms of the datasets ot visulise the distribution of Clan and Family ID

In [9]:
# Creating some large stupid numpy arrays to make it possible to do some histograms
np.data = np.zeros((len(holdout_data),2))
for x in range(0,len(holdout_data)):
    for h in range(2,4):
        np.data[x,(h-2)] = holdout_data[x][h]
        
np.data_train = np.zeros((len(train_data),2))
for x in range(0,len(train_data)):
    for h in range(2,4):
        np.data_train[x,(h-2)] = train_data[x][h]


_ = plt.hist(np.data[:,0], bins='auto')  # arguments are passed to np.histogram
plt.title("Proteins - Clan ID")
#plt.Text(0.5, 1.0, "Histogram with 'auto' bins")
plt.xlabel('Clan ID')
plt.ylabel('Number of Cases')
plt.show()


_ = plt.hist(np.data[:,1], bins='auto')  # arguments are passed to np.histogram
plt.title("Proteins - Family ID")
plt.xlabel('Family ID)')
plt.ylabel('Number of Cases')
plt.show()

# Training Set 
_ = plt.hist(np.data_train[:,0], bins='auto')  # arguments are passed to np.histogram
plt.title("Proteins - Clan ID")
#plt.Text(0.5, 1.0, "Histogram with 'auto' bins")
plt.xlabel('Clan ID')
plt.ylabel('Number of Cases')
plt.show()


_ = plt.hist(np.data_train[:,1], bins='auto')  # arguments are passed to np.histogram
plt.title("Proteins - Family ID")
plt.xlabel('Family ID)')
plt.ylabel('Number of Cases')
plt.show()


# Import Tokenizers
#from .tokenizers import TAPETokenizer

# One Hot Encoding the Data
aminoacids = 29



### One-hot encoding over Protein Sequence

One way to represent a fixed amount of words is by making a one-hot encoded vector, which consists of 0s in all cells with the exception of a single 1 in a cell used uniquely to identify each Protein Amionacid.

| Amionacid    | one-hot encoded vector   |
| ------------- |--------------------------|
| Paris         | $= [1, 0, 0, \ldots, 0]$ |
| Rome          | $= [0, 1, 0, \ldots, 0]$ |
| Copenhagen    | $= [0, 0, 1, \ldots, 0]$ |

Representing a large vocabulary with one-hot encodings often becomes inefficient because of the size of each sparse vector.
To overcome this challenge it is common practice to truncate the vocabulary to contain the $k$ most used words and represent the rest with a special symbol, $\mathtt{UNK}$, to define unknown/unimportant words.
This often causes entities such as names to be represented with $\mathtt{UNK}$ because they are rare.

#### RNN Network for Protein Generation
This section will define the network architecture for the neural network RNN used as the backbone of the ProWave neural network for protein generation.

In [10]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class MyRecurrentNet(nn.Module):
    def __init__(self):
        super(MyRecurrentNet, self).__init__()
        
        # Recurrent layer
        self.lstm = nn.LSTM(input_size=aminoacids,
                         hidden_size=50,
                         num_layers=1,
                         bidirectional=False)
        
        # Output layer
        self.l_out = nn.Linear(in_features=50,
                            out_features=aminoacids,
                            bias=False)
        
    def forward(self, x):
        # RNN returns output and last hidden state
        x, (h, c) = self.lstm(x)
        
        # Flatten output for feed-forward layer
        x = x.view(-1, self.lstm.hidden_size)
        
        # Output layer
        x = self.l_out(x)
        
        return x

net = MyRecurrentNet()
print(net)

MyRecurrentNet(
  (lstm): LSTM(29, 50)
  (l_out): Linear(in_features=50, out_features=29, bias=False)
)
