![dory_meme](https://jeinson.github.io/images/dory-meme.png)

# Lab 9: RNNs and LSTMs

In that lab, you'll be implementing a Recurrent Neural Network with an Long Short Term Memory (LSTM) layer. As we've seen in class and in the readings, RNNs designed to recognize patterns in sequence data. They are mainly used in language processing, but can also be used in bioinformatics, where nucleic acid and amino acid sequences are represented as strings of characters. 

The goal of this lab is to train a RNN to differentiate between trans-membrane proteins and proteins in non-membrane bound organelles, based on their amino acid sequence. Protein structure is very complicated, but the two types of proteins have marked differences in their biochemical properties. Membrane bound proteins for example (see below figure) must have hydrophilic and hydrophobic regions in alternating order to so they stay embedded in the hydrophobic lipid bilayer of a cell membrane. An LSTM should be able to recognize this pattern, and use it to distinguish membrane bound proteins from non-membrane bound proteins.

![membrane_protein](https://upload.wikimedia.org/wikipedia/commons/d/db/Polytopic_membrane_protein.png)

In general, RNN's for text processing follow the this workflow: 

1. Load in and visualize data
2. Process data - create vocab, encode the words, and encode the labels. Remove outliers and weird looking text pieces
3. Split data into training, testing, and validation sets
4. Define a data loader to feed data into the network. 
5. Define the LSTM architecture and model class
6. Train the network, and test on user-generated data

The deliverables in this lab will be to complete each step of this pipeline.

### 1. Load and visualize data

To read in protein data in the `fasta` format, install the biopython library into your Computational Methods conda environment (`pip install biopython`)

In [None]:
from Bio import SeqIO
import numpy as np
import torch

In [None]:
nmb_proteins = [x.seq for x in SeqIO.parse("protein_data/nmb_organelles_filter.fasta", "fasta")]
membrane_proteins = [x.seq for x in SeqIO.parse("protein_data/transmembrane.filter.fasta", "fasta")]

Plot histograms of the lengths of all proteins in both categories, then decide on a maximum length to use for a cutoff to truncate unusually long proteins. 

In [None]:
#### your code here ####

### 2. Process Data

Let's filter out some of the sequences that are really long, since that will make our lives easier down that road. The maximum length chosen for the sequence will also be the number of time steps needed when defining your LSTM, so choose carefully :-) *Removing long sequences is preferable to truncating them, because important inforamtion could be contained in the regions that would be cut out.*

Combine the `nmb_proteins` and `membrane_proteins` lists into a single list called `proteins`, and make a corresponding list with the labels. 

In [None]:
time_steps = ### your code here #### 

In [None]:
proteins = nmb_sequences + membrane_sequences

In NLP, the next task is typically to create a vocabulary of all words used in the training data. For classifying proteins, this process is easy, since each token can only be 1 of 20 amino acids. (Hint, this vocabulary is defined in the next cell)

In [None]:
from Bio import Alphabet

protein_alphabet = Alphabet.IUPAC.IUPACProtein.letters
n_aas = len(protein_alphabet)
print(protein_alphabet, n_aas)

####  Encode each amino acid
Make a function that maps each letter in the protein alphabet to a number from 1 to 20. Yes python, we actually want to start counting with 1 this time, since 0 will be used as a padding token in the next step! You will also find that some proteins have an 'X' amino acid. This codes for an unknown amino acid residue, so treat that as a 0 as well. 

In [None]:
def encodeAA(aa):
    #### your code here ####

Now encode all proteins as lists of ints. Encode the labels as 0's and 1's

In [None]:
encoded_sequences = []
encoded_labels = []
#### your code here ####

#### Pad the data

Make a large matrix that will contain all protein strings in the correct size. Start by defining a large matrix of zeros, and fill it in row by row with the encoded protein sequences, leaving 0s if a protein is smaller than the maximum size. This padding step is typically performed in NLP, to normalize sentences of different lengths.

In [None]:
protein_features = torch.zeros(len(protein_int), time_steps)
#### your code here ####

In [None]:
protein_features[0:10,:]

At this point, it would be a good idea to shufffle the data and labels, in the same order of course, so the data isn't evenly split between non-membrane bound and membrane bound rows. 

In [None]:
#### your code here ####

### 3. Split the data into training and testing
Split the data and labels into 80% for training, 10% for validation, and 10% for testing

In [None]:
split_frac = 0.8
train_x = 
train_y = 

remaining_x = 
remaining_y = 

valid_x = 
valid_y = 

test_x = 
test_y = 

### 4. Define data loaders
Again, this signature features of pytorch makes loading data into your model a lot easier during training. The `train_loader` object can be called iteratively, and will spit out batches of data in a specified size. 

In [None]:
# You get a freebie on this one
from torch.utils.data import DataLoader, TensorDataset

# create 
train_data = TensorDataset(train_x, torch.IntTensor(train_y))
valid_data = TensorDataset(valid_x, torch.IntTensor(valid_y))
test_data = TensorDataset(test_x, torch.IntTensor(test_y))

# dataloaders 
batch_size = 50

# make sure to SHUFFLE your data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

### 5. Define the model!!
Our "unrolled" model will look like this, as we slide along the protein sequence:

![LSTM_model](https://github.com/jeinson/jeinson.github.io/blob/master/images/LSTM%20Diagram.png?raw=true)


In English, the layers of the network are:
1. One-hot encoding layer, which takes an amino acid and converts it into a one-hot vector. Alternatively, you can use an embedding layer, which we haven't covered in detail yet, but we'll get to next week. 
2. LSTM Layer: defined by the hidden state dimension and the number of layers (i.e. the length of all sequences passed through the net)
3. Fully connected layer: this maps the output of the LSTM layer to a desired output size
4. Sigmoid activation layer: This maps the output of the fully connected layer to the space between 0 and 1.
5. Output: The sigmoid output from the last timestep is considered as the final output of the network. 

The rest of the lab is up to you. Once you're gotten the protein sequences and labels into an ML ready format, define an RNN class, and train it. Use whatever optimizer works. Try to optimize your model as best as you can, but training may take a while depending on how you decide to implement it. You may have to add a dropout layer, or use some other tricks to speed up the training. Also, if you think you know a better way to prepare the data for ML, feel free to ignore everything up to this point. The only **requirement** is that you use the provided data, implement a recurrent neural network with pytorch, and show how your training accuracy changes. *To be honest, I'm still debugging my implmentation, so I actually have no idea if this will work.* 

After training, test your model on some protein sequences from the [protein databank](https://www.rcsb.org/), and report what they were classified as. 

Here are some resources to help you in your quest:
* https://medium.com/dair-ai/building-rnns-is-fun-with-pytorch-and-google-colab-3903ea9a3a79
* https://towardsdatascience.com/sentiment-analysis-using-lstm-step-by-step-50d074f09948
* http://colah.github.io/posts/2015-08-Understanding-LSTMs/
* http://google.com

As always, please ask questions on the piazza page. I expect this will be challenging, so teamwork is highly encouraged. Have fun!

In [None]:
#### your code here #### 