# Machine Learning/Deep Learning Applied to Protein Design and Modeling

Hopefully these tutorials have given you an appreciation for the breadth and complexity of predicting/designing protein structure and function. The methods we have covered thus far use what we know about proteins as described through physics, chemistry, and emperically-derived statistics combined with clever sampling algorithms and constraints to find protein sequences/conformations that (hopefully) perform our desired goals. These methods are limited by our understanding of protein sequence/structure/function relationships, while extensive, still leaves much to be desired.

Machine learning/deep learning is an exciting avenue of research that takes advantage of the exponentially increasing availability of protein sequence ([Uniprot](https://www.uniprot.org/)), structure ([Protein Data Bank](https://www.rcsb.org/)), and function ([Uniprot](https://www.uniprot.org/), [BRENDA](https://www.brenda-enzymes.org/)) information to *learn* about these features from first principles. Given a large dataset, we can use deep learning algorithms to extract the key features responsible for imparting the structure/function we are interested in modelling. 

In this tutorial, we will be focusing on recent applications of neural networks, a key tool in machine learning/deep learning, to predicting and designing proteins with novel structures and functions. This tutorial is meant to be a surface-level introduction to machine learning; we will only introduce the basic underlying mathematical concepts and focus more on applications of neural networks to prediction and design of protein structure and function.

At the end of this tutorial, you will:

* Have a surface-level understanding of different types of neural networks
* Understand how large datasets can be used to predict protein structure/function
* Understand how large datasest can be used to generate new protein structure/function

# Quick and Dirty Introduction to Neural Networks

Here we will provide a quick introduction to the underlying algorithm of neural networks. A great [visual explanation of neural networks](http://3b1b.co/neural-networks) can be found on the 3Blue1Brown YouTube channel. If you prefer another source, [Skymind.ai](https://skymind.ai/wiki/neural-network) also provides a good introduction.

## Neurons are the Building Blocks of Neural Networks

The core building block of a neural network is a neuron. A neuron takes a series of inputs and outputs a single number that either acts as the input for another nueron or acts as the final output of the network. 

<br><img src="Assets/Tutorial_6/Neuron.png" width="400" align="center"/><br>

Each neuron consists of the same basic operations:

### 1. Apply weights and bias to inputs

The input to a neuron is vector, $v$, with a fixed dimension (we will discuss where these vectors come from later). The neuron has a compementary weight vector, $w$, that scales each corresponding value in $v$. This operation can be expressed as vector multiplcation of:

<br><center>$v * w^T$</center><br>

where $w^T$ is the transpose of $w$. Another way of expressing this is 

<br><center>$\sum_{i=1}^{n} v_i * w_i$</center><br>

where $n$ is the length of vectors $w$/$v$ and $i$ is the $i^{th}$ element of vectors $w$/$v$. 

An additional bias $b$ is added to the sum of weighted inputs to convert the vector input into a single value.

### 2. Apply an activation function

Once the weights and bias are applied to the input vector, the resulting value is run though an activation function that applies non-linearity. This function can be a [ReLu](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)), leaky ReLu, [sigmoid](https://en.wikipedia.org/wiki/Sigmoid_function), or [hyperbolic tanget function (tanh)](https://en.wikipedia.org/wiki/Hyperbolic_function). Non-linearity allows neural networks to [approximate any function](http://neuralnetworksanddeeplearning.com/chap4.html) with a finite number of neurons, which is extremely important for learning complex spaces!

The following plot (from [Wikipedia](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)#/media/File:Rectifier_and_softplus_functions.svg)) shows the functional form of the ReLu function (in blue):

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/6c/Rectifier_and_softplus_functions.svg/2560px-Rectifier_and_softplus_functions.svg.png" width="400" align="center"/>

The activation function produces a single scalar value that serves as the output of the neuron.

## Combining Neurons into a Network

Here we will describe a fully-connected **feed-forward artifical neural network** with an input layer ($i$), two hidden layers ($h_1$, $h_2$), and an output layer ($o$). Each neuron (blue circles) performs the operation $ReLu(vw^T + b)$ to evaluate and store a scalar output, where $v$ is the input vector for the neuron, $w$ is the weight vector containing weights for each value in $v$, and $b$ is the bias. Edges (the lines connecting nodes) possess the values of weights and biases. The output of each neuron is passed to the next layer along edges, where each edge has a unique weight $w$. Note that the bias nodes are not connected to the previous layer. 

<br><img src="Assets/Tutorial_6/FFNN.png" width="600" align="center"/><br>

### Input Layer ($i$)

The input of a neural network is, unsurprisingly, a vector. Inputs vectors are typically [trained embeddings](https://skymind.ai/wiki/word2vec#embed) or a [one-hot representation](https://en.wikipedia.org/wiki/One-hot) of a desired input. For instance, you can generate a 20-dimensional one-hot embedding for amino acids, where each bit vector represents a canonical amino acid.

Protein embedding examples: [Learned protein embeddings for machine learning](https://doi.org/10.1093/bioinformatics/bty178), [Learning protein sequence embeddings using information from structure](https://arxiv.org/pdf/1902.08661v1.pdf)

### Hidden Layers ($h_1$, $h_2$)

Hidden layers recieve their inputs from either the input layer or other hidden layers. These layers apply their weights and biases to gradually transform inputs into values that yield an intelligible output layer. 

### Output Layer ($o$)

The output layer is simply the final numerical result of repeated vector multiplication across all nodes through each layer in the network. This is known as **forward propogation**. The values of the output layer can be processed into a number of things (e.g. probabilities, coordinates, or contact maps) depending how we further process the output layer. One common method is applying the [Softmax function](https://en.wikipedia.org/wiki/Softmax_function) to convert the output layer into probabilities for a number of classes. This is known as a classification problem, where we want the network to recognize an input vector as one of several predetermined classes by assigning the greatest probability to it.

### Mathematical Notation

Up to this point, we have been considering the inputs and outputs of single neurons as:

<br><center>$output = ReLu(vw^T + b)$</center><br>

where ReLu is the activation function, $v$ is the input vector to the neuron, $w$ is the corresponding weight vector, and $b$ is the bias. For instance, the weights and bias from layer $h_1$ are evaluated to get the output of the first position in layer $h_2$:

<br><img src="Assets/Tutorial_6/FFNN_vector.png" width="600" align="center"/><br>

This can become quite tedious and slow, especially if our neural network grows to include hundreds of hidden layers with hundreds of nodes each. Luckily, we can take advantage of matrix multiplication to condense the calculation of the next layer into a single operation. 

Let $v$ be a vector (layer) in the neural network, as before. Instead of calculating the output for each node individually, we can concatenate the weight vectors connecting the input nodes to each node in the output layer into a matrix, $W$. The new operation becomes:

<br><center>$v_o = ReLu(W_{io}v_{i})$</center><br>

where $v_o$ is the output layer, $v_i$ in the input layer, and $W_{io}$ is the matrix for the weights connecting $v_i$ to $v_o$. 

The figure below represents the operation $h_2 = ReLu(W_{12}h_{1})$:

<br><img src="Assets/Tutorial_6/FFNN_matrix.png" width="600" align="center"/>

## Learning the Weights/Biases of a Neural Network

So far we have covered how an input vector is converted into an output vector through forward propogration of a nueral network, where we apply weights and biases to each layer forward though the network. We have yet to discuss how we determine the weights for the network that produce "correct" outputs. The process by which the weights of a network are updated is though backpropogation guided by an objective function. 

This explaination will be in the context of a supervised learning classificaion problem, where we have a labelled dataset with different classes and would like to train our network to correctly classify each input.

For reference, [MINST](http://yann.lecun.com/exdb/mnist/) and [CFAR10](https://www.cs.toronto.edu/~kriz/cifar.html) are examples of commonly used labelled datasets.

### Objective function

In the context of machine learning, an objective function (also known as a loss function) quantifies how "wrong" the output of the neural network is. Typically, we want to minimize the objective function where a value of 0 means we have correctly classified the input. One common objective function for classification problems is [negative log loss](https://scikit-learn.org/stable/modules/model_evaluation.html#log-loss), which takes the form:

<br><center>$\sum_{i=1}^{n}{-y_{i}log(p_i)}$</center><br>

where $n$ is the number of classes, y takes the value of $\{0,1\}$ (1 if class $i$ is the expected class, 0 otherwise), and $p$ is the probability returned by the neural network for class $i$. For example, if our neural network returns an output vector of $[0.3, 0.2, 0.4]$ and we know the true classification is $[0, 0, 1]$, we compute the negative log loss as:

<br><center>$loss = - [0 * log(0.3)] - [0 * log(0.2)] - [1 * log(0.4)] = 0.916$</center><br>

### Backpropogation

Now that we can quantify error, the goal of backpropogation is to iteratively adjust the weights and biases in the neural network to reduce error. [3Blue1Brown](https://youtu.be/Ilg3gGewQ5U) has a fantastic overview of the backpropogation algorithm. [Christopher Olah](https://colah.github.io/posts/2015-08-Backprop/) also provides an intuitve explaination.

Recall the gradient descent algorithm we discussed in Tutorial 3 with regards to Pose minimization in Rosetta. Similarly, the backpropogation algorithm calulates gradient of the objective function with respect to each individual weight and bias in the network: weights/biases that greatly contribute to the error are far from zero, while weights/biases that don't contribute as much to the error are closer to zero. Backpropogation gets its name because we use the gradient of layer $n$ to calulate the the gradient of layer $n-1$ using the [chain rule](https://en.wikipedia.org/wiki/Chain_rule#Multivariable_case). We work backwards through the network, applying the chain rule to iteratively calculate the gradient of each previous layer until we have calculated partial derivative of the loss with respect to all weights in the network. 

From [Christopher Olah's blog](https://colah.github.io/posts/2015-08-Backprop/), for a set of connected neurons that apply differentiable functions:

<img src="https://colah.github.io/posts/2015-08-Backprop/img/tree-eval-derivs.png" width="400" align="center"/>

Taking the derivative with respect to the output node ($e$ in this example) for all nodes calculates the contribution of each neuron to the output/loss funciton.

<img src="https://colah.github.io/posts/2015-08-Backprop/img/tree-backprop.png" width="400" align="center"/>

An optimization method such as [Stochastic Gradient Descent (SGD)](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) or [ADAM](https://arxiv.org/pdf/1412.6980) uses the calculated partial derivatives to update all weights/biases in the network. With SGD, we calculate and average loss across several training inputs (e.g. 64 inputs) before calculating the gradient of the average loss with respect to each weight/bias in the network. Weights are updated as follows:

<br><center>$w_t^{ij} = w_{t-1}^{ij} - \alpha * \frac{{\delta}E}{{\delta}w_{t-1}^{ij}}$</center><br>

where $w_t^{ij}$ is the weight between nodes $i$ and $j$ at step $t$ (after update), $w_{t-1}^{ij}$ is the weight between nodes $i$ and $j$ at step $t-1$ (before update), $\alpha$ is the learning rate, and $\frac{{\delta}E}{{\delta}w_{t-1}^{ij}}$ is the partial derivative of the loss function $E$ with respect to $w_{t-1}^{ij}$. The learning rate is a parameter of SGD that determines how quickly the weights of the network are changed. This is typically a very small number (e.g. 0.0001).

### Training, Test, and Validation Sets
In progess...

## How can we apply Neural Networks to Protein Sequences?

Now that we understand how a basic neural network functions, we can investigate how this basic architecture has been adapted to address different types of problems in the field of protein modeling and design. Don't concern yourself with the math here: the goal is to gain an intuition for how these algorithms learn from structured data to reveal underlying features about proteins.

### Recurrent Neural Networks (RNNs)
There are a number of great online resources for learning how recurrent neural networks work, such as [Skymind.ai](https://skymind.ai/wiki/lstm) and [Christopher Colah's blog](https://colah.github.io/posts/2015-08-Understanding-LSTMs/). This section will be a brief summary of these resources so we have a foundation to understand papers that use these methods.

Recurrent neural networks add a small tweak to the feed-forward networks we previously discussed. Here is a simplified representation of a feed-forward network:

<img src="Assets/Tutorial_6/FFNN_Simple.png" width="600" align="center"/>

where each column represents a layer in the network and arrows the operations between them. As we saw before, the network starts with an input vector (green) and each hidden layer is evaluated $v_o = ReLu(W_{io}v_{i} + b_i)$ using the values ($v_i$), weight matrix ($W_{io}$) and bias ($b_i$) from the previous layer. Notice that there is only one input, and weights/values are passed forward in one direction through the network to produce a final layer (the output layer).

In contrast to feed-forward neural networks, *each hidden layer in a recurrent neural network takes an input vector*. A simplified representation of a recurrent neural network:

<img src="Assets/Tutorial_6/RNN_Rollout.png" width="600" align="center"/>

This is extremely powerful for data that takes the form of a sequence, where the network can incorporate and learn underlying features of the sequence at each position. Predictions made by a neural network will take into account previously seen inputs. 

Inputs are incorporated at each position in the sequence (here we will think of it as time steps $t$) by concatenating the hidden layer for $t$ with the input vector at $t$. A weight matrix with transition weights from $t$ to $t+1$ transfroms this combined vector into the the hidden layer for $t+1$ (after applying the activation function of course!). 

<img src="Assets/Tutorial_6/RNN_Operation.png" width="600" align="center"/>

At every position in the sequence, the new hidden layer $t+1$ encodes information about the input vector at $t$ as well as all previous inputs. This allows the hidden layer to serve two purposes: not only does it serve as the hidden layer for the next position in the sequence, it also serves as the output of the network at that position. In the following examples, we will refer to the output vector for position $t$ as $x_t$. 

These output vectors are processed to generate predictions for each position in the sequence:

<img src="Assets/Tutorial_6/RNN_perposition.png" width="600" align="center"/>

Alternatively, the output for the last position in the sequence can be processed to make a single prediction for the entire sequence.

The papers we will be reviewing use each of these strategies to make inferences on protein sequences.

### Long-Short Term Memory (LSTM) Networks

Long-Short Term Memory networks are a special type of RNN that allows for long-term dependencies in a seqeunce. Where each step in a basic RNN simply performs a single operation at each time step \[$h_{t+1} = tanh(W_{ih}i_{t} + W_{hh}h_{t} + bias)$\], a LSTM network performs *four* separate operations to decide what information to (1) forget from the previous hidden layer and current input, (2) remember from the previous hidden layer and current input, and (3) include and pass on in the new hidden layer. LSTMs do this using a matrix called the cell state ($C$) that is passed from position to position in the sequence ($C_t$ for position $t$). The cell state serves as a condensed representation of what the network has previously seen in the sequence. At each position, the cell state is updated to forget old information and remember new information using the previous hidden layer and current input vector. The updated cell state is then used to evaluate the new hidden/output layers (remember these are the same thing!).

Here I will be condensing the information from [Christopher Colah's blog post on LSTMs](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) (you should definitely read this post in full!).

Where we have previously discussed the architecture of a basic RNN cells \[$h_{t+1} = tanh(W_{xh}x_{t} + W_{hh}h_{t} + b_t)$\]:

<img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png" width="600" align="center"/>

An LSTM cell performs additional operations:

<img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png" width="600" align="center"/>

An LSTM maintains a cell state ($C$) that is passed from position to position in the sequence:

<img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-C-line.png" width="600" align="center"/>

First, the cell decides what information should be forgotten given the previous hidden layer $h_{t-1}$ and the current input layer $x_t$:

<img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-f.png" width="600" align="center"/>

Second, the cell decides what information should be remembered given the previous hidden layer $h_{t-1}$ and the current input layer $x_t$:

<img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-i.png" width="600" align="center"/>

The cell state ($C_{t-1}$) is updated to take into account what was found to be forgotten/remembered:

<img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-C.png" width="600" align="center"/>

Finally, the updated cell state $C_{t}$ is used alongside the previous hidden layer $h_{t-1}$ and current input layer $x_t$ to evaluate the hidden layer for the current position ($h_t$):

<img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-o.png" width="600" align="center"/>

The hidden layer $h_t$ serves as the output for the current position in the sequence as well as the hidden layer for the next position in the sequence.

### Bidirectional RNNs and LSTMs
RNNs/LSTMs are extremely powerful since they are capable of incorporating prior information to make forward predictions. However, there are applications where knowing the full context of the sequence, rather than only what was been seen up to the current sequence position, would greatly improve predictive capability. One solution is to run two RNNs/LSTMs in parallel: one going forward through the sequence and the other going through the sequence in reverse. The output vectors for each RNN/LSTM is concatenated to serve as the final output for the network at each position.

<img src="Assets/Tutorial_6/biRNN_Architecture.png" width="600" align="center"/>

The concatenated output vectors for each position in the input sequence of length $n$ (forward RNN output at time $t$ + reverse RNN output at time $n-t$) are further processed to make predictions for each position or the entire sequence.

<img src="Assets/Tutorial_6/biRNN_Predictions.png" width="600" align="center"/>

RNNs and LSTMs have [extensive applications](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) in language processing, but here we will see how they are applied to protein sequences.

MAQ RGN

UniRep

Alphafold

## How can we apply Neural Networks to Protein Structures?
### Convolutional Neural Networks (CNN)

Namrata DCGAN

Rahpa Semantics

Facebook