# How to train your DragoNN: 
## Exploring convolutional neural network (CNN) architectures for simulated genomic data

This tutorial will take <1 hour if executed on a GPU. 

## Outline<a name='outline'>
<ol>
    <li><a href=#1>How to use this tutorial</a></li>
    <li><a href=#2>Review of patterns in transcription factor binding sites</a></li>
    <li><a href=#3>Learning to localize homotypic motif density</a></li>
    <li><a href=#4>Simulate training data with simdna</a></li>  
    <li><a href=#4.5>Running dragonn on your own data: starting with FASTA files</a></li>
    <li><a href=#5>Defining CNN architecture</a></li>
    <li><a href=#6>Single layer, single filter model</a></li>
    <li><a href=#7>Single layer, multiple filter model</a></li>
    <li><a href=#9>For further exploration</a></li>
</ol>

## How to use this tutorial<a name='1'>
<a href=#outline>Home</a>

This tutorial utilizes a Google Colaboratory Notebook - an interactive computational enviroment that combines live code, visualizations, and explanatory text. The notebook is organized into a series of cells. 

The first thing we do is set our Runtime to use Python3 and GPU. 

![ChangeRuntime](https://github.com/kundajelab/dragonn/blob/cshl/tutorials/tutorial_images/ChangeRuntime.png?raw=true)

![RuntimeType.png](https://github.com/kundajelab/dragonn/blob/cshl/tutorials/tutorial_images/RuntimeType.png?raw=1)

Now that we set our Runtime, we can execute the cells in the notebook. You can execute the cells one at a time by clicking inside of them and pressing SHIFT+enter. Alternatively, you can run all the cells by clicking the "Run All" button, as demonstrated below. 

![RunAllColab](https://github.com/kundajelab/dragonn/blob/cshl/tutorials/tutorial_images/RunAllCollab.png?raw=1)


You can run the next cell by cliking the play button:

![RunCellArrow](https://github.com/kundajelab/dragonn/blob/cshl/tutorials/tutorial_images/RunCellArrow.png?raw=1)

Half of the cells in this tutorial contain code, the other half contain visualizations and explanatory text. Code, visualizations, and text in cells can be modified - you are encouraged to modify the code as you advance through the tutorial. You can inspect the implementation of a function used in a cell by following these steps:

![inspecting code](https://github.com/kundajelab/dragonn/blob/cshl/tutorials/tutorial_images/inspecting_code.png?raw=1)


In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
# if running locally
# import sys
# sys.path.append('../../')

In [None]:
# run the lines below if you are running this tutorial from Google Colab 
# RESTART NOTEBOOK AFTER RUNNING THIS
!pip install git+https://github.com/kundajelab/dragonn.git@icts

In [None]:
!pip show tensorflow
!pip show dragonn 

In [None]:
# Making sure our results are reproducible
from numpy.random import seed
seed(1234)
from tensorflow.random import set_seed
set_seed(1234)
import tensorflow as tf

We start by loading dragonn's tutorial utilities and reviewing properties of regulatory sequence that transcription factors bind.

In [None]:
# load dragonn tutorial utilities 
from matplotlib import pyplot as plt
%reload_ext autoreload
%autoreload 2
%matplotlib inline

## Key properties of regulatory DNA sequences <a name='2'>
<a href=#outline>Home</a>

![sequence properties 1](https://github.com/kundajelab/dragonn/blob/master/paper_supplement/primer_tutorial_images/sequence_properties_1.jpg?raw=1)
![sequence properties 2](https://github.com/kundajelab/dragonn/blob/master/paper_supplement/primer_tutorial_images/sequence_properties_2.jpg?raw=1)

## Learning to localize homotypic motif density <a name='3'>
<a href=#outline>Home</a>

In this tutorial we will learn how to localize a homotypic motif cluster. We will simulate a positive set of sequences with multiple instances of a motif in the center and a negative set of sequences with multiple motif instances positioned anywhere in the sequence:
![homotypic motif density localization](https://github.com/kundajelab/dragonn/blob/master/tutorials/tutorial_images/homotypic_motif_density_localization.jpg?raw=1)
We will then train a binary classification model to classify the simulated sequences. To solve this task, the model will need to learn the motif pattern and whether instances of that pattern are present in the central part of the sequence.

![classification task](https://github.com/kundajelab/dragonn/blob/master/tutorials/tutorial_images/homotypic_motif_density_localization_task.jpg?raw=1)

We start by getting the simulation data.

## Getting simulation data <a name='4'>
<a href=#outline>Home</a>


DragoNN provides a set of simulation functions. We will use the **simulate_motif_density_localization** function to simulate homotypic motif density localization. First, we obtain documentation for the simulation parameters.

In [None]:
from dragonn.simulations import * 
from dragonn.vis import * 

In [None]:
print_simulation_info("simulate_motif_density_localization")

Next, we define parameters for a TAL1 motif density localization in 1000bp long sequence, with 0.4 GC fraction, and 2-4 instances of the motif in the central 150bp for the positive sequences. We simulate a total of 3000 positive and 3000 negative sequences.

In [None]:
motif_density_localization_simulation_parameters = {
    "motif_name": "TAL1_known4",
    "seq_length": 1000,
    "center_size": 150,
    "min_motif_counts": 2,
    "max_motif_counts": 4, 
    "num_pos": 3000,
    "num_neg": 3000,
    "GC_fraction": 0.4}

We get the simulation data by calling the **get_simulation_data** function with the simulation name and the simulation parameters as inputs. 1000 sequences are held out for a test set, 1000 sequences for a validation set, and the remaining 4000 sequences are in the training set.

In [None]:
simulation_data = get_simulation_data("simulate_motif_density_localization",
                                      motif_density_localization_simulation_parameters,
                                      validation_set_size=1000, test_set_size=1000)

simulation_data provides training, validation, and test sets of input sequences X and sequence labels y. The inputs X are matrices with a one-hot-encoding of the sequences:

<img src="https://github.com/kundajelab/dragonn/blob/cshl/tutorials/tutorial_images/one_hot_encoding.png?raw=1" width="500">



Simulation data is an object. It contains an attribute called X_train that is a numpy array of 4 dimensions. We can call the "shape" function on X_train to get it's dimensions. 

In [None]:
simulation_data.X_train.shape

Here are the first 10bp of a sequence in our training data:

In [None]:
#The first dimension indicates the index of the training samples. 
# The second dimension is 1, and is only necessary because we are 
# performing 2D convolutions. We could omit this "dummy" dimension if
# we used 1D convolutions. 
# The third dimension indicates the base index. 
# The fourth dimension indicates the base pair channels: A,C,G,T. 

simulation_data.X_train[0, :, :10, :]

We can convert this one-hot-encoded matrix back into a DNA string:

In [None]:
from dragonn.utils import *
get_sequence_strings(simulation_data.X_train)[0][0:10]

Let's examine the shape of training, validation, and test matrices: 

In [None]:
print(simulation_data.X_train.shape)
print(simulation_data.y_train.shape)

In [None]:
print(simulation_data.X_valid.shape)
print(simulation_data.y_valid.shape)

In [None]:
print(simulation_data.X_test.shape)
print(simulation_data.y_test.shape)

## Running dragonn on your own data: starting with FASTA files <a name='4.5'>
<a href=#outline>Home</a>

If you are running Dragonn on your own data, you can provide data in FASTA sequence format. We recommend generating 6 fasta files for model training: 
* Training positives 
* Training negatives 
* Validation positives 
* Validation negatives 
* Test positives 
* Test negatives 

To indicate how this could be done, we export the one-hot-encoded matrices from **simulation_data** to a FASTA file, and then show how this fasta file could be loaded back to a one-hot-encoded matrix.

In [None]:
from dragonn.utils import fasta_from_onehot

#get the indices of positive and negative sequences in the training, validation, and test sets 
train_pos=np.nonzero(simulation_data.y_train==True)
train_neg=np.nonzero(simulation_data.y_train==False)
valid_pos=np.nonzero(simulation_data.y_valid==True)
valid_neg=np.nonzero(simulation_data.y_valid==False)
test_pos=np.nonzero(simulation_data.y_test==True)
test_neg=np.nonzero(simulation_data.y_test==False)

#Generate gzipped  fasta files -- it is always a good idea to gzip your fasta files. This is less 
# important for our tiny example files, but becomes more relevant as the size of the files increases. 
# The fasta_from_onehot function gzips output fasta files. 
fasta_from_onehot(np.expand_dims(simulation_data.X_train[train_pos],axis=1),"X.train.pos.fasta.gz")
fasta_from_onehot(np.expand_dims(simulation_data.X_valid[valid_pos],axis=1),"X.valid.pos.fasta.gz")
fasta_from_onehot(np.expand_dims(simulation_data.X_test[test_pos],axis=1),"X.test.pos.fasta.gz")

fasta_from_onehot(np.expand_dims(simulation_data.X_train[train_neg],axis=1),"X.train.neg.fasta.gz")
fasta_from_onehot(np.expand_dims(simulation_data.X_valid[valid_neg],axis=1),"X.valid.neg.fasta.gz")
fasta_from_onehot(np.expand_dims(simulation_data.X_test[test_neg],axis=1),"X.test.neg.fasta.gz")

Let's examine "X.train.pos.fasta.gz" to verify that it's in the standard gzipped FASTA format. 

In [None]:
! zcat X.train.pos.fasta.gz | head

We can then load fasta format data to generate training, validation, and test splits for our models:

In [None]:
from dragonn.utils import encode_fasta_sequences
X_train_pos=encode_fasta_sequences("X.train.pos.fasta.gz")
X_train_neg=encode_fasta_sequences("X.train.neg.fasta.gz")
X_valid_pos=encode_fasta_sequences("X.valid.pos.fasta.gz")
X_valid_neg=encode_fasta_sequences("X.valid.neg.fasta.gz")
X_test_pos=encode_fasta_sequences("X.test.pos.fasta.gz")
X_test_neg=encode_fasta_sequences("X.test.neg.fasta.gz")

X_train=np.concatenate((X_train_pos,X_train_neg),axis=0)
X_valid=np.concatenate((X_valid_pos,X_valid_neg),axis=0)
X_test=np.concatenate((X_test_pos,X_test_neg),axis=0)


In [None]:
y_train=np.concatenate((np.ones(X_train_pos.shape[0]),
                        np.zeros(X_train_neg.shape[0])))
y_valid=np.concatenate((np.ones(X_valid_pos.shape[0]),
                        np.zeros(X_valid_neg.shape[0])))
y_test=np.concatenate((np.ones(X_test_pos.shape[0]),
                        np.zeros(X_test_neg.shape[0])))


Now, having read in the FASTA files, converted them to one-hot-encoded matrices, and defined label vectors, we are ready to train our model. 

# Defining the convolutional neural network model architecture  <a name='5'>
<a href=#outline>Home</a>

A locally connected linear unit in a CNN model can represent a PWM (part a). A sequence PWM score is obtained by multiplying the PWM across the sequence, thresholding the PWM scores, and taking the max (part b). A PWM score can also be computed by a CNN model with tiled, locally connected linear units, amounting to a convolutional layer with a single convolutional filter representing the PWM, followed by ReLU thresholding and maxpooling (part c).
    
![dragonn vs pssm](https://github.com/kundajelab/dragonn/blob/cshl/tutorials/tutorial_images/dragonn_and_pssm.jpg?raw=1)


By utilizing multiple convolutional layers with multiple convolutional filters, CNN's can represent a wide range of sequence features in a compositional fashion:
    
![dragonn model figure](https://github.com/kundajelab/dragonn/blob/cshl/tutorials/tutorial_images/dragonn_model_figure.png?raw=1)


We will use the deep learning library [keras](http://keras.io/) which is a high level API for  [TensorFlow](https://github.com/tensorflow/tensorflow) framework to generate and train the CNN models. 

In [None]:
# To prepare for model training, we import the necessary functions and submodules from keras
from keras.models import Sequential
from keras.layers import Dropout, Reshape, Dense, Activation, Flatten,Conv2D, MaxPooling2D, BatchNormalization
from keras.callbacks import EarlyStopping

# Single layer, single filter model <a name='6'>
<a href=#outline>Home</a>


We define a simple DragoNN model with one convolutional layer with one convolutional filter, followed by maxpooling of width 35. 

The model parameters are: 

* Input sequence length 1000 
* 1 filter: this is a neuron that acts as a local pattern detector on the input profile. 
* Convolutional filter width =  10: this metric defines the dimension of the filter weights; the model scans the entire input profile for a particular pattern encoded by the weights of the filter. 
* Max pool of width 35: computes the maximum value per-channel in sliding windows of size 35. We add the pooling layer becase DNA sequences are typically sparse in terms of the number of positions in the sequence that harbor TF motifs. The pooling layer allows us to reduce the size of the output profile of convolutional layers by employing summary statistics. 

In [None]:
#Define the model architecture in keras
one_filter_keras_model=Sequential() 
one_filter_keras_model.add(Conv2D(filters=1,kernel_size=(1,10),padding="same",input_shape=simulation_data.X_train.shape[1::]))
one_filter_keras_model.add(BatchNormalization(axis=-1))
one_filter_keras_model.add(Activation('relu'))
one_filter_keras_model.add(MaxPooling2D(pool_size=(1,35)))
one_filter_keras_model.add(Flatten())
one_filter_keras_model.add(Dense(1))
one_filter_keras_model.add(Activation("sigmoid"))

In [None]:
one_filter_keras_model.summary()

In [None]:
##compile the model, specifying the Adam optimizer, and binary cross-entropy loss. 
one_filter_keras_model.compile(optimizer='adam',
                               loss='binary_crossentropy')

We train the model for 150 epochs, with an early stopping criterion -- if the loss on the validation set does not improve for five consecutive epochs, the training is halted. In each epoch, the one_filter_dragonn performed a complete pass over the training data, and updated its parameters to minimize the loss, which quantifies the error in the model predictions. After each epoch, the performance metrics for the one_filter_dragonn on the validation data were stored. 

The performance metrics include balanced accuracy, area under the receiver-operating curve ([auROC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)), are under the precision-recall curve ([auPRC](https://en.wikipedia.org/wiki/Precision_and_recall)), and recall for multiple false discovery rates  (Recall at [FDR](https://en.wikipedia.org/wiki/False_discovery_rate)).

In [None]:
from dragonn.callbacks import * 
#We define a custom callback to print training and validation metrics while training. 
metrics_callback=MetricsCallback(train_data=(simulation_data.X_train,simulation_data.y_train),
                                 validation_data=(simulation_data.X_valid,simulation_data.y_valid))


We now proceed to train the model. We do this with the keras "fit" function. The "fit" function has a few key parameters: 

* **batch_size** -- the number of training and validation samples to be propagated through the network simultaneously. 
* **epochs** -- An epoch is a measure of the number of times all of the training vectors are used once to update the weights. For batch training all of the training samples pass through the learning algorithm simultaneously in one epoch before weights are updated.
* **callbacks** -- Keras callbacks return information from a training algorithm while training is taking place. A callback is a set of functions to be applied at given stages of the training procedure. You can use callbacks to get a view on internal states and statistics of the model during training.
* **EarlyStopping** -- a Keras callback that gets called at the end of each epoch. If the loss has not decreased for a consecutive n epochs, where n is referred to as the patience, the training is interrupted. 


## Visualize the intial parameters 

Next, let's visualize the randomly initialized weights in this model

### Dense layer

In [None]:
plot_model_weights(one_filter_keras_model)

### Convolutional layer 

In [None]:
W_conv, b_conv = one_filter_keras_model.layers[0].get_weights()

In [None]:
W_conv.shape

In [None]:
b_conv.shape

In [None]:
plot_filters(one_filter_keras_model, simulation_data)

## Model Training

In [None]:
## use the keras fit function to train the model for 150 epochs with early stopping after 3 epochs 
history_one_filter=one_filter_keras_model.fit(x=simulation_data.X_train,
                                  y=simulation_data.y_train,
                                  batch_size=128,
                                  epochs=150,
                                  verbose=1,
                                  callbacks=[EarlyStopping(patience=3,restore_best_weights=True),
                                            metrics_callback],
                                  validation_data=(simulation_data.X_valid,
                                                   simulation_data.y_valid))


### Evaluate the model on the held-out test set 

In [None]:
## Use the keras predict function to get model predictions on held-out test set. 
test_predictions=one_filter_keras_model.predict(simulation_data.X_test)
## Generate a ClassificationResult object to print performance metrics on held-out test set 
print(ClassificationResult(simulation_data.y_test,test_predictions))

### Visualize the model's performance

We can see that the validation loss is not decreasing and the auROC metric is not decreasing, which indicates this model is not learning. A simple plot of the learning curve, showing the loss function on the training and validation data over the course of training, demonstrates this visually:

In [None]:
#import functions for visualization of data 
from dragonn.vis import *

In [None]:
%matplotlib inline

In [None]:
plot_learning_curve(history_one_filter)

## Visualize the learned parameters 

Next, let's visualize the filter learned in this model

### Dense layer

In [None]:
plot_model_weights(one_filter_keras_model)

### Convolutional layer 

In [None]:
W_conv, b_conv = one_filter_keras_model.layers[0].get_weights()

In [None]:
W_conv.shape

In [None]:
b_conv.shape

In [None]:
plot_filters(one_filter_keras_model, simulation_data)

# Single layer, multi-filter model <a name='7'>
<a href=#outline>Home</a>


We define a simple DragoNN model with one convolutional layer with 15 convolutional filters, followed by maxpooling of width 35. 

The model parameters are: 

* Input sequence length 1000 
* 15 filter: there are neurons that act as  local pattern detectors on the input profile. 
* Convolutional filter width =  10: this metric defines the dimension of the filter weights; the model scans the entire input profile for a particular pattern encoded by the weights of the filter. 
* Max pool of width 35: computes the maximum value per-channel in sliding windows of size 35. We add the pooling layer becase DNA sequences are typically sparse in terms of the number of positions in the sequence that harbor TF motifs. The pooling layer allows us to reduce the size of the output profile of convolutional layers by employing summary statistics. 

![simArch1Layer](https://github.com/kundajelab/dragonn/blob/master/tutorials/tutorial_images/SimArch1Layer.png?raw=1)


In [None]:
#Define the model architecture in keras
multi_filter_keras_model=Sequential() 
multi_filter_keras_model.add(Conv2D(filters=15,kernel_size=(1,10),input_shape=simulation_data.X_train.shape[1::]))
multi_filter_keras_model.add(BatchNormalization(axis=-1))
multi_filter_keras_model.add(Activation('relu'))
multi_filter_keras_model.add(MaxPooling2D(pool_size=(1,35), strides=35))
multi_filter_keras_model.add(Flatten())
multi_filter_keras_model.add(Dense(1))
multi_filter_keras_model.add(Activation("sigmoid"))

##compile the model, specifying the Adam optimizer, and binary cross-entropy loss. 
multi_filter_keras_model.compile(optimizer='adam',
                               loss='binary_crossentropy')

In [None]:
multi_filter_keras_model.summary()

"Non-trainable params" refers to Batch Normalization parameter whose weights don't get updated during training. 

In [None]:
##compile the model, specifying the Adam optimizer, and binary cross-entropy loss. 
multi_filter_keras_model.compile(optimizer='adam',
                               loss='binary_crossentropy')

We train the model for 150 epochs, with an early stopping criterion -- if the loss on the validation set does not improve for 3 consecutive epochs, the training is halted. In each epoch, the model performs a complete pass over the training data, and updates its parameters to minimize the loss, which quantifies the error in the model predictions. After each epoch, the performance metrics for the model on the validation data were stored. 

The performance metrics include balanced accuracy, area under the receiver-operating curve ([auROC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)), are under the precision-recall curve ([auPRC](https://en.wikipedia.org/wiki/Precision_and_recall)), and recall for multiple false discovery rates  (Recall at [FDR](https://en.wikipedia.org/wiki/False_discovery_rate)).

In [None]:
from dragonn.callbacks import * 
#We define a custom callback to print training and validation metrics while training. 
metrics_callback=MetricsCallback(train_data=(simulation_data.X_train,simulation_data.y_train),
                                 validation_data=(simulation_data.X_valid,simulation_data.y_valid))


In [None]:
## use the keras fit function to train the model for 150 epochs with early stopping after 3 epochs 
history_multi_filter=multi_filter_keras_model.fit(x=simulation_data.X_train,
                                  y=simulation_data.y_train,
                                  batch_size=128,
                                  epochs=150,
                                  verbose=1,
                                  callbacks=[EarlyStopping(patience=3,restore_best_weights=True),
                                            metrics_callback],
                                  validation_data=(simulation_data.X_valid,
                                                   simulation_data.y_valid))


### Evaluate the model on the held-out test set 

In [None]:
## Use the keras predict function to get model predictions on held-out test set. 
test_predictions=multi_filter_keras_model.predict(simulation_data.X_test)
## Generate a ClassificationResult object to print performance metrics on held-out test set 
print(ClassificationResult(simulation_data.y_test,test_predictions))

### Visualize the model's performance

In [None]:
#import functions foro visualization of data 
from dragonn.vis import *

In [None]:
%matplotlib inline

In [None]:
plot_learning_curve(history_multi_filter)

We can see that the training and validation loss decrease, but the validation loss is somewhat higher than the training loss. This is indicative of over-fitting to the training data. 

## Visualize the learned parameters 

Next, let's visualize the filter learned in this model

### Dense layer

In [None]:
plot_model_weights(multi_filter_keras_model)

### Convolutional layer 

In [None]:
W_conv, b_conv = multi_filter_keras_model.layers[0].get_weights()

In [None]:
W_conv.shape

In [None]:
b_conv.shape

In [None]:
plot_filters(multi_filter_keras_model, simulation_data)

## For further exploration<a name='9'>
<a href=#outline>Home</a>

In this tutorial we explored modeling of homotypic motif density. Other properties of regulatory DNA sequence include
![sequence properties 3](https://github.com/kundajelab/dragonn/blob/master/tutorials/tutorial_images/sequence_properties_3.jpg?raw=1)
![sequence properties 4](https://github.com/kundajelab/dragonn/blob/master/tutorials/tutorial_images/sequence_properties_4.jpg?raw=1)

DragoNN provides simulations that formulate learning these patterns into classification problems:
![sequence](https://github.com/kundajelab/dragonn/blob/master/tutorials/tutorial_images/sequence_simulations.png?raw=1)

You can view the available simulation functions by running print_available_simulations:

In [None]:
print_available_simulations()