# How to train your DragoNN tutorial 3: 
## Interpreting features induced by DNN's across multiple types of motif grammars 

This tutorial is a supplement to the DragoNN manuscript and follows figure 7 in the manuscript. 

This tutorial will take 1 hour  if executed on a GPU. 

Please complete "Primer Tutorial 1- Exploring model architectures for a homotypic motif density simulation" prior to completing this tutorial. 

The architectures used in this tutorial were determined as optimal by hyperparameter grid search in "Primer Tutorial 3 - CNN Hyperparameter Tuning via Grid Search"


## Outline<a name='outline'>
<ol>
    <li><a href=#1>How to use this tutorial</a></li>
    <li><a href=#2>Defining helper functions for model training and interpretation</a></li>
    <li><a href=#3>Simulating training data with simdna: Review of Tutorial 1</a></li>
    <li><a href=#4>Single Motif</a></li>
    <li><a href=#5>Homotypic motif density detection</a></li>
    <li><a href=#6>Homotypic motif density localization</a></li>
    <li><a href=#7>Multiple motifs (multi-task)</a></li>  
    <li><a href=#8>Heterotypic motifs spatial grammar</a></li>
    <li><a href=#9>Conclusions</a></li>
</ol>
Github issues on the dragonn repository with feedback, questions, and discussion are always welcome.

 

## How to use this tutorial<a name='1'>
<a href=#outline>Home</a>

This tutorial utilizes a Jupyter/IPython Notebook - an interactive computational enviroment that combines live code, visualizations, and explanatory text. The notebook is organized into a series of cells. You can run the next cell by cliking the play button:
![play button](./primer_tutorial_images/play_button.png)
You can also run all cells in a series by clicking "run all" in the Cell drop-down menu:
![play all button](./primer_tutorial_images/play_all_button.png)
Half of the cells in this tutorial contain code, the other half contain visualizations and explanatory text. Code, visualizations, and text in cells can be modified - you are encouraged to modify the code as you advance through the tutorial. You can inspect the implementation of a function used in a cell by following these steps:
![inspecting code](./primer_tutorial_images/inspecting_code.png)


In [1]:
#uncomment the lines below if you are running this tutorial from Google Colab 
#!pip install https://github.com/kundajelab/simdna/archive/0.3.zip
#!pip install https://github.com/kundajelab/dragonn/archive/keras_2.2_tensorflow_1.6_purekeras.zip


In [6]:
#To prepare for model training, we import the necessary functions and submodules from keras
from keras.models import Sequential
from keras.layers.core import Dropout, Reshape, Dense, Activation, Flatten
from keras.layers.convolutional import Conv2D, MaxPooling2D
from keras.optimizers import Adadelta, SGD, RMSprop;
import keras.losses;
from keras.constraints import maxnorm;
from keras.layers.normalization import BatchNormalization
from keras.regularizers import l1, l2
from keras.callbacks import EarlyStopping, History
from keras import backend as K 
K.set_image_data_format('channels_last')

Using TensorFlow backend.


We start by loading dragonn's tutorial utilities.

In [7]:
#load dragonn tutorial utilities 
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')


from dragonn.tutorial_utils import *


## Defining helper functions for model training and interpretation  <a name='2'>
<a href=#outline>Home</a>


For each type of simulation, we will perform a consistent set of tasks: 
* Define the optimal model architecture, as determined in Tutorial 2 
* Train the model on simulation data.
* Compute the model's performance on a held-out test set.
* Visualize the model's learning curve on training and validation data. 
* Visualize motif scores for a positive and negative example. 
* Perform in silico mutagenesis for a positive and negative example.
* Compute DeepLIFT scores for a positive and negative example.

To avoid writing the same code for each scenario, we define a series of helpers functions to perform the tasks above. 

In [None]:
def analyze(model,data):
    pass


## Getting simulation data <a name='3'>
<a href=#outline>Home</a>


DragoNN provides a set of simulation functions. Let's use the **print_available_simulations** function to examine the list of simulations supported by DragoNN:

In [13]:
print_available_simulations()

simulate_differential_accessibility
simulate_heterodimer_grammar
simulate_motif_counting
simulate_motif_density_localization
simulate_multi_motif_embedding
simulate_single_motif_detection


## Single Motif <a name='4'>
<a href=#outline>Home</a>


We begin with single motif detection of the TAL1_known4 motif: 

![play button](./primer_tutorial_images/TAL1_known4.png)
Let's find out what parameters are needed for the simulation: 

In [9]:
print_simulation_info("simulate_single_motif_detection")


    Simulates two classes of seqeuences:
        - Positive class sequence with a motif
          embedded anywhere in the sequence
        - Negative class sequence without the motif

    Parameters
    ----------
    motif_name : str
        encode motif name
    seq_length : int
        length of sequence
    num_pos : int
        number of positive class sequences
    num_neg : int
        number of negative class sequences
    GC_fraction : float
        GC fraction in background sequence

    Returns
    -------
    sequence_arr : 1darray
        Array with sequence strings.
    y : 1darray
        Array with positive/negative class labels.
    embedding_arr: 1darray
        Array of embedding objects.
    


In this binary simulation task, we simulate a negative set of 10K 500 bp random sequences and a positive set of 10K 500 bp random sequences with one instance of the TAL1 motif randomly embedded at any position.

In [10]:
single_motif_simulation_parameters = {
    "motif_name": "TAL1_known4",
    "seq_length": 1500, 
    "num_pos": 3000,
    "num_neg": 3000,
    "GC_fraction": 0.4}
tal1_data = get_simulation_data("simulate_single_motif_detection",
                                      single_motif_simulation_parameters,
                                      validation_set_size=1000, test_set_size=1000)

We define the convolutional neural network model architecture: 

In [None]:
#Define the optimal model architecture in keras (Refer to Primer Tutorial 2)
tal1_model=Sequential() 
tal1_model.add(Conv2D(filters=5,kernel_size=(1,10),input_shape=tal1_data.X_train.shape[1::]))
tal1_model.add(Activation('relu'))
tal1_model.add(MaxPooling2D(pool_size=(1,10)))
tal1_model.add(Flatten())
tal1_model.add(Dense(1))
tal1_model.add(Activation("sigmoid"))

##compile the model, specifying the Adam optimizer, and binary cross-entropy loss. 
tal1_model.compile(optimizer='adam',
                               loss='binary_crossentropy')

In [None]:
analyze(tal1_model,tal1_data)

## Homotypic motif density detection <a name='5'>
<a href=#outline>Home</a>

In [None]:
#Define simulation parameters 
density_detection_parameters={
    "motif_name": "TAL1_known4",
    "seq_length": 500,
    "neg_counts":[0,2],
    "pos_counts":[3,5],
    "num_pos": 10000,
    "num_neg": 10000,
    "GC_fraction":0.4
}

#Get simulation data
density_detection_data=get_simulation_data("simulate_motif_counting",
                               density_detection_parameters,
                               validation_set_size=3200,test_set_size=4000)



In [None]:
#Define the optimal model architecture in keras (Refer to Primer Tutorial 2)
density_detection_model=Sequential() 
density_detection_model.add(Conv2D(filters=3,kernel_size=(1,10),input_shape=density_detection_data.X_train.shape[1::]))
density_detection_model.add(Activation('relu'))
density_detection_model.add(MaxPooling2D(pool_size=(1,10)))
density_detection_model.add(Flatten())
density_detection_model.add(Dense(1))
density_detection_model.add(Activation("sigmoid"))

##compile the model, specifying the Adam optimizer, and binary cross-entropy loss. 
density_detection_model.compile(optimizer='adam',
                               loss='binary_crossentropy')

In [None]:
analyze(density_detection_model,density_detection_data)

## Homotypic motif density localization <a name='6'>
<a href=#outline>Home</a>

In [None]:
#Define simulation parameters 
density_localization_parameters = {
    "motif_name": "TAL1_known4",
    "seq_length": 1000,
    "center_size": 150,
    "min_motif_counts": 2,
    "max_motif_counts": 4, 
    "num_pos": 10000,
    "num_neg": 10000,
    "GC_fraction": 0.4}

#Get simulation data
density_localization_data=get_simulation_data("simulate_motif_density_localization",
                               density_localization_parameters,
                               validation_set_size=3200,test_set_size=4000)



In [None]:
#Define the optimal model architecture in keras (Refer to Primer Tutorial 2)
density_localization_model=Sequential() 
density_localization_model.add(Conv2D(filters=5,kernel_size=(1,10),input_shape=density_localization_data.X_train.shape[1::]))
density_localization_model.add(Activation('relu'))
density_localization_model.add(MaxPooling2D(pool_size=(1,10)))
density_localization_model.add(Flatten())
density_localization_model.add(Dense(1))
density_localization_model.add(Activation("sigmoid"))

##compile the model, specifying the Adam optimizer, and binary cross-entropy loss. 
density_localization_model.compile(optimizer='adam',
                               loss='binary_crossentropy')

In [None]:
analyze(density_localization_model,density_localization_data)

## Multiple motifs (multi-task)<a name='7'>
<a href=#outline>Home</a>

In [None]:
#Define simulation parameters 
multi_motif_parameters = {
    "motif_names": ["CTCF_known1","ZNF143_known2","SIX5_known1"],
    "seq_length": 500,
    "min_num_motifs": 0,
    "max_num_motifs": 1, 
    "num_seqs": 20000,
    "GC_fraction": 0.4}

#Get simulation data
multi_motif_data=get_simulation_data("simulate_multi_motif_embedding",
                               multi_motif_parameters,
                               validation_set_size=3200,test_set_size=4000)



In [None]:
#Define the optimal model architecture in keras (Refer to Primer Tutorial 2)
multi_motif_model=Sequential() 
multi_motif_model.add(Conv2D(filters=20,kernel_size=(1,20),input_shape=multi_motif_data.X_train.shape[1::]))
multi_motif_model.add(Activation('relu'))
multi_motif_model.add(MaxPooling2D(pool_size=(1,10)))
multi_motif_model.add(Flatten())
multi_motif_model.add(Dense(1))
multi_motif_model.add(Activation("sigmoid"))

##compile the model, specifying the Adam optimizer, and binary cross-entropy loss. 
multi_motif_model.compile(optimizer='adam',
                               loss='binary_crossentropy')

In [None]:
analyze(multi_motif_model, multi_motif_data)

## Heterotypic motifs spatial grammar<a name='8'>
<a href=#outline>Home</a>

In [None]:
#Define simulation parameters 
heterodimer_parameters = {
    "motif1": "SPI1_known4",
    "motif2": "IRF_known1",
    "seq_length": 500,
    "min_spacing": 2,
    "max_spacing": 5, 
    "num_pos": 10000,
    "num_neg": 10000,
    "GC_fraction": 0.4}

#Get simulation data
heterodimer_data=get_simulation_data("simulate_heterodimer_grammar",
                               heterodimer_parameters,
                               validation_set_size=3200,test_set_size=4000)

In [None]:
heterodimer_model=Sequential()
heterodimer_model.add(Conv2D(filters=15,kernel_size=(1,15),input_shape=input_shape))
heterodimer_model.add(Activation("relu"))
heterodimer_model.add(Conv2D(filters=15,kernel_size=(1,15),input_shape=input_shape))
heterodimer_model.add(Activation("relu"))
heterodimer_model.add(Conv2D(filters=15,kernel_size=(1,15),input_shape=input_shape))
heterodimer_model.add(Activation("relu"))
heterodimer_model.add(MaxPooling2D(pool_size=(1,35)))    
heterodimer_model.add(Flatten())
heterodimer_model.add(Dense(num_tasks))
heterodimer_model.add(Activation("sigmoid"))
heterodimer_model.compile(optimizer='adam',loss='binary_crossentropy')


In [None]:
analyze(heterodimer_model,heterodimer_data)

## Conclusions<a name='9'>
<a href=#outline>Home</a>