# How to train your DragoNN tutorial 2: 
## Interpreting features induced by DNN's across multiple types of motif grammars 

This tutorial is a supplement to the DragoNN manuscript and follows figure 7 in the manuscript. 

This tutorial will take 20 - 30 minutes if executed on a GPU. 

Please complete "PrimerTutorial 1- Exploring model architectures for a homotypic motif density simulation" prior to completing this tutorial. 

## Outline<a name='outline'>
<ol>
    <li><a href=#1>How to use this tutorial</a></li>
    <li><a href=#2>Simulating training data with simdna: Review of Tutorial 1</a></li>
    <li><a href=#3>Single Motif</a></li>
    <li><a href=#4>Homotypic motif density</a></li>
    <li><a href=#5>Homotypic motif density localization</a></li>
    <li><a href=#6>Multiple motifs (multi-task)</a></li>  
    <li><a href=#7>Heterotypic motifs spatial grammar</a></li>
</ol>
Github issues on the dragonn repository with feedback, questions, and discussion are always welcome.

 

## How to use this tutorial<a name='1'>
<a href=#outline>Home</a>

This tutorial utilizes a Jupyter/IPython Notebook - an interactive computational enviroment that combines live code, visualizations, and explanatory text. The notebook is organized into a series of cells. You can run the next cell by cliking the play button:
![play button](./primer_tutorial_images/play_button.png)
You can also run all cells in a series by clicking "run all" in the Cell drop-down menu:
![play all button](./primer_tutorial_images/play_all_button.png)
Half of the cells in this tutorial contain code, the other half contain visualizations and explanatory text. Code, visualizations, and text in cells can be modified - you are encouraged to modify the code as you advance through the tutorial. You can inspect the implementation of a function used in a cell by following these steps:
![inspecting code](./primer_tutorial_images/inspecting_code.png)


In [5]:
#uncomment the lines below if you are running this tutorial from Google Colab 
#!pip install https://github.com/kundajelab/simdna/archive/0.3.zip
#!pip install https://github.com/kundajelab/dragonn/archive/keras_2.2_tensorflow_1.6_purekeras.zip


In [6]:
# Making sure our results are reproducible
from numpy.random import seed
seed(1234)
from tensorflow import set_random_seed
set_random_seed(1234)

We start by loading dragonn's tutorial utilities.

In [7]:
#load dragonn tutorial utilities 
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')


from dragonn.tutorial_utils import *


## Getting simulation data <a name='2'>
<a href=#outline>Home</a>


DragoNN provides a set of simulation functions. Let's use the **print_available_simulations** function to examine the list of simulations supported by DragoNN:

In [12]:
print_available_simulations()

simulate_differential_accessibility
simulate_heterodimer_grammar
simulate_motif_counting
simulate_motif_density_localization
simulate_multi_motif_embedding
simulate_single_motif_detection


## Single Motif <a name='3'>
<a href=#outline>Home</a>


We begin with single motif detection of the TAL1_known4 motif: 

![play button](./primer_tutorial_images/TAL1_known4.png)
Let's find out what parameters are needed for the simulation: 

In [13]:
print_simulation_info("simulate_single_motif_detection")


    Simulates two classes of seqeuences:
        - Positive class sequence with a motif
          embedded anywhere in the sequence
        - Negative class sequence without the motif

    Parameters
    ----------
    motif_name : str
        encode motif name
    seq_length : int
        length of sequence
    num_pos : int
        number of positive class sequences
    num_neg : int
        number of negative class sequences
    GC_fraction : float
        GC fraction in background sequence

    Returns
    -------
    sequence_arr : 1darray
        Array with sequence strings.
    y : 1darray
        Array with positive/negative class labels.
    embedding_arr: 1darray
        Array of embedding objects.
    


In this binary simulation task, we simulate a negative set of 10K 500 bp random sequences and a positive set of 10K 500 bp random sequences with one instance of the TAL1 motif randomly embedded at any position.

In [15]:
single_motif_simulation_parameters = {
    "motif_name": "TAL1_known4",
    "seq_length": 1500, 
    "num_pos": 3000,
    "num_neg": 3000,
    "GC_fraction": 0.4}
simulation_data = get_simulation_data("simulate_single_motif_detection",
                                      single_motif_simulation_parameters,
                                      validation_set_size=1000, test_set_size=1000)

We define the convolutional neural network model architecture: 