# How to train your DragoNN tutorial 5: 
## CNN Hyperparameter Tuning via Grid Search 
This tutorial is a supplement to the DragoNN manuscript and follows figure 6 in the manuscript. 

This tutorial will take 30 minutes - 1 hour if executed on a GPU.

## Outline<a name='outline'>
<ol>
    <li><a href=#1>How to use this tutorial</a></li>
    <li><a href=#2>Data simulation and default CNN model performance</a></li>
    <ol>
        <li><a href=#2a>Simple Motif Detection: TAL1, CTCF, ZNF143, SIX5 </a></li>
        <li><a href=#2b>Motif Density Detection</a></li>
        <li><a href=#2c>Motif Density Localization</a></li>
        <li><a href=#2d>Multiple Motif Detection</a></li>
        <li><a href=#2e>Heterodimer Motif Grammar: SPI1_IRF</a></li>
    </ol>
    <li><a href=#3>Training examples</a></li>
    <li><a href=#4>Convolutional Filter Width </a></li>  
    <li><a href=#5>Number of convolution filters</a></li>
    <li><a href=#6>Max Pooling Width</a></li>
</ol>
Github issues on the dragonn repository with feedback, questions, and discussion are always welcome.

 

## How to use this tutorial<a name='1'>
<a href=#outline>Home</a>

This tutorial utilizes a Jupyter/IPython Notebook - an interactive computational enviroment that combines live code, visualizations, and explanatory text. The notebook is organized into a series of cells. You can run the next cell by cliking the play button:
![play button](./primer_tutorial_images/play_button.png)
You can also run all cells in a series by clicking "run all" in the Cell drop-down menu:
![play all button](./primer_tutorial_images/play_all_button.png)
Half of the cells in this tutorial contain code, the other half contain visualizations and explanatory text. Code, visualizations, and text in cells can be modified - you are encouraged to modify the code as you advance through the tutorial. You can inspect the implementation of a function used in a cell by following these steps:
![inspecting code](./primer_tutorial_images/inspecting_code.png)


In [23]:
#uncomment the lines below if you are running this tutorial from Google Colab 
#!pip install https://github.com/kundajelab/simdna/archive/0.3.zip
#!pip install https://github.com/kundajelab/dragonn/archive/keras_2.2_tensorflow_1.6_purekeras.zip


In [24]:
# Making sure our results are reproducible
from numpy.random import seed
seed(1234)
from tensorflow import set_random_seed
set_random_seed(1234)

We start by loading dragonn's tutorial utilities and reviewing properties of regulatory sequence that transcription factors bind.

In [25]:
#load dragonn tutorial utilities 
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')


from dragonn.tutorial_utils import *


## Data simulation and basic architecture performance <a name='2'>
<a href=#outline>Home</a>


DragoNN provides a set of simulation functions. Let's use the **print_available_simulations** function to examine the list of simulations supported by DragoNN:

In [19]:
print_available_simulations()

simulate_differential_accessibility
simulate_heterodimer_grammar
simulate_motif_counting
simulate_motif_density_localization
simulate_multi_motif_embedding
simulate_single_motif_detection


We design four single-task binary classification tasks using DNA sequences that simulate different
properties of regulatory DNA sequences: single motif, homotypic motif clusters, heterotypic motif
clusters, and heterodimer motif grammars with spatial constraints. We further design a multitask
classification simulation to jointly detect motif instances of 3 distinct TFs (corresponding to 3 binary
classification tasks, one per TF). In each simulation, we embed motif instances with the relevant
constraints in random sequences (G/C frequency = 0.4). We hold out 20% of sequences for a test
set, 16% for a validation set, and use the remaining sequences for training. Motif instances are
reverse complemented with 0.5 probability before they are embedded in the background sequence.

### Simple Motif Detection: 

TAL1_Known4: 
![play button](./primer_tutorial_images/TAL1_known4.png)

In [20]:
tal1_parameters = {
    "motif_name": "TAL1_known4",
    "seq_length": 500, 
    "num_pos": 10000,
    "num_neg": 10000,
    "GC_fraction": 0.4}
tal1_data = get_simulation_data("simulate_single_motif_detection",
                                      tal1_parameters,
                                      validation_set_size=640, test_set_size=800)

CTCF_Known1: 
![play button](./primer_tutorial_images/CTCF_known1.png)

In [21]:
ctcf_parameters = {
    "motif_name": "CTCF_known1",
    "seq_length": 500, 
    "num_pos": 10000,
    "num_neg": 10000,
    "GC_fraction": 0.4}
ctcf_data = get_simulation_data("simulate_single_motif_detection",
                                      ctcf_parameters,
                                      validation_set_size=640, test_set_size=800)

ZNF143_known2: 
![play button](./primer_tutorial_images/ZNF143_known2.png)

In [26]:
znf143_parameters={
    "motif_name": "ZNF143_known2",
    "seq_length": 500,
    "num_pos": 10000,
    "num_neg": 10000,
    "GC_fraction":0.4
}
znf143_data=get_simulation_data("simulate_single_motif_detection",
                               znf143_parameters,
                               validation_set_size=640,test_set_size=800)

SIX5_known1:  
![play button](./primer_tutorial_images/SIX5_known1.png) 

In [27]:
six5_parameters={
    "motif_name": "SIX5_known1",
    "seq_length": 500,
    "num_pos": 10000,
    "num_neg": 10000,
    "GC_fraction":0.4
}
six5_data=get_simulation_data("simulate_single_motif_detection",
                               six5_parameters,
                               validation_set_size=640,test_set_size=800)

### Motif Density Detection 

In this binary simulation task, we simulate 10K 500 bp random sequences with 0-2 instances of a TAL1 motif embedded at any random position and
10K 500 bp random sequences with 3-5 instances of the motif embedded at any random position. To solve this simulation, the model needs to learn the differences in motif counts.

### Motif Density Localization 

In this binary simulation task, we simulate 20K random sequences of length 1 Kbp with 2-4 embedded instances of the TAL1 motif. In the positive set of 10K sequences, the motif instances are embedded in the central 150bp. The negative set of 10K sequences, contain embedded motif instances at any random position. To solve this simulation, the model needs to learn localization differences of the motif instances.

### Multiple Motif Detection 

In this simulation of multiple co-binding TFs, we simulate 20K 500 bp random sequences. For each sequence, we independently embed 0 or 1 instance of motifs corresponding to 3 TFs: CTCF, ZNF143, and SIX5 (See SM). Each sequence has binary labels for 3 tasks corresponding to the presence/absence of a motif instance of each of the three TFs. We train a multitask CNN such as that the last layer of the model now has three output logistic neurons corresponding to the three separate tasks. To solve this simulation, the model needs to detect all three motifs while sharing parameters.

### Heterodimer Motif Grammar 

In this binary simulation task, we simulate 20K 500 bp random sequences with one instance of an SPI1 motif and one instance of an IRF motif
32(See SM). In the positive set of 10K sequences, the pair of motifs are embedded with a relative spacing of 2-5 bp between each other, at any random position in each sequence. In the negative set, the pair of motifs are both randomly embedded with no positional or spacing constraints. To solve this simulation, the model needs to detect both motifs and learn the spacing constraint between them in the positive set. For this simulation, an architecture with a single convolutional layer does not perform well (results not shown). Hence, we use a reference architecture with 3 convolutional layers. Each convolutional layer has 15 filters (size 15, stride 1) and ReLU-non-linearity, followed by max-pooling (size 35, stride 35), followed by a fully connected layer with sigmoid non-linearity for binary classification.

We start with a simple reference CNN architecture that contains a convolutional layer with 10 convolutional filters (size 15, stride 1) and ReLU activations, followed by max-pooling (size 35,stride 35), followed by a fully connected layer with a logistic output neuron for binary classification. Models are trained using the Adam optimizer with early stopping after 7 consecutive epochs without validation loss improvement. Performance (auROC) is recorded on an the independent test set.

We then systematically vary the number of training examples, size of convolutional filters, number of convolutional filters and size of max pooling to understand the impact of these hyperparameters on prediction performance for each of the simulations.