In [1]:
%matplotlib inline
from simdna import synthetic as sn
import avutils
from avutils import file_processing as fp

## Overview

Describe the simulation (number per type of sequence, the three tasks, the different motifs, ENCODE images of the motifs, and also the GC content)

## Loading the raw data

Load in the testing set of our model for analysis

In [2]:
raw_data = sn.read_simdata_file("sequences.simdata.gz", ids_to_load=fp.read_rows_into_arr("splits/test.txt.gz"))

Let's inspect the contents of the raw_data object

First, we can access the raw underlying sequence

In [3]:
raw_data.sequences[:5]

['CTTGTATTGAGTTAGAAACCGCCACGGCACTGCTATGTATGACATTCTAACTAAGTGAGTTATGCGTTGGGTCCTTTATGTGGCATTATCTGGTAATACTTAATTGATGTACTATTTCCTCGACAAAACAGGTGGTGTGGGTGTACCGGTCACCGATAAGGGGGAACTCACCAGATGGTAGTTAACCTATAGAGTCCTGA',
 'TGACAATGGACCCGGTCCGGGTTAGGTACTGATTAGGAACGACCCAGGCCGAGCGACTTATCCCGTTAGATGACGGAATCGTTGTTAGCGGAAAAGAGATAAGAACTTCCCAATTTATACAGATAAGCACACAATGTTTAACGTTCCCGTCTGAGTACTCCTGAATGGGAAGGATATTCTATACTAAGGGATTAGTCGTG',
 'CGTTTGAAAGGAGACCAGGTGGTCCAACCACAGTAGCAGAATATTATGCGGTTGTCGATAGAGCTTTGCTAACAGATGTTTACAGTTATCCTGAAATCGTTTGGACAAACTGTCTTCGTGCGTAATTGCACTGTTATGGCCCAGTACAGGGGATCAGATGGTCATGGCAAAAGGCGTGCCAACGCTCGAACTGGAGTCTT',
 'TATGACGACTGCGTATAGTAGACAGATGGTGGACAATTCCCCGTTACATAAAGGGGTACAGATGGTAAGTCCACAAACCGTAACGGCAACAGATGTTTTGAGGTACAAATATAAGGTCCTGATAAGGAGCCGAGAGCTGACGCGTGCCCAATGAGTACATACGTGATACGAATGCGTGCGCCCGGAGTATGTCAAACCGT',
 'CAGTATCTACTGAAAGGAGAATGCACTTGCCGCAATTAACATCCTCTGATTGCACTTGAGTATTTAACAATATATTATGAGCAAGACGCCGGCCTTGTAAAAGACCAAATATAAGGACAATCTAGGGGCGCGTGACCAAGACTGCATCATATCTTCCAAATTTAGTAGTACGCCGTGG

...As well as the labels

In [4]:
print(raw_data.labels[:5])

[[1 1 1]
 [0 1 0]
 [0 0 1]
 [1 1 1]
 [0 0 0]]


As well as the actual contents of the motif objects that were embedded in the original simulation. Can you verify the relationship between the labels and the embeddings?

In [6]:
print("\n".join(", ".join(str(embedding) for embedding in embeddings_one_seq)
                     for embeddings_one_seq in raw_data.embeddings[:5]))

pos-151_GATA_disc1-ACCGATAAGG, pos-123_TAL1_known1-CAAAACAGGTGGTGTG, pos-166_TAL1_known1-CTCACCAGATGGTAGT
pos-94_GATA_disc1-AGAGATAAGA, pos-118_GATA_disc1-ACAGATAAGC, pos-27_GATA_disc1-ACTGATTAGG
pos-149_TAL1_known1-GGGATCAGATGGTCAT, pos-67_TAL1_known1-GCTAACAGATGTTTAC, pos-10_TAL1_known1-GAGACCAGGTGGTCCA
pos-161_GATA_disc1-CGTGATACGA, pos-117_GATA_disc1-CCTGATAAGG, pos-84_TAL1_known1-GGCAACAGATGTTTTG, pos-53_TAL1_known1-GGGTACAGATGGTAAG, pos-17_TAL1_known1-GTAGACAGATGGTGGA



We are going to have to one-hot encode the data, so let's do that

In [8]:
one_hot_data = avutils.util.one_hot_encode_sequences(raw_data.sequences)

## Loading a DeepLIFT model

Let's now set up a DeepLIFT model. In this tutorial, we will start with a keras model and use the deeplift autoconversion functions to create a DeepLIFT model. Note that it is not necessary to have a keras model; if you have a model trained with a different package, you can write your own conversion scripts to put it in the DeepLIFT format - information on how to do so is documented on the DeepLIFT repo https://github.com/kundajelab/deeplift#under-the-hood. For now, we will stick to the autoconversion.

We will start with loading a keras model. We're going to load a Keras graph model (version 0.3.2). When we load the model, we need to specify the weights and the configuration (the weights are stored in the .h5 format and the configuration is stored in the .yaml format)

In [12]:
import deeplift.conversion.keras_conversion as kc

keras_model_weights = "model_files/record_1_model_9vvXe_modelWeights.h5"
keras_model_yaml = "model_files/record_1_model_9vvXe_modelYaml.yaml"

keras_model = kc.load_keras_model(weights=keras_model_weights, yaml=keras_model_yaml)

We will now convert the Keras model to the DeepLIFT format using the provided autoconverion functions. When we convert the model, we need to specify a reference to use

In [None]:
deeplift_model = kc.convert_graph_model(model=keras_model,
                    nonlinear_mxts_mode=nonlinear_mxts_mode,
                    dense_mxts_mode=dense_mxts_mode,
                    reference=reference)