# DFIM: Recovering interactions between embedded motifs

In this example we recover interactions between embedded motifs. We simulate three sets sequences:

- Class 1: 20,000 sequences with motif A (the SIX5 motif) embedded 1-2 times
- Class 2: 20,000 sequences with motif B (the ELF1 motif) 
- Class 3: 20,000 sequences with motif A and motif B (SIX5 and ELF1)


We train a model that will only predict a positive label when both motifs are present. Thus Class 3 sequences are positive and Class 1 and 2 are both negative. The model must learn the interaction between motif A and B. We also randomly add in motifs C and D (AP1 and TAL1) 0, 1, or 2 times into all 60,000 sequences.

Knowing the ground truth, we show that DFIM recover an interaction between motifs A and B in Class 3 sequences but not between any other of the embedded motifs.

In [4]:
# Imports

import os, sys
import numpy as np
import pandas as pd
import gzip

import matplotlib
matplotlib.use('Agg') 
import matplotlib.pyplot as plt

import pickle
import itertools

from keras.models import model_from_json
from Bio import SeqIO

import dfim


Using Theano backend.
Using gpu device 5: GeForce GTX TITAN X (CNMeM is disabled, cuDNN 5105)
  "downsample module has been moved to the theano.tensor.signal.pool module.")


In [5]:
# Specify data files

labels = 'embedded_motif_interaction_data/labels.txt.gz'
fasta_file = 'embedded_motif_interaction_data/sequences.fa.gz'
simdata_file = 'embedded_motif_interaction_data/simulation_metadata.txt.gz'

model_weights = 'embedded_motif_interaction_data/model_weights.h5'
model_architecture = 'embedded_motif_interaction_data/model_architecture.json'


In [6]:
# Load model 

model_json = open(model_architecture, 'r').read()
model = model_from_json(model_json)
model.load_weights(model_weights)



In [7]:
# Load sequences

fasta_sequences = SeqIO.parse(gzip.open(fasta_file),'fasta')
sequence_list = []
seq_fasta_list = []

for fasta in fasta_sequences:
    name, sequence = fasta.id, str(fasta.seq)
    seq_fasta_list.append(sequence)
    new_sequence = dfim.util.one_hot_encode(sequence)
    sequence_list.append(new_sequence)
sequences = np.array(sequence_list)


AttributeError: 'module' object has no attribute 'util'

In [None]:
seqlet_loc_dict = dfim.util.process_locations_from_simdata(sequences, simdataFile)

# Read in labels
allLabels = pd.read_table(input_labels).ix[:,1::]

# Generate predictions
print('Generating Predictions')
allPredictions = model.predict(sequences)

# Find correct predictions for the certain label index (starting with Bcells)
correct_pred_dict = {}
pred_dict = {}
labels_dict = {}

task = 2
labels_dict[task] = allLabels.iloc[:,task].tolist()
pred_dict[task] = allPredictions[:,0].tolist()
correct_pred_dict[task] = dfim.util.get_correct_predictions(labels_dict[task], pred_dict[task])
correct_pred_list = [el[0] for el in correct_pred_dict[task]]

compute_range = range(50000,51000)
compute_index = [el for el in deeplift_range if el in correct_pred_list]
