# Generating synthetic sequences with embedded regulatory grammars for RNA-binding proteins

### Motivation

RNA-binding proteins physically interact with RNAs primarily through direct nucleotide interactions via sequence binding motifs or binding to secondary structure motifs.  Here, I will focus on the former using a deep convolutional neural network. However, assessing the performance on actual data is difficult as the "ground truth" is usually not known. 

To test the performance of CNNs, I am going to simulate idealistic data where the sequences contain combinations of "known" binding motifs implanted in specific "known" locations while the rest of the sequence is generated randomly. This idealistic dataset can therefore gauge whether or not CNNs can indeed recover the motifs and their regulatory grammars in the  simplest scenario. It will also provide a baseline understanding the limitations of this approach, such as how many sequences are needed to recover different levels of regulatory grammar complexities.   

Note that the regulatory grammars generated here are unrealistic -- there is no such databases with accurate annotations as far as I am aware of.  Here, only the binding motifs are derived from actual experimental data; while the regulatory grammars are arbitrary, generated from a probabilisitic framework with a user-defined level of complexity, i.e. how many motifs interact with each other.

### Simulation model

To generate regulatory grammars, we need to create a framework for the interactions of specific motifs across distinct spatial distances. 

First, we will assume there are only P proteins. Of these P proteins, we can generate G regulatory grammars, which sample combinations of the M motifs.  We can then generate arbitrary distances, sampled from an exponential distribution, between each motif. Once the motif distances and combinations have been set, these constitute the set of regulatory grammars.  

We can also simulate negative results by simulating different motifs with the same distances or the same motifs with different distance or incomplete grammars.  First, we'll just assume a perfect dataset and see how it performs.  Then, we will systematically increase the complexity to see when exactly this model fails.  


#### Create a motif database for drosophila melanogaster

The data comes from Ray et al. "A compendium of RNA-binding motifs for decoding gene regulation" (http://www.nature.com/nature/journal/v499/n7457/abs/nature12311.html). The link to the motifs I downloaded is here: 

\$ wget http://hugheslab.ccbr.utoronto.ca/supplementary-data/RNAcompete_eukarya/top10align_motifs.tar.gz

\$ tar xzvf top10align_motifs.tar.gz

Here, each file is a different RBP motif as a position frequency matrix.  So, the first step is to compile all of these files into a suitable database.  In particular, we can parse each motifs (position frequency matrix) from each file in motifpath (downloaded top10align_motifs folder), create a database (list of arrays), and save as binary format (motif.list):

In [3]:
!wget http://hugheslab.ccbr.utoronto.ca/supplementary-data/RNAcompete_eukarya/top10align_motifs.tar.gz
!tar xzf top10align_motifs.tar.gz

--2016-05-03 11:22:48--  http://hugheslab.ccbr.utoronto.ca/supplementary-data/RNAcompete_eukarya/top10align_motifs.tar.gz
Resolving hugheslab.ccbr.utoronto.ca (hugheslab.ccbr.utoronto.ca)... 142.150.84.64
Connecting to hugheslab.ccbr.utoronto.ca (hugheslab.ccbr.utoronto.ca)|142.150.84.64|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 44389 (43K) [application/x-gzip]
Saving to: ‘top10align_motifs.tar.gz.1’


2016-05-03 11:22:49 (1.76 MB/s) - ‘top10align_motifs.tar.gz.1’ saved [44389/44389]



In [1]:
import os.path
import pandas as pd
import numpy as np
from six.moves import cPickle

motifpath = 'top10align_motifs/'   # directory where motif files are located
motiflist = 'motif.pickle'         # output filename

# get all motif files in motifpath directory
listdir = os.listdir(motifpath)

# parse motifs
motif_set = []
for files in listdir:
    df = pd.read_table(os.path.join(motifpath,files))
    motif_set.append(df.iloc[0::,1::].transpose())

# save motifs    
f = open(motiflist, 'wb')
cPickle.dump(motif_set, f, protocol=cPickle.HIGHEST_PROTOCOL)
f.close()

Let's simulate some sequences with some specs... Note, there are 244 motifs here. So, for simplicity, we will downsample this to create different levels of dataset complexity.  

$N$: total number of sequences 

$M$: total number of motifs ($M \leq 244$) 

$S$: size of each sequence 

$w$: population fraction of each regulatory grammar 

To make the sequence super clean, we will use a uniform distribution for the PWFs of the 'non-motif sequences'. We will first set up a hierarchical interaction of the motifs, a so-called "regulatory grammar". 

So, first, we will sample a regulatory grammar.  This will give us which motifs are present and how far apart they are separated with respect to each other.  Then, we can create a postiion frequency matrix of this regulatory grammar and simulate a set number based on the population fraction $w$. 

Now that we generated some simulated data, we should shuffle the data, then split the data into training set, cross-validation set, and test set, while converting the sequence data into one-hot representation.

In [29]:
import h5py
datapath = '/home/peter/Data/Basset'
trainmat = h5py.File(os.path.join(datapath,'er.h5'), 'r')
y_train = np.array(trainmat['train_out'])
y = y_train.astype(np.float32)

In [30]:
np.mean(np.sum(y, axis=1))

13.729081

13.698700000000001

In [6]:
num_seq = 100000       # number of sequences
labels = y[0:num_seq,range(0,50)]
C = np.cov(labels.T)
Cinv = np.linalg.inv(C)
L = np.linalg.cholesky(C)
Linv = np.linalg.inv(L)


def data_subset(y, class_range, negative=True):
    " gets a subset of data in the class_range"
    data_index = []
    for i in class_range:
        index = np.where(y[:, i] == 1)
        data_index = np.concatenate((data_index, index[0]), axis=0)
    unique_index = np.unique(data_index)
    num_data = y.shape[0]
    non_index = np.array(list(set(range(num_data)) - set(unique_index)))
    if negative:
        index = np.concatenate((unique_index, non_index), axis=0)
    else:
        index = unique_index
    return index.astype(int)

class_range = range(0,50)
index = data_subset(y, class_range, negative=False)
labels = y[index[0:num_seq],:]
labels = labels[:,class_range]



nan

In [53]:
num_seq = 300000 
labels = np.random.binomial(1, 13.729/50, (num_seq,50))
np.sum(np.sum(labels, axis=1) == 0).astype(float)/len(labels)

0.0

In [54]:
labels.shape

(100000, 50)

In [55]:
import os
import sys
import numpy as np
import h5py

import matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm
import matplotlib as mpl
from matplotlib import rcParams
rcParams.update({'figure.autolayout': True})
matplotlib.rcParams['figure.figsize'] = (100.0, 100.0)

%matplotlib inline
from scipy.misc import imresize

In [56]:
num_motif = y.shape[1]        # number of motifs
# load motif list from file
motiflist = 'motif.pickle'
f = open(motiflist, 'rb')
motif_set = cPickle.load(f)
f.close()

# select M random motifs from the complete list of motif_set
motifIndex = np.random.permutation(len(motif_set))[0:num_motif]

# build subset of motifs relevant for dataset
motifs = []
for index in motifIndex:
    motifs.append(motif_set[index])


In [None]:

# dataset parameters
seq_length = 600      # length of sequence
    
seq = []
for label in labels:
    index = np.where(label==1)[0]
    num_index = len(index)
    buffer_size = seq_length - num_index*8
    ave_spacing = np.floor(buffer_size/(num_index+1))
    
    # generate sequence with motifs
    sequence_pwm = np.ones((4,ave_spacing))/4
    for i in index:
        sequence_pwm = np.hstack((sequence_pwm, motifs[i]))
        sequence_pwm = np.hstack((sequence_pwm, np.ones((4,ave_spacing))/4))
    sequence_pwm = np.hstack((sequence_pwm, np.ones((4,seq_length - sequence_pwm.shape[1]))/4))
    
    
    nucleotide = 'ACGU'
    cum_prob = sequence_pwm.cumsum(axis=0)
    Z = np.random.uniform(0,1,seq_length)
    one_hot_seq = np.zeros((4,seq_length))
    for i in range(seq_length):
        index=[j for j in range(4) if Z[i] < cum_prob[j,i]][0]
        one_hot_seq[index,i] = 1
    seq.append(one_hot_seq)
            
    

In [None]:
seq = np.array(seq)
seq.shape

In [None]:
# percentage for each dataset
train_size = 0.7
cross_validation_size = 0.15
test_size = 0.15

split_size = [train_size, cross_validation_size, test_size]
num_labels = len(np.unique(labels))
cum_index = np.cumsum(np.multiply([0, split_size[0], split_size[1], split_size[2]],num_seq)).astype(int) 
shuffle = np.random.permutation(num_seq)
train_index = shuffle[range(cum_index[0], cum_index[1])]
cross_validation_index = shuffle[range(cum_index[1], cum_index[2])]
test_index = shuffle[range(cum_index[2], cum_index[3])]


In [None]:

# create subsets of data based on indices 
train = (seq[train_index], labels[train_index])
cross_validation = (seq[cross_validation_index], labels[cross_validation_index])
test = (seq[test_index], labels[test_index])


In [None]:
import h5py
f = h5py.File("synthetic_random_motifs_100000.hdf5", "w")
dset = f.create_dataset("trainx", data=train[0])
dset = f.create_dataset("trainy", data=train[1])
dset = f.create_dataset("validx", data=cross_validation[0])
dset = f.create_dataset("validy", data=cross_validation[1])
dset = f.create_dataset("testx", data=test[0])
dset = f.create_dataset("testy", data=test[1])
f.close()

In [21]:
f.close()