# Generating synthetic sequences with embedded regulatory grammars for RNA-binding proteins

### Motivation

RNA-binding proteins physically interact with RNAs primarily through direct nucleotide interactions via sequence binding motifs or binding to secondary structure motifs.  Here, I will focus on the former using a deep convolutional neural network. However, assessing the performance on actual data is difficult as the "ground truth" is usually not known. 

To test the performance of CNNs, I am going to simulate idealistic data where the sequences contain combinations of "known" binding motifs implanted in specific "known" locations while the rest of the sequence is generated randomly. This idealistic dataset can therefore gauge whether or not CNNs can indeed recover the motifs and their regulatory grammars in the  simplest scenario. It will also provide a baseline understanding the limitations of this approach, such as how many sequences are needed to recover different levels of regulatory grammar complexities.   

Note that the regulatory grammars generated here are unrealistic -- there is no such databases with accurate annotations as far as I am aware of.  Here, only the binding motifs are derived from actual experimental data; while the regulatory grammars are arbitrary, generated from a probabilisitic framework with a user-defined level of complexity, i.e. how many motifs interact with each other.

### Simulation model

To generate regulatory grammars, we need to create a framework for the interactions of specific motifs across distinct spatial distances. 

First, we will assume there are only P proteins. Of these P proteins, we can generate G regulatory grammars, which sample combinations of the M motifs.  We can then generate arbitrary distances, sampled from an exponential distribution, between each motif. Once the motif distances and combinations have been set, these constitute the set of regulatory grammars.  

We can also simulate negative results by simulating different motifs with the same distances or the same motifs with different distance or incomplete grammars.  First, we'll just assume a perfect dataset and see how it performs.  Then, we will systematically increase the complexity to see when exactly this model fails.  


#### Create a motif database for drosophila melanogaster

The data comes from Ray et al. "A compendium of RNA-binding motifs for decoding gene regulation" (http://www.nature.com/nature/journal/v499/n7457/abs/nature12311.html). The link to the motifs I downloaded is here: 

\$ wget http://hugheslab.ccbr.utoronto.ca/supplementary-data/RNAcompete_eukarya/top10align_motifs.tar.gz

\$ tar xzvf top10align_motifs.tar.gz

Here, each file is a different RBP motif as a position frequency matrix.  So, the first step is to compile all of these files into a suitable database.  In particular, we can parse each motifs (position frequency matrix) from each file in motifpath (downloaded top10align_motifs folder), create a database (list of arrays), and save as binary format (motif.list):

In [81]:
import os.path
import pandas as pd
import numpy as np
from six.moves import cPickle

motifpath = 'top10align_motifs/'   # directory where motif files are located
motiflist = 'motif.pickle'         # output filename

# get all motif files in motifpath directory
listdir = os.listdir(motifpath)

# parse motifs
motif_set = []
for files in listdir:
    df = pd.read_table(os.path.join(motifpath,files))
    motif_set.append(df.iloc[0::,1::].transpose())

# save motifs    
f = open(motiflist, 'wb')
cPickle.dump(motif_set, f, protocol=cPickle.HIGHEST_PROTOCOL)
f.close()

Let's simulate some sequences with some specs... Note, there are 244 motifs here. So, for simplicity, we will downsample this to create different levels of dataset complexity.  

$N$: total number of sequences 

$M$: total number of motifs ($M \leq 244$) 

$S$: size of each sequence 

$w$: population fraction of each regulatory grammar 

To make the sequence super clean, we will use a uniform distribution for the PWFs of the 'non-motif sequences'. We will first set up a hierarchical interaction of the motifs, a so-called "regulatory grammar". 

So, first, we will sample a regulatory grammar.  This will give us which motifs are present and how far apart they are separated with respect to each other.  Then, we can create a postiion frequency matrix of this regulatory grammar and simulate a set number based on the population fraction $w$. 

Now that we generated some simulated data, we should shuffle the data, then split the data into training set, cross-validation set, and test set, while converting the sequence data into one-hot representation.

In [165]:
from six.moves import cPickle
import numpy as np

def generate_grammar_model(options):
    """generate a regulatory grammar model: various numbers of motifs with 
    distinct separations."""

    # input options
    motif_set = options[0]   # set of all motifs
    num_motif = options[1]           # number of motifs for data set (to sample from motif_set)
    num_grammar = options[2]           # number of regulatory grammars (combinations of motifs)
    interaction_rate = options[3]  # exponential rate of number of motifs for each grammar
    distance_scale = options[4]    # exponential rate of distance between motifs
    distance_offset = options[5]   # offset addition between motif distances
    max_motif = options[6]         # maximum number of motifs in a grammar

    # select M random motifs from the complete list of motif_set
    motifIndex = np.random.permutation(len(motif_set))[0:num_motif]

    # build subset of motifs relevant for dataset
    motifs = []
    for index in motifIndex:
        motifs.append(motif_set[index])
    
    # generate G regulatory grammars (combinations of motifs + distance between motifs)
    Z = np.ceil(np.random.exponential(scale=interaction_rate, size=num_grammar)).astype(int)
    num_interactions = np.minimum(Z, max_motif)
    grammar = []
    distance = []
    for num in num_interactions:
        index = motifIndex[np.random.randint(num_motif, size=num)]
        grammar.append(index)
        separation = np.ceil(np.random.exponential(scale=distance_scale, size=num)).astype(int) + distance_offset
        distance.append(separation)

    return [motifs, grammar, distance]


def generate_sequence_pwm(seq_length, gram, dist, motifs):
    """generates the position weight matrix (pwm) for a given regulatory grammar 
    with a string length S"""
    
    # figure out offset after centering grammar
    offset = np.round(np.random.uniform(1,(seq_length - np.sum(dist) - len(dist)*8 - 20)))

    # build position weight matrix
    sequence_pwm = np.ones((4,offset))/4
    for i in xrange(len(gram)):
        sequence_pwm = np.hstack((sequence_pwm, motifs[i]))
        if i < len(dist):
            sequence_pwm = np.hstack((sequence_pwm, np.ones((4,dist[i]))/4))

    # fill in the rest of the sequence with uniform distribution to have length seq_length
    sequence_pwm = np.hstack((sequence_pwm, np.ones((4,seq_length - sequence_pwm.shape[1]))/4))
    
    return sequence_pwm


def generate_sequence_model(seq_length, model):
    """generate the sequence models (PWMs) for each regulatory grammars. """
    
    motifs = model[0]      # set of motifs
    grammar = model[1]     # set of combinations of motifs
    distance = model[2]    # set of distances between motifs

    # build a PWM for each regulatory grammar
    seq_model = []
    for j in xrange(len(grammar)):
        seq_model.append(generate_sequence_pwm(seq_length, grammar[j], distance[j], motifs)) 
        
    return seq_model


def simulate_sequence(sequence_pwm):
    """simulate a sequence given a sequence model"""
    
    nucleotide = 'ACGU'

    # sequence length
    seq_length = sequence_pwm.shape[1]

    # generate uniform random number for each nucleotide in sequence
    Z = np.random.uniform(0,1,seq_length)
    
    # calculate cumulative sum of the probabilities
    cum_prob = sequence_pwm.cumsum(axis=0)

    # go through sequence and find bin where random number falls in cumulative 
    # probabilities for each nucleotide
    sequence = ''
    for i in xrange(seq_length):
        index=[j for j in xrange(4) if Z[i] < cum_prob[j,i]][0]
        sequence += nucleotide[index]

    return sequence


def simulate_data(seq_model, num_seq):
    """simulates N sequences with random population fractions for each sequence 
    model (PWM) of each regulatory grammar """

    # simulate random population fractions and scale to N sequences
    w = np.random.uniform(0, 1, size=len(seq_model))
    w = np.round(w/sum(w)*num_seq)
    popFrac = w.astype(int)
    
    # create a popFrac weighted number of simulation for each regulatory grammar
    label = []
    data = []
    for i in xrange(len(popFrac)):
        for j in xrange(popFrac[i]):
            sequence = simulate_sequence(seq_model[i])
            data.append(sequence)
            label.append(i)
            
    return data, label


def convert_one_hot(seq):
    """convert a sequence into a 1-hot representation"""
    
    nucleotide = 'ACGU'
    N = len(seq)
    one_hot_seq = np.zeros((4,N))
    for i in xrange(200):         
        #for j in range(4):
        #    if seq[i] == nucleotide[j]:
        #        one_hot_seq[j,i] = 1
        index = [j for j in xrange(4) if seq[i] == nucleotide[j]]
        one_hot_seq[index,i] = 1
        
    return one_hot_seq


def subset_data(data, label, sub_index):
    """returns a subset of the data and labels based on sub_index"""
    
    num_labels = len(np.unique(label))
    num_sub = len(sub_index)
    
    sub_set = np.zeros((num_sub, 4, len(data[0])))
    sub_set_label = np.zeros((num_sub, num_labels))
    
    k = 0;
    for index in sub_index:
        sub_set[k] = convert_one_hot(data[index])
        sub_set_label[k,label[index]] = 1
        k += 1

    sub_set_label = sub_set_label.astype(np.uint8)
    
    return (sub_set, sub_set_label)


def split_data(data, label, split_size):
    """split data into train set, cross-validation set, and test set"""
    
    # number of labels
    num_labels = len(np.unique(label))

    # determine indices of each dataset
    N = len(data)
    cum_index = np.cumsum(np.multiply([0, split_size[0], split_size[1], split_size[2]],N)).astype(int) 

    # shuffle data
    shuffle = np.random.permutation(N)

    # training dataset
    train_index = shuffle[range(cum_index[0], cum_index[1])]
    cross_validation_index = shuffle[range(cum_index[1], cum_index[2])]
    test_index = shuffle[range(cum_index[2], cum_index[3])]

    # create subsets of data based on indices 
    train = subset_data(data, label, train_index)
    cross_validation = subset_data(data, label, cross_validation_index)
    test = subset_data(data, label, test_index)
    
    return train, cross_validation, test


def main():

    # dataset parameters
    num_seq = 100000       # number of sequences
    seq_length = 200      # length of sequence
    num_motif = 10        # number of motifs
    num_grammar = 20      # number of regulatory grammars
    filename =  'N=' + str(num_seq) + \
                '_S=' + str(seq_length) + \
                '_M=' + str(num_motif) + \
                '_G=' + str(num_grammar) # output filename
                
    # motif interaction parameters (grammars)
    interaction_rate = 1.5       # exponential rate of number of motifs for each grammar
    distance_scale = seq_length/15        # exponential rate of distance between motifs
    offset = 5                   # offset addition between motif distances
    maxMotif = 5                 # maximum number of motifs in a grammar

    # percentage for each dataset
    train_size = 0.7
    cross_validation_size = 0.15
    test_size = 0.15

    # load motif list from file
    motiflist = 'motif.pickle'
    f = open(motiflist, 'rb')
    motif_set = cPickle.load(f)
    f.close()

    # generate regulatory grammar model
    print "Generating motif grammars"
    options = [motif_set, num_motif, num_grammar, 
            interaction_rate, distance_scale, offset, maxMotif]
    model = generate_grammar_model(options)

    # convert this to a sequence position weight matrix for each model
    seq_model = generate_sequence_model(seq_length, model)

    # simulate N sequences based on the position weight matrices
    print "Generating synthetic data"
    data, label = simulate_data(seq_model, num_seq)

    # get indices for each dataset
    print "Splitting dataset into train, cross-validation, and test"
    split_size = [train_size, cross_validation_size, test_size]
    train, cross_validation, test = split_data(data, label, split_size)

    # save training dataset in one-hot representation
    print "Saving dataset"
    f = open(filename+'_data.pickle', 'wb')
    cPickle.dump(train, f, protocol=cPickle.HIGHEST_PROTOCOL)
    cPickle.dump(cross_validation, f, protocol=cPickle.HIGHEST_PROTOCOL)
    cPickle.dump(test, f, protocol=cPickle.HIGHEST_PROTOCOL)
    f.close()

    # save training dataset in one-hot representation
    print "Saving model"
    f = open(filename+'_model.pickle', 'wb')
    cPickle.dump(options, f, protocol=cPickle.HIGHEST_PROTOCOL)
    cPickle.dump(model, f, protocol=cPickle.HIGHEST_PROTOCOL)
    cPickle.dump(seq_model, f, protocol=cPickle.HIGHEST_PROTOCOL)
    f.close()

if __name__ == "__main__":
   main()

Generating motif grammars
Generating synthetic data
Splitting dataset into train, cross-validation, and test
Saving dataset
Saving model


In [None]:


# For TensorFlow, need to reorganize data
    # sub_set = munge_data(sub_set):
        
def munge_data(data):
    num_data, dim, seq_length = np.shape(data)
    data = np.transpose(data, (0,2,1))
    return np.reshape(data, np.shape(data)+(1,))
    


Now, we are ready to build and test our deep learning model.

In [91]:
from six.moves import cPickle
import numpy as np
import sys

# load training data
filename = 'train_data_10000_200_10_20.pickle'
f = open(filename, 'rb')
train = cPickle.load(f)
cross_validation = cPickle.load(f)
test = cPickle.load(f)
f.close()

# separate data
train_set, train_set_label = train
cross_validation_set, cross_validation_set_label = cross_validation
test_set, test_set_label = test

# munge data for deep learning
train_set = munge_data(train_set)
cross_validation_set = munge_data(cross_validation_set)
test_set = munge_data(test_set)

    

In [10]:
import os

outdir = 'data'
if not os.path.isdir(outdir):
    os.mkdir(outdir)
    print "making directory: " + outdir

making directory data


In [6]:
ls

Implanting_sequence_motifs.ipynb  process_dataset.py  tensorflow_train.py
keras_train.py                    scrape_motifs.py    [0m[01;34mtop10align_motifs[0m/
motif.pickle                      simulate_data.py


In [2]:
from six.moves import cPickle

# load motif list from file
filename = 'model_10000_200_10_20.pickle'
f = open(filename, 'rb')
model = cPickle.load(f)
options = cPickle.load(f)
seq_model = cPickle.load(f)
f.close()

# options = [motif_set, M, G, interaction_rate, distance_scale, offset, maxMotif]
# model [motifs, grammar, distance]
# seq_model
