## Goal

- train the example RBP model
  - nothing fancy
  - no concise dependencies 
  - just single-task for simplicity - PUM2
- export the model to .json and weights to hdf5  

## TODO 

- [ ] check how is it with the pre-processing
  - Raw bed + fasta + gtf-> array
    - sequence as 1-hot
    - distances
- [ ]

## Structure

Input files:

- fasta
- Bed
- annotation GTF

Input features:

- 1-hot-encoded sequence
- annotation features & distances

I will start with the batch preprocessor (loading the whole dataset at once) and then add a generator.

## Open questions

- difference between genomelake and genomedatalayer?
- I really like genomelake - it should become the standard for the pre-processors
  - should be simple enough to use and understand
  - add nearest feature-point extractor
- test_files should be raw files not already pre-processed folders

## Pre-processor steps

- Given range extract the sequence from the fasta file
  - Validate the width (or just extract the center)
  - compute the center range for it
  - pybed tools
    - [...] check the kundaje-lab code for this task

- Load-in the GTF file and compute the distances to the nearest features
  - [ ] see the programming possibilities for doing this in python
  - see 4_append_other_positions.R
  - just use pandas.DataFrame for it
    - Idea: include it into concise as a pre-processor
    
- Why are they not using HTSeq?    
  

In [None]:
import pandas as pd

from concise.preprocessing import encodeDNA, EncodeSplines

In [None]:
import concise.layers as cl
import keras.layers as kl
import concise.initializers as ci
import concise.regularizers as cr
from keras.optimizers import Adam
from keras.models import Model
from keras.callbacks import EarlyStopping

In [None]:
# TODO - path = ~/projects-work/concise/data/

In [None]:
def load(split="train", st=None):
    dt = pd.read_csv("../data/RBP/PUM2_{0}.csv".format(split))
    # DNA/RNA sequence
    xseq = encodeDNA(dt.seq) 
    # distance to the poly-A site
    xpolya = dt.polya_distance.as_matrix().reshape((-1, 1))
    # response variable
    y = dt.binding_site.as_matrix().reshape((-1, 1)).astype("float")
    return {"seq": xseq, "dist_polya_raw": xpolya}, y

def data():
    
    train, valid, test = load("train"), load("valid"), load("test")
    
    # transform the poly-A distance with B-splines
    es = EncodeSplines()
    es.fit(train[0]["dist_polya_raw"])
    train[0]["dist_polya_st"] = es.transform(train[0]["dist_polya_raw"])
    valid[0]["dist_polya_st"] = es.transform(valid[0]["dist_polya_raw"])
    test[0]["dist_polya_st"] = es.transform(test[0]["dist_polya_raw"])
    
    #return load("train"), load("valid"), load("test")
    return train, valid, test

train, valid, test = data()

In [None]:
def model(train, filters=1, kernel_size=9, pwm_list=None, lr=0.001, use_splinew=True, ext_dist=False):
    seq_length = train[0]["seq"].shape[1]
    if pwm_list is None:
        kinit = "glorot_uniform"
        binit = "zeros"
    else:
        kinit = ci.PSSMKernelInitializer(pwm_list, add_noise_before_Pwm2Pssm=True)
        binit = "zeros"
        
    # sequence
    in_dna = cl.InputDNA(seq_length=seq_length, name="seq")
    inputs = [in_dna]
    x = cl.ConvDNA(filters=filters, 
                   kernel_size=kernel_size, 
                   activation="relu",
                   kernel_initializer=kinit,
                   bias_initializer=binit,
                   name="conv1")(in_dna)
    if use_splinew:
        x = cl.SplineWeight1D(n_bases=10, l2_smooth=0, l2=0, name="spline_weight")(x)
        x = kl.GlobalAveragePooling1D()(x)
    else:
        x = kl.AveragePooling1D(pool_size=4)(x)
        x = kl.Flatten()(x)
        
    if ext_dist:    
        # distance
        in_dist = kl.Input(train[0]["dist_polya_st"].shape[1:], name="dist_polya_st")
        x_dist = cl.SplineT()(in_dist)
        x = kl.concatenate([x, x_dist])
        inputs += [in_dist]
    
    x = kl.Dense(units=1)(x)
    m = Model(inputs, x)
    m.compile(Adam(lr=lr), loss="binary_crossentropy", metrics=["acc"])
    return m

In [None]:
m = model(train, filters=10, use_splinew=False, ext_dist=True)

In [None]:
m.fit(train[0], train[1], epochs=50, validation_data=valid, 
     callbacks=[EarlyStopping(patience=5)])