Stochastic Sequence Propagation - A Keras Model for optimizing DNA, RNA and protein sequences based on a predictor.
A Python API for constructing generative DNA/RNA/protein Sequence PWM models in Keras. Implements a PWM generator (with support for discrete sampling and ST gradient estimation), a predictor model wrapper and a loss model.
- Implements a Sequence PWM Generator as a Keras Model, outputting PWMs, Logits, or random discrete samples from the PWM. These representations can be fed into any downstream Keras model for reinforcement learning.
- Implements a Predictor Keras Model wrapper, allowing easy loading of pre-trained sequence models and connecting them to the upstream PWM generator.
- Implements a Loss model with various useful cost and objectives, including regularizing PWM losses (e.g., soft sequence constraints, PWM entropy costs, etc.)
- Includes visualization code for plotting PWMs and cost functions during optimization (as Keras Callbacks).
SeqProp can be installed by cloning or forking the github repository:
git clone https://github.com/johli/seqprop.git cd seqprop python setup.py install
SeqProp requires the following packages to be installed
- Tensorflow >= 1.13.1
- Keras >= 2.2.4
- Scipy >= 1.2.1
- Numpy >= 1.16.2
- Isolearn >= 0.2.0 (github)
SeqProp provides API calls for building PWM generators and downstream sequence predictors as Keras Models.
A simple generator pipeline for some (imaginary) predictor can be built as follows:
import keras from keras.models import Sequential, Model, load_model import isolearn.keras as iso import numpy as np from seqprop.visualization import * from seqprop.generator import * from seqprop.predictor import * from seqprop.optimizer import * from my.project import load_my_predictor #Function that loads your predictor #Define Loss Function (Fit predicted output to some target) #Also enforce low PWM entropy target = np.zeros((1, 1)) target[0, 0] = 5.6 (Arbitrary target) pwm_entropy_mse = get_target_entropy_sme(pwm_start=0, pwm_end=100, target_bits=1.8) def loss_func(predictor_outputs) : pwm_logits, pwm, sampled_pwm, predicted_out = predictor_outputs #Create target constant target_out = K.tile(K.constant(target), (K.shape(sampled_pwm), 1)) target_cost = (target_out - predicted_out)**2 pwm_cost = pwm_entropy_mse(pwm) return K.mean(target_cost + pwm_cost, axis=-1) #Build Generator Network _, seqprop_generator = build_generator(seq_length=100, n_sequences=1, batch_normalize_pwm=True) #Build Predictor Network and hook it on the generator PWM output tensor _, seqprop_predictor = build_predictor(seqprop_generator, load_my_predictor(), n_sequences=1, eval_mode='pwm') #Build Loss Model (In: Generator seed, Out: Loss function) _, loss_model = build_loss_model(seqprop_predictor, loss_func) #Specify Optimizer to use opt = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999) #Compile Loss Model (Minimize self) loss_model.compile(loss=lambda true, pred: pred, optimizer=opt) #Fit Loss Model loss_model.fit(, np.ones((1, 1)), epochs=1, steps_per_epoch=1000) #Retrieve optimized PWMs and predicted (optimized) target _, optimized_pwm, _, predicted_out = seqprop_predictor.predict(x=None, steps=1)
These examples show how to set up the sequence optimization model, hook it to a predictor, and define various loss models. The examples build on different DNA, RNA and protein design tasks using a wide selection of fitness predictors: APARENT (Bogard et. al., 2019), Optimus 5' (Sample et. al., 2019), DragoNN (Kundaje Lab), MPRA-DragoNN (Movva et. al., 2019), DeepSEA (Zhou et. al., 2015) and trRosetta (Yang et. al., 2020).
Alternative Polyadenylation (APARENT)
Notebook 1a: Generate Target Isoforms (Predict on PWM)
Notebook 1b: Generate Target Isoforms (Predict on Sampled One-hots)
Notebook 2: Generate Target 3' Cleavage (Predict on Sampled One-hots)
Notebook 3a: Evaluate Logit-Normalization
Notebook 3b: Evaluate Logit-Normalization (Different Gradient Estimators)
Notebook 3c: Evaluate Logit-Normalization (Gumbel Sampler)
Notebook 3d: Evaluate Logit-Normalization (Explicit Entropy Penalty)
Notebook 3e: Evaluate Logit-Normalization (Optimizer Settings)
Translational Efficiency (Optimus 5')
CTCF TF Binding (DeepSEA, Dnd41)
Transcriptional Activity (MPRA-DragoNN, SV40, Mean Activity)
SPI1 TF Binding (DragoNN)
Notebook 1a: Evaluate Logit-Normalization
Notebook 1b: Evaluate Logit-Normalization (Different Gradient Estimator)
Notebook 1c: Evaluate Logit-Normalization (Gumbel Sampler)
Notebook 1d: Evaluate Logit-Normalization (Vs. Simulated Annealing)