Skip to content

kundajelab/mpra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MPRA

Please see updated repository for this project at https://github.com/kundajelab/MPRA-DragoNN/.

This project applies convolutional neural networks to predict output from massively parallel reporter assays (MPRAs), with the aim of systematically decoding regulatory sequence patterns and identifying disease-causing noncodign variants.

Most of the code involved with training and interpreting the models can be found in Jupyter notebooks (compiled with Python 2.7) in deeplearn/scripts/. The final model was trained in Keras 1.2.2 and can be loaded with the load_model or model_from_json methods using the weights (deeplearn/model_files/sharpr_znormed_jul23/record_13_model_bgGhy_modelWeights.h5) and/or architecture files (deeplearn/model_files/sharpr_znormed_jul23/record_13_model_bgGhy_modelJson.json).

Dependencies:

  • numpy, scipy, pandas, seaborn, etc.
  • theano (0.9.0)
  • keras (1.2.2)
  • deeplift (0.5.2)

Here are descriptions of the relevant Jupyter notebooks in deeplearn/scripts/:

  • Sharpr Model Interpretation.ipynb: Scatter plots for replicates (Figure 1C), prediction performances (Figure 2A), performances for specific chromatin states (Figure 2B), exploration of regulatory motifs and grammars learned by the model.
  • GBM Performance Benchmarking.ipynb: Training and performance testing of gradient boosting tree models (Figure S2).
  • Sharpr DeepLIFT Scoring Validation: CENTIPEDE TFBS validation (Figure 2C-D), DeepLIFT motif scores, comparison to Sharpr (Fig. 3).
  • Regulatory Grammar Discovery with Sharpr Models.ipynb: Exploration of predictive TF motif PWMs by comparing to DeepLIFT score profiles (Fig. 4).
  • Variant Scoring.ipynb: Evaluation of SNPpet ISM scores for variant prioritization (Figures 5 and 6, Supp. Figures).
  • ggplot2 visualizations.ipynb: R notebook to generate several of the manuscript's figures.

Some code written to process the Sharpr-MPRA and SuRE-seq datasets is available in the folders "sharpr" and "sureseq" respectively. These scripts process the data and produce structured input/output matrices that are used to train the deep learning models.

Feel free to contact Rajiv Movva (rmovva at mit dot edu) with any questions.

Releases

No releases published

Packages

No packages published

Languages