Skip to content

Key results reported by PRAM's manuscript and scripts for reproducibility

License

Notifications You must be signed in to change notification settings

pliu55/PRAM_paper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PRAM manuscript key results and scripts for reproducibility

Table of Contents


Introduction

This repository contains key results reported in PRAM's manuscript and R scripts to reproduce them on user's local machine. We provided results for 'noise-free' benchmark, human master set transcript models, and mouse hematopoietic transcript models. In the sections below, We will describe each of them in details. To reproduce these results, we recommend to run all the R scripts in Linux, where we have tested their reproducibility. Also, please make sure to setup dependent files first before running any other R scripts.

To obtain this repository, please use the follow command

git clone https://github.com/pliu55/pram_paper

It will create a directory pram_paper/ that contains the following folders and files:

  • 0_setup/
    • run.R: script to setup dependent software and files
  • 1_benchmark/
    • reported/: results for 'noise-free' benchmark
    • run.R: script for reproducing the results
  • 2_human/:
    • reported/: results for human master set transcript models
    • prepareEncodeBam.R and run.R: scripts for reproducing the results
  • 3_mouse/:
    • reported/: results for mouse hematopoitic system

Setup dependent files

To reproduce PRAM's results, we need to prepare required software and genomic files first with the following commands:

cd 0_setup/
./run.R

The script run.R will download and install:

  • the latest PRAM package
  • transcript-building software:
    • Cufflinks
    • StringTie
    • TACO
  • human gene annotation from GENCODE version v24
  • human genome version hg38

This script requires ~ 9 GB hard drive space and takes ~ 10 minutes using a single 2.1 GHz CPU. All the dependent software and files will be saved in 0_setup/output/.

'Noise-free' benchmark

Key results

Results for the 'noise-free' benchmark test are in the folder 1_benchmark/reported/ with their descriptions listed in the table below

file name description
target_transcript_ids.txt GENCODE v24 transcript IDs for the 1,256 target transcripts
plcf.gtf predicted transcript models by PRAM's pooling + Cufflinks method
plst.gtf predicted transcript models by PRAM's pooling + StringTie method
cfmg.gtf predicted transcript models by PRAM's Cufflinks + Cuffmerge method
stmg.gtf predicted transcript models by PRAM's StringTie + merging method
cftc.gtf predicted transcript models by PRAM's Cufflinks + TACO method
model_eval.tsv precision and recall for transcript models predicted by the above five methods in terms of exon nucleotide (row name: exon_nuc), individual junction (row name: indi_jnc), and transcript structure (row name: tr_jnc)

Reproducibility

To reproduce the model prediction results, run the follow command:

cd 1_benchmark/
./run.R

The script run.R will:

  • download 'noise-free' input RNA-seq BAM files to 1_benchmark/input/
  • predict transcript models by PRAM's five meta-assembly methods and save prediction results as GTF files in 1_benchmark/output/. Files will be named in the same way as in the table above
  • compare transcript models with GENCODE annotation and save the evaluation results in 1_benchmark/output/model_eval.tsv

The script run.R requires ~23 GB hard drive space and takes ~3 hours using forty 2.1 GHz CPUs. To adjust to the running CPUs on your own machine, please edit the njob_in_para and nthr_per_job variables in run.R to make sure njob_in_para * nthr_per_job do not exceed the number of available cores.

Human master set

Key results

Five meta-assembly methods of PRAM were applied to predict intergenic transcript models based on thirty human ENCODE RNA-seq datasets. All five prediction results are saved in 2_human/reported/:

file name PRAM method
plcf.gtf.gz pooling + Cufflinks
plst.gtf.gz pooling + StringTie
cfmg.gtf.gz Cufflinks + Cuffmerge
stmg.gtf.gz StringTie + merging
cftc.gtf.gz Cufflinks + TACO

We quantified the expression levels of transcript models predicted by 'pooling + Cufflinks' together with GENCODE (v24)-annotated transcripts in each of the 30 ENCODE RNA-seq datasets. Their expression levels (in TPM) can be found in isoforms.tpm.gz

Reproducibility

To reproduce the model prediction results, run the follow command:

cd 2_human/
./prepareEncodeBam.R
./run.R

The script prepareEncodeBam.R will download thirty human RNA-seq BAM files from ENCODE, index and save them in 2_human/input/. It will take ~500 GB hard drive space and cost ~3 hours using thirty 2.1 GHz CPUs. You can adjust the number of running CPUs by the njob_in_para variable in prepareEncodeBam.R.

The script run.R will predict transcript models in human intergenic regions based on the downloaded BAM files. It will take ~20 GB space and ~4.5 hours using forty 2.1 GHz CPUs. To customize the number of running CPUs for your own machine is the same as in reproducing benchmark results. Predicted models will be saved as GTF files in 2_human/output/. Files will be named in the same way as the table above.

Mouse hematopoietic system

Key results

Three meta-assembly methods of PRAM were applied to predict intergenic transcript models based on thirty-two RNA-seq datasets from mouse hematopoietic system, followed by selection of transcript models that do not overlap with RefSeq genes and have mappability ≥ 0.8. All three prediction results are saved in 3_mouse/reported/:

file name PRAM method
plcf.gtf.gz pooling + Cufflinks
cfmg.gtf.gz Cufflinks + Cuffmerge
cftc.gtf.gz Cufflinks + TACO

Reproducibility

The way to use PRAM to predict intergenic transcript models for mouse hematopoietic system is the same as for human master set. You can refer to the script run.R in human master set for the usage of PRAM.

We do not provide scripts for automatically reproducing the results because:

  • Some mouse ENCODE RNA-seq datasets do not have alignment BAM files available, such as ENCSR000CLU and ENCSR000CLY
  • Some mouse ENCODE RNA-seq datasets have alignment BAM files available, such as ENCSR000CHV and ENCSR000CHY. But they were based on GENCODE vM4, not vM9, which we used to define known genes and intergenic regions.
  • The mouse RNA-seq alignment BAM file we generated takes ~750 GB hard drive space, which would cost a long time for users to download.

Therefore, we simply provided the results instead. You are always welcome to contact us regarding the details on reproducing these results.

Reference

PRAM: a novel pooling approach for discovering intergenic transcripts from large-scale RNA sequencing experiments. Peng Liu, Alexandra A. Soukup, Emery H. Bresnick, Colin N. Dewey, and Sündüz Keleş. Genome Research 2020 https://doi.org/10.1101/gr.252445.119

Contact

Got a question? Please report it at the issues tab in this repository.

About

Key results reported by PRAM's manuscript and scripts for reproducibility

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages