- Introduction
- Setup dependent files
- 'Noise-free' benchmark
- Human master set
- Mouse hematopoietic system
- Reference
- Contact
This repository contains key results reported in PRAM's manuscript and R scripts to reproduce them on user's local machine. We provided results for 'noise-free' benchmark, human master set transcript models, and mouse hematopoietic transcript models. In the sections below, We will describe each of them in details. To reproduce these results, we recommend to run all the R scripts in Linux, where we have tested their reproducibility. Also, please make sure to setup dependent files first before running any other R scripts.
To obtain this repository, please use the follow command
git clone https://github.com/pliu55/pram_paper
It will create a directory pram_paper/
that contains the following folders
and files:
0_setup/
run.R
: script to setup dependent software and files
1_benchmark/
reported/
: results for 'noise-free' benchmarkrun.R
: script for reproducing the results
2_human/
:reported/
: results for human master set transcript modelsprepareEncodeBam.R
andrun.R
: scripts for reproducing the results
3_mouse/
:reported/
: results for mouse hematopoitic system
To reproduce PRAM's results, we need to prepare required software and genomic files first with the following commands:
cd 0_setup/
./run.R
The script run.R will download and install:
- the latest PRAM package
- transcript-building software:
- Cufflinks
- StringTie
- TACO
- human gene annotation from GENCODE version v24
- human genome version hg38
This script requires ~ 9 GB hard drive space and takes ~ 10 minutes using a
single 2.1 GHz CPU. All the dependent software and files will be
saved in 0_setup/output/
.
Results for the 'noise-free' benchmark test are in the folder
1_benchmark/reported/
with their descriptions listed in the table below
file name | description |
---|---|
target_transcript_ids.txt | GENCODE v24 transcript IDs for the 1,256 target transcripts |
plcf.gtf | predicted transcript models by PRAM's pooling + Cufflinks method |
plst.gtf | predicted transcript models by PRAM's pooling + StringTie method |
cfmg.gtf | predicted transcript models by PRAM's Cufflinks + Cuffmerge method |
stmg.gtf | predicted transcript models by PRAM's StringTie + merging method |
cftc.gtf | predicted transcript models by PRAM's Cufflinks + TACO method |
model_eval.tsv | precision and recall for transcript models predicted by the above five methods in terms of exon nucleotide (row name: exon_nuc ), individual junction (row name: indi_jnc ), and transcript structure (row name: tr_jnc ) |
To reproduce the model prediction results, run the follow command:
cd 1_benchmark/
./run.R
The script run.R will:
- download 'noise-free' input RNA-seq BAM files to
1_benchmark/input/
- predict transcript models by PRAM's five meta-assembly methods and save
prediction results as GTF files in
1_benchmark/output/
. Files will be named in the same way as in the table above - compare transcript models with GENCODE annotation and save the evaluation
results in
1_benchmark/output/model_eval.tsv
The script run.R requires ~23 GB hard drive space and
takes ~3 hours using forty 2.1 GHz CPUs. To adjust to the running CPUs on your
own machine, please edit the njob_in_para
and nthr_per_job
variables in
run.R to make sure njob_in_para * nthr_per_job
do not
exceed the number of available cores.
Five meta-assembly methods of PRAM were applied to predict intergenic
transcript models based on thirty human ENCODE RNA-seq datasets. All five
prediction results are saved in 2_human/reported/
:
file name | PRAM method |
---|---|
plcf.gtf.gz | pooling + Cufflinks |
plst.gtf.gz | pooling + StringTie |
cfmg.gtf.gz | Cufflinks + Cuffmerge |
stmg.gtf.gz | StringTie + merging |
cftc.gtf.gz | Cufflinks + TACO |
We quantified the expression levels of transcript models predicted by 'pooling + Cufflinks' together with GENCODE (v24)-annotated transcripts in each of the 30 ENCODE RNA-seq datasets. Their expression levels (in TPM) can be found in isoforms.tpm.gz
To reproduce the model prediction results, run the follow command:
cd 2_human/
./prepareEncodeBam.R
./run.R
The script prepareEncodeBam.R will download
thirty human RNA-seq BAM files from ENCODE, index and save them in
2_human/input/
. It will take ~500 GB hard drive space and cost ~3 hours
using thirty 2.1 GHz CPUs. You can adjust the number of running CPUs by the
njob_in_para
variable in prepareEncodeBam.R.
The script run.R will predict transcript models in human
intergenic regions based on the downloaded BAM files. It will take ~20 GB
space and ~4.5 hours using forty 2.1 GHz CPUs. To customize the number of
running CPUs for your own machine is the same as in
reproducing benchmark results.
Predicted models will be saved as GTF files in 2_human/output/
. Files will
be named in the same way as the table above.
Three meta-assembly methods of PRAM were applied to predict intergenic
transcript models based on thirty-two RNA-seq datasets from mouse
hematopoietic system, followed by selection of
transcript models that do not overlap with RefSeq genes and have mappability
≥ 0.8. All three prediction results are saved in 3_mouse/reported/
:
file name | PRAM method |
---|---|
plcf.gtf.gz | pooling + Cufflinks |
cfmg.gtf.gz | Cufflinks + Cuffmerge |
cftc.gtf.gz | Cufflinks + TACO |
The way to use PRAM to predict intergenic transcript models for mouse hematopoietic system is the same as for human master set. You can refer to the script run.R in human master set for the usage of PRAM.
We do not provide scripts for automatically reproducing the results because:
- Some mouse ENCODE RNA-seq datasets do not have alignment BAM files available, such as ENCSR000CLU and ENCSR000CLY
- Some mouse ENCODE RNA-seq datasets have alignment BAM files available, such as ENCSR000CHV and ENCSR000CHY. But they were based on GENCODE vM4, not vM9, which we used to define known genes and intergenic regions.
- The mouse RNA-seq alignment BAM file we generated takes ~750 GB hard drive space, which would cost a long time for users to download.
Therefore, we simply provided the results instead. You are always welcome to contact us regarding the details on reproducing these results.
PRAM: a novel pooling approach for discovering intergenic transcripts from large-scale RNA sequencing experiments. Peng Liu, Alexandra A. Soukup, Emery H. Bresnick, Colin N. Dewey, and Sündüz Keleş. Genome Research 2020 https://doi.org/10.1101/gr.252445.119
Got a question? Please report it at the issues tab in this repository.