PROBer: A general toolkit for analyzing sequencing-based ‘toeprinting’ assays
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
ext
src
.gitignore
CMakeLists.txt
COPYING
INSTALL.md
LICENSE
README.md

README.md

PROBer

A general toolkit for analyzing sequencing-based 'toeprinting' assays

Bo Li, Akshay Tambe, Sharon Aviran and Lior Pachter.


Table of Contents


Introduction

PROBer is a software to quantify chemical modification profiles for a general set of sequencing-based 'toeprinting' assays.

Installation

See INSTALL.md

Usage

Prepare Reference Sequences

To prepare reference sequence, you should run PROBer prepare. Run

PROBer prepare --help

to get usage information.

Estimate toeprinting parameters

To estimate toeprinting parameters, you should run PROBer estimate. Run

PROBer estimate --help

to get usage information.

Allocate iCLIP multi-mapping reads

To allocate multi-mapping reads for iCLIP data, you should run PROBer iCLIP. Run

PROBer iCLIP --help

to get usage information.

Simulation

To simulate reads, you should run PROBer simulate. Run

PROBer simulate --help

to get usage information.

Variation of estimates

PROBer can produce plots assessing the variation of its beta estimates using a two step procedure: 1) multi-mapping reads are sampled using a collapsed Gibbs sampler; 2) For each transcript, the read counts are bootstrapped and the MAP estimates are re-estimated. Due to computational reasons, currently PROBer only provides variation plot for one transcript at a time.

To generate variation plots, you should turn on the --run-gibbs <directory> option when you run PROBer estimate.

Then for each transcript of interest, first run PROBer-bootstrap:

Usage: PROBer-bootstrap reference_name input_dir transcript_name num_trials [--primer-length primer_length(default: 6)] [--size-selection-min min_frag_len(required)] [--size-selection-max max_frag_len(required)] [--read-length read_length] [--gamma-init gamma_init(default: 0.0001)] [--beta-init beta_init(default: 0.0001)] [-p number_of_threads] [--no-control] [--seed seed] [-q]

In the above command, input_dir should be the same as the <directory> in --run-gibbs option. transcript_name is the name of the transcript you are interested. This name should be exactly the same as the one documented in PROBer reference. num_trials refers to the number of bootstrapping you want to perform (50 is recommended). -p sets the number of threads, which should be the same as you used in PROBer estimate. All other arguments/options have the same meanings as their counterparts in PROBer estimate.

Lastly, run PROBer-generateVariationPlot to generate plots:

Usage: PROBer-generateVariationPlot transcript_name estimates.beta bootstrap.txt percent start_position(1-based) end_position(1-based) output.pdf

In this command, transcript_name should be identical to the one used in PROBer-bootstrap. estimates.beta should be the sample_name.beta generated by PROBer estimate. bootstrap.txt should be <directory>/transcript_name.txt. percent is the percentage (between [0, 100]) used to draw error bars. For example, if percent = 90, the 5th and 95th percentiles from pooled bootstrap estimates will be drawn as two boundaries of error bars. start_position and end_position are two 1-based transcript coordinates. Only positions within this interval will be plotted. Lastly, output.pdf is the name of output pdf file.

Get version information

Run

PROBer version

to get version information.

Example

Suppose we have arabidopsis genome and gene annotation in two files: 'TAIR10_chr_all.fa' and 'TAIR10_GFF3_genes.gff'. We choose the reference name as 'arabidopsis' and are only interested in mRNA and rRNA. The data we have are single-end reads with read length 37bp, with minus channel reads in 'minus.fq' and plus channel reads in 'plus.fq'. The primer length is 6bp, the size selection range is from 21bp to 526bp. We use Bowtie aligner to align reads and assume Bowtie executables are under '/sw/bowtie'. We choose sample name as 'test_sample'. We use 40 cores. In the end, we simulate 10M single-end reads with output name 'test_sim'.

The commands are listed below:

PROBer prepare --gff3 TAIR10_GFF3_genes.gff --gff3-RNA-pattern mRNA,rRNA --bowtie --bowtie-path /sw/bowtie TAIR10_chr_all.fa arabidosis/arabidosis
PROBer estimate -p 40 --primer-length 6 --size-selection-min 21 --size-selection-max 526 --read-length 37 --bowtie-path /sw/bowtie arabidosis/arabidosis test_sample --reads plus.fq minus.fq
PROBer simulate arabidosis/arabidosis test_sample.temp/test_sample_minus.config test_sample minus 10000000 test_sim
PROBer simulate arabidosis/arabidosis test_sample.temp/test_sample_plus.config test_sample plus 10000000 test_sim

Authors

Bo Li wrote PROBer, with substaintial technical input from Akshay Tambe, Sharon Aviran and Lior Pachter.

Acknowledgements

Thanks Harold Pimentel and Páll Melsted for their help on CMake, website and markdown documents.

A small part of this project's codes are adopted from RSEM.

This project uses the Boost C++ and samtools libraries.

License

PROBer is licensed under the GNU General Public License v3.