-
Notifications
You must be signed in to change notification settings - Fork 4
License
nboley/grit
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
GRIT - A tool for the integrative analysis of RNA-seq type assays. ################################################################################ Install: ################################################################################ Build Dependencies: c compiler python headers cython Runtime Dependencies: scipy networkx pysam In a debian based system, running (sudo) apt-get install gcc python-dev cython python-scipy python-networkx and then (sudo) easy_install pysam should install all of the dependencies. INSTALLATION: -Method 1: easy_install Run: (sudo) easy_install GRIT-2.0.1.tar.gz This should install the dependencies, the grit python module, and the grit script (run_grit.py). However, installing them through your distributions package manager is preferred whenever possible. -Method 2: setup.py 1) unzip the package (tar -zxvf GRIT-VERSION.tar.gz) 2) move into the unzipped directory (cd GRIT-VERSION/) 3) run the setup script ((sudo) python setup.py install) -Method 3: packages Making debian, ubuntu, and redhat packages is on my TODO, but I haven't. If these would be useful, please feel free to send me an email and I'll move it up the priority list. ################################################################################ Tutorial: ################################################################################ There is a tutorial, with data, available on grit-bio.org. ################################################################################ Introduction: ################################################################################ GRIT is designed to use RNA-seq, TES (e.g. poly(A) seq), and TSS ( e.g. CAGE, RAMPAGE) data to build and quantify full length transcript models. When all of these data sources are not available, GRIT can be run by providing a candidate set of TES or TSS sites. In addition, GRIT can merge in reference junctions, and gene boundaries. GRIT can also be run in quantification mode, where it uses a provided GTF file and just estimates transcript expression. In addition, GRIT only works with stranded and paired RNA-seq data. Although it could be modified to work with either, we feel that these data sources are becoming obsolute. ################################################################################ Simple Example: ################################################################################ (to try this, download and unzip http://grit-bio.org/GRIT_example.tar.gz ) The simplest possible GRIT run is: run_grit --rnaseq-reads AdMatedF_Ecl_20days_Heads.biorep1.rnaseq.chr4.bam \ --cage-reads AdMatedF_Ecl_20days_Heads.biorep1.cage.chr4.bam \ --polya-reads AdMatedF_Ecl_20days_Heads.biorep2.passeq.chr4.bam \ --reference flybase-r5.45.chr4.gtf Note that the reference is required to determine the read strands. The following would produce identical results: run_grit --rnaseq-reads AdMatedF_Ecl_20days_Heads.biorep1.rnaseq.chr4.bam \ --rnaseq-read-type backward \ --cage-reads AdMatedF_Ecl_20days_Heads.biorep1.cage.chr4.bam \ --cage-read-type backward \ --polya-reads AdMatedF_Ecl_20days_Heads.biorep2.passeq.chr4.bam \ --polya-read-type forward It is only possible to use a single bam for each data type when using command line options, and there is no notion of a replicate. However, GRIT also accepts a control file, which allows for substantial flexibility. Output: This will output 3 data files: discovered.elements.bed discovered.transcripts.gtf discovered.expression.csv which contain the transcript elements (e..g exons and promoters), the transcripts, and transcript level expression estimates. Additional options: Typically, one would run the above with: --ucsc : formats the .bed and .gtf files so that they can be loaded into the ucsc genome browser --threads : specifies to use this many concurrent processes --fasta : a fasta file, which allows GRIT to run an ORF finder on the discovered transcripts In addition, for samples with relatively low read coverage, it can be useful to include reference elements. Our experience is that junctions are relatively well annotated, so using --reference with --use-reference-junctions can help to improve gene connectivity, and because the reference junctions are assigned a count of zero during quantification, won't bias the expression estimates. We would not recommend using reference TSS and TES's unless there are no other options. We have found them to be of much lower quality than other transcript elements. Instead, looking for publically avbailable CAGE and poly(A)-site-seq data in a matching tissue or cell line seems to perform better (in human, FANTOM and Merck poly(A) data are good resources. In fly, modENCODE has all the data types). ################################################################################ Control File Example: ################################################################################ Instead of the above commands, we could also run: run_grit.py --control AdMatedF_Ecl_20days_Heads.control.txt Here's an example control file. # comment line # *'s indicate merged #sample_type rep_id assay paired stranded read_type filename AdMatedF_20days_Heads rep1 rnaseq true true auto AdMatedF_Ecl_20days_Heads.biorep1.rnaseq.chr4.bam AdMatedF_20days_Heads rep2 rnaseq true true auto AdMatedF_Ecl_20days_Heads.biorep2.rnaseq.chr4.bam AdMatedF_20days_Heads * cage false true auto AdMatedF_Ecl_20days_Heads.biorep1.cage.chr4.bam AdMatedF_20days_Heads * polya false true auto AdMatedF_Ecl_20days_Heads.biorep2.passeq.chr4.bam sample_type - a unique identifier for the biological sample type. rep_id - a unique identifier for the replicate Data from the same sample type is merged for building elements (e.g. junction discovery, exon building, etc.) but transcripts are quantified on individual replicates. A '*' as a rep_id indicates that the data should be used for all replicates for that sample type. If multiple bam files with the same sample_type and rep_id are provided, they will be merged. So, for instance, if you have an RNAseq experiment where one has a read_type of forward, and one has a read_type of backward, then they should each be provided a line with different read types. Again, in practice we would probably run: run_grit --control AdMatedF_Ecl_20days_Heads.control.txt --verbose --ucsc \ -t 16 --reference flybase-r5.45.chr4.gtf --use-refernce-junctions \ --fasta dm_rel_5.fa ################################################################################ Peak Calling Example: ################################################################################ In addition to discovering and building transcript models, GRIT can be used to identify TSS/TES from RAMPAGE/CAGE/PASseq data and an RNA-seq control. The ENCODE consortium uses the call_peaks script distributed with GRIT v2.0.4 to identify RAMPAGE peaks. Installation is identical to isntalling the full GRIT software pacakge, as described above. To try calling peaks using the ENCODE RNA evaluation data (XXX) for download a position sorted and indexed RAMPAGE file (eg XXX), a matching RNAseq sorted and indexed bam file (eg XXX), and a reference annotation in GTF format (eg XXX). To download test data, in a suitable working directory, run: wget -r -nd --no-parent -A '*.bam*,*.gtf' \ http://mitra.stanford.edu/kundaje/nboley/grit-bio.org/ENCODE_RAMPAGE_TEST/; The exact commands for a linux machine with GRIT installed can be found below. Then run: call_peaks --rampage-reads $RAMPAGEReads.sorted.bam \ --rnaseq-reads $RNASeqReads.sorted.bam \ --reference $referenceAnnotation.gtf \ --exp-filter-fraction 0.05 --trim-fraction 0.01 \ --ucsc \ --outfname peaks.gff --outfname-type gff \ --bed-peaks-ofname peaks.bed \ --threads $nThreads Here we describe the individual options used above - the full set of options and descriptions can be found by running call_peaks --help. --rampage-reads: a position sorted, indexed bam file containing rampage reads to identify peaks from. The ENCODE consoritum uses STAR to map the RAMPAGE reads, PCR remove duplicates, identify junctions, etc. Details of the mapping procedure can be found here: https://github.com/ENCODE-DCC/long-rna-seq-pipeline If a reference annotation is not provided, then the user must set the --rampage-read-type option to forward or backward. If this option is set to forward, then if the first read pair maps to the reference genome without being reverse complemented then the read is assumed to have come from a + stranded gene. --rnaseq-reads: a position sorted, indexed bam file containing rnaseq reads from the same sample as the --rampage-reads. The RNA-seq reads help GRIT distinguish noise in a similar way to how a DNA seq experiment is used to control for a ChIP-seq experiment. GRIT currently requires RNAseq reads generated using a stranded protocol. The ENCODE consoritum uses STAR to map the RNAseq reads; details of the ENCODE mapping pipeline can be found here: https://github.com/ENCODE-DCC/long-rna-seq-pipeline If a reference annotation is not provided, then the user must set the --rnaseq-read-type option to forward or backward. If this option is set to forward, then if the first read pair maps to the reference genome without being reverse complemented then the read is assumed to have come from a + stranded gene. --reference: a GTF file containing reference transcripts. GRIT does not require a reference annotation, but providing one helps improve the quality of the result by providing connectivity between transcribed regions. A reference annotation is also required to automatically infer read strand. --exp-filter-fraction: peaks with low expression levels *relative to the gene region in which they lie* will be filtered regardless of their estimated statistical enrichment. This helps the results to be robust to differences between the RAMPAGE and RNAseq experiments. The ENCODE consortium sets this value to 0.05 which means that a peak will be removed if there is another peak in the same gene with an expression value >= 20x higher. 5% is relatively conservative - those primarily focused on new discovery that have properly matched RAMPAGE/RNAseq experiments should feel comfortable lowering this threshold to 1%. If there is not a control RNASeq experiments from a matching biological sample then values as high as 25% may be appropriate. --trim-fraction: Specify how much to trim the edge of peaks. ENCODE uses a value of 0.01, which specifies that each boundary will be trimmed while the trimming has removed less than 1% of the total reads that fall into the peak. --ucsc: Use contig names compatible with the ucsc browser (this typically just involves prepending a chr to the contig names) --outfname: The name of the output file. --outfname-type: The type of the output file. We recommend using the gff output whenever possible as future updates will include additional meta data which will likely break bed backwards compatibility. --bed-peaks-ofname: Write an additional output file in bed format. This is primarily used for downstream analysis tools that require bed input. --threads: The number of processes to run concurrently. --verbose: Open a ncurses interface which allows each stage to be monitored. ## Ouput Formats ############################################################### gff: gff output is standard gff output with the following id files: gene_id: The ID of the GRIT determined gene region this peaks falls in. gene_name: Currenty identical to gene_id - in the future this will contain the name of the nearest matching reference gene. tss_id: A unique identifier for this TSS id. peak_cov: The RAMPAGE read counts falling within this peak. This is useful for identifying peak summits, identifying peak enrichment, identifying binding motifs associated with this peak, etc. bed: The bed output is a standard bed6 with the following additional fields: read countq : int gene_id : string gene_name : string tss_id : string peak_coverage : comma separated floats specifying the observed read coverage across the peak ## Running IDR on peak replicates ############################################## The ENCODE consortium uses the IDR software package ( https://github.com/nboley/idr/) to identify peaks which are reproducible between replicates. Given the bed output from two peak calling runs (e.g. PEAK1.bed and PEAK2.bed) by running: idr --samples REP1.bed REP2.bed # run IDR on the REP1.bed and REP2.bed --input-file-type bed --rank 7 # specifies that the input is a bed format, # and the score track is in column 7 --output-file IDR.peaks.txt # output the peak list and reproducibility # information to IDR.peaks.txt --plot # produce a QC plot, IDR.peaks.txt.png Details about command line options and a description of the output files can be found at https://github.com/nboley/idr/. ## Example commands for ENCODE test data ####################################### For a linux machine with wget, GRIT and IDR installed, in a suitable working directory run: wget -r -nd --no-parent -A '*.bam*,*.gtf' \ http://mitra.stanford.edu/kundaje/nboley/grit-bio.org/ENCODE_RAMPAGE_TEST/; call_peaks --rampage-reads ENCFF001RDX-ENCFF001RDR_rampage_star_marked.chr20.bam \ --rnaseq-reads ENCFF001RFD-ENCFF001RFC_star_genome.chr20.bam \ --reference gencode.v19.annotation.chr20.gtf \ --exp-filter-fraction 0.05 --trim-fraction 0.01 \ --ucsc \ --outfname peaks.rep1.gff --outfname-type gff \ --bed-peaks-ofname peaks.rep1.bed \ --threads 24; call_peaks --rampage-reads ENCFF001RET-ENCFF001RES_rampage_star_marked.chr20.bam \ --rnaseq-reads ENCFF001RFF-ENCFF001RFE_star_genome.chr20.bam \ --reference gencode.v19.annotation.chr20.gtf \ --exp-filter-fraction 0.05 --trim-fraction 0.01 \ --ucsc \ --outfname peaks.rep2.gff --outfname-type gff \ --bed-peaks-ofname peaks.rep2.bed \ --threads 24; idr --samples peaks.rep1.bed peaks.rep2.bed \ --input-file-type bed --rank 7 \ --output-file peaks.IDR.txt --plot; ################################################################################ Command Line Options: ################################################################################ Run 'run_grit --help' for a complete list. They should be well described. ################################################################################ Questions/bug reports: ################################################################################ Please direct questions and bug reports to the mailing list ( http://groups.google.com/group/grit-bio ). I'll try to respond promptly to bug reports, so please do not hesitate to contact me with questions.
About
No description, website, or topics provided.
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published