HTTPS clone URL
Subversion checkout URL
Yet another bioinformatics tool to detect de novo splice junctions from paired-end RNA-seq reads (human genome only)
Fetching latest commit...
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
LIMITATION!!!! So far, SplicePL does NOT use non-uniquely mapped reads in the splice junction analysis. Since we really don't know how to resolve the ambiguous reads, we choose NOT to use them. SplicePL only works with human genome right now. The chromosome numbers are hard-coded from chr1 to chrM. I will encourage to change the source code to make it work for your genome of interest. INTRODUCTION SplicePL is a simple BLAT parser and splice site filter, originally hacked in one day to test the idea of using BLAT as an aligner. After it showed us promising results, it was rewritten in Perl5. We find it a useful tool to detect de novo splice junctions from paired-end human RNA-seq data, especially for reads longer than 75 nt. With these long reads, novel exon-exon junctions can be directly identified by aligning reads to the reference genome using algorithms such as BLAT (Kent, 2002) or SSAHA2 (Ning et al., 2001), rather than splitting reads into segments as in TopHat (Trapnell et al., 2009) and SpliceMap (Au et al., 2010). SplicePL detects splice junctions by a three-stage process. Initially, each read is aligned to the reference genome using BLAT, followed by identifying spliced reads across exon-exon junction. Finally, the putative junctions are identified and filtered based on splice site donor/acceptor site, intron and flanking sequence length. In summary, SplicePL is a simple BLAT parser and splice site filter. It is written in Perl, packaged as a perl module to facilitate testing and installation. The code can be improved a lot by using BioPerl modules, however, for the sake of simplicity and ease of use, we choose to implement it the way it is. At the moment of writing (Jun 2010), SplicePL has the highest sensitivity for long reads (at least 100 nt) compared with TopHat and SpliceMap. However, SpliceMap has better specificity than SplicePL. In our own RNA-seq analysis, we used TopHat, SpliceMap, SplicePL, and GSNAP. We find robust results can be obtained from highly expressed transcripts. Lowly expressed junctions tend to disagree more for different methods. PREREQUISITE Before running SplicePL, please make sure: 1. Either BLAT or gfServer/gfClient binaries is installed. The most recent version (Jun 25, 2010) is 34x7 . You also need faToTwoBit, which is part of BLAT. For BLAT installation, please go to http://genome.ucsc.edu/FAQ/FAQblat.html#blat3 2. fasta files for individual human chromosomes were downloaded, and in the same folder, a 2BIT format file is prepared from chr1.fa, chr2.fa, ..., chr22.fa, chrX.fa, chrY.fa,and chrM.fa. You can download the hg19 assembly ( or The Feb. 2009 human reference sequence (GRCh37) ) from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ 3. you know exact number of RAM and processors in your LINUX server. HOW TO USE IT SplicePL.pm is designed to run on a LINUX server with large memory and multiple CPUs. SplicePL is tested on Ubuntu 10.04 and Red Hat 4.1.2-46. It requires 3GB of RAM to load the hg19 assembly to the memory. To run multiple BLAT processes (n=N), you need to have at least 4.0*N GB of RAM. You run SplicePL.pm within a LINUX terminal. 1. (Recommended) Use the BLAT program which uses about 4.0GB of RAM for human genome hg19, including index (1.3GB) plus sequences (3GB): perl SplicePL/SplicePL.pm --forward=read1.fa --reverse=read2.fa --genome=/data/share/db/human/hg19.fa.masked --genome_2bit_filename=hg19.fa.masked.2bit --flanksize=10 --processes=6 2. (optional, not recommended) Use the gfServer/gfClient programs which use 1.2GB of RAM for human genome hg19. In the alignment stage, it will load needed sequences into RAM: perl SplicePL/SplicePL.pm --forward=read1.fa --reverse=read2.fa --genome=/data/share/db/human/hg19.fa.masked --genome_2bit_filename=hg19.fa.masked.2bit --flanksize=10 --processes=14 --gfServer OUTPUT The alignment output is stored in the 'temp' directory; the junction files are in the 'output' directory. The first column in the junction file is the splice junction. It has three parts: 1. chromosome name 2. last nucleotide of the donor exon 3. first nucleotide of the acceptor exon First base of the chromosome starts from 1. For example, it may look like chr10:102286311-102286732 ACKNOWLEDGEMENT Using BLAT, instead of short reads aligner such as Bowtie, to align the long reads to the genome was initially suggested by Greg Grant. REFERENCE Kent, W. J. (2002) Blat-the blast-like alignment tool, Genome Res 12, 656-664. Ning, Z., Cox, a. J., and Mullikin, J. C. (2001) SSAHA: a fast search method for large DNA databases., Genome research 11, 1725-9. CONTACT If you have any comments and questions about SplicePL, please contact email@example.com TO DO 1. convert junction format to bed format [Done. Jul 8, 2010, scripts/junction_to_bed.pl ] 2. filter junctions by read coverage 3. convert BLAT psl to BAM format such that RPKM can be calculated using Cufflinks