Skip to content
Lisle Mose edited this page Aug 18, 2014 · 64 revisions

Welcome to the UBU wiki!

Notes

To checkout the code, you will need git http://git-scm.com/

On Ubuntu, run: sudo apt-get install git

To clone the repository: git clone git@github.com:mozack/ubu.git

To build, you will need maven: http://maven.apache.org/

On Ubuntu, run: sudo apt-get install maven2

NOTE: The HEAD may be unstable. Unless you have special circumstances please download a pre-built jar

Available command line actions

  • sam-xlate - Translate from genome to transcriptome coordinates
  • sam-diff - Diff two SAM/BAM files outputting discrepant reads in corresponding SAM/BAM files
  • sam-filter - Filter reads from a paired end SAM or BAM file (only outputs paired reads)
  • sam-summary - Output summary statistics per reference for a SAM/BAM file (Aligned reads only).
  • sam-convert - Convert SAM/BAM file content (i.e. convert quality from phred64 to phred33)
  • sam-junc - Count splice junctions in a SAM or BAM file
  • fastq-format - Format a single FASTQ file (clean up read ids and/or convert quality scoring)
  • sam2fastq - Convert a SAM/BAM file to FASTQ. (Tested against paired end Mapsplice output)

Running

Run java -jar ubu.jar for an up to date list of actions available.

Run java -jar ubu.jar [action] for usage details on each action.

512 MB of RAM should be more than enough for the above commands with the exception of sam-xlate. sam-xlate has been tested against hg19 using 3GB of RAM.

Examples:

Output reads specific to first.bam in out1.bam. Output reads specific to second.bam in out2.bam
java -Xmx512M -jar ubu.jar sam-diff --in1 first.bam --in2 second.bam --out1 out1.bam --out2 out2.bam

Output summary information about alignments.bam (Aligned bases, NM count, error rate, reads, zero mapping quality stats)
java -Xmx512M -jar ubu.jar sam-summary --header --in alignments.bam --out summary.txt

Filter reads from alignments.bam with mapping quality < 1, insert length > 10000 or containing indels. Output remaining reads in filtered.bam
java -Xmx512M -jar ubu.jar sam-filter --in alignments.bam --mapq 1 --max-insert 10000 --strip-indels --out filtered.bam

Translate from genome to transcriptome coordinates
java -Xmx3G -jar ubu.jar sam-xlate --bed hg19.bed --order hg19_M_rCRS_ref.transcripts.fa --in sorted_by_ref_and_name.bam --xgtags --reverse --out translated.bam
Sample bed file: https://raw.github.com/mozack/ubu/master/src/test/java/edu/unc/bioinf/ubu/sam/testdata/unc_hg19.bed

Convert phred64.bam from phred64 to phred33 storing output in phred33.bam
java -Xmx512M -jar ubu.jar sam-convert --in phred64.bam --out phred33.bam --phred64to33

Convert from phred33 to phred64, strip whitespace (and following characters) from read id, and append /1 to read id (if necessary). Input is read from 1.fastq and output is written to formatted_1.fastq. java -Xmx512M -jar ubu.jar fastq-format --in 1.fastq --out formatted_1.fastq --phred33to64 --strip --suffix /1

Convert the input BAM file (sorted by name) to FASTQ. Sort by name is necessary to handle multi-mapped reads. java -Xmx512M -jar ubu.jar sam2fastq --in sorted_by_name.bam --fastq1 1.fastq --fastq2 2.fastq --end1 /1 --end2 /2

Additional classes of interest:

  • SamReadPairReader - Provides ability to iterate over paired reads in a BAM file. Accounts for multiple mappings.
  • SamMultiMappingReader - Provides the ability to iterate over lists of reads in a BAM file containing multiple mappings.

Contact

lmose at unc dot edu