Tools for early stage alignment file processing
Switch branches/tags
biobambam2_experimental_2_0_89 biobambam2_experimental_2_0_88 biobambam2_experimental_2_0_87 biobambam2_experimental_2_0_86 biobambam2_experimental_2_0_85 biobambam2_experimental_2_0_84 biobambam2_experimental_2_0_83 biobambam2_experimental_2_0_82 biobambam2_experimental_2_0_81 biobambam2_experimental_2_0_80 biobambam2_experimental_2_0_79 biobambam2_experimental_2_0_78 biobambam2_experimental_2_0_77 biobambam2_experimental_2_0_76 biobambam2_experimental_2_0_75 biobambam2_experimental_2_0_74 biobambam2_experimental_2_0_73 biobambam2_experimental_2_0_72 biobambam2_experimental_2_0_71 biobambam2_experimental_2_0_70 biobambam2_experimental_2_0_69 biobambam2_experimental_2_0_68 biobambam2_experimental_2_0_67 biobambam2_experimental_2_0_66 biobambam2_experimental_2_0_65 biobambam2_experimental_2_0_64 biobambam2_experimental_2_0_63 biobambam2_experimental_2_0_62 biobambam2_experimental_2_0_61 biobambam2_experimental_2_0_60 biobambam2_experimental_2_0_59 biobambam2_experimental_2_0_58 biobambam2_experimental_2_0_57 biobambam2_experimental_2_0_56 biobambam2_experimental_2_0_55 biobambam2_experimental_2_0_54 biobambam2_experimental_2_0_53 biobambam2_experimental_2_0_52 biobambam2_experimental_2_0_51 biobambam2_experimental_2_0_50 biobambam2_experimental_2_0_49 biobambam2_experimental_2_0_48 biobambam2_experimental_2_0_47 biobambam2_experimental_2_0_46 biobambam2_experimental_2_0_45 biobambam2_experimental_2_0_44 biobambam2_experimental_2_0_43 biobambam2_experimental_2_0_42 biobambam2_experimental_2_0_41 biobambam2_experimental_2_0_40 biobambam2_experimental_2_0_39 biobambam2_experimental_2_0_38 biobambam2_experimental_2_0_37 biobambam2_experimental_2_0_36 biobambam2_experimental_2_0_35 biobambam2_experimental_2_0_34 biobambam2_experimental_2_0_33 biobambam2_experimental_2_0_32 biobambam2_experimental_2_0_31 biobambam2_experimental_2_0_30 biobambam2_experimental_2_0_29 biobambam2_experimental_2_0_28 biobambam2_experimental_2_0_27 biobambam2_experimental_2_0_26 biobambam2_experimental_2_0_25 biobambam2_experimental_2_0_24 biobambam2_experimental_2_0_23 biobambam2_experimental_2_0_22 biobambam2_experimental_2_0_21 biobambam2_experimental_2_0_20 biobambam2_experimental_2_0_19 biobambam2_experimental_2_0_18 biobambam2_experimental_2_0_17 biobambam2_experimental_2_0_16 biobambam2_experimental_2_0_15 biobambam2_experimental_2_0_14 biobambam2_experimental_2_0_13 biobambam2_experimental_2_0_12 biobambam2_experimental_2_0_11 biobambam2_experimental_2_0_10 biobambam2_experimental_2_0_9 biobambam2_experimental_2_0_8 biobambam2_experimental_2_0_7 biobambam2_experimental_2_0_6 biobambam2_experimental_2_0_5 biobambam2_experimental_2_0_4 biobambam2_experimental_2_0_3 biobambam2_experimental_2_0_2 biobambam2_experimental_2_0_1 2.0.89-release-20180518145034 2.0.88-release-20180517100023 2.0.87-release-20180301132713 2.0.86-release-20180228171821 2.0.85-release-20180228154542 2.0.84-release-20180223152609 2.0.83-release-20180105121132 2.0.82-release-20171214120547 2.0.81-release-20171105203037 2.0.80-release-20171024132510 2.0.79-release-20171006114010
Nothing to show
Clone or download
gt1 Merge pull request #75 from EvanTheB/patch-1
Correct docs PGID -> RGID
Latest commit 12231f7 Jul 20, 2018

README.md

biobambam2

This package contains some tools for processing BAM files including

  • bamsormadup: parallel sorting and duplicate marking
  • bamcollate2: reads BAM and writes BAM reordered such that alignment or collated by query name
  • bammarkduplicates: reads BAM and writes BAM with duplicate alignments marked using the BAM flags field
  • bammaskflags: reads BAM and writes BAM while masking (removing) bits from the flags column
  • bamrecompress: reads BAM and writes BAM with a defined compression setting. This tool is capable of multi-threading.
  • bamsort: reads BAM and writes BAM resorted by coordinates or query name
  • bamtofastq: reads BAM and writes FastQ; output can be collated or uncollated by query name

A short list of options is available for each program by calling it with the -h parameter, e.g.

bamsort -h

Source

The biobambam2 source code is hosted on github:

git@github.com:gt1/biobambam2.git

Release packages can be found at

https://github.com/gt1/biobambam2/releases

Please make sure to choose a package containing the word "release" in it's name if you intend to compile biobambam2 for production (i.e. non development) use.

Compilation of biobambam2

biobambam2 needs libmaus2 [https://github.com/gt1/libmaus2] . When libmaus2 is installed in ${LIBMAUSPREFIX} then biobambam2 can be compiled and installed in ${HOME}/biobambam2 using

- autoreconf -i -f
- ./configure --with-libmaus2=${LIBMAUSPREFIX} \
	--prefix=${HOME}/biobambam2
- make install

The release packages come with a configure script included (making the autoreconf call unnecessary for source obtained via one of those).

Command line arguments

Different from a lot of other command line tools most options for biobambam2 commands are passed as key=value pairs. An example for sorting a BAM file by name order is:

bamsort SO=queryname <in.bam >out.bam

Using bamsormadup

bamsormadup is a new tool in biobambam2. In has two modes of operation. If SO=coordinate (as it is by default) then it expects a name collated (all reads for one name appear consecutively) input file. It sorts this file by coordinate and marks duplicate reads and read pairs and outputs the sorted file in bam, cram or sam format. If SO=queryname then it expects an input file in any order and sorts this file by query name. In both cases the program can produce checksums at read set level (like bamseqchksum) and at file level (like md5sum). If the output file is sorted by coordinate and in bam format, then program can also produce a bam index on the fly. Most stages of bamsormadup are parallelised. The number of threads used can be set with the threads option (e.g. threads=4). If no such option is given then the number of logical CPUs (cores) of the machine is used as the number of threads. Parallel sorting of alignment files is I/O heavy, so a fast I/O system is crucial. We recommend to store all data (input, temporary and output) on solid state storage (SSD).

bamsormadup and cram output

CRAM output with bamsormadup requires libmaus to be built with support for io_lib version 1.3.11 (or newer) or a sufficiently recent svn revision of version 1.3.10. The binaries provided on github have support for CRAM writing. The program will look for reference sequences in the following places while encoding cram:

  • The directory stored in the environment variable REF_CACHE (if any). The md5 hash value is used as a file name within this directory. For instance if REF_CACHE is set to ${HOME}/ref and a reference sequence (as stated in a SQ header line) has an md5 hash 01234567890123456789012345678901 then the program would look up the file ${HOME}/ref/01234567890123456789012345678901 . If REF_CACHE contains finite length string references, then parts of the hash will be inserted before adding the rest of the hash in the end . If for instance REF_CACHE is set to ${HOME}/ref/%2s/%2s/%s then the program would look for the sequence above in the file ${HOME}/ref/01/23/4567890123456789012345678901 . REF_CACHE designates a read and write cache . The program will try to produce reference sequences not previously stored in this directory if it is given a FastA or gzipped FastA file as a reference (either via the reference command line key or via the UR field of the corresponding sequence line).
  • The list of directories and URL prefixes stored in the environment variable REF_PATH (if any). Multiple paths are separated by the colon symbol ':'. A colon sign in a path can be escaped by duplicating it, e.g. the URL http://www.ebi.ac.uk/ena/cram/md5/ would be escaped as http:://www.ebi.ac.uk/ena/cram/md5/ . The locations given in this list are considered as read only. URLs must be specified using the URL= prefix, e.g. URL=http://www.ebi.ac.uk/ena/cram/md5/ .
  • A FastA or gzipped FastA file given as the UR parameter in the header of the input file. This file will be scanned to obtain the reference sequence. All newly found sequences will be stored in the REF_CACHE directory if the respective environment variable is set.
  • A FastA file given via the reference key on the command line.

If the program cannot find a required reference sequence for encoding CRAM in any of these locations, then it fails. Note that if a reference sequence is only present in a FastA file then the REF_CACHE environment variable must be set or CRAM encoding will fail in the current version.

Using blastnxmltobam

The blastnxmltobam program can be used to transform blastn's XML output to the BAM format. An example is shown below:

makeblastdb -in ref.fasta -dbtype nucl
blastn -outfmt 5 -query query.fasta -db ref.fasta | blastnxmltobam ref.fasta query.fasta >query.bam

The compilation of blastnxmltobam requires the xerces-c (see https://xerces.apache.org/xerces-c/) library for XML parsing (see configure switch --with-xerces-c).