SAMsift: advanced filtering and tagging of SAM/BAM alignments using Python expressions.
Switch branches/tags
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.

README.rst

SAMsift

https://travis-ci.org/karel-brinda/samsift.svg?branch=master https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square

SAMsift is a program for advanced filtering and tagging of SAM/BAM alignments using Python expressions.

Getting started

# clone this repo and add it to PATH
git clone http://github.com/karel-brinda/samsift
cd samsift
export PATH=$(pwd)/samsift:$PATH

# filtering: keep only alignments with score >94, save them as filtered.bam
samsift -i tests/test.bam -o filtered.bam -f 'AS>94'
# filtering: keep only unaligned reads
samsift -i tests/test.bam -f 'FLAG & 0x04'
# filtering: keep only aligned reads
samsift -i tests/test.bam -f 'not(FLAG & 0x04)'
# filtering: keep only sequences containing ACCAGAGGAT
samsift -i tests/test.bam -f 'SEQ.find("ACCAGAGGAT")!=-1'
# filtering: keep only sequences containing A and T only (defined using regular expressions)
samsift -i tests/test.bam -f 're.match(r"^[AT]*$", SEQ)'
# filtering: sample alignments with 25% rate
samsift -i tests/test.bam -f 'random.random()<0.25'
# filtering: sample alignments with 25% rate with a fixed RNG seed
samsift -i tests/test.bam -f 'random.random()<0.25' -0 'random.seed(42)'
# filtering: keep only alignments of reads specified in tests/qnames.txt
samsift -i tests/test.bam -0 'q=open("tests/qnames.txt").read().splitlines()' -f 'QNAME in q'
# filtering: keep only first 5000 reads from chr1 and 5000 reads from chr2
samsift -i tests/test.bam -0 'c={"chr1":5000,"chr2":5000}' -f 'c[RNAME]>0' -c 'c[RNAME]-=1' -m nonstop-remove
# tagging: add tags 'ln' with sequence length and 'ab' with average base quality
samsift -i tests/test.bam -c 'ln=len(SEQ);ab=1.0*sum(QUALa)/ln'
# tagging: add a tag 'ii' with the number of the current alignment
samsift -i tests/test.bam -0 'i=0' -c 'i+=1;ii=i'
# updating: removing sequences and base qualities
samsift -i tests/test.bam -c 'a.query_sequence=""'
# updating: switching all reads to unaligned
samsift -i tests/test.bam -c 'a.flag|=0x4;a.reference_start=-1;a.cigarstring="";a.reference_id=-1;a.mapping_quality=0'

Installation

Using Bioconda:

# add all necessary Bioconda channels
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

# install samsift
conda install samsift

Using PIP from PyPI:

pip install --upgrade samsift

Using PIP from Github:

pip install --upgrade git+https://github.com/karel-brinda/samsift

Command-line parameters

Program: samsift (advanced filtering and tagging of SAM/BAM alignments using Python expressions)
Version: 0.2.5
Author:  Karel Brinda <kbrinda@hsph.harvard.edu>

Usage:   samsift.py [-i FILE] [-o FILE] [-f PY_EXPR] [-c PY_CODE] [-m STR]
                    [-0 PY_CODE] [-d PY_EXPR] [-t PY_EXPR]

Basic options:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -i FILE               input SAM/BAM file [-]
  -o FILE               output SAM/BAM file [-]
  -f PY_EXPR            filtering expression [True]
  -c PY_CODE            code to be executed (e.g., assigning new tags) [None]
  -m STR                mode: strict (stop on first error)
                              nonstop-keep (keep alignments causing errors)
                              nonstop-remove (remove alignments causing errors) [strict]

Advanced options:
  -0 PY_CODE            initialization [None]
  -d PY_EXPR            debugging expression to print [None]
  -t PY_EXPR            debugging trigger [True]

Algorithm

exec(INITIALIZATION)
for ALIGNMENT in ALIGNMENTS:
        if eval(DEBUG_TRIGER):
                print(eval(DEBUG_EXPR))
        if eval(FILTER):
                exec(CODE)
                print(ALIGNMENT)

Python expressions and code. All expressions and code should be valid with respect to Python 3. Expressions are evaluated using the eval function and code is executed using the exec function. Initialization can be used for importing Python modules, setting global variables (e.g., counters) or loading data from disk. Some modules (namely random and re) are loaded without an explicit request.

Example (printing all alignments):

samsift -i tests/test.bam -f 'True'

SAM fields. Expressions and code can access variables mirroring the fields from the alignment section of the SAM specification, i.e., QNAME, FLAG, RNAME, POS (1-based), MAPQ, CIGAR, RNEXT, PNEXT, TLEN, SEQ, and QUAL. Several additional variables are defined to simply accessing some useful information: QUALa stores the base qualities as an integer array; SEQs, QUALs, QUALsa skip soft-clipped bases; and RNAMEi and RNEXTi store the reference ids as integers.

Example (keeping only the alignments with leftmost position <= 10000):

samsift -i tests/test.bam -f 'POS<=10000'

SAMsift internally uses the PySam library and the representation of the current alignment (an instance of the class pysam.AlignedSegment) is available as a variable a. Therefore, the previous example is equivalent to

samsift -i tests/test.bam -f 'a.reference_start+1<=10000'

The a variable can also be used for modifying the current alignment record.

Example (removing the sequence and the bases from every record):

samsift -i tests/test.bam -c 'a.query_sequence=""'

SAM tags. Every SAM tag is translated to a variable with the same name.

Example (removing alignments with a score smaller or equal to the sequence length):

samsift -i tests/test.bam -f 'AS>len(SEQ)'

If CODE is provided, all two-letter variables except re (the Python regex module) are back-translated to tags after the code execution.

Example (adding a tag ab carrying the average base quality):

samsift -i tests/test.bam -c 'ab=1.0*sum(QUALa)/len(QUALa)'

Errors. If an error occurs during an evalution of an expression or an execution of a code (e.g., due to accessing an undefined tag), then SAMsift behavior depends on the specified mode (-m). With the strict mode (-m strict, default), SAMsift will immediately interrupt the computation and report an error. With the -m nonstop-keep option, SAMsift will continue processing the alignments while keeping the error-causing alignments in the output. With the -m nonstop-remove option, all error-causing alignments are skipped and ommited from the output.

Similar programs

  • samtools view can filter alignments based on FLAGS, read group tags, and CIGAR strings.
  • sambamba view supports, in addition to SAMtools, a filtration using simple Perl-like expressions. However, it is not possible to use floats or compare different tags.
  • BamQL provides a simple query language for filtering SAM/BAM files.
  • bamPals adds tags XB, XE, XP and XL.
  • SamJavascript can filter alignments using JavaScript expressions.
  • Picard FilterSamReads can also filter alignments using JavaScript expressions.

Issues

Please use Github issues.

Changelog

See Releases.

Licence

MIT

Author

Karel Brinda <kbrinda@hsph.harvard.edu>