Skip to content

Highly scalable application to reduce the size of reference whole genome sequence databases.

License

Notifications You must be signed in to change notification settings

ja-pg/Spark-Sequence-Reducer

Repository files navigation

Spark Sequence Reducer

Spark Sequence Reducer is an application designed to reduce the size of whole genome reference sequence databases typically used in sequence analysis tools. Spark Sequence Reducer uses Apache Spark and the MapReduce like model to achieve high scalability, this makes possible the reduction of large sequence databases in an acceptable time frame, that otherwise, wouldn't be possible.

Spark Sequence Reducer uses taxonomic classification data of the sequences in the reference databases to find and remove highly similar regions in sequences sharing a taxon of a determined rank. The algorithm saves similar data only once, but keeps unique regions of every sequence, avoiding data loss.

Requirements

  • A Python 3 or Python 2 (Python 3 preferred) on all nodes of your cluster.
  • GCC for C code compilation.
  • A configured Apache Spark installation.

Compiling Stretcher

Spark Sequence Reducer uses Stretcher, a global alignment algorithm to find highly similar regions of closely related sequences. Stretcher only uses linear space to find the optimal pairwise global alignment. The low memory used by the algorithm makes possible the alignment in parallel of a larger number of sequences as part of the reduction process.

A precompiled dynamic library (stretcher.so) is provided, but you can compile it manually if needed using make:

make compile

Stretcher is part of Emboss, and has been modified for standalone use. Spark Sequence Reducer wraps Stretcher using ctypes to facilitate calling the program within python code.

Configuration instructions

  • Compile stretcher, the global alignment algorithm (See Compiling Stretcher).
  • Download nucl_gb.accession2taxid.gz, taxdump.tar.gz, directory of spark sequence reducer (Either manually or running ncbitax-download.sh).
  • Extract names.dmp and nodes.dmp from data/taxdump.tar.gz to data/ (you can skip this step if you used ncbitax-download.sh).
  • Run make configure to create and configure secondary files using the taxonomy files.
  • If running on multiple machines, replicate the spark sequence reducer directory to all the nodes in your cluster.

Running Spark Sequence Reducer

To run the program using spark-submit:

$SPARK_HOME/bin/spark-submit [spark-options] sparkseqreducer.py [-h]
                          [-r [{species,genus,family,order,class,phylum,superkingdom}]]
                          infile outfile

Arguments:

positional arguments:
  infile                Path to the input file containing the reference sequences to
                        reduce.
  outfile               Path to the output file to save the resulting reduced
                        sequences.

optional arguments:
  -h, --help            show this help message and exit
  -r [{species,genus,family,order,class,phylum,superkingdom}], --rank [{species,genus,family,order,class,phylum,superkingdom}]
                        The taxonomic rank to use for the reduction. Must be
                        one of the following: species, genus, family, order,
                        class, phylum or superkingdom

For example, to reduce a fasta file to output.fasta with species selected as the taxonomic rank for reduction: $SPARK_HOME/bin/spark-submit sparkseqreducer.py --rank species example.fasta $HOME/output

About

Highly scalable application to reduce the size of reference whole genome sequence databases.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages